diff --git a/papers/paper_3_darwinism.pdf b/papers/paper_3_darwinism.pdf
new file mode 100644
index 00000000..116f094a
--- /dev/null
+++ b/papers/paper_3_darwinism.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b73ffee7836a1c081e87f10eae37210bff020caf4316729c31c33e8d5cae4e9e
+size 187898
diff --git a/papers/paper_3_darwinism.tex b/papers/paper_3_darwinism.tex
index b7e942f3..f023e199 100644
--- a/papers/paper_3_darwinism.tex
+++ b/papers/paper_3_darwinism.tex
@@ -1,8 +1,9 @@
 \documentclass[11pt,a4paper]{article}
 \usepackage[utf8]{inputenc}
 \usepackage{amsmath,amssymb,amsfonts,amsthm}
+\usepackage{cite}
 
-\title{Biophysical Witness Dynamics: Quantum Darwinism in Microtubule Conformational States (Letter)}
+\title{Biophysical Witness Dynamics: Quantum Darwinism and Decoherence Scaling at $310$K (Letter)}
 \author{Antigravity}
 \date{\today}
 
@@ -10,28 +11,35 @@
 \maketitle
 
 \begin{abstract}
-We apply the principles of Quantum Darwinism to the conformational dipole states of tubulin dimers within cellular microtubules. By defining a pure dephasing interaction with an Ohmic aqueous thermal bath, we formally parameterize the decoherence rate $\gamma$. We calculate the Mutual Information $I(S; E_F)$ across multiple independent acoustic phonon fragments. By demonstrating that the Holevo bound is saturated, we compute the explicit redundancy factor $R_\delta$, proving that stable, classical tubulin pointer states are robustly imprinted into the biological environment.
+The survival of quantum coherence in warm, wet biological systems (e.g., microtubules) is fundamentally constrained by rapid decoherence. Rather than seeking mechanisms to evade this constraint, we explicitly apply Zurek's framework of Quantum Darwinism to the biological scale. Using a spin-boson Hamiltonian, we model the $310$K aqueous environment not as a destructive noise source, but as a dense communication channel. We derive the exact decoherence function over an Ohmic spectral density, embracing Tegmark's $\mathcal{O}(10^{-13}\text{s})$ decoherence timescale. We prove that this ultra-fast decoherence guarantees an extreme redundancy parameter $R_\delta$, ensuring that robust classical pointer states (biological conformations) are massively replicated into the environmental fraction $f_\delta$. Thus, macro-biological certainty is a direct consequence of optimal quantum information proliferation.
 \end{abstract}
 
-\section{Microtubule Dephasing and the Ohmic Bath}
-Let a single tubulin dimer be modeled as a two-level open quantum system representing its conformational dipole, $H_S = \frac{\omega_0}{2} \sigma_S^z$. The environment consists of acoustic phonon modes in the intra-cellular fluid. We define a pure dephasing interaction $H_{int} = \sum_k g_k (\sigma_S^z \otimes \sigma_{E_k}^z)$.
-The bath is characterized by an Ohmic spectral density:
+\section{The Spin-Boson Coupling and Tegmark's Timescale}
+The environment of a biological macromolecule (e.g., a tubulin dimer) is modeled as an Ohmic bath of harmonic oscillators (phonons and hydration shells). The total Hamiltonian is $H = H_S + H_E + H_{\text{int}}$. The interaction is strictly pure dephasing, defined by the standard spin-boson coupling \cite{Schlosshauer2007}:
 \begin{equation}
-J(\omega) = \sum_k |g_k|^2 \delta(\omega - \omega_k) = \alpha \omega e^{-\omega/\omega_c}
+H_{\text{int}} = \sigma_S^z \otimes \sum_k g_k(b_k + b_k^\dagger)
 \end{equation}
-where $\alpha$ is the dimensionless coupling strength derived from molecular dipole-water interactions, and $\omega_c$ is the high-frequency cutoff of the solvation shell. At biological temperatures $T=310$ K ($k_B T \gg \omega_c$), the Markovian decoherence rate is explicitly parameterized as $\gamma \approx \frac{2\pi \alpha}{\hbar} k_B T$.
+where $\sigma_S^z$ acts on the two conformational states of the protein, and $b_k^\dagger, b_k$ are the creation and annihilation operators of the $k$-th environmental mode. The bath is characterized by the Ohmic spectral density $J(\omega) = \alpha \omega e^{-\omega/\omega_c}$, where $\alpha$ governs coupling strength and $\omega_c$ is the high-frequency cutoff dictated by the speed of sound in water.
 
-\section{Redundant Imprinting and the Holevo Bound}
-We partition the cellular environment into disjoint fragments $E_F$. The mutual information $I(S; E_F)$ scales with the fragment size $f$. For pure dephasing, the environment perfectly records the pointer states (the diagonal elements of $\rho_S$). The Holevo bound $I \approx H(S)$ is saturated for small fractions $f$.
-The redundancy factor $R_\delta$, defined as the number of independent environmental fragments that supply the missing information $1-\delta$, is explicitly given by:
+The off-diagonal elements of the reduced density matrix $\rho_S(t)$ decay as $e^{-\Gamma(t)}$, governed by the exact decoherence function:
 \begin{equation}
-R_\delta = \frac{1}{f_\delta} \approx \frac{\gamma}{\gamma_{frag} \ln(1/\delta)}
+\Gamma(t) = 4\int_0^\infty d\omega\, \frac{J(\omega)}{\omega^2}\left[1 - \cos(\omega t)\right]\coth\!\left(\frac{\hbar\omega}{2k_B T}\right)
 \end{equation}
-Given the massive degrees of freedom in the biological solvation shell, $R_\delta \gg 1$, proving that numerous independent biochemical pathways can concurrently deduce the classical conformational state of the tubulin dimer without perturbing its Hamiltonian.
+At physiological temperature $T=310$K, the $\coth$ term strictly dictates a rapid thermal limit. Evaluating $\Gamma(t)$, we recover the decoherence timescale $\tau_D \sim 10^{-13}$ s, exactly matching Tegmark's bounds \cite{Tegmark2000}. However, rather than concluding that quantum mechanics is biologically irrelevant, this metric quantifies the immense bandwidth of the environment acting as an information witness.
+
+\section{Quantum Darwinism and the Redundancy Parameter}
+Following Zurek \cite{Zurek2009}, the emergence of objective classicality requires that information about the pointer states $\sigma_S^z$ be massively redundantly proliferated into the environment. We partition the bath into fractions of size $f$. The mutual information between the system and an environmental fraction $F_f$ is:
+\begin{equation}
+I(S:F_f) = H(\rho_S) + H(\rho_{F_f}) - H(\rho_{SF_f})
+\end{equation}
+Because $\tau_D$ is effectively instantaneous on biological timescales, the system rapidly reaches the asymptotic plateau of mutual information: $I(S:F_f) \approx H(\rho_S)$. The redundancy parameter $R_\delta = 1/f_\delta$ measures the number of copies of the system's state deposited into the environment. Because the interaction energy is distributed across $\sim 10^{15}$ water molecules per cubic micron, $R_\delta \to \infty$. 
+
+Therefore, the biological environment does not destroy the state; it perfectly records it. Fitness beats truth structurally because the environment acts as a macroscopic amplification channel, converting fragile superpositions into robust, objective classical configurations necessary for biological computation.
 
 \bibliographystyle{plain}
 \begin{thebibliography}{10}
 \bibitem{Zurek2009} W. H. Zurek, \textit{Nat. Phys.} \textbf{5}, 181 (2009).
-\bibitem{Plenio2008} M. B. Plenio, S. F. Huelga, \textit{New J. Phys.} \textbf{10}, 113019 (2008).
+\bibitem{Tegmark2000} M. Tegmark, \textit{Phys. Rev. E} \textbf{61}, 4194 (2000).
+\bibitem{Schlosshauer2007} M. Schlosshauer, \textit{Decoherence and the Quantum-to-Classical Transition} (Springer, 2007).
 \end{thebibliography}
 \end{document}
diff --git a/papers/project_paper_2_neuroscience/references/Amari2016.pdf b/papers/project_paper_2_neuroscience/references/Amari2016.pdf
new file mode 100644
index 00000000..eab69df0
--- /dev/null
+++ b/papers/project_paper_2_neuroscience/references/Amari2016.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d54c73f85233ad188ddd594aa061cd6ed671c77dd371ef2525e3448098558190
+size 15842558
diff --git a/papers/project_paper_2_neuroscience/references/Amari2016.txt b/papers/project_paper_2_neuroscience/references/Amari2016.txt
new file mode 100644
index 00000000..6c6084bb
--- /dev/null
+++ b/papers/project_paper_2_neuroscience/references/Amari2016.txt
@@ -0,0 +1,37050 @@
+Information 
+Geometry
+
+Geert Verdoolaege
+
+www.mdpi.com/journal/entropy
+
+Edited by
+
+Printed Edition of the Special Issue Published in Entropy
+
+
+Information Geometry
+
+
+
+Information Geometry
+
+Special Issue Editor
+
+Geert Verdoolaege
+
+MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade
+
+
+Special Issue Editor
+
+Geert Verdoolaege
+
+Ghent University
+
+Belgium
+
+Editorial Office
+
+MDPI
+
+St. Alban-Anlage 66
+
+4052 Basel, Switzerland
+
+This is a reprint of articles from the Special Issue published online in the open access journal Entropy
+
+(ISSN 1099-4300) in 2014 (available at: https://www.mdpi.com/journal/entropy/special issues/
+
+information-geometry)
+
+For citation purposes, cite each article independently as indicated on the article page online and as
+
+indicated below:
+
+LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number,
+
+Page Range.
+
+ISBN 978-3-03897-632-5 (Pbk)
+
+ISBN 978-3-03897-633-2 (PDF)
+
+Cover image courtesy of Geert Verdoolaege.
+
+c⃝ 2019 by the authors. Articles in this book are Open Access and distributed under the Creative
+
+Commons Attribution (CC BY) license, which allows users to download, copy and build upon
+
+published articles, as long as the author and publisher are properly credited, which ensures maximum
+
+dissemination and a wider impact of our publications.
+
+The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
+
+license CC BY-NC-ND.
+
+
+Contents
+
+About the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
+
+Preface to ”Information Geometry” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+ix
+
+Shun-ichi Amari
+Information Geometry of Positive Measures and Positive-DefiniteMatrices: 
+Decomposable Dually Flat Structure
+Reprinted from: Entropy 2014, 16, 2131–2145, doi:10.3390/e16042131 . . . . . . . . . . . . . . . . .
+1
+
+Harsha K. V. and Subrahamanian Moosath K S
+F-Geometry and Amari’s α−Geometry on a Statistical Manifold
+Reprinted from: Entropy 2014, 16, 2472–2487, doi:10.3390/e16052472 . . . . . . . . . . . . . . . . .
+14
+
+Frank Critchley and Paul Marriott
+Computational Information Geometry in Statistics: Theory and Practice
+Reprinted from: Entropy 2014, 16, 2454–2471, doi:10.3390/e16052454 . . . . . . . . . . . . . . . . .
+29
+
+Paul Vos and Karim Anaya-Izquierdo
+Using Geometry to Select One Dimensional Exponential Families That Are Monotone
+Likelihood Ratio in the Sample Space, Are Weakly Unimodal and Can Be Parametrized by a
+Measure of Central Tendency
+Reprinted from: Entropy 2014, 16, 4088–4100, doi:10.3390/e16074088 . . . . . . . . . . . . . . . . .
+44
+
+Guido Mont´ufar, Johannes Rauh and Nihat Ay
+On the Fisher Metric of Conditional Probability Polytopes
+Reprinted from: Entropy 2014, 16, 3207–3233, doi:10.3390/e16063207 . . . . . . . . . . . . . . . . .
+56
+
+Andr´e Klein
+Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes
+Reprinted from: Entropy 2014, 16, 2023–2055, doi:10.3390/e16042023 . . . . . . . . . . . . . . . . .
+80
+
+Keisuke Yano and Fumiyasu Komaki
+Asymptotically Constant-Risk Predictive Densities When the Distributions of Data and Target
+Variables Are Different
+Reprinted from: Entropy 2014, 16, 3026–3048, doi:10.3390/e16063026 . . . . . . . . . . . . . . . . . 110
+
+Samuel Livingstone and Mark Girolami
+Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions
+Reprinted from: Entropy 2014, 16, 3074–3102, doi:10.3390/e16063074 . . . . . . . . . . . . . . . . . 131
+
+Hui Zhao and Paul Marriott
+Variational Bayes for Regime-Switching Log-Normal Models
+Reprinted from: Entropy 2014, 16, 3832–3847, doi:10.3390/e16073832 . . . . . . . . . . . . . . . . . 155
+
+Frank Nielsen, Richard Nock and Shun-ichi Amari
+On Clustering Histograms with k-Means by Using Mixed α-Divergences
+Reprinted from: Entropy 2014, 16, 3273–3301, doi:10.3390/e16063273 . . . . . . . . . . . . . . . . . 169
+
+Salem Said, Lionel Bombrun and Yannick Berthoumieu
+New Riemannian Priors on the Univariate Normal Model
+Reprinted from: Entropy 2014, 16, 4015–4031, doi:10.3390/e16074015 . . . . . . . . . . . . . . . . . 194
+
+v
+
+
+Luigi Malag`o and Giovanni Pistone
+Combinatorial Optimization with Information Geometry: The Newton Method
+Reprinted from: Entropy 2014, 16, 4260–4289, doi:10.3390/e16084260 . . . . . . . . . . . . . . . . . 209
+
+Domenico Felice, Carlo Cafaro and Stefano Mancini
+Information Geometric Complexity of a Trivariate Gaussian Statistical Model
+Reprinted from: Entropy 2014, 16, 2944–2958, doi:10.3390/e16062944 . . . . . . . . . . . . . . . . . 237
+
+Alexandre Levada
+Learning from Complex Systems: On the Roles of Entropy and Fisher Information in Pairwise
+Isotropic Gaussian Markov Random Fields
+Reprinted from: Entropy 2014, 16, 1002–1036, doi:10.3390/e16021002 . . . . . . . . . . . . . . . . . 250
+
+Masatoshi Funabashi
+Network Decomposition and Complexity Measures: An Information Geometrical Approach
+Reprinted from: Entropy 2014, 16, 4132–4167, doi:10.3390/e16074132 . . . . . . . . . . . . . . . . . 283
+
+Roger Balian
+The Entropy-Based Quantum Metric
+Reprinted from: Entropy 2014, 16, 3878–3888, doi:10.3390/e16073878 . . . . . . . . . . . . . . . . . 315
+
+Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li
+Extending the Extreme Physical Information to Universal Cognitive Models via a Confident
+Information First Principle
+Reprinted from: Entropy 2014, 16, 3670–3688, doi:10.3390/e16073670 . . . . . . . . . . . . . . . . . 324
+
+vi
+
+
+About the Special Issue Editor
+
+Geert Verdoolaege obtained an M.Sc.
+degree in Theoretical Physics in 1999 and the Ph.D. in
+
+Engineering Physics in 2006, both at Ghent University (UGent, Belgium). His Ph.D. work concerned
+
+applications of Bayesian probability theory to plasma spectroscopy in fusion devices.
+He was a
+
+postdoctoral researcher in the field of computer vision at the University of Antwerp (2007–2008),
+
+working on probabilistic modeling of image textures using information geometry. From 2008 to 2010,
+
+he was with the Department of Data Analysis at UGent, where he worked on modeling and
+
+estimation of brain activity, based on functional magnetic resonance imaging. In 2010, he returned
+
+to the Department of Applied Physics at UGent, first as a postdoctoral assistant and from 2014
+
+onwards, as a part-time assistant professor.
+Since 2013, he has held a cross-appointment as a
+
+researcher at the Laboratory for Plasma Physics of the Royal Military Academy (LPP-ERM/KMS)
+
+in Brussels. His research activities comprise development of data analysis techniques using methods
+
+from probability theory, machine learning and information geometry, and their application to nuclear
+
+fusion experiments. He also teaches a Master course on Continuum Mechanics at Ghent University.
+
+He serves on the editorial board of the multidisciplinary journal Entropy and is a member of the
+
+scientific committees of several conferences (IAEA Technical Meeting on Fusion Data Processing,
+
+Validation and Analysis; International Workshop on Bayesian Inference and Maximum Entropy
+
+Methods in Science and Engineering; Conference on Geometric Science of Information). In addition,
+
+he is a consulting expert in the International Tokamak Physics Activity (ITPA) Transport and
+
+Confinement Topical Group and member of the General Assembly of the European Fusion Education
+
+Network (FuseNet).
+
+vii
+
+
+
+Preface to ”Information Geometry”
+
+The mathematical field of information geometry originated from the observation the Fisher
+
+information can be used to define a Riemannian metric on manifolds of probability distributions.
+
+This led to a geometrical description of probability theory and statistics, which over the years has
+
+developed into a rich mathematical field with a broad range of applications in the data sciences.
+
+Moreover, similar to the concept of entropy, there are various connections to and applications of
+
+information geometry in statistical mechanics, quantum mechanics, and neuroscience.
+
+It has been a pleasure to act as a guest editor for this first Special Issue on information geometry
+
+in the journal Entropy. For me, as a physicist working on the development and application of
+
+data science techniques in the context of nuclear fusion experiments, the interdisciplinary character
+
+of information geometry has always been one of the main reasons for its appeal.
+There are, of
+
+course, many other domains in physics where geometrical notions play a key role, including classical
+
+mechanics, continuum mechanics (which I have been teaching at Ghent University for several years
+
+now), general relativity, and much of modern physics. This interplay between the beautiful and
+
+elegant formalism of differential geometry on the one hand and physics and data science on the
+
+other hand is both fascinating and inspiring. The variety of topics covered by this Special Issue is a
+
+reflection of this cross-fertilization between disciplines.
+
+“Information Geometry I” has been a great success, and although the papers were published
+
+already several years ago, it was decided that it was worthwhile to reprint the collection of papers
+
+in book form. Indeed, even though all papers present original research, many have a strong tutorial
+
+character, and we were honored to receive multiple contributions by authorities in the field. The
+
+papers have been structured according to their main subject area, or field of application, and we
+
+briefly discuss each of them in the following.
+
+We start with two papers related to the foundations of information geometry. We were very
+
+pleased to receive a contribution by one of the founders of the field of information geometry, prof.
+
+Shun-ichi Amari. In his paper, the dually flat structure of the manifold of positive measures is
+
+discussed, derived from a class of Bregman divergences. These so-called (ρ,τ)-divergences, originally
+
+proposed by J. Zhang, are defined in terms of two monotone, scalar functions (ρ and τ) and form a
+
+unique class of dually flat, decomposable divergences. This is extended to the set of positive-definite
+
+matrices, additionally requiring invariance of the divergence under matrix transformations. It is well
+
+known that such dually flat manifolds have computationally desirable properties in applications to
+
+classification and information retrieval.
+
+Harsha K. V. and Subrahamanian Moosath K. S. introduce F-geometry as a generalization
+
+of α-geometry, based on a representation of a probability density function through a function F.
+
+They then combine this with another function G to define a weighted expectation, from which
+
+an (F,G)-metric and connection are deduced. A condition for two of such structures to lead
+
+to dual connections is also derived. However, it was shown by Zhang (J. Zhang, Entropy 17,
+
+pp. 4485–4499, 2015) that this framework is equivalent to the (ρ,τ)-geometry introduced earlier by
+
+him. Although the present paper is slightly different in perspective, it should be read with this
+
+equivalence in mind.
+
+The next four papers deal with applications of information geometry in statistics. The paper
+
+by Frank Critchley and Paul Marriott presents an important research program aimed at rendering
+
+some of the most useful results of information geometry more accessible to statisticians in
+
+ix
+
+
+practical applications. Indeed, the formalism of differential geometry and tensor algebra can
+
+appear daunting at first sight and may present an obstacle to adoption of many useful results
+
+by practitioners. The paper describes a computational framework that facilitates implementation
+
+of results from information geometry, based on an embedding of various important statistical
+
+models in a (sufficiently large) simplex. Challenges related to extension of the framework to the
+
+infinite–dimensional case are touched upon as well.
+
+In the paper by Paul Vos and Karim Anaya-Izquierdo, the goal is to identify one-dimensional
+
+exponential families enjoying a number of properties that are convenient for statistical modeling,
+
+i.e., parametrization by a measure of central tendency, unimodality, and monotone likelihood ratio.
+
+The basis for the framework is the multinomial distribution, modeled geometrically by the simplex.
+
+The selection of exponential families with desirable properties is then based on a partitioning of the
+
+natural parameter space of the family of multinomial distributions by means of convex cones.
+
+Guido Mont´ufar and co-workers consider various possibilities to define natural Riemannian
+
+metrics on polytopes of stochastic matrices, which describe the conditional probability distribution
+
+of two categorical random variables. Inspired by the classical result regarding the uniqueness of the
+
+Fisher metric by requiring invariance under Markov morphisms, they define metrics derived from
+
+a natural class of stochastic maps between such polytopes, or, alternatively, through embeddings in
+
+various possible model spaces. They provide recommendations as to which metric to use, depending
+
+on the application.
+
+Andr´e Klein, in his article, provides a survey of several matrix algebraic properties of the Fisher
+
+information matrix corresponding to weakly stationary time series. The link with various structured
+
+matrices arising from a number of time series models is demonstrated. A statistical distance measure
+
+is built using the Fisher information matrix in the context of classical and quantum information.
+
+Finally, conditions are obtained for the Fisher information of a stationary process to obey certain
+
+forms of the Stein equation.
+
+We continue with three papers concerning applications of information geometry in Bayesian
+
+inference and simulation. Keisuke Yano and Fumiyasu Komaki, in their paper, construct constant-risk
+
+Bayesian predictive densities using the Kullback-Leibler loss function when the distributions of the
+
+data and the target variable to be predicted are different but have a common unknown parameter.
+
+Specifically, the issue of prior selection is investigated, and several applications are given.
+
+Samuel Livingstone and Mark Girolami provide an introduction to recent enhancements of
+
+Markov chain Monte Carlo simulation techniques inspired by information geometry. They apply
+
+this to the Metropolis–Hastings algorithm driven by Langevin diffusion, gradually transforming the
+
+ingredients to the setting of a Riemannian manifold equipped with a metric similar to the Fisher
+
+information metric. Pointers to various applications are given. The paper is written in a way that also
+
+makes it accessible to practitioners with little background in stochastic processes and geometry.
+
+The paper by Hui Zhao and Paul Marriott concerns Bayesian inference making use of variational
+
+methods for approximating the posterior distribution. In the context of inference for time series
+
+models that switch between different regimes, variational Bayes is shown to be a computationally
+
+attractive alternative to Markov chain Monte Carlo simulations. The geometry related to the
+
+projection of the posterior onto a computationally tractable family of distributions is elucidated by
+
+means of a simple example. This is followed by an application wherein it is shown that variational
+
+Bayes is successful in estimating the regime-switching model, including the number of regimes.
+
+Applications of information geometry in machine learning are represented by the following
+
+x
+
+
+three papers. The article by Frank Nielsen and colleagues considers κ-means histogram clustering,
+
+with applications to, e.g., information retrieval. Based on the α-divergences as similarity measures,
+
+clustering is performed using either the sided (asymmetric) or symmetrized divergence, or by means
+
+of the interesting notion of a mixed divergence. An important computational advantage is that the
+
+centroids based on the sided and mixed divergences have a closed-form expression. Next, the scheme
+
+is extended to algorithms with optimized initialization of cluster centroids, as well as soft clustering.
+
+Salem Said and co-workers present a class of distributions on the manifold of the univariate
+
+normal model equipped with the Fisher information metric. Expressed in terms of the Fisher-Rao
+
+distance, the distributions are used as priors for modeling the classes in Bayesian classification
+
+problems of normal distributions. Characteristics of this “Gaussian” distribution on the manifold are
+
+discussed, as well as estimation of its parameters and the posterior using the Laplace approximation.
+
+In an application to classification of image textures, the improved performance of these priors over
+
+conjugate priors is demonstrated.
+
+Luigi Malag`o and Giovanni Pistoni address optimization on manifolds of exponential
+
+distributions on a discrete state space using Newton’s method, which is based on second-order
+
+calculus. In particular, the goal is to find maxima of the expectation of a function with respect to the
+
+distribution (stochastic relaxation). Details of the computation are provided, including calculation
+
+of the Riemannian Hessian. A nonparametric formalism is used, with a view to extension to the
+
+infinite–dimensional case.
+
+The next three papers are related to the role of information geometry in complex systems
+
+research. Domenico Felice and colleagues consider the time-averaged volume explored by geodesics
+
+on a statistical manifold as an indicator of complexity of the entropic dynamics of a system.
+
+The parameters of the model play the role of macrovariables conveying information on the
+
+system’s microstate. Examples are given for the case of univariate, bivariate, and trivariate normal
+
+distributions, providing interesting results depending on correlations between the microvariables.
+
+Alexandre Levada investigates the role of entropy and Fisher information in pairwise isotropic
+
+Gaussian Markov random fields, acting as models for complex systems. Expressions for these
+
+quantities are derived and the evolution of the Fisher information, and entropy is studied as
+
+a function of the inverse temperature of the system. An interesting interpretation is given of
+
+asymmetries between these curves in terms of the arrow of time.
+
+Masatoshi Funabashi presents a framework for measuring statistical dependence between
+
+subsystems of a stochastic model, based on the model’s graph representation.
+A description in
+
+terms of the mixed coordinates of the system is used to quantify the complexity loss incurred by
+
+cutting an edge of the graph. In addition, a complexity measure is defined as a geometric mean of
+
+Kullback–Leibler divergences between decompositions of the system in terms of subsystems with
+
+fewer statistical dependencies. This quantifies the degree to which the system can be decomposed.
+
+The following paper concerns an application to physics, specifically quantum mechanics. Roger
+
+Balian gives an overview of a geometrical framework for measuring information loss in quantum
+
+systems resulting from the mixing of states. A Riemannian metric is defined, based on the von
+
+Neumann entropy, generating a mapping between states and observables. The metric is compared
+
+to other quantum metrics, as well as the Fisher–Rao metric, and various geometrical properties are
+
+derived. Applications are given to quantum information, as well as equilibrium and non-equilibrium
+
+quantum statistical mechanics.
+
+The final paper in the Special Issue is situated at the interface between physics and neuroscience.
+
+xi
+
+
+Xiaozhao Zhao and colleagues consider the principle of extreme physical information based on the
+
+Fisher information, which has been used before in an attempt to establish an information-theoretical
+
+basis for physical laws. They extend the idea to cognitive systems and aim at narrowing the gap
+
+between the information bound and the data information for such complex systems, by transforming
+
+the model to a simpler one. This is done by means of a dimensionality reduction technique, also based
+
+on the Fisher information. The approach is applied to derive the model for single-layer Boltzmann
+
+machines and interpret their learning algorithms.
+
+We are convinced that the varied collection of papers in this Special Issue will be useful for
+
+scientists who are new to the field, while providing an excellent reference for the more seasoned
+
+researcher. Furthermore, it is worth mentioning that the second Entropy Special Issue in this series,
+
+“Information Geometry II”, will also be published as a book, and that a third Special Issue is being
+
+prepared. We hope that the reader will enjoy browsing and reading through this collection of papers
+
+as much as we enjoyed guest editing this Special Issue “Information Geometry I”.
+
+Finally, I would like to thank the Editor-in-Chief of Entropy, Prof. Dr. Kevin H. Knuth, for
+
+suggesting the opportunity to guest-edit a Special Issue on information geometry.
+Furthermore,
+
+I wish to thank the editorial staff at MDPI for their great help with contacting authors, organizing
+
+paper reviews, and editing the original Special Issue in Entropy, as well as the reprinted version in the
+
+present book.
+
+Geert Verdoolaege
+
+Special Issue Editor
+
+xii
+
+
+
+
+entropy
+
+Article
+Information Geometry of Positive Measures and
+Positive-Definite Matrices: Decomposable Dually
+Flat Structure
+
+Shun-ichi Amari
+
+RIKEN Brain Science Institute, Hirosawa 2-1, Wako-shi, Saitama 351-0198, Japan;
+E-Mail: amari@brain.riken.jp; Tel.: +81-48-467-9669; Fax: +81-48-467-9687
+
+Received: 14 February 2014; in revised form: 9 April 2014 / Accepted: 10 April 2014 /
+Published: 14 April 2014
+
+Abstract: Information geometry studies the dually flat structure of a manifold, highlighted by
+the generalized Pythagorean theorem. The present paper studies a class of Bregman divergences
+called the (ρ, τ)-divergence. A (ρ, τ)-divergence generates a dually flat structure in the manifold of
+positive measures, as well as in the manifold of positive-definite matrices. The class is composed of
+decomposable divergences, which are written as a sum of componentwise divergences. Conversely,
+a decomposable dually flat divergence is shown to be a (ρ, τ)-divergence. A (ρ, τ)-divergence is
+determined from two monotone scalar functions, ρ and τ. The class includes the KL-divergence, α-,
+β- and (α, β)-divergences as special cases. The transformation between an affine parameter and its
+dual is easily calculated in the case of a decomposable divergence. Therefore, such a divergence
+is useful for obtaining the center for a cluster of points, which will be applied to classification and
+information retrieval in vision. For the manifold of positive-definite matrices, in addition to the dually
+flatness and decomposability, we require the invariance under linear transformations, in particular
+under orthogonal transformations. This opens a way to define a new class of divergences, called the
+(ρ, τ)-structure in the manifold of positive-definite matrices.
+
+Keywords: information geometry; dually flat structure; decomposable divergence; (ρ, τ)-structure
+
+1. Introduction
+
+Information geometry, originated from the invariant structure of a manifold of probability
+distributions, consists of a Riemannian metric and dually coupled affine connections with respect to
+the metric [1]. A manifold having a dually flat structure is particularly interesting and important. In
+such a manifold, there are two dually coupled affine coordinate systems and a canonical divergence,
+which is a Bregman divergence. The highlight is given by the generalized Pythagorean theorem and
+projection theorem. Information geometry is useful not only for statistical inference, but also for
+machine learning, pattern recognition, optimization and even for neural networks. It is also related to
+the statistical physics of Tsallis q-entropy [2–4].
+The present paper studies a general and unique class of decomposable divergence functions
+in Rn+, the manifold of n-dimensional positive measures. This is the (ρ, τ)-divergences, introduced
+by Zhang [5,6], from the point of view of “representation duality”. They are Bregman divergences
+generating a dually flat structure. The class includes the well-known Kullback-Leibler divergence,
+α-divergence, β-divergence and (α, β)-divergence [1,7–9] as special cases. The merit of a decomposable
+Bregman divergence is that the θ-η Legendre transformation is computationally tractable, where θ
+and η are two affine coordinates systems coupled by the Legendre transformation. When one uses
+a dually flat divergence to define the center of a cluster of elements, the center is easily given by
+the arithmetic mean of the dual coordinates of the elements [10,11]. However, we need to calculate
+its primal coordinates. This is the θ-η transformation. Hence, our new type of divergences has an
+
+Entropy 2014, 16, 2131–2145; doi:10.3390/e16042131
+www.mdpi.com/journal/entropy
+1
+
+
+Entropy 2014, 16, 2131–2145
+
+advantage of calculating θ-coordinates for clustering and related pattern matching problems. The most
+general class of dually flat divergences, not necessarily decomposable, is further given in Rn+. They are
+the (ρ, τ) divergence.
+Positive-definite (PD) matrices appear in many engineering problems, such as convex
+programming, diffusion tensor analysis and multivariate statistical analysis [12–20]. The manifold,
+PDn, of n × n PD matrices form a cone, and its geometry is by itself an important subject of research.
+If we consider the submanifold consisting of only diagonal matrices, it is equivalent to the manifold
+of positive measures. Hence, PD matrices can be regarded as a generalization of positive measures.
+There are many studies on geometry and divergences of the manifold of positive-definite matrices. We
+introduce a general class of dually flat divergences, the (ρ, τ)-divergence. We analyze the cases when
+a (ρ, τ)-divergence is invariant under the general linear transformations, Gl(n), and invariant under
+the orthogonal transformations, O(n). They not only include many well-known divergences of PD
+matrices, but also give new important divergences.
+The present paper is organized as follows. Section 2 is preliminary, giving a short introduction
+to a dually flat manifold and the Bregman divergence. It also defines the cluster center due to a
+divergence. Section 3 defines the (ρ, τ)-structure in Rn+. It gives dually flat decomposable affine
+coordinates and a related canonical divergence (Bregman divergence). Section 4 is devoted to the
+(ρ, τ)-structure of the manifold, PDn, of PD matrices. We first study the class of divergences that are
+invariant under O(n). We further study a decomposable divergence that is invariant under Gl(n).
+It coincides with the invariant divergence derived from zero-mean Gaussian distributions with PD
+covariance matrices. They not only include various known divergences, but new remarkable ones.
+Section 5 discusses a general class of non-decomposable flat divergences and miscellaneous topics.
+Section 6 is the conclusions.
+
+2. Preliminaries to Information Geometry of Divergence
+
+2.1. Dually Flat Manifold
+
+A manifold is said to have the dually flat Riemannian structure, when it has two affine coordinate
+systems θ =
+�
+θ1, · · · , θn�
+and η = (η1, · · · , ηn) (with respect to two flat affine connections) together
+with two convex functions, ψ(θ) and ϕ(η), such that the two coordinates are connected by the Legendre
+transformations:
+η = ∇ψ(θ),
+θ = ∇ϕ(η),
+(1)
+
+where ∇ is the gradient operator. The Riemannian metric is given by:
+
+�
+gij(θ)
+� = ∇∇ψ(θ),
+�
+gij(η)
+�
+= ∇∇ϕ(η)
+(2)
+
+in the respective coordinate systems. A curve that is linear in the θ-coordinates is called a θ-geodesic,
+and a curve linear in the η-coordinates is called an η-geodesic.
+A dually flat manifold has a unique canonical divergence, which is the Bregman divergence
+defined by the convex functions,
+
+D[P : Q] = ψ (θP) + ϕ
+�
+ηQ
+� − θP · ηQ,
+(3)
+
+where θP is the θ-coordinates of P, ηQ is the η-coordinates of Q and θP · ηQ = ∑i
+�
+θi
+P
+� �
+ηQi
+�
+, where θi
+P
+and ηQi are components of θp and ηQ, respectively. The Pythagorean and projection theorems hold in
+a dually flat manifold:
+Pythagorean Theorem
+Given three points, P, Q, R, when the η-geodesic connecting P and Q is
+orthogonal to the θ-geodesic connecting Q and R with respect to the Riemannian metric,
+
+D[P : Q] + D[Q : R] = D[P : R].
+(4)
+
+2
+
+
+Entropy 2014, 16, 2131–2145
+
+Projection Theorem
+Given a smooth submanifold, S, let PS be the minimizer of divergence
+from P to S,
+PS = min
+Q∈S D[P : Q].
+(5)
+
+Then, PS is the η-geodesic projection of P to S, that is the η-geodesic connecting P and PS is orthogonal
+to S.
+We have the dual of the above theorems where θ- and η-geodesics are exchanged and D[P : Q] is
+replaced by its dual D[Q : P].
+
+2.2. Decomposable Divergence
+
+A divergence, D[P
+:
+Q], is said to be decomposable, when it is written as a sum of
+component-wise divergences,
+
+D[P : Q] =
+n
+∑
+i=1
+d
+�
+θi
+P, θi
+Q
+�
+,
+(6)
+
+where θi
+P and θi
+Q are the components of θP and θQ and d
+�
+θi
+P, θi
+Q
+�
+is a scalar divergence function.
+An f-divergence:
+
+Df [P : Q] = ∑ pi f
+� qi
+
+pi
+
+�
+(7)
+
+is a typical example of decomposable divergence in the manifold of probability distributions, where
+P = (p) and Q = (q) are two probability vectors with ∑ pi = ∑ qi = 1. A convex function, ψ(θ), is
+said to be decomposable, when it is written as:
+
+ψ(θ) =
+n
+∑
+i=1
+˜ψ
+�
+θi�
+(8)
+
+by using a scalar convex function, ˜ψ(θ). The Bregman divergence derived from a decomposable convex
+function is decomposable.
+When ψ(θ) is a decomposable convex function, its Legendre dual is also decomposable. The
+Legendre transformation is given componentwise as:
+
+ηi = ˜ψ′ (θi) ,
+(9)
+
+where ′ is the differentiation of a function, so that it is computationally tractable.
+Its inverse
+transformation is also componentwise,
+θi = ˜ϕ′ (ηi) ,
+(10)
+
+where ˜ϕ is the Legendre dual of ˜ψ.
+
+2.3. Cluster Center
+
+Consider a cluster of points P1, · · · , Pm of which θ-coordinates are θ1, · · · , θm and η-coordinates
+are η1, · · · , ηm. The center, R, of the cluster with respect to the divergence, D[P : Q], is defined by:
+
+R = arg min
+Q
+
+m
+∑
+i=1
+D [Q : Pi] .
+(11)
+
+By differentiating ∑ D [Q : Pi] by θ (the θ-coordinates of Q), we have:
+
+∇ψ (θR) = 1
+
+m
+
+m
+∑
+i=1
+ηi.
+(12)
+
+Hence, the cluster-center theorem due to Banerjee et al. [10] follows; see also [11]:
+
+3
+
+
+Entropy 2014, 16, 2131–2145
+
+Cluster-Center Theorem
+The η-coordinates ηR of the cluster center are given by the arithmetic
+average of the η-coordinates of the points in the cluster:
+
+ηR = 1
+
+m
+
+m
+∑
+i=1
+ηi.
+(13)
+
+When we need to obtain the θ-coordinates of the cluster center, it is given by the θ-η transformation
+from ηR,
+θR = ∇ϕ (ηR) .
+(14)
+
+However, in many cases, the transformation is computationally heavy and intractable when the
+dimensions of a manifold is large. The transformation is easy in the case of a decomposable divergence.
+This is motivation for considering a general class of decomposable Bregman divergences.
+
+3. (ρ, τ) Dually Flat Structure in Rn
++
+
+3.1. (ρ, τ)-Coordinates of Rn
++
+
+Let Rn+ be the manifold of positive measures over n elements x1, · · · , xn. A measure (or a weight)
+of xi is given by:
+ξi = m (xi) > 0
+(15)
+
+and ξ = (ξ1, · · · , ξn) is a distribution of measures. When ∑ ξi = 1 is satisfied, it is a probability
+measure. We write:
+R+
+n = {ξ |ξi > 0}
+(16)
+
+and ξ forms a coordinate system of Rn+.
+Let ρ(ξ) and τ(ξ) be two monotonically increasing differentiable functions. We call:
+
+θ = ρ(ξ),
+η = τ(ξ)
+(17)
+
+the ρ- and τ-representations of positive measure ξ. This is a generalization of the ±α representations [1]
+and was introduced in [5] for a manifold of probability distributions. See also [6].
+By using these functions, we construct new coordinate systems θ and η of Rn+. They are given, for
+θ =
+�
+θi�
+and η = (ηi), by componentwise relations,
+
+θi = ρ (ξi) ,
+ηi = τ (ξi) .
+(18)
+
+They are called the ρ- and τ-representations of ξ ∈ Rn+, respectively. We search for convex functions,
+ψρ,τ(θ) and ϕρ,τ(η), which are Legendre duals to each other, such that θ and η are two dually coupled
+affine coordinate systems.
+
+3.2. Convex Functions
+
+We introduce two scalar functions of θ and η by:
+
+˜ψρ,τ(θ)
+=
+� ρ−1(θ)
+
+0
+τ(ξ)ρ′(ξ)dξ,
+(19)
+
+˜ϕρ,τ(η)
+=
+� τ−1(η)
+
+0
+ρ(ξ)τ′(ξ)dξ.
+(20)
+
+Then, the first and second derivatives of ˜ψρ,τ are:
+
+˜ψ′
+ρ,τ(θ)
+=
+τ(ξ),
+(21)
+
+˜ψ′′
+ρ,τ(θ)
+=
+τ′(ξ)
+ρ′(ξ) .
+(22)
+
+4
+
+
+Entropy 2014, 16, 2131–2145
+
+Since ρ′(ξ) > 0, τ′(ξ) > 0, we see that ˜ψρ,τ(θ) is a convex function. So is ˜ϕρ,τ(η). Moreover, they are
+Legendre duals, because:
+
+˜ψρ,τ(θ) + ˜ϕρ,τ(η) − θη
+=
+� ξ
+
+0 τ(ξ)ρ′(ξ)dξ +
+� ξ
+
+0 ρ(ξ)τ′(ξ)dξ − ρ(ξ)τ(ξ)
+(23)
+
+=
+0.
+(24)
+
+We then define two decomposable convex functions of θ and η by:
+
+ψρ,τ(θ)
+= ∑ ˜ψρ,τ
+�
+θi�
+,
+(25)
+
+ϕρ,τ(η)
+= ∑ ˜ϕρ,τ (ηi) .
+(26)
+
+They are Legendre duals to each other.
+
+3.3. (ρ, τ)-Divergence
+
+The (ρ, τ)-divergence between two points, ξ, ξ′ ∈ R+
+n , is defined by:
+
+Dρ,τ
+�
+ξ : ξ′�
+=
+ψρ,τ (θ) + ϕρ,τ
+�
+η′� − θ · η′
+(27)
+
+=
+n
+∑
+i=1
+
+�� ξi
+
+0
+τ(ξ)ρ′(ξ)dξ +
+� ξ′
+i
+
+0
+ρ(ξ)τ′(ξ)dξ − ρ (ξi) τ
+�
+ξ′
+i
+��
+,
+(28)
+
+where θ and η′ are ρ- and τ-representations of ξ and ξ′, respectively.
+The (ρ, τ)-divergence gives a dually flat structure having θ and η as affine and dual affine
+coordinate systems. This is originally due to Zhang [5] and a generalization of our previous results
+concerning the q and deformed exponential families [4]. The transformation between θ and η is simple
+in the (ρ, τ)-structure, because it can be done componentwise,
+
+θi
+=
+ρ
+�
+τ−1 (ηi)
+�
+,
+(29)
+
+ηi
+=
+τ
+�
+ρ−1 �
+θi��
+.
+(30)
+
+The Riemannian metric is:
+
+gij(ξ) = τ′ (ξi)
+
+ρ′ (ξi) δij,
+(31)
+
+and hence Euclidean, because the Riemann-Christoffel curvature due to the Levi-Civita connection
+vanishes, too.
+The following theorem is new, characterizing the (ρ, τ)-divergence.
+
+Theorem 1. The (ρ, τ)-divergences form a unique class of divergences in Rn+ that are dually flat and
+decomposable.
+
+3.4. Biduality: α-(ρ, τ) Divergence
+
+We have dually flat connections,
+�
+∇ρ,τ, ∇∗
+ρ,τ
+�
+, represented in terms of covariant derivatives,
+which are derived from Dρ,τ. This is called the representation duality by Zhang [5]. We further have
+the α-(ρ, τ) connections defined by:
+
+∇(α)
+ρ,τ = 1 + α
+
+2
+∇ρ,τ + 1 − α
+
+2
+∇∗
+ρ,τ.
+(32)
+
+The α-(−α) duality is called the reference duality [5]. Therefore, ∇(α)
+ρ,τ possesses the biduality, one
+concerning α and (−α), and the other with respect to ρ and τ.
+
+5
+
+
+Entropy 2014, 16, 2131–2145
+
+The Riemann-Christoffel curvature of ∇(α)
+ρ,τ is:
+
+R(α)
+ρ,τ = 1 − α2
+
+4
+R(0)
+ρ,τ = 0
+(33)
+
+for any α. Hence, there exists unique canonical divergence D(α)
+ρ,τ and α-(ρ, τ) affine coordinate systems.
+It is an interesting future problem to obtain their explicit forms.
+
+3.5. Various Examples
+
+As a special case of the (ρ, τ)-divergence, we have the (α, β)-divergence obtained from the
+following power functions,
+
+ρ(ξ) = 1
+
+αξα, τ(ξ) = 1
+
+βξβ.
+(34)
+
+This was introduced by Cichocki, Cruse and Amari in [7,8].
+The affine and dual affine coordinates are:
+
+θi = 1
+
+α (ξi)α ,
+ηi = 1
+
+β (ξi)β
+(35)
+
+and the convex functions are:
+
+ψ(θ) = cα,β ∑ θ
+
+α+β
+
+i α
+,
+ϕ(η) = cβ,α ∑ η
+
+α+β
+
+β
+i
+,
+(36)
+
+where:
+cα,β =
+1
+
+β(α + β)α
+α+β
+
+α .
+(37)
+
+The induced (α, β)-divergence has a simple form,
+
+Dα,β[ξ : ξ′] =
+1
+
+αβ (α + β) ∑
+�
+αξα+β
+i
++ βξ′α+β
+i
+− (α + β)ξα
+i ξ′β
+i
+�
+,
+(38)
+
+for ξ, ξ′ ∈ Rn+. It is defined similarly in the manifold, Sn, of probability distributions, but it is not
+a Bregman divergence in Sn. This is because the total mass constraint ∑ ξi = 1 is not linear in θ- or
+η-coordinates in general.
+The α-divergence is a special case of the (α, β)-divergence, so that it is a (ρ, τ)-divergence. By
+putting:
+
+ρ(ξ) =
+2
+
+1 − αξ
+1−α
+
+2 ,
+τ(ξ) =
+2
+
+1 + αξ
+1+α
+
+2 ,
+(39)
+
+we have:
+
+Dα
+�
+ξ : ξ′� =
+4
+
+1 − α2 ∑
+
+�1 − α
+
+2
+ξi + 1 + α
+
+2
+ξ
+
+1−α
+
+i 2
+− ξα
+i
+�
+ξ′
+i
+� 1+α
+
+2
+�
+.
+(40)
+
+The β-divergence [19] is obtained from:
+
+ρ(ξ) = ξ,
+τ(ξ) = 1
+
+βξ1+β.
+(41)
+
+It is written as:
+
+Dβ
+�
+ξ : ξ′� =
+1
+
+β(β + 1) ∑
+i
+
+�
+ξβ+1
+i
++ (β + 1)ξ′
+i −
+�
+ξ′
+i
+�β+1 − (β + 1)ξi
+�
+ξ′
+i
+�β�
+.
+(42)
+
+The β-divergence is special in the sense that it gives a dually flat structure, even in Sn. This is because
+u(ξ) is linear in ξ.
+
+6
+
+
+Entropy 2014, 16, 2131–2145
+
+The classes of α-divergences and β-divergences intersect at the KL-divergence, and their duals are
+different in general. They are the only intersecting points of the two classes.
+When ρ(ξ) = ξ and τ(ξ) = U′(ξ) where U is a convex function, (ρ, τ)-divergence is Eguchi’s
+U-divergence [21].
+Zhang already introduced the (α, β)-divergence in [5], which is not a (ρ, τ)-divergence in Rn+ and
+different from ours. We regret for our confusing the naming of the (α, β)-divergence.
+
+4. Invariant, Flat Decomposable Divergences in the Manifold of Positive-Definite Matrices
+
+4.1. Invariant and Decomposable Convex Function
+
+Let P be a positive-definite matrix and ψ(P) be a convex function. Then, a Bregman divergence is
+defined between two positive definite matrices, P and Q, by:
+
+D[P : Q] = ψ(P) − (Q) − ∇ (P) · (P − Q)
+(43)
+
+where ∇ is the gradient operator with respect to matrix P =
+�
+Pij
+�
+, so that ∇ψ(P) is a matrix and the
+inner product of two matrices is defined by:
+
+∇ψ(Q) · P = tr {∇ψ(Q)P} ,
+(44)
+
+where tr is the trace of a matrix.
+It induces a dually flat structure to the manifold of positive-definite matrices, where the affine
+coordinate system (θ-coordinates) is
+= P and the dual affine coordinate system (η-coordinates) is:
+
+H = ∇ψ(P).
+(45)
+
+A convex function, ψ(P), is said to be invariant under the orthogonal group O(n), when:
+
+ψ(P) = ψ
+�
+OTPO
+�
+(46)
+
+holds for any orthogonal transformation O, where OT is the transpose of O. An invariant function is
+written as a symmetric function of n eigenvalues λ1, · · · , λn of P. See Dhillon and Tropp [12]. When
+an invariant convex function of P is written, by using a convex function, f, of one variable, in the
+additive form:
+ψ(P) = ∑ f (λi) ,
+(47)
+
+it is said to be decomposable. We have:
+
+ψ(P) = trf (P).
+(48)
+
+4.2. Invariant, Flat and Decomposable Divergence
+
+A divergence D[P : Q] is said to be invariant under O(n), when it satisfies:
+
+D[P : Q] = D
+�
+OTPO : OTQO
+�
+.
+(49)
+
+When it is derived from a decomposable convex function, ψ(P), it is invariant, flat and decomposable.
+We give well-known examples of decomposable convex functions and the divergences derived
+from them:
+
+7
+
+
+Entropy 2014, 16, 2131–2145
+
+(1) For f (λ) = (1/2)λ2, we have:
+
+ψ(P)
+=
+1
+2 ∑ λ2
+i ,
+(50)
+
+D[P : Q]
+=
+1
+2∥P − Q∥2,
+(51)
+
+where ∥P∥2 is the Frobenius norm:
+∥P∥2 = ∑ P2
+ij.
+(52)
+
+(2) For f (λ) = − log λ
+
+ψ(P)
+=
+− log (det |P|) ,
+(53)
+
+D[P : Q]
+=
+tr
+�
+PQ−1�
+− log
+�
+det
+���PQ−1���
+�
+− n.
+(54)
+
+The affine coordinate system is P, and the dual coordinate system is P−1. The derived geometry is
+the same as that of multivariate Gaussian probability distributions with mean zero and covariance
+matrix P.
+(3) For f (λ) = λ log λ − λ,
+
+ψ(P)
+=
+tr (P log P − P) ,
+(55)
+
+D[P : Q]
+=
+tr (P log P − P log Q − P + Q) .
+(56)
+
+This divergence is used in quantum information theory. The affine coordinate system is P, and
+the dual affine coordinate system is log P; and, ψ(P) is called the negative von Neuman entropy.
+
+4.3. (ρ, τ)-Structure in Positive Definite Matrices
+
+We extend the (ρ, τ)-structure in the previous section to the matrix case and introduce a general
+dually flat invariant decomposable divergence in the manifold of positive-definite matrices. Let:
+
+Θ = ρ(P),
+H = τ(P)
+(57)
+
+be ρ- and τ-representations of matrices.
+We use two functions,
+˜ψρ,τ(θ) and ˜ϕρ,τ(η), defined
+in Equations (19) and (20), for defining a pair of dually coupled invariant and decomposable
+convex functions,
+
+ψ(Θ)
+=
+tr ˜ψρ,τ {Θ} ,
+(58)
+
+ϕ(H)
+=
+tr ˜ϕρ,τ {H} .
+(59)
+
+They are not convex with respect to P, but are convex with respect to Θ and H, respectively. The
+derived Bregman divergence is:
+
+D[P : Q] = ψ {Θ(P)} + ϕ {H(Q)} − Θ(P) · H(Q).
+(60)
+
+Theorem 2. The (ρ, τ)-divergences form a unique class of invariant, decomposable and dually flat
+divergences in the manifold of positive matrices.
+
+8
+
+
+Entropy 2014, 16, 2131–2145
+
+The Euclidean, Gaussian and von Neuman divergences given in Equations (51), (54) and (56) are
+special examples of (ρ, τ)-divergences. They are given, respectively, by:
+
+(1)
+ρ(ξ) = τ(ξ) = ξ,
+(61)
+
+(2)
+ρ(ξ) = ξ,
+τ(ξ) = −1
+
+ξ ,
+(62)
+
+(3)
+ρ(ξ) = ξ,
+τ(ξ) = log ξ.
+(63)
+
+When ρ and τ are power functions, we have the (α, β)-structure in the manifold of positive-definite
+matrices.
+
+(4) (α-β)-divergence.
+
+By using the (α, β) power functions given by Equation (34), we have:
+
+ψ(Θ) =
+α
+
+α + βtr Θ
+α+β
+
+α
+=
+α
+
+α + βtr Pα+β,
+(64)
+
+ϕ(H) =
+β
+
+α + βtr H
+
+α+β
+
+β
+=
+β
+
+α + βtr Pα+β
+(65)
+
+so that the (α, β)-divergence of matrices is:
+
+D[P : Q] = tr
+�
+α
+
+α + βPα+β +
+β
+
+α + βQα+β − PαQβ
+�
+.
+(66)
+
+This is a Bregman divergence, where the affine coordinate system is Θ = Pα and its dual is
+H = Pβ.
+(5) The α-divergence is derived as:
+
+Θ(P)
+=
+2
+
+1 − αP
+1−α
+
+2 ,
+(67)
+
+ψ(Θ)
+=
+2
+
+1 + αP,
+(68)
+
+Dα[P : Q]
+=
+4
+
+1 − α2 tr
+�
+−P
+1−α
+
+2 Q
+1+α
+
+2 + 1 − α
+
+2
+P + 1 + α
+
+2
+Q
+�
+.
+(69)
+
+The affine coordinate system is
+2
+
+1−αP
+1−α
+
+2 , and its dual is
+2
+
+1+αP
+1+α
+
+2 .
+(6) The β-divergence is derived from Equation (41) as:
+
+Dβ[P : Q] =
+1
+
+β(β + 1)tr
+�
+Pβ+1 + (β + 1)Q − Qβ+1 − (β + 1)PQβ�
+.
+(70)
+
+4.4. Invariance Under Gl(n)
+
+We extend the concept of invariance under the orthogonal group to that under the general linear
+group, Gl(n), that is the set of invertible matrices, L, det |L| ̸= 0. This is a stronger condition. A
+divergence is said to be invariant under Gl(n), when:
+
+D[P : Q] = D
+�
+LTPL : LTQL
+�
+(71)
+
+holds for any L ∈ Gl(n).
+We identify matrix P with the zero-mean Gaussian distribution:
+
+p(x, P) = exp
+�
+−1
+
+2xTP−1x − 1
+
+2 log det |P| − c
+�
+,
+(72)
+
+9
+
+
+Entropy 2014, 16, 2131–2145
+
+where c is a constant. We know that an invariant divergence belongs to the class of f-divergences in
+the case of a manifold of probability distributions, where the invariance means the geometry does
+not change under a one-to-one mapping of x to y. Moreover, the only invariant flat divergence is the
+KL-divergence [22]. These facts suggest the following conjecture.
+
+Proposition. The invariant, flat and decomposable divergence under Gl(n) is the KL-divergence
+given by:
+
+DKL[P : Q] = tr
+�
+PQ−1�
+− log
+�
+det
+���PQ−1|
+�
+− n.
+(73)
+
+5. Non-Decomposable Divergence
+
+We have focused on flat and decomposable divergences.
+There are many interesting
+non-decomposable divergences. We first discuss a general class of flat divergences in Rn+ and then
+touch upon interesting flat and non-flat divergences in the manifold of positive-definite matrices.
+
+5.1. General Class of Flat Divergences in Rn
++
+
+We can describe a general class of flat divergence in Rn+, which are not necessarily decomposable.
+This is introduced in [23], which studies the conformal structure of general total Bregman divergences
+([11,13]). When Rn+ is endowed with a dually flat structure, it has a θ-coordinate system given by:
+
+θ = ρ(ξ)
+(74)
+
+which is not necessarily a componentwise function. Any pair of invertible θ = ρ(ξ) and convex
+function ψ(θ) defines a dually flat structure and, hence, a Bregman divergence in Rn+.
+The dual coordinates η = τ(ξ) are given by:
+
+η = ∇ψ(θ)
+(75)
+
+so that we have:
+η = τ(ξ) = ∇ψ {ρ(ξ)} .
+(76)
+
+This implies that a pair (ρ, τ) of coordinate systems can define dually coupled affine coordinates and,
+hence, a dually flat structure, when and only when η = τ
+�
+ρ−1(θ)
+�
+is a gradient of a convex function.
+This is different from the case of decomposable divergence, where any monotone pair of ρ(ξ) and
+τ(ξ) gives a dually flat structure.
+
+5.2. Non-Decomposable Flat Divergence in PDn
+
+Ohara and Eguchi [15,16] introduced the following function:
+
+ψV(P) = V (det |P|) ,
+(77)
+
+where V(ξ) is a monotonically decreasing scalar function. ψV is convex when and only when:
+
+1 + V′′(ξ)ξ2
+
+V′(ξ)
+< 1
+
+n.
+(78)
+
+In such a case, we can introduce dually flat structure to PDn, where P is an affine coordinate system
+with convex ψV(P), and the dual affine coordinate system is:
+
+H = V′(det ∥P∥)P−1.
+(79)
+
+10
+
+
+Entropy 2014, 16, 2131–2145
+
+The derived divergence is:
+
+DV[P : Q] = V(det |P) − V(det |Q)|
+(80)
+
++ V′(det |Q|)tr
+�
+Q−1(Q − P)
+�
+.
+(81)
+
+When V(ξ) = − log ξ, it reduces to the case of Equation (54), which is invariant under Gl(n) and
+decomposable. However, the divergence DV[P : Q] is not decomposable. It is invariant under O(n)
+and more strongly so under SGl(n) ⊂ Gl(n), defined by det |L| = ±1.
+
+5.3. Flat Structure Derived from q-Escort Distribution
+
+A dually flat structure is introduced in the manifold of probability distributions [4] as:
+
+˜Dα[p : q] =
+1
+
+1 − q
+1
+
+Hq(p)
+
+�
+1 − ∑ p1−q
+i
+qq
+i
+�
+,
+(82)
+
+where:
+
+Hq(p)
+= ∑ pq
+i ,
+(83)
+
+q
+=
+1 + α
+
+2
+.
+(84)
+
+The dual affine coordinates are the q-escort distribution: [4]
+
+ηi =
+1
+
+Hq(p) pq
+i .
+(85)
+
+The divergence, ˜Dq, is flat, but not decomposable.
+We can generalize it to the case of PDn,
+
+˜Dq[P : Q] =
+1
+
+1 − q
+1
+
+tr Pq
+�
+(1 − q) tr (P) + q tr (Q) − tr
+�
+P1−qQq��
+.
+(86)
+
+This is flat, but not decomposable.
+
+5.4. γ-Divergence in PDn
+
+The γ-divergence is introduced by Fujisawa and Eguchi [24]. It gives a super-robust estimator. It
+is interesting to generalize it to PDn,
+
+Dγ[P : Q] =
+1
+
+γ(γ − 1)
+
+�
+log tr Pγ − (γ − 1) log tr Qγ−1 − γ log tr PQγ−1�
+.
+(87)
+
+This is not flat nor decomposable. This is a projective divergence in the sense that, for any c, c′ > 0,
+
+Dγ
+�
+cP : c′Q
+� = Dγ[P : Q].
+(88)
+
+Therefore, it can be defined in the submanifold of tr P = 1.
+
+6. Concluding Remarks
+
+We have shown that the (ρ, τ)-divergence introduced by Zhang [5] is a general dually flat
+decomposable structure of the manifold of positive measures. We then extended it to the manifold
+of positive-definite matrices, where the criterion of invariance under linear transformations (in
+particular, under orthogonal transformations) were added. The decomposability is useful from the
+
+11
+
+
+Entropy 2014, 16, 2131–2145
+
+computational point of view, because the θ-η transformation is tractable. This is the motivation for
+studying decomposable flat divergences.
+When we treat the manifold of probability distributions, it is a submanifold of the manifold of
+positive measures, where the total sum of measures are restricted to one. This is a nonlinear constraint
+in the θ or η coordinates, so that the manifold is not flat, but curved in general. Hence, our arguments
+hold in this case only when at least one of the ρ and τ functions are linear. The U-divergence [21] and
+β-divergence [19] are such cases. However, for clustering, we can take the average of the η-coordinates
+of member probability distributions in the larger manifold of positive measures and then project it
+to the manifold of probability distributions. This is called the exterior average, and the projection is
+simply a normalization of the result. Therefore, the (ρ, τ)-structure is useful in the case of probability
+distributions. The same situation holds in the case of positive-definite matrices.
+Quantum information theory deals with positive-definite Hermitian matrices of trace one [25,26].
+We need to extend our discussions to the case of complex matrices. The trace one constraint is not
+linear with respect to θ- or η-coordinates, as is the same in the case of probability distributions. Many
+interesting divergence functions have been introduced in the manifold of positive-definite Hermitian
+matrices. It is an interesting future problem to apply our theory to quantum information theory.
+
+Conflicts of Interest: The author declares no conflicts of interest.
+
+References
+
+1.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society and Oxford
+University Press: Rhode Island, RI, USA, 2000.
+2.
+Tsallis, C. Introduction to Nonextensive Statistical Mechanics:
+Approaching a Complex World; Springer:
+Berlin/Heidelberg, Germany, 2009.
+3.
+Naudts, J. Generalized Thermostatistics; Springer: Berlin/Heidelberg, Germany, 2011.
+4.
+Amari,
+S.;
+Ohara,
+A.;
+Matsuzoe,
+H. Geometry of deformed exponential families:
+Invariant,
+dually-flat and conformal geometries. Physica A 2012, 391, 4308–4319.
+5.
+Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195.
+6.
+Zhang, J. Nonparametric information geometry: From divergence function to referential-representational
+biduality on statistical manifolds. Entropy 2013, 15, 5384–5418.
+7.
+Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of
+similarities. Entropy 2010, 12, 1532–1568.
+8.
+Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust
+nonnegative matrix factorization. Entropy 2011, 13, 134–170.
+9.
+Minami, M.; Eguchi, S. Robust blind source separation by beta-divergence. Neural Comput. 2002 14,
+1859–1886.
+10.
+Banerjee, A.; Merugu, S.; Dhillon I.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res.
+2005, 6, 1705–1749.
+11.
+Liu, M.; Vemuri, B.C.; Amari, S.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering.
+IEEE Trans. Pattern Anal. Mach. Learn. 2012, 24, 3192–3212.
+12.
+Dhillon, I.S.; Tropp, J.A. Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl.
+2007, 29, 1120–1146.
+13.
+Vemuri, B.C.; Liu, M.; Amari, S.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis.
+IEEE Trans. Med. Imaging 2011, 30, 475–483.
+14.
+Ohara, A.; Suda, N.; Amari, S. Dualistic differential geometry of positive definite matrices and its applications
+to related problems. Linear Algebra Appl. 1996 247, 31–53.
+15.
+Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by
+beta-divergence. Entropy 2013, 15, 4732–4747.
+16.
+Ohara, A.; Eguchi, S. Geometry on positive definite matrices induced from V-potential functions. In Geometric
+Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2013;
+pp. 621–629.
+
+12
+
+
+Entropy 2014, 16, 2131–2145
+
+17.
+Chebbi,
+Z.;
+Moakher,
+M.
+Means
+of
+Hermitian
+positive-definite
+matrices
+based
+on
+the
+log-determinant alpha-divergence function. Linear Algebra Appl. 2012, 436, 1872–1889.
+18.
+Tsuda, K.; Ratsch, G.; Warmuth, M.K. Matrix exponentiated gradient updates for on-line learning and
+Bregman projection. J. Mach. Learn. Res. 2005, 6, 995–1018.
+19.
+Nock, R.; Magdalou, B.; Briys, E.; Nielsen, F. Mining matrix data with Bregman matrix divergences for
+portfolio selection. In Matrix Information Geometry; Nielsen, F., Bhatia, R., Eds.; Springer: Berlin/Heidelberg,
+Germany, 2013; Chapter 15, pp. 373–402.
+20.
+Nielsen, F., Bhatia, R., Eds. Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013.
+21.
+Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo. 2006, 19, 197–216.
+22.
+Amari, S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes.
+IEEE Trans. Inf. Theory 2009, 55, 4925–4931.
+23.
+Nock, R.; Nielsen, F.; Amari, S. On conformal divergences and their population minimizers. IEEE Trans. Inf.
+Theory 2014, submitted for publication.
+24.
+Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination.
+J. Multivar. Anal. 2008, 99, 2053–2081.
+25.
+Petz, P. Monotone metrics on matrix spaces. Linear Algebra Appl. 1996, 244, 81–96.
+26.
+Hasegawa, H. α-divergence of the non-commutative information geometry. Rep. Math. Phys. 1993, 33, 87–93.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+13
+
+
+entropy
+
+Article
+F-Geometry and Amari’s α−Geometry on a
+Statistical Manifold
+
+Harsha K. V. * and Subrahamanian Moosath K S *
+
+Indian Institute of Space Science and Technology, Department of Space, Government of India, Valiamala P.O,
+Thiruvananthapuram-695547, Kerala, India
+*
+E-Mails: harsha.11@iist.ac.in (K.V.H.); smoosath@iist.ac.in (K.S.S.M.); Tel.: +91-95-6736-0425 (K.V.H.);
++91-94-9574-3148 (K.S.S.M.).
+
+Received: 13 December 2013; in revised form: 21 April 2014 / Accepted: 25 April 2014 /
+Published: 6 May 2014
+
+Abstract: In this paper, we introduce a geometry called F-geometry on a statistical manifold S using
+an embedding F of S into the space RX of random variables. Amari’s α−geometry is a special
+case of F−geometry. Then using the embedding F and a positive smooth function G, we introduce
+(F, G)−metric and (F, G)−connections that enable one to consider weighted Fisher information
+metric and weighted connections. The necessary and sufficient condition for two (F, G)−connections
+to be dual with respect to the (F, G)−metric is obtained. Then we show that Amari’s 0−connection is
+the only self dual F−connection with respect to the Fisher information metric. Invariance properties
+of the geometric structures are discussed, which proved that Amari’s α−connections are the only
+F−connections that are invariant under smooth one-to-one transformations of the random variables.
+
+Keywords:
+embedding; Amari’s α−connections;
+F−metric;
+F−connections; (F, G)−metric;
+(F, G)−connections; invariance
+
+1. Introduction
+
+Geometric study of statistical estimation has opened up an interesting new area called the
+Information Geometry. Information geometry achieved a remarkable progress through the works of
+Amari [1,2], and his colleagues [3,4]. In the last few years, many authors have considerably contributed
+in this area [5–9]. Information geometry has a wide variety of applications in other areas of engineering
+and science, such as neural networks, machine learning, biology, mathematical finance, control system
+theory, quantum systems, statistical mechanics, etc.
+A statistical manifold of probability distributions is equipped with a Riemannian metric and a pair
+of dual affine connections [2,4,9]. It was Rao [10] who introduced the idea of using Fisher information
+as a Riemannian metric in the manifold of probability distributions. Chentsov [11] introduced a family
+of affine connections on a statistical manifold defined on finite sets. Amari [2] introduced a family of
+affine connections called α−connections using a one parameter family of functions, the α−embeddings.
+These α−connections are equivalent to those defined by Chentsov. The Fisher information metric and
+these affine connections are characterized by invariance with respect to the sufficient statistic [4,12] and
+play a vital role in the theory of statistical estimation. Zhang [13] generalized Amari’s α−representation
+and using this general representation together with a convex function he defined a family of divergence
+functions from the point of view of representational and referential duality. The Riemannian metric
+and dual connections are defined using these divergence functions.
+In this paper, Amari’s idea of using α−embeddings to define geometric structures is extended to
+a general embedding. This paper is organized as follows. In Section 2, we define an affine connection
+called F−connection and a Riemannian metric called F−metric using a general embedding F of
+a statistical manifold S into the space of random variables. We show that F−metric is the Fisher
+
+Entropy 2014, 16, 2472–2487; doi:10.3390/e16052472
+www.mdpi.com/journal/entropy
+14
+
+
+Entropy 2014, 16, 2472–2487
+
+information metric and Amari’s α−geometry is a special case of F−geometry. Further, we introduce
+(F, G)−metric and (F, G)−connections using the embedding F and a positive smooth function G.
+In Section 3, a necessary and sufficient condition for two (F, G)−connections to be dual with
+respect to the (F, G)−metric is derived and we prove that Amari’s 0−connection is the only self
+dual F−connection with respect to the Fisher information metric. Then we prove that the set of all
+positive finite measures on X, for a finite X, has an F−affine manifold structure for any embedding F.
+In Section 4, invariance properties of the geometric structures are discussed. We prove that the
+Fisher information metric and Amari’s α−connections are invariant under both the transformation
+of the parameter and the transformation of the random variable. Further we show that Amari’s
+α−connections are the only F−connections that are invariant under both the transformation of the
+parameter and the transformation of the random variable.
+Let (X, B) be a measurable space, where X is a non-empty subset of R and B is the σ-field of
+subsets of X. Let RX be the space of all real valued measurable functions defined on (X, B). Consider
+an n−dimensional statistical manifold S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn}, with coordinates
+ξ = [ξ1, ..., ξn], defined on X. S is a subset of P(X), the set of all probability measures on X given by
+
+P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X);
+�
+
+X
+p(x)dx = 1}.
+(1)
+
+The tangent space to S at a point pξ is given by
+
+Tξ(S) = {
+n
+∑
+i=1
+αi∂i / αi ∈ R}
+where ∂i =
+∂
+∂ξi .
+(2)
+
+Define ℓ(x; ξ) = log p(x; ξ) and consider the partial derivatives { ∂ℓ
+
+∂ξi = ∂iℓ ; i = 1, ...., n} which are
+called scores. For the statistical manifold S, ∂iℓ’s are linearly independent functions in x for a fixed ξ.
+Let T1
+ξ (S) be the n-dimensional vector space spanned by n functions {∂iℓ ; i = 1, ...., n} in x. So
+
+T1
+ξ (S) = {
+n
+∑
+i=1
+Ai∂iℓ / Ai ∈ R}.
+(3)
+
+Then there is a natural isomorphism between these two vector spaces Tξ(S) and T1
+ξ (S) given by
+
+∂i ∈ Tξ(S) ←→ ∂iℓ(x; ξ) ∈ T1
+ξ (S).
+(4)
+
+Obviously, a tangent vector A = ∑n
+i=1 Ai∂i ∈ Tξ(S) corresponds to a random variable A(x) =
+∑n
+i=1 Ai∂iℓ(x; ξ) ∈ T1
+ξ (S) having the same components Ai. Note that Tξ(S) is the differentiation
+
+operator representation of the tangent space, while T1
+ξ (S) is the random variable representation of the
+
+same tangent space. The space T1
+ξ (S) is called the 1-representation of the tangent space.
+Let A and B be two tangent vectors in Tξ(S) and A(x) and B(x) be the 1−representations of A and B
+respectively. We can define an inner product on each tangent space Tξ(S) by
+
+gξ(A, B) =< A, B >ξ = Eξ[A(x)B(x)] =
+�
+A(x)B(x)p(x; ξ)dx.
+(5)
+
+Especially the inner product of the basis vectors ∂i and ∂j is
+
+gij(ξ) = < ∂i, ∂j >ξ = Eξ[∂iℓ ∂jℓ] =
+�
+∂iℓ(x; ξ)∂jℓ(x; ξ)p(x; ξ)dx.
+(6)
+
+15
+
+
+Entropy 2014, 16, 2472–2487
+
+Note that g =<, > defines a Riemannian metric on S called the Fisher information metric.
+On the Riemannian manifold (S, g =<, >), define n3 functions Γijk by
+
+Γijk(ξ) = Eξ[(∂i∂jℓ(x; ξ))(∂kℓ(x; ξ))].
+(7)
+
+These functions Γijk uniquely determine an affine connection ∇ on S by
+
+Γijk(ξ) =< ∇∂i∂j, ∂k >ξ .
+(8)
+
+∇ is called the 1−connection or the exponential connection.
+Amari [2] defined a one parameter family of functions called the α−embeddings given by
+
+Lα(p) =
+
+�
+2
+
+1−α p
+1−α
+
+2
+α ̸= 1
+log p
+α = 1
+(9)
+
+Using these, we can define n3 functions Γα
+ijk by
+
+Γα
+ijk =
+�
+∂i∂jLα(p(x; ξ))∂kL−α(p(x; ξ))dx
+(10)
+
+These Γα
+ijk uniquely determine affine connections ∇α on the statistical manifold S by
+
+Γα
+ijk = < ∇α
+∂i∂j, ∂k >
+(11)
+
+which are called ff−connections.
+
+2. F−Geometry of a Statistical Manifold
+
+On a statistical manifold S, the Fisher information metric and exponential connection are defined
+using the log embedding. In a similar way, α−connections are defined using a one parameter family of
+functions, the α−embeddings. In general, we can give other geometric structures on S using different
+embeddings of the manifold S into the space of random variables RX.
+Let F : (0, ∞) −→ R be an injective function that is at least twice differentiable. Thus we have
+F′(u) ̸= 0, ∀ u ∈ (0, ∞). F is an embedding of S into RX that takes each p(x; ξ) �−→ F(p(x; ξ)).
+Denote F(p(x; ξ)) by F(x; ξ) and ∂iF can be written as
+
+∂iF(x; ξ) = p(x; ξ)F′(p(x; ξ))∂iℓ(p(x; ξ)).
+(12)
+
+It is clear that ∂iF(x; ξ);
+i = 1, ..., n are linearly independent functions in x for fixed ξ since
+∂iℓ(p(x; ξ)); i = 1, .., n are linearly independent. Let TF(pξ)F(S) be the n-dimensional vector space
+spanned by n functions ∂iF; i = 1, ...., n in x for fixed ξ. So
+
+TF(pξ)F(S) = {
+n
+∑
+i=1
+Ai∂iF / Ai ∈ R}
+(13)
+
+Let the tangent space TF(pξ)(F(S)) to F(S) at the point F(pξ) be denoted by TF
+ξ (S). There is a natural
+
+isomorphism between the two vector spaces Tξ(S) and TF
+ξ (S) given by
+
+∂i ∈ Tξ(S) ←→ ∂iF(x; ξ) ∈ TF
+ξ (S).
+(14)
+
+TF
+ξ (S) is called the F−representation of the tangent space Tξ(S).
+
+16
+
+
+Entropy 2014, 16, 2472–2487
+
+For any A = ∑n
+i=1 Ai∂i ∈ Tξ(S), the corresponding A(x) = ∑n
+i=1 Ai∂iF ∈ TF
+ξ (S) is called the
+
+F−representation of the tangent vector A and is denoted by AF(x). Note that TF
+ξ (S) ⊆ TF(pξ)(RX).
+
+Since RX is a vector space, its tangent space TF(pξ)(RX) can be identified with RX. So TF
+ξ (S) ⊆ RX.
+
+Definition 1. F−expectation of a random variable
+f
+with respect to the distribution p(x; ξ) is
+defined as
+
+EF
+ξ ( f ) =
+�
+f (x)
+1
+
+p(F′(p))2 dx.
+(15)
+
+We can use this F−expectation to define an inner product in RX by
+
+< f, g >F
+ξ = EF
+ξ [ f (x)g(x)],
+(16)
+
+which induces an inner product on Tξ(S) by
+
+< A, B >F
+ξ = EF
+ξ [AF(x)BF(x)] ; A, B ∈ Tξ(S).
+(17)
+
+Proposition 1. The induced metric <, >F on S is the Fisher information metric g =<, > on S.
+
+Proof. For any basis vectors ∂i, ∂j ∈ Tξ(S)
+
+< ∂i, ∂j >F
+ξ
+=
+EF
+ξ [∂iF ∂jF]
+
+=
+�
+∂iF ∂jF
+1
+
+p(F′(p))2 dx
+
+=
+�
+(p F′(p) ∂iℓ) (p F′(p) ∂jℓ)
+1
+
+p(F′(p))2 dx
+(18)
+
+=
+�
+∂iℓ ∂jℓ p(x; ξ) dx
+
+=
+Eξ[∂iℓ ∂jℓ]
+
+=
+gij(ξ)
+
+=
+< ∂i, ∂j >ξ .
+
+So the metric <, >F on S induced by the embedding F of S into RX is the Fisher information metric
+g =<, > on S.
+
+We can induce a connection on S using the embedding F.
+Let πF
+|pξ : RX −→ TF
+ξ (S) be the projection map.
+
+Definition 2. The connection induced by the embedding F on S, the F−connection, is defined as
+
+∇F
+∂i∂j
+=
+πF
+|pξ(∂i∂jF)
+
+= ∑
+n ∑
+m
+gmn < ∂i∂jF, ∂mF >F
+ξ ∂n.
+(19)
+
+where [gmn(ξ)] is the inverse of the Fisher information matrix G(ξ) = [gmn(ξ)]. Note that the F−connections
+are symmetric.
+
+Lemma 1. The F−connection and its components can be written in terms of scores as
+
+∇F
+∂i∂j = ∑
+n ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+∂n
+(20)
+
+17
+
+
+Entropy 2014, 16, 2472–2487
+
+and
+
+ΓF
+ijk(ξ) = Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+(21)
+
+Proof. From Equation (12), we have
+
+∂i∂jF = pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ.
+(22)
+
+Therefore
+
+< ∂i∂jF, ∂mF >F
+ξ
+=
+�
+∂i∂jF ∂mF
+1
+
+p(F′(p))2 dx
+
+=
+� �
+pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ
+�
+∂mℓ
+F′(p)dx
+(23)
+
+=
+� �
+∂i∂jℓ ∂mℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ ∂mℓ
+�
+pdx
+
+=
+Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+.
+
+Hence we can write
+
+∇F
+∂i∂j
+=
+πF
+|pξ(∂i∂jF)
+
+= ∑
+n ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+∂n.
+(24)
+
+Then we have the Christoffel symbols of the F−connection
+
+Γn
+ij = ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+(25)
+
+and components of the F−connection are given by
+
+ΓF
+ijk(ξ) =< ∇F
+∂i∂j, ∂k >ξ= Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+.
+(26)
+
+Theorem 1. Amari’s α−geometry is a special case of the F−geometry.
+
+Proof. Let F(p) = Lα(p), Lα(p) is the α−embedding of Amari.
+The components Γα
+ijk of the α−connection are given by
+
+Γα
+ijk(ξ)
+=
+< ∇α
+∂i∂j, ∂k >ξ
+
+=
+Eξ
+
+�
+(∂i∂jℓ + 1 − α
+
+2
+∂iℓ ∂jℓ)(∂kℓ)
+�
+.
+(27)
+
+From Equation (26), when F(p) = Lα(p)
+we have
+
+F′(p) = L′
+α(p)
+=
+p−( 1+α
+
+2 )
+(28)
+
+F′′(p) = L′′
+α(p)
+=
+−1 + α
+
+2
+p−( 3+α
+
+2 ).
+(29)
+
+18
+
+
+Entropy 2014, 16, 2472–2487
+
+Then we get
+
+1 + pF′′(p)
+
+F′(p) = 1 + pL′′
+α(p)
+
+L′α(p) = 1 − α
+
+2
+(30)
+
+Hence
+
+ΓF
+ijk(ξ) =< ∇F
+∂i∂j, ∂k >ξ
+=
+Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+
+=
+Eξ
+
+�
+(∂i∂jℓ + 1 − α
+
+2
+∂iℓ ∂jℓ)(∂kℓ)
+�
+(31)
+
+=
+Γα
+ijk(ξ)
+
+which are the components of the α−connection. Hence F−connection reduces to α−connection.
+Thus we obtain that α−geometry is a special case of F−geometry.
+
+Remark 1. Burbea [14] introduced the concept of weighted Fisher information metric using a positive
+continuous function. We use this idea to define weighted F−metric and weighted F−connections. Let
+G : (0, ∞) −→ R be a positive smooth function and F be an embedding, define (F, G)−expectation of a
+random variable with respect to the distribution pξ as
+
+EF,G
+ξ
+( f ) =
+�
+f (x)
+G(p)
+
+p(F′(p))2 dx.
+(32)
+
+Define (F, G)−metric <, >F,G
+ξ
+in Tpξ(S) by
+
+< ∂i, ∂j >F,G
+ξ
+=
+EF,G
+ξ
+[∂iF ∂jF]
+
+=
+�
+∂iF ∂jF
+G(p)
+
+p(F′(p))2 dx
+(33)
+
+=
+�
+∂iℓ ∂jℓ G(p) p dx
+
+=
+Eξ[G(p) ∂iℓ ∂jℓ].
+
+Define (F, G)−connection as
+
+ΓF,G
+ijk
+=
+< ∇F,G
+∂i ∂j, ∂k >ξ
+
+=
+Eξ
+
+��
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+(G(p))
+�
+.
+(34)
+
+When G(p) = 1, (F, G)−connection reduces to the F−connection and the metric <, >F,G reduces to the Fisher
+information metric. This is a more general way of defining Riemannian metrics and affine connections on a
+statistical manifold.
+
+3. Dual Affine Connections
+
+Definition 3. Let M be a Riemannian manifold with a Riemannian metric g. Two affine connections, ∇ and
+∇∗ on the tangent bundle are said to be dual connections with respect to the metric g if
+
+Zg(X, Y) = g(∇ZX, Y) + g(X, ∇∗
+ZY)
+(35)
+
+holds for any vector fields X, Y, Z on M.
+
+19
+
+
+Entropy 2014, 16, 2472–2487
+
+Theorem 2. Let F, H be two embeddings of statistical manifold S into the space RX of random variables. Let G
+be a positive smooth function on (0, ∞). Then the (F, G)−connection ∇F,G and the (H, G)−connection ∇H,G
+
+are dual connections with respect to the (F, G)−metric iff the functions F and H satisfy
+
+H′(p) = G(p)
+
+pF′(p).
+(36)
+
+We call such an embedding H as a G−dual embedding of F.
+The components of the dual connection ∇H,G can be written as
+
+ΓH,G
+ijk
+=
+� �
+∂i∂jℓ + ( pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ
+�
+∂kℓ G(p)p dx.
+(37)
+
+Proof. ∇F,G and ∇H,G are dual connections with respect to the G−metric means,
+
+∂k < ∂i, ∂j >F,G=< ∇F,G
+∂k ∂i, ∂j >F,G + < ∂i, ∇H,G
+∂k
+∂j >F,G .
+(38)
+
+for any basis vectors ∂i, ∂j, ∂k ∈ Tξ(S).
+
+∂k < ∂i, ∂j >F,G
+=
+�
+∂k∂jℓ ∂iℓ pG(p)dx +
+�
+∂k∂iℓ ∂jℓ pG(p)dx
+
++
+�
+(1 + pG′(p)
+
+G(p) )∂iℓ ∂jℓ ∂kℓ pG(p)dx.
+(39)
+
+< ∇F,G
+∂k ∂i, ∂j >F,G + < ∂i, ∇H,G
+∂k
+∂j >F,G
+=
+�
+∂k∂iℓ ∂jℓ pG(p)dx
+
++
+�
+1 + pF′′(p)
+
+F′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx
+
++
+�
+1 + pH′′(p)
+
+H′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx
+
++
+�
+∂k∂jℓ ∂iℓ pG(p)dx
+(40)
+
+Then the condition (38) holds iff
+
+�
+[2 + pF′′(p)
+
+F′(p) + pH′′(p)
+
+H′(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx =
+
+�
+[1 + pG′(p)
+
+G(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx
+(41)
+
+⇐⇒ [2 + pF′′(p)
+
+F′(p) + pH′′(p)
+
+H′(p) ] = 1 + pG′(p)
+
+G(p) .
+(42)
+
+⇐⇒ 1 + pH′′(p)
+
+H′(p) = pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p)
+(43)
+
+⇐⇒ H′′(p)
+
+H′(p) = G′(p)
+
+G(p) − F′′(p)
+
+F′(p) − 1
+
+p ⇐⇒ H′(p) = G(p)
+
+pF′(p).
+(44)
+
+Hence ∇F,G and ∇H,G are dual connections with respect to the (F, G)−metric iff Equation (36) holds.
+From Equation (43), we can rewrite the components of dual connection ∇H,G as
+
+ΓH,G
+ijk
+=
+� �
+∂i∂jℓ + ( pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ
+�
+∂kℓ G(p)p dx.
+(45)
+
+20
+
+
+Entropy 2014, 16, 2472–2487
+
+Corollary 1. Amari’s 0−connection is the only self dual F−connection with respect to the Fisher information
+metric.
+
+Proof. From Theorem 2, for G(p) = 1 the F−connection ∇F and the H−connection ∇H are dual
+connections with respect to the Fisher information metric iff the functions F and H satisfy
+
+H′(p) =
+1
+
+pF′(p)
+(46)
+
+Thus the F−connection ∇F is self dual iff the embedding F satisfies the condition
+
+F′(p) =
+1
+
+pF′(p) ⇐⇒ F′(p) = p−( 1
+
+2 ) ⇐⇒ F(p) = 2p
+1
+2 = L0(p).
+(47)
+
+That is, Amari’s 0−connection is the only self dual F−connection with respect to the Fisher
+information metric.
+
+So far, we have considered the statistical manifold S as a subset of P(X), the set of all probability
+measures on X. Now we relax the condition �
+p(x)dx = 1, and consider S as a subset of ˜P(X), which
+is defined by
+˜P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X);
+�
+
+X
+p(x)dx < ∞}.
+(48)
+
+Definition 4. Let M be a Riemannian manifold with a Riemannian metric g. Let ∇ be an affine connection on
+M. If there exists a coordinate system [θi] of M such that ∇∂i∂j = 0 then we say that ∇ is flat, or alternatively
+M is flat with respect to ∇, and we call such a coordinate system [θi] an affine coordinate system for ∇.
+
+Definition 5. Let S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn} be an n−dimensional statistical manifold. If
+for some coordinate system [θi]; i = 1, ..., n
+
+∂i∂jF(p(x; θ)) = 0
+(49)
+
+then we can see from Equation (19) that [θi] is an F−affine coordinate system and that S = {pθ} is F−flat. We
+call such S as an F−affine manifold.
+The condition (49) is equivalent to the existence of the functions C, F1, .., Fn on X such that
+
+F(p(x; θ)) = C(x) +
+n
+∑
+i=1
+θiFi(x)
+(50)
+
+Theorem 3. For any embedding F, ˜P(X) is an F−affine manifold for finite X.
+
+Proof. Let X = {x1, ...., xn} be a finite set constituted by n elements. Let Fi : X −→ R be the functions
+defined by Fi(xj) = δij for i, j = 1, .., n. Let us define n coordinates [θi] by
+
+θi = F(p(xi))
+(51)
+
+Then we get F(p(x))
+=
+∑n
+i=1 θiFi(x).
+Therefore
+˜P(X) is an F−affine manifold for any
+embedding F(p).
+
+Remark 2. Zhang [13] introduced ρ-representation, which is a generalization of α-representation of Amari.
+Zhang’s geometry is defined using this ρ-representation together with a convex function. Zhang also defined the
+ρ-affine family of density functions and discussed its dually flat structure. The F−geometry defined using a
+
+21
+
+
+Entropy 2014, 16, 2472–2487
+
+general F-representation is different from the Zhang’s geometry. The metric defined in the F-embedding approach
+is the Fisher information metric and the Riemannian metric defined using the ρ-representation is different from
+the Fisher information metric. The F-connections defined are not in general dually flat and are different from the
+dual connections defined by Zhang.
+
+Remark 3. On a statistical manifold S, we introduced a dualistic structure (g, ∇F, ∇H), where g is the Fisher
+information metric and ∇F, ∇H are the dual connections with respect to the Fisher information metric. Since
+F-connections are symmetric, the manifold S is flat with respect to ∇F iff S is flat with respect to ∇H. Thus if
+S is flat with respect to ∇F, then (S, g, ∇F, ∇H) is a dually flat space. The dually flat spaces are important in
+statistical estimation [4].
+
+4. Invariance of the Geometric Structures
+
+For the statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn}, the parameters are merely labels attached
+to each point p ∈ S, hence the intrinsic geometric properties should be independent of these labels.
+Consequently, it is natural to consider the invariance properties of the geometric structures under
+suitable transformations of the variables in a statistical manifold. Here we can consider two kinds of
+invariance of the geometric structures; covariance under re-parametrization of the parameter of the
+manifold and invariance under the transformations of the random variable [15]. Now let us investigate
+the invariance properties of the F-geometric structures defined in Section 2.
+
+4.1. Covariance under Re-Parametrization
+
+Let [θi] and [ηj] be two coordinate systems on S, which are related by an invertible transformation
+η = η(θ). Let us denote ∂i =
+∂
+∂θi and ∂j =
+∂
+∂ηj . Let the coordinate expressions of the metric g be given
+
+by gij =< ∂i, ∂j > and ˜gij =< ∂i, ∂j >. Let the components of the connection ∇ with respect to the
+coordinates [θi] and [ηj] be given by Γijk, ˜Γijk respectively.
+Then the covariance of the metric g and the connection ∇ under the re-parametrization means,
+
+˜gij
+= ∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+gmn
+(52)
+
+˜Γijk
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+Γmnh + ∑
+m,h
+
+∂θh
+
+∂ηk
+
+∂2θm
+
+∂ηi∂ηj
+gmh
+(53)
+
+Lemma 2. The Fisher information metric g is covariant under re-parametrization.
+
+Proof. The components of the Fisher information metric with respect to the coordinate system [θi] are
+given by
+
+gij(θ) = < ∂i, ∂j >θ =
+�
+∂ip(x; θ)∂jp(x; θ)
+1
+
+p(x; θ)dx.
+(54)
+
+Let ˜p(x; η) = p(x; θ(η)). Then the components of the Fisher information metric with respect to the
+coordinate system [ηj] are given by
+
+˜gij(η) = < ∂i, ∂j >η =
+�
+∂i ˜p(x; η)∂j ˜p(x; η)
+1
+
+˜p(x; η)dx.
+(55)
+
+Since
+
+∂i ˜p(x; η) = ∑
+m
+
+∂θm
+
+∂ηi
+
+∂p(x; θ(η))
+
+∂θm
+(56)
+
+22
+
+
+Entropy 2014, 16, 2472–2487
+
+we can write
+
+˜gij(η)
+=
+�
+∂i ˜p(x; η)∂j ˜p(x; η)
+1
+
+˜p(x; η)dx
+
+=
+�
+∑
+m
+
+∂θm
+
+∂ηi
+
+∂p(x; θ)
+
+∂θm
+∑
+n
+
+∂θn
+
+∂ηj
+
+∂p(x; θ)
+
+∂θn
+1
+
+p(x; θ)dx
+(57)
+
+= ∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+�
+∂mp(x; θ)∂np(x; θ)
+1
+
+p(x; θ)dx.
+
+=
+
+�
+∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+gmn(θ)
+
+�
+
+θ=θ(η)
+
+Lemma 3. The F−connection ∇F is covariant under re-parametrization.
+
+Proof. Let the components of ∇F with respect to the coordinates [θi] and [ηj] be given by Γijk,
+˜Γijk respectively.
+Let ˜p(x; η) = p(x; θ(η)). Let us denote log p(x; θ) by ℓ(x; θ) and log ˜p(x; η) by ˜ℓ(x; η).
+The components of the F−connection ∇F with respect to the coordinate system [θi] are given by
+
+Γijk =
+� �
+∂i∂jℓ(x; θ) + (1 + pF′′(p)
+
+F′(p) )∂iℓ(x; θ) ∂jℓ(x; θ)
+�
+∂kℓ(x; θ)p(x; θ)dx
+(58)
+
+The components of ∇F with respect to the coordinate system [ηj] are given by
+
+˜Γijk =
+� �
+∂i∂j˜ℓ(x; η) + (1 + ˜pF′′( ˜p)
+
+F′( ˜p) )∂i˜ℓ(x; η) ∂j˜ℓ(x; η)
+�
+∂k˜ℓ(x; η) ˜p(x; η)dx
+(59)
+
+We can write
+
+∂i˜ℓ(x; η) = ∑
+m
+
+∂θm
+
+∂ηi
+
+∂ℓ(x; θ(η))
+
+∂θm
+(60)
+
+Then
+
+∂i∂j˜ℓ(x; η) = ∑
+m,n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂2ℓ(x; θ(η))
+
+∂θm∂θn
++ ∑
+m
+
+∂2θm
+
+∂ηi∂ηj
+
+∂ℓ(x; θ(η))
+
+∂θm
+(61)
+
+∂i˜ℓ(x; η) ∂j˜ℓ(x; η) = ∑
+m,n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+(62)
+
+∂k˜ℓ(x; η) = ∑
+h
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θh
+(63)
+
+Hence we get
+
+˜Γijk
+=
+�
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+∂2ℓ(x; θ(η))
+
+∂θm∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+�
+∑
+m,h
+
+∂2θm
+
+∂ηi∂ηj
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+(64)
+
+�
+(1 + pF′′(p)
+
+F′(p) ) ∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx
+
+23
+
+
+Entropy 2014, 16, 2472–2487
+
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+� ∂2ℓ(x; θ(η))
+
+∂θm∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+∑
+m,h
+
+∂2θm
+
+∂ηi∂ηj
+
+∂θh
+
+∂ηk
+
+� ∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+�
+(1 + pF′′(p)
+
+F′(p) )∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx
+
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+Γmnh + ∑
+m,h
+
+∂θh
+
+∂ηk
+
+∂2θm
+
+∂ηi∂ηj
+gmh
+
+Hence we showed that F−connections are covariant under re-parametrization of the parameter.
+The covariance under re-parametrization actually means that the metric and connections are coordinate
+independent. Hence we obtained that the F−geometry is coordinate independent.
+
+4.2. Invariance Under the Transformation of the Random Variable
+
+Amari and Nagaoka [4] defined the invariance of Riemannian metric and connections on a
+statistical manifold under a transformation of the random variable as follows,
+
+Definition 6. Let S = {p(x; ξ) | ξ ∈ E ⊆ Rn} be a statistical manifold defined on a sample space
+X.
+Let x, y be random variables defined on sample spaces X, Y respectively and φ be a transformation
+of x to y.
+Assume that this transformation induces a model S′ = {q(y; ξ) | ξ ∈ E ⊆ Rn} on Y.
+Let λ : S −→ S′ be a diffeomorphism defined as
+
+λ(pξ) = qξ
+(65)
+
+Let g =<>, g′ =<>′ be two Riemannian metrics defined on S and S′ respectively. Let ∇, ∇
+′ be two affine
+connections on S and S′ respectively. Then the invariance properties are given by
+
+< X, Y >p
+=
+< λ∗(X), λ∗(Y) >′
+λ(p) ∀ X, Y ∈ Tp(S)
+(66)
+
+λ∗(∇XY)
+=
+∇
+′
+λ∗(X)λ∗(Y)
+(67)
+
+where λ∗ is the push forward map associated with the map λ, which is defined by
+
+λ∗(X)λ(p) = (dλ)p(X)
+(68)
+
+Now we discuss the invariance properties of the F−geometry under suitable transformations
+of the random variable. Let us restrict ourselves to the case of smooth one-to-one transformations
+of the random variable that are in fact statistically interesting. Amari and Nagaoka [4] mentioned a
+transformation, the sufficient statistic of the parameter of the statistical model, which is widely used in
+statistical estimation. In fact the one-to-one transformations of the random variable are trivial examples
+of sufficient statistic.
+Consider a statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn} defined on a sample space X. Let φ be
+a smooth one-to-one transformation of the random variable x to y. Then the density function q(y; ξ) of
+the induced model S′ takes the form
+
+q(y : ξ) = p(w(y); ξ)w′(y)
+(69)
+
+where w is a function such that x = w(y) and φ′(x) =
+1
+
+w′(φ(x)).
+Let us denote log q(y; ξ) by ℓ(qy) and log p(x; ξ) by ℓ(px).
+
+24
+
+
+Entropy 2014, 16, 2472–2487
+
+Lemma 4. The Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one
+transformations of the random variable.
+
+Proof. Let φ be a smooth one-to-one transformation of the random variable x to y.
+From Equation (69)
+
+p(x; ξ)
+=
+q(φ(x); ξ)φ′(x)
+(70)
+
+∂iℓ(qy)
+=
+∂iℓ(pw(y))
+(71)
+
+∂iℓ(qφ(x))
+=
+∂iℓ(px)
+(72)
+
+The Fisher information metric g′ on the induced manifold S′ is given by
+
+g′
+ij(qξ)
+=
+�
+
+Y ∂iℓ(qy) ∂jℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂iℓ(qφ(x)) ∂jℓ(qφ(x)) q(φ(x); ξ) φ′(x)dx
+(73)
+
+=
+�
+
+X ∂iℓ(px) ∂jℓ(px) p(x; ξ)dx
+
+=
+gij(pξ)
+
+which is the Fisher information metric on S.
+The components of Amari’s α−connections on the induced manifold S′ are given by
+
+´Γα
+ijk(qξ)
+=
+�
+
+Y ∂i∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy +
+�
+
+Y
+1 − α
+
+2
+∂iℓ(qy) ∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂i∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx +
+�
+
+X
+1 − α
+
+2
+∂iℓ(qφ(x)) ∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx
+(74)
+
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+�
+
+X
+1 − α
+
+2
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx
+
+=
+Γα
+ijk(pξ)
+
+which are the components of Amari’s α−connections on the manifold S. Thus we obtained that
+the Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one
+transformations of the random variable.
+
+Now we prove that α-connections are the only F−connections that are invariant under smooth
+one-to-one transformations of the random variable.
+
+Theorem 4. Amari’s α-connections are the only F−connections that are invariant under smooth one-to-one
+transformations of the random variable.
+
+25
+
+
+Entropy 2014, 16, 2472–2487
+
+Proof. Let φ be a smooth one-to-one transformation of the random variable x to y.
+The components of the F−connection of the induced manifold S′ are
+
+´ΓF
+ijk(qξ)
+=
+�
+
+Y
+
+�
+∂i∂jℓ(qy) + (1 + qF′′(q)
+
+F′(q) )∂iℓ(qy) ∂jℓ(qy)
+�
+∂kℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+(75)
+
+�
+
+X(1 + q(φ(x); ξ)F′′(q(φ(x); ξ))
+
+F′(q(φ(x); ξ))
+)∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx.
+
+and the components of the F−connection of the manifold S are
+
+ΓF
+ijk(pξ)
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+
+�
+
+X(1 + p(φ(x); ξ)F′′(p(x; ξ))
+
+F′(p(x; ξ))
+)∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx.
+(76)
+
+Then by equating the components ´ΓF
+ijk(qξ), ΓF
+ijk(pξ) of the F−connection, we get
+
+� q(φ(x); ξ)F′′(q(φ(x); ξ))
+
+F′(q(φ(x); ξ))
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx =
+
+� p(x; ξ)F′′(p(x; ξ))
+
+F′(p(x; ξ))
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx
+(77)
+
+Then it follows that the condition for F−connection to be invariant under the transformation φ is
+given by
+pF′′(p)
+F′(p) = k,
+(78)
+
+where k is a real constant.
+Hence it follows from the Euler’s homogeneous function theorem that the function F′ is a positive
+homogeneous function in p of degree k. So
+
+F′(λp) = λkF′(p) for λ > 0.
+(79)
+
+Since F′ is a positive homogeneous function in the single variable p, without loss of generality
+we can take,
+F′(p) = pk.
+(80)
+
+Therefore
+
+F(p) =
+
+�
+pk+1
+
+k+1
+k ̸= −1
+log p
+k = −1
+(81)
+
+Let
+
+k = −(1 + α)
+
+2
+, α ∈ R.
+(82)
+
+we get
+
+F(p) =
+
+�
+2
+
+1−α p
+1−α
+
+2
+α ̸= 1
+log p
+α = 1
+(83)
+
+which is nothing but Amari’s α−embeddings Lα(p). Hence we obtain that Amari’s α−connections
+are the only F−connections that are invariant under smooth one-to-one transformations of the
+random variable.
+
+26
+
+
+Entropy 2014, 16, 2472–2487
+
+Remark 4. In Section 2, we defined (F, G)-connections using a general embedding function F and a positive
+smooth function G. We can show that (F, G)-connection is invariant under smooth one-to-one transformation
+of the random variable when G(p) = c, where c is a real constant and F(p) = Lα(p) (proof is similar to that of
+Theorem 4). The notion of (F, G)−metric and (F, G)−connection provides a more general way of introducing
+geometric structures on a manifold. We were able to show that the Fisher information metric (up to a constant)
+and Amari’s α−connections are the only metric and connections belonging to this class that are invariant under
+both the transformation of the parameter and the one-to-one transformation of the random variable.
+
+5. Conclusions
+
+The Fisher information metric and Amari’s α−connections are widely used in the theory
+of information geometry and have an important role in the theory of statistical estimation.
+Amari’s α−connections are defined using a one parameter family of functions, the α−embeddings.
+We generalized this idea to introduce geometric structures on a statistical manifold S. We considered
+a general embedding function F of S into RX and obtained a geometric structure on S called the
+F−geometry. Amari’s α−geometry is a special case of F−geometry. A more general way of defining
+Riemannian metrics and affine connections on a statistical manifold S is given using a positive
+continuous function G and the embedding F.
+Amari’s α−geometry is the only F−geometry that is invariant under both the transformation of
+the parameter and the random variable or equivalently under the sufficient statistic. We can relax the
+condition of invariance under the sufficient statistic and can consider other statistically significant
+transformations as well, which then gives an F−geometry other than α−geometry that is invariant
+under these statistically significant transformations. We believe that the idea of F−geometry can be
+used in the further development of the geometric theory of q-exponential families. We look forward to
+studying these problems in detail later.
+
+Acknowledgments: We are extremely thankful to Shun-ichi Amari for reading this article and encouraging our
+learning process. We would like to thank the reviewer who mentioned the references [13,16] that are of great
+importance in our future work.
+
+Author Contributions: The authors contributed equally to the presented mathematical framework and the writing
+of the paper.
+
+Conflicts of Interest: The authors declare no conflicts of interest.
+
+References
+
+1.
+Amari, S. Differential geometry of curved exponential families-curvature and information loss.
+Ann. Statist. 1982, 10, 357–385.
+2.
+Amari, S. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics, Volume 28; Springer-Verlag:
+New York, NY, USA, 1985.
+3.
+Amari, S.; Kumon, M. Differential geometry of Edgeworth expansions in curved exponential family. Ann. Inst.
+Statist. Math. 1983, 35, 1–24.
+4.
+Amari, S.; Nagaoka, H. Methods of Information Geometry, Translations of Mathematical Monographs;
+Oxford University Press: Oxford, UK, 2000.
+5.
+Barndorff-Nielsen, O.E.; Cox, D.R.; Reid, N. The role of differential geometry in statistical theory.
+Internat. Statist. Rev. 1986, 54, 83–96.
+6.
+Dawid, A.P. A Discussion to Efron’s paper. Ann. Statist. 1975, 3, 1231–1234.
+7.
+Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency).
+Ann. Statist. 1975, 3, 1189–1242.
+8.
+Efron, B. The geometry of exponential families. Ann. Statist. 1978, 6, 362–376.
+9.
+Murray,
+M.K.;
+Rice,
+R.W.
+Differential
+Geometry
+and
+Statistics;
+Chapman
+&
+Hall:
+London,
+UK, 1995.
+10.
+Rao,
+C.R.
+Information
+and
+accuracy
+attainable
+in
+the
+estimation
+of
+statistical
+parameters.
+Bull. Calcutta. Math. Soc. 1945, 37, 81–91.
+
+27
+
+
+Entropy 2014, 16, 2472–2487
+
+11.
+Chentsov, N.N. Statistical Decision Rules and Optimal Inference; Transted in English, Translation of the
+Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 1982.
+12.
+Corcuera, J.M.; Giummole, F. A characterization of monotone and regular divergences.
+Ann.
+Inst.
+Statist. Math. 1998, 50, 433–450.
+13.
+Zhang, J. Divergence function, duality and convex analysis. Neur. Comput. 2004, 16, 159–195.
+14.
+Burbea, J. Informative geometry of probability spaces. Expo Math. 1986, 4, 347–378.
+15.
+Wagenaar,
+D.A.
+Information
+Geometry
+for
+Neural
+Networks.
+Available
+online:
+http://www.danielwagenaar.net/res/papers/98-Wage2.pdf (accessed on 13 December 2013).
+16.
+Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually flat and
+conformal geometries. Physica A 2012, 391, 4308–4319.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+28
+
+
+entropy
+
+Article
+Computational Information Geometry in Statistics:
+Theory and Practice
+
+Frank Critchley 1 and Paul Marriott 2,*
+
+1 Department of Mathematics and Statistics, The Open University, Walton Hall, Milton Keynes,
+Buckinghamshire MK7 6AA, UK; E-Mail: f.critchley@open.ac.uk
+2 Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo,
+ON N2L 3G1, Canada
+*
+E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567.
+
+Received: 27 March 2014; in revised form: 25 April 2014 / Accepted: 29 April 2014 /
+Published: 2 May 2014
+
+Abstract: A broad view of the nature and potential of computational information geometry in
+statistics is offered.
+This new area suitably extends the manifold-based approach of classical
+information geometry to a simplicial setting, in order to obtain an operational universal model space.
+Additional underlying theory and illustrative real examples are presented. In the infinite-dimensional
+case, challenges inherent in this ambitious overall agenda are highlighted and promising new
+methodologies indicated.
+
+Keywords: information geometry; computational geometry; statistical foundations
+
+1. Introduction
+
+The application of geometry to statistical theory and practice has seen a number of different
+approaches developed. One of the most important can be defined as starting with Efron’s seminal
+paper [? ] on statistical curvature and subsequent landmark references, including the book by Kass
+and Vos [? ]. This approach, a major part of which has been called information geometry, continues
+today, a primary focus being invariant higher-order asymptotic expansions obtained through the use
+of differential geometry. A somewhat representative example of the type of result it generates is taken
+from [? ], where the notation is defined:
+
+Example 1. The bias correction of a first-order efficient estimator, ˆβ, is defined by:
+
+ba(β) = − 1
+
+2n gaa′ �
+gbcΓ(−1)
+a′bc + gκλh(−1)
+κλa′
+�
+,
+
+and has the property that if ˆβ∗ := ˆβ − b(β) then:
+
+Eβ( ˆβ∗ − β) = O(n−3/2).
+
+The strengths usually claimed of such a result are that, for a worker fluent in the language of
+information geometry, it is explicit, insightful as to the underlying structure and of clear utility in
+statistical practice. We agree entirely. However, the overwhelming evidence of the literature is that,
+while the benefits of such inferential improvements are widely acknowledged in principle, in practice,
+the overhead of first becoming fluent in information geometry prevents their routine use. As a result, a
+great number of powerful results of practical importance lay severely underused, locked away behind
+notational and conceptual bars.
+This paper proposes that this problem can be addressed computationally by the development
+of what we call computational information geometry. This gives a mathematical and numerical
+
+Entropy 2014, 16, 2454–2471; doi:10.3390/e16052454
+www.mdpi.com/journal/entropy
+29
+
+
+Entropy 2014, 16, 2454–2471
+
+computational framework in which the results of information geometry can be encoded as “black-box"
+numerical algorithms, allowing direct access to their power. Essentially, this works by exploiting the
+structural properties of information geometry, which are such that all formulae can be expressed in
+terms of four fundamental building blocks: defined and detailed in Amari [? ], these are the +1 and −1
+geometries, the way that these are connected via the Fisher information and the foundational duality
+theorem. Additionally, computational information geometry enables a range of methodologies and
+insights impossible without it; notably, those deriving from the operational, universal model space,
+which it affords; see, for example, [? ? ? ].
+The paper is structured as follows. Section 2 looks at the case of distributions on a finite number
+of categories where the extended multinomial family provides an exhaustive model underlying the
+corresponding information geometry. Since the aim is to produce a computational theory, a finite
+representation is the ultimate aim, making the results of this section of central importance. The
+paper also emphasises how the simplicial structures introduced here are foundational to a theory of
+computational information geometry. Being intrinsically constructive, a simplicial approach is useful
+both theoretically and computationally. Section 3 looks at how simplicial structures, defined for finite
+dimensions, can be extended to the infinite dimensional case.
+
+2. Finite Discrete Case
+
+2.1. Introduction
+
+This section shows how the results of classical information geometry can be applied in a purely
+computational way. We emphasise that the framework developed here can be implemented in
+a purely algorithmic way, allowing direct access to a powerful information geometric theory of
+practical importance.
+The key tool, as explained in [? ], is the simplex:
+
+Δk :=
+
+�
+
+ß = (ß0, ß1, . . . , ßk)⊤ : ßi ≥ 0 ,
+k
+∑
+i=0
+ßi = 1
+
+�
+
+,
+(1)
+
+with a label associated with each vertex. Here, k is chosen to be sufficiently large, so that any statistical
+model—by which we mean a sample space, a set of probability distributions and selected inference
+problem—can be embedded. The embedding is done in such a way that all the building blocks
+of information geometry (i.e., manifold, affine connections and metric tensor) can be numerically
+computed explicitly. Within such a simplex, we can embed a large class of regular exponential families;
+see [? ] for details. This class includes exponential family random graph models, logistic regression,
+log-linear and other models for categorical data analysis. Furthermore, the multinomial family on k + 1
+categories is naturally identified with the relative interior of this space, int(Δk), while the extended
+family, Equation (??), is a union of distributions with different support sets.
+This paper builds on the theory of information geometry following that introduced by [? ] via
+the affine space construction introduced by [? ] and extended by [? ]. Since this paper concentrates
+on categorical random variables, the following definitions are appropriate. Consider a finite set of
+disjoint categories or bins B = {Bi}i∈A. Any distribution over this finite set of categories is defined
+by a set, {πi}i∈A, which defines the corresponding probabilities. With “mix” connoting mixtures of
+distributions, we have:
+
+Definition 1. The −1-affine space structure over distributions on B := {Bi}i∈A is (Xmix, Vmix, +) where:
+
+Xmix =
+
+�
+
+{xi}i∈A| ∑
+i∈A
+xi = 1
+
+�
+
+, Vmix =
+
+�
+
+{vi}i∈A| ∑
+i∈A
+vi = 0
+
+�
+
+and the addition operator, +, is the usual addition of sequences.
+
+30
+
+
+Entropy 2014, 16, 2454–2471
+
+In Definition ??, the space of (discretised) distributions is a −1-convex subspace of the affine
+space, (Xmix, Vmix, +). A similar affine structure for the +1-geometry, once the support has been fixed,
+can be derived from the definitions in [? ].
+
+2.2. Examples
+
+Examples ?? and ?? are used for illustration. The second of these is a moderately high dimensional
+family, where the way that the boundaries of the simplex are attached to the model is of great
+importance for the behaviour of the likelihood and of the maximum likelihood estimate. In general,
+working in a simplex, boundary effects mean that standard first order asymptotic results can fail,
+while the much more flexible higher order methods can be very effective. The other example is a
+continuous curved exponential family, where both higher order asymptotic sampling theory results
+and geometrically-based dimension reduction are described.
+
+Example 2. The paper [? ] models survival times for leukaemia patients. These times, recorded in days, start
+at the time of diagnosis, and there are 43 observations; see [? ] for details. We further assume that the data is
+censored at a fixed value. It was observed that a censored exponential distribution gives a reasonable, but not
+exact, fit. As discussed in [? ], this gives a one-dimensional curved exponential family inside a two-dimensional
+regular exponential family of the form:
+
+exp
+�
+λ1x + λ2y − log
+� 1
+
+λ2
+
+�
+eλ2t − 1
+�
++ eλ1+λ2t
+��
+,
+(2)
+
+where y = min(z, t) and x = I(z ≥ t), and the embedding map is given by (λ1(θ), λ2(θ)) = (− log θ, −θ).
+As shown in [? ], the loss due to discretisation can be made arbitrarily small for all information geometry
+objects. Thus, for example, using this computational approach, it is straightforward to compute the bias
+correction described in Example ??. Each of the terms in the asymptotic bias, i.e., the metric, gij, its inverse, gij,
+
+the Christoffel symbols, Γ(−1)
+ijk
+, and curvature term, h(−1), can be directly numerically coded as appropriate finite
+difference approximations to derivatives. Thus, “black-box” code can directly calculate the numerical value of
+the asymptotic bias, and this numerical value can then be used by those who are not familiar with information
+geometry. For example this calculation establishes the fact that, with this particular data set, the sample size is
+such that the bias is inferentially unimportant.
+
+1
+
+2
+
+3
+
+4
+
+Figure 1. Undirected graphical model showing the cyclic graph of order four.
+
+Example 3. The paper [? ] discusses an undirected graphical model based on the cyclic graph of order four,
+shown in Figure ??, with binary random variables at each node. Without any constraints, there are 16 possible
+values for the graph, so model space can be thought of as a 15-dimensional simplex, including the relative
+
+31
+
+
+Entropy 2014, 16, 2454–2471
+
+boundary. However, the conditional independence relations encoded by the graph impose linear constraints in the
+natural parameters of the exponential family. Thus, the resultant model is a lower dimensional full exponential
+family and its closure.
+As described in [? ], the four cycle model is a seven dimensional exponential family, which is a +1-affine
+subspace of the +1-affine structure of the 15-dimensional simplex. The model can be written in the form:
+
+⎛
+
+⎝
+πi exp
+�
+∑8
+h=1 ηhvhi
+�
+
+∑15
+j=0 πj exp
+�
+∑8
+h=1 ηhvhj
+�
+
+⎞
+
+⎠
+
+15
+
+i=0
+
+(3)
+
+for a given set of linearly independent vectors {vh}8
+h=1. The existence of the maximum likelihood estimate for
+η = (ηh) will depend on how the limit points of Model (??) meet the observed face of Δ15; that is, the span of the
+vertices (bins) having positive counts. Thus, a key computational task is to learn how a full exponential family,
+defined by a representation of the form of (??), is attached to boundary sub-simplices of the high-dimensional
+embedding simplex.
+In order to visualise the geometric aspects of this problem, consider a lower dimensional version. Define
+a two-dimensional full exponential family by the vectors v1 = (1, 2, 3, 4), v2 = (1, 4, 9, −1) and the uniform
+distribution base point, πi, embedded in the three-dimensional simplex. The two-dimensional family is defined
+by the +1-affine space through (0.25, 0.25, 0.25, 0.25) spanned by the space of vectors of the form:
+
+α(1, 2, 3, 4) + β(1, 4, 9, −1) = (α + β, 2α + 4β, 3α + 9β, 4α − β).
+
+Consider directions from the origin obtained by writing α = θβ, giving, for each θ, a one-dimensional, full
+exponential family parameterized by β in the direction β(θ + 1, 2θ + 4, 3θ + 9, 4θ − 1). The aspect of this vector,
+which determines the connection to the boundary, is the rank order of its elements. For example, suppose the first
+component was the maximum and the last the minimum. Then, as β → ±∞, this one-dimensional family will
+be connected to the first and fourth vertex of the embedding four simplex, respectively. Note that changing the
+value of θ changes the rank structure, as illustrated in Figure ??. This plot shows the four element-wise linear
+functions of θ (dashed lines) and the salient overall feature of their rank order; that is, their upper and lower
+envelopes (solid lines). From this analysis of the envelopes of a set of linear functions, it can be seen that the
+function 2θ + 4 is redundant. The consequence of this is shown in Figure ??, which shows a direct computation
+of the two-dimensional family. It is clear that, indeed, only three of the four vertexes have been connected by
+the model.
+In general, the problem of finding the limit points in full exponential families inside simplex models is a
+problem of finding redundant linear constraints. As shown in [? ], this can be converted, via convex duality, into
+the problem of finding extremal points in a finite dimensional affine space. In the four-cycle model, this technique
+can construct all sub-simplices containing limit points of the four-cycle model. For example, it can be shown
+that all of the 16 vertices are part of the boundary. Once the boundary points have been identified as necessary
+and sufficient, conditions for the existence of the maximum likelihood in the +1-parameters can easily be found
+computationally [? ].
+
+32
+
+
+Entropy 2014, 16, 2454–2471
+
+���
+���
+��
+�
+�
+��
+��
+
+���
+���
+���
+�
+��
+��
+��
+
+Envelope of linear functions
+
+�
+
+���������������
+
+Figure 2. The envelope of a set of linear functions. Functions, dashed lines; envelope, solid lines.
+
+Figure 3. Attaching a two-dimensional example to the boundary of the simplex.
+
+2.3. Tensor Analysis and Numerical Stability
+
+One of the most powerful set of results from classical information geometry is the way that
+geometrically-based tensor analysis is perfect for use in multi-dimensional higher order asymptotic
+analysis; see [? ] or [? ]. The tensorial formulation does, however, present a couple of problems in
+practice. For many, its very tight and efficient notational aspects can obscure rather than enlighten,
+while the resulting formulae tend to have a very large number of terms, making them rather
+cumbersome to work with explicitly. These are not problems at all for the computational approach
+described in this paper. Rather, the clarity of the tensorial approach is ideal for coding, where large
+numbers of additive terms, of course, are easy to deal with.
+Two more fundamental issues, which the global geometric approach of this paper highlights,
+concern numerical stability. The ability to invert the Fisher information matrix is vital in most tensorial
+
+33
+
+
+Entropy 2014, 16, 2454–2471
+
+formulae, and so understanding its spectrum, discussed in Section ??, is vital. Secondly, numerical
+underflow and overflow near boundaries require careful analysis, and so, understanding the way
+that models are attached to the boundaries of the extended multinomial models is equally important.
+The four-cycle model, to which we now return, illustrates computational information geometry doing
+this effectively.
+
+Example 4. The multivariate Edgeworth approximation to the sampling distribution of part of the sufficient
+statistic for the four-cycle model is shown in Figure ??. Using the techniques described above, a point near the
+boundary of the 15-simplex has been selected as the data generation process. For illustration, we focus on the
+marginal distribution of two components of the sufficient statistic, though any number could have been chosen.
+The boundary forces constraints on the range of the sufficient statistics, shown by the dashed line in the plot.
+The points, jittered for clarity, show the distribution computed by simulation. It is typical that such boundary
+constraints prevent standard first order methods from performing well, but the greater flexibility of higher
+order methods can be seen to work well here. As discussed above, methods, such as the multivariate Edgeworth
+expansion, can be strongly exploited in a computational framework, such as ours. Note, the discretization that
+can be observed in the figure is extensively discussed in [? ].
+
+�
+��
+��
+��
+
+��
+�
+�
+��
+��
+��
+
+��������������������
+
+��������������������
+
+Figure 4. Using the Edgeworth expansion near the boundary of four-cycle model.
+
+2.4. Spectrum of Fisher Information
+
+We focus now on the second numerical issue identified above. In any multinomial, the Fisher
+information matrix and its inverse are explicit. Indeed, the 0-geodesics and the corresponding geodesic
+distance are also explicit; see [? ] or [? ]. However, since the simplex glues together multinomial
+structures with different supports and the computational theory is in high dimensions, it is a fact
+that the Fisher information matrix can be arbitrarily close to being singular. It is therefore of central
+interest that the spectral decomposition of the Fisher information itself has a very nice structure, as
+shown below.
+
+Example 5. Consider a multinomial distribution based on 81 equal width categories on [−5, 5], where the
+probability associated to a bin is proportional to that of the standard normal distribution for that bin. The Fisher
+information for this model is an 80 × 80 matrix, whose spectrum is shown in Figure ??. By inspection, it can
+be seen that there are exponentially small eigenvalues, so that while the matrix is positive definite, it is also
+arbitrarily close to being singular. Furthermore, it can be seen that the spectrum has the shape of a half-normal
+density function and that the eigenvalues seem to come in pairs. These facts are direct consequences of the general
+results below.
+
+34
+
+
+Entropy 2014, 16, 2454–2471
+
+With π−0 denoting the vector of all bin probabilities, except π0, we can write the Fisher
+information matrix (in the +1 form) as N times:
+
+I(π) := diag(π−0) − π−0πT
+−0.
+
+This has an explicit spectral decomposition, which can be computed by using interlacing
+eigenvalue results (see for example [?
+], Chapter 4).
+In particular, if the diagonal matrices,
+diag(π1, . . . , πk) and diag(λ1Im1| · · · |λgImg), agree up to a row-and-column permutation, where g > 1
+and λ1 > · · · > λg > 0, then I(π) has ordered spectrum:
+
+λ1 > ˜λ1 > · · · > λg > ˜λg ≥ 0,
+(4)
+
+with ˜λg > 0 ⇐⇒ π0 > 0, each λi having multiplicity mi − 1, while each ˜λg is simple.
+
+0
+20
+40
+60
+80
+
+0.00
+0.01
+0.02
+0.03
+0.04
+0.05
+0.06
+
+Eigenvalues
+
+rank
+
+Eigenvalues
+
+Figure 5. Spectrum of the Fisher information matrix of a discretised normal distribution.
+
+We give a complete account of the spectral decomposition (SpD) of I(π). There are four cases to
+consider, the last having the generic spectrum of (??). Without loss, after permutation, assume now
+π1 ≥ · · · ≥ πk. The four cases are:
+
+Case 1 For some l < k, the last k − l elements of π−0 vanish: the sub-case l = 0 ⇐⇒ π0 = 1 ⇐⇒
+I(π) = 0 is trivial. Otherwise, writing π+ = (π1, . . . , πl)T and Π+ = diag(π+), the SpD of:
+
+I(π) =
+
+�
+Π+ − π+πT+
+0
+
+0
+0
+
+�
+
+follows at once from that of Π+ − π+πT+, given below.
+Case 2 k = 1: this case is trivial.
+Case 3 k > 1, π = λ1k, λ > 0: the SpD of I(π) is:
+
+λCk + λ(1 − kλ)Jk
+
+where Ck = Ik − Jk and Jk = k−11k1T
+k . Here, λ has multiplicity k − 1 and eigenspace [Span(1k)]⊥,
+while ˜λ := λ(1 − kλ) has multiplicity one and eigenspace Span(1k).
+In particular, since
+1 − π0 = kλ, it follows that:
+I(π) is singular ⇐⇒ π0 = 0.
+
+35
+
+
+Entropy 2014, 16, 2454–2471
+
+Case 4 π−0 = (λ11T
+m1| . . . |λg1T
+mg)T, g > 1 and λ1 > · · · > λg > 0:
+
+This is the generic case, having the spectrum of (??) above. Denoting by Om the zero matrix of
+order m × m and by P(ν) the rank one orthogonal projector onto Span(ν), (ν ̸= 0), the SpD is:
+
+g
+∑
+i=1,mi>1
+λidiag(Omi−, Cmi, Omi−) +
+
+g
+∑
+i=1
+˜λiP
+
+⎛
+
+⎝
+�
+λ1
+
+˜λi − λ1
+1T
+m1, . . . ,
+λg
+
+˜λi − λg
+1T
+mg
+
+�T⎞
+
+⎠ ,
+
+where: mi− = ∑{mj|j < i}, mi+ = ∑{mj|j > i} and the ˜λi are the zeros of:
+
+h(˜λ) := 1 +
+
+g
+∑
+i=1
+
+miλ2
+i
+
+˜λ − λi
+= (1 −
+
+g
+∑
+i=1
+miλi) + ˜λ
+
+� g
+∑
+i=1
+
+miλi
+˜λ − λi
+
+�
+
+.
+
+In particular, {˜λi : i = 1, · · · , g} are simple eigenvalues satisfying (??) while, whenever mi > 1,
+λi, is also an eigenvalue having multiplicity mi − 1. Further, expanding det(I(π)), we again find:
+
+I(π) is singular
+⇐⇒ π0 = 0,
+
+so that �λg
+>
+0 ⇔
+π0
+>
+0, as claimed.
+Finally, we note that each �λi (i <
+g) is
+typically (much) closer to λi than to λi+1.
+For, considering the graph of x
+→
+1/x,
+h ((λi + λi+1)/2 + δ(λi − λi+1)/2) (−1 < δ < +1) is well-approximated by:
+
+1 −
+2miλ2
+i
+
+(λi − λi+1)(1 − δ) +
+2mi+1λ2
+i+1
+
+(λi − λi+1)(1 + δ)
+
+whose unique zero δ∗ over (−1, 1) is positive whenever, as will typically be the case, mi = mi+1
+(both will usually be one), while (miλi + mi+1λi+1) < 1/2. Indeed, a straightforward analysis
+shows that, for any mi and mi+1, δ∗ = 1 + O(λi) as λi → 0.
+
+2.5. Total Positivity and Local Mixing
+
+Mixture modelling is an exemplar of a major area of statistics in which computational information
+geometry enables distinctive methodological progress. The −1-convex hull of an exponential family is
+of great interest, mixture models being widely used in many areas of statistical science. In particular,
+they are explored further in [? ]. Here, we simply state the main result, a simple consequence of the
+total positivity of exponential families [? ], that, generically, convex hulls are of maximal dimension. In
+this result, “generic” means that the +1 tangent vector, which defines the exponential family as having
+components that are all distinct.
+
+Theorem 1. The −1-convex hull of an open subset of a generic one-dimensional exponential family is of full
+dimension.
+
+Proof. For any (πi) ∈ Δk with each πi > 0, θ0 < · · · < θk and s0 < · · · < sk, let B = (π(θ0), ..., π(θk))
+have general element:
+πi(θj) := πi exp[siθj − ψ(θj)].
+
+Further, let �B = B − π(θ0)1T
+k+1, whose general column is π(θj) − π(θ0). Then, it suffices to show that
+�B has rank k. However, using [? ] (p. 33), Rank(�B) = Rank(B) − 1, so that:
+
+Rank(�B) = k ⇔ B is nonsingular ⇔ B∗ is nonsingular,
+
+36
+
+
+Entropy 2014, 16, 2454–2471
+
+where B∗ = (exp[siθj]). It suffices, then, to recall [? ] that K(x, y) = exp(xy) is strictly total positive (of
+order ∞), so that det B∗ > 0.
+
+3. Infinite Dimensional Structure
+
+This section will start to explore the question of whether the simplex structure, which describes
+the finite dimensional space of distributions, can extend to the infinite dimensional case. We examine
+some of the differences with the finite dimensional case, illustrating them with clear, commonly
+occurring examples.
+
+3.1. Infinite Dimensional Information Geometry: A Review
+
+In the previous sections, the underlying computational space is always finite dimensional. This
+section looks at issues related to an infinite dimensional extension of the theory in that paper. There
+is a great deal of literature concerning infinite dimensional statistical models. The discussion here
+concentrates on information geometric, parametrisation and boundary issues.
+The information geometry theory of Amari [? ] has a geometric foundation, where statistical
+models (typically full and curved exponential families) have a finite dimensional manifold structure.
+When considering the extension to infinite dimensional cases, Amari notes the problem of finding an
+“adequate topology” [? ] (p. 93). There has to be very interesting work following up this topological
+challenge. By concentrating on distributions with a common support, the paper [? ] uses the geometry
+of a Banach manifold, where local patches on the manifold are modelled by Banach spaces, via
+the concept of an Orlicz space. This gives a structure that is analogous to an infinite dimensional
+exponential family, with mean and natural parameters and including the ability to define mixed
+parametrisations. One drawback of this Banach structure, as pointed out in [? ], is that the likelihood
+function with finite samples is not continuous on the manifold. Fukumizu uses a reproducing kernel
+Hilbert space structure rather than a Banach manifold, which is a stronger topology. There are strong
+connections between the approach taken in [? ] and the material in Section ??, we note two issues here:
+(1) a focus on the finite nature of the data; and (2) using a Hilbert structure defined by a cumulant
+generating function. The approaches differ in that [? ] uses a manifold approach rather than the
+simplicial complex as the fundamental geometric object. There is also other work that explicitly used
+infinite dimensional Hilbert spaces in statistics, a good reference being [? ].
+In this paper, in contrast to previous authors, a simplicial, rather than a manifold-based, approach
+is taken. This allows distributions with varying support, as well as closures of statistical families to be
+included in the geometry. Another difference in approach is the way in which geometric structures
+are induced by infinite dimensional affine spaces rather than by using an intrinsic geometry. This
+approach was introduced by [? ] and extended by [? ]. Spaces of distributions are convex subsets of
+the affine spaces, and their closure within the affine space is key to the geometry.
+In exponential families, the −1-affine structure is often called the mean parametrisation, and
+using moments as parameters is one very important part of modelling. In the infinite dimensional
+case, the use of moments as a parameter system is related to the classical moment problem—when
+does there exist a (unique) distribution whose moments agree with a given sequence?—which has
+generated a vast literature in its own right; see [? ? ? ]. In general terms, the existence of a solution
+to the moment problem is connected to positivity conditions on moment matrices. Such conditions
+have been used in connection to the infinite dimensional geometry of mixture models [? ]. Uniqueness,
+however, is a much more subtle problem: sufficient conditions can be formulated in terms of the rate
+of growth of the moments [? ]. Counter examples to general uniqueness results include the log-normal
+distribution [? ].
+The geometry of the Fisher information is also much more complex in general spaces of
+distributions than in exponential families. Simple mixture models, including two-component mixtures
+of exponential distributions [? ], can have “infinite” expected Fisher information, which gives rise to
+
+37
+
+
+Entropy 2014, 16, 2454–2471
+
+non-standard inference issues. Similar results on infinitely small (and large) eigenvalues of covariance
+operators are also noted in [? ]. Since the Fisher information is a covariance, the fact that it does not
+exist for certain distributions or that its spectrum can be unbounded above or arbitrarily close to zero
+is not a surprise. However, these observations do need to be taken into account when considering the
+information geometry of infinite dimensional spaces.
+The rest of this section looks at the topology and geometry of the infinite dimensional simplex
+and gives some illustrative examples, which, in particular, show the need for specific Hilbert space
+structures, discussed in the final section.
+
+3.2. Topology
+
+For simplicity and concreteness, in this section, we will be looking at models for real valued
+random variables. In this paper, we restrict attention to the cases where the sample space is R+ or R
+and has been discretised to a countably infinite set of bins, Bi, with i ∈ N or Z, respectively. In the
+finite case, the basic object is the standard simplex, Δk, with k + 1 bins. We generalise this to countable
+unions of such objects. Of these, one is of central importance, denoted by Δemp or simply Δ, because it
+is the smallest object that contains all possible empirical distributions.
+
+Definition 2. For any finite subset of bins, indexed by I ⊂ N or Z, denote
+
+ΔI =
+
+�
+
+x = (xi)i∈I : xi ≥ 0 , ∑
+i∈I
+xi = 1
+
+�
+
+.
+
+We take the union of all such sets �
+|I|<∞ ΔI, where |I| denotes the number of elements of the index set. This
+can always be written as:
+
+Δ =
+
+�
+
+x = (xi)i∈Z : ∑
+i∈Z
+xi = 1, xi ≥ 0 and only finitely many xi > 0
+
+�
+
+.
+
+In what follows, it is important to note that for any given statistical inference problem, the sample
+size, n, is always finite, even if we frequently use asymptotic approximations, where n → ∞. Thus, the
+data, as represented by the empirical distribution, naturally lie in the space, Δ. However, many models,
+used in the given inference problem, will have support over all bins, so the models most naturally
+lie in the “boundary” constructed using the closures of the set. These objects are subsets of sequence
+spaces, and the corresponding topologies can be constructed from the Banach spaces, ℓp, p ∈ [1, ∞].
+The following results follow directly from explicit calculations, where we note that in this section, since
+all terms are non-negative, convergence always means absolute convergence. In particular, arbitrary
+rearrangements of series do not affect the existence of limits or their values.
+
+Example 6. Consider the sequence of “uniform distributions” x(n) = ( 1
+
+n, . . . , 1
+
+n, 0, . . . ) as elements of Δ. This
+has an ℓp limit of the zero sequence for p ∈ (1, ∞].
+
+Proposition 1. The ℓp extreme points of Δ, for p ∈ (1, ∞], are the zero sequence and the sequences, ffii (i ∈ Z),
+with one as the i − th element and zero elsewhere.
+
+For p ∈ [1, ∞], let Δp ⊂ ℓp denote the ℓp closure of Δ.
+
+Theorem 2. (a) Δ1 = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi = 1} .
+(b) Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} .
+(c) For p ∈ (1, ∞), Δp = Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} .
+
+38
+
+
+Entropy 2014, 16, 2454–2471
+
+Proof. (a) It is immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ1. Conversely, if ¯x is a limit
+point, then all its elements must be non-negative. Finally, if ∑∞
+i=1 ¯xi is not bounded above by one,
+
+then there exists N, such that ∑N
+i=1 ¯xi > 1 + ϵ for some ϵ > 0. Hence, ∑∞
+i=1 | ¯xi − x(n)
+i
+| ≥ ∑N
+i=1 | ¯xi −
+
+x(n)
+i
+| ≥ ∑N
+i=1 ¯xi − ∑N
+i=1 x(n)
+i
+> ϵ for all n, which contradicts convergence. If ∑∞
+i=1 ¯xi < 1 − ϵ, then
+
+∑∞
+i=1 | ¯xi − x(n)
+i
+| ≥ ∑∞
+i=1 x(n)
+i
+− ∑∞
+i=1 ¯xi > ϵ, which again contradicts convergence.
+(b) It is again immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ∞. However, by Example ??,
+the zero sequence is also in Δ∞, so that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi ≤ 1} ⊆ Δ∞.
+Conversely, by contradiction, it is easy to see that all elements of the closure must have non-
+negative elements. Finally, for any ¯x ∈ Δ∞, if ∑∞
+i=1 ¯xi is not bounded above by one, there exists N, such
+
+that ∑N
+i=1 ¯xi > 1 + ϵ for some ϵ > 0. For any sequence of points, x(n) in Δ, we have that ∑N
+i=1 x(n)
+i
+≤ 1,
+
+so that, for i = 1, . . . , N, the maximum value of |x(n)
+i
+− ¯xi| > ϵ/N. Hence, for all sequences, x(n), we
+have ∥x(n) − ¯x∥∞ > ϵ/N, which contradicts ¯x being in the closure.
+(c) This follows essentially the same argument as (b) by noting in the case where ∑∞
+i=1 ¯xi is not
+bounded above by one, we have:
+
+∥x(n) − ¯x∥p
+p ≥
+N
+∑
+i=1
+| ¯xi − x(n)
+i
+|p ≥ N max
+i=1,...N |x(n)
+i
+− ¯xi|p > N1−pϵp
+
+for any sequences, x(n), which contradicts ¯x being in the closure.
+
+It is immediate that the spaces, Δ and Δ1, are convex subsets of ℓ1 and that Δ∞ is a convex set in
+ℓ∞.
+
+3.3. Geometry
+
+In the same way as for the finite case, the −1-geometry can be defined using an affine space
+structure using the following definition.
+
+Definition 3. Let I be a countable index set which is a subset of Z. The −1-affine space structure over
+distributions is (Xmix, Vmix, +), where:
+
+Xmix =
+
+�
+
+x = (xi)|∑
+I
+xi = 1,∑
+I
+|xi| < ∞
+
+�
+
+, Vmix =
+
+�
+
+v = vi|∑
+I
+vi = 0,∑
+I
+|vi| < ∞,
+
+�
+
+,
+
+and x + v = (xi + vi).
+
+In order to define the +1-geometric structure, we also follow the approach used in the finite case.
+Initially, to understand the +1- structure, consider the case where all distributions have a common
+support, i.e., assume πi > 0 for all i. We follow here the approach of [? ].
+
+Definition 4. Consider the set of non-negative measures on N or Z and the equivalence relation defined by:
+
+{ai} ∼ {bi} ⇐⇒ ∃λ > 0 s.t. ∀i ai = λbi.
+
+The equivalences classes of this are the points in the +1 geometry.
+These points can be further partitioned into sets with the same support, i.e., supp(< a >) = {i : ai > 0},
+where this is clearly well-defined.
+
+On sets of +1-points with the same support, we can define the +1-geometry in the same way as
+in the finite case. With “exp” connoting an exponential family distribution, we have:
+
+39
+
+
+Entropy 2014, 16, 2454–2471
+
+Definition 5. For a given index set, I, define Xexp to be all +1-points whose support equals I, and define the
+vector space Vexp = {vi, i ∈ I} with the operation, ⊕, defined by:
+
+< xi > ⊕vi = ⟨xi exp(vi)⟩ ,
+
+is an affine space. The +1-affine structure is then defined by (Xexp, Vexp, ⊕).
+
+Theorem 3. If a and b lie in Δ (or Δ1) and have the same support, then C(ρ) = ∑(aρ
+i b(1−ρ)
+i
+) < ∞ for
+
+ρ ∈ [0, 1]. Hence, aρ
+i b(1−ρ)
+i
+
+C(ρ)
+∈ Δ (or Δ1).
+
+Proof. Since a, b are absolutely convergent, the sequence, max(ai, bi), is also. Since we have:
+
+0 ≤ min(ai, bi) ≤ aρ
+i b1−ρ
+i
+≤ max(ai, bi)
+
+it follows that C(ρ) < ∞, and we have the result.
+
+This result shows that sets in Δ1 with the same support are +1-convex, just as the faces in the
+finite case are.
+
+3.4. Examples
+
+In order to get a sense of how the +1-geometry works, let us consider a few illustrative examples.
+
+Example 7. If we denote the discretised standard normal density by a and the discretised Cauchy density by b
+and consider the path:
+
+aρ
+i b(1−ρ)
+i
+
+C(ρ)
+,
+
+the normalising constant is shown in Figure ??. We see that at ρ = 0 (the Cauchy distribution), we have that
+the derivative of the normalising constant (i.e., the mean of the sufficient statistic) is tending to infinity. At the
+other end (ρ = 1), the model can be extended in the sense that the distribution exists for values greater than one.
+
+���
+���
+���
+���
+���
+���
+
+����
+����
+����
+����
+����
+����
+����
+����
+
+����������������
+
+��������������������
+
+Figure 6. Normalising constant for normal-Cauchy exponential mixing example.
+
+Thus, in this example, the path joining the two distributions is an extended, rather than natural,
+exponential family, since we have to include the boundary point where the mean is unbounded.
+
+40
+
+
+Entropy 2014, 16, 2454–2471
+
+Example 8. Let us return to Example ??, but now without the censoring. Thus, now, there is a countably
+infinite set of bins, and so, we can investigate its embedding in the infinite simplex. As discussed in [? ], we shall
+discretise the continuous distribution by computing the probabilities associated to bins [ci, ci+1], i = 1, 2, · · · .
+For the exponential model, Exp(θ), the bin probabilities are simply:
+
+πi(θ) = exp(−θci) − exp(−θci+1).
+
+Using this, the model will lie in the infinite simplex on the positive half line with the index set I = N.
+First, consider the case where we have a uniform choice of discretisation, where cn = n × ϵ for some fixed,
+ϵ > 0. In this case, the bin probabilities can be written as an exponential family:
+
+πn(θ) = exp
+�
+−θϵn + log(1 − e−θϵ)
+�
+
+for θ > 0. This gives a +1-geodesic though {πi(θ0)} in the direction {ϵ × n} of the form:
+
+πn(θ0) exp
+
+�
+
+−λϵn + log
+
+�
+1 − e−(λ+θ0)ϵ
+
+1 − e−θ0ϵ
+
+��
+
+(5)
+
+for λ > −θ0. In the case where λ → −θ0, the limiting distribution is the zero measure in Δ∞, and at the
+other extreme, where λ → ∞, the limiting distribution is the atomic distribution in the first bin, a distribution
+with a different support than πi(θ0). However, unlike the finite case, there is no guarantee that, for a given
+“direction”, {ti}, there exists a +1-geodesic starting at {πi(θ0)}, since we require the convergence of the
+normalising constant:
+∞
+∑
+i=0
+πi(θ0) exp(λti) < ∞.
+
+From this example, we see that the limit points of exponential families can lie in the space, Δ∞,
+but not in Δ1. The next example shows that limits do not have to exist at all.
+
+Example 9. Consider the family whose bin probabilities, πi ∈ Δ∞, are proportional to a discretised standard
+normal with bins of constant width. The exponential family, which is proportional to πi exp(θi), does not have
+an ℓ∞ limit, as it is discretised normal with mean θ. The natural parameter space here is (−∞, ∞).
+
+The last illustrative example is from [? ] and shows that even for simple models, the Fisher
+information for the parameters of interest need not be finite.
+
+Example 10. Let us consider a simple example of a two-component mixture of (discretised) exponential distributions:
+
+(1 − ρ)πi(θ0 + λ) + ρπi(θ0)
+(6)
+
+the tangent vector in the ρ-direction is:
+
+πi(θ0) − πi(θ0 + λ) = πi(θ0)
+�
+1 − e−λϵnC
+�
+
+for a positive constant, C. The corresponding squared length, with respect to the Fisher information, is:
+
+∞
+∑
+n=0
+
+�
+1 − e−λϵnC
+�2
+
+πi(θ0)
+.
+
+As an example, consider θ0 = 1; then, this term will be infinite for λ ≤ −0.5.
+
+41
+
+
+Entropy 2014, 16, 2454–2471
+
+3.5. Hilbert Space Structures
+
+Following these examples, we can consider the Hilbert space structure of exponential families
+inside the infinite simplex with the following results.
+
+Definition 6. Define the functions, S(·), by S({vi}, ß) = supθ {θ| ∑I πi exp(θvi) < ∞}, the function being
+set to ∞ when the set is unbounded. Furthermore, define for a given {πi} ∈ ¯Δ∞, the set:
+
+V(ß) = {{vi}|S({vi}, ß > 0} , and Vc(ß) = {{vi}| ± {vi} ∈ V(ß)} .
+
+The spaces, Vc(ß), correspond to the directions in which the +1-geodesic and, so, the
+corresponding exponential families are well-defined and have particularly “nice” geometric structures.
+
+Theorem 4. For ß, define a Hilbert space by:
+
+H(ß) :=
+�
+{vi}|∑ v2
+i πi < ∞
+�
+
+with inner product:
+⟨{vi}, {wi}⟩ß = ∑ viwiπi,
+
+and corresponding norm || · ||ß. Under these conditions:
+(i) Vc(ß) is a subspace of H(ß), and
+(ii) the set V(ß) is a convex cone.
+
+Proof. (i) First, if {vi} ∈ Vc(ß), then by definition, the moment generating function:
+
+∑ exp(θvi)πi,
+
+is finite for θ in an open set containing θ = 0. Hence, have both:
+
+∑ viπi < ∞, and ∑ v2
+i πi < ∞.
+
+Thus, {vi} ∈ H(ß). The fact that it is a subspace follows from (ii) below.
+(ii) It is immediate that V(ß) is a cone.
+Convexity follows from the Cauchy–Schwartz inequality, since for all {vi}, {v∗
+i } ∈ V(ß) and
+λ ∈ [0, 1], it follows that:
+
+�
+∑ πie
+θ
+2 (λvi+(1−λ)v∗
+i )�2
+=
+�
+∑
+�√πie
+θ
+2 λvi
+� �√πie
+θ
+2 (1−λ)v∗
+i
+��2
+
+≤
+�
+∑ πieθλvi
+� �
+∑ πieθ(1−λ)v∗
+i
+�
+,
+
+and, so, is finite for a strictly positive value of θ, hence
+�
+λvi + (1 − λ)v∗
+i
+� ∈ V(ß).
+
+Hence, this result illustrates the point above regarding the existence of “nice” geometric structure
+in the sense of Amari’s information geometry developed for finite dimensional exponential families.
+Infinite dimensional families have a richer structure; for example, they include the possibility of having
+an infinite Fisher information; see Examples ?? and ??.
+
+Acknowledgments: The authors would like to thank Karim Anaya-Izquierdo and Paul Vos for many helpful
+discussions and the UK’s Engineering and Physical Sciences Research Council (EPSRC) for the support of grant
+number EP/E017878/.
+
+Author Contributions: All authors contributed to the conception and design of the study, the collection and
+analysis of the data and the discussion of the results. All authors read and approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+42
+
+
+Entropy 2014, 16, 2454–2471
+
+References
+
+1.
+Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency).
+Ann. Stat. 1975, 3, 1189–1242.
+2.
+Kass, R.E.; Vos, P.W. Geometrical Foundations of Asymptotic Inference; John Wiley & Sons: London, UK, 1997.
+3.
+Amari, S.-I. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics; Springer-Verlag Inc.:
+New York, NY, USA, 1985; Volume 28.
+4.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Foundations.
+In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science;
+Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 311–318.
+5.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Mixture
+Modelling. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer
+Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 319–326.
+6.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P. When are first order asymptotics adequate? A diagnostic. Stat
+2014, 3, 17–22.
+7.
+Murray, M.K.; Rice, J.W. Differential Geometry and Statistics; Chapman & Hall: London, UK, 1993.
+8.
+Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 95–97.
+9.
+Hand, D.J.; Daly, F.; Lunn, A.D.; McConway, K.J.; Ostrowski, E. A Handbook of Small Data Sets; Chapman and
+Hall: London, UK, 1994.
+10.
+Bryson, M.C.; Siddiqui, M.M. Survival times: Some criteria for aging. J. Am. Stat. Assoc. 1969, 64, 1472–1483.
+11.
+Marriott, P.; West, S. On the geometry of censored models. Calcutta Stat. Assoc. Bull. 2002, 52, 567–576.
+12.
+Geiger, D.; Heckerman, D.; King, H.; Meek, C. Stratified exponential families: Graphical models and model
+selection. Ann. Stat. 2001, 29, 505–529.
+13.
+Edelsbrunner, H. Algorithms in Combinatorial Geometry; Springer-Verlag: NewYork, NY, USA, 1987.
+14.
+Barndorff-Nielsen, O.E.; Cox, D.R. Asymptotic Techniques for Use in Statistics; Chapman & Hall: London, UK, 1989.
+15.
+McCullagh, P. Tensor Methods in Statistics; Chapman & Hall: London, UK, 1987.
+16.
+Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge Universtiy Press: Cambridge, UK, 1985.
+17.
+Karlin, S. Total Positivity; Stanford University Press: Stanford, CA, USA, 1968; Volume I.
+18.
+Householder, A.S. The Theory of Matrices in Numerical Analysis; Dover Publications: Dover, DE, USA, 1975.
+19.
+Pistone, G.; Rogantin, M.P. The exponential statistical manifold: Mean parameters, orthogonality and space
+transformations. Bernoulli 1999, 5, 571–760.
+20.
+Fukumizu, K. Infinite dimensional exponential families by reproducing kernel Hilbert spaces. In Proceedings
+of the 2nd International Symposium on Information Geometry and its Applications, Tokyo, Japan,
+12–16 December 2005.
+21.
+Small, C.G.; McLeish, D.L. Hilbert Space Methods in Probability and Statistical Inference; John Wiley & Sons:
+London, UK, 1994.
+22.
+Akhiezer, N.I. The Classical Moment Problem; Hafner: New York, NY, USA, 1965.
+23.
+Stoyanov, J.M. Counter Examples in Probability; John Wiley & Sons: London, UK, 1987.
+24.
+Gut, A. On the moment problem. Bernoulli 2002, 8, 407–421.
+25.
+Lindsay, B.G. Moment matrices: Applications in mixtures. Ann. Stat. 1989, 17, 722–740.
+26.
+Li, P.; Chen, J.; Marriott, P. Non-finite Fisher information and homogeneity: An EM approach. Biometrika
+2009, 96, 411–426.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+43
+
+
+entropy
+
+Article
+Using Geometry to Select One Dimensional
+Exponential Families That Are Monotone Likelihood
+Ratio in the Sample Space, Are Weakly Unimodal and
+Can Be Parametrized by a Measure of
+Central Tendency
+
+Paul Vos 1 and Karim Anaya-Izquierdo 2,*
+
+1 Department of Biostatistics, East Carolina University, Greenville, NC 27858, USA; E-Mail: vosp@ecu.edu
+2 Department of Mathematical Sciences, University of Bath, Bath BA27AY, UK
+*
+E-Mail: kai21@bath.ac.uk; Tel: +44-1225-384644
+
+Received: 30 April 2014; in revised form: 30 June 2014 / Accepted: 14 July 2014 /
+Published: 18 July 2014
+
+Abstract: One dimensional exponential families on finite sample spaces are studied using the
+geometry of the simplex Δ◦
+n−1 and that of a transformation Vn−1 of its interior. This transformation is
+the natural parameter space associated with the family of multinomial distributions. The space Vn−1
+is partitioned into cones that are used to find one dimensional families with desirable properties for
+modeling and inference. These properties include the availability of uniformly most powerful tests
+and estimators that exhibit optimal properties in terms of variability and unbiasedness.
+
+Keywords: simplex; cone; exponential family; monotone likelihood ratio; unimodal; duality
+
+1. Introduction
+
+The motivation for the constructions in this paper begins with a sample from a one dimensional
+space that is discrete. We allow for a continuous sample space but assume that this has been suitably
+discretized into n bins. The simplest underlying structure for the probability assigned to these bins is
+given by the multinomial distribution. The collection of all multinomial distributions can be identified
+with the n − 1 simplex Δn−1. We use the geometry of the simplex along with a transformation of its
+interior Δ◦
+n−1 to search for one dimensional subspaces that have good properties for modeling and for
+inference. In particular, we want families that can be parameterized by the mean, have only unimodal
+distributions, have desirable test characteristics (such as providing uniformly most powerful unbiased
+tests) and estimation properties (such as unbiasedness and small variability).
+The boundary of the (n − 1) dimensional simplex Δn−1 can be written as the union of simplexes
+of dimension (n − 2). This process can be repeated on the simplexes of lower dimension until the
+boundary consists of the vertices of the original simplex. This construction has statistical relevance
+to the possible supports for the probability distributions considered on the n bins. We obtain a dual
+decomposition for a transformation Vn−1 (defined in Equation (5) in Section 5) of Δ◦
+n−1; it is dual in
+that the result can be obtained by replacing simplexes with cones. The statistical relevance of the
+conical decomposition is to the possible modes for all the distributions on the n bins. Since Vn−1 is
+the natural parameter space for the distributions in Δ◦
+n−1, one dimensional exponential families are
+lines in Vn−1 and these can be related to the cones that partition Vn−1. One result is that the limiting
+distribution for any one dimensional exponential family in Δ◦
+n−1 is the uniform distribution whose
+support is determined by the cone that contains the limiting values of the line corresponding to the
+exponential family.
+
+Entropy 2014, 16, 4088–4100; doi:10.3390/e16074088
+www.mdpi.com/journal/entropy
+44
+
+
+Entropy 2014, 16, 4088–4100
+
+While one parameter exponential families can be defined quite generally by choosing a sufficient
+statistic, it can be useful to start with the sufficient statistics from well-known families such as the
+binomial, Poisson, negative binomial, normal, inverse Gaussian, and Gamma distribution. These
+exponential families have good modeling and inferential properties that we try to maintain by limiting
+the extent to which the sufficient statistic is modified. These restrictions lead to considering vectors in
+Vn−1 that lie in a cone. Examples of how to construct these cones are given.
+
+2. Motivating Examples
+
+One dimensional exponential families such as the binomial or Poisson are the workhorse of
+parametric inference because of their excellent statistical properties. However, being one dimensional
+means they do not always fit data very well so an extension to a two (or higher) dimensional
+exponential family can be pursued in order to preserve the nice inferential structure. An issue
+with such extension is that, for each extra natural parameter added, we need to choose a new sufficient
+statistic and this choice can substantially change the shape of the corresponding density functions. For
+example densities can pass from being unimodal to have multiple modes for some parameter values.
+To see this, consider the following examples.
+
+Example 1. Altham [1] considered the so-called multiplicative generalization of the binomial distribution with
+corresponding density
+
+f (x; p, φ) =
+�n
+x
+
+�
+px(1 − p)n−xφx(x−n)/C(p, φ)
+(1)
+
+where C is the normalizing constant and where clearly the binomial is recovered when φ = 1.
+By reparametrizing using θ1 = log(p/(1 − p)) and θ2 = log(φ) this density can be expressed in
+exponential form as
+f (x; θ1, θ2) = h(x) exp(θ1 x + θ2 T(x) − K(θ1, θ2))
+(2)
+
+where T(x) = x(x − n) is the added sufficient statistic and h(x) = (n
+x) where dependence on n has been ignored.
+Note that the same family is obtained if T(x) = x2 is added as a sufficient statistic instead of x(x − n).
+If n = 127 and (θ1, θ2) = (−0.0122, 0.018) then density (2) is bimodal as shown in the left panel of Figure
+
+1. The mean μ of this distribution is 50. Also plotted is the corresponding binomial density with the same mean
+or equivalently with θ1 = log(50/(127 − 50)) = −0.4318 and θ2 = 0.
+
+�
+��
+��
+��
+��
+���
+
+����
+����
+����
+����
+����
+
+�
+
+�
+��
+��
+��
+��
+���
+
+����
+����
+����
+����
+����
+
+�
+
+Figure 1. Binomial density (thick in both panels). Multiplicative binomial density (left panel and thin)
+and double binomial density (right panel and thin). All densities have the same mean μ = 50 and
+n = 127. Variance of the multiplicative and double binomial densities is equal.
+
+45
+
+
+Entropy 2014, 16, 4088–4100
+
+As explained by Lovison [2], this distribution has the feature of being under- or over-dispersed with
+respect to the binomial depending on θ2 being negative or positive, respectively. Furthermore, using the mixed
+parametrization (μ, θ2) (see [3] for details) it is easy to see that this distribution can be parametrized so that one
+parameter controls dispersion independently of the mean. In fact, for a fixed mean μ, as θ2 → −∞ f (x; θ1, θ2)
+tends to a two point distribution (with support points at the extremes x = 0 and x = n) or to a degenerate
+distribution on x = μ when θ2 → ∞.
+
+Example 2. Double exponential families [4] are two parameter exponential families that extend standard
+unidimensional exponential families such as the binomial and the Poisson. Similar to the multiplicative binomial
+in Example 1, the extra parameter involved in double exponential families controls the variance independently of
+the mean. The density for the so-called double binomial family can be written in the form (2) with
+
+T(x) = x log
+� x
+
+n
+
+�
++ (n − x) log
+�
+1 − x
+
+n
+
+�
+
+h(x) = (n
+x) and with the particular restriction that θ2 < 1 (see [4] for details). The range θ2 < 0 generates
+underdispersion and θ2 ∈ [0, 1) generates overdispersion with respect to the binomial. As shown on the right
+panel of Figure 1, the double binomial density can also be multimodal where the double binomial density shown
+has the same mean and variance as the multiplicative binomial shown in the left panel.
+
+These examples show that while extending exponential families can lead to useful modeling
+properties such as overdispersion, the extension can also result in distributions that are not suitable
+for modeling. We are interested in the relationship between geometric properties of one dimensional
+families and the modeling properties of their distributions.
+
+3. Sample Space and Distribution-valued Random Variables
+
+We consider first the general case where the sample space for a single observation X1 consists of
+n bins
+Sn = {B1, B2, . . . , Bn−1, Bn} .
+
+We consider the space of all probability distributions P on this sample space Sn. Each probability
+distribution in P is defined by the n-tuple p whose ith component is
+
+pi = Pr(Bi)
+
+so that P can be identified with the n − 1 simplex
+
+Δn−1 = {p ∈ Rn : pi ≥ 0 ∀i, 1′p = 1}
+
+where 1 in 1′p is the vector 1 ∈ Rn each of whose components is 1. We will slightly abuse the notation
+by using p to name a point in Δn−1, and hence in Rn, as well as the corresponding distribution in P.
+The sample space for a random sample of size N from a distribution p0 ∈ Δn−1 is
+
+X N
+n = {x : x is an n vector of nonnegative integers that sum to N} .
+
+There is simple relationship between X N
+n and the simplex that we obtain by dividing each component
+of x by N. Although the sample space X N
+n can be viewed as formed by compositional data, we will
+follow a different approach to handle this kind of data compared with the classical approach described
+by Aitchison [5] because the data we consider have additional structure.
+In Figure 2 the sample space for the sample of size N = 10 is displayed using open circles. The
+vertices correspond to the case where all 10 values fall in a single bin. The other points correspond
+to the less extreme cases. Let p0 be any point in Δn−1. By mapping the multinomial random variable
+of counts X to Δn−1, we obtain the random distribution �P = X/N whose values are multinomial
+
+46
+
+
+Entropy 2014, 16, 4088–4100
+
+distributions each having number of cases N and probability vector X/N. Identifying X N
+n -valued
+random variables with distribution-valued random variables provides a natural means for comparing
+data with probability models using the Kullback–Leibler (KL) divergence.
+We can compare distributions in Δn−1 using the KL divergence D : P × P �→ R
+
+D(p1, p2) = ∑ p1 log (p1/p2) = H(p1, p2) − H(p1)
+
+where H(p1, p2) = − ∑ p1 log(p2) and H(p1) = H(p1, p1) is the entropy of p1. Note that the arguments
+to D and H are distributions while the logarithm and ratios are defined on points in Rn. Following Wu
+and Vos [6], the variance of the random distribution �P is defined to be
+
+Varp0( �P) = min
+p∈Δn−1
+Ep0D( �P, p)
+
+and its mean is defined to be
+Ep0( �P) = arg min
+p∈Δn−1
+Ep0D( �P, p).
+
+Note that the expectation on the right hand side of the equations above are for real-valued random
+variables while the expectation on the left hand side of the second equation is for a distribution-valued
+random variable.
+
+Figure 2. Simplex for n = 3 bins and sample space for N = 10 observations.
+
+It is not difficult to show that Ep0 �P = p0 so that �P can be considered an unbiased estimator for
+p0. Details are in [6], which also shows that the KL risk can be decomposed into bias-squared and
+variance terms:
+Ep0D( �P, q) = D(p0, q) + Varp0( �P).
+
+The distributional variance is related to the entropy
+
+Varp0( �P) = Ep0D( �P, p0) = H(p0) − Ep0 H( �P).
+
+Note that for N = 1, H( �P) = 0 so that for a single observation the random distribution �P taking values
+on the vertices of Δn−1 has variance equal to the entropy of p0.
+For inference, p0 is unknown but we specify a subspace M ⊂ Δn−1 that contains p0, or at
+least has distributions that are not too different from p0. Estimates can be obtained by choosing a
+parameterization for M, say θ, and then considering real-valued functions ˆθ and evaluating these in
+
+47
+
+
+Entropy 2014, 16, 4088–4100
+
+terms of bias and variance. Bias and variance are useful descriptions when θ describes a feature of
+the distribution that is of inherent interest. However, if θ is simply a parameterization, or if there are
+other features that are also of interest, then these quantities are less useful. For inference regarding the
+distribution p0 we can use a distribution-valued estimator �PM where the subscript indicates that the
+estimator is defined to account for the fact that p0 ∈ M.
+We will not pursue the details of distribution-valued estimators here; we mention these only
+because all the subspaces we consider will be exponential families and in this case the maximum
+likelihood estimator has important properties in terms of distribution variance and distribution bias:
+when M is an exponential family, the maximum likelihood estimator is distribution unbiased, and it
+uniquely minimizes the distribution variance among the class of all distribution unbiased estimators.
+Furthermore, when p0 ̸∈ M then the maximum likelihood estimator is the unique unbiased minimum
+distribution variance estimator of the distribution in M that is closest (in terms of KL) to p0. Extensions
+of one dimensional exponential families that do not result in exponential families will not enjoy these
+properties of maximum likelihood estimation. Details of these results that hold for sample spaces more
+general than Sn are in [7].
+
+4. Simplices Δs
+
+One dimensional exponential families on Sn are curves in Δn−1 whose properties will depend on
+their location within various subspaces of Δn−1. An important collection of subspaces will be indexed
+by the subsets of Sn. For notational convenience we take Bi to the integer i. Using integers is suggestive
+of an ordering and a scale structure but at this point these are only being used to indicate distinct bins.
+For each s ⊂ Sn,
+
+Δs =
+�
+p ∈ Rn : pi ≥ 0 ∀i ∈ s, pi = 0 ∀i ∈ sc, 1′p = 1
+�
+
+where sc = {i ∈ Sn : i ̸∈ s}. Note that ΔSn = Δn−1. The interior of Δs is
+
+Δ◦
+s =
+�
+p ∈ Δs : pi > 0 ∀i ∈ s
+�
+.
+
+As probability distributions in P, Δ◦
+s corresponds to the set of all distributions having support s. There
+is a simple and obvious relationship between the dimension of Δs, |Δs|, and the cardinality of s, |s|,
+which holds for all nonempty s ⊂ Sn
+|Δs| + 1 = |s|.
+
+The boundary of Δs is defined as
+
+∂Δs = {p ∈ Δs : p ̸∈ Δ◦
+s }
+
+so that
+Δs = Δ◦
+s ⊎ ∂Δs
+
+where ⊎ indicates the sets in the union are disjoint. The boundary ∂Δs can be written as the union of
+all simplices of dimension one less than that Δs
+
+∂Δs =
+�
+
+s′:s′⊂s, |s′|=|s|−1
+Δs′
+(3)
+
+This boundary property for Δs holds because the simplex Sn consists of all possible subsets. Each
+nonempty s ∈ Sn specifies one of the possible supports for distribution P ∈ Pn
+
+Δs =
+�
+
+s′:s′⊂s
+Δ◦
+s′
+(4)
+
+48
+
+
+Entropy 2014, 16, 4088–4100
+
+where we set Δ∅ = ∅.
+
+5. Cones Λs
+
+The set of all nonempty subsets of the sample space provides a partition of Δn−1 based on the
+support of the distributions in P. The elements in the partition are simplices whose dimension is one
+less than the cardinality of the indexing set. In most cases we will consider models having support
+Sn, that is, models corresponding to Δ◦
+n−1. If we use subsets s to define the mode rather than support,
+we obtain a partition of P◦, the distributions in P having support Sn. This partition can be expressed
+using convex cones in an n − 1 dimensional plane Vn−1. The dimension of the cones are n minus the
+cardinality of the indexing set and the relationship between interiors of cones and their boundaries is
+analogous to that for simplices expressed in Equations (3) and (4).
+Let
+Vn−1 =
+�
+v ∈ Rn : 1′v = 0
+�
+(5)
+
+be the subspace of Rn of dimension n − 1 of all vectors that sum to zero. For each nonempty s ∈
+Sn define
+Λs =
+�
+v ∈ Vn−1 : vi ≥ vj ∀i ∈ s, ∀j ∈ Sn
+�
+.
+
+It is easily checked that Λs is a convex cone
+
+v1, v2 ∈ Λs =⇒ a1v1 + a2v2 ∈ Λs ∀a1, a2 ∈ [0, ∞) .
+
+The dimension of Λs is |Λs| = n − |s| since each point in j ∈ sc provides a basis vector bj whose ith
+
+component is 1 if i ∈ s or i = j and is zero otherwise and |sc| = n − |s|. The interior of Λs is
+
+Λ◦
+s =
+�
+v ∈ Λs : vi > vj ∀i ∈ s, ∀j ∈ sc�
+,
+
+the boundary is
+∂Λs = {v ∈ Λs : v ̸∈ Λ◦
+s } ,
+
+so that
+Λs = Λ◦
+s ⊎ ∂Λs
+
+by definition. Note ΛSn = Λ◦
+Sn = 0 ∈ Vn−1 ⊂ Rn where the first equality holds because the conditions
+in the definition of Λ◦
+s hold vacuously since i ∈ Sc
+n = ∅ adds no restriction. Likewise, we can extend
+the definition of Λs to include s = ∅ and since i ∈ ∅ adds no restriction
+
+Λ∅ = Λ◦
+∅ = Vn−1.
+
+Note that Λ∅ depends on the cardinality of the set Sn. Since we are considering n fixed, we will not
+show this dependence in the notation.
+Corresponding to Equation (3) we have for all nonempty s that the boundary of the cone Λs is the
+union of all cones having dimension one less than the dimension of Λs
+
+∂Λs =
+�
+
+s′:s⊂s′, |s′|=|s|+1
+Λs′.
+(6)
+
+Corresponding to Equation (4) we have
+
+Λs =
+�
+
+s′:s⊂s′
+Λ◦
+s′
+(7)
+
+The relationship between the simplices Δ and cones Λ is more easily seen if we suppress the
+sets that index these objects. Let Δ and Δ∗ be any two simplices and let Λ and Λ∗ be any two convex
+
+49
+
+
+Entropy 2014, 16, 4088–4100
+
+cones. We only consider cones and simplices that correspond to a nonempty subset of Sn. Then the
+Equations (6) and (7) for the convex cones are obtained by simply replacing Δ in Equations (3) and (4)
+with Λ:
+∂Δ =
+�
+
+Δ∗:|Δ∗|=|Δ|−1
+Δ∗,
+∂Λ =
+�
+
+Λ∗:|Λ∗|=|Λ|−1
+Λ∗
+(8)
+
+Δ =
+�
+
+Δ∗⊂Δ
+Δ◦
+∗,
+Λ =
+�
+
+Λ∗⊂Λ
+Λ◦
+∗
+(9)
+
+Equation (9) also holds for the empty set since Δ∅ = ∅ and Λ∅ = Vn−1.
+
+6. Vn−1 and P◦
+
+There is a natural bijection φ between Vn−1 and Δ◦
+n−1 defined by
+
+φ(p) = log(p) − m(p)1
+
+where log(p) is the vector with ith component log(pi) and m(p) is defined so that 1′φ(p) = 0. The
+inverse is
+ϕ(v) = k−1(v) exp(v)
+
+where exp(v) is the vector with ith component exp(vi) and k(v) is defined so that 1′ exp(v) = 1.
+Each cone Λ◦
+s in the partition
+Vn−1 =
+�
+Λ◦
+s
+
+corresponds to one of the 2n − 1 possible modes for any distribution having support Sn since vi > vj if
+and only if ϕi(v) > ϕj(v).
+
+7. Vn−1 and Exponential Families in P◦
+
+We define a line by a pair of vectors v0, v1 ∈ Vn−1 with v1 ̸= 0
+
+ℓ = ℓ(t) = {v ∈ Vn−1 : v = v0 + tv1, t ∈ R}
+
+Note that v0 and v1 are not unique. Applying the inverse transformation ϕ to points in ℓ gives
+probability densities
+
+ϕ(v0 + tv1) =
+exp(v0 + tv1)
+1′ exp(v0 + tv1)
+(10)
+
+which have the exponential family form with t playing the role of the natural parameter. Therefore,
+the space Vn−1 is easily recognized as the natural parameter space for the distributions Δ◦
+n−1 so that
+each line ℓ in Vn−1 corresponds to a one dimensional exponential family.
+For each line ℓ(t) there is a value tmax such that {ℓ(t) : t ≥ tmax} is contained in one of the cones
+Λ◦
+s where s is the subset of Sn with the property that vi
+1 ≥ vj
+1 for all i ∈ s for vectors v1 ∈ Λ◦
+x. For each
+line ℓ(t) there is a value tmin such that {ℓ(t) : t ≤ tmin} is contained in one of the cones Λ◦
+s′ where s′ is
+
+the subset of Sn with the property that vi
+1 ≤ vj
+1 for all i ∈ s′ for vectors v1 ∈ Λ◦
+x. The cones Λ◦
+s and Λ◦
+s′
+are disjoint and will be called the extremal cones for ℓ. There is at least one other cone Λ◦
+s′′ such that
+ℓ ∩ Λ◦
+s′′ ̸= ∅.
+Any one dimensional exponential family ℓ(t) can be described by an ordered sequence of
+disjoint cones
+�
+Λ◦
+s1, Λ◦
+s2, . . . , Λ◦
+sk
+
+�
+
+50
+
+
+Entropy 2014, 16, 4088–4100
+
+where k = k(ℓ) will depend on the family. These are simply the cones that are traversed by ℓ(t)
+between its extremal cones. We take Λ◦
+sk to be the cone that contains ℓ(t) for all sufficiently large t.
+Equation (6) for cones means that
+
+∂Λsi ⊂ Λsj for j = i + 1 or j = i − 1
+
+The ordered sequence of cones provides an ordered sequence of unique subsets of Sn
+
+(s1, s2, . . . , sk)
+
+that we call the modal profile for ℓ as these are the modes realized by the exponential family ℓ(t) between
+its extremal cones that have modes s1 and sk.
+Each point on a line ℓ(t) in Vn−1 corresponds to a distribution having support Sn. As t goes to −∞
+(+∞) ϕ(ℓ(t)) goes to a distribution having support s1 (sk). In fact, these are the uniform distribution
+on these supports. For every s ⊂ Sn other than ∅ and Sn, the uniform distribution on s is a limiting
+distribution for some one dimensional exponential family in P◦.
+Figure 3 shows Vn−1 for the two dimensional simplex shown in Figure 2. The three rays are the
+one dimensional cones and the spaces between these cones are the two dimensional cones. The origin
+is the zero dimensional cone. The sample values on the boundary of Δ2 are not in V2. Note that the
+one dimensional cones are line segments in Δ2.
+
+��
+��
+�
+�
+�
+
+��
+��
+�
+�
+�
+
+��
+
+Figure 3. V2 for n = 3 bins and sample space for N = 10 observations that are in the interior of Δ2.
+
+8. Ordered Bins and the Monotone Likelihood Ratio Property
+
+Let the bins be ordered and assign the first n integers to the bins to reflect this ordering. We seek
+to define exponential families that have a modal profile of the form
+
+({1} , {1, 2} , {2} , {2, 3} , . . . , {n − 1, n} , {n})
+(11)
+
+or a contiguous sub-collection of this profile. Extensions to three or more contiguous modes are clearly
+possible but not discussed here.
+From the definition of modal profile, it follows that a family with modal profile (11) will have the
+property that the mode is a non-decreasing function of t. In addition to this property for the mode, we
+want the likelihood ratio for any two members of the family to provide the same ordering structure
+
+51
+
+
+Entropy 2014, 16, 4088–4100
+
+as that of the bins. A family that satisfies this condition is said to have the monotone likelihood ratio
+property with respect to x where x takes the values of the bin labels: 1, 2, . . . , n. Let pθ1 and pθ2 be
+two distributions in a one dimensional family parameterized by θ and let pθ2/pθ1 be the n-vector with
+
+components pj
+θ2/pj
+θ1 for 1 ≤ j ≤ n. This family has monotone likelihood ratio if for all θ1 < θ2 and
+j < j′
+
+pj
+θ2
+
+pj
+θ1
+<
+pj′
+
+θ2
+
+pj′
+θ1
+
+.
+
+A family with this property avoids the problem situation where in general the data in the higher
+numbered bins are evidence for pθ2 but in going from a particular bin, say j0 to j0 + 1, the likelihood
+ratio actually decreases. Exponential families such as the binomial and Poisson have this monotone
+likelihood ratio property for the bin labels. The monotone likelihood ratio property can be extended
+to allow for likelihood ratios that are monotone in some function of x. An important advantage of
+families with the monotone likelihood ratio property is the existence of uniformly most powerful tests.
+To ensure that our exponential families have the monotone likelihood ratio property we consider
+vectors in the cone Λ↑ ⊂ Λn
+Λ↑ =
+�
+v : vi < vj, i < j
+�
+.
+
+From Equation (10), the exponential family indexed by θ is k(θ) exp(v0 + θv1)
+
+pj
+θ2
+
+pj
+θ1
+= k(θ2)
+
+k(θ1) exp
+�
+(θ2 − θ1) vj
+1
+�
+
+so that the likelihood ratio is monotone in j if v1 ∈ Λ↑.
+
+9. Selecting Vectors in Λ↑
+
+In order to choose n-dimensional vectors v ∈ Λ↑ we will consider a set of infinite dimensional
+vectors f. Let ¯f : R �→ R and consider f = ¯f |Z where Z is the set of integers. The function f is
+represented by a doubly infinite sequence
+
+f = . . . , f j−1, f j, f j+1, . . .
+
+and we denote the set of all such functions as
+
+F =
+�
+f : f j ∈ R ∀ j ∈ Z
+�
+.
+
+While it is not necessary to consider functions ¯f to define f, these functions are useful to describe
+properties of f, which can be thought of as a discretized version of ¯f.
+Define the gradient of f as the function ∇ whose jth component is
+
+(∇ f )j = f j − f j−1
+
+The simplest functions in F are the constant functions
+
+F0 =
+�
+f ∈ F : f j = f j′ ∀j, j′ ∈ Z
+�
+.
+
+The next simplest functions are those whose gradient is constant. We call these first order functions
+and denote the set of these as
+F1 = { f ∈ F : ∇ f ∈ F0} .
+
+52
+
+
+Entropy 2014, 16, 4088–4100
+
+Functions in F1 are such that changes from one bin to the next bin is the same for all bins. That is,
+these functions describe constant change. We can write the functions in F1 explicitly as
+
+F1 =
+�
+f ∈ F : f j = aj + b, a, b ∈ R
+�
+
+which shows that each f ∈ F1 is the discretized version of a function ¯f whose graph is a line in R × R.
+We obtain a vector v from f by defining the jth component of v as
+
+vj = f j −
+n
+∑
+1
+f i
+
+. From this definition we see that the intercept b of f does not affect v and that the slope is a scaling
+factor so that the restriction to first order functions results in a single direction in Λ↑. This direction
+defines the one dimensional cone defined by the vector with vj = j − (n + 1)/2.
+Additional directions can be obtained from the second order functions
+
+F2 = { f ∈ F : ∇ f ∈ F1} .
+
+If f ∈ F2 then (∇2 f )j = a for some a ∈ R and for all j ∈ Z. Using the fact that
+
+(∇2 f )j = (∇(∇ f ))j = ( f j − f j−1) − ( f j−1 − f j−2)
+
+= f j + f j−2 − 2f j−1
+
+the second order functions can be written explicitly as
+
+F2 =
+�
+f ∈ F : f j = a
+
+2 j(j + 1) + bj + c, a, b, c ∈ R
+�
+
+.
+In order for the vector v obtained from f ∈ F2 to be in Λ↑ we need (∇ f )j ≥ 0 for j = 1, 2, . . . , n.
+With f j = (a/2)j(j + 1) + bj + c we have (∇ f )j = aj + b so that for a > 0 we require b ≥ −a and for
+a < 0 we require b ≥ −an. Since we are concerned with the direction rather than the magnitude we
+can take a = ±1 and the value of c is chosen so the sum of the components is zero.
+The second order vectors in Λ↑ consists of the cone defined by the vectors v20 and v21 having
+components defined by
+
+(n − 1)(v20)j = 1
+
+2 j(j + 1) − j − c20
+
+(n − 1)(v21)j = −1
+
+2 j(j + 1) + nj − c21
+
+Notice that this cone contains v1 since v1 is proportional to v20 + v21. Many discrete one dimensional
+exponential families (e.g., binomial, negative binomial, and Poisson) use the vector v1. Furthermore,
+many continuous one dimensional exponential families use the continuous function f used to define
+v1: normal with σ known, and the gamma and inverse Gaussian distributions with known shape
+parameter (the shape parameter is the non-scale parameter). The cone defined by v20 and v21 allows us
+to perturb the v1 direction to obtain related exponential families that we would expect to have similar
+properties. Figure 4 shows v20 and v21 as well as v1 = 0.5v20 + 0.5v21.
+Other vectors can be used to define cones around v1. Looking at common exponential families we
+see that log(x) and x−1 are sufficient statistics so that these suggest taking ¯f (x) = log(x) or ¯f (x) = 1/x.
+These can be further generalized to ¯f (x; λ), which can be the power family or some other family of
+transformations. The vectors v f0 and v f1 are defined using the discretized f with the constraints that
+v f0, v f1 ∈ Λ↑ and 0.5v f0 + 0.5v f1 = v1.
+
+53
+
+
+Entropy 2014, 16, 4088–4100
+
+An exponential family with sufficient statistic x can be modified by choosing a function ¯f (x) and
+0 ≤ α ≤ 1 where α = 0.5 corresponds to the original exponential family and other values perturb this
+direction. We denote this vector as v f α so that v0 + tv f α is the natural parameter of the modified family.
+Figure 4 shows the components of the vectors v20 and v21.
+
+�
+��
+��
+��
+��
+���
+���
+
+����
+����
+���
+���
+���
+
+�����������
+
+�����
+
+���������������
+
+���
+
+���
+
+Figure 4. Components of the vectors v20 and v21 for n = 128 bins.
+
+Since v0 is common to each exponential family with natural parameter ℓ(t) = v0 + tv f α, the
+monotone likelihood ratio property will hold even if v0 ̸∈ Λ↑. Initial choices for v0 are suggested by
+the Poisson, binomial, and negative binomial distributions:
+
+(vPoisson)j = − log Γ(j) + c
+/∈ Λ↑
+
+(vbinomial)j = log Γ(n) − log Γ(j) − log Γ(n − j) + c
+/∈ Λ↑
+
+(vneg.bin.)j = log Γ(j + r) − log Γ(j) + c
+∈ Λ↑
+
+where c is a constant chosen so that the components sum to 1, n is the number of bins, and r is a
+positive real constant.
+
+Author Contributions: This paper was initiated by the first author but all sections reflect a collaborative effort.
+Both authors have read and approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+References
+
+1.
+Altham, P.M.E. Two Generalizations of the Binomial Distribution. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1978,
+27, 162–167.
+2.
+Lovison, G. An alternative representation of Altham’s multiplicative-binomial distribution. Stat. Probab. Lett.
+1998, 36, 415–420.
+3.
+Brown, L. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory;
+IMS Lecture Notes; Institute of Mathematical Statistics: Hayward, CA, USA, 1986.
+4.
+Efron, B. Double Exponential Families and Their Use in Generalized Linear Regression. J. Am. Stat. Assoc.
+1986, 81, 709–721.
+5.
+Aitchison, J. The Statistical Analysis of Compositional Data; Chapman and Hall: London, UK, 1986.
+
+54
+
+
+Entropy 2014, 16, 4088–4100
+
+6.
+Wu, Q.; Vos, P. Decomposition of Kullback–Leibler risk and unbiasedness for parameter-free estimators.
+J. Stat. Plan. Inference 2012, 142, 1525–1536.
+7.
+Vos, P.; Wu, Q.
+Maximum Likelihood Estimators Uniformly Minimize Distribution Variance among
+Distribution Unbiased Estimators in Exponential Families. Bernoulli 2014, submitted.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+55
+
+
+entropy
+
+Article
+On the Fisher Metric of Conditional Probability
+Polytopes
+
+Guido Montúfar 1,*, Johannes Rauh 1 and Nihat Ay 1,2,3
+
+1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig 04103, Germany; E-Mails:
+jrauh@mis.mpg.de (J.R.); nay@mis.mpg.de (N.A.)
+2 Department of Mathematics and Computer Science, Leipzig University, PF 10 09 20, Leipzig 04009, Germany
+3 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
+*
+E-Mail: montufar@mis.mpg.de; Tel.: +49-341-9959-521.
+
+Received: 31 March 2014; in revised form: 18 May 2014 / Accepted: 29 May 2014 /
+Published: 6 June 2014
+
+Abstract: We consider three different approaches to define natural Riemannian metrics on polytopes
+of stochastic matrices. First, we define a natural class of stochastic maps between these polytopes and
+give a metric characterization of Chentsov type in terms of invariance with respect to these maps.
+Second, we consider the Fisher metric defined on arbitrary polytopes through their embeddings as
+exponential families in the probability simplex. We show that these metrics can also be characterized
+by an invariance principle with respect to morphisms of exponential families. Third, we consider
+the Fisher metric resulting from embedding the polytope of stochastic matrices in a simplex of joint
+distributions by specifying a marginal distribution. All three approaches result in slight variations
+of products of Fisher metrics. This is consistent with the nature of polytopes of stochastic matrices,
+which are Cartesian products of probability simplices. The first approach yields a scaled product of
+Fisher metrics; the second, a product of Fisher metrics; and the third, a product of Fisher metrics
+scaled by the marginal distribution.
+
+Keywords: Fisher information metric; information geometry; convex support polytope; conditional
+model; Markov morphism; isometric embedding; natural gradient
+
+1. Introduction
+
+The Riemannian structure of a function’s domain has a crucial impact on the performance of
+gradient optimization methods, especially in the presence of plateaus and local maxima. The natural
+gradient [1] gives the steepest increase direction of functions on a Riemannian space. For example,
+artificial neural networks can often be trained by following some function’s gradient on a space of
+probabilities. In this context, it has been observed that following the natural gradient with respect to
+the Fisher information metric, instead of the Euclidean metric, can significantly alleviate the plateau
+problem [1,2]. The Fisher information metric, which is also called Shahshahani metric [3] in biological
+contexts, is broadly recognized as the natural metric of probability spaces. An important argument
+was given by Chentsov [4], who showed that the Fisher information metric is the only metric on
+probability spaces for which certain natural statistical embeddings, called Markov morphisms, are
+isometries. More generally, Chentsov’s Theorem characterizes the Fisher metric and α-connections of
+statistical manifolds uniquely (up to a multiplicative constant) by requiring invariance with respect
+to Markov morphisms. Campbell [5] gave another proof that characterizes invariant metrics on the
+set of non-normalized positive measures, which restrict to the Fisher metric in the case of probability
+measures (up to a multiplicative constant). In this paper, we explore ways of defining distinguished
+Riemannian metrics on spaces of stochastic matrices.
+
+Entropy 2014, 16, 3207–3233; doi:10.3390/e16063207
+www.mdpi.com/journal/entropy
+56
+
+
+Entropy 2014, 16, 3207–3233
+
+In learning theory, when modeling the policy of a system, it is often preferred to consider
+stochastic matrices instead of joint probability distributions. For example, in robotics applications,
+policies are optimized over a parametric set of stochastic matrices by following the gradient of a
+reward function [6,7]. The set of stochastic matrices can be parametrized in many ways, e.g., in terms
+of feedforward neural networks, Boltzmann machines [8] or projections of exponential families [9].
+The information geometry of policy models plays an important role in these applications and has
+been studied by Kakade [2], Peters and co-workers [10–12], and Bagnell and Schneider [13], among
+others. A stochastic matrix is a tuple of probability distributions, and therefore, the space of stochastic
+matrices is a Cartesian product of probability simplices. Accordingly, in applications, usually a product
+metric is considered, with the usual Fisher metric on each factor. On the other hand, Lebanon [14]
+takes an axiomatic approach, following the ideas of Chentsov and Campbell, and characterizes a class
+of invariant metrics of positive matrices that restricts to the product of Fisher metrics in the case of
+stochastic matrices. We will consider three different approaches discussed in the following.
+In the first part, we take another look at Lebanon’s approach for characterizing a distinguished
+metric on polytopes of stochastic matrices. However, since the maps considered by Lebanon do not
+map stochastic matrices to stochastic matrices, we will use different maps. We show that the product
+of Fisher metrics can be characterized by an invariance principle with respect to natural maps between
+stochastic matrices.
+In the second part, we consider an approach that allows us to define Riemannian structures
+on arbitrary polytopes. Any polytope can be identified with an exponential family by using the
+coordinates of the polytope vertices as observables. The inverse of the moment map then defines
+an embedding of the polytope in a probability simplex. This embedding can be used to pull back
+geometric structures from the probability simplex to the polytope, including Riemannian metrics,
+affine connections, divergences, etc. This approach has been considered in [9] as a way to define
+low-dimensional families of conditional probability distributions. More general embeddings can be
+defined by identifying each exponential family with a point configuration, B, together with a weight
+function, ν. Given B and ν, the corresponding exponential family defines geometric structures on the
+set (conv B)◦, which is the relative interior of the convex support of the exponential family. Moreover,
+we can define natural morphisms between weighted point configurations as surjective maps between
+the point sets, which are compatible with the weight functions. As it turns out, the Fisher metric on
+(conv B)◦ can be characterized by invariance under these maps.
+In the third part, we return to stochastic matrices. We study natural embeddings of conditional
+distributions in probability simplices as joint distributions with a fixed marginal. These embeddings
+define a Fisher metric equal to a weighted product of Fisher metrics. This result corresponds to the
+Definitions commonly used in robotics applications.
+All three approaches give very similar results. In all cases, the identified metric is a product
+metric. This is a sensible result, since the set of k × m stochastic matrices is a Cartesian product of
+probability simplices Δm−1 × · · · × Δm−1 = Δk
+m−1, which suggests using the product metric of the
+Fisher metrics defined on the factor simplices, Δm−1. Indeed, this is the result obtained from our second
+approach. The first approach yields that same result with an additional scaling factor of 1/k. Only
+when stochastic matrices of different sizes are compared, the two approaches differ. The third approach
+yields a product of Fisher metrics scaled by the marginal distribution that defines the embedding.
+Which metric to use depends on the concrete problem and whether a natural marginal distribution
+is defined and known. In Section 7, we do a case study using a reward function that is given as an
+expectation value over a joint distribution. In this simple example, the weighted product metric gives
+the best asymptotic rate of convergence, under the assumption that the weights are optimally chosen.
+In Section 8, we sum up our findings.
+The contents of the paper is organized as follows. Section 2 contains basic Definitions around the
+Fisher metric and concepts of differential geometry. In Section 3, we discuss the Theorems of Chentsov,
+Campbell and Lebanon, which characterize natural geometric structures on the probability simplex,
+
+57
+
+
+Entropy 2014, 16, 3207–3233
+
+on the set of positive measures and on the cone of positive matrices, respectively. In Section 4, we
+study metrics on polytopes of stochastic matrices, which are invariant under natural embeddings. In
+Section 5, we define a Riemannian structure for polytopes, which generalizes the Fisher information
+metric of probability simplices and conditional models in a natural way. In Section 6, we study a class of
+weighted product metrics. In Section 7, we study the gradient flow with respect to an expectation value.
+Section 8 contains concluding Remarks. In Appendix A, we investigate restrictions on the parameters
+of the metrics characterized in Sections 3 and 4 that make them positive definite. Appendix B contains
+the proofs of the results from Section 4.
+
+2. Preliminaries
+
+We will consider the simplex of probability distributions on [m] := {1, . . . , m}, m ≥ 2, which is
+given by Δm−1 := {(pi)i ∈ Rm : pi ≥ 0, ∑i pi = 1}. The relative interior of Δm−1 consists of all strictly
+positive probability distributions on [m], and will be denoted Δ◦
+m−1. This is a subset of Rm+, the cone
+of strictly positive vectors. The set of k × m row-stochastic matrices is given by Δk
+m−1 := {(Kij)ij ∈
+Rk×m : (Kij)j ∈ Δm−1 for all i ∈ [k]} and is equal to the Cartesian product ×i∈[k] Δm−1. The relative
+
+interior (Δk
+m−1)◦ is a subset of Rk×m
++
+, the cone of strictly positive matrices.
+Given two random variables X and Y taking values in the finite sets [k] and [m], respectively, the
+conditional probability distribution of Y given X is the stochastic matrix K = (P(y|x))x∈[k],y∈[m] with
+rows (P(y|x))y∈[m] ∈ Δm−1 for all x ∈ [k]. Therefore, the polytope of stochastic matrices Δk
+m−1 is called
+a conditional polytope.
+The tangent space of Rn+ at a point p ∈ Rn+, denoted by TpRn+, is the real vector space spanned
+by the vectors ∂1, . . . , ∂n of partial derivatives with respect to the n components. The tangent space of
+Δ◦
+n−1 at a point p ∈ Δ◦
+n−1 ⊂ Rn+ is the subspace TpΔ◦
+n−1 ⊂ TpRn+ consisting of the vectors:
+
+u = ∑
+i
+ui∂i ∈ TpRn
++
+with
+∑
+i
+ui = 0.
+(1)
+
+The Fisher metric on the positive probability simplex Δ◦
+n−1 is the Riemannian metric given by:
+
+g(n)
+p (u, v) =
+n
+∑
+i=1
+
+uivi
+pi
+,
+for all u, v ∈ TpΔ◦
+n−1.
+(2)
+
+The same formula (2) also defines a Riemannian metric on Rn+, which we will denote by the same
+symbol. This, however, is not the only way in which the Fisher metric can be extended from Δ◦
+n−1
+to Rn+. We will discuss other extensions in the next section (see Campbell’s Theorem, Theorem 2).
+Consider a smoothly parametrized family of probability distributions M = {(p(x; θ))x∈[n] : θ ∈
+
+Ω} ⊆ Δ◦
+n−1, where Ω ⊆ Rd is open. Then, g(n) induces a Riemannian metric on M. Denote by
+∂θi =
+∂
+∂θi the tangent vector corresponding to the partial derivative with respect to θi, for all i ∈ [d].
+Then, the Fisher matrix has coordinates:
+
+gM
+θ (∂θi, ∂θj) = ∑
+x∈[n]
+p(x; θ)∂ log p(x; θ)
+
+∂θi
+
+∂ log p(x; θ)
+
+∂θj
+,
+for all i, j ∈ [d],
+for all θ ∈ Ω.
+(3)
+
+Here, it is not necessary to assume that the parameters θi are independent. In particular, the dimension
+of M may be smaller than d, in which case the matrix is not positive definite. If the map Ω → M, θ �→
+p(·; θ) is an embedding (i.e., a smooth injective map that is a diffeomorphism onto its image), then gM
+θ
+defines a Riemannian metric on Ω, which corresponds to the pull-back of g(n).
+Consider an embedding f : E → E′. The pull-back of a metric g′ on E′ through f is defined as:
+
+( f ∗g′)p(u, v) := g′
+f (p)( f∗u, f∗v),
+for all u, v ∈ TpE,
+(4)
+
+58
+
+
+Entropy 2014, 16, 3207–3233
+
+where f∗ denotes the push-forward of TpE through f, which in coordinates is given by:
+
+f∗ :
+TpE → Tf (p)E′;
+∑
+i
+ui∂θi �→ ∑
+j ∑
+i
+ui
+∂ fj(p)
+
+∂θi
+∂θ′
+j,
+(5)
+
+where {∂θi}i spans TqE and {∂θ′
+j}j spans Tf (p)E′.
+
+An embedding f : E → E′ of two Riemannian manifolds (E, g) and (E′, g′) is an isometry iff:
+
+gp(u, v) = ( f ∗g′)p(u, v),
+for all p ∈ E and u, v ∈ TpE.
+(6)
+
+In this case, we say that the metric g is invariant with respect to f (and g′).
+
+3. The Results of Campbell and Lebanon
+
+One of the theoretical motivations for using the Fisher metric is provided by Chentsov’s
+characterization [4], which states that the Fisher metric is uniquely specified, up to a multiplicative
+constant, by an invariance principle under a class of stochastic maps, called Markov morphisms. Later,
+Campbell [5] considered the characterization problem on the space Rn+ instead of Δ◦
+n−1. This simplifies
+the computations, since Rn+ has a more symmetric parametrization.
+
+Definition 1. Let 2 ≤ m ≤ n. A (row) stochastic partition matrix (or just row-partition matrix) is a matrix
+Q ∈ Rm×n of non-negative entries, which satisfies ∑j∈Ai′ Qij = δii′ for an m block partition {A1, . . . , Am} of
+[n]. The linear map defined by:
+Rm
++ → Rn
++;
+p �→ p · Q
+(7)
+
+is called a congruent embedding by a Markov mapping of Rm+ to Rn+ or just a Markov map, for short.
+
+An example of a 3 × 5 row-partition matrix is:
+
+Q =
+
+⎛
+
+⎜
+⎝
+1/2
+0
+1/2
+0
+0
+0
+1/3
+0
+2/3
+0
+0
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ .
+(8)
+
+Markov maps preserve the 1-norm and restrict to embeddings Δ◦
+m−1 → Δ◦
+n−1.
+
+Theorem 1 (Chentsov’s Theorem.).
+
+• Let g(m) be a Riemannian metric on Δ◦
+m−1 for m ∈ {2, 3, . . .}. Let this sequence of metrics have the
+property that every congruent embedding by a Markov mapping is an isometry. Then, there is a constant
+C > 0 that satisfies:
+
+g(m)
+p
+(u, v) = C∑
+i
+
+uivi
+pi
+.
+(9)
+
+• Conversely, for any C > 0, the metrics given by Equation (9) define a sequence of Riemannian metrics
+under which every congruent embedding by a Markov mapping is an isometry.
+
+The main result in Campbell’s work [5] is the following variant of Chentsov’s Theorem.
+
+Theorem 2 (Campbell’s Theorem.).
+
+• Let g(m) be a Riemannian metric on Rm+ for m ∈ {2, 3, . . .}. Let this sequence of metrics have the
+property that every embedding by a Markov mapping is an isometry. Then:
+
+g(m)
+p
+(∂i, ∂j) = A(|p|) + δijC(|p|)|p|
+
+pi
+,
+(10)
+
+59
+
+
+Entropy 2014, 16, 3207–3233
+
+where |p| = ∑m
+i=1 pi, δij is the Kronecker delta, and A and C are C∞ functions on R+ satisfying
+C(α) > 0 and A(α) + C(α) > 0 for all α > 0.
+• Conversely, if A and C are C∞ functions on R+ satisfying C(α) > 0, A(α) + C(α) > 0 for all α > 0,
+then Equation (10) defines a sequence of Riemannian metrics under which every embedding by a Markov
+mapping is an isometry.
+
+The metrics from Campbell’s Theorem also define metrics on the probability simplices Δ◦
+m−1 for
+m = 2, 3, . . .. Since the tangent vectors v = ∑i vi∂i ∈ TpΔ◦
+m−1 satisfy ∑i vi = 0, for any two vectors
+u, v ∈ TpΔ◦
+m−1, also ∑i ∑j Auivj = 0 for any A. In this case, the choice of A is immaterial, and the
+metric becomes Chentsov’s metric.
+
+Remark 1. Observe that Chentsov’s Theorem is not a direct implication of Campbell’s Theorem. However, it
+can be deduced from it by the following arguments. Suppose that we have a family of Riemannian simplices
+(Δ◦
+m−1, g(m)) for m ∈ {2, 3, . . .}, and suppose that they are isometric with respect to Markov maps. If we can
+extend every g(m) to a Riemannian metric ˜g(m) on Rm+ in such a way that the resulting spaces (Rm+, ˜g(m)) are
+still isometric with respect to Markov maps, then Campbell’s Theorem implies that g(m) is a multiple of the
+Fisher metric. Such metric extensions can be defined as follows. Consider the diffeomorphism:
+
+Δ◦
+m−1 × R+ ∼= Rm
++,
+(p, r) �→ r · p.
+(11)
+
+Any tangent vector u ∈ T(p,r)Rm+ can be written uniquely as u = up + ur∂r, where up is tangent to rΔ◦
+m−1.
+Since each Markov map f preserves the one-norm | · |, its push-forward f∗ maps the tangent vector ∂r ∈ T(p,r)Rm+
+to the corresponding tangent vector ∂r ∈ Tf (p,r)Rm+; that is, f∗u = f∗up + ur∂r. Therefore,
+
+˜g(m)
+(p,r)(u, v) := g(m)
+p
+(up, vp) + urvr
+(12)
+
+is a metric on Rm+ that is invariant under f.
+
+In what follows, we will focus on positive matrices. In order to define a natural Riemannian
+metric, we can use the identification Rk×m
++
+∼= Rkm
++ and apply Campbell’s Theorem. This leads to
+metrics of the form:
+g(k,m)
+M
+(∂ij, ∂kl) = A(|M|) + δikδjlC(|M|)/Mij,
+(13)
+
+where ∂ij =
+∂
+
+∂Mij and |M| = ∑ij Mij. However, a disadvantage of this approach is that the action of
+
+general Markov maps on Rkm
++ has no natural interpretation in terms of the matrix structure. Therefore,
+Lebanon [14] considered a special class of Markov maps defined as follows.
+
+Definition 2. Consider a k × l row-partition matrix R and a collection of m × n row-partition matrices
+Q = {Q(1), . . . , Q(k)}. The map:
+
+Rk×m
++
+→ Rl×n
++ ;
+M �→ R⊤(M ⊗ Q)
+(14)
+
+is called a congruent embedding by a Markov morphism of Rk×m
++
+to Rl×n
++
+in [15]. We will refer to such an
+embedding as a Lebanon map. Here, the row product M ⊗ Q is defined by:
+
+(M ⊗ Q)ab = (M · Q(a))ab,
+for all a ∈ [k], b ∈ [n];
+(15)
+
+that is, the a-th row of M is multiplied by the matrix Q(a).
+
+In a Lebanon map, each row of the input matrix M is mapped by an individual Markov mapping
+Q(i), and each resulting row is copied and scaled by an entry of R. This kind of map preserves the
+sum of all matrix entries. Therefore, with the identification Rk×m
++
+∼= Rkm
++ , each Lebanon map restricts
+
+60
+
+
+Entropy 2014, 16, 3207–3233
+
+to a map Δ◦
+mk−1 → Δ◦
+nl−1. The set Δ◦
+mk−1 can be identified with the set of joint distributions of two
+random variables. Lebanon maps can be regarded as special Markov maps that incorporate the product
+structure present in the set of joint probability distributions of a pair of random variables. In Section 4,
+we will give an interpretation of these maps.
+Contrary to what is stated in [15], a Lebanon map does not map (Δk
+m−1)◦ to (Δl
+n−1)◦, unless k = l.
+Therefore, later, we will provide a characterization for the metrics on (Δk
+m−1)◦ in terms of invariance
+under other maps (which are not Markov nor Lebanon maps).
+The main result in Lebanon’s work [15, Theorems 1 and 2] is the following.
+
+Theorem 3 (Lebanon’s Theorem.).
+
+• For each k ≥ 1, m ≥ 2, let g(k,m) be a Riemannian metric on Rk×m
++
+in such a way that every Lebanon
+map is an isometry. Then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A(|M|) + δac
+
+� B(|M|)
+
+|Ma|
++ δbd
+C(|M|)
+
+Mab
+
+�
+(16)
+
+for some differentiable functions A, B, C ∈ C∞(R+).
+• Conversely, let {(Rk×m
++
+, g(k,m))} be a sequence of Riemannian manifolds, with metrics g(k,m) of the
+form (16) for some A, B, C ∈ C∞(R+). Then, every Lebanon map is an isometry.
+
+Lebanon does not study the question under which assumptions on A, B, C ∈ C∞(R+) the
+formula (16) does indeed define a Riemannian metric. This question has the following simple answer,
+which we will prove in Appendix A:
+
+Proposition 1. The matrix (16) is positive definite if and only if C(|M|) > 0, B(|M|) + C(|M|) > 0 and
+A(|M|) + B(|M|) + C(|M|) > 0.
+
+The class of metrics (16) is larger than the class of metrics (13) derived in Campbell’s Theorem.
+The reason is that Campbell’s metrics are invariant with respect to a larger class of embeddings.
+The special case with A(|M|) = 0, B(|M|) = 0 and C(|M|) = 1 is called product Fisher metric,
+
+g(k,m)
+M
+(∂ab, ∂cd) = δacδbd
+1
+
+Mab
+.
+(17)
+
+Furthermore, if we restrict to (Δk
+m−1)◦, the functions A and B do not play any role. In this case |M| = k,
+and we obtain the scaled product Fisher metric:
+
+g(k,m)
+M
+(∂ab, ∂cd) = δacδbd
+C(k)
+Mab
+,
+(18)
+
+where C(k) : N → R+ is a positive function. As mentioned before, Lebanon’s Theorem does not give a
+characterization of invariant metrics of stochastic matrices, since Lebanon maps do not preserve the
+stochasticity of the matrices. However, Lebanon maps are natural maps on the set Δ◦
+mk−1 of positive
+joint distributions. In the same way as Chentsov’s Theorem can be derived from Campbell’s Theorem
+(see Remark 1), we obtain the following Corollary:
+
+61
+
+
+Entropy 2014, 16, 3207–3233
+
+Corollary 1.
+
+• Let {(Δ◦
+km−1, g(k,m)): k ≥ 1, m ≥ 2} be a double sequence of Riemannian manifolds with the property
+that every Lebanon map is an isometry. Then:
+
+g(k,m)
+P
+(u, v) = B∑
+a ∑
+b,c
+
+uabuac
+
+|Pa|
++ C∑
+a ∑
+b
+
+uabvab
+
+Pab
+,
+for each P ∈ Δ◦
+km−1,
+(19)
+
+for some constants B, C ∈ R with C > 0 and B + C > 0, where |Pa| = ∑b Pab.
+• Conversely, let {(Δ◦
+km−1, g(k,m))} be a sequence of Riemannian manifolds with metrics g(k,m) of the form
+of Equation (19) for some B, C ∈ R. Then, every Lebanon map is an isometry.
+
+Observe that these metrics agree with (a multiple of) the Fisher metric only if B = 0. The case B = 0
+can also be characterized; note that Lebanon maps do not treat the two random variables symmetrically.
+Switching the two random variables corresponds to transposing the joint distribution matrix P. When
+exchanging the role of the two random variables, the Lebanon map becomes P �→ (P⊤ ⊗ Q)⊤R. We
+call such a map a dual Lebanon map. If we require invariance under both Lebanon maps and their
+duals in Theorem 3 or Corollary 1, the statements remain true with the additional restriction that B = 0
+(as a function or constant, respectively).
+
+4. Invariance Metric Characterizations for Conditional Polytopes
+
+According to Chentsov’s Theorem (Theorem 1), a natural metric on the probability simplex can
+be characterized by requiring the isometry of natural embeddings. Lebanon follows this axiomatic
+approach to characterize metrics on products of positive measures (Theorem 3). However, the maps
+considered by Lebanon dissolve the row-normalization of conditional distributions. In general, they
+do not map conditional polytopes to conditional polytopes. Therefore, we will consider a slight
+modification of Lebanon maps, in order to obtain maps between conditional polytopes.
+
+4.1. Stochastic Embeddings of Conditional Polytopes
+
+A matrix of conditional distributions P(Y|X) in Δk
+m−1 can be regarded as the equivalence class
+of all joint probability distributions P(X, Y) ∈ Δkm−1 with conditional distribution P(Y|X). Which
+Markov maps of probability simplices are compatible with this equivalence relation? The most obvious
+examples are permutations (relabelings) of the state spaces of X and Y.
+In information theory, stochastic matrices are also viewed as channels. For any distribution of X,
+the stochastic matrix gives us a joint distribution of the pair (X, Y) and, hence, a marginal distribution
+of Y. If we input a distribution of X into the channel, the stochastic matrix determines what the
+distribution of the output Y will be.
+Channels can be combined, provided the cardinalities of the state spaces fit together. If we
+take the output Y of the first channel P(Y|X) and feed it into another channel P(Y′|Y) then we
+obtain a combined channel P(Y′|X). The composition of channels corresponds to ordinary matrix
+multiplication. If the first channel is described by the stochastic matrix K and the second channel by Q,
+then the combined channel is described by K · Q. Observe that in this case, the joint distribution P
+(considered as a normalized matrix P ∈ Δkm−1) is transformed similarly; that is, the joint distribution
+of the pair (X, Y′) is given by P · Q.
+More general maps result from compositions where the choice of the second channel depends on
+the input of the first channel. In other words, we have a first channel that takes as input X and gives
+as output Y, and we have another channel that takes as input (X, Y) and gives as output Y′; we are
+interested in the resulting channel from X to Y′. The second channel can be described by a collection
+of stochastic matrices Q = {Q(i)}i. If K describes the first channel, then the combined channel is
+described by the row product K ⊗ Q (see Definition 2). Again, the joint distribution of (X, Y′) arises in
+a similar way as P ⊗ Q.
+
+62
+
+
+Entropy 2014, 16, 3207–3233
+
+We can also consider transformations of the first random variable X. Suppose that we use X as the
+input to a channel described by a stochastic matrix R. In this case, the joint distribution of the output
+X′ of the channel and Y is described by R⊤X. However, in general, there is not much that we can say
+about the conditional distribution of Y given X′. The result depends in an essential way on the original
+distribution of X. However, this is not true in the special case that the channel is “not mixing”, that is,
+in the case that R is a stochastic partition matrix. In this case, the conditional distribution P(Y|X′) is
+
+described by R⊤K, where R is the corresponding partition indicator matrix, where all non-zero entries
+of R are replaced by one. In other words, each state of X corresponds to several states of X′, and the
+corresponding row of K is copied a corresponding number of times.
+To sum up, if we combine the transformations due to Q and R, then the joint probability
+
+distribution transforms as P �→ R⊤(P ⊗ Q) and the conditional transforms as K �→ R⊤(K ⊗ Q).
+In particular, for the joint distribution, we obtain the Definition of a Lebanon map. Figure 1 illustrates
+the situation.
+
+X
+Y
+
+X′
+Y′
+
+K
+
+K′
+
+R
+Q
+
+joint distributions: P′ = R⊤(P ⊗ Q)
+
+conditional distributions: K′ = R⊤(K ⊗ Q)
+
+Figure 1. An interpretation for Lebanon maps and conditional embeddings. The variable X′ is
+computed from X by R, and Y′ is computed from X and Y by Q.
+
+Finally, we will also consider the special case where the partition of R (and R) is homogeneous,
+i.e., such that all blocks have the same size. For example, this describes the case where there is a third
+random variable Z that is independent of Y given X. In this case, the conditional distribution satisfies
+P(Y|X) = P(Y|X, Z), and R describes the conditional distribution of (X, Z) given X.
+
+Definition 3. A (row) partition indicator matrix is a matrix R ∈ {0, 1}k×l that satisfies:
+
+Rij =
+
+�
+1,
+if j ∈ Ai,
+
+0,
+else,
+(20)
+
+for a k block partition {A1, . . . , Ak} of [l].
+
+For example, the 3 × 5 partition indicator matrix corresponding to Equation (8) is:
+
+R =
+
+⎛
+
+⎜
+⎝
+1
+0
+1
+0
+0
+0
+1
+0
+1
+0
+0
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ .
+(21)
+
+Definition 4. Consider a k × l partition indicator matrix R and a collection of m × n stochastic partition
+matrices Q = {Q(i)}k
+i=1. We call the map:
+
+f :
+Rk×m
++
+→ Rl×n
++ ;
+M �→ R⊤(M ⊗ Q)
+(22)
+
+a conditional embedding of Rk×m
++
+in Rl×n
++ . We denote the set of all such maps by ˆF l,n
+k,m. If R is the partition
+indicator matrix of a homogeneous partition (with partition blocks of equal cardinality), then we call f a
+homogeneous conditional embedding. We denote the set of all such homogeneous conditional embeddings by
+F l,n
+k,m and assume that l is a multiple of k.
+
+63
+
+
+Entropy 2014, 16, 3207–3233
+
+Conditional embeddings preserve the 1-norm of the matrix rows; that is, the elements of ˆF l,n
+k,m
+map (Δk
+m−1)◦ to (Δl
+n−1)◦. On the other hand, they do not preserve the 1-norm of the entire matrix.
+Conditional embeddings are Markov maps only when k = l, in which case they are also Lebanon
+maps.
+
+4.2. Invariance Characterization
+
+Considering the conditional embeddings discussed in the previous section, we obtain the
+following metric characterization.
+
+Theorem 4.
+
+• Let g(k,m) denote a metric on Rk×m
++
+for each k ≥ 1 and m ≥ 2. If every homogeneous conditional
+embedding f ∈ F l,n
+k,m is an isometry with respect to these metrics, then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A
+
+k2 + δac
+
+�
+k B
+
+k2 + δbd
+|M|
+Mab
+
+C
+k2
+
+�
+,
+for all M ∈ Rk×m
++
+,
+(23)
+
+for some constants A, B, C ∈ R, where ∂ab =
+∂
+
+∂Mab and |M| = ∑ab Mab.
+• Conversely, given the metrics defined by Equation (23) for any non-degenerate choice of constants
+A, B, C ∈ R, each homogeneous conditional embedding f ∈ F l,n
+k,m, k ≤ l, m ≤ n is an isometry.
+• Moreover, the tensors g(k,m) from Equation (23) are positive-definite for all k ≥ 1 and m ≥ 2 if and only
+if C > 0, B + C > 0 and A + B + C > 0.
+
+The proof of Theorem 4 is similar to the proof of the Theorems of Chentsov, Campbell and
+Lebanon. Due to its technical nature, we defer it to Appendix B.
+Now, for the restriction of the metric g(k,m) to (Δk
+m−1)◦, we have the following. In this case,
+|M| = k. Since tangent vectors v = ∑ab vab∂ab ∈ TM(Δk
+m−1)◦ satisfy ∑b vab = 0 for all a, the constants
+A and B become immaterial, and the metric can be written as:
+
+g(k,m)
+M
+(u, v) = ∑
+ab
+
+|M|uabvab
+
+Mab
+
+C
+k2 = ∑
+ab
+
+uabvab
+Mab
+
+C
+k ,
+for all u, v ∈ TM(Δk
+m−1)◦.
+(24)
+
+This metric is a specialization of the metric (18) derived by Lebanon (Theorem 3).
+The statement of Theorem 4 becomes false if we consider general conditional embeddings instead
+of homogeneous ones:
+
+Theorem 5. There is no family of metrics g(k,m) on Rk×m
++
+(or on (Δk
+m−1)◦) for each k ≥ 1 and m ≥ 2, for
+
+which every conditional embedding f ∈ ˆF l,n
+k,m is an isometry.
+
+This negative result will become clearer from the perspective of Section 6: as we will show in
+Theorem 7, although there are no metrics that are invariant under all conditional embeddings, there are
+families of metrics (depending on a parameter, ρ) that transform covariantly (that is, in a well-defined
+manner) with respect to the conditional embeddings. We defer the proof of Theorem 5 to Appendix B.
+
+5. The Fisher Metric on Polytopes and Point Configurations
+
+In the previous section, we obtained distinguished Riemannian metrics on Rk×m
++
+and (Δk
+m−1)◦ by
+postulating invariance under natural maps. In this section, we take another viewpoint based on general
+considerations about Riemannian metrics on arbitrary polytopes. This is achieved by embedding each
+polytope in a probability simplex as an exponential family. We first recall the necessary background.
+In Section 5.2, we then present our general results, and in Section 5.3, we discuss the special case of
+conditional polytopes.
+
+64
+
+
+Entropy 2014, 16, 3207–3233
+
+5.1. Exponential Families and Polytopes
+
+Let X be a finite set and A ∈ Rd×X a matrix with columns ax indexed by x ∈ X . It will be
+convenient to consider the rows Ai, i ∈ [d] of A as functions Ai : X → R. Finally, let ν: X → R+. The
+exponential family EA,ν is the set of probability distributions on X given by:
+
+p(x; θ) = exp(θ⊤ax + log(ν(x)) − log(Z(θ))),
+for all x ∈ X ,
+for all θ ∈ Rd,
+(25)
+
+with the normalization function Z(θ) = ∑x′∈X exp(θ⊤ax′ + log(ν(x′))). The functions Ai are called
+the observables and ν the reference measure of the exponential family. When the reference measure ν
+is constant, ν(x) = 1 for all x ∈ X , we omit the subscript and write EA.
+A direct calculation shows that the Fisher information matrix of EA,ν at a point θ ∈ Rd has
+coordinates:
+
+gEA,ν
+θ
+(∂θi, ∂θj) = covθ(Ai, Aj),
+for all i, j ∈ [d].
+(26)
+
+Here, covθ denotes the covariance computed with respect to the probability distribution p(·; θ).
+The convex support of EA,ν is defined as:
+
+conv A := conv{ax : x ∈ X } =
+�
+Ep[A]: p ∈ Δ|X |−1
+�
+=
+�
+Ep[A]: p ∈ EA,ν
+�
+,
+(27)
+
+where conv S is the set of all convex combinations of points in S. The moment map μ : p ∈ Δn−1 �→
+A · p ∈ Rd restricts to a homeomorphism EA,ν → conv A; see [16]. Here, EA,ν denotes the Euclidean
+closure of EA,ν. The inverse of μ will be denoted by μ−1 : conv A → EA,ν ⊆ Δn−1. This gives a natural
+embedding of the polytope conv A in the probability simplex Δ|X |−1. Note that the convex support is
+independent of the reference measure ν. See [17] for more details.
+
+5.2. Invariance Fisher Metric Characterizations for Polytopes
+
+Let P ∈ Rd be a polytope with n vertices a1, . . . , an. Let A = (a1, . . . , an) be the matrix with
+columns ai ∈ Rd for all i ∈ [n]. Then, EA ⊆ Δ◦
+n−1 is an exponential family with convex support P. We
+will also denote this exponential family by EP. We can use the inverse of the moment map, μ−1, to pull
+back geometric structures on Δ◦
+n−1 to the relative interior P◦ of P.
+
+Definition 5. The Fisher metric on P◦ is the pull-back of the Fisher metric on EA ⊆ Δ◦
+n−1 by μ−1.
+
+Some obvious questions are: Why is this a natural construction? Which maps between polytopes
+are isometries between their Fisher metrics? Can we find a characterization of Chentsov type for
+this metric?
+Affine maps are natural maps between polytopes. However, in order to obtain isometries, we
+need to put some additional constraints. Consider two polytopes P ∈ Rd, P′ ∈ Rd′ and an affine
+map φ : Rd → Rd′ that satisfies φ(P) ⊆ P′. A natural condition in the context of exponential families is
+that φ restricts to a bijection between the set vert(P) of vertices of P and the set vert(P′) of vertices of P′.
+In this case, EP′ ⊆ EP ⊆ Δ◦
+n−1. Moreover, the moment map μ′ of P′ factorizes through the moment
+map μ of P: μ′ = φ ◦ μ. Let φ−1 = μ ◦ μ′−1. Then, the following diagram commutes:
+
+P◦
+EP
+
+Δ◦
+n−1
+
+P′◦
+EP′
+
+μ−1
+
+μ′−1
+
+φ−1
+(28)
+
+65
+
+
+Entropy 2014, 16, 3207–3233
+
+It follows that φ−1 is an isometry from P′◦ to its image in P◦. Observe that the inverse moment map
+itself arises in this way: In the diagram (28), if P is equal to Δn−1, then the upper moment map μ−1 is
+the identity map, and φ−1 equals the inverse moment map μ′−1 of P′.
+The constraint of mapping vertices to vertices bijectively is very restrictive. In order to consider
+a larger class of affine maps, we need to generalize our construction from polytopes to weighted
+point configurations.
+
+Definition 6. A weighted point configuration is a pair (A, ν) consisting of a matrix A ∈ Rd×n with columns
+a1, . . . , an and a positive weight function ν : {1, . . . , n} → R+ assigning a weight to each column ai. The pair
+(A, ν) defines the exponential family EA,ν.
+The (A, ν)-Fisher metric on (conv A)◦ is the pull-back of the Fisher metric on Δ◦
+n−1 through the inverse
+of the moment map.
+
+We recover Definition 5 as follows. For a polytope P, let A be the point configuration consisting
+of the vertices of P. Moreover, let ν be a constant function. Then, EP = EA,ν, and the two Definitions of
+the Fisher metric on P◦ coincide.
+The following are natural maps between weighted point configurations:
+
+Definition 7. Let (A, ν), (A′, ν′) be two weighted point configurations with A = (ai)i ∈ Rd×n and A′ =
+(a′
+j)j ∈ Rd′×n′. A morphism (A, ν) → (A′, ν′) is a pair (φ, σ) consisting of an affine mapφ : Rd → Rd′ and a
+surjective map σ : {1, . . . , n} → {1, . . . , n′} with φ(ai) = a′
+σ(i) andν′(a′
+j) = α ∑i:σ(i)=j ν(ai), where α > 0 is
+a constant that does not depend on j.
+
+Consider a morphism (φ, σ) : (A, ν) → (A′, ν′). For each j ∈ [n′], let Aj = {i : φ(ai) = a′
+j}. Then,
+
+(A1, . . . , An′) is a partition of [n]. Define a matrix Q ∈ Rn′×n by:
+
+Qji =
+
+⎧
+⎨
+
+⎩
+
+ν(i)
+
+∑i′∈Aj ν(i′),
+if i ∈ Aj,
+
+0,
+else.
+(29)
+
+Then, Q is a Markov mapping, and the following diagram commutes:
+
+(conv A)◦
+EA,ν
+Δ◦
+n−1
+
+(conv A′)◦
+EA′,ν′
+Δ◦
+n′−1
+
+μ−1
+
+μ′−1
+
+φ−1
+Q
+(30)
+
+By Chentsov’s Theorem (Theorem 1), Q is an isometric embedding. It follows that φ−1 also induces an
+isometric embedding. This shows the first part of the following Theorem:
+
+Theorem 6.
+
+• Let (φ, σ) : (A, ν) → (A′, ν′) be a morphism of weighted point configurations.
+Then, φ−1 :
+(conv A′)◦ → (conv A)◦ is an isometric embedding with respect to the Fisher metrics on (conv A)◦
+
+and (conv A′)◦.
+• Let gA,ν be a Riemannian metric on (conv A)◦ for each weighted point configuration (A, ν). If every
+morphism (φ, σ) : (A, ν) → (A′, ν′) of weighted point configurations induces an isometric embedding
+φ−1 : (conv A′)◦ → (conv A)◦, then there exists a constant α ∈ R+ such that gA,ν is equal to α times
+the (A, ν)-Fisher metric.
+
+66
+
+
+Entropy 2014, 16, 3207–3233
+
+Proof. The first statement follows from the discussion before the Theorem. For the second statement,
+we show that under the given assumptions, all Markov maps are isometric embeddings. By Chentsov’s
+Theorem (Theorem 1), this implies that the metrics gP agree with the Fisher metric whenever P is
+a simplex. The statement then follows from the two facts that the metric on P◦ or (conv A)◦ is the
+pull-back of the Fisher metric through the inverse of the moment map and that μ−1 is itself a morphism.
+Observe that Δn−1 = conv In = conv{e1, . . . , en} is a polytope, and Δ◦
+n−1 is the corresponding
+exponential family. Consider a Markov embedding Q : Δ◦
+n′−1 → Δ◦
+n−1, p �→ p · Q. Let ν(i) = ∑j Qji
+be the value of the unique non-zero entry of Q in the i-th column. This defines a morphism and an
+embedding as follows:
+Let A be the matrix that arises from Q by replacing each non-zero entry by one. We define φ
+as the linear map represented by the matrix A, and define σ : [n] → [n′] by σ(j) = i if and only
+if aj = ei, that is, σ(j) indicates the row i in which the j-th column of A is non-zero. Then, (φ, σ)
+is a morphism (In, ν) → (In′, 1), and by assumption, the inverse φ−1 is an isometric embedding
+Δ◦
+n′−1 → Δ◦
+n−1. However, φ−1 is equal to the Markov map Q. This shows that all Markov maps are
+isometric embeddings, and so, by Chentsov’s Theorem, the statement holds true on the simplices.
+
+Theorem 6 defines a natural metric on (Δk
+m−1)◦ that we want to discuss in more detail next.
+
+5.3. Independence Models and Conditional Polytopes
+
+Consider k random variables with finite state spaces [n1], . . . , [nk]. The independence model
+consists of all joint distributions p ∈ Δ∏i∈[k] ni−1 of these variables that factorize as:
+
+p(x1, . . . , xk) = ∏
+i∈[k]
+pi(xi),
+for all x1 ∈ [n1], . . . , xk ∈ [nk],
+(31)
+
+where pi ∈ Δni−1 for all i ∈ [k]. Assuming fixed n1, . . . , nk, we denote the independence model by Ek.
+It is the Euclidean closure of an exponential family (with observables of the form δiyi). The convex
+support of Ek is equal to the product of simplices Pk := Δn1−1 × · · · × Δnk−1. The parametrization (31)
+corresponds to the inverse of the moment map.
+We can write any tangent vector u ∈ T(p1,...,pk)P◦
+k of this open product of simplices as a linear
+combination u = ∑i∈[k] ∑xi∈[ni] uixi∂i,xi, where ∑xi∈[ni] vixi = 0 for all i ∈ [k]. Given two such tangent
+vectors, the Fisher metric is given by:
+
+gPk
+(p1,...,pk)(u, v) = ∑
+i∈[k] ∑
+xi∈[ni]
+
+uixivixi
+pi(xi) .
+(32)
+
+Just as the convex support of the independence model is the Cartesian product of probability
+simplices, the Fisher metric on the independence model is the product metric of the Fisher metrics
+on the probability simplices of the individual variables. If n1 = · · · = nk =: n, then Pk = Δk
+n−1 can be
+identified with the set of k × n stochastic matrices.
+The Fisher metric on the product of simplices is equal to the product of the Fisher metrics on the
+factors. More generally, if P = Q1 × Q2 is a Cartesian product, then the Fisher metric on P◦ is equal
+to the product of the Fisher metrics on Q◦
+1 and Q◦
+2. In fact, in this case, the inverse of the moment
+map of P can be expressed in terms of the two moment map inverses μ1 : Q1 → EQ1 ⊆ Δm1−1 and
+μ2 : Q2 → EQ2 ⊆ Δm2−1 and the moment map ˜μ of the independence model Δm1−1 × Δm2−1, by:
+
+μ−1(q1, q2) = ˜μ−1(μ−1
+1 (q1), μ−1
+2 (q2)).
+(33)
+
+Therefore, the pull-back by μ−1 factorizes through the pull-back by ˜μ−1, and since the independence
+model carries a product metric, the product of polytopes also carries a product metric.
+
+67
+
+
+Entropy 2014, 16, 3207–3233
+
+Let us compare the metric g(k,m)
+K
+from Equation (24), with the Fisher metric gPk
+(K1,...,Kk) from
+
+Equation (32) on the product of simplices P◦ = (Δk
+m−1)◦. In both cases, the metric is a product
+metric; that is, it has the form:
+g = g1 + · · · + gk,
+(34)
+
+where gi is a metric on the i-th factor Δ◦
+m−1. For g
+Δk
+Km−1
+, gi is equal to the Fisher metric on Δ◦
+m−1. However,
+
+for g(k,m)
+K
+, gi is equal to 1/k times the Fisher metric on Δ◦
+m−1. Since this factor only depends on k, it
+only plays a role if stochastic matrices of different sizes are compared. The additional factor of 1/k can
+be interpreted as the uniform distribution on k elements. This is related to another more general class
+of Riemannian metrics that are used in applications; namely, given a function K ∈ Δk
+m−1 → ρK ∈ Rk+,
+it is common to use product metrics with gi equal to ρK(i) times the Fisher metric on Δ◦
+m−1. When K
+has the interpretation of a channel or when K describes the policy by which a system reacts to some
+sensor values, a natural possibility is to let ρK be the stationary distribution of the channel input or of
+the sensor values, respectively. We will discuss this approach in Section 6.
+
+6. Weighted Product Metrics for Conditional Models
+
+In this section, we consider metrics on spaces of stochastic matrices defined as weighted sums
+of the Fisher metrics on the spaces of the matrix rows, similar to Equation (34). This kind of metric
+was used initially by Amari [1] in order to define a natural gradient in the supervised learning context.
+Later, in the context of reinforcement learning, Kakade [2] defined a natural policy gradient based on
+this kind of metric, which has been further developed by Peters et al. [10]. Related applications within
+unsupervised learning have been pursued by Zahedi et al. [18].
+Consider the following weighted product Fisher metric:
+
+gρ,m
+K
+= ∑
+a
+ρK(a)g(m),a
+Ka
+,
+for all K ∈ (Δk
+m−1)◦,
+(35)
+
+where g(m),a
+Ka
+denotes the Fisher metric of Δ◦
+m−1 at the a-th row of K and ρK ∈ Δ◦
+k−1 is a probability
+distribution over a associated with each K ∈ (Δk
+m−1)◦. For example, the distribution ρK could be the
+stationary distribution of sensor values observed by an agent when operating under a policy described
+by K.
+In the following, we will try to illuminate the properties of polytope embeddings that yield the
+metric (35) as the pull-back of the Fisher information metric on a probability simplex. We will focus on
+the case that ρK = ρ is independent of K.
+There are two direct ways of embedding Δk
+n−1 in a probability simplex. In Section 5, we used
+the inverse of the moment map of an exponential family, possibly with some reference measure. This
+embedding is illustrated in the left panel of Figure 2. If we have given a fixed probability distribution
+ρ ∈ Δ◦
+k−1, there is a second natural embedding ψρ : Δk
+m−1 → Δk·m−1 defined as follows:
+
+ψρ(K)(x, y) = ρ(x)Kx,y
+for all x ∈ [k], y ∈ [m].
+(36)
+
+If ρ is the distribution of a random variable X and K ∈ Δk
+m−1 is the stochastic matrix describing the
+conditional distribution of another variable Y given X, then ψρ(K) is the joint distribution of X and Y.
+Note that ψρ is an affine embedding. See the right panel of Figure 2 for an illustration.
+The pull-back of the Fisher metric on Δ◦
+km−1 through ψρ is given by:
+
+g(km)
+ψρ(K)(ψρ∗u, ψρ∗v) =∑
+a,b ∑
+c,d ∑
+i,j
+ρ(i)Kijuab
+∂ log ρ(i)Kij
+
+∂Kab
+vcd
+∂ log ρ(i)Kij
+
+∂Kcd
+
+=∑
+i
+ρ(i)∑
+j
+
+uijvij
+Kij
+= ∑
+i
+ρ(i)gi
+Ki(ui, vi) = gρ,m
+K (u, v).
+(37)
+
+68
+
+
+Entropy 2014, 16, 3207–3233
+
+This recovers the weighted sum of Fisher metrics from Equation (35).
+
+Δk
+m−1
+
+Δmk−1
+
+Ek
+
+Δk·m−1
+
+ψ( 1
+
+4, 3
+
+4)Δk
+m−1
+
+ψ( 1
+
+2, 1
+
+2)Δk
+m−1
+
+Figure 2. An illustration of different embeddings of the conditional polytope Δk
+m−1 in a probability
+simplex. The left panel shows an embedding in Δmk−1 by the inverse of the moment map μ of the
+independence model. The right panel shows an affine embedding in Δk·m−1 as a set of joint probability
+distributions for two different specifications of marginals.
+
+Are there natural maps that leave the metrics gρ,m invariant? Let us reconsider the stochastic
+embeddings from Definition 4. Let R be a k × l indicator partition matrix and R a stochastic partition
+matrix with the same block structure as R. Observe that to each indicator partition matrix R there are
+many compatible stochastic partition matrices R, but the indicator partition matrix R for any stochastic
+partition matrix R is unique. Furthermore, let Q = {Q(a)}a∈[k] be a collection of stochastic partition
+
+matrices. The corresponding conditional embedding f maps K ∈ Δk
+m−1 to f (K) := R⊤(K ⊗ Q) ∈ Δl
+n−1.
+Let ρ ∈ Δ◦
+k−1. Suppose that K describes the conditional distribution of Y given X and that ψρ(K)
+describes the joint distribution of Y and X. As explained in Section 4.1, the matrix f (P) := R⊤(P ⊗ Q)
+describes the joint distribution of a pair of random variables (X′, Y′), and the conditional distribution
+of Y′ given X′ is given by f (K). In this situation, the marginal distribution of X′ is given by ρ′ = ρR.
+Therefore, the following diagram commutes:
+
+(Δk
+m−1)◦
+Δ◦
+mk−1
+
+(Δl
+n−1)◦
+Δ◦
+nl−1
+
+ψρ
+
+ψρ′
+
+f
+f
+(38)
+
+The preceding discussion implies the first statement of the following result:
+
+69
+
+
+Entropy 2014, 16, 3207–3233
+
+Theorem 7.
+
+• For any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦
+k−1, the Riemannian metric gρ,m on (Δk
+m−1)◦ satisfies:
+
+gρ,m = f
+∗(gρ′,n),
+for ρ′ = ρR,
+(39)
+
+for any conditional embedding f : K �→ R(K ⊗ Q).
+• Conversely, suppose that for any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦
+k−1, there is a Riemannian metric
+g(ρ,m) on (Δk
+m−1)◦, such that Equation (39) holds for all conditional embeddings, and suppose that g(ρ,m)
+
+depends continuously on ρ. Then, there is a constant A > 0 that satisfies g(ρ,m) = Agρ,m.
+
+Proof. The first statement follows from the commutative diagram (38). For the second statement,
+denote by ρk the uniform distribution on a set of k elements. If f : K �→ R(K ⊗ Q) is a homogeneous
+conditional embedding of Δk
+m−1 in Δl
+n−1, then R = k
+
+l R is a stochastic partition matrix corresponding
+to the partition indicator matrix R. Observe that ρl = ρkR. Therefore, the family of Riemannian
+metrics gρk,m on Δk
+m−1 satisfies the assumptions of Theorem 4. Therefore, there is a constant A > 0
+
+for which gρk,m equals A/k times the product Fisher metric. This proves the statement for uniform
+distributions ρ.
+A general distribution ρ ∈ Δ◦
+k−1 can be approximated by a distribution with rational probabilities.
+Since g(ρ,m) is assumed to be continuous, it suffices to prove the statement for rational ρ. In this case,
+there exists a stochastic partition matrix R for which ρ′ := ρR is a uniform distribution, and so, g(ρ′,n)
+
+is of the desired form. Equation (39) shows that g(ρ,m) is also of the desired form.
+
+7. Gradient Fields and Replicator Equations
+
+In this section, we use gradient fields in order to compare Riemannian metrics on the space
+(Δk
+n−1)◦.
+
+7.1. Replicator Equations
+
+We start with gradient fields on the simplex Δ◦
+n−1. A Riemannian metric g on Δ◦
+n−1 allows us to
+consider gradient fields of differentiable functions F: Δ◦
+n−1 → R. To be more precise, consider the
+differential dpF : TpΔ◦
+n−1 → R of F in p. It is a linear form on TpΔ◦
+n−1, which maps each tangent vector
+u to dpF(u) = ∂F
+
+∂u(p) ∈ R. Using the map u �→ gp(u, ·), this linear form can be identified with a tangent
+vector in TpΔ◦
+n−1, which we denote by gradpF. If we choose the Fisher metric g(n) as the Riemannian
+metric, we obtain the gradient in the following way. First consider a differentiable extension of F to the
+positive cone Rn+, which we will denote by the same symbol F. With the partial derivatives ∂iF of F,
+the Fisher gradient of F on the simplex Δ◦
+n−1 is given as:
+
+(gradpF)i = pi
+
+�
+
+∂iF(p) −
+n
+∑
+j=1
+pj ∂jF(p)
+
+�
+
+,
+i ∈ [n].
+(40)
+
+Note that the expression on the right-hand side of Equation (40) does not depend on the particular
+differentiable extension of F to Rn+. The corresponding differential equation is well known in theoretical
+biology as the replicator equation; see [19,20].
+
+˙pi = pi
+
+�
+
+∂iF(p) −
+n
+∑
+j=1
+pj ∂jF(p)
+
+�
+
+,
+i ∈ [n].
+(41)
+
+70
+
+
+Entropy 2014, 16, 3207–3233
+
+We now apply this gradient formula to functions that have the structure of an expectation value.
+Given real numbers Fi, i ∈ [n], referred to as fitness values, we consider the mean fitness:
+
+¯F(p) :=
+n
+∑
+i=1
+pi Fi.
+(42)
+
+Replacing the pi by any positive real numbers leads to a differentiable extension of F, also denoted
+by F. Obviously, we have ∂iF = Fi, which leads to the following replicator equation:
+
+˙pi = pi (Fi − ¯F(p)) ,
+i ∈ [n].
+(43)
+
+This equation has the solution:
+
+pi(t) =
+pi(0)etFi
+
+∑n
+j=1 pj(0)etFi ,
+i ∈ [n].
+(44)
+
+Clearly, the mean fitness will increase along this solution of the gradient field. The rate of increase can
+be easily calculated:
+
+d
+dt
+¯F
+�
+p(t)
+� =
+n
+∑
+i=1
+˙pi(t) Fi =
+n
+∑
+i=1
+pi (Fi − ¯F(p)) Fi =
+n
+∑
+i=1
+pi (Fi − ¯F(p))2 = varp(F) > 0.
+(45)
+
+As limit points of this solution, we obtain:
+
+lim
+t→−∞ pi(t) =
+
+�
+pi(0)
+
+∑j∈argmin F pj(0),
+if i ∈ argmin F
+
+0
+,
+otherwise
+,
+i ∈ [n],
+(46)
+
+and:
+
+lim
+t→+∞ pi(t) =
+
+�
+pi(0)
+
+∑j∈argmax F pj(0),
+if i ∈ argmax F
+
+0
+,
+otherwise
+,
+i ∈ [n].
+(47)
+
+7.2. Extension of the Replicator Equations to Stochastic Matrices
+
+Now, we come to the corresponding considerations of gradient fields in the context of stochastic
+matrices K ∈ (Δk
+n−1)◦. We consider a function:
+
+K �→ F(K) = F(K11, . . . , K1n; K21, . . . , K2n; . . . ; Kk1, . . . , Kkn).
+(48)
+
+One way to deal with this is to consider for each i ∈ [k] the corresponding replicator equation:
+
+˙Kij = Kij
+
+⎛
+
+⎝∂ijF(K) −
+n
+∑
+j′=1
+Kij′ ∂ij′F(K)
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(49)
+
+Obviously, this is the gradient field that one obtains by using the product Fisher metric on (Δk
+n−1)◦
+
+(Equation (17)):
+
+g(k,m)
+K
+(u, v) = ∑
+ij
+
+1
+Kij
+uijvij.
+(50)
+
+If we replace the metric by the weighted product Fisher metric considered by Kakade (Equation (35)),
+
+gρ,m
+K (u, v) = ∑
+ij
+
+ρi
+Kij
+uijvij,
+(51)
+
+71
+
+
+Entropy 2014, 16, 3207–3233
+
+then we obtain
+
+˙Kij = Kij
+
+ρi
+
+⎛
+
+⎝∂ijF(K) −
+n
+∑
+j′=1
+Kij′ ∂ij′F(K)
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(52)
+
+7.3. The Example of Mean Fitness
+
+Next, we want to study how the gradient flows with respect to different metrics compare. We
+restrict to the class of metrics gρ,m (Equation (35)), where ρ ∈ Δ◦
+k is a probability distribution. In
+principle, one could drop the normalization condition ∑i ρi = 1 and allow arbitrary coefficients ρi.
+However, it is clear that the rate of convergence can always be increased by scaling all values ρi with a
+common positive factor. Therefore, some normalization condition is needed for ρ.
+With a probability distribution p ∈ Δ◦
+k−1 and fitness values Fij, let us consider again the example
+of an expectation value function:
+
+¯F(K) =
+k
+∑
+i=1
+pi
+
+n
+∑
+j=1
+Kij Fij.
+(53)
+
+With ∂ij ¯F(π) = pi Fij, this leads to:
+
+˙Kij = pi
+
+ρi
+Kij
+
+⎛
+
+⎝Fij −
+n
+∑
+j′=1
+Kij′ Fij′
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(54)
+
+The corresponding solutions are given by:
+
+Kij(t) =
+Kij(0) et pi
+
+ρi Fij
+
+∑n
+j′=1 Kij′(0) et pi
+
+ρi Fij′ ,
+i ∈ [n].
+(55)
+
+Since argmax( pi
+
+ρi Fi·) and argmin( pi
+
+ρi Fi·) are independent of ρi > 0, the limit points are given
+independently of the chosen ρ as:
+
+lim
+t→−∞ Kij(t) =
+
+⎧
+⎨
+
+⎩
+
+Kij(0)
+
+∑j′∈argmin Fi· Kij′(0),
+if j ∈ argmin Fi·
+
+0
+,
+otherwise
+,
+i ∈ [n],
+(56)
+
+and:
+
+lim
+t→+∞ Kij(t) =
+
+⎧
+⎨
+
+⎩
+
+Kij(0)
+
+∑j′∈argmax Fi· Kij′(0),
+if j ∈ argmax Fi·
+
+0
+,
+otherwise
+,
+i ∈ [n].
+(57)
+
+This is consistent with the fact that the critical points of gradient fields are independent of the chosen
+Riemannian metric. However, the speed of convergence does depend on the metric:
+For each i, let Gi = maxj Fij and gi = maxj/∈argmax(Fij) Fij be the largest and second-largest values
+in the i-th row of Fij, respectively. Then, as: t → ∞,
+
+Kij(t) =
+
+�
+1 − O(exp(− pi
+
+ρi (Gi − gi)t),
+if i ∈ argmax Fi·
+O(exp(− pi
+
+ρi (Gi − Fij)t) ,
+otherwise
+(58)
+
+72
+
+
+Entropy 2014, 16, 3207–3233
+
+Therefore,
+
+¯F(K(t)) = ∑
+i
+pi
+∑
+j∈argmax Fi·
+Fij + O
+�
+exp(− pi
+
+ρi
+(Gi − gi)t)
+�
+
+= ∑
+i
+pi
+∑
+j∈argmax Fi·
+Fij + O
+�
+exp(− inf
+i
+
+� pi
+
+ρi
+(Gi − gi)
+�
+t)
+�
+.
+(59)
+
+Thus, in the long run, the rate of convergence is given by infi{ pi
+
+ρi (Gi − gi)}, which depends on the
+parameter ρ of the metric. As a result, in this case study, the optimal choice of ρi, i.e., with the largest
+convergence rate, can be computed if the numbers Gi and gi are known.
+Consider, for example, the case that the differences Gi − gi are of comparable sizes for all i. Then,
+we need to find the choice of ρ that maximizes infi{ pi
+
+ρi }. Clearly, infi{ pi
+
+ρi } ≤ 1 (since there is always an
+index i with pi ≤ ρi). Equality is attained for the choice ρi = pi. Thus, we recover the choice of Kakade.
+
+8. Conclusions
+
+So, which Riemannian metric should one use in practice on the set of stochastic matrices, (Δk
+m−1)◦?
+The results provided in this manuscript give different answers, depending on the approach. In all
+cases, the characterized Riemannian metrics are products of Fisher metrics with suitable factor weights.
+Theorem 4 suggests to use a factor weight proportional to 1/k, and Theorem 6 suggests to use a
+constant weight independent of k. In many cases, it is possible to work within a single conditional
+polytope (Δk
+m−1)◦ and a fixed k, and then, these two results are basically equivalent. On the other
+hand, Theorem 7 gives an answer that allows arbitrary factor weights ρ.
+Which metric performs best obviously depends on the concrete application. The first observation
+is that in order to use the metric gρ,m of Theorem 7, it is necessary to know ρ. If the problem at
+hand suggests a natural marginal distribution ρ, then it is natural to make use of it and choose the
+metric gρ,m. Even if ρ is not known at the beginning, a learning system might try to learn it to improve
+its performance.
+On the other hand, there may be situations where there is no natural choice of the weights ρ.
+Observe that ρ breaks the symmetry of permuting the rows of a stochastic matrix. This is also expressed
+by the structural difference between Theorems 4 and 6 on the one side and Theorem 7 on the other.
+While the first two Theorems provide an invariance metric characterization, Theorem 7 provides a
+“covariance” classification; that is, the metrics gρ,m are not invariant under conditional embeddings,
+but they transform in a controlled manner. This again illustrates that the choice of a metric should
+depend on which mappings are natural to consider, e.g., which mappings describe the symmetries of a
+given problem.
+For example, consider a utility function of the form F = ∑i ρi ∑j KijFij. Row permutations do not
+leave gρ,m invariant (for a general ρ), but they are not symmetries of the utility function F, either, and
+hence, they are not very natural mappings to consider. However, row permutations transform the
+metric gρ,m and the utility function in a controlled manner; in such a way that the two transformations
+match. Therefore, in this case, it is natural to use gρ,m. On the other hand, when studying problems
+that are symmetric under all row permutations, it is more natural to use the invariant metric g(k,m).
+
+Appendix A
+
+Appendix A Conditions for Positive Definiteness
+
+Equation (16) in Lebanon’s Theorem 3 defines a Riemannian metrics whenever it defines a
+positive-definite quadratic form. The next Proposition gives sufficient and necessary conditions for
+which this is the case.
+
+73
+
+
+Entropy 2014, 16, 3207–3233
+
+Proposition A1. For each pair k ≥ 1 and m ≥ 2, consider the tensor on Rk×m
++
+defined by:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A(|M|) + δac
+
+� B(|M|)
+
+|Ma|
++ δbd
+C(|M|)
+
+Mab
+
+�
+(A1)
+
+for some differentiable functions A, B, C ∈ C∞(R+). The tensor g(k,m) defines a Riemannian metric for all k
+and m if and only if C(α) > 0, B(α) + C(α) > 0 and A(α) + B(α) + C(α) > 0 for all α ∈ R+.
+
+Proof. The tensors are Riemannian metrics when:
+
+g(k,m)
+M
+(V) = A(|M|)(∑
+ab
+Vab)2 + B(|M|)∑
+a
+
+|M|
+|Ma|(∑
+b
+Vab)2 + C(|M|)∑
+ab
+
+|M|
+Mab
+V2
+ab
+(A2)
+
+is strictly positive for all non-zero V ∈ Rk×m, for all M ∈ Rk×m
++
+.
+We can derive necessary conditions on the functions A, B, C from some basic observations.
+Choosing V = ∂ab in Equation (A2) shows that A(|M|) + |M|
+
+|Ma| B(|M|) + |M|
+
+Mab C(|M|) has to be positive
+
+for all a ∈ [k], b ∈ [m], for all M ∈ Rk×m
++
+. Since Mab can be arbitrarily small for fixed |M| and |Ma|, we
+see that C has to be non-negative. Since we can choose |Ma| ≈ Mab ≪ |M| for a fixed |M|, we find
+that B + C has to be non-negative. Further, since we can choose Mab ≈ |Ma| ≈ |M| for a given |M|,
+we find that A + B + C has to be non-negative. This shows that the quadratic form is positive definite
+only if C ≥ 0, B + C ≥ 0, A + B + C ≥ 0. Since the cone of positive definite matrices is open, these
+inequalities have to be strictly satisfied. In the following, we study sufficient conditions.
+For any given M ∈ Rk×m
++
+, we can write Equation (A2) as a product V⊤GV, for all V ∈ Rkm, where
+G = GA + GB + GC ∈ Rkm×km is the sum of a matrix GA with all entries equal to A(|M|), a block
+diagonal matrix GB whose a-th block has all entries equal to |M|
+
+|Ma| B(|M|), and a diagonal matrix GC with
+
+diagonal entries equal to |M|
+
+Mab C(|M|). The matrix G is obviously symmetric, and by Sylvester’s criterion,
+it is positive definite iff all its leading principal minors are positive. We can evaluate the minors using
+Sylvester’s determinant Theorem. That Theorem states that for any invertible m × m matrix X, an
+m × n matrix Y and an n × m matrix Z, one has the equality det(X + YZ) = det(X) det(In + ZX−1Y).
+Let us consider a leading square block G′, consisting of all entries Gab,cd of G with row-index pairs
+(a, b) satisfying b ∈ [m] for all a < a′ and b ≤ b′ for a = a′ for some a′ ≤ k and b′ ≤ m; and the same
+restriction for the column index pairs. The corresponding block G′
+A + G′
+B can be written as the rank-a′
+
+matrix YZ, with Y consisting of columns 1a for all a ≤ a′ and Z consisting of rows A + 1a
+|M|
+|Ma| B for all
+
+a ≤ a′. Hence, the determinant of G′ is equal to:
+
+det(G′) = det(G′
+C) · det(Ia′ + ZG′−1
+C Y).
+(A3)
+
+Since G′C is diagonal, the first term is just:
+
+det(G′
+C) =
+
+�
+∏
+a<a′ ∏
+b
+
+|M|
+Mab C
+
+� �
+∏
+b≤b′
+
+|M|
+Ma′b C
+
+�
+
+.
+(A4)
+
+The matrix in the second term of Equation (A3) is given by:
+
+Ia′ + ZG′−1
+C Y =
+
+1
+C
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+C + B
+...
+C + B
+
+C + ∑b≤b′ Ma′b
+
+|Ma′|
+B
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
++ 1
+
+C
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+|M1|
+|M| A
+· · ·
+|Ma′−1|
+
+|M|
+A
+∑b≤b′ Ma′b
+
+|M|
+A
+...
+...
+...
+|M1|
+|M| A
+· · ·
+|Ma′−1|
+
+|M|
+A
+∑b≤b′ Ma′b
+
+|M|
+A
+
+⎞
+
+⎟
+⎟
+⎟
+⎠ .
+(A5)
+
+74
+
+
+Entropy 2014, 16, 3207–3233
+
+By Sylvester’s determinant Theorem, we have:
+
+det(Ia′ + ZG′−1
+C Y) = C−a′(C + B)a′−1(C + ∑b≤b′ Ma′b
+
+|Ma′|
+B)(1 + ∑
+a<a′
+
+|Ma|
+|M| A
+
+C + B +
+
+∑b≤b′ Ma′b
+
+|M|
+A
+
+C + ∑b≤b′ Ma′b
+
+|Ma′|
+B
+)
+
+=
+
+�
+∏
+a
+
+C + Ba
+
+C
+
+� �
+
+1 + ∑
+a
+
+Aa
+
+C + Ba
+
+�
+
+,
+(A6)
+
+where Aa = |Ma|
+
+|M| A for a < a′ and Aa′ = ∑b≤b′ Ma′b
+
+|M|
+A, and Ba = B for a < a′ and Ba′ = ∑b≤b′ Ma′b
+
+|Ma′|
+B.
+This shows that the matrix G is positive definite for all M if and only if C > 0, C + B > 0 and
+�
+1 + ∑a≤a′
+Aa
+
+C+Ba
+
+�
+> 0 for all a′ and b′. The latter inequality is satisfied whenever A + B + C > 0. This
+completes the proof.
+
+Appendix B Proofs of the Invariance Characterization
+
+The following Lemma follows directly from the Definition and contains all the technical details
+we need for the proofs.
+
+Lemma A1. The push-forward f∗ : TMRk×m
++
+→ Tf (M)Rl×n
++
+of a map f ∈ ˆF l,n
+k,m is given by:
+
+f∗(∂ab) =
+l
+∑
+i=1
+
+n
+∑
+j=1
+
+RaiQ(a)
+bj ∂′
+ij,
+(A7)
+
+and the pull-back of a metric g(l,n) on Rl×n
++
+through f is given by:
+
+( f ∗g(l,n))M(∂ab, ∂cd) = g(l,n)
+f (M)( f∗∂ab, f∗∂cd) =
+l
+∑
+i=1
+
+n
+∑
+j=1
+
+l
+∑
+s=1
+
+n
+∑
+t=1
+
+RaiRcsQ(a)
+bj Q(c)
+dt g(l,n)
+f (M)(∂′
+ij, ∂′
+st).
+(A8)
+
+Proof of Theorem 4. We follow the strategy of [5,14]. The idea is to consider subclasses of maps from
+the class F l,n
+k,m and to evaluate their push-forward and pull-back maps together with the isometry
+requirement. This yields restrictions on the possible metrics, eventually fully characterizing them.
+
+First. Consider the maps hπ,σ ∈ F k,m
+k,m , resulting from permutation matrices Q(a) = Pπa, πa : [m] → [m]
+for all a ∈ [k], and R = Pσ, σ: [k] → [k]. Requiring isometry yields:
+
+(hπ,σ)∗(∂ab)
+=
+∂′
+σ(a) πa(b)
+(A9)
+
+g(k,m)
+M
+(∂ab, ∂cd)
+=
+g(k,m)
+hπ,σ(M)(∂σ(a) π(a)(b), ∂σ(c) π(c)(d)).
+(A10)
+
+Second. Consider the maps rzw ∈ F kz,mw
+k,m
+defined by Q(1) = · · · = Q(k) ∈ Rm×mw and R ∈ Rk×kz
+
+being uniform. In this case, for some permutations π and σ,
+
+(rzw)∗(∂ab)
+=
+1
+w
+
+z
+∑
+i=1
+
+w
+∑
+j=1
+∂′
+σ(a)(i) π(b)(j)
+(A11)
+
+(rzw∗g(kz,mw))M(∂ab, ∂cd)
+=
+1
+w2
+
+z
+∑
+i=1
+
+w
+∑
+j=1
+
+z
+∑
+s=1
+
+w
+∑
+t=1
+g(kz,mw)
+rzw(M) (∂′
+σ(a)(i) π(b)(j), ∂′
+σ(c)(s) π(d)(t)).
+(A12)
+
+75
+
+
+Entropy 2014, 16, 3207–3233
+
+Third. For a rational matrix M = 1
+
+Z ˜M with ˜M ∈ Nk×m and row-sum | ˜Ma| = N ∈ N for all a ∈ [k],
+consider the map vM ∈ F zk,N
+k,m
+that maps M to a constant matrix. In this case, R ∈ Rk×kz and Q(a) has
+
+the b-th row with | ˜Mab| entries with value
+1
+
+| ˜Mab| at positions π(ab)([ ˜Mab]) ⊆ [N], and:
+
+(vM)∗(∂ab)
+=
+1
+˜Mab
+
+k
+∑
+i=1
+
+˜Mab
+∑
+j=1
+∂′
+σ(a)(i) π(ab)(j)
+(A13)
+
+(vM∗g(kz,N))M(∂ab, ∂cd)
+=
+1
+˜Mab
+
+1
+˜Mcd
+
+z
+∑
+i=1
+
+˜Mab
+∑
+j=1
+
+z
+∑
+s=1
+
+˜Mcd
+∑
+t=1
+g(kz,N)
+vM(M)(∂′
+σ(a)(i) π(ab)(j), ∂′
+σ(c)(s) π(cd)(t)). (A14)
+
+Step 1: a ̸= c. Consider a constant matrix M = U. Then:
+
+g(k,m)
+U
+(∂a1b1, ∂c1d1) = g(k,m)
+hπ,σ(U)(∂a2b2, ∂c2d2) = g(k,m)
+U
+(∂a2b2, ∂c2d2).
+(A15)
+
+This implies that g(k,m)
+U
+(∂ab, ∂cd) = ˆA(k, m) when a ̸= c.
+Using the second type of map, we get:
+
+ˆA(k, m) = z2w2
+
+w2
+ˆA(kz, mw),
+(A16)
+
+which implies g(k,m)
+U
+(∂ab, ∂cd) = A
+
+k2 , when a ̸= c. Considering a rational matrix M and the map vM
+yields:
+
+g(k,m)
+M
+(∂ab, ∂c,d) = A
+
+k2 .
+(A17)
+
+Step 2: b ̸= d. By similar arguments as in Part 1, g(k,m)
+U
+(∂ab, ∂ad) = ˆB(k, m). Evaluating the map
+rzw yields:
+
+ˆB(k, m) = zw2
+
+w2 ˆB(kz, mw) + z(z − 1)w2
+
+w2
+A
+
+(kz)2
+
+= z ˆB(kz, mw) + z − 1
+
+z
+A
+k2 ,
+(A18)
+
+and therefore,
+
+1
+z
+
+�
+ˆB(k, m) − A
+
+k2
+
+�
+= ˆB(kz, mw) −
+A
+
+(kz)2 ,
+(A19)
+
+which implies that
+�
+ˆB(k, m) − A
+
+k2
+�
+is independent of m and scales with the inverse of k, such that it
+
+can be written as B
+
+k . Rearranging the terms yields g(k,m)
+U
+(∂ab, ∂ad) = A
+
+k2 + B
+
+k , for b ̸= d.
+For a rational matrix M, the pull-back through vM shows then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = z
+˜Mab ˜Mad
+˜Mab ˜Mad
+
+�
+A
+
+(kz)2 + B
+
+kz
+
+�
++ z(z − 1) ˜Mab ˜Mad
+
+˜Mab ˜Mad
+
+A
+
+(kz)2 = A
+
+k2 + B
+
+k .
+(A20)
+
+76
+
+
+Entropy 2014, 16, 3207–3233
+
+Step 3: a = c and b = d. In this case, g(k,m)
+U
+(∂a1b1, ∂a1b1) = g(k,m)
+U
+(∂a2b2, ∂a2b2) = ˆC(k, m), and:
+
+ˆC(k, m) = 1
+
+w2 zw ˆC(kz, mw) + 1
+
+w2 zw(w − 1)
+�
+A
+
+(kz)2 + B
+
+kz
+
+�
++ 1
+
+w2 zw2z(z − 1)
+A
+
+(kz)2
+
+= z
+
+w
+ˆC(kz, mw) + (1 − 1
+
+zw) A
+
+k2 + (1 − z
+
+zw) B
+
+k ,
+(A21)
+
+which implies:
+k
+m
+
+�
+ˆC(k, m) − A
+
+k2 − B
+
+k
+
+�
+= kz
+
+mw
+
+�
+˜C(kz, mw) −
+A
+
+(kz)2 − B
+
+kz
+
+�
+,
+(A22)
+
+such that the left-hand side is a constant C, and g(k,m)
+U
+(∂ab, ∂ab) = A
+
+k2 + B
+
+k + m
+
+k C. Now, for a rational
+matrix M, pulling back through vM gives:
+
+g(k,m)
+M
+(∂ab, ∂ab) =
+1
+˜M2
+ab
+˜Mab
+
+� A
+
+k2 + B
+
+k + | ˜Ma|
+
+k
+C
+�
++
+1
+˜M2
+ab
+˜Mab( ˜Mab − 1)
+� A
+
+k2 + B
+
+k
+
+�
++ 0
+
+= A
+
+k2 + B
+
+k + | ˜Ma|
+
+˜MabkC
+
+= A
+
+k2 + k B
+
+k2 + |M|
+
+Mab
+
+C
+k2 .
+(A23)
+
+Summarizing, we found:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A
+
+k2 + δac
+
+�
+k B
+
+k2 + δbd
+|M|
+Mab
+
+C
+k2
+
+�
+,
+(A24)
+
+which proves the first statement. The second statement follows by plugging Equation (23) into
+Equation (A8). Finally, the statement about the positive-definiteness is a direct consequence of
+Proposition 1.
+
+Proof of Theorem 5. Suppose, contrary to the claim, that a family of metrics g(k,m)
+M
+exists, which is
+invariant with respect to any conditional embedding. By Theorem 4, these metrics are of the form
+of Equation (23). To prove the claim, we only need to show that A, B and C vanish. In the following,
+we study conditional embeddings where Q consists of identity matrices and evaluate the isometry
+
+requirement ( f ∗g(l,n))M(∂ab, ∂cd) = g(k,m)
+M
+(∂ab, ∂cd).
+Step 1: In the case a ̸= c, we obtain from the invariance requirement and Equation (A8), that:
+
+A
+k2 = |Ra||Rc| A
+
+l2 .
+(A25)
+
+Observe that:
+1
+k
+
+k
+∑
+i=1
+|Ri| = 1
+
+k |R| = l
+
+k.
+(A26)
+
+In fact, |Ri| is the cardinality of the i-th block of the partition belonging to R. Therefore, if we choose R
+to be the partition indicator matrix of a partition that is not homogeneous and in which |Ra| > l/k
+and |Rc| > l/k, then Equation (A25) implies that A = 0.
+Step 2: In the case a = c and b ̸= d, we obtain from invariance and Equation (A8), that:
+
+B
+k =
+l
+∑
+i=1
+
+l
+∑
+s=1
+
+RaiRasδis
+B
+l = |Ra| B
+
+l .
+(A27)
+
+Again, we may chose Ra in such a way that |Ra| ̸= k
+
+l and find that B = 0.
+
+77
+
+
+Entropy 2014, 16, 3207–3233
+
+Step 3: Finally, in the case a = c and b = d, we obtain from invariance and Equation (A8), that:
+
+C|M|
+k2Mab
+=
+l
+∑
+i=1
+
+l
+∑
+s=1
+
+RaiRasδi,s
+C|R⊤M|
+
+l2(R⊤M)ib
+= |Ra|C|R⊤M|
+
+l2Mab
+.
+(A28)
+
+If we chose Ra, such that |Ra| ̸=
+|M|
+
+|R⊤M|, then we see that C = 0. Therefore, g(k,m) is the zero-tensor,
+
+which is not a metric.
+
+Acknowledgments
+
+The authors are grateful to Keyan Zahedi for discussions related to policy gradient methods in
+robotics applications. Guido Montúfar thanks the Santa Fe Institute for hosting him during the initial
+work on this article. Johannes Rauh acknowledges support by the VW Foundation. This work was
+supported in part by the DFG Priority Program, Autonomous Learning (DFG-SPP 1527).
+
+Author Contributions
+
+All authors contributed to the design of the research. The research was carried out by all authors,
+with main contributions by Guido Montúfar and Johannes Rauh. The manuscript was written by
+Guido Montúfar, Johannes Rauh and Nihat Ay. All authors read and approved the final manuscript.
+
+Conflicts of Interest
+
+The authors declare no conflict of interests.
+
+References
+
+1.
+Amari, S. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276.
+2.
+Kakade, S. A Natural Policy Gradient. Advances in Neural Information Processing Systems 14; MIT Press:
+Cambridge, MA, USA, 2001; pp. 1531–1538.
+3.
+Shahshahani, S. A New Mathematical Framework for the Study of Linkage and Selection; American Mathematical
+Society: Providence, RI, USA, 1979.
+4.
+Chentsov, N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Providence, RI,
+USA, 1982.
+5.
+Campbell, L. An extended ˇCencov characterization of the information metric. Proc. Am. Math. Soc. 1986,
+98, 135–141.
+6.
+Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with
+Function Approximation. In Advances in Neural Information Processing Systems 12; MIT Press: Cambridge,
+MA, USA, 2000; pp. 1057–1063.
+7.
+Marbach, P.; Tsitsiklis, J.
+Simulation-based optimization of Markov reward processes.
+IEEE Trans.
+Autom. Control 2001, 46, 191–209.
+8.
+Montúfar, G.; Ay, N.; Zahedi, K. Expressive power of conditional restricted boltzmann machines for
+sensorimotor control. 2014, arXiv:1402.3346.
+9.
+Ay, N.; Montúfar, G.; Rauh, J. Selection Criteria for Neuromanifolds of Stochastic Dynamics. In Advances
+in Cognitive Neurodynamics (III); Yamaguchi, Y., Ed.; Springer-Verlag: Dordrecht, The Netherlands 2013;
+pp. 147–154.
+10.
+Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190.
+11.
+Peters, J.; Schaal, S. Policy Gradient Methods for Robotics. In Proceedings of the IEEE International
+Conference on Intelligent Robotics Systems (IROS 2006), Beijing, China, 9–15 October 2006.
+
+78
+
+
+Entropy 2014, 16, 3207–3233
+
+12.
+Peters, J.; Vijayakumar, S.; Schaal, S. Reinforcement learning for humanoid robotics. In Proceedings of the
+third IEEE-RAS international conference on humanoid robots, Karlsruhe, Germany, 29–30 September 2003;
+pp. 1–20.
+13.
+Bagnell, J.A.; Schneider, J.
+Covariant policy search.
+In Proceedings of the 18th International Joint
+Conference on Artificial Intelligence, Acapulco, Mexico, August 9–15, 2003; Morgan Kaufmann Publishers
+Inc.: San Francisco, CA, USA, 2003; pp. 1019–1024.
+14.
+Lebanon, G. Axiomatic geometry of conditional models. IEEE Trans. Inform. Theor. 2005, 51, 1283–1294.
+15.
+Lebanon, G.
+An Extended ˇCencov-Campbell Characterization of Conditional Information Geometry.
+In Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence (UAI 04), Banff, AL, Canada,
+7–11 July 2004; Chickering, D.M., Halpern, J.Y., Eds.; AUAI Press: Arlington, VA, USA, 2004; pp. 341–345.
+16.
+Barndorff-Nielsen, O. Information and Exponential Families: In statistical Theory; John Wiley & Sons, Inc.:
+Hoboken, NJ, USA, 1978.
+17.
+Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory;
+Institute of Mathematical Statistics: Hayward, CA, USA, 1986.
+18.
+Zahedi, K.; Ay, N.; Der, R. Higher coordination with less control—A result of informaion maximiation in the
+sensorimotor loop. Adapt. Behav. 2010, 18.
+19.
+Hofbauer, J.; Sigmund, K.
+Evolutionary Games and Population Dynamics; Cambridge University Press:
+Cambridge, United Kingdom, 1998.
+20.
+Ay, N.; Erb, I. On a notion of linear replicator equations. J. Dyn. Differ. Equ. 2005, 17, 427–451.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+79
+
+
+entropy
+
+Article
+Matrix Algebraic Properties of the Fisher Information
+Matrix of Stationary Processes
+
+André Klein
+
+Rothschild Blv. 123 Apt.7, 65271 Tel Aviv, Israel; A.A.B.Klein@uva.nl or klein@contact.uva.nl; Tel.: 972.5.25594723
+
+Received: 12 February 2014; in revised form: 11 March 2014 / Accepted: 24 March 2014 /
+Published: 8 April 2014
+
+Abstract: In this survey paper, a summary of results which are to be found in a series of papers,
+is presented.
+The subject of interest is focused on matrix algebraic properties of the Fisher
+information matrix (FIM) of stationary processes. The FIM is an ingredient of the Cramér-Rao
+inequality, and belongs to the basics of asymptotic estimation theory in mathematical statistics. The
+FIM is interconnected with the Sylvester, Bezout and tensor Sylvester matrices. Through these
+interconnections it is shown that the FIM of scalar and multiple stationary processes fulfill the
+resultant matrix property. A statistical distance measure involving entries of the FIM is presented.
+In quantum information, a different statistical distance measure is set forth. It is related to the
+Fisher information but where the information about one parameter in a particular measurement
+procedure is considered. The FIM of scalar stationary processes is also interconnected to the solutions
+of appropriate Stein equations, conditions for the FIM to verify certain Stein equations are formulated.
+The presence of Vandermonde matrices is also emphasized.MSC Classification: 15A23, 15A24, 15B99,
+60G10, 62B10, 62M20.
+
+Keywords: Bezout matrix; Sylvester matrix; tensor Sylvester matrix; Stein equation; Vandermonde
+matrix; stationary process; matrix resultant; Fisher information matrix
+
+1. Introduction
+
+In this survey paper, a summary of results derived and described in a series of papers, is presented.
+It concerns some matrix algebraic properties of the Fisher information matrix (abbreviated as FIM) of
+stationary processes. An essential property emphasized in this paper concerns the matrix resultant
+property of the FIM of stationary processes. To be more explicit, consider the coefficients of two monic
+polynomials p(z) and q(z) of finite degree, as the entries of a matrix such that the matrix becomes
+singular if and only if the polynomials p(z) and q(z) have at least one common root. Such a matrix is
+called a resultant matrix and its determinant is called the resultant. The Sylvester, Bezout and tensor
+Sylvester matrices have such a property and are extensively studied in the literature, see e.g., [1,3]. The
+FIM associated with various stationary processes will be expressed by these matrices. The derived
+interconnections are obtained by developing the necessary factorizations of the FIM in terms of the
+Sylvester, Bezout and tensor Sylvester matrices. These factored forms of the FIM enable us to show that
+the FIM of scalar and multiple stationary processes fulfill the resultant matrix property. Consequently,
+the singularity conditions of the appropriate Fisher information matrices and Sylvester, Bezout and
+tensor Sylvester matrices coincide, these results are described in [4,6].
+A statistical distance measure involving entries of the FIM is presented and is based on [7]. In
+quantum information, a statistical distance measure is set forth, see [8,10], and is related to the Fisher
+information but where the information about one parameter in a particular measurement procedure is
+considered. This leads to a challenging question that can be presented as, can the existing distance
+measure in quantum information be developed at the matrix level?
+
+Entropy 2014, 16, 2023–2055; doi:10.3390/e16042023
+www.mdpi.com/journal/entropy
+80
+
+
+Entropy 2014, 16, 2023–2055
+
+The matrix Stein equation, see e.g., [11], is associated with the Fisher information matrices of
+scalar stationary processes through the solutions of the appropriate Stein equations. Conditions for the
+Fisher information matrices or associated matrices to verify certain Stein equations are formulated
+and proved in this paper. The presence of Vandermonde matrices is also emphasized. The general
+and more detailed results are set forth in [12] and [13]. In this survey paper it is shown that the FIM of
+linear stationary processes form a class of structured matrices. Note that in [14], the authors emphasize
+that statistical problems related to stationary processes have been treated successfully with the aid
+of Toeplitz forms. This paper is organized as follows. The various stationary processes, considered
+in this paper, are presented in Section 2, the Fisher information matrices of the stationary processes
+are displayed in Section 3. Section 3 sets forth the interconnections between the Fisher information
+matrices and the Sylvester, Bezout, tensor Sylvester matrices, and solutions to Stein equations. A
+statistical distance measure is expressed in terms of entries of a FIM.
+
+2. The Linear Stationary Processes
+
+In this section we display the class of linear stationary processes whose corresponding Fisher
+information matrix shall be investigated in a matrix algebraic context. But first some basic definitions
+are set forth, see e.g., [15].
+
+If a random variable X is indexed to time, usually denoted by t, the observations {Xt, t ∈ T } is
+
+called a time series, where T is a time index set (for example, T = Z, the integer set).
+
+2.1. Definition 2.1
+
+A stochastic process is a family of random variables {Xt, t ∈ T } defined on a probability space {Ω, F, ℘}.
+
+2.2. Definition 2.2
+
+The Autocovariance function. If {Xt, t ∈ T } is a process such that Var(Xt) < ∞ (variance) for each t, then
+
+the autocovariance function γX (·, ·) of {Xt} is defined by γX (r, s) = Cov (Xr, Xs) = E [(Xr − E Xr) (Xs − E
+
+Xs)], r, s ∈ Z and E represents the expected value.
+
+2.3. Definition 2.3
+
+Stationarity. The time series {Xt, t ∈ Z}, with the index set Z ={0,±}1,±}2, . . .}, is said to be stationary if
+
+(i)
+E |Xt|2 < ∞
+
+(ii)
+E (Xt) = m for all t ∈ Z, m is the constant average or mean
+(iii)
+γX (r, s) = γX (r + t, s + t) for all r, s, t ∈ Z,
+
+From Definition 2.3 can be concluded that the joint probability distributions of the random variables
+{X1, X2, . . . Xtn} and {X1+k, X2+k, . . . Xtn+k} are the same for arbitrary times t1, t2, . . . , tn for all n and
+all lags or leads k = 0, ±}1, ±}2, . . .. The probability distribution of observations of a stationary process
+is invariant with respect to shifts in time. In the next section the linear stationary processes that will be
+considered throughout this paper are presented.
+
+2.4. The Vector ARMAX or VARMAX Process
+
+We display one of the most general linear stationary process called the multivariate autoregressive,
+moving average and exogenous process, the VARMAX process. To be more specific, consider the
+vector difference equation representation of a linear system {y(t), t ∈ Z}, of order (p, r, q),
+
+p
+∑
+j=0
+Aj y(t − j) =
+r
+∑
+j=0
+Cj x(t − j) +
+
+q
+∑
+j=0
+Bj e(t − j), t ∈ Z
+(1)
+
+81
+
+
+Entropy 2014, 16, 2023–2055
+
+where y(t) are the observable outputs, x(t) the observable inputs and ϵ(t) the unobservable errors, all
+are n-dimensional. The acronym VARMAX stands for vector autoregressive-moving average with
+exogenous variables. The left side of (1) is the autoregressive part the second term on the right
+is the moving average part and x(t) is exogenous. If x(t) does not occur the system is said to be
+(V)ARMA. Next to exogenous, the input x(t) is also named the control variable, depending on the field
+of application, in econometrics and time series analysis, e.g., [15], and in signal processing and control,
+e.g., [16,17]. The matrix coefficients, Aj ∈ Rn×n, Cj ∈ Rn×n, and Bj ∈ Rn×n are the associate parameter
+matrices. We have the property A0 ≡ B0 ≡ C0 ≡ In.
+Equation (1) can compactly be written as
+
+A(z) y(t) = C(z) x(t) + B(z) e(t)
+(2)
+
+where
+
+A(z) =
+
+p
+∑
+j=0
+Aj zj; C(z) =
+r
+∑
+j=0
+Cj zj; B(z) =
+
+q
+∑
+j=0
+Bj zj
+
+we use z to denote the backward shift operator, for example z xt = xt−1. The matrix polynomials
+A(z), B(z) and C(z) are the associated autoregressive, moving average matrix polynomials, and the
+exogenous matrix polynomial respectively of order p, q and r respectively. Hence the process described
+
+by (2) is denoted as a VARMAX(p, r, q) process. Here z ∈ C with a duplicate use of z as an operator and
+as a complex variable, which is usual in the signal processing and time series literature, e.g., [15,16,18].
+The assumptions Det(A(z)) ̸= 0, such that |z| ≤ 1 and Det(B(z)) ̸= 0, such that |z| < 1 for all z ∈
+C, is imposed so that the VARMAX(p, r, q) process (2) has exactly one stationary solution and the
+condition Det(B(z)) ̸= 0 implies the invertibility condition, see e.g., [15] for more details. Under these
+assumptions, the eigenvalues of the matrix polynomials A(z) and B(z) lie outside the unit circle. The
+eigenvalues of a matrix polynomial Y (z) are the roots of the equation Det(Y (z)) = 0, Det(X) is the
+determinant of X. The VARMAX(p, r, q) stationary process (2) is thoroughly discussed in [15,18,19].
+The error {ϵ(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables
+
+each having positive definite covariance matrix ∑ and we assume, for all s, t, Eϑ { x(s) ϵT(t)} = 0, where
+
+XT denotes the transposition of matrix X and Eϑ represents the expected value under the parameter
+ϑ. The matrix ϑ represents all the VARMAX(p, r, q) parameters, with the total number of parameters
+being n2(p + q + r). For different purposes which will be specified in the next sections, two choices of
+the parameter structure are considred. First, the parameter vector ϑ ∈ Rn2(p+q+r)×1 is defined by
+
+ϑ = vec {A1, A2, . . . , Ap, C1, C2, . . . , Cr, B1, B2, . . . , Bq}
+(3)
+
+The vec operator transforms a matrix into a vector by stacking the columns of the matrix one
+underneath the other according to vec X = col(col(Xij)n
+i=1)n
+j=1, see e.g., [2,20]. A different choice
+
+is set forth, when the parameter matrix ϑ ∈ Rn×n(p+q+r) is of the form
+
+ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+r ϑp+r+1 ϑp+r+2 . . . ϑp+r+q)
+(4)
+
+= (A1 A2 . . . Ap C1 C2 . . . Cr B1 B2 . . . Bq)
+(5)
+
+Representation (5) of the parameter matrix has been used in [21]. The estimation of the matrices A1,
+A2,. . ., Ap, C1, C2,. . ., Cr, B1, B2, . . ., Bq and ∑ has received considerable attention in the time series
+and statistical signal processing literature, see e.g., [15,17,19]. In [19], the authors study the asymptotic
+properties of maximum likelihood estimates of the coefficients of VARMAX(p, r, q) processes, stored in
+a (ℓ × 1) vector ϑ, where ℓ = n2(p + q + r).
+Before describing the control-exogenous variable x(t) used in this survey paper, we shall present
+the different special cases of the model described in 1 and 2.
+
+82
+
+
+Entropy 2014, 16, 2023–2055
+
+2.5. The Vector ARMA or VARMA Process
+
+When the process (2) does not contain the control process x(t) it yields
+
+A(z)y(t) = B(z)e(t)
+(6)
+
+which is a vector autoregressive and moving average process, VARMA(p, q) process, see e.g., [15].
+The matrix ϑ represents now all the VARMA parameters, with the total number of parameters being
+n2(p+q). The VARMA(p, q) version of the parameter vector ϑ defined in (3) is then given by
+
+ϑ = vec {A1, A2, . . . , Ap, B1, B2, . . . , Bq}
+(7)
+
+A VARMA process equivalent to the parameter matrix (4) is then the n × n(p + q) parameter matrix
+
+ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+q) = (A1 A2 . . . Ap B1 B2 . . . Bq)
+(8)
+
+A description of the input variable x(t), in 2 follows. Generally, one can assume either that x(t) is non
+
+stochastic or that x(t) is stochastic. In the latter case, we assume Eϑ{ x(s) ϵT(t)} = 0, for all s, t, and
+that statistical inference is performed conditionally on the values taken by x(t). In this case it can
+be interpreted as constant, see [22] for a detailed exposition. However, in the papers referred in this
+survey, like in [21] and [23], the observed input variable x(t), is assumed to be a stationary VARMA
+process, of the form
+
+α(z)x(t) = β(z)η(t)
+(9)
+
+where α(z) and β(z) are the autoregressive and moving average polynomials of appropriate degree
+and {η(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables each having
+positive definite covariance matrix Ω. The spectral density of the VARMA process x(t) is Rx(·)/2π and
+for a definition, see e.g., [15,16], to obtain
+
+Rx(eiω) = α−1(eiω)β(eiω)Ωβ∗(eiω)α−∗(eiω)
+ω ∈ [−π, π]
+(10)
+
+where i is the imaginary unit with the property i2 = −1, ω is the frequency, the spectral density Rx(eiω)
+is Hermitian, and we further have, Rx(eiω) ≥ 0 and � π
+−π Rx(eiω)dω < ∞. As mentioned above, the
+basic assumption, x(t) and ϵ(t) are independent or at least uncorrelated processes, which corresponds
+geometrically with orthogonal processes, holds and X* is the complex conjugate transpose of matrix X.
+
+2.6. The ARMAX and ARMA Processes
+
+The scalar equivalent to the VARMAX(p, r, q) and VARMA(p, q) processes, given by 2 and 6
+
+respectively, shall now be displayed, to obtain for the ARMAX(p, r, q) process
+
+a(z)y(t) = c(z)x(t) + b(z)e(t)
+(11)
+
+and for the ARMA(p, q) process
+
+a(z)y(t) = b(z)e(t)
+(12)
+
+popularized in, among others, the Box-Jenkins type of time series analysis, see e.g., [15]. Where a(z),
+b(z) and c(z) are respectively the scalar autoregressive, moving average polynomials and exogenous
+polynomial, with corresponding scalar coefficients aj, bj and cj,
+
+a(z) =
+
+p
+∑
+j=0
+aj zj; c(z) =
+r
+∑
+j=0
+cj zj; b(z) =
+
+q
+∑
+j=0
+bj zj
+(13)
+
+83
+
+
+Entropy 2014, 16, 2023–2055
+
+Note that as in the multiple case, a0 = b0 = 1. The parameter vector, ϑ, for the processes, 11 and 12
+is then
+
+ϑ = {a1, a2, . . . , ap, c1, c2, . . . , cr, b1, b2, . . . , bq}
+(14)
+
+and
+
+ϑ = {a1, a2, . . . , ap, b1, b2, . . . , bq}
+(15)
+
+respectively.
+In the next section the matrix algebraic properties of the Fisher information matrix of the stationary
+processes (2), (6), (11) and (12) will be verified. Interconnections with various known structured
+matrices like the Sylvester resultant matrix, the Bezout matrix and Vandermonde matrix are set forth.
+The Fisher information matrix of the various stationary processes is also expressed in terms of the
+unique solutions to the appropriate Stein equations.
+
+3. Structured Matrix Properties of the Asymptotic Fisher Information Matrix of
+Stationary Processes
+
+The Fisher information is an ingredient of the Cramér-Rao inequality, also called by some
+the Cauchy-Schwarz inequality in mathematical statistics, and belongs to the basics of asymptotic
+estimation theory in mathematical statistics. The Cramér-Rao theorem [24] is therefore considered.
+When assuming that the estimators of ϑ, defined in the previuos sections, are asymptotically unbiased,
+the inverse of the asymptotic information matrix yields the Cramér-Rao bound, and provided that the
+estimators are asymptotically efficient, the asymptotic covariance matrix then verifies the inequality
+
+Cov
+� ˆϑ
+� ≽ I−1� ˆϑ
+�
+
+here I (�ϑ) is the FIM, Cov (�ϑ) is the covariance of �ϑ, the unbiased estimator of ϑ, for a detailed
+fundamental statistical analysis, see [25,26]. The FIM equals the Cramér-Rao lower bound, and the
+subject of the FIM is also of interest in the control theory and signal processing literature, see e.g., [27].
+Its quantum analog was introduced immediately after the foundation of mathematical quantum
+estimation theory in the 1960’s, see [28,29] for a rigorous exposition of the subject. More specifically, the
+Fisher information is also emphasized in the context of quantum information theory, see e.g., [30,31].
+It is clear that the Cramér-Rao inequality takes a lot of attention because it is located on the highly
+exciting boundary of statistics, information and quantum theory and more recently matrix theory. In
+the next sections, the Fisher information matrices of linear stationary processes will be presented and
+its role as a new class of structured matrices will be the subject of study.
+When time series models are the subject, using 2 for all t ∈ Z to determine the residual ϵ(t) or
+ϵt(ϑ), to emphasize the dependency on the parameter vector ϑ, and assuming that x(t) is stochastic and
+that (y(t), x(t)) is a Gaussian stationary process, the asymptotic FIM F(ϑ) is defined by the following
+(ℓ × ℓ) matrix which does not depend on t
+
+F(ϑ) = E
+
+��∂et(ϑ)
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+�∂et(ϑ)
+
+∂ϑ⊤
+
+��
+
+(16)
+
+where the (v × ℓ) matrix ∂(·)/∂ϑ T, the derivative with respect to ϑ T, for any (v × 1) column vector
+(·) and ℓ is the total number of parameters. The derivative with respect to ϑ T is used for obtaining
+the appropriate dimensions. Equality (16) is used for computing the FIM of the various time series
+processes presented in the previous sections and appropriate definitions of the derivatives are used,
+especially for the multivariate processes (2) and (6), see [21,22].
+
+84
+
+
+Entropy 2014, 16, 2023–2055
+
+3.1. The Fisher Information Matrix of an ARMA(p, q) Process
+
+In this section, the focus is on the FIM of the ARMA process (12). When ϑ is given in 15, the
+derivatives in 16 are at the scalar level
+
+∂et(ϑ)
+
+∂aj
+=
+1
+
+a(z)et−j
+for j = 1, . . . , p and∂et(ϑ)
+
+∂bk
+= − 1
+
+b(z)et−k for k = 1, . . . , q
+
+when combined for all j and k, the FIM of the ARMA process (12) with the variance of the noise process
+ϵt(ϑ) equal to one, yields the block decomposition, see [32]
+
+F(ϑ) =
+
+�
+Faa(ϑ)
+Fab(ϑ)
+Fba(ϑ)
+Fbb(ϑ)
+
+�
+
+(17)
+
+The expressions of the different blocks of the matrix F(ϑ) are
+
+Faa(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)u⊤
+p (z−1)
+
+a(z)a(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)v⊤
+p (z)
+
+a(z)ˆa(z)
+dz
+(18)
+
+Fab(ϑ) = − 1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)u⊤
+q (z−1)
+
+a(z)b(z−1)
+dz
+z = − 1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z)
+dz
+(19)
+
+Fba(ϑ) = − 1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)u⊤
+p (z−1)
+
+a(z−1)b(z)
+dz
+z = − 1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)v⊤
+p (z)
+
+ˆa(z)b(z) dz
+(20)
+
+Fbb(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)u⊤
+q (z−1)
+
+b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z)
+dz
+(21)
+
+where the integration above and everywhere below is counterclockwise around the unit circle. The
+
+reciprocal monic polynomials â(z) and �b(z) are defined as â(z) = zpa(z−1) and �b (z) = zqb(z−1) and ϑ
+=(a1, . . . , ap, b1, . . . , bq) T introduced in (15). For each positive integer k we have uk(z) = (1, z, z2,
+. . . , zk−1) T and vk(z) = zk−1uk(z−1). Considering the stability condition of the ARMA(p, q) process
+implies that all the roots of the monic polynomials a(z) and b(z) lie outside the unit circle. Consequently,
+
+the roots of the polynomials â(z) and �b(z) lie within the unit circle and will be used as the poles for
+computing the integrals (18)–(21) when Cauchy’s residue theorem is applied. Notice that the FIM F(ϑ)
+is symmetric block Toeplitz so that Fab(ϑ) = F ⊤
+ba(ϑ) and the integrands in (18)–(21) are Hermitian.
+The computation of the integral expressions, (18)–(21) is easily implementable by using the standard
+residue theorem. The algorithms displayed in [33] and [22] are suited for numerical computations of
+among others the FIM of an ARMA(p, q) process.
+
+3.2. The Sylvester Resultant Matrix - The Fisher Information Matrix
+
+The resultant property of a matrix is considered, in order to show that the FIM F(ϑ) has the matrix
+resultant property implies to show that the matrix F(ϑ) becomes singular if and only if the appropriate
+
+scalar monic polynomials â(z) and �b(z) have at least one common zero. To illustrate the subject, the
+following known property of two polynomials is set forth. The greatest common divisor (frequently
+abbreviated as GCD) of two polynomials is a polynomial, of the highest possible degree, that is a factor
+of both the two original polynomials, the roots of the GCD of two polynomials are the common roots of
+the two polynomials. Consider the coefficients of two monic polynomials p(z) and q(z) of finite degree,
+as the entries of a matrix such that the matrix becomes singular if and only if the polynomials p(z) and
+q(z) have at least one common root. Such a matrix is called a resultant matrix and its determinant is
+
+85
+
+
+Entropy 2014, 16, 2023–2055
+
+called the resultant. Therefore we present the known (p + q) × (p + q) Sylvester resultant matrix of the
+polynomials a and b, see e.g., [2], to obtain
+
+S(a, b) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+a1
+· · ·
+ap
+0
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+0
+0
+· · ·
+0
+1
+a1
+· · ·
+ap
+1
+b1
+· · ·
+bq
+0
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+0n×n
+0
+· · ·
+0
+1
+b1
+· · ·
+bq
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(22)
+
+Consider the q ×(p+q) and p×(p+q) upper and lower submatrices Sp (b) and Sq (−a) of the Sylvester
+
+resultant matrix S (−b, a) such that
+
+S(b, −a) =
+
+�
+Sp(b)
+−Sq(a)
+
+�
+
+(23)
+
+The matrix
+
+S
+
+(a, b) becomes singular in the presence of one or more common zeros of the monic polynomials â(z)
+
+and �b(z), this property is assessed by the following equalities
+
+R(a, b) =
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(αi − βj), R(b, a) = (−1)pq
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(αi − βj)
+(24)
+
+and
+
+R(b, −a) = (−1)q
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(βj − αi), and R(−b, a) = (−1)p
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(βj − αi)
+(25)
+
+where
+R(a,
+b)
+is
+the
+resultant
+of
+â(z)
+and
+�b(z),
+and
+is
+equal
+to
+Det
+includegraphics[scale=1]entropy-16-02023f6.pdf (a, b).
+The string of equalities in (24) and (25)
+hold since R(b, a) = (−1)pq R(a, b), R(b, −a) = (−1)q R(b, a), and R(−b, a) = (−1)p R(b, a), see [34]. The
+
+zeros of the scalar monic polynomials â(z) and �b(z) are αi and βj respectively and are assumed to be
+distinct. By this is meant, when we have (z − αi)nαi and (z − βj)nβj with the powers nαi and nβj both
+greater than one, that only the distinct roots will be considered free from the corresponding powers.
+
+The key property of the classical Sylvester resultant matrix S (a, b) is that its null space provides a
+complete description of the common zeros of the polynomials involved. In particular, in the scalar
+
+case the polynomials â(z) and �b(z) are coprime if and only if S (a, b) is non-singular. The following
+
+key property of the classical Sylvester resultant matrix S (a, b), is given by the well known theorem on
+resultants, to obtain
+
+dim Ker S(a, b) = ν(a, b)
+(26)
+
+86
+
+
+Entropy 2014, 16, 2023–2055
+
+where ν(a, b) is the number of common roots of the polynomials â(z) and �b(z), with counting
+multiplicities, see e.g., [3]. The dimension of a subspace V is represented by dim (V ), Ker (X)
+is the null space or kernel of the matrix X, denoted by Null or Ker. The null space of an n × n matrix A
+with coefficients in a field K (typically the field of the real numbers or of the complex numbers) is the
+set Ker A = {x ∈ Kn: Ax = 0}, see e.g., [1,2,20].
+
+In order to prove that the FIM F (ϑ) fulfills the resultant matrix property, the following
+factorization is derived, Lemma 2.1 in [5],
+
+F(ϑ) = S(b, −a)P(ϑ)S⊤(b, −a)
+(27)
+
+where the matrix ℘(ϑ) ∈ R(p+q)×(p+q) admits the form
+
+P(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)u⊤
+p+a(z−1)
+
+a(z)b(z)a(z−1)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)v⊤
+p+q(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(28)
+
+It is proved in [5] that the symmetric matrix ℘(ϑ) fulfills the property, ℘(ϑ) ≻ O. The factorization (27)
+allows us to show the matrix resultant property of the FIM, Corollary 2.2 in [5] states.
+The FIM of an ARMA(p, q) process with polynomials a(z) and b(z) of order p, q respectively
+
+becomes singular if and only if the polynomials â(z) and �b(z) have at least one common root.
+From Corollary 2.2 in [5] can be concluded, the FIM of an ARMA(p, q) process and the Sylvester
+resultant matrix
+
+S
+
+(−b, a) have the same singularity property. By virtue of (26) and (27) we will specify the dimension of
+
+the null space of the FIM F (ϑ), this is set forth in the following lemma.
+
+3.2.1. Lemma 3.1
+
+Assume that the polynomials â(z) and b(z) have ν(a, b) common roots, counting multiplicities.
+The factorization (27) of the FIM and the property (26) enable us to prove the equality
+
+dim (Ker F(ϑ)) = dim (Ker S(b, −a)) = ν(a, b)
+(29)
+
+Proof
+
+The matrix ℘(ϑ) ∈ R(p+q)×(p+q), given in (27), fulfills the property of positive definiteness, as
+proved in [5]. This implies that a Cholesky decomposition can be applied to ℘(ϑ), see [35] for more
+details, to obtain ℘(ϑ) =LT(ϑ)L(ϑ), where L(ϑ) is a R(p+q)×(p+q) upper triangular matrix that is unique if
+its diagonal elements are all positive. Consequently, all its eigenvalues are then positive so that the
+matrix L(ϑ) is also positive definite. Factorization of (27) now admits the representation
+
+F(ϑ) = S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a)
+(30)
+
+and taking the property, if A is an m× n matrix, then Ker (A) = Ker (ATA), into account, yields when
+applied to (30)
+
+Ker F(ϑ) = Ker S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a) = Ker L(ϑ)S⊤(b, −a)
+
+Assume the vector u ∈ Ker L(ϑ) S⊤ (b, −a), such that L(ϑ) S⊤ (b, −a)u = 0 and set S⊤ (b, −a)u = v = ⇒
+
+L(ϑ)v = 0, since the matrix L(ϑ) ≻ O = ⇒ v = 0, this implies S⊤ (b, −a)u = 0 = ⇒ u ∈ Ker S⊤ (b, −a).
+Consequently,
+
+87
+
+
+Entropy 2014, 16, 2023–2055
+
+Ker F(ϑ) = Ker S⊤(b, −a)
+(31)
+
+We will now consider the Rank-Nullity Theorem, see e.g., [1], if A is an m × n matrix, then
+
+dim (Ker A) + dim (Im A) = n
+
+and the property dim (Im A) = dim (Im AT). When applied to the (p + q) × (p + q) matrix S (b, −a),
+it yields
+
+dim (Ker S(b, −a)) = dim (Ker S⊤(b, −a)) ⇒ dim (Ker F(ϑ)) = dim (Ker S(b, −a))
+
+which completes the proof.
+Notice that the dimension of the null space of matrix A is called the nullity of A and the dimension
+of the image of matrix A, dim (Im A), is termed the rank of matrix A. An alternative proof to the one
+developed in Corollary 2.2 in [5], is given in a corollary to Lemma 3.1, reconfirming the resultant
+
+matrix property of the FIM F (ϑ).
+
+3.2.2. Corollary 3.2
+
+The FIM F (ϑ) of an ARMA(p, q) process becomes singular if and only if the autoregressive and moving
+
+average polynomials â(z) and �b(z) have at least one common root.
+
+Proof
+
+By virtue of the equality (31) combining with the property Det S⊤ (b, −a) = Det S (b, −a) and
+
+the matrix resultant property of the Sylvester matrix S (b, −a) yields, Det S⊤ (b, −a) = 0 ⇔ Ker S⊤
+
+(b, −a) ̸= {0} if and only if the ARMA(p, q) polynomials â(z) and �b(z) have at least one common root.
+
+Equivalently, Det S⊤ (b, −a) ̸= 0 ⇔ Ker S⊤ (b, −a) = {0} if and only if the ARMA(p, q) polynomials
+
+â(z) and �b(z) have no common roots. Consequently, by virtue of the equality Ker F (ϑ) =Ker S⊤ (b,
+
+−a) can be concluded, the FIM F (ϑ) becomes singular if and only if the ARMA(p, q) polynomials â(z)
+
+and �b(z) have at least one common root. This completes the proof.
+
+3.3. The Statistical Distance Measure and the Fisher Information Matrix
+
+In [7] statistical distance measures are studied. Most multivariate statistical techniques are based
+upon the concept of distance. For that purpose a statistical distance measure is considered that is
+a normalized Euclidean distance measure with entries of the FIM as weighting coefficients. The
+measurements x1, x2,. . . , xn are subject to random fluctuations of different magnitudes and have
+therefore different variabilities. It is then important to consider a distance that takes the variability
+of these variables or measurements into account when determining its distance from a fix point. A
+rotation of the coordinate system through a chosen angle while keeping the scatter of points given
+by the data fixed, is also applied, see [7] for more details. It is shown that when the FIM is positive
+definite, the appropriate statistical distance measure is a metric. In case of a singular FIM of an ARMA
+stationary process, the metric property depends on the rotation angle. The statistical distance measure,
+is based on m parameters unlike a statistical distance measure introduced in quantum information, see
+e.g., [8,9], that is also related to the Fisher information but where the information about one parameter
+in a particular measurement procedure is considered.
+
+88
+
+
+Entropy 2014, 16, 2023–2055
+
+The straight-line or Euclidean distance between the stochastic vector x =
+�
+x1
+x2
+. . .
+xn
+�⊤
+
+and fixed vector y =
+�
+y1
+y2
+. . .
+yn
+�⊤
+where x, y ∈ Rn, is given by
+
+d(x, y) = ∥x − y∥ =
+
+�
+n
+∑
+j=1
+(xj − yj)2
+�1/2
+(32)
+
+where the metric d(x, y):= ||x−y|| is induced by the standard Euclidean norm || · || on Rn, see
+e.g., [2] for the metric conditions.
+The observations x1, x2, . . . , xn are used to compute maximum likelihood estimated of the
+parameters ϑ1, ϑ2, . . . , ϑm and where m < n. These estimated parameters are random variables, see
+e.g., [15]. The distance of the estimated vector ϑ ∈ Rm, given in (15), is studied. Entries of the FIM are
+inserted in the distance measure as weighting coefficients. The linear transformation
+
+�ϑ = Li(ϕ)ϑ
+(33)
+
+is applied, where Li(ϕ) ∈ Rm×n is the Givens rotation matrix with rotation angle ϕ, with 0 ≤ ϕ ≤ 2π
+and i ∈ {1, . . . , m − 1}, see e.g., [36], and is given by
+
+Li(ϕ) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+Ii−1
+0
+0
+0
+0
+(cos(ϕ))i,i
+(− sin(ϕ))i,i+1
+0
+0
+(sin(ϕ))i+1,i
+(cos(ϕ))i+1,i+1
+0
+0
+0
+0
+Im−i−1
+
+⎞
+
+⎟
+⎟
+⎟
+⎠,
+0 ≤ ϕ ≤ 2π
+(34)
+
+The following matrix decomposition is applied in order to obtain a transformed FIM
+
+Fϕ(ϑ) = Li(ϕ)F(ϑ)L⊤
+i (ϕ)
+(35)
+
+where Fϕ(ϑ) and F (ϑ) are respectively the transformed and untransformed Fisher information
+matrices. It is straightforward to conclude that by virtue of (35), the transformed and untransformed
+
+Fisher information matrices F ϕ(ϑ) and F (ϑ), are similar since the rotation matrix Li(ϕ) is orthogonal.
+Two matrices A and B are similar if there exists an invertible matrix X such that the equality AX = XB
+holds. As can be seen, the Givens matrix Li(ϕ) involves only two coordinates that are affected by the
+rotation angle ϕ whereas the other directions, which correspond to eigenvalues of one, are unaffected
+by the rotation matrix.
+
+By virtue of (35) can be concluded that a positive definite FIM, F (ϑ) ≻ 0, implies a positive
+
+definite transformed FIM, F ϕ(ϑ) ≻ 0. Consequently, the elements on the main diagonal of F (ϑ), f 1,1,
+
+f 2,2, . . . , fm,m, as well as the elements on the main diagonal of F ϕ(ϑ), �f1,1, �f2,2, . . . , �fm,m are all
+positive. However, the elements on the main diagonal of a singular FIM of a stationary ARMA process
+are also positive.
+As developed in [7], combining (33) and (35) yields the distance measure of the estimated
+parameters ϑ1, ϑ2, . . . , ϑm accordingly, to obtain
+
+d2
+Fϕ(ϑ) =
+m
+∑
+j=1,j̸=i,i+1
+
+� ϑ2
+j
+
+fj,j
+
+�
+
++ {ϑi cos(ϕ) − ϑi+1 sin(ϕ)}2
+
+�fi,i(ϕ)
++ {ϑi+1 cos(ϕ) + ϑi sin(ϕ)}2
+
+�fi+1,i+1(ϕ)
+(36)
+
+where
+
+�fi,i(ϕ) = fi,i cos2(ϕ) − fi,i+1 sin(2ϕ) + fi+1,i+1 sin2(ϕ)
+(37)
+
+89
+
+
+Entropy 2014, 16, 2023–2055
+
+�fi+1,i+1(ϕ) = fi+1,i+1 cos2(ϕ) + fi,i+1 sin(2ϕ) + fi,i sin2(ϕ)
+(38)
+
+and fj,l are entries of the FIM F (ϑ) whereas �fi,i(φ) and �fi+1,i+1(φ) are the transformed components
+
+since the rotation affects only the entries, i and i+1, as can be seen in matrix Li(ϕ). In [7], the existence
+of the following inequalities is proved
+
+�fi,i(ϕ) > 0
+and
+�fi+1,i+1(ϕ) > 0
+
+this guaratees the metric property of (36).
+When the FIM of an ARMA(p, q) process is the
+case, a combination of (27) and (35) for the ARMA(p, q) parameters, given in (15) yields for the
+transformed FIM,
+
+Fϕ(ϑ) = Sϕ(−b, a)P(ϑ)S⊤
+ϕ (−b, a)
+(39)
+
+where ℘(ϑ) is given by (28) and the transformed Sylvester resultant matrix is of the form
+
+Sϕ(−b, a) = Li(ϕ)S(−b, a)
+(40)
+
+Proposition 3.5 in [7], proves that the transformed FIM F ϕ(ϑ) and the transformed Sylvester matrix
+Sφ (−b, a) fulfill the resultant matrix property by using the equalities (40) and (39). The following
+property is then set forth.
+
+3.3.1. Proposition 3.3
+
+The properties
+
+Ker Fϕ(ϑ) = Ker S⊤
+ϕ (−b, a) and Ker Sϕ(−b, a) = Ker S(−b, a)
+
+hold true.
+
+Proof
+
+By virtue of the equalities (39), (40) and the orthogonality property of the rotation matrix Li(ϕ)
+which implies that Ker Li(ϕ) = {0} combined with the same approach as in Lemma 3.1 completes
+the proof.
+A straightforward conclusion from Proposition 3.3 is then
+
+dim Ker Fϕ(ϑ) = dim Ker Sϕ(−b, a), dim Ker Sϕ(−b, a) = dim Ker S(−b, a)
+
+In the next section a distance measure introduced in quantum information is discussed.
+Statistical Distance Measure - Fisher Information and Quantum Information
+In quantum information, the Fisher information, the information about a parameter θ in a
+particular measurement procedure, is expressed in terms of the statistical distance s, see [8,10]. The
+statistical distance used is defined as a measure to distinguish two probability distributions on the basis
+of measurement outcomes, see [37]. The Fisher information and the statistical distance are statistical
+quantities, and generally refer to many measurements as it is the case in this survey. However, in
+the quantum information theory and quantum statistics context, the problem set up is presented as
+follows. There may or may not be a small phase change θ, and the question is whether it is there. In
+that case you can design quantum experiments that will tell you the answer unambiguously in a single
+measurement. The equality derived is of the form
+
+F (ϕ) =
+� ds
+
+dθ
+
+�2
+(41)
+
+90
+
+
+Entropy 2014, 16, 2023–2055
+
+the Fisher information is the square of the derivative of the statistical distance s with respect to θ.
+Contrary to (36), where the square of the statistical distance measure is expressed in terms of entries
+
+of a FIM F (ϑ) which is based on information about m parameters estimated from n measurements,
+for m < n. A challenging question could therefore be formulated as follows, can a generalization of
+equality (41) be developed in a quantum information context but at the matrix level ? To be more
+specific, many observations or measurements that lead to more than one parameter such that the
+corresponding Fisher information matrix is interconnected to an appropriate statistical distance matrix,
+a matrix where entries are scalar distance measures. This question could equally be a challenge to
+algebraic matrix theory and to quantum information.
+
+3.4. The Bezoutian - The Fisher Information Matrix
+
+In this section an additional resultant matrix is presented, it concerns the Bezout matrix or
+Bezoutian. The notation of Lancaster and Tismenetsky [2] shall be used and the results presented are
+extracted from [38]. Assume the polynomials a and b given by a(z) = ∑n
+j=0 aj zj and b(z) = ∑n
+j=0 bj zj,
+cfr. (13) but where p = q = n, and we further assume a0 = b0 = 1. The Bezout matrix B(a, b) of the
+polynomials a and b is defined by the relation
+
+a(z)b(w) − a(w)b(z) = (z − w)u⊤
+n (z)B(a, b)un(z)
+
+This matrix is often referred as the Bezoutian. We will display a decomposition of the Bezout matrix
+B(a, b) developed in [38]. For that purpose the matrix Uϕ and its inverse Tϕ are presented, where ϕ is a
+given complex number, to obtain
+
+Uϕ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+· · ·
+· · ·
+0
+−ϕ
+1
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+0
+· · ·
+0
+−ϕ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, Tϕ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+· · ·
+· · ·
+0
+ϕ
+1
+· · ·
+· · ·
+0
+
+ϕ2
+...
+...
+...
+...
+...
+ϕn−1
+· · ·
+ϕ2
+ϕ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+Let (1 − α1z) and (1 − β1z) be a factor of a(z) and b(z) respectively and α1 and β1 are zeros of â(z)
+
+and �b(z). Consider the factored form of the nth order polynomials a(z) and b(z) of the form a(z) = (1
+− α1z)a−1(z) and b(z) = (1 − β1z)b−1(z) respectively. Proceeding this way, for α2, . . . , αn yields the
+recursion a−(k−1)(z) = (1 − αkz)a−k(z), equivalently for the polynomials b−k(z) and a0(z) = a(z) and b0(z)
+= b(z). Proposition 3.1 in [38] is presented.
+The following non-symmetric decomposition of the Bezoutian is derived, considering the
+notations above
+
+B(a, b) = Uα1
+
+�
+B(a−1, b−1)
+0
+0
+0
+
+�
+
+U⊤
+β1 + (β1 − α1)bβ1a⊤
+α1
+(42)
+
+with aα1 such that a⊤
+α1 un(z) = a−1 similarly for bβ1. Iteration gives the following expansion for the
+Bezout matrix
+
+B(a, b) =
+n
+∑
+k=1
+(βk − αk)Uα1 . . . Uαk−1Uβk+1 . . . Uβnen
+1 (en
+1)⊤ U⊤
+β1 . . . U⊤
+βk−1U⊤
+αk+1 . . . U⊤
+αn
+
+where en
+1 is the first unit standard basis column vector in Rn, by ej we denote the jth coordinate vector,
+ej = (0, . . . , 1, . . . , 0) T, with all its components equal to 0 except the jth component which equals 1.
+The following corollarys to Proposition 3.1 in [38] are now presented.
+
+Corollary 3.2 in [38] states. Let ϕ be a common zero of the polynomials â(z) and �b(z). Then a(z) =
+(1 − ϕz)a−1(z) and b(z) = (1 − ϕz)b−1(z) and
+
+91
+
+
+Entropy 2014, 16, 2023–2055
+
+B(a, b) = Uϕ
+
+�
+B(a−1, b−1)
+0
+0
+0
+
+�
+
+U⊤
+ϕ
+
+This a direct consequence of (42) and from which can be concluded that the Bezoutian B(a, b) is
+non-singular if and only if the polynomials a(z) and b(z) have no common factors. A similar conclusion
+
+is drawn for the FIM in (27) so that matrices F (ϑ) and B(a, b) have the same singularity property.
+Related to Corollary 3.2 in[38], this is where we give a description of the kernel or nullspace of
+the Bezout matrix.
+Corollary 3.3 in [38] is now presented. Let ϕ1, . . ., ϕm be all the common zeros of the polynomials
+
+â(z) and �b(z), with multiplicities n1, . . . , nm. Let ℓ be the last unit standard basis column vector in Rn
+
+and put
+
+wj
+k =
+�
+Tj
+ϕk Jj−1�⊤
+ℓ
+
+for k = 1, . . . , m and j = 1, . . . , nk and by J we denote the forward n × n shift matrix, Jij = 1 if i = j + 1.
+
+Consequently, the subspace Ker B(a, b) is the linear span of the vectors wj
+k.
+An alternative representation to (27) but involving the Bezoutian B(b, a) and derived in
+Proposition 5.1 in [38] is of the form
+
+F(ϑ) = M−1(b, a)H(ϑ)M−⊤(b, a)
+(43)
+
+where
+
+H(ϑ) =
+
+�
+I
+0
+0
+B(b, a)
+
+�
+
+Q(ϑ)
+
+�
+I
+0
+0
+B(b, a)
+
+�
+
+and M(b, a) =
+
+�
+P
+0
+PS(ˆa)P
+PS(ˆb)P
+
+�
+
+(44)
+
+and
+
+P =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+· · ·
+0
+1
+...
+1
+0
+
+0
+...
+1
+0
+· · ·
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+, S(ˆa) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+an−1
+an−2
+· · ·
+a0
+an−2
+a0
+0
+...
+...
+a0
+0
+· · ·
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
+and Q(ϑ) ≻ 0
+
+The matrix S(â) is the symmetrizer of the polynomial â(z), in this paper a0 = 1, see [2] and P is a
+permutation matrix. In [38] it is shown that the matrix Q(ϑ) is the unique solution to an appropriate
+Stein equation and is strictly positive definite. However, in the next section an explicit form of the Stein
+solution Q(ϑ) is developed. Some comments concerning the property summarized in Corollary 5.2
+in [38] follow.
+
+The matrix H(ϑ) is non-singular if and only if the polynomials a(z) and b(z) have no common
+factors. The proof is straightforward since the matrix Q(ϑ) is non-singular which implies that the
+
+matrixH(ϑ) is only non-singular when the Bezoutian B(b, a) is non-singular and this is fulfilled if and
+only if the polynomials a(z) and b(z) have no common factors.
+
+The matrix M(b, a) is non-singular if a0 ̸= 0 and b0 ̸= 0, which is the case since we have a0 =
+
+b0 = 1. From (43) can be concluded that the FIM F (ϑ) is non-singular only when the matrix H(ϑ)
+is non-singular or by virtue of (44) when the Bezoutian B(b, a) is non-singular. Consequently, the
+
+singularity conditions of the Bezoutian B(b, a), the FIM F (ϑ) and the Sylvester resultant matrix
+
+S
+
+92
+
+
+Entropy 2014, 16, 2023–2055
+
+(b, −a) are therefore equivalent. Can be concluded, by virtue of (29) proved in Lemma 3.1 and the
+
+equality dim (Ker S (a, b)) = dim (Ker B(a, b)) proved in Theorem 21.11 in [1], yields
+
+dim (Ker S(b, −a)) = dim (Ker F(ϑ)) = dim (Ker B(b, a)) = ν(a, b)
+
+3.5. The Stein Equation - The Fisher Information Matrix of an ARMA(p, q) Process
+
+In [12], a link between the FIM of an ARMA process and an appropriate solution of a Stein
+equation is set forth. In this survey paper we shall present some of the results and confront some
+results displayed in the previous sections. However, alternative proofs will be given to some results
+obtained in [12,38].
+The Stein matrix equation is now set forth. Let A ∈ Cm×m, B ∈ Cn×n and Γ ∈ Cn×m and consider
+the Stein equation
+
+S − BSA⊤ = Γ
+(45)
+
+It has a unique solution if and only if λμ ̸= 1 for any λ ∈ σ(A) and μ ∈ σ(B), the spectrum of D is σ(D)
+
+= {λ ∈ C: det(λIm − D) = 0}, the set of eigenvalues of D. The unique solution will be given in the next
+theorem [11].
+
+3.5.1. Theorem 3.4
+
+Let A and B be, such that there is a single closed contour C with σ(B) inside C and for each non-zero w ∈
+σ(A), w−1 is outside C. Then for an arbitrary Γ the Stein 45 has a unique solution S
+
+S =
+1
+
+2πi
+
+�
+
+C
+(λIn − B)−1Γ(Im − λA)−⊤dλ
+(46)
+
+In this section an interconnection between the representation (27) of the FIM F (ϑ) and an appropriate
+solution to a Stein equation of the form (45) as developed in [12] is set forth. The distinct roots of
+
+the polynomials â(z) and �b(z) are denoted by α1, α2, . . . , αp and β1, β2, . . . , βq respectively such
+
+that the non-singularity of the FIM F (ϑ) is guaranteed. The following representation of the integral
+expression (28) is given when Cauchy’s residue theorem is applied, equation (4.8) in [12]
+
+P(ϑ) = U(ϑ)D(ϑ) ˆU(ϑ)
+(47)
+
+where
+
+U(ϑ) = {up+q(α1), up+q(α2), . . . , up+q(αp), up+q(β1), up+q(β2), . . . , up+q(βq)}
+
+D(ϑ) = diag
+��
+1
+
+ˆa(z;αi)ˆb(αi)a(αi)b(αi)
+
+�
+,
+�
+1
+
+ˆa(βj)ˆb(z;βj)a(βj)b(βj)
+
+��
+, i = 1, ..., p and j = 1, ..., q
+
+and
+
+ˆU(ϑ) = {vp+q(α1), vp+q(α2), . . . , vp+q(αp), vp+q(β1), vp+q(β2), . . . , vp+q(βq)}⊤
+
+the polynomial p(·; β) is defined accordingly, p(z; β) =
+p(z)
+(z−β) and D (ϑ) is the (p + q) × (p + q) diagonal
+
+matrix. The matrices U (ϑ) and �U((ϑ) in (47) are the (p + q)× (p + q) Vandermonde matrices Vαβ and
+�U αβ respectively, given by
+
+93
+
+
+Entropy 2014, 16, 2023–2055
+
+Vαβ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+α1
+α2
+1
+· · ·
+αp+q−1
+1
+1
+α2
+α2
+2
+· · ·
+αp+q−1
+2
+...
+...
+...
+...
+...
+1
+αp
+α2
+p
+· · ·
+αp+q−1
+p
+1
+β1
+β2
+1
+· · ·
+βp+q−1
+1
+1
+β2
+β2
+2
+· · ·
+βp+q−1
+2
+...
+...
+...
+...
+...
+1
+βq
+β2
+q
+· · ·
+βp+q−1
+q
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+and ˆVαβ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αp+q−1
+1
+αp+q−2
+1
+· · ·
+α1
+1
+αp+q−1
+2
+αp+q−2
+2
+· · ·
+α2
+1
+...
+...
+...
+...
+...
+αp+q−1
+p
+αp+q−2
+p
+· · ·
+αp
+1
+βp+q−1
+1
+βp+q−2
+1
+· · ·
+β1
+1
+βp+q−1
+2
+βp+q−2
+2
+· · ·
+β2
+1
+...
+...
+...
+...
+...
+βp+q−1
+q
+βp+q−2
+q
+· · ·
+βq
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+It is clear that the (p + q) × (p + q) Vandermonde matrices Vαβ and �U αβ are nonsingular when αi ̸= αj,
+βk ̸= βh and αi ̸= βk for all i, j = 1, . . . , p and k, h = 1, . . . , q. A rigorous systematic evaluation of the
+
+Vandermonde determinants DetVαβ and Det �U αβ, yields
+
+DetVαβ = (−1)(p+q) (p+q−1)/2Φ (αi, βk)
+
+where
+
+Φ (αi, βk) =
+∏
+1≤i<j≤p
+(αi − αj)
+∏
+1≤k<h≤q
+(βk − βh)
+∏
+m = 1, . . . p
+n = 1, . . . q
+
+(αm − βn)
+
+Since Vαβ = P ˆV⊤
+αβ and given the configuration of the permutation matrix, P, this leads to the equalities
+
+Det ˆV⊤
+αβ = DetP DetVαβ and DetP = (−1)(p+q)(p+q−1)/2 so that
+
+Det ˆVαβ = (−1)(p+q) (p+q−1) Φ (αi, βk) ⇒| DetVαβ |= | Det ˆVαβ |
+
+We shall now introduce an appropriate Stein equation of the form (45) such that an interconnection with
+℘(ϑ) in (47) can be verified. Therefore the following (p + q)× (p + q) companion matrix is introduced,
+
+Cg =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+1
+· · ·
+0
+...
+...
+...
+0
+· · ·
+0
+1
+−gp+q
+−gp+q−1
+· · ·
+−g1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
+(48)
+
+where the entries gi are given by zp+q + ∑
+p+q
+i=1 gi(ϑ)zp+q−i = ˆa(z)ˆb(z) = ˆg(z, ϑ) and ˆg(ϑ) is the vector
+ˆg(ϑ) = (gp+q(ϑ), gp+q−1(ϑ), . . . , g1(ϑ)) T. Likewise is the vector g(z, ϑ) = a(z)b(z) and g(ϑ) = (g1(ϑ), g1(ϑ),
+. . . , gp+q(ϑ)) T, for investigating the properties of a companion matrix see e.g., [36], [2]. Since all
+
+the roots of the polynomials â(z) and �b(z) are distinct and lie within the unit circle implies that the
+products αiβj ̸= 1, αiαj ̸= 1 and βiβj ̸= 1 hold for all i = 1, 2, . . . , p and j = 1, 2, . . . , q. Consequently,
+the uniqueness condition of the solution of an appropriate Stein equation is verified. The following
+Stein equation and its solution, according to (45) and (46), are now presented
+
+S − CgSC⊤
+g = Γ and S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − Cg)−1Γ(Ip+q − zCg)−⊤dz
+
+where the closed contour is now the unit circle |z| = 1 and the matrix Γ is of size (p + q)× (p + q). A
+more explicit expression of the solution S is of the form
+
+94
+
+
+Entropy 2014, 16, 2023–2055
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+q − Cg)Γ adj(Ip+q − zCg)⊤
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(49)
+
+where adj(X) = X−1 Det(X), the adjoint of matrix X. When Cauchy’s residue theorem is applied to the
+solution S in (49), the following factored form of S is derived, equation (4.9) in [12]
+
+S = (C1, C2) (Ip+q ⊗ Γ) (D(ϑ) ⊗ Ip+q) (C3, C4)⊤
+(50)
+
+where
+
+C1 = adj(α1Ip+q − Cg), adj(α2Ip+q − Cg), . . . , adj(αpIp+q − Cg)
+C2 = adj(β1Ip+q − Cg), adj(β2Ip+q − Cg), . . . , adj(βpIp+q − Cg)
+C3 = adj(Ip+q − α1Cg), adj(Ip+q − α2Cg), . . . , adj(Ip+q − αpCg)
+C4 = adj(Ip+q − β1Cg), adj(Ip+q − β2Cg), . . . , adj(Ip+q − βpCg)
+
+and D ϑ) is given in (47), the following matrix rule is applied
+
+(A ⊗ B) (C ⊗ D) = AC ⊗ BD
+
+and the operator ⊗ is the tensor (Kronecker) product of two matrices, see e.g., [2], [20].
+Combining (47) and (50) and taking the assumption, αi ̸= αj, βk ̸= βh and αi ̸= βk, into account
+
+implies that the inverse of the (p + q)× (p + q) Vandermonde matrices Vαβ and �U αβ exist, as Lemma
+4.2 [12] states.
+The following equality holds true
+
+S = (C1, C2)
+�
+V−1
+αβ P(ϑ) ˆV−1
+αβ ⊗ Γ
+�
+(C3, C4)⊤
+
+or
+
+S = (C1, C2)
+�
+V−1
+αβ S−1(b, −a)F(ϑ)S−⊤(b, −a) ˆV−1
+αβ ⊗ Γ
+�
+(C3, C4)⊤
+(51)
+
+Consequently, under the condition αi ̸= αj, βk ̸= βh and αi ̸= βk, and by virtue of (27) and (51),
+
+an interconnection involving the FIM F (ϑ), a solution to an appropriate Stein equation S, the
+Sylvester matrix
+
+S
+
+(b, −a) and the Vandermonde matrices Vαβ and �U αβ is established. It is clear that by using the
+expression (43), the Bezoutian B (a, b) can be inserted in equality (51).
+We will formulate a Stein equation when the matrix Γ = ep+qe⊤
+p+q,
+
+S − CgSC⊤
+g = ep+qe⊤
+p+q
+(52)
+
+where ep+q is the last standard basis column vector in Rp+q, em
+i is the i-th unit standard basis
+column vector in Rm, with all its components equal to 0 except the i-th component which equals 1. The
+next lemma is formulated.
+
+3.5.2. Lemma 3.5
+
+The symmetric matrix ℘(ϑ) defined in (28) fulfills the Stein Equation (52).
+
+Proof
+
+The unique solution of (52) is according to (46)
+
+95
+
+
+Entropy 2014, 16, 2023–2055
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − Cg)−1ep+qe⊤
+p+q(Ip+q − zCg)−⊤dz
+
+more explictely written,
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+q − Cg)ep+qe⊤
+p+qadj(Ip+q − zCg)⊤
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+
+Using the property of the companion matrix Cg, standard computation shows that the last column
+
+of adj(zIp+q − Cg) is the basic vector up+q(z) and consequently the last column of adj(Ip+q − z Cg)
+
+is the basic vector vp+q(z) = zp+q−1up+q(z−1). This implies that adj(zIp+q − Cg)ep+q = up+q(z) and
+e⊤
+p+qadj(Ip+q − zCg)⊤ = v⊤
+p+q(z) or
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)v⊤
+p+q(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz = P(ϑ)
+
+Consequently, the solution S to the Stein 52 coincides with the matrix ℘(ϑ) defined in (28).
+
+The Stein equation that is verified by the FIM F (ϑ) will be considered. For that purpose we
+
+display the following p × p and q × q companion matrices Ca and Cb of the form,
+
+Ca =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+−a1
+−a2
+· · ·
+· · ·
+−ap
+1
+0
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+0
+· · ·
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, Cb =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+−b1
+−b2
+· · ·
+· · ·
+−bq
+1
+0
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+0
+· · ·
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+respectively. Introduce the (p + q) × (p + q) matrix K(ϑ) =
+
+�
+Ca
+O
+O
+Cb
+
+�
+
+and the (p + q) × 1 vector
+
+B =
+
+�
+e1
+p
+−e1
+q
+
+�
+
+, where e1
+p and e1
+q are the first standard basis column vectors in Rp and Rq respectively.
+
+Consider the Stein equation
+
+S − K(ϑ)SK⊤(ϑ) = BB⊤
+(53)
+
+followed by the theorem.
+
+3.5.3. Theorem 3.6
+
+The Fisher information matrix F (ϑ) (17) coincides with the solution to the Stein 53.
+
+Proof
+
+The eigenvalues of the companion matrices Ca and Cb are respectively the zeros of the
+
+polynomials â(z) and �b(z) which are in absolute value smaller than one. This implies that the unique
+solution of the Stein 53 exists and is given by
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − K(ϑ))−1BB⊤(Ip+q − zK(ϑ))−⊤dz
+
+96
+
+
+Entropy 2014, 16, 2023–2055
+
+developing this integral expression in a more explicit form yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+adj(zIp−Ca)
+
+ˆa(z)
+O
+
+O
+adj(zIq−Cb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+e1
+p
+−e1
+q
+
+� ⎧
+⎨
+
+⎩
+
+⎛
+
+⎝
+
+adj(Ip−zCa)
+
+a(z)
+O
+
+O
+adj(Iq−zCb)
+
+b(z)
+
+⎞
+
+⎠
+�
+e1
+p
+−e1
+q
+
+�⎫
+⎬
+
+⎭
+
+⊤
+
+dz
+
+Considering the form of the companion matrices Ca and Cb leads through straightforward
+
+computation to the conclusion, the first column of adj(zIp − Ca ) is the basic vector vp(z) and
+
+consequently the first column of adj(Ip − z Ca ) is the basic vector up(z). Equivalently for the companion
+
+matrix Cb , this yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+vp(z)
+ˆa(z)
+− vq(z)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+u⊤p (z)
+a(z)
+−
+u⊤
+q (z)
+b(z)
+
+�
+dz
+(54)
+
+Representation (54) is such that in order to obtain an equivalent representation to the FIM F (ϑ) in (17),
+the transpose of the solution to the Stein 53 is therefore required, to obtain
+
+S⊤ =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎜
+⎝
+
+up(z)v⊤p (z)
+
+a(z)ˆa(z)
+−
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z)
+
+−
+uq(z)v⊤p (z)
+
+ˆa(z)b(z)
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z)
+
+⎞
+
+⎟
+⎠ dz = F(ϑ)
+(55)
+
+or
+
+S⊤ =
+1
+
+2πi
+
+�
+
+|z|=1
+(Ip+q − zK(ϑ))−1BB⊤(zIp+q − K(ϑ))−⊤dz = F(ϑ)
+
+The symmetry property of the FIM F (ϑ), leads to S = F (ϑ). From the representation (55) can be
+
+concluded that the solution S of the Stein 53 coincides with the symmetric block Toeplitz FIM F ( ϑ)
+given in (17). This completes the proof.
+It is straightforward to verify that the submatrix (1,2) in (55) is the complex conjugate transpose
+of the submatrix (2,1), whereas each submatrix on the main diagonal is Hermitian, consequently,
+the integrand is Hermitian. This implies that when the standard residue theorem is applied, it yields
+F ( ϑ) = F T (ϑ).
+An Illustrative Example of Theorem 3.6
+To illustrate Theorem 3.6, the case of an ARMA(2, 2) process is considered. We will use the
+
+representation (17) for computing the FIM F (ϑ) of an ARMA(2, 2) process. The autoregressive and
+moving average polynomials are of degree two or p = q = 2 and the ARMA(2, 2) process is described by,
+
+y(t)a(z) = b(z)e(t)
+(56)
+
+where y(t) is the stationary process driven by white noise ϵ(t), a(z) = (1 + a1z + a2z2) and b(z) = (1+b1z +
+b2z2) and the parameter vector is ϑ = (a1, a2, b1, b2)T. The condition, the zeros of the polynomials
+
+ˆa(z) = z2a(z−1) = z2 + a1z + a2 and ˆb(z) = z2b(z−1) = z2 + b1z + b2
+
+are in absolute value smaller than one, is imposed. The FIM F (ϑ) of the ARMA(2, 2) process (56) is of
+the form
+
+97
+
+
+Entropy 2014, 16, 2023–2055
+
+F(ϑ) =
+
+�
+Faa(ϑ)
+Fab(ϑ)
+F ⊤
+ab(ϑ)
+Fbb(ϑ)
+
+�
+
+(57)
+
+where
+
+Faa(ϑ) =
+1
+
+(1−a2)
+�
+(1+a2)2−a2
+1
+�
+
+�
+1 + a2
+−a1
+−a1
+1 + a2
+
+�
+
+Fbb(ϑ) =
+1
+
+(1−b2)
+�
+(1+b2)2−b2
+1
+�
+
+�
+1 + b2
+−b1
+−b1
+1 + b2
+
+�
+
+Fab(ϑ) =
+1
+
+(a2b2−1)2+(a2b1−a1) (b1−a1b2)
+
+�
+a2b2 − 1
+a1 − a2b1
+b1 − a1b2
+a2b2 − 1
+
+�
+
+The submatrices F aa(ϑ) and F bb(ϑ) are symmetric and Toeplitz whereas F ab(ϑ) is Toeplitz. One can
+assert that without any loss of generality, the property, symmetric block Toeplitz, holds for the class
+of Fisher information matrices of stationary ARMA(p, q) processes, where p and q are arbitrary, finite
+integers that represent the degrees of the autoregressive and moving average polynomials, respectively.
+
+The appropriate companion matrices Ca , Cb , the 4 × 4 matricesK (ϑ) and BBT are
+
+Ca =
+
+�
+−a1
+−a2
+1
+0
+
+�
+
+, Cb =
+
+�
+−b1
+−b2
+1
+0
+
+�
+
+, K(ϑ) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+−a1
+−a2
+0
+0
+1
+0
+0
+0
+0
+0
+−b1
+−b2
+0
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎠
+and BB⊤ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+−1
+0
+0
+0
+0
+0
+−1
+0
+1
+0
+0
+0
+0
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎠
+(58)
+
+where B =
+�
+1
+0
+−1
+0
+�⊤
+. It can be verified that the Stein equation
+
+F(ϑ) − K(ϑ)F(ϑ)K⊤(ϑ) = BB⊤
+
+holds true, when F (ϑ) is of the form (57) and the matricesK (ϑ) and
+includegraphics[scale=1]entropy-16-02023f666.pdfT are given in (58).
+
+3.5.4. Some Additional Results
+
+In Proposition 5.1 in [38], the matrix Q(ϑ) in (44) fulfills the Stein 59 and the property Q(ϑ) ≻ 0 is
+
+proved. It states that when e⊤
+P =
+�
+e⊤
+1 P, 0
+�⊤ = (en, 0n)⊤ ∈ R2n, where e1 is the first unit standard basis
+column vector in Rn and en is the last or n-th unit standard basis column vector in Rn, the following
+Stein equation admits the form
+
+Q(ϑ) = FN(ϑ)Q(ϑ)F⊤
+N (ϑ) + ePe⊤
+P
+(59)
+
+where
+
+FN(ϑ) =
+
+�
+ˆCa
+0
+e1e⊤
+1
+Cb
+
+�
+
+, ˆCa =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+1
+0
+· · ·
+0
+0
+0
+1
+· · ·
+0
+...
+...
+...
+...
+
+0
+...
+...
+1
+−ap
+−ap−1
+· · ·
+· · ·
+−a1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+A corollary to Proposition 5.1, [38] will be set forth, the involvement of various Vandermonde matrices
+in the explicit solution to 59 is confirmed. For that purpose the following Vandermonde matrices
+are displayed,
+
+98
+
+
+Entropy 2014, 16, 2023–2055
+
+Vα =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+1
+1
+α1
+α2
+αn
+α2
+1
+α2
+2
+α2
+n
+...
+...
+...
+αn−1
+1
+αn−1
+2
+αn−1
+n
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, ˆVα =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αn−1
+1
+αn−2
+1
+1
+αn−1
+2
+αn−2
+2
+1
+αn−1
+3
+αn−2
+3
+1
+...
+...
+...
+αn−1
+n
+αn−2
+n
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, ˆVαβ =
+
+�
+ˆVα
+ˆVβ
+
+�
+
+, and Vαβ =
+�
+Vα
+Vβ
+�
+
+(60)
+
+where �U β and Vβ have the same configuration as �U α and Vα respectively. A corollary to Proposition
+5.1 in [38] is now formulated.
+
+3.5.5. Corollary 3.7
+
+An explicit expression of the solution to the Stein 59 is of the form
+
+Q(ϑ) =
+
+�
+VαD11(ϑ) ˆVα
+VαD12(ϑ)V⊤
+α
+ˆV⊤
+αβD21(ϑ) ˆVαβ
+ˆV⊤
+αβD22(ϑ)V⊤
+αβ
+
+�
+
+(61)
+
+where the n × n and 2n × 2n diagonal matrices Dkl ϑ) shall be specified in the proof.
+
+Proof
+
+The condition of a unique solution of the Stein 59 is guaranteed since the eigenvalues of the
+
+companions matrices �Ca and Cb given respectively by the zeros of the polynomials â (z) and �b (z)
+are in absolute value smaller than one. Consequently, the unique solution to the Stein 59 exists and is
+given by
+
+Q(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+(zI2n − FN(ϑ))−1ePe⊤
+P (I2n − zFN(ϑ))−⊤dz
+(62)
+
+in order to proceed successfully, the following matrix property is displayed, to obtain
+
+�
+A
+O
+B
+C
+
+�−1
+=
+
+�
+A−1
+O
+−C−1BA−1
+C−1
+
+�
+
+When applied to the 62, it yields
+
+Q(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+adj(zIp− ˆCa)
+
+ˆa(z)
+O
+
+adj(zIq−Cb)e1e⊤
+1 adj(zIp− ˆCa)
+
+ˆa(z)ˆb(z)
+adj(zIq−Cb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+en
+0
+
+�
+
+×
+
+⎧
+⎨
+
+⎩
+
+⎛
+
+⎝
+
+adj(In−z ˆCa)
+
+ˆa(z)
+O
+
+adj(In−zCb)e1e⊤
+1 adj(Ip−z ˆCa)
+
+ˆa(z)ˆb(z)
+adj(In−zCb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+en
+0
+
+�⎫
+⎬
+
+⎭
+
+⊤
+
+dz
+
+Considering that the last column vector of the matrices adj(zIp − �Ca ) and adj(In − z �Ca ) are the
+vectors un(z) and vn(z) respectively, it then yields
+
+Q(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+un(z)
+ˆa(z)
+vn(z)
+
+ˆa(z)ˆb(z)
+
+⎞
+
+⎠
+�
+v⊤
+n (z)
+a(z)
+zn−1u⊤
+n (z)
+
+a(z)b(z)
+
+�
+dz
+
+=
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+un(z)v⊤
+n (z)
+
+a(z)ˆa(z)
+zn−1un(z)u⊤
+n (z)
+
+ˆa(z)a(z)b(z)
+vn(z)v⊤
+n (z)
+
+ˆa(z)ˆb(z)a(z)
+zn−1vn(z)u⊤
+n (z)
+
+ˆa(z)ˆb(z)a(z)b(z)
+
+⎞
+
+⎠ dz =
+
+�
+Q11(ϑ)
+Q12(ϑ)
+Q21(ϑ)
+Q22(ϑ)
+
+�
+
+99
+
+
+Entropy 2014, 16, 2023–2055
+
+Applying the standard residue theorem leads for the respective submatrices
+
+Q11(ϑ) = {un(α1), . . . , un(αn)}D11(ϑ) {vn(α1), . . . , vn(αn)}⊤
+
+Q12(ϑ) = {un(α1), . . . , un(αn)}D12(ϑ) {un(α1), . . . , un(αn)}⊤
+
+Q21(ϑ) = {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}D21(ϑ) {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}⊤
+
+Q22(ϑ) = {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}D22(ϑ) {un(α1), . . . , un(αn), un(β1), . . . , un(βn)}⊤
+
+where the n × n diagonal matrices are
+
+D11(ϑ) = diag {1/(a(αi)ˆa(z; αi))}, D12(ϑ) = diag {αn−1
+i
+/(a(αi)b(αi)ˆa(z; αi))} for i = 1, . . . , n
+
+and the 2n × 2n diagonal matrices are
+
+D21(ϑ) = diag
+�
+1/
+�
+a(αi)ˆb(αi)ˆa(z; αi)
+�
+, 1/
+�
+ˆa(βj)a(βj)ˆb(z; βj)
+��
+, for i, j = 1, . . . , n
+
+D22(ϑ) = diag
+�
+αn−1
+i
+/
+�
+a(αi)b(αi)ˆb(αi)ˆa(z; αi)
+�
+, βn−1
+j
+/
+�
+ˆa(βj)a(βj)b(βj)ˆb(z; βj)
+��
+, for i, j = 1, . . . , n
+
+It is clear that the first and third matrices in Q11(ϑ), Q12(ϑ), Q21(ϑ) and Q22(ϑ) are the appropriate
+Vandermonde matrices displayed in (60), it can be concluded that the representation (61) is verified.
+This completes the proof.
+In this section an explicit form of the solution Q(ϑ), expressed in terms of various Vandermonde
+
+matrices, is displayed. Also, an interconnection between the Fisher information F (ϑ) and appropriate
+solutions to Stein equations and related matrices is presented. Proofs are given when the Stein
+
+equations are verified by the FIM F (ϑ) and the associated matrix ℘(ϑ). These are alternative to the
+proofs developed in [38]. The presence of various forms of Vandermonde matrices is also emphasized.
+
+In the next section some matrix properties of the FIM F (ϑ) of an ARMAX process is presented.
+
+3.6. The Fisher Information Matrix of an ARMAX(p, r, q) Process
+
+The FIM of the ARMAX process (11) is set forth according to [4].
+The derivatives in the
+corresponding representation (16) are
+
+∂et(ϑ)
+
+∂aj
+=
+c(z)
+
+a(z)b(z) x(t − j) +
+1
+
+a(z)e(t − j), ∂et(ϑ)
+
+∂cl
+= − 1
+
+b(z)e(t − l) and∂et(ϑ)
+
+∂bk
+= − 1
+
+b(z)et−k
+
+where j = 1, . . . , p, l = 1, . . . , r and k = 1, . . . , q. Combining all j, l and k yields the (p + r + q) × (p + r +
+q) FIM
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+Gaa(ϑ)
+Gac(ϑ)
+Gab(ϑ)
+G⊤
+ac(ϑ)
+Gcc(ϑ)
+Gcb(ϑ)
+G⊤
+ab(ϑ)
+G⊤
+cb(ϑ)
+Gbb(ϑ)
+
+⎞
+
+⎟
+⎠
+(63)
+
+where the submatrices of G (ϑ) are given by
+
+100
+
+
+Entropy 2014, 16, 2023–2055
+
+Gaa(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z)
+up(z)u⊤p (z−1)c(z)c(z−1)
+
+a(z)a(z−1)b(z)b(z−1)
+dz
+z +
+1
+
+2πi
+�
+
+|z|=1
+
+up(z)u⊤p (z−1)
+
+a(z)a(z−1)
+dz
+z
+
+=
+1
+
+2πi
+�
+
+|z|=1
+Rx(z)
+up(z)v⊤p (z)c(z)ˆc(z)
+
+a(z)ˆa(z)b(z)ˆb(z)zr−q dz +
+1
+
+2πi
+�
+
+|z|=1
+
+up(z)v⊤p (z)
+
+a(z)ˆa(z) dz
+
+Gab(ϑ) = − 1
+
+2πi
+�
+
+|z|=1
+
+up(z)u⊤
+q (z−1)
+
+a(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z) dz
+
+Gac(ϑ) = − 1
+
+2πi
+�
+
+|z|=1
+Rx(z) up(z)u⊤
+r (z−1)c(z)
+
+a(z)b(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+Rx(z) up(z)v⊤
+r (z)c(z)
+
+a(z)b(z)ˆb(z)zr−q dz
+
+Gcc(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z) ur(z)u⊤
+r (z−1)
+
+b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z) ur(z)v⊤
+r (z)
+
+b(z)ˆb(z)zr−q dz
+
+Gbb(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+uq(z)u⊤
+q (z−1)
+
+b(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z) dz, and Gcb(ϑ) = O
+
+where Rx(z) is the spectral density of the process x(t) and is defined in (10).
+Let K(z) =
+a(z)a(z−1)b(z)b(z−1), combining all the expressions in (63) leads to the following representation of
+G (ϑ) as the sum of two matrices
+
+1
+
+2πi
+�
+
+|z|=1
+
+Rx(z)
+K(z)
+
+⎛
+
+⎜
+⎝
+c(z)up(z)
+−a(z)ur(z)
+O
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+c(z)up(z)
+−a(z)ur(z)
+O
+
+⎞
+
+⎟
+⎠
+
+∗
+
+dz
+z +
+1
+
+2πi
+�
+
+|z|=1
+
+1
+
+K(z)
+
+⎛
+
+⎜
+⎝
+b(z)up(z)
+O
+−a(z)uq(z)
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+b(z)up(z)
+O
+−a(z)uq(z)
+
+⎞
+
+⎟
+⎠
+
+∗
+
+dz
+z
+(64)
+
+where (X)* is the complex conjugate transpose of the matrix X ∈ Cm×n. Like in (23) we set forth
+
+S(−c, a) =
+
+�
+−Sp(c)
+Sr(a)
+
+�
+
+here Sp (c) is formed by the top p rows of S (−c, a). In a similar way we decompose
+
+S(−b, a) =
+
+�
+−Sp(b)
+Sq(a)
+
+�
+
+The representation (64) can be expressed by the appropriate block representations of the Sylvester
+resultant matrices, to obtain
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠ W(ϑ)
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠ P(ϑ)
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
+(65)
+
+where the matrix ℘(ϑ) is given in (28) and the matrix P (ϑ) ∈ R(p+r)×(p+r) is of the form
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Rx(z)
+up+r(z)u⊤
+p+r(z−1)
+
+a(z)a(z−1)b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Rx(z)
+up+r(z)v⊤
+p+r(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(66)
+
+It is shown in [4] that P (ϑ) ≻ O. As can be seen in (65), the ARMAX part is explained by the first
+term, whereas the ARMA part is described by the second term, the combination of both terms is a
+
+summary of the Fisher information of a ARMAX(p, r, q) process. The FIM G(ϑ) under form (65) allows
+
+us to prove the following property, Theorem 3.1 in [4]. The FIM G (ϑ) of the ARMAX(p, r, q) process
+with polynomials a(z), c(z) and b(z) of order p, r, q respectively becomes singular if and only if these
+
+101
+
+
+Entropy 2014, 16, 2023–2055
+
+polynomials have at least one common root. Consequently, the class of resultant matrices is extended
+
+by the FIM G (ϑ).
+
+3.7. The Stein Equation - The Fisher Information Matrix of an ARMAX(p, r, q) Process
+
+In Lemma 3.5 it is proved that the matrix ℘(ϑ) (28) fulfills the Stein 52. We will now consider
+
+the conditions under which the matrix P (ϑ) (66) verifies an appropriate Stein equation. For that
+purpose we consider the spectral density to be of the form Rx(z) = (1/h(z)h(z−1)). The degree of the
+polynomial h(z) is ℓ and we assume the distinct roots of the polynomial h(z) to lie outside the unit
+
+circle, consequently, the roots of the polynomial ˆh(z) lie within the unit circle. We therefore rewrite P
+
+(ϑ) accordingly
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)u⊤
+p+r(z−1)
+
+h(z)h(z−1)a(z)a(z−1)b(z)b(z−1)
+dz
+z
+
+We consider a companion matrix of the form (48) and with size p + q + ℓ, it is denoted by Cf and the
+
+entries fi are given by zp+q+ℓ + ∑
+p+q+ℓ
+i=1
+fi(ϑ)zp+q+qℓ−i = ˆa(z)ˆb(z)ˆh(z) = ˆf (z, ϑ) and �f((ϑ) is the vector
+
+�f((ϑ) = (fp+q+ℓ(ϑ), fp+q+ℓ−1(ϑ), . . . , f 1(ϑ))T. Likewise for the vector f(z, ϑ) = a(z)b(z)h(z) and f(ϑ) =
+
+(f 1(ϑ), f 1(ϑ), . . . , fp+q+ℓ(ϑ))T. The property Det(zIp+q+ℓ − Cf ) = â(z)�b(z)ˆh(z) and Det(Ip+q+ℓ − z Cf ) =
+a(z)b(z)h(z) holds and assume
+
+r = q + ℓ or p + q + ℓ = p + r and r > q
+(67)
+
+P (ϑ) is then of the form
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)v⊤
+p+r(z)
+
+h(z)ˆh(z)a(z)ˆa(z)b(z)ˆb(z)
+dz
+(68)
+
+We will formulate a Stein equation when the matrix Γ = ep+re⊤
+p+r and which is of the form
+
+S − C f SC⊤
+f = ep+re⊤
+p+r
+(69)
+
+where ep+r is the last standard basis column vector in Rp+r. The next lemma is formulated.
+
+3.7.1. Lemma 3.8
+
+The matrix P (ϑ) given in (68) fulfills the Stein 69.
+
+Proof
+
+The unique solution of (69) is assured since the product of all the eigenvalues of Cf are different
+from one, the solution is of the form
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+r − C f )−1ep+re⊤
+p+r(Ip+r − zC f )−⊤dz
+
+or
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+r − C f )ep+re⊤
+p+radj(Ip+r − zC f )⊤
+
+ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z)
+dz
+
+102
+
+
+Entropy 2014, 16, 2023–2055
+
+taking the property of the companion matrix Cf into account implies that the last column vector of
+
+adj(zIp+r − Cf ) is the basic vector up+r(z), consequently the last column of adj(Ip+r − z Cf ) is the basic
+vector vp+r(z), this yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)v⊤
+p+r(z)
+
+ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z)
+dz = W(ϑ)
+
+Consequently, the matrix P (ϑ) defined in (68) verifies the Stein 69. This completes the proof.
+
+The matrices, ℘(ϑ) and P (ϑ), in (65), verify under specific conditions appropriate Stein equations,
+as has been shown in Lemma 3.5 and Lemma 3.8, respectively. We will now confirm the presence of
+
+Vandermonde matrices by applying the standard residue theorem to P (ϑ) in (68), to obtain
+
+W(ϑ) = VαβξR (ϑ) ˆVαβξ
+(70)
+
+The (p + r) × (p + r) diagonal matrix R(ϑ) is of the form
+
+R (ϑ) = diag
+��
+1/ˆa(z; αi)ˆb(αi)ˆh(αi)ϕ(αi)
+�
+,
+�
+1/ˆa(βj)ˆb(z; βj)ˆh(βj)ϕ(βj)
+�
+,
+�
+1/ˆa(ξk)ˆb(ξk)ˆh(z; ξk)ϕ(ξk)
+��
+
+where ϕ(z) = a(z)b(z)h(z) and i = 1, . . . , p, j = 1, . . . , q and k = 1, . . . , ℓ. Whereas the (p + r) × (p + r)
+
+matrices Vαβξ and �U αβξ are of the form
+
+Vαβξ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+α1
+α2
+1
+· · ·
+αp+r−1
+1
+...
+...
+...
+...
+...
+1
+αp
+α2
+p
+· · ·
+αp+r−1
+p
+1
+β1
+β2
+1
+· · ·
+βp+r−1
+1
+...
+...
+...
+...
+...
+1
+βq
+β2
+q
+· · ·
+βp+r−1
+q
+1
+ξ1
+ξ2
+1
+· · ·
+ξp+r−1
+1
+...
+...
+...
+...
+...
+1
+ξℓ
+ξ2
+ℓ
+· · ·
+ξp+r−1
+ℓ
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+⊤
+
+, ˆVαβξ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αp+r−1
+1
+αp+r−2
+1
+· · ·
+α1
+1
+...
+...
+...
+...
+αp+r−1
+p
+αp+r−2
+p
+· · ·
+αp
+1
+βp+r−1
+1
+βp+r−2
+1
+· · ·
+β1
+1
+...
+...
+...
+...
+βp+r−1
+q
+βp+r−2
+q
+· · ·
+βq
+1
+ξp+r−1
+1
+ξp+r−2
+1
+· · ·
+ξ1
+1
+...
+...
+...
+...
+ξp+r−1
+ℓ
+ξp+r−2
+ℓ
+· · ·
+ξℓ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+The (p + r) × (p + r) Vandermonde matrices Vαβξ and �U αβξ are nonsingular when αi ̸= αj , βk ̸= βh,
+ξm ̸= ξn, αi ̸= βk, αi ̸= ξm, βk ̸= ξm for all i, j = 1, . . . , p, k, h = 1, . . . , q and m,n = 1, . . . , ℓ. The
+
+Vandermonde determinants DetVαβξ and Det �U αβξ, are
+
+DetVαβξ = (−1)(p+r) (p+r−1)/2 Ψ (αi, βk, ξm)
+
+where
+
+Ψ (αi, βk, ξm) =
+∏
+1≤i<j≤p
+(αi − αj)
+∏
+1≤k<h≤q
+(βk − βh)
+∏
+1≤m<n≤ℓ
+(ξm − ξn)
+∏
+r = 1, . . . , p
+s = 1, . . . , q
+
+(αr − βs)
+∏
+r = 1, . . . , p
+w = 1, . . . , ℓ
+
+(αr − ξw)
+∏
+s = 1, . . . , q
+w = 1, . . . , ℓ
+
+(βs − ξw)
+
+Like for the Vandermonde matrices Vαβ and ˆV⊤
+αβ,
+
+103
+
+
+Entropy 2014, 16, 2023–2055
+
+Det ˆVαβξ = (−1)(p+r) (p+r−1) Ψ (αi, βk, ξm) ⇒| DetVαβξ |= | Det ˆVαβξ |
+
+(70) is the ARMAX equivalent to (47). A combination of both equations generates a new representation
+
+of the FIM G (ϑ), this is set forth in the following lemma.
+
+3.7.2. Lemma 3.9
+
+Assume the conditions (67) to hold and consider the representations of ℘(ϑ) and P (ϑ) in (47) and (70)
+respectively, leads to an alternative form to (65), it is given by
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠ VαβξR (ϑ) ˆVαβξ
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠ VαβD(ϑ) ˆVαβ
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
+In Lemma 3.9, the FIM G (ϑ) is expressed by submatrices of two Sylvester matrices and various
+Vandermonde matrices, both type of matrices become singular if and only if the appropriate
+polynomials have at least one common root.
+
+3.8. The Fisher Information Matrix of a Vector ARMA(p, q) Process
+
+The process (5) is summarized as,
+
+A(z)y(t) = B(z)e(t)
+
+and we assume that {y(t), t ∈ N}, is a zero mean Gaussian time series and {ϵ(t), t ∈ N} is a n-dimensional
+vector random variable, such that
+
+Eϑ
+
+{ϵ(t)} = 0 and Eϑ {ϵ(t)ϵT (t)} = ∑ and the parameter vector ϑ is of the form (7). In [6] it is shown that
+representation (16) for the n2(p+q)×n2(p+q) asymptotic FIM of the VARMA process (6) is
+
+F(ϑ) = Eϑ
+
+�� ∂e
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+� ∂e
+
+∂ϑ⊤
+
+��
+
+(71)
+
+where ∂ϵ/∂ϑT is of size n×n2(p+q) and for convenience t is omitted from ϵ(t). Using the differential
+rules outlined in [6], yields
+
+∂e
+∂ϑ⊤ =
+�
+(A−1(z)B(z)e)
+⊤ ⊗ B−1(z)
+�∂vec A(z)
+
+∂ϑ⊤
+− (e⊤ ⊗ B−1(z))∂vec B(z)
+
+∂ϑ⊤
+(72)
+
+The substitution of representation (72) of ∂ϵ/∂ϑ T in (71) yields the FIM of a VARMA process. The
+purpose is to construct a factorization of the FIM F(ϑ) that should be a multiple variant of the
+factorization (27), so that a multiple resultant matrix property can be proved for F(ϑ). As illustrated
+in [6], the multiple version of the Sylvester resultant matrix (22) does not fulfill the multiple resultant
+matrix property. In that case even when the matrix polynomials A(z) and B(z) have a common zero or
+a common eigenvalue, the multiple Sylvester matrix is not neccessarily singular. This has also been
+
+illustrated in [3]. In order to consider a multiple equivalent to the resultant matrix S −b, a), Gohberg
+and Lerer set forth the n2(p + q) × n2(p + q) tensor Sylvester matrix
+
+104
+
+
+Entropy 2014, 16, 2023–2055
+
+S⊗(−B, A) :=
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+(−In) ⊗ In
+(−B1) ⊗ In
+· · ·
+(−Bq) ⊗ In
+On2×n2
+· · ·
+On2×n2
+
+On2×n2
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+On2×n2
+On2×n2
+· · ·
+On2×n2
+(−In) ⊗ In
+(−B1) ⊗ In
+· · ·
+(−Bq) ⊗ In
+In ⊗ In
+In ⊗ A1
+· · ·
+In ⊗ Ap
+On2×n2
+· · ·
+On2×n2
+
+On2×n2
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+On2×n2
+On2×n2
+· · ·
+On2×n2
+In ⊗ In
+In ⊗ A1
+· · ·
+In ⊗ Ap
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(73)
+
+In [3], the authors prove that the tensor Sylvester matrix S⊗ (−B,A) fulfills the multiple resultant
+property, it becomes singular if and only if the appropriate matrix polynomials A(z) and B(z) have at
+least one common zero. In Proposition 2.2 in [6], the following factorized form of the Fisher information
+F(ϑ) is developed
+
+F(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Φ(z)Θ(z)Φ∗(z)dz
+
+z
+(74)
+
+where
+
+Φ(z) =
+
+�
+Ip ⊗ A−1(z) ⊗ In
+Opn2×qn2
+Oqn2×pn2
+Iq ⊗ In ⊗ A−1(z)
+
+�
+
+S⊗(−B, A) (up+q(z) ⊗ In2)
+
+and
+
+Θ(z) = Σ ⊗ σ(z), σ(z) = B−⊤(z)Σ−1B−1(z−1)
+(75)
+
+In order to obtain a multiple variant of (27), the following matrix is introduced,
+
+M(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Λ(z)J (z)Λ∗(z)dz
+
+z = S⊗(−B, A)P(ϑ) (S⊗(−B, A))⊤
+(76)
+
+where
+
+J (z) = Φ(z)Θ(z)Φ∗(z) and Λ(z) =
+
+�
+Ip ⊗ A(z) ⊗ In
+Opn2×qn2
+Oqn2×pn2
+Iq ⊗ In ⊗ A(z)
+
+�
+
+and the matrix P(ϑ) is a multiple variant of the matrix ℘(ϑ) in (28), it is of the form
+
+P(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+(up+q(z) ⊗ In2) Θ(z) (up+q(z) ⊗ In2)∗ dz
+
+z
+(77)
+
+In Lemma 2.3 in [6], it is proved that the matrix M(ϑ) in (76) becomes singular if and only if the matrix
+polynomials A(z) and B(z) have at least one common eigenvalue-zero. The proof is a multiple equivalent
+of the proof of Corollary 2.2 in [5], since the equality (76) is a multiple version of (27). Consequently,
+
+the matrix M(ϑ) like the tensor Sylvester matrix S⊗ (−B,A), fulfills the multiple resultant matrix
+property. Since the matrix M(ϑ) is derived from the FIM F(ϑ), this enables us to prove that the matrix
+F(ϑ) fulfills the multiple resultant matrix property by showing that it becomes singular if and only if
+the matrix M(ϑ) is singular, this is done in Proposition 2.4 in [6]. Consequently, it can be concluded
+
+from [6] that the FIM of a VARMA process F(ϑ) and the tensor Sylvester matrix S⊗ (−B,A) have the
+same singularity conditions. The FIM of a VARMA process F(ϑ) can therefore be added to the class of
+multiple resultant matrices.
+
+105
+
+
+Entropy 2014, 16, 2023–2055
+
+A brief summary of the contribution of [6] follows, in order to show that the FIM of a VARMA
+process F(ϑ) is a multiple resultant matrix two new representations of the FIM are derived. To
+construct such representations appropriate matrix differential rules are applied. The newly obtained
+representations are expressed in terms of the multiple Sylvester matrix and the tensor Sylvester matrix.
+The representation of the FIM expressed by the tensor Sylvester matrix is used to prove that the FIM
+becomes singular if and only if the autoregressive and moving average matrix polynomials have
+at least one common eigenvalue. It then follows that the FIM and the tensor Sylvester matrix have
+equivalent singularity conditions. In a numerical example it is shown, however, that the FIM fails to
+detect common eigenvalues due to some kind of numerical instability. The tensor Sylvester matrix
+reveals it clearly, proving the usefulness of the results derived in this paper.
+
+3.9. The Fisher Information Matrix of a Vector ARMAX(p, r, q) Process
+
+The n2(p + q + r) × n2(p + q + r) asymptotic FIM of the VARMAX(p, r, q) process (2)
+
+A(z)y(t) = C(z)x(t) + B(z)e(t)
+
+is displayed according to [23] and is an extension of the FIM of the VARMA(p, q) process (6).
+Representation (16) of the FIM of the VARMAX(p, r, q) process is then
+
+G(ϑ) = Eϑ
+
+�� ∂e
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+� ∂e
+
+∂ϑ⊤
+
+��
+
+where
+
+∂e
+∂ϑ⊤ =
+�
+(A−1(z)C(z)x)⊤ ⊗ B−1(z)
+� ∂vec A(z)
+
+∂ϑ⊤
++
+�
+(A−1(z)B(z)e)⊤ ⊗ B−1(z)
+� ∂vec A(z)
+
+∂ϑ⊤
+−{x⊤ ⊗ B−1(z)} ∂vec C(z)
+
+∂ϑ⊤
+−(e⊤ ⊗ B−1(z)) ∂vec B(z)
+
+∂ϑ⊤
+
+(78)
+
+To obtain the term ∂ϵ/∂ϑ T, of size n × n2(p + q + r), the same differential rules are applied as for the
+VARMA(p, q) process. In Proposition 2.3 in [23], the representation of the FIM of a VARMAX process
+is expressed in terms of tensor Sylvester matrices, this obtained when ∂ϵ/∂ϑ T in (78) is substituted
+in (16), to obtain
+
+G(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1 Φx(z)Θ(z)Φ∗
+x(z)dz
+
+z +
+1
+
+2πi
+
+�
+
+|z|=1 Λx(z)Ψ(z)Λ∗
+x(z)dz
+
+z
+(79)
+
+The matrices in (79) are of the form
+
+Φx(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A−1(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Orn2×rn2
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Iq ⊗ In ⊗ A−1(z)
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠ (up+q(z) ⊗ In2)
+
+Λx(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A−1(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Ir ⊗ In ⊗ A−1(z)
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Oqn2×qn2
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠ (up+r(z) ⊗ In2)
+
+S⊗
+p,q(−B, A) =
+
+�
+−S⊗
+p (B)
+S⊗
+q (A)
+
+�
+
+, S⊗
+p,r(−C, A) =
+
+�
+−S⊗
+p (C)
+S⊗
+r (A)
+
+�
+
+(80)
+
+additionally we have Ψ(z) = Rx(z) ⊗ σ(z) and the Hermitian spectral density matrix Rx(z) is defined
+in (10), whereas the matrix polynomials Θ(z) and σ(z) are presented in (75). In (80), we have the pn2
+
+× (p + q)n2 and qn2 × (p + q)n2 submatrices S⊗
+p (−B) and S⊗
+q (A) of the tensor Sylvester resultant
+matrix S⊗
+p,q(−B, A). Whereas the matrices S⊗
+p (−C) and S⊗
+r (A) are the upper and lower blocks of the
+(p+r)n2×(p+r)n2 tensor Sylvester resultant matrix S⊗
+p,r(−C, A). As for the FIM of the VARMA(p, q)
+process, the objective is to construct a multiple version of (65), this done in [23], to obtain
+
+106
+
+
+Entropy 2014, 16, 2023–2055
+
+Mx(ϑ) =
+1
+
+2πi
+�
+
+|z|=1 L(z)A(z)L∗(z) dz
+
+z +
+1
+
+2πi
+�
+
+|z|=1 W(z)B(z)W∗(z) dz
+
+z
+
+=
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠ P(ϑ)
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠ T(ϑ)
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠
+
+⊤
+(81)
+
+The matrices involved are of the form
+
+L(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Orn2×rn2
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Iq ⊗ In ⊗ A(z)
+
+⎞
+
+⎟
+⎠ and A(z) := Φx(z)Θ(z)Φ∗
+x(z)
+
+W(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Ir ⊗ In ⊗ A(z)
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Oqn2×qn2
+
+⎞
+
+⎟
+⎠ and B(z) := Λx(z)Ψ(z)Λ∗
+x(z)
+
+T(ϑ) =
+1
+
+2πi
+�
+
+|z|=1 (up+r(z) ⊗ In2) Ψ(z) (up+r(z) ⊗ In2)∗ dz
+
+z
+
+and P(ϑ) is given in (77). Note, the matrices Φx(z), Λx(z), L(z) and P (z) are the corrected versions of
+the corresponding matrices in [23].
+A parallel between the scalar and multiple structures is straightforward. This is best illustrated
+by comparing the representations (27) and (28) with (76) and (77) respectively, confronting the FIM
+for scalar and vector ARMA(p, q) processes. The FIM of the scalar ARMAX(p, r, q) process contains
+an ARMA(p, q) part, this is confirmed by (65), through the presence of the matrix ℘(ϑ) which is
+originally displayed in (28). The multiple resultant matrices M(ϑ) and Mx(ϑ) derived from the FIM
+of the VARMA(p, q) and VARMAX(p, r, q) processes respectively both contain P(ϑ), whereas the first
+matrix term of the matrices Φ(z) and Φx(z), which are of different size, consist of the same nonzero
+submatrices. To summarize, in [23] compact forms of the FIM of a VARMAX process expressed in
+terms of multiple and tensor Sylvester matrices are developed. The tensor Sylvester matrices allow
+us to investigate the multiple resultant matrix property of the FIM of VARMAX(p, r, q) processes.
+However, since no proof of the multiple resultant matrix property of the FIM G(ϑ) has been done yet,
+justifies the consideration of a conjecture. A conjecture that states, the FIMG(ϑ) of a VARMAX(p, r, q)
+process becomes singular if and only if the matrix polynomials A(z), B(z) and C(z) have at least one
+common eigenvalue. A multiple equivalent to Theorem 3.1 in [4] and combined with Proposition 2.4
+in [6], but based on the representations (79) and (81), can be envisaged to formulate a proof which will
+be a subject for future study.
+
+4. Conclusions
+
+In this survey paper, matrix algebraic properties of the FIM of stationary processes are discussed.
+The presented material is a summary of papers where several matrix structural aspects of the FIM
+are investigated. The FIM of scalar and multiple processes like the (V)ARMA(X) are set forth with
+appropriate factorized forms involving (tensor) Sylvester matrices. These representations enable us to
+prove the resultant matrix property of the corresponding FIM. This has been done for (V)ARMA(p,
+q) and ARMAX(p, r, q) processes in the papers [4,6]. The development of the stages that lead to the
+appropriate factorized form of the FIM G(ϑ) (79) is set forth in [23]. However, there is no proof done
+yet that confirms the multiple resultant matrix property of the FIM G(ϑ) of a VARMAX(p, r, q) process.
+This justifies the consideration of a conjecture which is formulated in the former section, this can be a
+subject for future study.
+The statistical distance measure derived in [7], involves entries of the FIM. This distance measure
+can be a challenge to its quantum information counterpart (41). Because (36) involves information
+about m parameters estimated from n measurements. Whereas in quantum information, like in
+e.g., [8,10], the information about one parameter in a particular measurement procedure is considered
+
+107
+
+
+Entropy 2014, 16, 2023–2055
+
+for establishing an interconnection with the appropriate statistical distance measure. A possible
+approach, by combining matrix algebra and quantum information, for developing a statistical distance
+measure in quantum information or quantum statistics but at the matrix level, can be a subject of
+future research. Some results concerning interconnections between the FIM of ARMA(X) models
+and appropriate solutions to Stein matrix equations are discussed, the material is extracted from the
+papers, [12] and [13]. However, in this paper, some alternative and new proofs that emphasize the
+conditions under which the FIM fulfills appropriate Stein equations, are set forth. The presence of
+various types of Vandermonde matrices is also emphasized when an explicit expansion of the FIM is
+computed. These Vandermonde matrices are inserted in interconnections with appropriate solutions to
+Stein equations. This explains, when the matrix algebraic structures of the FIM of stationary processes
+are investigated, the involvement of structured matrices like the (tensor) Sylvester, Bezoutian and
+Vandermonde matrices is essential.
+
+Acknowledgments: The author thanks a perceptive reviewer for his comments which significantly improved the
+quality and presentation of the paper.
+
+Conflicts of Interest: The authors have declared no conflict of interest.
+
+References
+
+1.
+Dym, H. Linear Algebra in Action; American Mathematical Society: Providence, RI, USA, 2006; Volume 78.
+2.
+Lancaster, P.; Tismenetsky, M. The Theory of Matrices with Applications, 2nd ed; Academic Press: Orlando, FL,
+USA, 1985.
+3.
+Gohberg, I.; Lerer, L. Resultants of matrix polynomials. Bull. Am. Math. Soc 1976, 82, 565–567.
+4.
+Klein, A.; Spreij, P. On Fisher’s information matrix of an ARMAX process and Sylvester’s resultant matrices.
+Linear Algebra Appl 1996, 237/238, 579–590.
+5.
+Klein, A.; Spreij, P. On Fisher’s information matrix of an ARMA process. In Stochastic Differential and Difference
+Equations; Csiszar, I., Michaletzky, Gy., Eds.; Birkhäuser: Boston: Boston, USA, 1997; Progress in Systems and
+Control Theory; Volume 23, pp. 273–284.
+6.
+Klein, A.; Mélard, G.; Spreij, P. On the Resultant Property of the Fisher Information Matrix of a Vector ARMA
+process. Linear Algebra Appl 2005, 403, 291–313.
+7.
+Klein, A.; Spreij, P. Transformed Statistical Distance Measures and the Fisher Information Matrix.
+Linear Algebra Appl 2012, 437, 692–712.
+8.
+Braunstein, S.L.; Caves, C.M. Statistical Distance and the Geometry of Quantum States. Phys. Rev. Lett 1994,
+72, 3439–3443.
+9.
+Jones, P.J.; Kok, P. Geometric derivation of the quantum speed limit. Phys. Rev. A 2010, 82, 022107.
+10.
+Kok, P. Tutorial: Statistical distance and Fisher information; Oxford: UK, 2006.
+11.
+Lancaster, P.; Rodman, L. Algebraic Riccati Equations; Clarendon Press: Oxford, UK, 1995.
+12.
+Klein, A.; Spreij, P. On Stein’s equation, Vandermonde matrices and Fisher’s information matrix of time
+series processes. Part I: The autoregressive moving average process. Linear Algebra Appl 2001, 329, 9–47.
+13.
+Klein, A.; Spreij, P. On the solution of Stein’s equation and Fisher’s information matrix of an ARMAX process.
+Linear Algebra Appl 2005, 396, 1–34.
+14.
+Grenander, U.; Szeg˝o, G.P. Toeplitz Forms and Their Applications; University of California Press: New York, NY,
+USA, 1958.
+15.
+Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods, 2nd ed; Springer Verlag: Berlin, Germany;
+New York, NY, USA, 1991.
+16.
+Caines, P. Linear Stochastic Systems; John Wiley and Sons: New York, NY, USA, 1988.
+17.
+Ljung, L.; Söderström, T. Theory and Practice of Recursive Identification; M.I.T. Press: Cambridge, MA, USA, 1983.
+18.
+Hannan, E.J.; Deistler, M. The Statistical Theory of Linear Systems; John Wiley and Sons: New York, NY, USA, 1988.
+19.
+Hannan, E.J.; Dunsmuir, W.T.M.; Deistler, M. Estimation of vector Armax models. J. Multivar. Anal 1980, 10,
+275–295.
+20.
+Horn, R.A.; Johnson, C.R. Topics in Matrix Analysis; Cambridge University Press: New York, NY, USA, 1995.
+21.
+Klein, A.; Spreij, P. Matrix differential calculus applied to multiple stationary time series and an extended
+Whittle formula for information matrices. Linear Algebra Appl 2009, 430, 674–691.
+
+108
+
+
+Entropy 2014, 16, 2023–2055
+
+22.
+Klein, A.; Mélard, G. An algorithm for the exact Fisher information matrix of vector ARMAX time series.
+Linear Algebra Its Appl 2014, 446, 1–24.
+23.
+Klein, A.; Spreij, P. Tensor Sylvester matrices and the Fisher information matrix of VARMAX processes.
+Linear Algebra Appl 2010, 432, 1975–1989.
+24.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc 1945, 37, 81–91.
+25.
+Ibragimov, I.A.; Has’minski˘ı, R.Z. Statistical Estimation. In Asymptotic Theory; Springer-Verlag: New York,
+NY, USA, 1981.
+26.
+Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983.
+27.
+Friedlander, B. On the computation of the Cramér-Rao bound for ARMA parameter estimation. IEEE Trans.
+Acoust. Speech Signal Process 1984, 32, 721–727.
+28.
+Holevo, A.S. Probabilistic and Statistical Aspects of Quantum Theory, 2nd ed; Edizioni Della Normale, SNS Pisa:
+Pisa, Italy, 2011.
+29.
+Petz, T. Quantum Information Theory and Quantum Statistics; Springer-Verlag: Berlin Heidelberg, Germany,
+2008.
+30.
+Barndorff-Nielsen, O.E.; Gill, R.D. Fisher Information in quantum statistics. J. Phys. A 2000, 30, 4481–4490.
+31.
+Luo, S. Wigner-Yanase skew information vs. quantum Fisher information. Proc. Amer. Math. Soc 2004, 132,
+885–890.
+32.
+Klein, A.; Mélard, G. On algorithms for computing the covariance matrix of estimates in autoregressive
+moving average processes. Comput. Stat. Q 1989, 5, 1–9.
+33.
+Klein, A.; Mélard, G. An algorithm for computing the asymptotic Fisher information matrix for seasonal
+SISO models. J. Time Ser. Anal 2004, 25, 627–648.
+34.
+Bistritz, Y.; Lifshitz, A. Bounds for resultants of univariate and bivariate polynomials. Linear Algebra Appl
+2010, 432, 1995–2005.
+35.
+Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: New York, NY, USA, 1996.
+36.
+Golub, G.H.; van Loan, C.F. Matrix Computations, 3rd ed; John Hopkins University Press: Baltimore, USA, 1996.
+37.
+Kullback, S. Information Theory and Statistics; John Wiley and Sons: New York, NY, USA, 1959.
+38.
+Klein, A.; Spreij, P. The Bezoutian, state space realizations and Fisher’s information matrix of an ARMA
+process. Linear Algebra Appl 2006, 416, 160–174.
+
+© 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+109
+
+
+entropy
+
+Article
+Asymptotically Constant-Risk Predictive Densities
+When the Distributions of Data and Target Variables
+Are Different
+
+Keisuke Yano 1,* and Fumiyasu Komaki 1,2
+
+1 Department of Mathematical Informatics, Graduate School of Information Science and Technology,
+The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan; E-Mail:
+komaki@mist.i.u-tokyo.ac.jp
+2 RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
+*
+E-Mail: Keisuke_Yano@mist.i.u-tokyo.ac.jp; Tel.: +81-3-5841-6909.
+
+Received: 28 March 2014; in revised form: 9 May 2014 / Accepted: 22 May 2014 /
+Published: 28 May 2014
+
+Abstract: We investigate the asymptotic construction of constant-risk Bayesian predictive densities
+under the Kullback–Leibler risk when the distributions of data and target variables are different and
+have a common unknown parameter. It is known that the Kullback–Leibler risk is asymptotically
+equal to a trace of the product of two matrices: the inverse of the Fisher information matrix for the
+data and the Fisher information matrix for the target variables. We assume that the trace has a unique
+maximum point with respect to the parameter. We construct asymptotically constant-risk Bayesian
+predictive densities using a prior depending on the sample size. Further, we apply the theory to the
+subminimax estimator problem and the prediction based on the binary regression model.
+
+Keywords:
+Bayesian prediction; Fisher information; Kullback–Leibler divergence; minimax;
+predictive metric; subminimax estimator
+
+1. Introduction
+
+Let x(N) = (x1, · · · , xN) be independent N data distributed according to a probability density,
+p(x|θ), that belongs to a d-dimensional parametric model, {p(x|θ) : θ ∈ Θ}, where θ = (θ1, · · · , θd)
+is an unknown d-dimensional parameter and Θ is the parameter space. Let y be a target variable
+distributed according to a probability density, q(y|θ), that belongs to a d-dimensional parametric
+model, {q(y|θ) : θ ∈ Θ} with the same parameter, θ. Here, we assume that the distributions of the
+data and the target variables, p(x|θ) and q(y|θ), are different. For simplicity, we assume that the data
+and the target variables are independent, given by θ.
+We construct predictive densities for target variables based on the data.
+We measure
+the performance of the predictive density,
+ˆq(y; x(N)), by the Kullback–Leibler divergence,
+D(q(·|θ), ˆq(·; x(N))), from the true density, q(y|θ), to the predictive density, ˆq(y; x(N)):
+
+D(q(·|θ), ˆq(·; x(N)))
+=
+�
+q(y|θ) log
+q(y|θ)
+
+ˆq(y; x(N))dy.
+
+Then, the risk function, R(θ, ˆq(y; x(N))), of the predictive density, ˆq(y; x(N)), is given by:
+
+R(θ, ˆq(y; x(N)))
+=
+�
+p(x(N)|θ)D(q(·|θ), ˆq(·; x(N)))dx(N)
+
+=
+�
+p(x(N)|θ)
+�
+q(y|θ) log
+q(y|θ)
+
+ˆq(y; x(N))dydx(N).
+
+Entropy 2014, 16, 3026–3048; doi:10.3390/e16063026
+www.mdpi.com/journal/entropy
+110
+
+
+Entropy 2014, 16, 3026–3048
+
+For the construction of predictive densities, we consider the Bayesian predictive density defined
+by:
+
+ˆqπ(y|x(N))
+=
+�
+q(y|θ)p(x(N)|θ)π(θ; N)dθ
+�
+p(x(N)|θ)π(θ; N)dθ
+,
+
+where π(θ; N) is a prior density for θ, possibly depending on the sample size, N. Aitchison [1]
+showed that, for a given prior density, π(θ; N), the Bayesian predictive density, ˆqπ(y|x(N)), is a Bayes
+solution under the Kullback–Leibler risk. Based on the asymptotics as the sample size goes to infinity,
+Komaki [2] and Hartigan [3] showed its superiority over any plug-in predictive density, q(y| ˆθ), with
+any estimator, ˆθ. However, there remains a problem of prior selection for constructing better Bayesian
+predictive densities. Thus, a prior, π(θ; N), must be chosen based on an optimality criterion for actual
+applications.
+Among various criteria, we focus on a criterion of constructing minimax predictive densities
+under the Kullback–Leibler risk. For simplicity, we refer to the priors generating minimax predictive
+densities as minimax priors. Minimax priors have been previously studied in various predictive
+settings; see [4–8]. When the simultaneous distributions of the target variables and the data belong to
+the submodel of the multinomial distributions, Komaki [7] shows that minimax priors are given as
+latent information priors maximizing the conditional mutual information between target variables and
+the parameter given the data. However, the explicit forms of latent information priors are difficult to
+obtain, and we need asymptotic methods, because they require the maximization on the space of the
+probability measures on Θ.
+Except for [7], these studies on minimax priors are based on the assumption that the distributions,
+p(x|θ) and q(y|θ), are identical. Let us consider the prediction based on the logistic regression model
+where the covariates of the data and the target variables are not identical. In this predictive setting, the
+assumption that the distributions, p(x|θ) and q(y|θ), are identical is no longer valid.
+We focus on the minimax priors in predictions where the distributions, p(x|θ) and q(y|θ), are
+different and have a common unknown parameter. Such a predictive setting has traditionally been
+considered in statistical prediction and experiment design. It has recently been studied in statistical
+learning theory; for example, see [9]. Predictive densities where the distributions, p(x|θ) and q(y|θ),
+are different and have a common unknown parameter are studied by [10–13].
+Let gX
+ij (θ) be the (i, j)-component of the Fisher information matrix of the distribution, p(x|θ), and
+
+let gY
+ij(θ) be the (i, j)-component of the Fisher information matrix of the distribution, q(y|θ). Let gX,ij(θ)
+
+and gY,ij(θ) denote the (i, j)-components of their inverse matrices. We adopt Einstein’s summation
+convention: if the same indices appear twice in any one term, it implies summation over that index
+from one to d. For the asymptotics below, we assume that the prior densities, π(θ; N), are smooth.
+On the asymptotics as the sample size N goes to infinity, we construct the asymptotically
+constant-risk prior, π(θ; N), in the sense that the asymptotic risk:
+
+R(θ, ˆqπ(y|x(N))) = 1
+
+N R1(θ, ˆqπ(y|x(N))) +
+1
+
+N
+√
+
+N
+R2(θ, ˆqπ(y|x(N))) + O(N−2)
+
+is constant up to O(N−2). Since the proper prior with the constant risk is a minimax prior for any finite
+sample size, the asymptotically constant-risk prior relates to the minimax prior; in Section 4, we verify
+that the asymptotically constant-risk prior agrees with the exact minimax prior in binomial examples.
+When we use the prior, π(θ), independent of the sample size, N, it is known that the N−1-order
+term, R1(θ, ˆqπ(y|x(N))), of the Kullback–Leibler risk is equal to the trace, gX,ij(θ)gY
+ij(θ). If the trace
+does not depend on the parameter, θ, the construction of the asymptotically constant-risk prior is
+parallel to [6]; see also [13].
+However, we consider the settings where there exists a unique maximum point of the trace,
+gX,ij(θ)gY
+ij(θ); for example, these settings appear in predictions based on the binary regression model,
+
+111
+
+
+Entropy 2014, 16, 3026–3048
+
+where the covariates of the data and the target variables are not identical. In the settings, there do
+not exist asymptotically constant-risk priors among the priors independent of the sample size, N.
+The reason is as follows: we consider the prior, π(θ), independent of the sample size, N. Then, the
+Kullback–Leibler risk of the Bayesian predictive density is expanded as:
+
+R(θ, ˆqπ(y|x(N))) =
+1
+2N gY
+ij(θ)gX,ij(θ) + O(N−2).
+
+Since, in our settings, the first-order term, gY
+ij(θ)gX,ij(θ), is not constant, the prior independent of the
+sample size, N, is not an asymptotically constant-risk prior.
+When there exists a unique maximum point of the trace, gX,ij(θ)gY
+ij(θ), we construct the
+
+asymptotically constant-risk prior, π(θ; N), up to O(N−2), by making the prior dependent on the
+sample size, N, as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+{ f (θ)}
+√
+
+Nh(θ),
+
+where f (θ) and h(θ) are the scalar functions of θ independent of N and |gX(θ)| denotes the determinant
+of the Fisher information matrix, gX(θ).
+The key idea is that, if the specified parameter point has more undue risk than the other parameter
+points, then the more prior weights should be concentrated on that point.
+Further, we clarify the subminimax estimator problem based on the mean squared error from
+the viewpoint of the prediction where the distributions of data and target variables are different and
+have a common unknown parameter. We obtain the improvement achieved by the minimax estimator
+over the subminimax estimators up to O(N−2). The subminimax estimator problem [14,15] is the
+problem that, at first glance, there seems to exist asymptotically dominating estimators of the minimax
+estimator. However, any relationship between such subminimax estimator problems and predictions
+have not been investigated, and further, in general, the improvement by the minimax estimator over
+the subminimax estimators has not been investigated.
+
+2. Information Geometrical Notations
+
+In this section, we prepare the information geometrical notations; see [16] for details. We
+abbreviate ∂/∂θi to ∂i, where the indices, i, j, k, . . ., run from one to d. Similarly, we abbreviate ∂2/∂θi∂θj,
+∂3/∂θi∂θj∂θk and ∂4/∂θi∂θj∂θk∂θl to ∂ij, ∂ijk and ∂ijkl, respectively. We denote the expectations of the
+random variables, X, Y and X(N), by EX[·], EY[·] and EX(N)[·], respectively. We denote their probability
+densities by p(x|θ), q(y|θ) and p(x(N)|θ), respectively.
+We define the predictive metric proposed by Komaki [13] as:
+
+˚gij(θ)
+=
+gX
+ik(θ)gY,kl(θ)gX
+lj (θ).
+
+When the parameter is one-dimensional, gθθ(θ) denotes Fisher information and gθθ(θ) denotes its
+
+inverse. Let
+
+e
+Γ X
+ij,k(θ) and
+
+m
+Γ X
+ij,k(θ) be the quantities given by:
+
+e
+Γ X
+ij,k(θ)
+:=
+EX[∂ij log p(x|θ)∂k log p(x|θ)]
+
+and:
+
+m
+Γ X
+ij,k(θ)
+:=
+�
+1
+
+p(x|θ)[∂ijp(x|θ)∂kp(x|θ)]dx.
+
+112
+
+
+Entropy 2014, 16, 3026–3048
+
+Using these quantities, the e-connection and m-connection coefficients with respect to the parameter, θ,
+for the model, {p(x|θ) : θ ∈ Θ}, are given by:
+
+e
+Γ X
+ij,k(θ)
+:=
+gX,lk(θ)
+
+e
+Γ X
+ij,l(θ)
+
+and:
+
+m
+Γ X
+ij,k(θ)
+:=
+gX,kl(θ)
+
+m
+Γ X
+ij,l(θ),
+
+respectively.
+The (0, 3)-tensor, TX
+ijk(θ), is defined by:
+
+TX
+ijk(θ)
+:=
+EX[∂i log p(x|θ)∂j log p(x|θ)∂k log p(x|θ)].
+
+The tensor, TX
+ijk(θ), also produces a (0, 1)-tensor:
+
+TX
+i (θ)
+:=
+TX
+ijk(θ)gX,jk(θ).
+
+In the same manner, the information geometrical quantities,
+
+e
+Γ X
+ij,l(θ),
+
+m
+Γ X
+ij,l(θ) and TY
+ijk(θ), are
+defined for the model, {q(y|θ) : θ ∈ Θ}.
+Let Mk
+ij(θ) be a (1, 2)-tensor defined by:
+
+Mk
+ij(θ) :=
+
+m
+Γ Y,k
+ij
+(θ) −
+
+m
+Γ X,k
+ij
+(θ).
+
+For a derivative, (∂1v(θ), · · · , ∂dv(θ)), of the scalar function, v(θ), the e-covariant derivative is
+given by:
+
+e
+∇i∂jv(θ)
+:=
+∂ijv(θ) −
+
+e
+Γ X,k
+ij
+(θ)∂kv(θ).
+
+3. Asymptotically Constant-Risk Priors When the Distributions of Data and Target Variables
+Are Different
+
+In this section, we consider the settings where the trace, gX,ij(θ)gY
+ij(θ), has a unique maximum
+point. We construct the asymptotically constant-risk prior under the Kullback–Leibler risk in the sense
+that the asymptotic risk up to O(N−2) is constant. We find asymptotically constant-risk priors up to
+O(N−2) in two steps: first, expand the Kullback–Leibler risks of Bayesian predictive densities; second,
+find the prior having an asymptotically constant risk using this expansion.
+From now on, we assume the following two conditions for the prior, π(θ; N):
+
+(C1) The prior, π(θ; N), has the form:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ) + log h(θ)},
+
+where f (θ) and h(θ) are smooth scalar functions of θ independent of N.
+(C2) The unique maximum point of the scalar function, f (θ), is equal to the unique maximum point
+of the trace, gX,ij(θ)gY
+ij(θ).
+
+113
+
+
+Entropy 2014, 16, 3026–3048
+
+Based on Conditions (C1) and (C2), we expand the Kullback–Leibler risk of a Bayesian predictive
+density up to O(N−2).
+
+Theorem 1. The Kullback–Leibler risk of a Bayesian predictive density based on the prior, π(θ; N), satisfying
+Condition (C1), is expanded as:
+
+R(θ, ˆqπ(y|x(N)))
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ) −
+1
+
+N
+√
+
+N
+TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)
+e
+∇i∂j log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
+−
+1
+
+3N
+√
+
+N
+TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gY
+kl(θ)Ml
+ij(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij (θ)∂m log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)Mk
+ij(θ)∂k log f (θ)
+
++
+1
+
+2N
+√
+
+N
+˚gij(θ)TX
+i (θ)∂j log f (θ) +
+1
+
+2N
+√
+
+N
+gX,im(θ)gY
+ij(θ)gX,kl(θ)Mj
+kl(θ)∂m log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(1)
+
+The proof is given in the Appendix. The first term in (1) represents that the precision of the
+estimation is determined by the geometric quantity of the data, gX,ij(θ), and the metric of the parameter
+is determined by the geometric quantity of the target variables, gY
+ij(θ). Note that each term in (1) is
+invariant under the reparametrization.
+
+Remark 1. For the subsequent theorem, it is important that at the point, θ f , maximizing the scalar function,
+log f (θ), R(θ f , ˆqπ(y|xN)) is given by:
+
+R(θ f , ˆqπ(y|xN))
+
+=
+1
+2N sup
+θ∈Θ
+{gX,ij(θ)gY
+ij(θ)} +
+1
+
+N
+√
+
+N
+˚gij(θ f )∂ij log f (θ f ) + O(N−2).
+(2)
+
+The N−3/2-order term of this risk is common whenever we use the same scalar function, log f (θ). This term is
+negative because of the definition of the point, θ f . Under Condition (C2), θ f is equal to the unique maximum
+point, θmax, of the trace, gX,ij(θ)gY
+ij(θ).
+
+Based on (1) and (2), we construct asymptotically constant-risk priors using the solutions of the
+partial differential equations.
+
+Theorem 2. Suppose that the scalar functions, log ˜f (θ) and log ˜h(θ), satisfy the following conditions:
+
+(A1) log ˜f (θ) is the solution of the Eikonal equation given by:
+
+˚gij(θ)∂i log ˜f (θ)∂j log ˜f (θ)
+=
+gX,ij(θmax)gY
+ij(θmax) − gX,ij(θ)gY
+ij(θ),
+(3)
+
+where θmax is the unique maximum point of the scalar function, gX,ij(θ)gY
+ij(θ).
+
+114
+
+
+Entropy 2014, 16, 3026–3048
+
+(A2) log ˜h(θ) is the solution of the first-order linear partial equation given by:
+
+˚gij∂i log ˜f (θ)∂j log ˜h(θ) = − ˚gij(θ)
+e
+∇i∂j log ˜f (θ)
+
+− ˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log ˜f (θ)
+�
+∂j log ˜f (θ)∂l log ˜f (θ)
+
++ TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log ˜f (θ)
+
++ 1
+
+3TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log ˜f (θ)∂t log ˜f (θ)∂u log ˜f (θ)
+
+− 1
+
+2 gY
+kl(θ)Ml
+ij(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log ˜f (θ)∂t log ˜f (θ)∂u log ˜f (θ)
+
+− 1
+
+2 gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij (θ)∂m log ˜f (θ) − ˚gij(θ)Mk
+ij(θ)∂k log ˜f (θ)
+
+− 1
+
+2 ˚gij(θ)TX
+i (θ)∂j log ˜f (θ) − 1
+
+2 gX,im(θ)gY
+ij(θ)gX,kl(θ)Mj
+kl(θ)∂m log ˜f (θ)
+
++ ˚gij(θmax)∂ij log ˜f (θmax).
+(4)
+
+Let π(θ; N) be the prior that is constructed as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ) + log ˜h(θ)}.
+
+Further, suppose that log ˜f (θ) satisfies Condition (C2).
+Then, the Bayesian predictive density based on the prior, π(θ; N), has the asymptotically smallest constant
+risk up to O(N−2) among all priors with the form (C1).
+
+Proof. First, we consider the prior, φ(θ; N), constructed as:
+
+φ(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ)}.
+
+From Theorem 1, the Kullback–Leibler risk, R(θ, ˆqφ(y|x(N))), based on the prior, φ(θ; N), is given by:
+
+R(θ, ˆqφ(y|x(N)))
+=
+1
+2N gX,ij(θmax)gY
+ij(θmax) + o(N−1).
+(5)
+
+This is constant up to o(N−1).
+Suppose that there exists another prior, ϕ(θ; N), constructed as:
+
+ϕ(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ)},
+
+and the Bayesian predictive density based on the prior, ϕ(θ; N), has the asymptotically constant risk:
+
+R(θ, ˆqϕ(y|x(N))) =
+k
+2N + o(N−1).
+
+From Theorem 1, the prior ϕ(θ; N) must satisfy the equation:
+
+˚gij(θ)∂i log f (θ)∂j log f (θ)
+=
+k − gX,ij(θ)gY
+ij(θ).
+
+The left-hand side of the above equation is non-negative, because the matrix, ˚gij(θ), is positive-definite.
+Hence, the infimum of the constant, k, is equal to gX,ij(θmax)gY
+ij(θmax). From (5), the N−1-order term
+
+of the risk based on the prior, φ(θ; N), achieves the infimum, gX,ij(θmax)gY
+ij(θmax). Thus, the Bayesian
+
+115
+
+
+Entropy 2014, 16, 3026–3048
+
+predictive density based on the prior, φ(θ; N), has the asymptotically smallest constant risk up to
+o(N−1).
+Second, we consider the prior, π(θ; N), constructed as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ) + log ˜h(θ)}.
+
+The above argument ensures that the prior, π(θ; N), has the asymptotically smallest constant risk up
+to o(N−1). Thus, we only have to check if the N−3/2-order term of the risk is the smallest constant.
+From (2), the N−3/2-order term of the risk at the point, θmax, is unchanged by the choice of the
+scalar function, log h(θ). In other words, the constant N−3/2-order term must agree with the quantity,
+˚gij(θmax)∂ij log ˜f (θmax). From Theorem 1, if we choose the prior, π(θ; N), the N−3/2-order term of the
+risk is the smallest constant, and it agrees with the quantity, ˚gij(θmax)∂ij log ˜f (θmax). Thus, the prior,
+π(θ; N), has the asymptotically smallest constant risk up to O(N−2).
+
+Remark 2. In Theorem 2, we choose log ˜f (θ), satisfying Condition (C2) among the solutions of (A1). We
+consider the model with a one-dimensional parameter, θ. There are four possibilities to the solutions of (A1):
+
+�
+
+˚gθθ(θ)∂θ log ˜f (θ) =
+
+⎧
+⎨
+
+⎩
+±
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≤ θmax,
+
+±
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≥ θmax,
+
+where the double-sign corresponds. From the concavity around θmax as suggested by (C2), we choose log ˜f (θ)
+as the solution of the following equation:
+
+�
+
+˚gθθ(θ)∂θ log ˜f (θ) =
+
+⎧
+⎨
+
+⎩
+
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≤ θmax,
+
+−
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≥ θmax.
+(6)
+
+Integrating both sides of Equation (6), the unique function, log ˜f (θ), is obtained.
+
+Remark 3. Compare the Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), with
+that based on the prior, λ(θ), independent of the sample size, N. From Theorem 1 and Theorem 2, the
+Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), is given as:
+
+R(θ, ˆqπ(y|x(N)))
+=
+1
+2N gX,ij(θmax)gY
+ij(θmax)
+
++
+1
+
+N
+√
+
+N
+˚gij(θmax)∂ij log ˜f (θmax) + O(N−2).
+(7)
+
+In contrast, the Kullback–Leibler risk based on the prior, λ(θ), is given as:
+
+R(θ, ˆqλ(y|x(N)))
+=
+1
+2N gX,ij(θ)gY
+ij(θ) + O(N−2).
+(8)
+
+The N−1-order term in (8) is under the N−1-order term in (7); although the N−3/2-order term in (8) does not
+exist, the N−3/2-order term in (7) is negative. Thus, the maximum of the risk based on the asymptotically
+constant-risk prior, π(θ; N), is smaller than that of the risk based on the prior, λ(θ). This result is consistent
+with the minimaxity of selecting the prior that constructs the predictive density with the smallest maximum of
+the risk.
+
+4. Subminimax Estimator Problem Based on the Mean Squared Error
+
+In this section, we refer to the subminimax estimator problem based on the mean squared error,
+from the viewpoint of the prediction where the distributions of data and target variables are different
+
+116
+
+
+Entropy 2014, 16, 3026–3048
+
+and have a common unknown parameter. First, we give a brief review of subminimax estimator
+problem through the binomial example.
+
+Example 1. Let us consider the binomial estimation based on the mean squared error, RMSE(θ, ˆθ). For any
+finite sample size, N, the Bayes estimator, ˆθπ, based on the Beta prior, π(θ; N) ∝ θ
+√
+
+N/2−1(1 − θ)
+√
+
+N/2−1, is
+minimax under the mean squared error. The mean squared error of the minimax Bayes estimator, ˆθπ, is given by:
+
+RMSE(θ, ˆθπ)
+=
+N
+
+4(
+√
+
+N + N)2 =
+1
+4N −
+1
+
+2N
+√
+
+N
++ O(N−2).
+(9)
+
+In contrast, the mean squared error of the maximum likelihood estimator, ˆθMLE, is given by:
+
+RMSE(θ, ˆθMLE)
+=
+θ(1 − θ)
+
+N
+.
+
+We compare the two estimators, ˆθπ and ˆθMLE. In the comparison of the N−1-order terms of the mean
+squared errors, it seems that the maximum likelihood estimator, ˆθMLE, dominates the minimax Bayes estimator,
+ˆθπ. In other words, the N−1-order term of RMSE(θ, ˆθMLE) is not greater than that of RMSE(θ, ˆθπ) for every
+θ ∈ Θ, and the equality holds when θ = 1/2. This seeming paradox is known as the subminimax estimator
+problem; see [14,17,18] for details. See also [15] for the conditions that such problems do not occur in estimation.
+However, this paradox does not mean the inferiority of the minimax Bayes estimator. This is because,
+although the mean squared error of the minimax Bayes estimator, ˆθπ, has the negative N−3/2-order term, the
+mean squared error of the maximum likelihood estimator, ˆθMLE, does not have the N−3/2-order term. Hence, in
+comparison to the mean squared errors up to O(N−2), the maximum of the mean squared error, RMSE(θ, ˆθπ), is
+below the maximum of the mean squared error, RMSE(θ, ˆθMLE).
+
+Next, we construct the asymptotically constant-risk prior in the estimation based on the mean
+squared error when the subminimax estimator problem occurs, from the viewpoint of the prediction.
+We consider the priors, π(θ; N), satisfying (C1). From Lemma A3 in the Appendix, the mean squared
+error of the Bayes estimator, ˆθπ, is equal to the Kullback–Leibler risk of the ˆθπ-plugin predictive density,
+q(y| ˆθπ), by assuming that the target variable, y, is a d-dimensional Gaussian random variable with the
+
+mean vector, θ, and unit variance. Note that gY
+ij(θ) = 1,
+
+m
+Γ Y
+ij,k = 0 and
+
+e
+Γ Y
+ij,k = 0 for i, j, k = 1, · · · , d.
+
+Thus, if gY
+ij(θ)gX,ij(θ) = Σd
+i=1gX,ii(θ) has a unique maximum point, we obtain the asymptotically
+
+constant-risk prior, π(θ; N), up to O(N−2) from Lemma A2 in the Appendix and Theorem 2.
+Finally, we compare the mean squared error of the asymptotically constant-risk Bayes estimator,
+ˆθπ, with that of the maximum likelihood estimator, ˆθMLE. The mean squared error of the asymptotically
+constant-risk Bayes estimator, ˆθπ, is given as:
+
+RMSE(θ, ˆθπ)
+=
+1
+N
+
+d
+Σ
+i=1
+gX,ii(θmax) +
+2
+
+N
+√
+
+N
+Σd
+k=1gX,ik(θmax)gX,jk(θmax)∂ij log ˜f (θmax) + O(N−2).
+
+In contrast, the mean squared error of the maximum likelihood estimator, ˆθMLE, is given as:
+
+RMSE(θ, ˆθMLE) = 1
+
+N Σd
+i=1gX,ii(θ) + O(N−2).
+
+See [16,19].
+Thus, the maximum of the mean squared error of the asymptotically constant-risk Bayes estimator
+is smaller than that of estimators by the improvement of order N−3/2 in proportion to the Hessian
+of the scalar function, log ˜f (θ), at θmax. In the prediction where the trace, gX,ij(θ)gY
+ij(θ), has a unique
+maximum point, the same improvement holds (Remark 3).
+
+117
+
+
+Entropy 2014, 16, 3026–3048
+
+Example 2. Using the above results, we consider the binomial estimation based on the mean squared error from
+the viewpoint of the prediction. The geometrical quantities to be used are given by:
+
+gX
+θθ(θ) =
+1
+
+θ(1 − θ),
+gY
+θθ(θ) = 1,
+
+m
+Γ X
+θθ,θ (θ) = 0,
+
+m
+Γ X
+θθ,θ(θ) = 0,
+
+e
+Γ X
+θθ,θ(θ) (θ) = −
+1 − 2θ
+
+θ2(1 − θ)2 ,
+
+e
+Γ Y
+θθ,θ (θ) = 0,
+
+TX
+θθθ(θ) =
+1 − 2θ
+
+θ2(1 − θ)2 ,
+and
+TY
+θθθ(θ) = 0,
+
+respectively. Since
+m X,θ
+θθ ,
+mY,θ
+θθ
+and TY
+θθθ vanish, the asymptotically constant-risk prior in the estimation is
+identical to the asymptotically constant-risk prior in the prediction; compare Theorem 1 with the expansion of
+gY,ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] in Lemma A2 in the Appendix.
+In this example, Equation (3) is given by:
+
+θ2(1 − θ)2{∂θ log ˜f (θ)}2
+=
+
+�
+
+1
+4 − θ(1 − θ),
+
+and the solution, log ˜f (θ), is (1/2) log{θ(1 − θ)}. Here, the second-order derivative of the function, log ˜f (θ),
+is given by:
+
+∂θθ log ˜f (θ)
+=
+−1 − 2θ + 2θ2
+
+2θ2(1 − θ)2 .
+
+From this, Equation (4) is given by:
+
+1
+2θ(1 − θ)(1 − 2θ)∂θ log ˜h(θ) + θ2 − θ
+=
+−1
+
+4,
+
+and the solution, log ˜h(θ), is (1/2) log{θ(1 − θ)}. Hence, the asymptotically constant-risk prior, π(θ; N), is a
+Beta prior with the parameters, α =
+√
+
+N/2 and β =
+√
+
+N/2. Note that the asymptotically constant-risk prior
+coincides with the exact minimax prior. Since gX,θθ(θmax) = 1/2 and gX,θθ(θmax)∂θθ log ˜f (θmax) = −1, the
+mean squared error of the asymptotically constant-risk Bayes estimator, ˆθπ, agrees with (9) up to O(N−2).
+
+5. Application to the Prediction of the Binary Regression Model under the Covariate Shift
+
+In this section, we construct asymptotically constant-risk priors in the prediction based on the
+binary regression model under the covariate shift; see [10].
+We consider that we predict a binary response variable, y, based on the binary response variables,
+x(N). We assume that the target variable, y, and the data, x(N), follow the logistic regression models
+with the same parameter, β, given by:
+
+log
+Πx
+
+1 − Πx
+=
+α + zβ
+
+and:
+
+log
+Πy
+
+1 − Πy
+=
+˜α + ˜zβ,
+
+118
+
+
+Entropy 2014, 16, 3026–3048
+
+where Πx is the success probability of the data and Πy is the success probability of the target variable.
+Let α and ˜α denote known constant terms, and let β denote the common unknown parameter. Further,
+we assume that the covariates, z and ˜z, are different.
+Using the parameter θ = Πx, we convert this predictive setting to binomial prediction where the
+data, x, and the target variable, y, are distributed according to:
+
+p(x|θ) :=
+
+�
+θ
+if x = 1,
+1 − θ
+if x = 0,
+
+and:
+
+q(y|θ) :=
+
+⎧
+⎨
+
+⎩
+
+e˜α−˜zz−1αθ ˜zz−1/
+�
+(1 − θ)˜zz−1 + e˜α−˜zz−1αθ ˜zz−1�
+if y = 1,
+
+(1 − θ)˜zz−1/
+�
+(1 − θ)˜zz−1 + e˜α−˜zz−1αθ ˜zz−1�
+if y = 0,
+
+respectively. We obtain two Fisher information for x and y as:
+
+gX
+θθ(θ)
+=
+1
+
+θ(1 − θ)
+
+and:
+
+gY
+θθ(θ)
+=
+� ˜z
+
+z
+
+�2
+e−˜α+˜zz−1α
+(1 − θ)˜zz−1−2θ ˜zz−1−2
+
+�
+θ ˜zz−1 + e−˜α+˜zz−1α(1 − θ)˜zz−1�2 ,
+
+respectively.
+For simplicity, we consider the setting where z = 1, ˜z = 2 and α = ˜α = 0. The geometrical
+quantities for the model, {p(x|θ) : θ ∈ Θ}, are given by:
+
+gX
+θθ(θ) =
+1
+
+θ(1 − θ),
+
+m
+Γ X
+θθ,θ (θ) = 0,
+
+e
+Γ X
+θθ,θ(θ) (θ) = −
+1 − 2θ
+
+θ2(1 − θ)2 ,
+and
+TX
+θθθ(θ) =
+1 − 2θ
+
+θ2(1 − θ)2 ,
+
+respectively. In the same manner, the geometrical quantities for the model, {q(y|θ) : θ ∈ Θ}, are
+given by:
+
+gY
+θθ(θ) =
+4
+
+{(1 − θ)2 + θ2}2 ,
+
+m
+Γ X
+θθ,θ(θ) = 4 (1 − 2θ)(1 + 2θ − 2θ2)
+
+θ(1 − θ){(1 − θ)2 + θ2}3 ,
+
+e
+Γ Y
+θθ,θ (θ) = −4
+1 − 2θ
+
+θ(1 − θ){(1 − θ)2 + θ2}2 ,
+and
+TY
+θθθ(θ) = 8
+1 − 2θ
+
+θ(1 − θ){(1 − θ)2 + θ2}3 ,
+
+respectively.
+Using these quantities, Equation (3) is given by:
+
+4
+θ2(1 − θ)2
+
+{θ2 + (1 − θ)2}2 (∂θ log ˜f (θ))2 = 4 − 4
+θ(1 − θ)
+
+{θ2 + (1 − θ)2}2 .
+
+119
+
+
+Entropy 2014, 16, 3026–3048
+
+By noting that the maximum point of gX,θθ(θ)gY
+θθ(θ) is 1/2, the solution, log ˜f (θ), of this equation is
+given by:
+
+log ˜f (θ)
+=
+2
+�
+
+1 − θ + θ2 + log{θ(1 − θ)}
+
+− log(2 − θ + 2
+�
+
+1 − θ + θ2) − log(1 + θ + 2
+�
+
+1 − θ + θ2).
+
+Using this solution, we obtain the solution of Equation (4) given by:
+
+log ˜h(θ)
+=
+1
+6
+
+�
+−
+1
+
+1 − θ − 1
+
+θ − 12θ(1 − θ) − 12
+√
+
+3
+�
+
+1 − θ + θ2
+
++(3 − 6
+√
+
+3){log θ + log(1 − θ)} − 3 log(1 − θ + θ2) + 10 log{(1 − θ)2 + θ2}
+
+−6 log(
+√
+
+3 + 2
+�
+
+1 − θ + θ2) + 6
+√
+
+3 log{1 + (1 − θ) + 2
+�
+
+1 − θ + θ2}
+
++6
+√
+
+3 log{1 + θ + 2
+�
+
+1 − θ + θ2}
+�
+.
+
+The asymptotically constant-risk priors for the different sample sizes are shown in Figure 1. The prior
+weight is found to be more concentrated to 1/2 as the sample size, N, grows.
+In this example, we obtain the Kullback–Leibler risk of the Bayesian predictive density based on
+the asymptotically constant-risk prior, π(θ; N), as:
+
+R(θ, ˆqπ(y|x(N)))
+=
+2
+N − 4
+√
+
+3
+
+N
+√
+
+N
++ O(N−2).
+
+We compare this value with the Bayes risk calculated using the Monte Carlo simulation; see Figure 2.
+As the sample size, N, grows, the difference appears negligible. Further, we compare this value with
+the risk itself calculated by the Monte Carlo simulation; see Figure 3. As the sample size, N, grows, the
+risk becomes more constant.
+
+Figure 1. Asymptotically constant-risk prior in the prediction where the data are distributed according
+to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial
+distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+120
+
+
+Entropy 2014, 16, 3026–3048
+
+�����
+�����
+�����
+�����
+
+���������������
+
+����������
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+����
+
+��������������������
+����� ��� ���
+
+�����
+�����
+�����
+�����
+
+Figure 2. Bayes risk based on the asymptotically constant-risk prior in the prediction where the data
+are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed
+according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+Figure 3. Comparison of the Kullback–Leibler risk calculated using the Monte Carlo simulations and
+the asymptotic risk, 2/N − (4
+√
+
+3)/(N
+√
+
+N), in the prediction where the data are distributed according
+to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial
+distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+6. Discussion and Conclusions
+
+We have considered the setting where the quantity, gX,ij(θ)gY
+ij(θ)—the trace of the product of the
+
+inverse Fisher information matrix, gX,ij(θ), and the Fisher information matrix, gY
+ij(θ)—has a unique
+maximum point, and we have investigated the asymptotically constant-risk prior in the sense that the
+asymptotic risk is constant up to O(N−2).
+In Section 3, we have considered the prior depending on the sample size, N, and constructed
+the asymptotically constant-risk prior using Equations (3) and (4). In Section 4, we have clarified the
+relationship between the subminimax estimator problem based on the mean squared error and the
+prediction where the distributions of data and target variables are different. In Section 5, we have
+constructed the asymptotically constant-risk prior in the prediction based on the logistic regression
+model under the covariate shift.
+We have assumed that the trace, gX,ij(θ)gY
+ij(θ), is finite. However, the trace may diverge in the
+non-compact parameter space; for example, it diverges under the predictive setting, where the
+
+121
+
+
+Entropy 2014, 16, 3026–3048
+
+distribution, q(y|θ), of the target variable is the Poisson distribution and the data distribution, p(x|θ),
+is the exponential distribution, with Θ equivalent to R. Therefore, for our future work, in such a
+setting, we should adopt criteria other than minimaxity.
+
+Acknowledgments: The authors thank the referees for their helpful comments. This research was partially
+supported by a Grant-in-Aid for Scientific Research (23650144, 26280005).
+
+Author Contributions: Both authors contributed to the research and writing of this paper. Both authors read and
+approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+Appendix A
+
+We prove Theorem 1. First, we introduce some lemmas for the proof. For the expansion, we
+follow the following six steps (the first five steps are arranged in the form of lemmas): the first is to
+expand the MAPestimator; the second is to calculate their bias and mean squared error; the third
+is to expand the Kullback–Leibler risk using ˆθπ-plugin predictive density, q(y| ˆθπ); the fourth is to
+expand the Bayesian predictive density based on the prior π(θ; N); the fifth is to expand the Bayesian
+estimator minimizing the Bayes risk; and the last is to prove Theorem 1 using these lemmas.
+We use some additional notations for the expansion. Let ˆθπ be the maximum point of the
+scalar function log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Let l(θ|x(N)) denote the log likelihood of
+the data, x(N). Let lij(θ|x(N)), lijk(θ|x(N)) and lijkl(θ|x(N)) be the derivatives of order 2, 3 and 4 of the
+log likelihood, l(θ|x(N)). Let Hij(θ|x(N)) denote the quantity, lij(θ|x(N)) + NgX
+ij (θ). Let ˜li(θ|x(N)) and
+˜Hij(θ|x(N)) denote (1/
+√
+
+N)li(θ|x(N)) and (1/
+√
+
+N)Hij(θ|x(N)), respectively. In addition, the brackets ( )
+denotes the symmetrization: for any two tensors, aij and bij, ai(jbk)l denotes ai(jbk)l = (aijbkl + aikbjl)/2.
+
+Lemma A1. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th
+component of this estimator ˆθπ is expanded as follows:
+
+ˆθi
+π
+=
+θi +
+1
+√
+
+N
+gX,ik(θ)˜lk(θ|x(N)) +
+1
+√
+
+N
+gX,ik(θ)∂k log f (θ)
+
++ 1
+
+N gX,ik(θ) ˜Hkm(θ|x(N))gX,mr(θ)˜lr(θ|xN)
+
++ 1
+
+2N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)˜lq(θ|xN)˜ls(θ|x(N))
+
++ 1
+
+N gX,ik(θ) ˜Hkm(θ|xN)gX,mr(θ)∂r log f (θ)
+
++ 1
+
+N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)˜lq(θ|xN)∂s log f (θ)
+
++ 1
+
+2N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)˜lq(θ|x(N))
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ)
+
++ 1
+
+N gX,ik(θ)∂k log h(θ) + OP(N−3/2).
+(A1)
+
+Proof. By the definition of ˆθπ, we get the equation given by:
+
+∂i log p(x(N)| ˆθπ) + ∂i log
+π( ˆθπ; N)
+
+|gX( ˆθπ)|1/2
+=
+0.
+
+122
+
+
+Entropy 2014, 16, 3026–3048
+
+From our assumption that prior π(θ; N) has the form given by:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ) + log h(θ)},
+
+we rewrite this equation as:
+
+∂i log p(x(N)| ˆθπ) +
+√
+
+N∂i log f ( ˆθπ) + ∂i log h( ˆθπ)
+=
+0.
+
+By applying Taylor expansion around θ to this new equation, we derive the following expansion:
+
+∂i log p(x(N)|θ) + {∂ij log p(x(N)|θ)}( ˆθj
+π − θj)
+
++1
+
+2{∂ijk log p(x(N)|θ)}( ˆθj
+π − θj)( ˆθk
+π − θk) +
+√
+
+N∂i log f (θ)
+
++
+√
+
+N{∂ij log f (θ)}( ˆθj
+π − θj) + ∂i log h(θ) + oP(1) = 0.
+
+From the law of large numbers and the central limit theorem, we rewrite the above expansion as:
+
+NgX
+ij (θ)( ˆθj
+π − θj)
+=
+∂i log p(x(N)|θ) +
+√
+
+N∂i log f (θ) + Hij(θ|x(N))( ˆθj
+π − θj)
+
++ N
+
+2 Lijk(θ)( ˆθj
+π − θj)( ˆθk
+π − θk) +
+√
+
+N∂ij log f (θ)( ˆθj
+π − θj)
+
++∂i log h(θ) + oP(1).
+(A2)
+
+By substituting the deviation, ˆθπ − θ, recursively into Expansion (A2), we obtain Expansion (A1).
+
+Lemma A2. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th
+component of the bias of the estimator, ˆθπ, is given by:
+
+EX(N)[ ˆθi
+π]
+=
+θi +
+1
+√
+
+N
+gX,ik∂k log f (θ)
+
+− 1
+
+2N
+
+m
+Γ X,i(θ) + 1
+
+2N gX,ik(θ)gX,mq(θ)gX,rs(θ)LX
+kmr(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ)
+
++ 1
+
+N gX,ik(θ)∂k log h(θ) + O(N−3/2).
+(A3)
+
+123
+
+
+Entropy 2014, 16, 3026–3048
+
+The (i, j)-component of the mean squared error of ˆθπ is given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
+= 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+−
+1
+
+N
+√
+
+N
+gX,k(i(θ)
+m
+Γ X,j)(θ)∂k log f (θ) +
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂kl log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)∂kgX,j)l(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)gX,pt(θ)LX
+lrt(θ)∂k log f (θ)∂n log f (θ)∂p log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)∂ln log f (θ)∂r log f (θ)∂k log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂k log f (θ)∂l log h(θ)
+
++O(N−2),
+(A4)
+
+where gX,k(i(θ)
+
+m
+Γ X,j) (θ) denotes (1/2){gX,ki(θ)
+
+m
+Γ X,j(θ) + gX,ki(θ)
+
+m
+Γ X,j(θ)} and gX,k(i(θ)∂kgX,j)l(θ)
+denotes (1/2){gX,ki(θ)∂kgX,jl(θ) + gX,kj(θ)∂kgX,il(θ)}. The (i, j, k)-component of the mean of the third
+power of the deviation, ˆθπ − θ, is given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+
+N
+√
+
+N
+gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+3
+
+N
+√
+
+N
+gX,(ij(θ)gX,k)l(θ)∂l log f (θ) + O(N−2).
+(A5)
+
+Proof. First, using Lemma A1, we determine the i-th component of the bias of ˆθπ given by:
+
+EX(N)[ ˆθi
+π − θi]
+
+=
+1
+√
+
+N
+gX,ik∂k log f (θ)
+
+− 1
+
+2N
+
+m
+Γ X,i(θ) + 1
+
+2N gX,ik(θ)gX,mq(θ)gX,rs(θ)LX
+kmr(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ) + 1
+
+N gX,ik(θ)∂k log h(θ) + O(N−3/2).
+
+Second, consider the following relationship:
+
+EX(N)
+
+��
+ˆθi
+π − θi −
+1
+√
+
+N
+gX,ik(θ)˜lk(θ|x(N)) −
+1
+√
+
+N
+gX,ik(θ)∂k log f (θ)
+�
+
+×
+�
+ˆθj
+π − θj −
+1
+√
+
+N
+gX,jl(θ)˜ll(θ|xN) −
+1
+√
+
+N
+gX,jl(θ)∂l log f (θ)
+��
+
+= EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] + 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+− 1
+√
+
+N
+gX,ki(θ)EX(N)[( ˆθj
+π − θj)˜lk(θ|x(N))] −
+1
+√
+
+N
+gX,kj(θ)EX(N)[( ˆθi
+π − θi)˜lk(θ|x(N))]
+
+− 1
+√
+
+N
+gX,ki(θ)EX(N)[( ˆθj
+π − θj)∂k log f (θ)] −
+1
+√
+
+N
+gX,kj(θ)EX(N)[( ˆθi
+π − θi)∂k log f (θ)]. (A6)
+
+124
+
+
+Entropy 2014, 16, 3026–3048
+
+By differentiating the j-th component of the bias, EX(N)[ ˆθj
+π − θj], we obtain the equation given by:
+
+1
+N ∂kEX(N)[ ˆθj
+π − θj]
+=
+− 1
+
+N δj
+k +
+1
+√
+
+N
+EX(N)[( ˆθj
+π − θj)˜lk(θ|xN)],
+(A7)
+
+where δi
+j denotes the delta function: if the upper and the lower indices agree, then the value of
+this function is one and otherwise zero. Equation (A7) has been used by [2,16,19]. By substituting
+Equations (A7) and (A3) into Relationship (A6), we obtain the (i, j)-component of the mean squared
+error of ˆθπ given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
+= 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+−
+1
+
+N
+√
+
+N
+gX,k(i(θ)
+m
+Γ X,j)(θ)∂k log f (θ) +
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂kl log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)∂kgX,j)l(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)gX,pt(θ)LX
+lrt(θ)∂k log f (θ)∂n log f (θ)∂p log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)∂ln log f (θ)∂r log f (θ)∂k log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂k log f (θ)∂l log h(θ) + O(N−2).
+
+Finally, by taking the expectation of the third power of the deviation, ˆθi
+π − θi, we obtain the
+following expansion:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+
+N
+√
+
+N
+gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+3
+
+N
+√
+
+N
+gX,(ij(θ)gX,k)l(θ)∂l log f (θ) + O(N−2).
+
+125
+
+
+Entropy 2014, 16, 3026–3048
+
+Lemma
+A3. Let
+ˆθπ
+be
+the
+maximum
+point
+of
+log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}.
+The Kullback–Leibler risk of the plug-in predictive density, q(y(N)| ˆθπ), with the estimator,
+ˆθπ, is
+expanded as follows:
+
+R(θ, q(y| ˆθπ))
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)
+� e
+∇i∂j log f (θ)
+�
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
+−
+1
+
+3N
+√
+
+N
+TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gY
+kl(θ)Ml
+ijgX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij ∂m log f (θ) −
+1
+
+N
+√
+
+N
+TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)Mk
+ij∂k log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(A8)
+
+Proof. By applying the Taylor expansion, the Kullback–Leibler risk, R(θ, q(y| ˆθπ)), is expanded as:
+
+Ex(N)[D(q(·|θ), q(·| ˆθπ))]
+
+= EX(N)
+
+��
+q(y|θ)
+�
+−li(θ|y) ˜θi
+π − 1
+
+2lij(θ|y)( ˆθi
+π − θi)( ˆθj
+π − θj)
+
+−1
+
+6lijk(θ|y)( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk) + OP(N−2)
+�
+dy
+�
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] − 1
+
+6 LY
+ijk(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)] + O(N−2)
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
++
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+3
+2
+
+m
+Γ Y
+(ij,k)(θ) − 1
+
+3TY
+ijk(θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)] + O(N−2)
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] − 1
+
+3TY
+ijk(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
++1
+
+2
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+gY
+kl(θ)
+
+m
+Γ Y,l
+ij (θ) − gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
++1
+
+2 gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk] + O(N−2),
+(A9)
+
+where
+
+e
+Γ Y
+(ij,k) denotes (1/3){
+
+e
+Γ Y
+ij,k +
+
+e
+Γ Y
+jk,i +
+
+e
+Γ Y
+ki,j}.
+
+126
+
+
+Entropy 2014, 16, 3026–3048
+
+By the definition of the predictive metric, ˚gij(θ) = gX
+ik(θ)gY,kl(θ)gX
+lj (θ), by Expansions (A4) and
+
+(A5) and by the relationship LX
+ijk(θ) = −
+
+e
+Γ X
+ij,k(θ) −
+e X
+jk,i(θ) −
+e X
+ki,j(θ) − TX
+ijk(θ), the last two terms of the
+above expansion (A9) are expanded as:
+
+1
+2 gY
+ij(θ)EX(N)[ ˆ(θ
+i
+π − θi)( ˆθj
+π − θj)] + 1
+
+2 gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂ij log f (θ) −
+
+e
+Γ X,k
+ij
+(θ)∂k log f (θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+�
+∂ik log f (θ) −
+e X,m
+ik
+∂m log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(A10)
+
+By substituting Expansion (A10) into Expansion (A9), Expansion (A8) is obtained.
+
+Note that Expansion (A8) is invariant up to O(N−2) under the reparametrization, so that each
+term of this expansion is a scalar function of θ.
+
+Lemma A4. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. The Bayesian
+predictive density based on the prior, π(θ; N), is expanded as:
+
+ˆqπ(y|x(N))
+=
+q(y| ˆθπ) + 1
+
+N gX,ij( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂i log |gX( ˆθπ)|
+1
+2 −
+
+e
+Γ X,k
+ik
+( ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+∂jq(y| ˆθπ)
+
++ 1
+
+2N gX,ij( ˆθπ)
+
+⎧
+⎨
+
+⎩∂ijq(y| ˆθπ) −
+
+m
+Γ X,k
+ij
+( ˆθπ)∂kq(y| ˆθπ)
+
+⎫
+⎬
+
+⎭ + OP(N−3/2).
+(A11)
+
+Proof. Let ˜θπ denote ˆθπ − θ. First, using a Taylor expansion twice, we expand the posterior density,
+π(θ|x(N)), as:
+
+π(θ|x(N))
+=
+|gX( ˆθπ)|
+1
+2
+π( ˆθπ)
+
+|gX( ˆθπ)|
+1
+2
+p(x(N)| ˆθπ) exp
+�
+−1
+
+2{−lij( ˆθπ|x(N))} ˜θi
+π ˜θj
+π
+
+�
+
+×
+
+�
+
+1 − {∂i log |gX( ˆθπ)|
+1
+2 } ˜θi
+π + 1
+
+2
+
+�
+∂ij|gX( ˆθπ)|
+1
+2
+
+|gX( ˆθπ)|
+1
+2
+
+�
+˜θi
+π ˜θj
+π + OP(N−3/2)
+
+�
+
+×
+�
+1 + 1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6{lijk( ˆθπ|x(N))} ˜θi
+π ˜θj
+π ˜θk
+π + 1
+
+2{log h( ˆθπ)} ˜θi
+π ˜θj
+π
+
+−1
+
+6{
+√
+
+N∂ijk log f ( ˆθπ)} ˜θi
+π ˜θj
+π ˜θk
+π + 1
+
+24lijkl( ˆθπ|x(N)) ˜θi
+π ˜θj
+π ˜θk
+π ˜θl
+π
+
++1
+
+2
+
+�1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6lijk( ˆθπ|xN) ˜θi
+π ˜θj
+π ˜θk
+π
+
+�
+
+×
+�1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6lijk( ˆθπ|x(N)) ˜θi
+π ˜θj
+π ˜θk
+π
+
+�
++ OP(N−3/2)
+�
+
+×
+
+��
+p(x(N)|θ) π(θ; N)
+
+|gX(θ)|
+1
+2
+|gX(θ)|
+1
+2 dθ
+
+�−1
+.
+
+127
+
+
+Entropy 2014, 16, 3026–3048
+
+We denote the N−1/2-order,
+N−1-order and N−3/2-order terms by (N−1/2)a0( ˜θπ; ˆθπ),
+(N−1)a1( ˜θπ; ˆθπ) and (N−3/2)a2( ˜θπ; ˆθπ), respectively. Then, this expansion is rewritten as:
+
+π(θ|x(N))
+=
+|gX( ˆθπ)|
+1
+2
+π( ˆθπ)
+
+|gX( ˆθπ)|
+1
+2
+p(x(N)| ˆθπ) exp
+�
+−1
+
+2{−lij( ˆθπ|x(N))} ˜θi
+π ˜θj
+π
+
+�
+
+×
+�
+1 +
+1
+√
+
+N
+a0( ˜θπ; ˆθπ)
+
++ 1
+
+N a1( ˜θπ; ˆθπ) +
+1
+
+N
+√
+
+N
+a2( ˜θπ; ˆθπ) + OP(N−2)
+�
+
+×
+
+��
+p(x(N)|θ) π(θ; N)
+
+|gX(θ)|
+1
+2
+|gX(θ)|
+1
+2 dθ
+
+�−1
+.
+
+To make the expansion easier to see, the following notations are used. Let φ(η; −lij( ˆθπ|x(N))) be
+the probability density function of the d-dimensional normal distribution with the precision matrix
+whose (i, j)-component is −lij( ˆθπ|x(N)). Let η = (η1, · · · , ηd) be a d-dimensional random vector
+distributed according to the normal density, φ(η; −lij( ˆθπ|x(N))) The notations, ¯a0( ˆθπ), ¯a1( ˆθπ), ¯a2( ˆθπ)
+and ˆωij( ˆθπ), denote the expectations of a0(η; ˆθπ), a1(η; ˆθπ), a2(η; ˆθπ) and ηiηj, respectively.
+Using the above notations, we get the following posterior expansion:
+
+π(θ|x(N))
+=
+φ( ˆθπ; −lij( ˆθπ|x(N)))
+
+×
+�
+1 +
+1
+√
+
+N
+{a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)} + 1
+
+N {a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)}
+
+− 1
+
+N ¯a0( ˆθπ){a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)} +
+1
+
+N
+√
+
+N
+{a2( ˜θπ; ˆθπ) − ¯a2( ˆθπ)}
+
+−
+1
+
+N
+√
+
+N
+¯a0( ˆθπ){a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)} −
+1
+
+N
+√
+
+N
+¯a1( ˆθπ){a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)}
+
++
+1
+
+N
+√
+
+N
+¯a2
+0( ˆθπ){a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)} + OP(N−2)
+�
+.
+(A12)
+
+Second, using (A12), the Bayesian predictive density, ˆqπ(y|x(N)), based on the prior, π(θ; N), is
+expanded as:
+
+ˆqπ(y|x(N))
+
+=
+�
+q(y| ˆθπ)
+
+�
+
+1 − {∂i log q(y| ˆθπ)} ˜θi
+π + 1
+
+2
+∂ijq(y| ˆθπ)
+
+q(y| ˆθπ)
+˜θi
+π ˜θj
+π + oP(N−1)
+
+�
+
+π(θ|xN)dθ
+
+=
+�
+q(y| ˆθπ)
+�
+1 + {∂i log |gX( ˆθπ)|
+1
+2 }{∂j log q(y| ˆθπ)} ˜θi
+π ˜θj
+π
+
++1
+
+6{∂ijk log p(x(N)| ˆθπ) +
+√
+
+N∂ijk log f ( ˆθπ)}{∂l log q(y| ˆθπ)} ˜θi
+π ˜θj
+π ˜θk
+π ˜θl
+π
+
++1
+
+2
+∂ijq(y| ˆθπ)
+
+q(y| ˆθπ)
+˜θi
+π ˜θj
+π + oP(N−1)
+
+�
+
+φ( ˜θπ; −lij( ˆθπ|xN))d ˜θπ
+
+= q(y| ˆθπ) + ˆωij( ˆθπ){∂i log |gX( ˆθπ)|
+1
+2 }∂jq(y| ˆθπ) + 1
+
+2 ˆωik( ˆθπ) ˆωjl( ˆθπ)lijk( ˆθπ|xN)∂lq(y| ˆθπ)
+
++1
+
+2 ˆωij( ˆθπ)∂ijq(y| ˆθπ) + OP(N−3/2).
+(A13)
+
+Here, the following two equations hold:
+
+−lij( ˆθπ|x(N))
+=
+NgX
+ij ( ˆθπ) −
+√
+
+N ˜Hij( ˆθπ|xN) + OP(1),
+(A14)
+
+128
+
+
+Entropy 2014, 16, 3026–3048
+
+lijk( ˆθπ|x(N))
+=
+−2N
+
+e
+Γ X
+ij,k( ˆθπ) − N
+
+m
+Γ X
+ik,j( ˆθπ) +
+√
+
+N ˜Hijk( ˆθ|xN).
+(A15)
+
+By combining Equation (A14) with the Sherman–Morrison–Woodbury formula, the following
+expansion is obtained:
+
+ˆωij( ˆθπ)
+=
+1
+N gX,ij( ˆθπ) +
+1
+
+N
+√
+
+N
+gX,ik( ˆθπ)gX,jl( ˆθπ)Hkl( ˆθπ|x(N)) + OP(N−2).
+(A16)
+
+By substituting Equations (A14), (A15) and (A16) into Expansion (A13), Expansion (A11) is
+obtained.
+
+Note that the integration of Expansion (A11) is one up to OP(N−2). Further, Expansion (A11) is
+similar to the expansion in [2]. However, the estimator that is the center of the expansion is different,
+because of the dependence of the prior on the sample size.
+
+Lemma A5. The Bayesian estimator, ˆθopt, minimizing the Bayes risk,
+�
+R(θ, q(y| ˆθ))dπ(θ; N), among plug-in predictive densities is given by:
+
+ˆθi
+opt
+=
+ˆθi
+π + 1
+
+2N gX,ij( ˆθπ)TX
+j ( ˆθπ)
+
++ 1
+
+2N gX,jk( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+m
+Γ Y,i
+jk ( ˆθπ) −
+
+m
+Γ X,i
+jk ( ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
++ OP(N−3/2).
+(A17)
+
+Proof. The Bayes risk, �
+R(θ, q(y| ˆθ))dπ(θ; N), is decomposed as:
+
+�
+R(θ, q(y| ˆθ))dπ(θ; N)
+=
+�
+π(θ; N)
+�
+p(x(N)|θ)
+�
+q(y|θ) log
+q(y|θ)
+
+ˆqπ(y|x(N))dydx(N)dθ
+
++
+�
+π(θ; N)
+�
+p(x(N)|θ)
+�
+q(y|θ) log ˆqπ(y|x(N))
+
+q(y| ˆθ)
+dydx(N)dθ.
+
+The first term of this decomposition is not dependent on ˆθ. From Fubini’s theorem and Lemma A4, the
+proof is completed.
+
+Using these lemmas, we prove Theorem 1. First, we find that the Kullback–Leibler risk of the
+plug-in predictive density with the estimator, ˆθopt, defined in Lemma A5, is given by:
+
+R(θ, q(y| ˆθopt))
+=
+R(θ, q(y| ˆθπ)) +
+1
+
+2N
+√
+
+N
+˚gij(θ)TX
+i (θ)∂j log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,im(θ)gY
+ij(θ)gX,kl(θ)
+
+×
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+m
+Γ Y,j
+kl (θ) −
+
+m
+Γ X,j
+kl
+((θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+∂m log f (θ).
+(A18)
+
+Using Expansion (A18) and Lemma A3, we expand the Kullback–Leibler risk, R(θ, ˆqπ(y|x(N))).
+Here, the risk, R(θ, ˆqπ(y|x(N))), is equal to the risk, R(θ, q(y| ˆθopt)), up to O(N−2), because we expand
+the Bayesian predictive density, ˆqπ(y|x(N)) as:
+
+q(y|x(N)) = q(y| ˆθopt) + 1
+
+2N gX,ij( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂ijq(y| ˆθπ) −
+
+m
+Γ Y,k
+ij
+( ˆθπ)∂kq(y| ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
++ OP(N−3/2).
+(A19)
+
+129
+
+
+Entropy 2014, 16, 3026–3048
+
+Thus, we obtain Expansion (1).
+
+References
+
+1. Aitchison, J. Goodness of prediction fit. Biometrika 1975, 62, 547–554.
+2. Komaki, F. On asymptotic properties of predictive distributions. Biometrika 1996, 83, 299–313.
+3. Hartigan, J. The maximum likelihood prior. Ann. Stat. 1998, 26, 2083–2103.
+4. Bernardo, J. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. B 1979, 41, 113–147.
+5. Clarke, B.; Barron, A. Jeffreys prior is asymptotically least favorable under entropy risk.
+J. Stat. Plan.
+Inference 1994, 41, 37–60.
+6. Aslan, M. Asymptotically minimax Bayes predictive densities. Ann. Stat. 2006, 34, 2921–2938.
+7. Komaki, F. Bayesian predictive densities based on latent information priors. J. Stat. Plan. Inference 2011, 141,
+3705–3715.
+8. Komaki, F. Asymptotically minimax Bayesian predictive densities for multinomial models. Electron. J. Stat.
+2012, 6, 934–957.
+9. Kanamori, T.; Shimodaira, H. Active learning algorithm using the maximum weighted log-likelihood
+estimator. J. Stat. Plan. Inference 2003, 116, 149–162.
+10. Shimodaira,
+H.
+Improving
+predictive
+inference
+under
+covariate
+shift
+by
+weighting
+the
+log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244.
+11. Fushiki, T.; Komaki, F.; Aihara, K. On parametric bootstrapping and Bayesian prediction. Scand. J. Stat. 2004,
+31, 403–416.
+12. Suzuki, T.; Komaki, F. On prior selection and covariate shift of β-Bayesian prediction under α-divergence
+risk. Commun. Stat. Theory 2010, 39, 1655–1673.
+13. Komaki, F. Asymptotic properties of Bayesian predictive densities when the distributions of data and target
+variables are different. Bayesian Anal. 2014, submitted for publication.
+14. Hodges, J.L.; Lehmann, E.L. Some problems in minimax point estimation. Ann. Math. Stat. 1950, 21, 182–197.
+15. Ghosh, M.N. Uniform approximation of minimax point estimates. Ann. Math. Stat. 1964, 35, 1031–1047.
+16. Amari, S. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1985.
+17. Robbins, H.
+Asymptotically Subminimax solutions of Compound Statistical Decision Problems.
+In Proceedings of the Second Berkley Symposium Mathematical Statistics and Probability, Berkeley, CA,
+USA, 31 July–12 August 1950; University of California Press: Oakland, CA, USA, 1950; pp. 131–148.
+18. Frank, P.; Kiefer, J. Almost subminimax and biased minimax procedures. Ann. Math. Stat. 1951, 22, 465–468.
+19. Efron, B. Defining curvature of a statistical problem (with applications to second order efficiency). Ann. Stat.
+1975, 3, 189–1372.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+130
+
+
+entropy
+
+Article
+Information-Geometric Markov Chain Monte Carlo
+Methods Using Diffusions
+
+Samuel Livingstone 1,* and Mark Girolami 2
+
+1 Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK
+2 Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; E-Mail:m.girolami@warwick.ac.uk
+*
+E-Mail: samuel.livingstone@ucl.ac.uk; Tel.: +44-20-7679-1872.
+
+Received: 29 March 2014; in revised form: 23 May 2014 / Accepted: 28 May 2014 /
+Published: 3 June 2014
+
+Abstract: Recent work incorporating geometric ideas in Markov chain Monte Carlo is reviewed
+in order to highlight these advances and their possible application in a range of domains beyond
+statistics. A full exposition of Markov chains and their use in Monte Carlo simulation for statistical
+inference and molecular dynamics is provided, with particular emphasis on methods based on
+Langevin diffusions. After this, geometric concepts in Markov chain Monte Carlo are introduced.
+A full derivation of the Langevin diffusion on a Riemannian manifold is given, together with
+a discussion of the appropriate Riemannian metric choice for different problems. A survey of
+applications is provided, and some open questions are discussed.
+
+Keywords: information geometry; Markov chain Monte Carlo; Bayesian inference; computational
+statistics; machine learning; statistical mechanics; diffusions
+
+1. Introduction
+
+There are three objectives to this article. The first is to introduce geometric concepts that have
+recently been employed in Monte Carlo methods based on Markov chains [1] to a wider audience.
+The second is to clarify what a “diffusion on a manifold” is, and how this relates to a diffusion
+defined on Euclidean space. Finally, we review the state-of-the-art in the field and suggest avenues for
+further research.
+The connections between some Monte Carlo methods commonly used in statistics, physics
+and application domains, such as econometrics, and ideas from both Riemannian and information
+geometry [2,3] were highlighted by Girolami and Calderhead [1] and the potential benefits
+demonstrated empirically. Two Markov chain Monte Carlo methods were introduced, the manifold
+Metropolis-adjusted Langevin algorithm and Riemannian manifold Hamiltonian Monte Carlo. Here,
+we focus on the former for two reasons. First, the intuition for why geometric ideas can improve
+standard algorithms is the same in both cases. Second, the foundations of the methods are quite
+different, and since the focus of the article is on using geometric ideas to improve performance,
+we considered a detailed description of both to be unnecessary. It should be noted, however, that
+impressive empirical evidence exists for using Hamiltonian methods in some scenarios (e.g., [4]). We
+refer interested readers to [5,6].
+We take an expository approach, providing a review of some necessary preliminaries from
+Markov chain Monte Carlo, diffusion processes and Riemannian geometry. We assume only a minimal
+familiarity with measure-theoretic probability. More informed readers may prefer to skip these
+sections. We then provide a full derivation of the Langevin diffusion on a Riemannian manifold and
+offer some intuition for how to think about such a process. We conclude Section 4 by presenting the
+Metropolis-adjusted Langevin algorithm on a Riemannian manifold.
+A key challenge in the geometric approach is which manifold to choose. We discuss this in
+Section 4.4 and review some candidates that have been suggested in the literature, along with the
+
+Entropy 2014, 16, 3074–3102; doi:10.3390/e16063074
+www.mdpi.com/journal/entropy
+131
+
+
+Entropy 2014, 16, 3074–3102
+
+reasoning for each. Rather than provide a simulation study here, we instead reference studies where
+the methods we describe have been applied in Section 5. In Section 6, we discuss several open
+questions, which we feel could be interesting areas of further research and of interest to both theorists
+and practitioners.
+Throughout, π(·) will refer to an n-dimensional probability distribution and π(x) its density with
+respect to the Lebesgue measure.
+
+2. Markov Chain Monte Carlo
+
+Markov chain Monte Carlo (MCMC) is a set of methods for drawing samples from a distribution,
+π(·), defined on a measurable space (X , B), whose density is only known up to some proportionality
+constant. Although the i-th sample is dependent on the (i − 1)-th, the Ergodic Theorem ensures that
+for an appropriately constructed Markov chain with invariant distribution π(·), long-run averages are
+consistent estimators for expectations under π(·). As a result, MCMC methods have proven useful
+in Bayesian statistical inference, where often, the posterior density π(x|y) ∝ f (y|x)π0(x) for some
+parameter, x (where f (y|x) denotes the likelihood for data y and π0(x) the prior density), is only
+known up to a constant [7]. Here, we briefly introduce some concepts from general state space Markov
+chain theory together with a short overview of MCMC methods. The exposition follows [8].
+
+2.1. Markov Chain Preliminaries
+
+A time-homogeneous Markov chain, {Xm}m∈N, is a collection of random variables, Xm, each of
+which is defined on a measurable space (X , B), such that:
+P[Xm ∈ A|X0 = x0, ..., Xm−1 = xm−1] = P[Xm ∈ A|Xm−1 = xm−1],
+(1)
+
+for any A ∈ B. We define the transition kernel P(xm−1, A) = P[Xm ∈ A|Xm−1 = xm−1] for the chain to
+be a map for which P(x, ·) defines a distribution over (X , B) for any x ∈ X , and P(·, A) is measurable
+for any A ∈ B. Intuitively, P defines a map from points to distributions in X . Similarly, we define the
+m-step transition kernel to be:
+
+Pm(x0, A) = P[Xm ∈ A|X0 = x0].
+(2)
+
+We call a distribution π(·) invariant for {Xm}m∈N if:
+
+π(A) =
+�
+
+X P(x, A)π(dx)
+(3)
+
+for all A ∈ B. If P(x, ·) admits a density, p(x′|x), this can be equivalently written:
+
+π(x′) =
+�
+
+X π(x)p(x′|x)dx.
+(4)
+
+The connotation of Equations (3) and (4) is that if Xm ∼ π(·), then Xm+s ∼ π(·) for any s ∈ N. In this
+instance, we say the chain is “at stationarity”. Of interest to us will be Markov chains for which there
+is a unique invariant distribution, which is also the limiting distribution for the chain, meaning that for
+any x0 ∈ X for which π(x0) > 0:
+lim
+m→∞ Pm(x0, A) = π(A)
+(5)
+
+for any A ∈ B. Certain conditions are required for Equation (5) to hold, but for all Markov chains
+presented here, these are satisfied (though, see [8]).
+A useful condition, which is sufficient (though not necessary) for π(·) to be an invariant
+distribution, is reversibility, which can be shown by the relation:
+
+π(x)p(x′|x) = π(x′)p(x|x′).
+(6)
+
+132
+
+
+Entropy 2014, 16, 3074–3102
+
+Integrating over both sides with respect to x, we recover Equation (4). In other words, a chain is
+reversible if, at stationarity, the probability that xi ∈ A and xi+1 ∈ B are equal to the probability that
+xi+1 ∈ A and xi ∈ B. The relation (6) will be the primary tool used to construct Markov chains with a
+desired invariant distribution in the next section.
+
+2.1.1. Monte Carlo Estimates from Markov Chains
+
+Of most interest here are estimators constructed from a Markov chain. The Ergodic Theorem
+states that for any chain, {Xm}m∈N, satisfying Equation (5) and any g ∈ L1(π), we have that:
+
+lim
+m→∞
+1
+m
+
+m
+∑
+i=1
+g(Xi) = Eπ[g(X)]
+(7)
+
+with probability one [7]. This is a Markov chain analogue to the Law of large numbers.
+The efficiency of estimators of the form ˆtm
+= ∑i g(Xi)/m can be assessed through the
+autocorrelation between elements in the chain. We will assess the efficiency of ˆtm relative to estimators
+¯tm = ∑i g(Zi)/m, where {Zi}m∈N is a sequence of independent random variables, each having
+distribution π(·). Provided Varπ[g(Zi)] < ∞, then Var[¯tm] = Varπ[g(Zi)]/m. We now seek a similar
+result for estimators of the form, ˆtm.
+It follows directly from the Kipnis–Varadhan Theorem [9] that an estimator, ˆtm, from a reversible
+Markov chain for which X0 ∼ π(·) satisfies:
+
+lim
+m→∞
+Var[ˆtm]
+Var[¯tm] = 1 + 2
+∞
+∑
+i=1
+ρ(0,i) = τ,
+(8)
+
+provided that ∑∞
+i=1 i|ρ(0,i)| < ∞, where ρ(0,i) = Corrπ[g(X0), g(Xi)]. We will refer to the constant, τ, as
+the autocorrelation time for the chain.
+Equation (8) implies that for large enough m, Var[ˆtm] ≈ τVar[¯tm]. In practical applications, the
+sum in Equation (8) is truncated to the first p − 1 realisations of the chain, where p is the first instance
+at which |ρ(0,p)| < ϵ for some ϵ > 0. For example, in the Convergence Diagnosis and Output Analysis
+for MCMC (CODA) package within the R statistical software ϵ = 0.05 [10,11].
+Another commonly used measure of efficiency is the effective sample size me f f = m/τ, which
+gives the number of independent samples from π(·) needed to give an equally efficient estimate for
+Eπ[g(X)]. Clearly, minimising τ is equivalent to maximising me f f .
+The measures arising from Equation (8) give some intuition for what sort of Markov chain gives
+rise to efficient estimators. However, in practice, the chain will never be at stationarity. Therefore, we
+also assess Markov chains according to how far away they are from this point. For this, we need to
+measure how close Pm(x0, ·) is from π(·), which requires a notion of distance between probability
+distributions.
+Although there are several appropriate choices [12], a common option in the Markov chain
+literature is the total variation distance:
+
+∥μ(·) − ν(·)∥TV := sup
+A∈B
+|μ(A) − ν(A)|,
+(9)
+
+which informally gives the largest possible difference between the probabilities of a single event in
+B according to μ(·) and ν(·). If both distributions admit densities, Equation (9) can be written (see
+Appendix A):
+
+∥μ(·) − ν(·)∥TV = 1
+
+2
+
+�
+
+X |μ(x) − ν(x)|dx.
+(10)
+
+which is proportional to the L1 distance between μ(x) and ν(x). Our metric, ∥ · ∥TV ∈ [0, 1], with
+∥ · ∥TV = 1 for distributions with disjoint supports and ∥μ(·) − ν(·)∥TV = 0, implies μ(·) ≡ ν(·).
+
+133
+
+
+Entropy 2014, 16, 3074–3102
+
+Typically, for an unbounded X , the distance ∥Pm(x0, ·) − π(·)∥TV will depend on x0 for any finite
+m. Therefore, bounds on the distance are often sought via some inequality of the form:
+
+∥Pm(x0, ·) − π(·)∥TV ≤ MV(x0) f (m),
+(11)
+
+for some M < ∞, where V : X → [1, ∞) depends on x0 and is called a drift function, and f : N → [0, ∞)
+depends on the number of iterations, m (and is often defined, such that f (0) = 1).
+A Markov chain is called geometrically ergodic if f (m) = rm in Equation (11) for some 0 < r < 1.
+If in addition to this, V is bounded above, the chain is called uniformly ergodic. Intuitively, if either
+condition holds, then the distribution of Xm will converge to π(·) geometrically quickly as m grows,
+and in the uniform case, this rate is independent of x0. As well as providing some (often qualitative
+if M and r are unknown) bounds on the convergence rate of a Markov chain, geometric ergodicity
+implies that a central limit theorem exists for estimators of the form, ˆtm. For more detail on this,
+see [13,14].
+In practice several approximate methods also exist to assess whether a chain is close enough to
+stationarity for long-run averages to provide suitable estimators (e.g., [15]). The MCMC practitioner
+also uses a variety of visual aids to judge whether an estimate from the chain will be appropriate for
+his or her needs.
+
+2.2. Markov Chain Monte Carlo
+
+Now that we have introduced Markov chains, we turn to simulating them. The objective here
+is to devise a method for generating a Markov chain, which has a desired limiting distribution, π(·).
+In addition, we would strive for the convergence rate to be as fast as possible and the effective sample
+size to be suitably large relative to the number of iterations. Of course, the computational cost of
+performing an iteration is also an important practical consideration. Ideally, any method would
+also require limited problem-specific alterations, so that practitioners are able to use it with as little
+knowledge of the inner workings as is practical.
+Although other methods exist for constructing chains with a desired limiting distribution, a
+popular choice is the Metropolis–Hastings algorithm [7]. At iteration i, a sample is drawn from some
+candidate transition kernel, Q(xi−1, ·), and then either accepted or rejected (in which case, the state of
+the chain remains xi−1). We focus here on the case where Q(xi−1, ·) admits a density, q(x′|xi−1), for all
+xi−1 ∈ X (though, see [8]). In this case, a single step is shown below (the wedge notation a ∧ b denotes
+the minimum of a and b). The “acceptance rate”, α(xi−1, x′), governs the behaviour of the chain, so
+that, when it is close to one, then many proposed moves are accepted, and the current value in the
+chain is constantly changing. If it is on average close to zero, then many proposals are rejected, so
+that the chain will remain in the same place for many iterations. However, α ≈ 1 is typically not ideal,
+often resulting in a large autocorrelation time (see below). The challenge in practice is to find the right
+acceptance rate to balance these two extremes.
+
+Algorithm 1 Metropolis–Hastings, single iteration.
+
+Require: xi−1
+Draw X′ ∼ Q(xi−1, ·)
+Draw Z ∼ U[0, 1]
+Set α(xi−1, x′) ← 1 ∧
+π(x′)q(xi−1|x′)
+
+π(xi−1)q(x′|xi−1)
+if z < α(xi−1, x′) then
+Set xi ← x′
+else
+Set xi ← xi−1
+end if
+
+Combining the “proposal” and “acceptance” steps, the transition kernel for the resulting Markov
+chain is:
+
+134
+
+
+Entropy 2014, 16, 3074–3102
+
+P(x, A) = r(x)δx(A) +
+�
+
+A α(x, x′)q(x′|x)dx′,
+(12)
+
+for any A ∈ B, where:
+
+r(x) = 1 −
+�
+
+X α(x, x′)q(x′|x)dx′
+
+is the average probability that a draw from Q(x, ·) will be rejected, and δx(A) = 1 if x ∈ A and zero,
+otherwise. A Markov chain defined in this way will have π(·) as an invariant distribution, since the
+chain is reversible for π(·). We note here that:
+
+π(xi−1)q(xi|xi−1)α(xi−1, xi) = π(xi−1)q(xi|xi−1) ∧ π(xi)q(xi−1|xi)
+
+= α(xi, xi−1)q(xi−1|xi)π(xi)
+
+in the case that the proposed move is accepted and that if the proposed move is rejected, then xi = xi−1;
+so the chain is reversible for π(·). It can be shown that π(·) is also the limiting distribution for the
+chain [7].
+The convergence rate and autocorrelation time of a chain produced by the algorithm are dependent
+on both the choice of proposal, Q(xi−1, ·), and the target distribution, π(·). For simple forms of the
+latter, less consideration is required when choosing the former. A broad objective among researchers
+in the field is to find classes of proposal kernels that produce chains that converge and mix quickly for
+a large class of target distributions. We first review a simple choice before discussing one that is more
+sophisticated, and the will be the focus of the rest of the article.
+
+2.3. Random Walk Proposals
+
+An extremely simple choice for Q(x, ·) is one for which:
+
+q(x′|x) = q(∥x′ − x∥)
+(13)
+
+where ∥ · ∥ denotes some appropriate norm on X , meaning the proposal is symmetric. In this case, the
+acceptance rate reduces to:
+
+α(x, x′) = 1 ∧ π(x′)
+
+π(x) .
+(14)
+
+In addition to simplifying calculations, Equation (14) strengthens the intuition for the method, since
+proposed moves with higher density under π(·) will always be accepted. A typical choice for Q(x, ·)
+is N (x, λ2Σ), where the matrix, Σ, is often chosen in an attempt to match the correlation structure
+of π(·) or simply taken as the identity [16]. The tuning parameter, λ, is the only other user-specific
+input required.
+Much research has been conducted into properties of the random walk Metropolis algorithm
+(RWM). It has been shown that the optimal acceptance rate for proposals tends to 0.234 as the
+dimension, n, of the state space, X , tends to ∞ for a wide class of targets (e.g., [17,18]). The intuition
+for an optimal acceptance rate is to find the right balance between the distance of proposed moves
+and the chances of acceptance. Increasing the former will reduce the autocorrelation in the chain if the
+proposal is accepted, but if it is rejected, the chain will not move at all, so autocorrelation will be high.
+Random walk proposals are sometimes referred to as blind (e.g., [19]), as no information about π(·)
+is used when generating proposals, so typically, very large moves will result in a very low chance of
+acceptance, while small moves will be accepted, but result in very high autocorrelation for the chain.
+Figure 1 demonstrates this in the simple case where π(·) is a one-dimensional N (0, 12) distribution.
+
+135
+
+
+Entropy 2014, 16, 3074–3102
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+�������������
+����������������������������
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+������������
+�������������������������������
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+������������
+�����������������������������
+
+Figure 1. These traceplots show the evolution of three RWM Markov chains for which π(·) is a N (0, 12)
+distribution, with different choices for λ.
+
+Several authors have also shown that for certain classes of π(·), the tuning parameter, λ, should
+be chosen, such that λ2 ∝ n−1, so that α ↛ 0 as n → ∞ [20]. Because of this, we say that algorithm
+efficiency “scales” O(n−1) as the dimension n of π(·) increases.
+Ergodicity results for a Markov chain constructed using the RWM algorithm also exist [21–23].
+At least exponentially light tails are a necessity for π(x) for geometric ergodicity, which means that
+π(x)/e−∥x∥ → c as ∥x∥ → ∞, for some constant, c. For super-exponential tails (where π(x) → 0 at
+a faster than the exponential rate), additional conditions are required [21,23]. We demonstrate with
+a simple example why heavy-tailed forms of π(x) pose difficulties here (where π(x) → 0 at a rate
+slower than e−∥x∥).
+
+Example: Take π(x) ∝ 1/(1 + x2), so that π(·) is a Cauchy distribution. Then, if X′ ∼ N (x, λ2), the ratio
+π(x′)/π(x) = (1 + x2)/(1 + (x′)2) → 1 as |x| → ∞. Therefore, if x0 is far away from zero, the Markov
+chain will dissolve into a random walk, with almost every proposal being accepted.
+
+It should be noted that starting the chain from at or near zero can also cause problems in the
+above example, as the tails of the distribution may not be explored. See [7] for more detail here.
+Ergodicity results for the RWM also exist for specific classes of the statistical model. Conditions for
+geometric ergodicity in the case of generalised linear mixed models are given in [24], while spherically
+constrained target densities are discussed in [25]. In [26], the authors provide necessary conditions
+for the geometric convergence of RWM algorithms, which are related to the existence of exponential
+moments for π(·) and P(x, ·). Weaker forms of ergodicity and corresponding conditions are also
+discussed in the paper.
+In the remainder of the article, we will primarily discuss another approach to choosing Q, which
+has been shown empirically [1] and, in some cases, theoretically [20] to be superior to the RWM
+algorithm, though it should be noted that random walk proposals are still widely used in practice and
+are often sufficient for more straightforward problems [16].
+
+3. Diffusions
+
+In MCMC, we are concerned with discrete time processes. However, often, there are benefits
+to first considering a continuous time process with the properties we desire. For example, some
+continuous time processes can be specified via a form of differential equation. In this section, we
+derive a choice for a Metropolis–Hastings proposal kernel based on approximations to diffusions,
+
+136
+
+
+Entropy 2014, 16, 3074–3102
+
+those continuous-time n-dimensional Markov processes (Xt)t≥0 for which any sample path t �→ Xt(ω)
+is a continuous function with probability one. For any fixed t, we assume Xt is a random variable
+taking values on the measurable space (X , B) as before. The motivation for this section is to define a
+class of diffusions for which π(·) is the invariant distribution. First, we provide some preliminaries,
+followed by an introduction to our main object of study, the Langevin diffusion.
+
+3.1. Preliminaries
+
+We focus on the class of time-homogeneous Itô diffusions, whose dynamics are governed by a
+stochastic differential equation of the form:
+
+dXt = b(Xt)dt + σ(Xt)dBt, X0 = x0,
+(15)
+
+where (Bt)t≥0 is a standard Brownian motion and the drift vector, b, and volatility matrix, σ, are
+Lipschitz continuous [27]. Since E[Bt+△t − Bt|Bt = bt] = 0 for any △t ≥ 0, informally, we can see that:
+
+E[Xt+△t − Xt|Xt = xt] = b(xt)△t + o(△t),
+(16)
+
+implying that the drift dictates how the mean of the process changes over a small time interval, and if
+we define the process (Mt)t≥0 through the relation:
+
+Mt = Xt −
+� t
+
+0 b(Xs)ds
+(17)
+
+then we have:
+
+E[(Mt+△t − Mt)(Mt+△t − Mt)T|Mt = mt, Xt = xt] = σ(xt)σ(xt)T△t + o(△t),
+(18)
+
+giving the stochastic part of the relationship between Xt+△t and Xt for small enough △t; see, e.g., [28].
+While Equation(15) is often a suitable description of an Itô diffusion, it can also be characterised
+through an infinitesimal generator, A, which describes how functions of the process are expected to
+evolve. We define this partial differential operator through its action on a function, f ∈ C0(X ), as:
+
+A f (Xt) = lim
+△t→0
+E[ f (Xt+△t)|Xt = xt] − f (xt)
+
+△t
+,
+(19)
+
+though A can be associated with the drift and volatility of (Xt)t≥0 by the relation:
+
+A f (x) = ∑
+i
+bi(x) ∂ f
+
+∂xi
+(x) + 1
+
+2 ∑
+i,j
+Vij(x) ∂2 f
+
+∂xi∂xj
+(x),
+(20)
+
+where Vij(x) denotes the component in row i and column j of σ(x)σ(x)T [27].
+As in the discrete case, we can describe the transition kernel of a continuous time Markov process,
+Pt(x0, ·). In the case of an Itô diffusion, Pt(x0, ·) admits a density, pt(x|x0), which, in fact, varies
+smoothly as a function of t. The Fokker–Planck equation describes this variation in terms of the drift
+and volatility and is given by:
+
+∂
+∂t pt(x|x0) = −∑
+i
+
+∂
+∂xi
+[bi(x)pt(x|x0)] + 1
+
+2 ∑
+i,j
+
+∂2
+
+∂xi∂xj
+[Vij(x)pt(x|x0)].
+(21)
+
+137
+
+
+Entropy 2014, 16, 3074–3102
+
+Although, typically, the form of Pt(x0, ·) is unknown, the expectation and variance of Xt ∼ Pt(x0, ·)
+are given by the integral equations:
+
+E[Xt|X0 = x0] = x0 + E
+�� t
+
+0 b(Xs)ds
+�
+,
+
+E[(Xt − E[Xt])(Xt − E[Xt])T|X0 = x0] = E
+�� t
+
+0 σ(Xs)σ(Xs)Tds
+�
+,
+
+where the second of these is a result of the Itô isometry [27]. Continuing the analogy, a natural question
+is whether a diffusion process has an invariant distribution, π(·), and whether:
+
+lim
+t→∞ Pt(x0, A) = π(A)
+(22)
+
+for any A ∈ B and any x0 ∈ X , in some sense. For a large class of diffusions (which we confine
+ourselves to), this is, in fact, the case. Specifically, in the case of positive Harris recurrent diffusions
+with invariant distribution π(·), all compact sets must be small for some skeleton chain, see [29] for
+details. In addition, Equation (21) provides a means of finding π(·), given b and σ. Setting the left-hand
+side of Equation (21) to zero gives:
+
+∑
+i
+
+∂
+∂xi
+[bi(x)π(x)] = 1
+
+2 ∑
+i,j
+
+∂2
+
+∂xi∂xj
+[Vij(x)π(x)],
+(23)
+
+which can be solved to find π(·).
+
+3.2. Langevin Diffusions
+
+Given Equation (23), our goal becomes clearer: find drift and volatility terms, so that the resulting
+dynamics describe a diffusion, which converges to some user-defined invariant distribution, π(·). This
+process can then be used as a basis for choosing Q in a Metropolis–Hastings algorithm. The Langevin
+diffusion, first used to describe the dynamics of molecular systems [30], is such a process, given by the
+solution to the stochastic differential equation:
+
+dXt = 1
+
+2∇ log π(Xt)dt + dBt, X0 = x0.
+(24)
+
+Since Vij(x) = 1{i=j}, it is clear that
+
+1
+2
+∂
+∂xi
+[log π(x)]π(x) = 1
+
+2
+∂
+∂xi
+π(x), ∀i,
+(25)
+
+which is a sufficient condition for Equation (23) to hold. Therefore, for any case in which π(x) is
+suitably regular, so that ∇ log π(x) is well-defined and the derivatives in Equation (23) exist, we can
+use (24) to construct a diffusion, which has invariant distribution, π(·).
+Roberts and Tweedie [31] give sufficient conditions on π(·) under which a diffusion, (Xt)t≥0,
+with dynamics given by Equation (24), will be ergodic, meaning:
+
+∥Pt(x0, ·) − π(·)∥TV → 0
+(26)
+
+as t → ∞, for any x0 ∈ X .
+
+3.3. Metropolis-Adjusted Langevin Algorithm
+
+We can use Langevin diffusions as a basis for MCMC in many ways, but a popular
+variant is known as the Metropolis-adjusted Langevin algorithm (MALA), whereby Q(x, ·) is
+
+138
+
+
+Entropy 2014, 16, 3074–3102
+
+constructed through a Euler–Maruyama discretisation of (24) and used as a candidate kernel in
+a Metropolis–Hastings algorithm. The resulting Q is:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 ∇ log π(x), λ2I
+�
+,
+(27)
+
+where λ is again a tuning parameter.
+Before we discuss the theoretical properties of the approach, we first offer an intuition for the
+dynamics. From Equation (27), it can be seen that Langevin-type proposals comprise a deterministic
+shift towards a local mode of π(x), combined with some random additive Gaussian noise, with
+variance λ2 for each component. The relative weights of the deterministic and random parts are fixed,
+given as they are by the parameter, λ. Typically, if λ1/2 ≫ λ, then the random part of the proposal will
+dominate and vice versa in the opposite case, though this also depends on the form of ∇ log π(x) [31].
+Again, since this is a Metropolis–Hastings method, choosing λ is a balance between proposing
+large enough jumps and ensuring that a reasonable proportion are accepted. It has been shown that in
+the limit, as n → ∞, the optimal acceptance rate for the algorithm is 0.574 [20] for forms of π(·), which
+either have independent and identically distributed components or whose components only differ
+by some scaling factor [20]. In these cases, as n → ∞, the parameter, λ, must be ∝ n−1/3, so we say
+the algorithm efficiency scales O(n−1/3). Note that these results compare favourably with the O(n−1)
+scaling of the random walk algorithm.
+Convergence properties of the method have also been established. Roberts and Tweedie [31]
+highlight some cases in which MALA is either geometrically ergodic or not. Typically, results are
+based on the tail behaviour of π(x). If these tails are heavier than exponential, then the method is
+typically not geometrically ergodic and similarly if the tails are lighter than Gaussian. However, in the
+in between case, the converse is true. We again offer two simple examples for intuition here.
+
+Example: Take π(x) ∝ 1/(1 + x2) as in the previous example. Then, ∇ log π(x) = −2x/(1 + x2)2 → 0
+as |x| → ∞. Therefore, if x0 is far away from zero, then the MALA will be approximately equal to the RWM
+algorithm and, so, will also dissolve into a random walk.
+
+Example: Take π(x) ∝ e−x4. Then, ∇ log π(x) = −4x3 and X′ ∼ N (x − 4λ2x3, λ2). Therefore, for any fixed
+λ, there exists c > 0, such that, for |x0| > c, we have |4λ2x3| >> x and |x − 4λ2x3| >> λ, suggesting that
+MALA proposals will quickly spiral further and further away from any neighbourhood of zero, and hence, nearly
+all will be rejected.
+
+For cases where there is a strong correlation between elements of x or each element has a different
+marginal variance, the MALA can also be “pre-conditioned” in a similar way to the RWM, so that the
+covariance structure of proposals more accurately reflects that of π(x) [32]. In this case, proposals take
+the form:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 Σ∇ log π(x), λ2Σ
+�
+,
+(28)
+
+where λ is again a tuning parameter. It can be shown that provided Σ is a constant matrix, π(x) is still
+the invariant distribution for the diffusion on which Equation (28) is based [33].
+
+4. Geometric Concepts in Markov Chain Monte Carlo
+
+Ideas from information geometry have been successfully applied to statistics from as early as [34].
+More widely, other geometric ideas have also been applied, offering new insight into common problems
+(e.g., [35,36]). A survey is given in [37]. In this section, we suggest why some ideas from differential
+geometry may be beneficial for sampling methods based on Markov chains. We then review what is
+
+139
+
+
+Entropy 2014, 16, 3074–3102
+
+meant by a “diffusion on a manifold”, before turning to the specific case of Equation (24). After this,
+we discuss what can be learned from work in information geometry in this context.
+
+4.1. Manifolds and Markov Chains
+
+We often make assumptions in MCMC about the properties of the space, X , in which our
+Markov chains evolve. Often X = Rn or a simple re-parametrisation would make it so. However,
+here, Rn = {(a1, ..., an) : ai ∈ (−∞, ∞) ∀i}. The additional assumption that is often made is that Rn is
+Euclidean, an inner product space with the induced distance metric:
+
+d(x, y) =
+�
+
+∑
+i
+(xi − yi)2.
+(29)
+
+For sampling methods based on Markov chains that explore the space locally, like the RWM and
+MALA, it may be advantageous to instead impose a different metric structure on the space, X , so that
+some points are drawn closer together and others pushed further apart. Intuitively, one can picture
+distances in the space being defined, such that if the current position in the chain is far from an area
+of X , which is “likely to occur” under π(·), then the distance to such a typical set could be reduced.
+Similarly, once this region is reached, the space could be “stretched” or “warped”, so that it is explored
+as efficiently as possible.
+While the idea is attractive, it is far from a constructive definition. We only have the pre-requisite
+that (X , d) must be a metric space. However, as Langevin dynamics use gradient information, we will
+require (X , d) to be a space on which we can do differential calculus. Riemannian manifolds are an
+appropriate choice, therefore, as the rules of differentiation are well understand for functions defined
+on them [38,39], while we are still free to define a more local notion of distance than Euclidean. In this
+section, we write Rn to denote the Euclidean vector space.
+
+4.2. Preliminaries
+
+We do not provide a full overview of Riemannian geometry here [38–40]. We simply note that
+for our purposes, we can consider an n-dimensional Riemannian manifold (henceforth, manifold)
+to be an n-dimensional metric space, in which distances are defined in a specific way. We also only
+consider manifolds for which a global coordinate chart exists, meaning that a mapping r : Rn → M
+exists, which is both differentiable and invertible and for which the inverse is also differentiable (a
+diffeomorphism). Although this restricts the class of manifolds available (the sphere, for example, is
+not in this class), it is again suitable for our needs and avoids the practical challenges of switching
+between coordinate patches. The connection with Rn defined through r is crucial for making sense of
+differentiability in M. We say a function f : M → R is “differentiable” if ( f ◦ r) : Rn → R is [39].
+As has been stated, Equation (29) can be induced via a Euclidean inner product, which we denote
+⟨·, ·⟩. However, it will aid intuition to think of distances in Rn via curves:
+
+γ : [0, 1] → Rn.
+(30)
+
+We could think of the distance between two points in x, y ∈ Rn as the minimum length among all
+curves that pass through x and y. If γ(0) = x and γ(1) = y, the length is defined as:
+
+L(γ) =
+� 1
+
+0
+
+�
+
+⟨γ′(t), γ′(t)⟩dt,
+(31)
+
+giving the metric:
+d(x, y) = inf {L(γ) : γ(0) = x, γ(1) = y} .
+(32)
+
+In Rn, the curve with a minimum length will be a straight line, so that Equation (32) agrees with
+Equation (29). More generally, we call a solution to Equation (32) a geodesic [38].
+
+140
+
+
+Entropy 2014, 16, 3074–3102
+
+In a vector space, metric properties can always be induced through an inner product (which also
+gives a notion of orthogonality). Such a space can be thought of as “flat”, since for any two points, y
+and z, the straight line ay + (1 − a)z, a ∈ [0, 1] is also contained in the space. In general, manifolds do
+not have vector space structure globally, but do so at the infinitesimal level. As such, we can think
+of them as “curved”. We cannot always define an inner product, but we can still define distances
+through (32). We define a curve on a manifold, M, as γM : [0, 1] → M. At each point γM(t) = p ∈ M,
+the velocity vector, γ′
+M(t), lies in an n-dimensional vector space, which touches M at p. These are
+known as tangent spaces, denoted TpM, which can be thought of as local linear approximations to M.
+We can define an inner product on each as gp : TpM → R, which allows us to define a generalisation
+of (31) as:
+
+L(γM) =
+� 1
+
+0
+
+�
+
+gp(γ′
+M(t), γ′
+M(t))dt.
+(33)
+
+and
+provides
+a
+means
+to
+define
+a
+distance
+metric
+on
+the
+manifold
+as
+d(x, y)
+=
+inf {L(γM) : γM(0) = x, γM(1) = y}. We emphasise the difference between this distance metric on M
+and gp, which is called a Riemannian metric or metric tensor and which defines an inner product on
+TpM.
+
+Embeddings and Local Coordinates
+
+So far, we have introduced manifolds as abstract objects. In fact, they can also be considered as
+objects that are embedded in some higher-dimensional Euclidean space. A simple example is any
+two-dimensional surface, such as the unit sphere, lying in R3. If a manifold is embedded in this way,
+then metric properties can be induced from the ambient Euclidean space.
+We seek to make these ideas more concrete through an example, the graph of a function, f (x1, x2),
+of two variables, x1 and x2. The resulting map, r, is:
+
+r : R2 → M
+(34)
+
+r(x1, x2) = (x1, x2, f (x1, x2)).
+(35)
+
+We can see that M is embedded in R3, but that any point can be identified using only two coordinates,
+x1 and x2. In this case, each TpM is a plane, and therefore, a two-dimensional subspace of R3, so: (i) it
+inherits the Euclidean inner product, ⟨·, ·⟩; and (ii) any vector, v ∈ TpM, can be expressed as a linear
+combination of any two linearly independent basis vectors (a canonical choice is the partial derivatives
+∂r/∂x1 := r1 and r2, evaluated at x = r−1(p) ∈ R2). The resulting inner product, gp(v, w), between
+two vectors, v, w ∈ TpM, can be induced from the Euclidean inner product as:
+
+⟨v, w⟩ = ⟨v1r1(x) + v2r2(x), w1r1(x) + w2r2(x)⟩,
+
+= v1w1⟨r1(x), r1(x)⟩ + v1w2⟨r1(x), r2(x)⟩ + v2w1⟨r2(x), r1(x)⟩ + v2w2⟨r2(x), r2(x)⟩,
+
+= vTG(x)w,
+
+where:
+
+G(x) =
+
+�
+⟨r1(x), r1(x)⟩
+⟨r1(x), r2(x)⟩
+⟨r1(x), r2(x)⟩
+⟨r2(x), r2(x)⟩
+
+�
+
+(36)
+
+and we use vi, wi to denote the components of v and w. To write (31) using this notation, we define the
+curve, x(t) ∈ R2, corresponding to γM(t) ∈ M as x = (r−1 ◦ γM) : [0, 1] → R2. Equation (31) can then
+be written:
+
+L(γM) =
+� 1
+
+0
+
+�
+
+x′(t)TG(x(t))x′(t)dt,
+(37)
+
+which can be used in (32) as before.
+
+141
+
+
+Entropy 2014, 16, 3074–3102
+
+The key point is that, although we have started with an object embedded in R3, we can compute
+the Riemannian metric, gp(v, w) (and, hence, distances in M), using only the two-dimensional “local”
+coordinates (x1, x2). We also need not have explicit knowledge of the mapping, r, only the components
+of the positive definite matrix, G(x). The Nash embedding theorem [41] in essence enables us to define
+manifolds by the reverse process: simply choose the matrix, G(x), so that we define a metric space
+with suitable distance properties, and some object embedded in some higher-dimensional Euclidean
+space will exist for which these metric properties can be induced as above. Therefore, to define our new
+space, we simply choose an appropriate matrix-valued map, G(x) (we discuss this choice in Section
+4.4). If G(x) does not depend on x, then M has a vector space structure and can be thought of as “flat”.
+Trivially, G(x) = I gives Euclidean n-space.
+We can also define volumes on a Riemannian manifold in local coordinates. Following standard
+coordinate transformation rules, we can see that for the above example, the area element, dx, in R2
+
+will change according to a Jacobian J = |(Dr)T(Dr)|1/2, where Dr = ∂(p1, p2, p3)/∂(x1, x2). This
+reduces to J = |G(x)|1/2, which is also the case for more general manifolds [38]. We therefore define
+the Riemannian volume measure on a manifold, M, in local coordinates as:
+
+VolM(dx) = |G(x)|
+1
+2 dx.
+(38)
+
+If G(x) = I, then this reduces to the Lebesgue measure.
+
+4.3. Diffusions on Manifolds
+
+By a “diffusion on a manifold” in local coordinates, we actually mean a diffusion defined on
+Euclidean space. For example, a realisation of Brownian motion on the surface, S ⊂ R3, defined
+in Figure 2 through r(x1, x2) = (x1, x2, sin(x1) + 1) will be a sample path, which is defined on S
+and “looks locally” like Brownian motion in a neighbourhood of any point, p ∈ S. However, the
+pre-image of this sample path (through r−1) will not be a realisation of a Brownian motion defined on
+R2, owing to the nonlinearity of the mapping. Therefore, to define “Brownian motion on S”, we define
+some diffusion (Xt)t≥0 that takes values in R2, for which the process (r(Xt))t≥0 “looks locally” like a
+Brownian motion (and lies on S). See [42] for more intuition here.
+Our goal, therefore, is to define a diffusion on Euclidean space, which, when mapped onto a
+manifold through r, becomes the Langevin diffusion described in (24) by the above procedure. Such a
+diffusion takes the form:
+
+dXt = 1
+
+2
+˜∇ log ˜π(Xt)dt + d ˜Bt,
+(39)
+
+where those objects marked with a tilde must be defined appropriately. The next few paragraphs
+are technical, and readers aiming to simply grasp the key points may wish to skip to the end of
+this Subsection.
+We turn first to ( ˜Bt)t≥0, which we use to denote Brownian motion on a manifold. Intuitively,
+we may think of a construction based on embedded manifolds, by setting ˜B0 = p ∈ M, and for
+each increment sampling some random vector in the tangent space TpM, and then moving along the
+manifold in the prescribed direction for an infinitesimal period of time before re-sampling another
+velocity vector from the next tangent space [42]. In fact, we can define such a construction using
+Stratonovich calculus and show that the infinitesimal generator can be written using only local
+coordinates [28]. Here, we instead take the approach of generalising the generator directly from
+Euclidean space to the local coordinates of a manifold, arriving at the same result. We then deduce the
+stochastic differential equation describing ( ˜Bt)t≥0 in Itô form using (20).
+For a standard Brownian motion on Rn, A = △/2, where △ denotes the Laplace operator:
+
+△ f = ∑
+i
+
+∂2 f
+∂x2
+i
+= div(∇ f ).
+(40)
+
+142
+
+
+Entropy 2014, 16, 3074–3102
+
+Substituting A = △/2 into (20) trivially gives bi(x) = 0 ∀i, Vij(x) = 1{i=j}, as required. The Laplacian,
+△ f (x), is the divergence of the gradient vector field of some function, f ∈ C2(Rn), and its value at
+x ∈ Rn can be thought of as the average value of f in some neighbourhood of x [43].
+
+A 
+
+B 
+
+Figure 2. A two-dimensional manifold (surface) embedded in R3 through r(x1, x2) = (x1, x2, sin(x1) +
+1), parametrised by the local coordinates, x1 and x2. The distance between points A and B is given by
+the length of the curve γ(t) = (t, t, sin(t) + 1)).
+
+To define a Brownian motion on any manifold, the gradient and divergence must be generalised.
+We provide a full derivation in Appendix B, which shows that the gradient operator on a manifold can
+be written in local coordinates as ∇M = G−1(x)∇. Combining with the operator, divM, we can define
+a generalisation of the Laplace operator, known as the Laplace–Beltrami operator (e.g., [44,45]), as:
+
+△LB f = divM(∇M f ) = |G(x|− 1
+
+2
+n
+∑
+i=1
+
+∂
+∂xi
+
+�
+
+|G(x)|
+1
+2
+n
+∑
+j=1
+{G−1(x)}ij
+∂ f
+∂xj
+
+�
+
+,
+(41)
+
+for some f ∈ C2
+0(M).
+The generator of a Brownian motion on M is △LB/2 [44]. Using (20), the resulting diffusion has
+dynamics given by:
+
+d ˜Bt = Ω(Xt)dt +
+�
+
+G−1(Xt)dBt,
+
+Ωi(Xt) = 1
+
+2|G(Xt)|− 1
+
+2
+n
+∑
+j=1
+
+∂
+∂xj
+
+�
+|G(Xt)|
+1
+2 {G−1(Xt)}ij
+�
+.
+
+Those familiar with the Itô formula will not be surprised by the additional drift term, Ω(Xt). As Itô
+integrals do not follow the chain rule of ordinary calculus, non-linear mappings of martingales, such
+as (Bt)t≥0, typically result in drift terms being added to the dynamics (e.g., [27]).
+To define ˜∇, we simply note that this is again the gradient operator on a general manifold, so
+˜∇ = G−1(x)∇. For the density, ˜π(x), we note that this density will now implicitly be defined with
+respect to the volume measure, |G(x)|
+1
+2 dx, on the manifold. Therefore, to ensure the diffusion (39) has
+the correct invariant density with respect to the Lebesgue measure, we define:
+
+˜π(x) = π(x)|G(x)|− 1
+
+2 .
+(42)
+
+Putting these three elements together, Equation (39) becomes:
+
+143
+
+
+Entropy 2014, 16, 3074–3102
+
+dXt = 1
+
+2G−1(Xt)∇ log
+�
+π(Xt)|G(Xt)|− 1
+
+2
+�
+dt + Ω(Xt)dt +
+�
+
+G−1(Xt)dBt,
+
+which, upon simplification, becomes:
+
+dXt = 1
+
+2G−1(Xt)∇ log π(Xt)dt + Λ(Xt)dt +
+�
+
+G−1(Xt)dBt,
+(43)
+
+Λi(Xt) = 1
+
+2 ∑
+j
+
+∂
+∂xj
+{G−1(Xt)}ij.
+(44)
+
+It can be shown that this diffusion has invariant Lebesgue density π(x), as required [33]. Intuitively,
+when a set is mapped onto the manifold, distances are changed by a factor,
+�
+
+G(x). Therefore, to end
+up with the initial distances, they must first be changed by a factor of
+�
+
+G−1(x) before the mapping,
+which explains the volatility term in Equation (43).
+The resulting Metropolis–Hastings proposal kernel for this “MALA on a manifold” was clarified
+in [33] and is given by:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 G−1(x)∇ log π(x) + λ2Λ(x), λ2G−1(x)
+�
+,
+(45)
+
+where λ2 is a tuning parameter. The nonlinear drift term here is slightly different to that reported
+in [1,32], for reasons discussed in [33].
+
+4.4. Choosing a Metric
+
+We now turn to the question of which metric structure to put on the manifold, or equivalently,
+how to choose G(x). In this section, we sometimes switch notation slightly, denoting the target density,
+π(x|y), as some of the discussion is directed towards Bayesian inference, where π(·) is the posterior
+distribution for some parameter, x, after observing some data, y. The problem statement is: what is an
+appropriate choice of distance between points in the sample space of a given probability distribution?
+A related (but distinct) question is how to define a distance between two probability distributions
+from the same parametric family, but with different parameters. This has been a key theme in
+information geometry, explored by Rao [46] and others [2] for many years.
+Although generic
+measures of distance between distributions (such as total variation) are often appropriate, based on
+information-theoretic principles, one can deduce that for a given parametric family, {px(y) : x ∈ X },
+it is in some sense natural to consider this “space of distributions” to be a manifold, where the Fisher
+information is the matrix, G(x) (with the α = 0 connection employed; see [2] for details).
+Because of this, Girolami and Calderhead [1] proposed a variant of the Fisher metric for geometric
+Markov chain Monte Carlo, as:
+
+G(x) = Ey|x
+
+�
+
+−
+∂2
+
+∂xi∂xj
+log f (y|x)
+
+�
+
+−
+∂2
+
+∂xi∂xj
+log π0(x),
+(46)
+
+where π(x|y) ∝ f (y|x)π0(x) is the target density, f denotes the likelihood and π0 the prior. The metric
+is tailored to Bayesian problems, which are a common use for MCMC, so the Fisher information is
+combined with the negative Hessian of the log-prior. One can also view this metric as the expected
+negative Hessian of the log target, since this naturally reduces to (46).
+The motivation for a Hessian-style metric can also be understood from studying MCMC proposals.
+From (45) and by the same logic as for general pre-conditioning methods [32], the objective is to choose
+G−1(x) to match the covariance structure of π(x|y) locally. If the target density were Gaussian with
+covariance matrix, Σ, then:
+
+−
+∂2
+
+∂xi∂xj
+log π(x|y) = Σ.
+(47)
+
+144
+
+
+Entropy 2014, 16, 3074–3102
+
+In the non-Gaussian case, the negative Hessian is no longer constant, but we can imagine that it
+matches the correlation structure of π(x|y) locally at least. Such ideas have been discussed in the
+geostatistics literature previously [47]. One problem with simply using (47) to define a metric is that
+unless π(x|y) is log-concave, the negative Hessian will not be globally positive-definite, although
+Petra et al. [48] conjecture that it may be appropriate for use in some realistic scenarios and suggest
+some computationally efficient approximation procedures [48].
+
+Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = −∂2 log π(x)/∂x2. Then, G−1(x) = (1 + x2)2/(2 −
+2x2), which is negative if x2 > 1, so unusable as a proposal variance.
+
+Girolami and Calderhead [1] use the Fisher metric in part to counteract this problem. Taking
+expectations over the data ensures that the likelihood contribution to G(x) in (46) will be positive
+(semi-)definite globally (e.g., [49]); so, provided a log-concave prior is chosen, then (46) should
+be a suitable choice for G(x). Indeed, Girolami and Calderhead [1] provide several examples in
+which geometric MCMC methods using this Fisher metric perform better than their “non-geometric”
+counterparts.
+Betancourt [50] also starts from the viewpoint that the Hessian (47) is an appropriate choice for
+G(x) and defines a mapping from the set of n × n matrices to the set of positive-definite n × n matrices
+by taking a “smooth” absolute value of the eigenvalues of the Hessian. This is done in a way such that
+derivatives of G(x) are still computable, inspiring the author to the name, SoftAbs metric. For a fixed
+value of x, the negative Hessian, H(x), is first computed and, then, decomposed into UTDU, where
+D is the diagonal matrix of eigenvalues. Each diagonal element of D is then altered by the mapping
+tα : R → R, given by:
+tα(λi) = λi coth(αλi),
+(48)
+
+where α is a tuning parameter (typically chosen to be as large as possible for which eigenvalues remain
+non-zero numerically). The function, tα, acts as an absolute value function, but also uplifts eigenvalues,
+which are close to zero to ≈ 1/α. It should be noted that while the Fisher metric is only defined for
+models in which a likelihood is present and for which the expectation is tractable, the SoftAbs metric
+can be found for any target distribution, π(·).
+Many authors (e.g., [1,48]) have noted that for many problems, the terms involving derivatives
+of G(x) are often small, and so, it is not always worth the computational effort of evaluating them.
+Girolami and Calderhead [1] propose the simplified manifold, MALA, in which proposals are of
+the form:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 G−1(x)∇ log π(x), λ2G−1(x)
+�
+(49)
+
+Using this method means derivatives of G(x) are no longer needed, so more pragmatic ways of
+regularising the Hessian are possible. One simple approach would be to take the absolute values
+of each eigenvalue, giving G(x) = UT|D|U, where H(x) = UTDU is the negative Hessian and |D|
+is a diagonal matrix with {|D|}ii = |λi| (this approach may fall into difficulties if eigenvalues are
+numerically zero). Another would be choosing G(x) as the “nearest” positive-definite matrix to the
+negative Hessian, according to some distance metric on the set of n × n matrices. The problem has, in
+fact, been well-studied in mathematical finance, in the context of finding correlations using incomplete
+data sets [51], and tackled using distances induced by the Frobenius norm. Approximate solution
+algorithms are discussed in Higham [51]. It is not clear to us at present whether small changes to
+the Hessian would result in large changes to the corresponding positive definite matrix under a
+given distance or, indeed, whether given a distance metric on the space of matrices, there is always a
+well-defined unique “nearest” positive definite matrix. Below, we provide two simple examples, here
+showing how a “Hessian-style metric” can alleviate some of the difficulties associated with both heavy
+and light-tailed target densities.
+
+145
+
+
+Entropy 2014, 16, 3074–3102
+
+Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) =
+−x(1 + x2)/|1 − x2|, which no longer tends to zero as |x| → ∞, suggesting a manifold variant of MALA with
+a Hessian-style metric may avoid some of the pitfalls of the standard algorithm. Note that the drift may become
+very large if |x| ≈ 1, but since this event occurs with probability zero, we do not see it as a major cause for
+concern.
+
+Example: Take π(x) ∝ e−x4, and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) = −x/3,
+which is O(x), so alleviating the problem of spiralling proposals for light-tailed targets demonstrated by MALA
+in an earlier example.
+
+Other choices for G(x) have been proposed, which are not based on the Hessian. These have the
+advantage that gradients need not be computed (either analytically or using computational methods).
+Sejdinovic et al. [52] propose a Metropolis–Hastings method, which can be viewed as a geometric
+variant of the RWM, where the choice for G(x) is based on mapping samples to an appropriate
+feature space, and performing principal component analysis on the resulting features to choose a local
+covariance structure for proposals.
+If we consider the RWM with Gaussian proposals to be a Euler–Maruyama discretisation of
+Brownian motion on a manifold, then proposals will take the form Q(x, ·) ≡ N (x + λ2Ω(x), λ2G−1(x)).
+If we assume (like in the simplified manifold MALA) that Ω(x) ≈ 0, then we have proposals centred
+at the current point in the Markov chain with a local covariance structure (the full Hastings acceptance
+rate must now be used as q(x′|x) ̸= q(x|x′) in general).
+As no gradient information is needed, the Sejdinovic et al. metric can be used in conjunction with
+the pseudo-marginal MCMC algorithm, so that π(x|y) need not be known exactly. Examples from the
+article demonstrate the power of the approach [52].
+An important property of any Riemannian metric is how it transforms under coordinate change
+(e.g., [2]). The Fisher information metric commonly studied in information geometry is an example of a
+“coordinate invariant” choice for G(x). If we consider two parametrisations for a statistical model given
+by x and z = t(x), computing the Fisher information under x and then transforming this matrix using
+the Jacobian for the mapping, t, will give the same result as computing the Fisher information under z.
+It should be noted that because of either the prior contribution in (46) or the nonlinear transformations
+applied in other cases, none of the metrics we have reviewed here have this property, which means
+that we have no principled way of understanding how G(x) will relate to G(z). It is intuitive, however,
+that using information from all of π(x), rather than only the likelihood contribution, f (y|x), would
+seem sensible when trying to sample from π(·).
+
+5. Survey of Applications
+
+Rather than conduct our own simulation study, we instead highlight some cases in the literature
+where geometric MCMC methods have been used with success.
+Martin et al. [53] consider Bayesian inference for a statistical inverse problem, in which a surface
+explosion causes seismic waves to travel down into the ground (the subsurface medium). Often,
+the properties of the subsurface vary with distance from ground level or because of obstacles in the
+medium, in which case, a fraction of the waves will scatter off these boundaries and be reflected
+back up to ground level at later times. The observations here are the initial explosion and the waves,
+which return to the surface, together with return times. The challenge is to infer the properties of the
+subsurface medium from this data. The authors construct a likelihood based on the wave equation for
+the data and perform Bayesian inference using a variant of the manifold MALA. Figures are provided
+showing the local correlations present in the posterior and, therefore, highlighting the need for an
+algorithm that can navigate the high density region efficiently. Several methods are compared in the
+paper, but the variant of MALA that incorporates a local correlation structure is shown to be the most
+efficient, particularly as the dimension of the problem increases [53].
+
+146
+
+
+Entropy 2014, 16, 3074–3102
+
+Calderhead and Girolami [54] dealt with two models for biological phenomena based on nonlinear
+dynamical systems. A model of circadian control in the Arabidopsis thaliana plant comprised a system
+of six nonlinear differential equations, with twenty two parameters to be inferred. Another model
+for cell signalling consisted of a system of six nonlinear differential equations with eight parameters,
+with inference complicated by the fact that observations of the model are not recorded directly [54].
+The resulting inference was performed using RWM, MALA and geometric methods, with the results
+highlighting the benefits of taking the latter approach. The simplified variant of MALA on a manifold
+is reported to have produced the most efficient inferences overall, in terms of effective sample size per
+unit of computational time.
+Stathopoulos and Girolami [55] considered the problem of inferring parameters in Markov jump
+processes. In the paper, a linear noise approximation is shown, which can make inference in such
+models more straightforward, enabling an approximate likelihood to be computed. Models based on
+chemical reaction dynamics are considered; one such from chemical kinetics contained four unknown
+parameters; another from gene expression consisted of seven. Inference was performed using the
+RWM, the simplified manifold MALA and Hamiltonian methods, with the MALA reported as most
+efficient according to the chosen diagnostics. The authors note that the simplified manifold method is
+both conceptually simple and able to account for local correlations, making it an attractive choice for
+inference [55].
+Konukoglu et al. [56] designed a method for personalising a generic model for a physiological
+process to a specific patient, using clinical data. The personalisation took the form of patient-specific
+parameter inference. The authors highlight some of the difficulties of this task in general, including the
+complexity of the models and the relative sparsity of the datasets, which often result in a parameter
+identifiability issue [56]. The example discussed in the paper is the Eikonal-diffusion model describing
+electrical activity in cardiac tissue, which results in a likelihood for the data based on a nonlinear partial
+differential equation, combined with observation noise [56]. A method for inference was developed by
+first approximating the likelihood using a spectral representation and then using geometric MCMC
+methods on the resulting approximate posterior. The method was first evaluated on synthetic data
+and then repeated on clinical data taken from a study for ventricular tachycardia radio-frequency
+ablation [56].
+
+6. Discussion
+
+The geometric viewpoint in not necessary to understand manifold variants of the MALA. Indeed,
+several authors [32,33] have discussed these algorithms without considering them to be “geometric”,
+rather simply Metropolis–Hastings methods in which proposal kernels have a position-dependent
+covariance structure. We do not claim that the geometric view is the only one that should be taken.
+Our goal is merely to point out that such position-dependent methods can often be viewed as methods
+defined on a manifold and that studying the structure of the manifold itself may lead to new insights on
+the methods. For example, taking the geometric viewpoint and noting the connection with information
+geometry enabled Girolami and Calderhead to adopt the Fisher metric for calculations [1]. We list here
+a few open questions on which the geometric viewpoint may help shed some insight.
+Computationally-minded readers will have noted that using position-dependent covariance
+matrices adds a significant computational overhead in practice, with additional O(n3) matrix inversions
+required at each step of the corresponding Metropolis–Hastings algorithms. Clearly, there will be
+many problems for which the matrix, G(x), does not change very much, and therefore, choosing
+a constant covariance G−1(x) = Σ may result in a more efficient algorithm overall. Geometrically,
+this would correspond to a manifold with scalar curvature close to zero everywhere. It may be that
+geometric ideas could be used to understand whether the manifold is flat enough that a constant choice
+of G(x) is sufficient. To make sense of this truly would require a relationship between curvature, an
+inherently local property and more global statements about the manifold. Many results in differential
+geometry, beginning with the celebrated Gauss–Bonnet theorem, have previously related global and
+
+147
+
+
+Entropy 2014, 16, 3074–3102
+
+local properties in this way [57]. It is unknown to the authors whether results exist relating the
+curvature of a manifold to some global property, but this is an interesting avenue for further research.
+A related question is when to choose the simplified manifold MALA over the full method.
+Problems in which the term, ∥Λ(x)∥, is sufficient large to warrant calculation correspond to those for
+which the manifold has very high curvature in many places; so again, making some global statement
+related to curvature could help here.
+Although there is a reasonably intuitive argument for why the Hessian is an appropriate starting
+point for G(x), the lack of positive-definiteness may be seen as a cause for concern by some. After
+all, it could be argued that if the curvature is not positive-definite in a region, then how can it be a
+reasonable approximation to the local covariance structure. Many statistical models used to describe
+natural phenomena are characterised by distributions with heavy tails or multiple modes, for which
+this is the case. In addition, for target densities of the form π(x) ∝ e−|x|, the Hessian is everywhere
+equal to zero!The attempts to force positive-definiteness we have described will typically result in
+small moves being proposed in such regions of the sample space, which may not be an optimal strategy.
+Much work in information geometry has centred on the geometry of Hessian structures [58], and some
+insights from this field may help to better understand the question of choosing an appropriate metric.
+In addition, the field of pseudo-Riemannian geometry deals with forms of G(x), which need not be
+positive-definite [39]; so again, understanding could be gained from here.
+Some recent work in high-dimensional inference has centred on defining MCMC methods for
+which efficiency scales O(1) with respect to the dimension, n, of π(·) [19,59]. In the case where X
+takes values in some infinite-dimensional function space, this can be done provided a Gaussian prior
+measure is defined for X. A striking result from infinite-dimensional probability spaces is that two
+different probability measures defined over some infinite dimensional space have a striking tendency
+to have disjoint supports [60]. The key challenge for MCMC is to define transition kernels for which
+proposed moves are inside the support for π(·). A straight-forward approach is to define proposals for
+which the prior is invariant, since the likelihood contribution to the posterior typically will not alter
+its support from that of the prior [19]. However, the posterior may still look very different from the
+prior, as noted in [61], so this proposal mechanism, though O(1), can still result in slow exploration.
+Understanding the geometry of the support and defining methods that incorporate the likelihood term,
+but also respect this geometry, so as to ensure proposals remain in support of π(·), is an intriguing
+research proposition.
+The methods reviewed in this paper are based on first order Langevin diffusions. Algorithms
+have also been developed that are based on second order Langevin diffusions, in which a stochastic
+differential equation governs the behaviour of the velocity of a process [62,63]. A natural extension to
+the work of Girolami and Calderhead [1] and Xifara et al. [33] would be to map such diffusions onto
+a manifold and derive Metropolis–Hastings proposal kernels based on the resulting dynamics. The
+resulting scheme would be a generalisation of [63], though the most appropriate discretisation scheme
+for a second order process to facilitate sampling is unclear and perhaps a question worthy of further
+exploration.
+We have focused primarily here on the sample space X = Rn and on defining an appropriate
+manifold on which to construct Markov chains. In some inference problems, however, the sample
+space is a pre-defined manifold, for example the set of n × n rotation matrices, commonly found in the
+field of directional statistics [64]. Such manifolds are often not globally mappable to Euclidean n-space.
+Methods have been devised for sampling from such spaces [65,66]. In order to use the methods
+described here for such problems, an appropriate approach for switching between coordinate patches
+at the relevant time would need to be devised, which could be an interesting area of further study.
+Alongside these geometric problems, we can also discuss geometric MCMC methods from a
+statistical perspective. The last example given in the previous section hinted that the manifold MALA
+may cope better with target distributions with heavy tails. In fact, Latuszynski et al. [67] have shown
+that, in one dimension, the manifold MALA is geometrically ergodic for a class of targets of the
+
+148
+
+
+Entropy 2014, 16, 3074–3102
+
+form π(x) ∝ exp(−|x|β) for any choice of β ̸= 1. This incorporates cases where tails are heavier
+than exponential and lighter than Gaussian, two scenarios under which geometric ergodicity fails for
+the MALA.
+Finding optimal acceptance rates and scaling of λ with dimension are two other related challenges.
+In this case, the picture is more complex. Traditional results have been shown for Metropolis–Hastings
+methods in the case where target distributions are independent and identically-distributed or some
+other suitable symmetry and regularity in the shape of π(·).
+Manifold methods are, however,
+specifically tailored to scenarios in which this is not the case, scenarios in which there is a high
+correlation between components of x, which changes depending on the value of x. It is less clear how
+to proceed with finding relevant results that can serve as guidelines to practitioners here. Indeed,
+Sherlock [18] notes that a requirement for optimal acceptance rate results for the RWM to be appropriate
+is that the curvature of π(x) does not change too much, yet this is the very scenario in which we would
+want to use a manifold method.
+
+Acknowledgments: We thank the two reviewers for helpful comments and suggestions. Samuel Livingstone is
+funded by a PhD Scholarship from Xerox Research Centre Europe. Mark Girolami is funded by an Engineering
+and Physical Sciences Research Council Established Career Research Fellowship, EP/J016934/1, and a Royal
+Society Wolfson Research Merit Award.
+
+Author Contributions: Author Contributions
+The article was written by Samuel Livingstone under the guidance of Mark Girolami. All authors
+have read and approved the final manuscript.
+
+Appendix
+
+Appendix Total Variation Distance
+
+We show how to obtain (10) from (9). Denoting two probability distributions, μ(·) and ν(·), and
+associated densities, μ(x) and ν(x), we have:
+
+∥μ(·) − ν(·)∥TV := sup
+A∈B
+|μ(A) − ν(A)|.
+
+Define the set B = {x ∈ X : μ(x) > ν(x)}. To see that B ∈ B, note that B = ∪q∈Q{x ∈ X : μ(x) >
+q} ∩ {x ∈ X : ν(x) < q}, and the result follows from properties of B (e.g., [68]). Now, for any A ∈ B:
+
+μ(A) − ν(A) ≤ μ(A ∩ B) − ν(A ∩ B) ≤ μ(B) − ν(B),
+
+and similarly:
+ν(A) − μ(A) ≤ ν(Bc) − μ(Bc),
+
+so, the supremum will be attained either at B or Bc. However, since μ(X ) = ν(X ) = 1, then:
+
+[μ(B) − ν(B)] − [ν(Bc) − μ(Bc)] = 0,
+
+so that
+|μ(B) − ν(B)| = |μ(Bc) − ν(Bc)|.
+
+Using these facts gives an alternative characterisation of the total variation distance as:
+
+∥μ(·) − ν(·)∥TV = 1
+
+2 (|μ(B) − ν(B)| + |μ(Bc) − ν(Bc)|)
+
+= 1
+
+2
+
+�
+
+X |μ(x) − ν(x)|dx
+
+as required.
+
+149
+
+
+Entropy 2014, 16, 3074–3102
+
+Appendix Gradient and Divergence Operators on a Riemannian Manifold
+
+The gradient of a function on Rn is the unique vector field, such that, for any unit vector, u:
+
+⟨∇ f (x), u⟩ = Du [ f (x)] = lim
+h→0
+
+� f (x + hu) − f (x)
+
+h
+
+�
+,
+(A1)
+
+the directional derivative of f along u at x ∈ Rn.
+On a manifold, the gradient operator, ∇M, can still be defined, such that the inner product
+gp(∇M f (x), u) = Du[ f (x)]. Setting ∇M = G(x)−1∇ gives:
+
+gp(∇M f (x), u) = (G−1(x)∇ f (x))TG(x)u,
+
+= ⟨∇ f (x), u⟩,
+
+which is equal to the directional derivative along u as required.
+The divergence of some vector field, v, at a point, x ∈ Rn, is the net outward flow generated by
+v through some small neighbourhood of x. Mathematically, the divergence of v(x) ∈ R3 is given by
+∑i ∂vi/∂xi. On a more general manifold, the divergence is also a sum of derivatives, but here, they
+are covariant derivatives. A short introduction is provided in Appendix C. Here, we simply state
+that the covariant derivative of a vector field, v, at a point p ∈ M is the orthogonal projection of the
+directional derivative onto the tangent space, TpM. Intuitively, a vector field on a manifold is a field
+of vectors, each of which lie in the tangent space to a point, p ∈ M. It only makes sense therefore to
+discuss how vector fields change along the manifold or in the direction of vectors, which also lie in the
+tangent space. Although the idea seems simple, the covariant derivative has some attractive geometric
+properties; notably, it can be completely written in local coordinates,and, so, does not depend on
+knowledge of an embedding in some ambient space.
+The divergence of a vector field, v, defined on a manifold, M, at the point, p ∈ M, is defined as:
+
+divM(v) =
+n
+∑
+i=1
+Dc
+ei[vi],
+
+where ei denotes the i-th basis vector for the tangent space, TpM, at p ∈ M, and vi denotes the i-th
+coefficient. This can be written in local coordinates (see Appendix C) as:
+
+divM(v) = |G(x)|− 1
+
+2
+n
+∑
+i=1
+
+∂
+∂xi
+
+�
+|G(x)|
+1
+2 vi
+�
+,
+
+and can be combined with ∇M to form the Laplace–Beltrami operator (41).
+
+Appendix Vector Fields and the Covariant Derivative
+
+Here, we provide a short introduction to vector fields and differentiation on a smooth manifold;
+see [38,39]. The following geometric notation is used here: (i) vector components are indexed with a
+superscript, e.g., v = (v1, ..., vn); and (ii) repeated subscripts and superscripts are summed over, e.g.,
+viei = ∑i viei (known as the Einstein summation convention).
+For any smooth manifold, M, the set of all tangent vectors to points on M is known as the tangent
+bundle and denoted TM.
+A Cr vector field defined on M is a mapping that assigns to each point, p ∈ M, a tangent vector,
+v(p) ∈ TpM. In addition, the components of v(p) in any basis for TpM must also be Cr [38]. We
+will denote the set of all vector fields on M as Γ(TM). For some vector field, v ∈ Γ(TM), at any
+point, p ∈ M, the vector, v(p) ∈ TpM, can be written as a linear combination of some n basis vectors
+{e1, ..., en} as v = viei. To understand how v will change in a particular direction along M, it only
+makes sense, therefore, to consider derivatives along vectors in TpM. Two other things must be
+
+150
+
+
+Entropy 2014, 16, 3074–3102
+
+considered when defining a derivative along a manifold: (i) how the components, vi, of each basis
+vector will change; and (ii) how each basis vector, ei, itself will change. For the usual directional
+derivative on Rn, the basis vectors do not change, as the tangent space is the same at each point, but
+for a more general manifold, this is no longer the case: the ei’s are referred to as a “local” basis for each
+TpM.
+The covariant derivative, Dc, is defined so as to account for these shortcomings. When considering
+differentiation along a vector, u∗ /∈ TpM, u∗ is simply projected onto the tangent space. The derivative
+with respect to any u ∈ TpM can now be decomposed into a linear combination of derivatives of basis
+vectors and vector components:
+
+Dc
+u[v] = Dc
+uiei[viei],
+(A2)
+
+where the argument, p, has been dropped, but is implied for both components and local basis vectors.
+The operator, Dc
+u[v], is defined to be linear in both u and v and to satisfy the product rule [38]; so,
+Equation (A2) can be decomposed into:
+
+Dc
+u[v] = ui �
+Dc
+ei[vj]ej + vjDc
+ei[ej]
+�
+.
+(A3)
+
+The operator, Dc, need, therefore, only be defined along the direction of basis vectors ei and for vector
+component vi and basis vector ei arguments.
+For components vi, Dc
+ej[vi] is defined as simply the partial derivative ∂jvi := ∂vi/∂xj. The
+directional derivative of some basis vector ei along some ej is best understood through the example of
+a regular surface Σ ⊂ R3. Here, Dej[ei] will be a vector, w ∈ R3. Taking the basis for this space at the
+point, p, as {e1, e2, ˆn}, where ˆn denotes the unit normal to TpΣ, we can write w = αe1 + βe2 + κ ˆn. The
+covariant derivative, Dc
+ej[ei], is simply the projection of w onto TpΣ, given by w∗ = αe1 + βe2. More
+
+generally, at some point, p, in a smooth manifold, M, the covariant derivative Dc
+ej[ei] = Γk
+jiek (with
+
+upper and lower indices summed over). The coefficients, Γk
+ji, are known as the Christoffel symbols: Γk
+ji
+denotes the coefficient of the k-th basis vector when taking the derivative of the i-th with respect to the
+j-th. If a Riemannian metric, g, is chosen for M; then, they can be expressed completely as a function
+of g (or in local coordinates as a function of the matrix, G). Using these definitions, Equation (A3) can
+be re-written as:
+Dc
+u[v] = ui �
+∂ivk + vjΓk
+ij
+�
+ek.
+(A4)
+
+The divergence of a vector field, v ∈ Γ(TM), at the point, p ∈ M, is given by:
+
+divM(v) = Dc
+ei[vi],
+(A5)
+
+where, again, repeated indices are summed over. If M = Rn, this reduces to the usual sum of partial
+derivatives, ∂ivi. On a more general manifold, M, the equivalent expression is:"’
+
+Dc
+ei[vi] = ∂ivi + viΓj
+ij,
+(A6)
+
+where, again, repeated indices are summed. As has been previously stated, if a metric, g, and coordinate
+chart is chosen for M, the Christoffel symbols can be written in terms of the matrix, G(x). In this
+case [69]:
+
+Γj
+ij = |G(x)|− 1
+
+2 ∂i
+�
+|G(x)|
+1
+2
+�
+,
+(A7)
+
+so Equation (A6) becomes:
+
+Dc
+ei[vi] = |G(x)|− 1
+
+2 ∂i
+�
+|G(x)|
+1
+2 vi�
+,
+(A8)
+
+where v = v(x).
+
+Conflicts of Interest: Conflicts of Interest
+
+151
+
+
+Entropy 2014, 16, 3074–3102
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Girolami, M.; Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat.
+Soc. Ser. B 2011, 73, 123–214.
+2.
+Amari, S.I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2007; Volume 191.
+3.
+Marriott, P.; Salmon, M. Applications of Differential Geometry to Econometrics; Cambridge University Press:
+Cambridge, UK, 2000.
+4.
+Betancourt, M.; Girolami, M. Hamiltonian Monte Carlo for Hierarchical Models. 2013, arXiv: 1312.0906.
+5.
+Neal, R. MCMC using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo; Chapman and
+Hall/CRC: Boca Raton, FL, USA, 2011; pp. 113–162.
+6.
+Betancourt, M.; Stein, L.C. The Geometry of Hamiltonian Monte Carlo. 2011, arXiv: 1112.4118.
+7.
+Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer: New York, NY, USA, 2004; Volume 319.
+8.
+Tierney, L. Markov chains for exploring posterior distributions. Ann. Stat. 1994, 22, 1701–1728.
+9.
+Kipnis, C.; Varadhan, S. Central limit theorem for additive functionals of reversible Markov processes and
+applications to simple exclusions. Commun. Math. Phys. 1986, 104, 1–19.
+10.
+R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing:
+Vienna, Austria, 2012.
+11.
+Plummer, M.; Best, N.; Cowles, K.; Vines, K. CODA: Convergence diagnosis and output analysis for MCMC.
+R. News 2006, 6, 7–11.
+12.
+Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435.
+13.
+Jones, G.L.; Hobert, J.P. Honest exploration of intractable probability distributions via Markov chain Monte
+Carlo. Stat. Sci. 2001, 16, 312–334.
+14.
+Jones, G.L. On the Markov chain central limit theorem. Probab. Surv. 2004, 1, 299–320.
+15.
+Gelman, A.; Rubin, D.B. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992, 7,
+457–472.
+16.
+Sherlock, C.; Fearnhead, P.; Roberts, G.O. The random walk Metropolis: Linking theory and practice through
+a case study. Stat. Sci. 2010, 25, 172–190.
+17.
+Sherlock, C.; Roberts, G. Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal
+targets. Bernoulli 2009, 15, 774–798.
+18.
+Sherlock, C. Optimal scaling of the random walk Metropolis: General criteria for the 0.234 acceptance rule. J.
+Appl. Probab. 2013, 50, 1–15.
+19.
+Beskos, A.; Kalogeropoulos, K.; Pazos, E. Advanced MCMC methods for sampling on diffusion pathspace.
+Stoch. Processes Appl. 2013, 123, 1415–1453.
+20.
+Roberts, G.O.; Rosenthal, J.S. Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. 2001,
+16, 351–367.
+21.
+Roberts, G.O.; Tweedie, R.L. Geometric convergence and central limit theorems for multidimensional
+Hastings and Metropolis algorithms. Biometrika 1996, 83, 95–110.
+22.
+Mengersen, K.L.; Tweedie, R.L. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Stat.
+1996, 24, 101–121.
+23.
+Jarner, S.F.; Hansen, E. Geometric ergodicity of Metropolis algorithms.
+Stoch. Processes Appl. 2000,
+85, 341–361.
+24.
+Christensen, O.F.; Møller, J.; Waagepetersen, R.P. Geometric ergodicity of Metropolis–Hastings algorithms
+for conditional simulation in generalized linear mixed models.
+Methodol. Comput. Appl. Probab. 2001,
+3, 309–327.
+25.
+Neal, P.; Roberts, G. Optimal scaling for random walk Metropolis on spherically constrained target densities.
+Methodol. Comput. Appl. Probab. 2008, 10, 277–297.
+26.
+Jarner, S.F.;
+Tweedie, R.L.
+Necessary conditions for geometric and polynomial ergodicity of
+random-walk-type. Bernoulli 2003, 9, 559–578.
+27.
+Øksendal, B. Stochastic Differential Equations; Springer: New York, NY, USA, 2003.
+
+152
+
+
+Entropy 2014, 16, 3074–3102
+
+28.
+Rogers, L.C.G.; Williams, D. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus; Cambridge
+University Press: Cambridge, UK, 2000; Volume 2.
+29.
+Meyn, S.P.; Tweedie, R.L. Stability of Markovian processes III: Foster–Lyapunov criteria for continuous-time
+processes. Adv. Appl. Probab. 1993, 25, 518–518.
+30.
+Coffey, W.; Kalmykov, Y.P.; Waldron, J.T. The Langevin Equation: with Applications to Stochastic Problems in
+Physics, Chemistry, and Electrical Engineering; World Scientific: Singapore, Singapore, 2004; Volume 14.
+31.
+Roberts, G.O.; Tweedie, R.L.
+Exponential convergence of Langevin distributions and their discrete
+approximations. Bernoulli 1996, 2, 341–363.
+32.
+Roberts, G.O.; Stramer, O. Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput.
+Appl. Probab. 2002, 4, 337–357.
+33.
+Xifara, T.; Sherlock, C.; Livingstone, S.; Byrne, S.; Girolami, M.
+Langevin diffusions and the
+Metropolis-adjusted Langevin algorithm. Stat. Probab. Lett. 2013, 91, 14–19.
+34.
+Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A
+Math. Phys. Sci. 1946, 186, 453–461.
+35.
+Critchley, F.; Marriott, P.; Salmon, M. Preferred point geometry and statistical manifolds. Ann. Stat. 1993,
+21, 1197–1224.
+36.
+Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 77–93.
+37.
+Barndorff-Nielsen, O.; Cox, D.; Reid, N. The role of differential geometry in statistical theory. Int. Stat. Rev.
+1986, 54, 83–96.
+38.
+Boothby, W.M. An Introduction to Differentiable Manifolds and Riemannian Geometry; Academic Press: San
+Diego, CA, USA, 1986; Volume 120.
+39.
+Lee, J.M. Smooth Manifolds; Springer: New York, NY, USA, 2003.
+40.
+Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992.
+41.
+Nash, J.F., Jr. The imbedding problem for Riemannian manifolds. In The Essential John Nash; Princeton
+University Press: Princeton, NJ, USA, 2002; p. 151.
+42.
+Manton, J.H. A Primer on Stochastic Differential Geometry for Signal Processing. 2013, arXiv: 1302.0430.
+43.
+Stewart, J. Multivariable Calculus; Cengage Learning: Boston, MA, USA, 2011.
+44.
+Hsu, E.P. Stochastic Analysis on Manifolds; American Mathematical Society: Providence, RI, USA, 2002;
+Volume 38.
+45.
+Kent, J. Time-reversible diffusions. Adv. Appl. Probab. 1978, 10, 819–835.
+46.
+Radhakrishna Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull.
+Calcutta Math. Soc. 1945, 37, 81–91.
+47.
+Christensen, O.F.; Roberts, G.O.; Sköld, M. Robust Markov chain Monte Carlo methods for spatial generalized
+linear mixed models. J. Comput. Graph. Stat. 2006, 15, 1–17.
+48.
+Petra, N.; Martin, J.; Stadler, G.; Ghattas, O. A computational framework for infinite-dimensional Bayesian
+inverse problems: Part II. Stochastic Newton MCMC with application to ice sheet flow inverse problems.
+2013, arXiv: 1308.6221.
+49.
+Pawitan, Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood; Oxford University Press:
+Oxford, UK, 2001.
+50.
+Betancourt, M. A General Metric for Riemannian Manifold Hamiltonian Monte Carlo. In Geometric Science of
+Information; Springer: New York, NY, USA, 2013; pp. 327–334.
+51.
+Higham, N.J. Computing the nearest correlation matrix—a problem from finance. IMA J. Numer. Anal. 2002,
+22, 329–343.
+52.
+Sejdinovic, D.; Garcia, M.L.; Strathmann, H.; Andrieu, C.; Gretton, A. Kernel Adaptive Metropolis–Hastings.
+2013, arXiv: 1307.5302.
+53.
+Martin, J.; Wilcox, L.C.; Burstedde, C.; Ghattas, O. A stochastic Newton MCMC method for large-scale
+statistical inverse problems with application to seismic inversion.
+SIAM J. Sci. Comput.
+2012,
+34, A1460–A1487.
+54.
+Calderhead, B.; Girolami, M. Statistical analysis of nonlinear dynamical systems using differential geometric
+sampling methods. Interface Focus 2011, 1, 821–835.
+55.
+Stathopoulos, V.; Girolami, M.A. Markov chain Monte Carlo inference for Markov jump processes via the
+linear noise approximation. Philos. Trans. R. Soc. A 2013, 371, 20110541.
+
+153
+
+
+Entropy 2014, 16, 3074–3102
+
+56.
+Konukoglu, E.; Relan, J.; Cilingir, U.; Menze, B.H.; Chinchapatnam, P.; Jadidi, A.; Cochet, H.; Hocini, M.;
+Delingette, H.; Jaïs, P.; et al. Efficient probabilistic model personalization integrating uncertainty on data and
+parameters: Application to eikonal-diffusion models in cardiac electrophysiology. Prog. Biophys. Mol. Biol.
+2011, 107, 134–146.
+57.
+Do Carmo, M.P.; Do Carmo, M.P. Differential Geometry of Curves and Surfaces; Englewood Cliffs: Prentice-Hall,
+NJ, USA, 1976; Volume 2.
+58.
+Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, Singapore, 2007; Volume 1.
+59.
+Cotter, S.; Roberts, G.; Stuart, A.; White, D. MCMC methods for functions: Modifying old algorithms to
+make them faster. Stat. Sci. 2013, 28, 424–446.
+60.
+Da Prato, G.; Zabczyk, J. Stochastic Equations in Infinite Dimensions; Cambridge University Press: Cambridge,
+UK, 2008.
+61.
+Law, K.J. Proposals which speed up function-space MCMC. J. Comput. Appl. Math. 2014, 262, 127–138.
+62.
+Ottobre, M.; Pillai, N.S.; Pinski, F.J.; Stuart, A.M. A Function Space HMC Algorithm With Second Order
+Langevin Diffusion Limit. 2013, arXiv: 1308.0543.
+63.
+Horowitz, A.M. A generalized guided Monte Carlo algorithm. Phys. Lett. B 1991, 268, 247–252.
+64.
+Mardia, K.V.; Jupp, P.E. Directional Statistics; Wiley: New York, NY, USA, 2009; Volume 494.
+65.
+Byrne, S.; Girolami, M. Geodesic Monte Carlo on embedded manifolds. Scand. J. Stat. 2013, 40, 825–845.
+66.
+Diaconis, P.; Holmes, S.; Shahshahani, M. Sampling from a manifold. In Advances in Modern Statistical Theory
+and Applications: A Festschrift in Honor of Morris L. Eaton; Institute of Mathematical Statistics: Washington,
+DC, USA, 2013; pp. 102–125.
+67.
+Latuszynski, K.; Roberts, G.O.; Thiery, A.; Wolny, K. Discussion on “Riemann manifold Langevin and
+Hamiltonian Monte Carlo methods” (by Girolami, M. and Calderhead, B.). J. R. Stat. Soc. Ser. B 2011,
+73, 188–189.
+68.
+Capinski, M.; Kopp, P.E. Measure, Integral and Probability; Springer: New York, NY, USA, 2004.
+69.
+Schutz, B.F. Geometrical Methods of Mathematical Physics; Cambridge University Press: Cambridge, UK, 1984.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+154
+
+
+entropy
+
+Article
+Variational Bayes for Regime-Switching
+Log-Normal Models
+
+Hui Zhao and Paul Marriott *
+
+University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada; E-Mail:
+h6zhao@uwaterloo.ca
+*
+E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567.
+
+Received: 14 April 2014; in revised form: 12 June 2014 / Accepted: 7 July 2014 /
+Published: 14 July 2014
+
+Abstract: The power of projection using divergence functions is a major theme in information
+geometry. One version of this is the variational Bayes (VB) method. This paper looks at VB in the
+context of other projection-based methods in information geometry. It also describes how to apply
+VB to the regime-switching log-normal model and how it provides a computationally fast solution to
+quantify the uncertainty in the model specification. The results show that the method can recover
+exactly the model structure, gives the reasonable point estimates and is very computationally efficient.
+The potential problems of the method in quantifying the parameter uncertainty are discussed.
+
+Keywords: information geometry; variational Bayes; regime-switching log-normal model; model
+selection; covariance estimation
+
+1. Introduction
+
+While, in principle, the calculation of the posterior distribution is mathematically straightforward,
+in practice, the computation of many of its features, such as posterior densities, normalizing constants
+and posterior moments, is a major challenge in Bayesian analysis. Such computations typically
+involve high dimensional integrals, which often have no analytical or tractable forms. The variational
+Bayes (VB) method was developed to generate tractable approximations to these quantities. This
+method provides analytic approximations to the posterior distribution by minimizing the Kullback–
+Leibler (KL) divergence from the approximations to the actual posterior and has been demonstrated to
+be computationally very fast.
+VB gains its computational advantages by making simplifying assumptions about the posterior
+dependence structure.
+For example, in the simplest form, it assumes posterior independence
+between selected sets of parameters. Under these assumptions, the resultant approximate posterior
+is either known analytically or can be computed by a simple iteration algorithm similar to the
+Expectation-Maximization (EM) algorithm. In this paper, we show that, as well as having advantages
+of computational speed, the VB algorithm does an excellent job of model selection, in particular in
+finding the appropriate number of regimes.
+While the simplification in the dependence gives computational advantages, it also comes at
+a cost. For example, we also found that the posterior variance may be underestimated. In [1], we
+propose a novel method to compute the true posterior covariance matrix by only using the information
+obtained from VB approximations.
+The use of projections to particular families is, of course, not new to information geometry (IG).
+In [2], we find the famous Pythagorean results concerning projection using α-divergences to α-families,
+and other important results on projections based on divergences can be found in [3] and [4] (Chapter 7).
+
+Entropy 2014, 16, 3832–3847; doi:10.3390/e16073832
+www.mdpi.com/journal/entropy
+155
+
+
+Entropy 2014, 16, 3832–3847
+
+1.1. Variational Bayes
+
+Suppose, in a Bayesian inference problem that we use q(τ) to approximate the posterior p(τ|y),
+where y is the data and τ = {τ1, · · · , τp} the model parameter vector. The KL divergence between
+them is defined as,
+
+KL [q(τ)||p(τ|y)] =
+�
+q(τ) log q(τ)
+
+p(τ|y)dτ,
+(1)
+
+provided the integral exists. We want to balance two things, having the discrepancy between p and q
+small, while keeping q tractable. Hence, we want to seek q(τ), which minimizes Equation (1), while
+keeping q(τ) in an analytically tractable form. First, note that the evaluation of Equation (1) requires
+p(τ|y), which may be unavailable, since in the general Bayesian problem, its normalizing constant is
+one of the main intractable integrals. However, we note that:
+
+KL [q(τ)||p(τ|y)]
+=
+�
+q(τ) log
+q(τ)
+
+p(τ|y)p(y)dτ + log p(y)
+
+=
+−
+�
+q(τ) log p(τ, y)
+
+q(τ) dτ + log p(y).
+(2)
+
+Thus, minimizing Equation (1) is equivalent to maximizing the first term of the right-hand side of
+Equation (2). The key computational point is that, often, the term p(τ, y) is available even when the
+full posterior
+� p(τ,y)
+p(τ,y)dτ is not.
+
+Definition 1. Let F(q) = �
+q(τ) log p(τ,y)
+
+q(τ) dτ and:
+
+ˆq = arg max
+q∈Q
+F(q),
+(3)
+
+where Q is a predetermined set of probability density functions over the parameter space. Then ˆq is called the
+variational approximation or variational posterior distribution, and functions of ˆq (such as mean, variance, etc.),
+are called variational parameters.
+
+Some of the power of Definition 1 comes when we assume that all elements of Q have tractable
+posteriors. In that case, all variational parameters will then also be tractable when the optimization
+can be achieved. A prime example of a choice for Q is the set of all densities that factorize as
+
+q(τ) =
+d
+∏
+i=1
+qi(τi).
+
+This reduces the computational problem from computing a high dimensional integral to one of
+computing a number of one-dimensional ones. Furthermore, as we see in the example of this paper, it
+is often the case that the variational families are standard exponential families (since they are often
+‘maximum entropy models’ in some sense), and the optimisation problem (3) can be solved by simple
+iterative methods with very fast convergence.
+The core of the method builds on the basis of the principle of the variational free energy
+minimization in physics, which is concerned with finding the maxima and minima of a functional over
+a class of functions, and the method gains its name from this root. Early developments of the method
+can be found in machine learning, especially in applications on neural networks [5,6]. The method
+has been successfully applied in many different disciplines and domains, for example, in independent
+component analysis [7,8], graphical models [9,10], information retrieval [11] and factor analysis [12].
+
+156
+
+
+Entropy 2014, 16, 3832–3847
+
+In the statistical literature, an early application of the variational principle can be found in the
+work of [13] to construct Bayes estimators. In recent years, the method has obtained more attention
+from both the application and theoretical perspective, for example [14–18].
+
+1.2. Regime-Switching Models
+
+In this paper, we illustrate the strengths and weaknesses of VB through a detailed case study.
+In particular, we look at a model that is used in finance, risk management and actuarial science, the
+so-called regime-switching log-normal model (RSLN) proposed, in this context, by [19].
+Switching between different states, or regimes, is a common phenomenon in many time series,
+and regime-switching models, originally proposed by [20], have been used to model these switching
+processes. As demonstrated in [21], the maximum likelihood estimate (MLE) does not give a simple
+method to deal with parameter uncertainty; for details of this method, see [21]. The asymptotic
+normality of maximum likelihood estimators may not apply for sample sizes commonly found in
+practice. Hence, to understand parameter uncertainty, [21] considered the RSLN model in a Bayesian
+framework using the Metropolis–Hastings algorithm. Furthermore, model uncertainty, in particular
+selecting the correct number of regimes, is a major issue. Hence, model selection criteria have to be
+used to choose the “best” model. Hardy [19] found that a two-regime RSLN model maximized the
+Bayes information criterion (BIC) [22] for both monthly TSE 300 total return data and S&P 500 total
+return data; however, according to the Akaike information criterion (AIC) [23], a three-regime model
+was the optimal on S&P data. To account for the model uncertainty associated with the number of
+regimes, [24] offered a trans-dimensional model using reversible jump MCMC [25]. We note that BIC
+is not necessarily ideal for model selection with state space models [26], while it is still commonly used
+in the literature.
+MCMC methods make possible the computation of all posterior quantities; however there are a
+number of practical issues associated with their implementation. A primary concern is determining
+that the generated chain has, in fact, “converged”. In practice, MCMC practitioners have to resort
+to convergence diagnostic techniques. Furthermore, the computational cost can be a concern. Other
+implementational issues include the difficulty of making good initalisation choices, implementing the
+MCMC algorithm in one long chain or several shorter chains in parallel, etc. Detailed discussions can
+be found in [27].
+One of the main contributions of this paper is to apply the variational Bayes (VB) method to the
+RSLN model and present a solution to quantify the uncertainty in model specification. The VB method
+is a technique that provides analytical approximations to the posterior quantities, and in practice, it is
+demonstrated to be a very much faster alternative to MCMC methods.
+
+2. Variational Bayes and Informational Geometry
+
+In this section, we explore the relationship between VB and IG, in particular the statistical
+properties of divergence-based projections onto exponential families. Here, we used the IG of [2], in
+particular the ±1 dual affine parameters for exponential families. One of the most striking results
+from [2] is the Pythagorean property of these dual affine coordinate systems. This is illustrated in
+Figure 1, which shows a schematic representing a model space containing the distribution f0(x) and
+an exponential family f (x; θ).
+
+157
+
+
+Entropy 2014, 16, 3832–3847
+
+θ
+
+−1−geodesic
+
++1−geodesic
+
+of (x)
+
+f(x,   )
+
+Figure 1. Projections onto an exponential family.
+
+The Pythagorean result comes from using the KL divergence to project onto the exponential family
+f (x; θ) = ν(x) exp {s(x)θ − ψ(θ)}, i.e.,
+
+min
+θ
+
+�
+− log f (x; θ)
+
+f0(x) f0(x)dx.
+
+All distributions that project to the same point form a −1-flat space defined by all distributions f (x)
+with the same mean, i.e.,
+E�θ(s(x)) = Ef (x)(s(x)),
+
+and further, it is Fisher orthogonal to the +1-flat family f (x; θ). The statistical interpretation of this
+concerns the behaviour of a model f (x, θ) when the data generation process does not lie in the model.
+In contrast to this, we have the VB method, which uses the reverse KL divergence for the projection,
+i.e.,
+
+min
+θ
+
+�
+log f (x; θ)
+
+f0(x) f (x; θ)dx.
+
+This results in a Fisher orthogonal projection, shown in Figure 1, but now using a +1-flat family.
+This does not have the property that the mean of s(x) is constant, but as we shall see, it does have nice
+computational properties when used in the context of Bayesian analysis.
+In order to investigate the information geometry of VB, we consider two examples. The first,
+in Section 3.1, is selected to maximally illustrate the underlying geometric issues and to get some
+understanding of the quality of the VB approximation. The second, in Section 3.2, shows an important
+real-world application from actuarial science and is illustrated with simulated and real data.
+
+3. Applications of Variational Bayes
+
+3.1. Geometric Foundation
+
+We consider the simplest model that shows dependence. Let X1, X2 be two binary random
+variables, with distribution π := (π00, π10, π01, π11), where P(X1 = i, X2 = j) = πij, i, j ∈ {0, 1}.
+Further, let the marginal distributions be denoted by π1 = P(X1 = 1), π2 = P(X2 = 1). We want to
+consider the geometry of the VB projection from a general distribution to the family of independent
+distributions. This represents the way that VB gains its computational advantages by simplifying the
+posterior dependence structure.
+The model space is illustrated in Figure 2, where π is represented by a point in the three simplex,
+and the independence surface, where π00π11 = π10π01, is also shown.
+
+158
+
+
+Entropy 2014, 16, 3832–3847
+
+1
+
+ y
+0
+
+0
+
+0.5
+ z
+
+1.0
+
+x
+
+1
+
+Figure 2. Space of distributions with independence surface: marginal probabilityand dependence.
+
+Both the interior of the simplex and independence surface are exponential families, and it is
+convenient to use the natural parameters for the interior of the simplex:
+
+ξ1 = log π10
+
+π00
+, ξ2 = log π01
+
+π00
+, ξ3 = log π11π00
+
+π10π01
+
+where the independence surface is given by ξ3 = 0.
+The independence surface can also be
+parameterised by the marginal distributions π1, π2 or the corresponding natural parameters ξind
+i
+:=
+log(πi/(1 − πi)). For any distribution, π, represented in natural parameters by (ξ1, ξ2, ξ3), has its VB
+approximation defined implicitly by the simultaneous equations:
+
+ξind
+1 (π1)
+=
+ξ1 + ξ3π2,
+(4)
+
+ξind
+2 (π2)
+=
+ξ2 + ξ3π1.
+(5)
+
+These can be solved, as is typical with VB methods, by iterating updated estimates of π1 and π2
+across the two equations. We show this in a realistic example in the following section.
+Having seen the VB solution in this simple model, we can investigate the quality of the
+approximation. If we were using the forward KL project, as proposed by [2], then the mean will
+be preserved by the approximation, while, of course, the variance structure is distorted. In the case of
+using the reverse KL projection, as used by VB, the mean will not be preserved, but in this example,
+we can investigate the distortion explicitly. Let (ξ1(α), ξ2(α), ξ3(α)) be a +1-geodesic, which cuts
+the independence surface orthogonally and is parameterised by α, where α = 0 corresponds to the
+independence surface. In this example, all such geodesics can be computed explicitly. Figure 3 shows
+the distortion associated with the VB approximation. In the left-hand panel, we show the mean, which
+is the marginal probability, P(X1 = 1), for all points on the orthogonal geodesic. We see, as expected,
+that this is not constant, but it is locally constant at α = 0, showing that the distortion of the mean can
+be small near the independence surface. The right-hand panel shows the dependence, as measured
+by the log-odds, for points on the geodesic. As expected, the VB does not preserve the dependence
+structure; indeed, it is designed to exploit the simplification of the dependence structure.
+
+159
+
+
+Entropy 2014, 16, 3832–3847
+
+Figure 3. Distortion implied by variational Bayes (VB) approximation.
+
+3.2. Variational Bayes for the RSLN Model
+
+The regime-switching log-normal model [19] with a fixed finite number, K, of regimes can be
+described as a bivariate discrete time process with the observed data sequence w1:T = {wt}T
+t=1 and
+the unobserved regime sequence S1:T = {St}T
+t=1, where St ∈ {1, · · · , K} and T is the number of
+observations. The logarithm of wt, denoted by yt = log wt, is assumed normally distributed, having
+mean μi and variance σ2
+i both dependent on the hidden regime St. The sequence of S1:T is assumed
+to follow a first order Markov chain having transition probabilities A = (aij) with the probabilities
+π = (πi)K
+i=1 to start the first regime.
+The RSLN model is a special case of more general state-space models, which were studied in
+detail by [28]. In this paper, we use this model and simulated and real data to illustrate the VB method
+in practice. We also calibrate its performance by referring to [24], which used MCMC methods to fit the
+same model to the same data. Here, we are regarding the MCMC analysis as a form of “gold-standard",
+but with the cost of being orders-of-magnitude slower than VB in computational time.
+In the Bayesian framework, we use a symmetric Dirichlet prior for π, that is p(π) =
+Dir(π; Cπ
+
+K , · · · , Cπ
+
+K ), for Cπ > 0.
+Let ai denote the i − th row vector of A. The prior for A is
+
+chosen as p(A) = ∏K
+i=1 p(ai) = ∏K
+i=1 Dir(ai; CA
+
+K , · · · , CA
+
+K ), for CA > 0, and the prior distribution for
+
+{(μi, σ2
+i )}K
+i=1 is chosen to be normal-inverse gamma, p({μi, σ2
+i }K
+i=1) = ∏K
+i=1 N(μi|σ2
+i ; γ, σ2
+i
+η2 )IG(σ2
+i ; α, β).
+
+In the above setting, Cπ, CA, γ, η2, α and β are hyper-parameters. Thus, the joint posterior distribution
+of π, A, {μi, σ2
+i }K
+i=1, and S1:T is P(π, A, {μi, σ2
+i }K
+i=1, S1:T|y1:T) and is proportional to:
+
+p(S1|π)
+T−1
+∏
+t=1
+p(St+1|St; A)
+T
+∏
+t=1
+p(yt|St; {μi, σ2
+i }K
+i=1)p(π)p(A)p({μi, σ2
+i }K
+i=1).
+(6)
+
+This posterior distribution and its corresponding marginal posterior distributions are analytically
+intractable. In VB, we seek an approximation of Equation (6), denoted by q(π, A, {μi, σ2
+i }K
+i=1, S1:T),
+to which we want to balance two things: having the discrepancy between Equation (6) and q small,
+while keeping q tractable. In general, there are two ways to choose q. The first is to specify a particular
+distributional family for q, for example the multivariate normal distribution. The other is to choose
+q with a simpler dependency structure than that of Equation (6); for example, we choose q, which
+factorizes as:
+
+q(π, A, {μi, σ2
+i }K
+i=1, S1:T) = q(π)
+K
+∏
+i=1
+q(ai)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T).
+(7)
+
+160
+
+
+Entropy 2014, 16, 3832–3847
+
+The Kullback–Leibler (KL) divergence [29] can be used as the measure of dissimilarity between
+Equations (6) and (7). For succinctness, we denote τ = (π, A, {μi, σ2
+i }K
+i=1, S1:T); thus the KL divergence
+is defined as:
+
+KL(q(τ) || p(τ|y)) =
+�
+q(τ) log q(τ)
+
+p(τ|y)dτ.
+(8)
+
+Note that the evaluation of Equation (8) requires p(τ|y), which is unavailable. However, we note that:
+
+KL(q(τ) || p(τ|y)) = log p(y) −
+�
+q(τ) log p(τ, y)
+
+q(τ) dτ
+
+Given the factorization Equation (7), this can be written as:
+
+KL(q(τ) || p(τ|y)) =
+
+log p(y) −
+�
+∑
+S1:T
+q(π)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T) log
+p(π, A, {μi, σ2
+i }K
+i=1, S1:T, y1:T)
+
+q(π)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T)
+
+dπdAd{μi, σ2
+i }K
+i=1
+
+Consider first the q(π) term. The right-hand side can be rearranged as:
+
+KL
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+q(π)
+����
+
+����
+
+exp
+� � ∑
+S1:T
+q(S1:T)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i ) log p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T)dAd{μi, σ2
+i }K
+i=1
+
+�
+
+Zπ
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
++ Kπ,
+(9)
+
+where:
+
+Kπ =
+�
+∑
+S1:T
+q(S1:T)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T) log q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )dAd{μi, σ2
+i }K
+i=1 − log Zπ + log p(y),
+
+and Zπ is a normalizing term. The first term of Equation (9) is the only term that depends on q(π).
+Thus, the minimum value of KL(q(τ) || p(τ|y)) is achieved when this term equals zero. Hence, we
+obtained:
+
+q(π) =
+exp
+� �
+∑S1:T q(S1:T)q(A) ∏K
+i=1 q(μi|σ2
+i )q(σ2
+i ) log p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T)dAd{μi, σ2
+i }K
+i=1
+
+�
+
+Zπ
+(10)
+
+Given the joint distribution of p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T) in the form of Equation (6), the
+straightforward evaluation of Equation (10) results in:
+
+q(π)
+∝
+K
+∏
+i=1
+π
+
+CKπ
+K +ws
+i −1
+i
+= Dir(π, wπ
+1 , · · · , wπ
+K); wπ
+i = CK
+π
+K + ws
+i , ws
+i = Eq(S1:T)[S1,i]
+
+(11)
+
+where S1,i = 1, if the process is in state i at time 1, and zero otherwise.
+
+161
+
+
+Entropy 2014, 16, 3832–3847
+
+Similarly, we can rearrange Equation (9) with respect to {q(ai)}K
+i=1, {q(μi|σ2
+i )}K
+i=1, {q(σ2
+i )}K
+i=1 and
+q(S1:T), respectively, and using the same arguments, then we can obtain:
+
+q(A)
+=
+k
+∏
+i
+Dir(ai; wA
+i1, ..., wA
+ik); wA
+ij = CA
+
+K + vs
+ij,
+(12)
+
+q(μi|σ2
+i )
+=
+N
+
+�
+
+γ′
+i, σ2
+i
+κi
+
+�
+
+, γ′
+i = η2γ + ps
+i
+
+η2 + qs
+i
+, κi = η2 + qs
+i
+(13)
+
+q(σ2
+i )
+=
+IG
+�
+α′
+i, β′
+i
+�
+, α′
+i = α + qs
+i
+2 , β′
+i = β + rs
+i
+2 + η2
+
+2 (γ
+′
+i − γ)2
+(14)
+
+q(S1:T)
+=
+
+k
+∏
+i=1
+π∗S1,i
+i
+
+T−1
+∏
+t=1
+
+k
+∏
+i=1
+
+k
+∏
+j=1
+a
+∗St,iSt+1,j
+ij
+
+T
+∏
+t=1
+
+k
+∏
+i=1
+θ∗St,i
+
+˜Z
+,
+(15)
+
+where St,i = 1, if the process in state i at time t, and zero otherwise, and with π∗
+i = eEq(π)[log πi],
+
+a∗
+ij = eEq(A)[log(aij)], θ∗
+i,t = eEq(μi|σ2
+i )q(σ2
+i )[log φi(yt)], vs
+ij = ∑T−1
+t=1 Eq(S1:T)
+�
+St,iSt+1,j
+�
+, ps
+i = ∑T
+t=1 Eq(S1:T)[st,i]yt,
+
+qs
+i = ∑T
+t=1 Eq(S1:T)[st,i], rs
+i = ∑T
+t=1(γ′
+i − yt)2Eq(S)[st,i]. Here, ψ is the digamma function, φ is the normal
+density function and the exact functional forms used in the updates are shown in Algorithm 1.
+
+Algorithm 1 Variational Bayes algorithm for the regime-switching log-normal model (RSLN) model.
+
+Initialize ws
+i
+(0), vs
+ij
+(0), ps
+i
+(0), qs
+i
+(0), and rs
+i
+(0) at step 0
+while wπ
+i
+(t−1), wA
+ij
+(t−1), γ′
+i
+(t−1), α′
+i
+(t−1), β′
+i
+(t−1), π∗
+i
+(t−1), a∗
+ij
+(t−1), and θ∗
+i,t
+(t−1) do not converge do
+
+1.
+Compute wπ
+i
+(t), wA
+ij
+(t), γ′
+i
+(t), κi(t), α′
+i
+(t), and β′
+i
+(t)at step t by
+
+wπ
+i
+(t) = CK
+π
+K + ws
+i
+(t−1),
+wA
+ij
+(t) = CA
+π
+K + vs
+ij
+(t−1),
+γ′
+i
+(t) = η2γ + ps
+i
+(t−1)
+
+η2 + qs
+i
+(t−1) ,
+
+κi(t) = η2 + qs
+i
+(t−1),
+α′
+i
+(t) = α + qs
+i
+(t−1)
+
+2
+,
+β′
+i
+(t) = β + rs
+i
+(t−1)
+
+2
++ η2
+
+2 (γ
+′
+i
+(t) − γ)2
+
+2.
+Compute π∗
+i
+(t), θ∗
+i,t
+(t) and a∗
+ij
+(t) at step t by:
+
+π∗
+i
+(t) = exp
+�
+ψ(wπ
+i
+(t)) − ψ(∑
+i
+wπ
+i
+(t))
+�
+,
+a∗
+ij
+(t) = exp
+�
+ψ(wA
+ij
+(t)) − ψ(∑
+j=1
+wA
+ij
+(t))
+�
+
+θ∗
+i,t
+(t) = exp
+� − 1
+
+2 log 2π − 1
+
+2(log β′
+i
+(t) − ψ(α′
+i
+(t))) − 1
+
+2
+
+�
+
+(yt − γ′
+i
+(t))2 α′
+i
+(t)
+
+β′
+i
+(t) +
+1
+
+κi(t)
+
+�
+�
+
+3.
+Compute ws
+i
+(t), vs
+ij
+(t), ps
+i
+(t), qs
+i
+(t), and rs
+i
+(t) at step t by:
+
+ws
+i
+(t) = Eq(t)(S1:T)[S1,i], vs
+ij
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)
+�
+St,iSt+1,j
+�
+, ps
+i
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)[st,i]yt,
+
+qs
+i
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)[st,i], rs
+i
+(t) =
+T−1
+∑
+t=1
+(γ′
+i
+(t) − yt)2Eq(t)(S)[st,i]
+
+t ⇐ t + 1
+end while
+
+The VB method proceeds, as was shown with the simple Equations (4) and (5), by iterative
+updating the variational parameters to solve a set of simultaneous equations. In this example, the
+update equations for the variables π, A, {μi, σ2
+i }K
+i=1, S1:T are given explicitly by Algorithm 1. For the
+initialisation, we choose symmetric values for most of the parameters and choose random values for
+
+162
+
+
+Entropy 2014, 16, 3832–3847
+
+others, as appropriate. For this example, this worked very satisfactory, although we note that for more
+general state space models [28], states that find good initial values can be non-trivial.
+
+3.3. Interpretation of Results
+
+First, all approximating distributions above turn out to lie in well-known parametric families.
+The only unknown quantities are the parameters of these distributions, which are often called the
+variational parameters.
+The evaluation of parameters of q(π), q(A), q(μi|σ2
+i ), and q(σ2
+i ) requires knowledge of q(S1:T),
+and also, the evaluation of π∗
+i , a∗
+ij and θ∗
+i,t requires knowledge of q(π), q(A), q(μi|σ2
+i ) and q(σ2
+i ). This
+structure leads to an iterative updating scheme, described in Algorithm 1.
+The main computational effort in Algorithm 1 is computing Eq(S1:T)[St,i] and Eq(S1:T)
+�
+St,iSt+1,j
+�
+,
+which have no simple tractable forms. We note that the distributional form of q(S1:T) has a very
+similar structure as the conditional distribution of p(S1:T|Y1:T, τ) for which the forward-backward algo-
+rithm [30] is commonly used to compute Ep(S1:T|Y1:T,τ)[St,i|Y1:T, τ] and Ep(S1:T|Y1:T,τ)
+�
+St,iSt+1,j|Y1:T, τ
+�
+.
+Therefore, we also use the forward-backward algorithm to compute Eq(S1:T)[St,i] and Eq(S1:T)
+�
+St,iSt+1,j
+�
+.
+
+The conditional distribution of q(μi|σ2
+i ) is N
+�
+μi|σ2
+i ; γ′
+i, σ2
+i
+κi
+
+�
+, then the marginal distribution of μi
+
+is the location-scale t distribution, denoted as t2α′
+i(μi; γ′
+i,
+κi
+
+β′
+i/α′
+i ), where the density function of tν(x; μ, λ)
+
+is defined as p(x|ν, μ, λ) = Γ( ν+1
+
+2 )
+
+Γ( ν
+
+2 )
+
+�
+λ
+πν
+� 1
+
+2 �
+1 + λ(x−μ)2
+
+ν
+�− ν+1
+
+2 , for x, μ ∈ (−∞, +∞) and ν, λ > 0.
+
+4. Numerical Studies
+
+4.1. Simulated Data
+
+In this section, we applied the VB solutions to four sets of simulated data, which are used in [24].
+Through these simulated studies, we will test the performance of VB on detecting the number of
+regimes and compare it with those of the BIC and the MCMC methods [24]. For this paper, we present
+only an initial study with a relatively small number of datasets. The results are highly promising,
+but more extensive studies are needed to draw comprehensive conclusions. Furthermore, see [28] for
+general results on VB in hidden state space models.
+To estimate the number of regimes, we construct a matrix, called the relative magnitude matrix
+
+(RMM), defined as A′ =
+�ˆa′
+ij
+�
+, where ˆa′
+ij =
+wA
+ij
+
+wA
+0 , wA
+0 = ∑K
+i=1 ∑K
+j=1 wA
+ij and wA
+ij is the parameter of q(A).
+
+Our model selection procedure is to fit a VB with a large number of regimes and to examine the rows
+and columns in the RMM. If the values of the entries in the i − th row and the i − th column of A′ are
+all equal to
+CA/K
+
+T−1+CA×K, then we will declare the regime i nonexistent. This method is validated by the
+
+following observations. It can be shown that the parameter of vs
+ij in wA
+ij is equal to the number of times
+the process leaves regime i and enters regime j. Therefore, for the i − th regime, the values of zero for
+all of vs
+ji and vs
+ij with j = 1, · · · , K indicate that there is no transition process entering or leaving regime
+i.
+Table 1 specifies the parameters for the four cases, and we generate 671 observations for each
+case (equal to the number of months from January 1956 to September 2011). The parameters used in
+Case 1 are identical to the maximum likelihood estimates for TSX monthly return data from 1956 to
+1999 [19]. Case 2 only has one regime present. Case 3 is similar to Case 1, but the two regimes have the
+same mean. Case 4 adds a third regime. For each case, we use MLE to fit a one-regime, two-regime,
+three-regime and four-regime RSLN model and report the corresponding BIC and log-likelihood scores.
+We then misspecify the number of regimes and run a four-regime VB algorithm.
+
+163
+
+
+Entropy 2014, 16, 3832–3847
+
+Table 1. Parameters of the simulated data.
+
+Case
+Regime 1
+Regime 2
+Regime 3
+Transition Probability
+(μi, σi)
+(μi, σi)
+(μi, σi)
+
+1
+(0.012, 0.035)
+(−0.016, 0.078)
+-
+
+�0.963
+0.037
+0.210
+0.790
+
+�
+
+2
+(0.014, 0.050)
+-
+-
+-
+
+3
+(0.000, 0.035)
+(0.000, 0.078)
+-
+
+�0.963
+0.037
+0.210
+0.790
+
+�
+
+4
+(0.012, 0.035)
+(−0.016, 0.078)
+(0.04, 0.01)
+
+⎛
+
+⎝
+0.953
+0.037
+0.01
+0.210
+0.780
+0.01
+0.80
+0.190
+0.01
+
+⎞
+
+⎠
+
+Table 2 shows the number of iterations that VB takes to converge in each case and the
+corresponding computational time (on a MacBook, 2 GHz processor). On average, VB converges after
+a hundred iterations and takes about one minute. On the same computer, a 104-iteration Reverse Jump
+MCMC (RJMCMC) will take about 10 h to finish. Using diagnostics, this seemed to be enough for
+convergence, while not being an “unfair” comparison in terms of time with VB. We can see that the
+computational efficiency will be a very attractive feature of the VB method. The results of the BIC with
+the log-likelihood (in parentheses), the relative magnitude matrices and the posterior probabilities
+for the models with the different number of regimes estimated by MCMC (cited from Hartman and
+Heaton [24]) are given in Table 3. In Case 1, the BIC favors the two-regime model. The posterior
+probability estimated by MCMC for the one-regime model is the largest, but there is still a large
+probability for the two regime model. Note that the prior specification for the number of regimes
+can effect these numbers and is always an issue with these forms of multidimensional MCMC. The
+relative magnitude matrix clearly shows that there are only two regimes whose ˆa′
+ij are not negligible.
+This implies VB removes excess transition and emission processes and discovers the exact number of
+hidden regimes. In Case 2 and Case 3, both VB and the BIC can select the correct number of regimes,
+and the posterior probability for the one-regime model estimated by MCMC is still the largest. In Case
+4, VB does not detect the third regime. The transition probability to this regime is only 0.01, and the
+means and standard deviations of Regime 1 make the rare data from Regime 3 easily merged within
+the data from Regime 1. From Table 3, it is clear that for all of the cases, the log-likelihood always
+increases as the number of regimes increase.
+
+Table 2. Computational efficiency of VB.
+
+Case 1
+Case 2
+Case 3
+Case 4
+
+Iterations to converge
+62
+182
+132
+94
+Computational time [s]
+27.161
+80.842
+58.510
+45.044
+
+164
+
+
+Entropy 2014, 16, 3832–3847
+
+Table 3. The estimated number of regimes by VB, BIC and MCMC.
+
+No. of
+MLE
+RJMCMC
+VB
+Case
+Regimes
+BIC (Log Likelihood)
+Posterior
+Probability
+Relative Magnitude Matrix
+
+1
+
+1
+2
+3
+4
+
+1, 108.875(1, 115.384)
+1, 158.227(1, 174.499)
+1, 156.370(1, 182.405)
+1, 153.150(1, 188.948)
+
+0.647
+0.214
+0.088
+<0.052
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.14357
+0.00004
+0.00004
+0.03153
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.03018
+0.00004
+0.00004
+0.79428
+
+⎞
+
+⎟
+⎟
+⎠
+
+2
+
+1
+2
+3
+4
+
+1, 045.448(1, 051.957)
+1, 038.360(1, 054.632)
+1, 030.733(1, 056.768)
+1, 026.882(1, 062.680)
+
+0.864
+0.109
+0.020
+<0.006
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.99944
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+
+⎞
+
+⎟
+⎟
+⎠
+
+3
+
+1
+2
+3
+4
+
+1, 110.903(1, 117.411)
+1, 139.214(1, 155.486)
+1, 131.904(1, 157.719)
+1, 121.921(1, 157.940)
+
+0.629
+0.221
+0.098
+<0.052
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.11322
+0.00004
+0.00004
+0.02647
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.02659
+0.00004
+0.00004
+0.83327
+
+⎞
+
+⎟
+⎟
+⎠
+
+4
+
+1
+2
+3
+4
+
+1, 044.819(1, 051.328)
+1, 092.610(1, 108.881)
+1, 087.435(1, 113.470)
+1, 080.240(1, 116.038)
+
+0.641
+0.203
+0.094
+<0.06
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.22643
+0.00004
+0.00004
+0.05518
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.05377
+0.00004
+0.00004
+0.66417
+
+⎞
+
+⎟
+⎟
+⎠
+
+4.2. Real Data
+
+In this section, we apply the VB solution to the TSX monthly total return index in the period from
+January, 1956, to December, 1999 (528 observations in total and studied in [19,21]).
+A four-regime VB is implemented first. VB converges after 100 iterations about 34.284 s (on a
+MacBook, 2 GHz processor). The relative magnitude matrix, given in Table 4, clearly shows that VB
+identifies two regimes. This matches both of the BIC and AIC-based results [19]. Based on these results,
+we then fit a two-regime VB, which converges after 83 iterations in about 14.241 s. Table 5 gives the
+marginal distributions for all of the parameters. Figure 4 presents the corresponding density functions,
+where we can see that all of the plots show a symmetric and bell-shaped pattern.
+
+Table 4. Estimations of the number of regimes for TSXdata.
+
+January 1956–December 1999
+
+R. M. M.
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.11496
+0.00005
+0.00005
+0.02803
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.02853
+0.00005
+0.00005
+0.82791
+
+⎞
+
+⎟
+⎟
+⎠
+
+Table 5. The marginal distributions of the parameters estimated by VB.
+
+Parameter
+Distribution
+Mean
+s.d.
+Transition Probability
+
+μ1
+t454.61(0.0123, 370778.19)
+0.0123
+0.00165
+-
+σ2
+1
+IG(227.30, 0.28)
+0.00122(0.0349)
+0.00008
+-
+μ2
+t80.39(−0.0161, 12987.55)
+−0.0161
+0.00889
+-
+σ2
+2
+IG(40.20, 0.24)
+0.00603(0.0777)
+0.00098
+-
+p1,2
+p2,1
+Beta(15.21, 434.78)
+Beta(15.00, 61.21)
+0.0338
+0.1969
+0.00851
+0.04525
+
+�0.9662
+0.0338
+0.1969
+0.8031
+
+�
+
+165
+
+
+Entropy 2014, 16, 3832–3847
+
+�����
+�����
+����
+����
+����
+
+�
+��
+���
+���
+���
+���
+���
+���
+
+�������
+
+(a)
+
+�����
+�����
+�����
+�����
+�����
+
+�
+����
+����
+����
+����
+����
+
+�������
+
+(b)
+
+���
+���
+���
+���
+���
+���
+
+�
+��
+��
+��
+��
+��
+
+�������
+
+(c)
+
+Figure 4. The VB marginal distributions of the parameters. (a) μ2 (left) and μ1 (right); (b) σ2
+1 (left) and
+σ2
+2 (right) ; (c) p1,2 (left) and p2,1 (right) .
+
+Table 6 (the upper part) gives the maximum likelihood estimates (cited from [19]), mean
+parameters computed by the MCMC method (cited from [21]) and mean parameters computed
+by VB. It clearly shows that the point estimates by VB are very close to those by MLE and MCMC.
+The numbers in parenthesis in Table 6 are the standard deviations computed by the three methods,
+respectively. It is worth noting that all of the variance estimated by VB are smaller than those by the
+MLE or MCMC methods. In fact, some other researchers also report the underestimation of posterior
+variance in other VB applications, for example [31,32]. In the paper [1], we look at some diagnostics
+methods that can assess how well the VB approximates the true posterior, particularly with regards to
+its covariance structure. The methods proposed also allow us to generate simple corrections when the
+approximation error is large.
+
+Table 6. Estimates and standard deviations by VB, MLE and MCMC.
+
+μ1
+σ1
+p1,2
+μ2
+σ2
+p2,1
+
+VB
+0.0123(0.00165)
+0.0349(0.00008)
+0.0338(0.00851)
+−0.0161(0.00889)
+0.0777(0.00098)
+0.1969(0.04525)
+MLE
+0.0123(0.002)
+0.0347(0.001)
+0.0371(0.012)
+−0.0157(0.010)
+0.0778(0.009)
+0.2101(0.086)
+MCMC
+0.0122(0.002)
+0.0351(0.002)
+0.0334(0.012)
+−0.0164(0.010)
+0.0804(0.009)
+0.2058(0.065)
+
+5. Conclusions
+
+Variational Bayes can be thought of in terms of information geometry as a projection-based
+approximation technique; it provides a framework to approximate posteriors. We applied this method
+to the regime-switching log-normal model and provide solutions to account for both model uncertainty
+and parameter uncertainty. The numerical results show that our method can recover exactly the
+number of regimes and gives reasonable point estimates. The VB method is also demonstrated to be
+very computationally efficient.
+The application on the TSX monthly total return index data in the period from January 1956 to
+December 1999, confirms the similar results in the literature in finding the number of regimes.
+
+Author Contributions
+
+The article was written by Hui Zhao under the guidance of Paul Marriott. All authors have read
+and approved the final manuscript.
+
+166
+
+
+Entropy 2014, 16, 3832–3847
+
+Conflicts of Interest
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Zhao, H.; Marriott, P. Diagnostics for variational bayes approximations. 2013, arXiv:1309.5117.
+2.
+Amari, S.-I. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1990.
+3.
+Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat.
+1983, 11, 793–803.
+4.
+Kass, R.; Vos, P. Geometrical Foundations of Asymptotic Inference; Wiley: New York, NY, USA, 1997.
+5.
+Hinton, G.E.; van Camp, D. Keeping neural networks simple by minimizing the description length of the
+weights. In Proceedings of the 6th ACM Conference on Computational Learning Theory, Santa Cruz, CA,
+USA, 26–28 July 1993; ACM: New York, NY, USA, 1993.
+6.
+MacKay, D. Developments in Probabilistic Modelling with Neural Networks—Ensemble Learning. In Neural
+Networks: Artifical Intelligence and Industrial Applications; Springer: London, UK, 1995; pp. 191–198.
+7.
+Attias, H. Independent Factor Analysis. Neur. Comput. 1999, 11, 803–851.
+8.
+Lappalainen, H. Ensemble Learning For Independent Component Analysis. In Proceedings of the First
+International Workshop on Independent Component Analysis, Aussois, France, 11–15 January 1999; pp.
+7–12.
+9.
+Beal, M.; Ghahramani, Z. The variational Bayesian EM algorithm for incomplete data: With application to
+scoring graphical model structures. Bayesian Stat. 2003, 7, 453–463.
+10.
+Winn, J. Variational Message Passing and its Applications. Ph.D. Thesis, Department of Physics, University of
+Cambridge, Cambridge, UK, 2003.
+11.
+Blei, D.M.; Ng, A.Y.; Jordan, M.I.; Lafferty, J. Latent Dirichlet allocation. J. Mach. Learn. Res.
+2003, 3,
+993–1022.
+12.
+Ghahramani, Z.; Beal, M.J. A Variational Inference for Bayesian Mixtures of Factor Analysers. Adv. Neur. Inf.
+Process. Syst. 2000, 12, 449–455.
+13.
+Haff, L.R. The Variational Form of Certain Bayes Estimators. Ann. Stat. 1991, 19, 1163–1190.
+14.
+Faes, C.; Ormerod, J.T.; Wand, M.P. Variational Bayesian Inference for Parametric and Nonparametric
+Regression With Missing Data. J. Am. Stat. Assoc. 2011, 106, 959–971.
+15.
+McGrory, C.; Titterington, D.; Reeves, R.; Pettitt, A.N. Variational Bayes for estimating the parameters of a
+hidden Potts model. Stat. Comput. 2009, 19, 329–340.
+16.
+Ormerod, J.T.; Wand, M.P. Gaussian Variational Approximate Inference for Generalized Linear Mixed
+Models. J. Comput. Graph. Stat. 2011, 21, 1–16.
+17.
+Hall, P.; Humphreys, K.; Titterington, D.M. On the Adequacy of Variational Lower Bound Functions for
+Likelihood-Based Inference in Markovian Models with Missing Values. J. R. Stat. Soc. Ser. B 2002, 64,
+549–564.
+18.
+Wang, B.; Titterington, M. Convergence Properties of a general algorithm for calculating variational Bayesian
+estimates for a normal mixture model. Bayesian Anal. 2006, 1, 625–650.
+19.
+Hardy, M.R. A Regime-Switching Model of Long-Term Stock Returns. N. Am. Actuar. J. 2001, 5, 41–53.
+20.
+Hamilton, J.D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business
+Cycle. Econometrica 1989, 57, 357–384.
+21.
+Hardy, M.R. Bayesian Risk Management for Equity-Linked Insurance. Scand. Actuar. J. 2002, 2002, 185–211.
+22.
+Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464.
+23.
+Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723.
+24.
+Hartman, B.M.; Heaton, M.J. Accounting for regime and parameter uncertainty in regime-switching models.
+Insur. Math. Econ. 2011, 49, 429–437.
+25.
+Green, P.J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.
+Biometrika 1995, 82, 711–732.
+26.
+Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: Cambridge, UK, 2009.
+27.
+Brooks, S.P. Markov Chain Monte Carlo Method and Its Application. J. R. Stat. Soc. Ser. D 1998, 47, 69–100.
+
+167
+
+
+Entropy 2014, 16, 3832–3847
+
+28.
+Ghahramani, Z.; Hinton, G.E. Variational learning for switching state-space models. Neur. Comput. 1998, 12,
+831–864.
+29.
+Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat 1951, 22, 79–86.
+30.
+Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A maximization technique occurring in the statistical analysis of
+probabilistic functions of markov chains. Ann. Math. Stat. 1970, 41, 164–171.
+31.
+Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using
+integrated nested Laplace approximations. J. R. Stat. Soc. Ser. B 2009, 71, 319–392.
+32.
+Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+168
+
+
+entropy
+
+Article
+On Clustering Histograms with k-Means by Using
+Mixed α-Divergences
+
+Frank Nielsen 1,2,*, Richard Nock 3 and Shun-ichi Amari 4
+
+1 Sony Computer Science Laboratories, Inc, Tokyo 141-0022, Japan
+2 École Polytechnique, 91128 Palaiseau Cedex, France
+3 NICTA and The Australian National University, Locked Bag 9013, Alexandria NSW 1435, Australia
+4 RIKEN Brain Science Institute, 2-1 Hirosawa Wako City, Saitama 351-0198, Japan; E-Mail: amari@brain.riken.jp
+*
+E-Mail: Frank.Nielsen@acm.org; Tel.:+81-3-5448-4380.
+
+Received: 15 May 2014; in revised form: 10 June 2014 / Accepted: 13 June 2014 /
+Published: 17 June 2014
+
+Abstract: Clustering sets of histograms has become popular thanks to the success of the generic
+method of bag-of-X used in text categorization and in visual categorization applications. In this paper,
+we investigate the use of a parametric family of distortion measures, called the α-divergences, for
+clustering histograms. Since it usually makes sense to deal with symmetric divergences in information
+retrieval systems, we symmetrize the α-divergences using the concept of mixed divergences. First,
+we present a novel extension of k-means clustering to mixed divergences. Second, we extend the
+k-means++ seeding to mixed α-divergences and report a guaranteed probabilistic bound. Finally, we
+describe a soft clustering technique for mixed α-divergences.
+
+Keywords: bag-of-X; α-divergence; Jeffreys divergence; centroid; k-means clustering; k-means seeding
+
+1. Introduction: Motivation and Background
+
+1.1. Clustering Histograms in the Bag-of-Word Modeling Paradigm
+
+A common task of information retrieval (IR) systems is to classify documents into categories.
+Given a training set of documents labeled with categories, one asks to classify new incoming documents.
+Text categorisation [1,2] proceeds by first defining a dictionary of words from a corpus. It then
+models each document by a word count yielding a word distribution histogram per document (see
+the University of California, Irvine, UCI, machine learning repository for such data-sets [3]). The
+importance of the words in the dictionary can be weighted by the term frequency-inverse document
+frequency [2] (tf-idf) that takes into account both the frequency of the words in a given document, but
+also of the frequency of the words in all documents: Namely, the tf-idf weight for a given word in a
+given document is the product of the frequency of that word in the document times the logarithm of
+the ratio of the number of documents divided by the document frequency of the word [2]. Defining a
+proper distance between histograms allows one to:
+
+• Classify a new on-line document: We first calculate its word distribution histogram signature and
+seek for the labeled document, which has the most similar histogram to deduce its category tag.
+• Find the initial set of categories: we cluster all document histograms and assign a category
+per cluster.
+
+This text classification method based on the representation of the bag-of -words (BoWs) has also
+been instrumental in computer vision for efficient object categorization [4] and recognition in natural
+images [5]. This paradigm is called bag-of-features [6] (BoFs) in the general case. It first requires one
+to create a dictionary of “visual words” by quantizing keypoints (e.g., affine invariant descriptors of
+image patches) of the training database. Quantization is performed using the k-means [7–9] algorithm
+
+Entropy 2014, 16, 3273–3301; doi:10.3390/e16063273
+www.mdpi.com/journal/entropy
+169
+
+
+Entropy 2014, 16, 3273–3301
+
+that partitions n data X = {x1, ..., xn} into k pairwise disjoint clusters C1, ..., Ck, where each data
+element belongs to the closest cluster center (i.e., the cluster prototype). From a given initialization,
+batched k-means first assigns data points to their closest centers and then updates the cluster centers
+and reiterates this process until convergence is met to a local minimum (not necessarily the global
+minimum) after a provably finite number of steps. Csurka et al. [4] used the squared Euclidean
+distance for building the visual vocabulary. Depending on the chosen features, other distances
+have proven useful. For example, the symmetrized Kullback–Leibler (KL) divergence was shown to
+perform experimentally better than the Euclidean or squared Euclidean distances for a compressed
+histogram of gradient descriptors [10] (CHoGs), even if it is not a metric distance, since its fails to
+satisfy the triangular inequality. To summarize, k-means histogram clustering with respect to the
+symmetrized KL (called Jeffreys divergence J) can be used to quantize both visual words and document
+categories. Nowadays, the seminal bag-of-word method has been generalized fruitfully to various
+settings using the generic bag-of-X paradigm, like the bag-of-textons [6], the bag-of-readers [11], etc.
+Bag-of-X represents each data (e.g., document, image, etc.) as an histogram of codeword count indices.
+Furthermore, the semantic space [12] paradigm has been recently explored to overcome two drawbacks
+of the bag-of-X paradigms: the high-dimensionality of the histograms (number of bins) and difficult
+human interpretation of the codewords due to the lack of semantic information. In semantic space,
+modeling relies on semantic multinomials that are discrete frequency histograms; see [12].
+In summary, clustering histograms with respect to symmetric distances (like the symmetrized KL
+divergence) is playing an increasing role. It turns out that the symmetrized KL divergence belongs to a
+1-parameter family of divergences, called symmetrized α-divergences, or Jeffreys α-divergence [13].
+
+1.2. Contributions
+
+Since divergences D(p : q) are usually asymmetric distortion measures between two objects
+p and q, one has to often consider two kinds of centroids obtained by carrying the minimization
+process either on the left argument or on the right argument of the divergences; see [14]. In theory,
+it is enough to consider only one type of centroid, say the right centroid, since the left centroid with
+respect to a divergence D(p : q) is equivalent to the right centroid with respect to the mirror divergence
+D′(p : q) = D(q : p).
+In this paper, we consider mixed divergences [15] that allow one to handle in a unified way the
+arithmetic symmetrization S(p, q) = 1
+
+2(D(p : q) + D(q : p)) of a given divergence D(p : q) with both
+the sided divergences: D(p : q) and its mirror divergence D′(p : q). The mixed α-divergence is the
+mixed divergence obtained for the α-divergence. We term α-clustering the clustering with respect
+to α-divergences and mixed α-clustering the clustering w.r.t. mixed α-divergences [16]. Our main
+contributions are to extend the celebrated batched k-means [7–9] algorithm to mixed divergences
+by associating two dual centroids per cluster and to generalize the probabilistically guaranteed
+good seeding of k-means++ [17] to mixed α-divergences. The mixed α-seedings provide guaranteed
+probabilistic clustering bounds by picking up seeds from the data and do not require explicitly
+computing of centroids. Therefore, it follows a fast clustering technique in practice, even when cluster
+centers are not available in closed form. We also consider clustering histograms by explicitly building
+the symmetrized α-centroids and end up with a variational k-means when the centroids are not
+available in closed-form, Finally, we investigate soft mixed α-clustering and discuss topics related to
+α-clustering. Note that clustering with respect to non-symmetrized α-divergences has been recently
+investigated independently in [18] and proven useful in several applications.
+
+1.3. Outline of the Paper
+
+The paper is organized as follows: Section 2 introduces the notion of mixed divergences, presents
+an extension of k-means to mixed divergences and recalls some properties of α-divergences. Section 3
+describes the α-seeding techniques and reports a probabilistically-guaranteed bound on the clustering
+quality. Section 4 investigates the various sided/symmetrized/mixed calculations of the α-centroids.
+
+170
+
+
+Entropy 2014, 16, 3273–3301
+
+Section 5 presents the soft α-clustering with respect to α-mixed divergences. Finally, Section 6
+summarises the contributions, discusses related topics and hints at further perspectives. The paper is
+followed by two appendices. Appendix B studies several properties of α-divergences that are used to
+derive the guaranteed probabilistic performance of the α-seeding. Appendix C proves that α-sided
+centroids are quasi-arithmetic means for the power generator functions.
+
+2. Mixed Centroid-Based k-Means Clustering
+
+2.1. Divergences, Centroids and k-Means
+
+Consider a set H of n histograms h1, ..., hn, each with d bins, with all positive real-valued bins:
+hi
+j > 0, ∀1 ≤ i ≤ d, 1 ≤ j ≤ n. A histogram h is called a frequency histogram when its bins sums up
+
+to one: w(h) = wh = ∑i hi = 1. Otherwise, it is called a positive histogram that can eventually be
+normalized to a frequency histogram:
+
+˜h
+.=
+h
+
+w(h).
+(1)
+
+The frequency histograms belong to the (d-1)-dimensional open probability simplex Δd:
+
+Δd
+.=
+
+�
+
+(x1, ..., xd) ∈ Rd | ∀i, xi > 0, and
+d
+∑
+i=1
+xi = 1
+
+�
+
+.
+(2)
+
+That is, although frequency histograms have d bins, the constraint that those bin values should
+sum up to one yields d-1 degrees of freedom. In probability theory, the frequency or counting of
+histograms either model discrete multinomial probabilities or discrete positive measures (also called
+positive arrays [19]).
+The celebrated k-means clustering [8,9] is one of the most famous clustering techniques that has
+been generalized in many ways [20,21]. In information geometry [22], a divergence D(p : q) is a
+smooth C3 differentiable dissimilarity measure that is not necessarily symmetric (D(p : q) ̸= D(q : p),
+hence the notation “:” instead of the classical “,” reserved for metric distances), but is non-negative and
+satisfies the separability property: D(p : q) = 0 iff p = q. More precisely, let ∂iD(x : y) =
+∂
+∂xi D(x : y),
+
+∂,iD(x : y) =
+∂
+∂yi D(x : y). Then, we require ∂iD(x : x) = ∂,iD(x : x) = 0 and −∂i∂,jD(x : y) positive
+
+definite for defining a divergence. For a distance function D(· : ·), we denote by D(x : H) the weighted
+average distance of x to a set a weighted histograms:
+
+D(x : H)
+.=
+n
+∑
+j=1
+wiD(x : hj).
+(3)
+
+An important class of divergences on frequency histograms is the f-divergences [23–25] defined for a
+convex generator f (with f (1) = f ′(1) = 0 and f ′′(1) = 1):
+
+If (p : q)
+.=
+d
+∑
+i=1
+qi f
+� pi
+
+qi
+
+�
+.
+
+Those divergences preserve information monotonicity [19] under any arbitrary transition probability
+(Markov morphisms). f-divergences can be extended to positive arrays [19].
+The k-means algorithm on a set of weighted histograms can be tailored to any divergence as
+follows: First, we initialize the k cluster centers C = {c1, ..., ck} (say, by picking up randomly arbitrary
+distinct seeds). Then, we iteratively repeat until convergence the following two steps:
+
+• Assignment: Assign each histogram hj to its closest cluster center:
+
+l(hj)
+.= arg
+k
+min
+l=1 D(hj : cl).
+
+171
+
+
+Entropy 2014, 16, 3273–3301
+
+This yields a partition of the histogram set H = ∪k
+l=1Al, where Al denotes the set of histograms
+of the l-th cluster: Al = {hj |l(hj) = l}.
+• Center relocation: Update the cluster centers by taking their centroids:
+
+cl
+.= arg min
+x
+∑
+hj∈Al
+wjD(hj : x).
+
+Throughout this paper, centroid shall be understood in the broader sense of a barycenter when
+weights are non-uniform.
+
+2.2. Mixed Divergences and Mixed k-Means Clustering
+
+Since divergences are potentially asymmetric, we can define two-sided k-means or always consider
+a right-sided k-means, but then define another sided divergence D′(p : q) = D(q : p). We can
+also consider the symmetrized k-means with respect to the symmetrized divergence: S(p, q) =
+D(p : q) + D(q : p). Eventually, we may skew the symmetrization with a parameter λ ∈ [0, 1]:
+Sλ(p, q) = λD(p : q) + (1 − λ)D(q : p) (and consider other averaging schemes instead of the arithmetic
+mean).
+In order to handle those sided and symmetrized k-means under the same framework, let us
+introduce the notion of mixed divergences [15] as follows:
+
+Definition 1 (Mixed divergence).
+
+Mλ(p : q : r)
+.= λD(p : q) + (1 − λ)D(q : r),
+(4)
+
+for λ ∈ [0, 1].
+
+A mixed divergence includes the sided divergences for λ ∈ {0, 1} and the symmetrized (arithmetic
+mean) divergence for λ = 1
+
+2.
+We generalize k-means clustering to mixed k-means clustering [15] by considering two centers
+per cluster (for the special cases of λ = 0, 1, it is enough to consider only one). Algorithm 1 sketches
+the generic mixed k-means algorithm. Note that a simple initialization consists of choosing randomly
+the k distinct seeds from the dataset with li = ri.
+
+Algorithm 1: Mixed divergence-based k-means clustering.
+
+Input: Weighted histogram set H, divergence D(·, ·), integer k > 0, real λ ∈ [0, 1];
+Initialize left-sided/right-sided seeds C = {(li, ri)}k
+i=1;
+repeat
+
+//Assignment
+for i = 1, 2, ..., k do
+
+Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj)};
+
+// Dual-sided centroid relocation
+for i = 1, 2, ..., k do
+
+ri ← arg minx D(Ci : x) = ∑h∈Ci wjD(h : x);
+li ← arg minx D(x : Ci) = ∑h∈Ci wjD(x : h);
+
+until convergence;
+Output: Partition of H into k clusters following C;
+
+Notice that the mixed k-means clustering is different from the k-means clustering with respect to
+the symmetrized divergences Sλ that considers only one centroid per cluster.
+
+172
+
+
+Entropy 2014, 16, 3273–3301
+
+2.3. Sided, Symmetrized and Mixed α-Divergences
+
+For α ̸= ±1, we define the family of α-divergences [26] on positive arrays [27] as:
+
+Dα(p : q)
+.=
+d
+∑
+i=1
+
+4
+
+1 − α2
+
+�1 − α
+
+2
+pi + 1 + α
+
+2
+qi − (pi)
+1−α
+
+2 (qi)
+1+α
+
+2
+�
+,
+
+=
+D−α(q : p), α ∈ R\{0, 1},
+(5)
+
+with the limit cases D−1(p : q) = KL(p : q) and D1(p : q) = KL(q : p), where KL is the extended
+Kullback–Leibler divergence:
+
+KL(p : q)
+.=
+d
+∑
+i=1
+pi log pi
+
+qi + qi − pi.
+(6)
+
+Divergence D0 is the squared Hellinger symmetric distance (scaled by a multiplicative factor of
+four) extended to positive arrays:
+
+D0(p : q) = 2
+� ��
+
+p(x) −
+�
+
+q(x)
+�2
+dx = 4H2(p, q),
+(7)
+
+with the Hellinger distance:
+
+H(p, q) =
+
+�
+
+1
+2
+
+� ��
+
+p(x) −
+�
+
+q(x)
+�2
+dx.
+(8)
+
+Note that α-divergences are defined for the full range of α values: α ∈ R.
+Observe that
+α-divergences of Equation (5) are homogeneous of degree one: Dα(λp : λq) = λDα(p : q) for
+λ > 0.
+When histograms p and q are both frequency histograms, we have:
+
+Dα( ˜p : ˜q)
+=
+4
+
+1 − α2
+
+�
+
+1 −
+d
+∑
+i=1
+( ˜pi)
+1−α
+
+2 ( ˜qi)
+1+α
+
+2
+
+�
+
+,
+
+=
+D−α( ˜q : ˜p), α ∈ R\{0, 1},
+(9)
+
+and the extended Kullback–Leibler divergence reduces to the traditional Kullback–Leibler
+
+divergence: KL( ˜p : ˜q) = ∑d
+i=1 ˜pi log ˜pi
+
+˜qi .
+The Kullback–Leibler divergence between frequency histograms ˜p and ˜q (α = ±1) is interpreted
+as the cross-entropy minus the Shannon entropy:
+
+KL( ˜p : ˜q)
+.= H×( ˜p : ˜q) − H( ˜p).
+
+Often, ˜p denotes the true model (hidden by nature), and ˜q is the estimated model from observations.
+However, in information retrieval, both ˜p and ˜q play the same symmetrical role, and we prefer to deal
+with a symmetric divergence.
+The Pearson and Neyman χ2 distances are obtained for α = −3 and α = 3, respectively:
+
+D3( ˜p : ˜q)
+=
+1
+2 ∑
+i
+
+( ˜qi − ˜pi)2
+
+˜pi
+,
+(10)
+
+D−3( ˜p : ˜q)
+=
+1
+2 ∑
+i
+
+( ˜qi − ˜pi)2
+
+˜qi
+.
+(11)
+
+173
+
+
+Entropy 2014, 16, 3273–3301
+
+The α-divergences belong to the class of Csiszár f-divergences with the following generator:
+
+f (t) =
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+4
+
+1−α2
+�
+1 − t(1+α)/2�
+,
+if α ̸= ±1,
+t ln t,
+if α = 1,
+− ln t,
+if α = −1
+(12)
+
+Remark 1. Historically, the α-divergences have been introduced by Chernoff [28,29] in the context of hypothesis
+testing. In Bayesian binary hypothesis testing, we are asked to decide whether an observation belongs to one
+class or the other class, based on prior w1 and w2 and class-conditional probabilities p1 and p2. The average
+expected error of the best decision maximum a posteriori (MAP) rule is called the probability of error, denoted by
+Pe. When prior probabilities are identical (w1 = w2 = 1
+
+2), we have Pe(p1, p2) = 1
+
+2
+�
+min(p1(x), p2(x))dx.
+Let S(p, q) = �
+min(p(x), q(x))dx denote the intersection similarity measure, with 0 < S ≤ 1 (generalizing
+the histogram intersection distance often used in computer vision [30]). S is bounded by the α-Chernoff affinity
+coefficient:
+
+S(p, q) ≤ Cβ(p, q) =
+�
+pβ(x)q1−β(x)dx,
+
+for all β ∈ [0, 1]. We can convert the affinity coefficient 0 < Cβ ≤ 1 into a divergence Dβ by simply taking
+Dβ = 1 − Cβ. Since the absolute value of divergences does not matter, we can rescale appropriately the
+divergence. One nice rescaling is by multiplying by
+1
+
+β(1−β): Dβ =
+1
+
+β(1−β)(1 − Cβ). This lets coincide the
+parameterized divergence with the fundamental Kullback–Leibler divergence for the limit values β ∈ {0, 1}.
+Last, by choosing β = 1−α
+
+2 , it yields the well-known expression of the α-divergences.
+
+Interestingly, the α-divergences can be interpreted as a generalized α-Kullback–Leibler
+divergence [26] with deformed logarithms.
+Next, we introduce the mixed α-divergence of a histogram x to two histograms p and q as follows:
+
+Definition 2 (Mixed α-divergence). The mixed α-divergence of a histogram x to two histograms p and q is
+defined by:
+
+Mλ,α(p : x : q)
+=
+λDα(p : x) + (1 − λ)Dα(x : q),
+
+=
+λD−α(x : p) + (1 − λ)D−α(q : x),
+
+=
+M1−λ,−α(q : x : p),
+(13)
+
+The α-Jeffreys symmetrized divergence is obtained for λ = 1
+
+2:
+
+Sα(p, q) = M 1
+
+2 ,α(q : p : q) = M 1
+
+2 ,α(p : q : p).
+
+The skew symmetrized α-divergence is defined by:
+
+Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p).
+
+2.4. Notations and Hard/Soft Clusterings
+
+Throughout the paper, superscript index i denotes the histogram bin numbers and subscript
+index j the histogram numbers. Index l is used to iterate on the clusters. The left-sided, right-sided
+and symmetrized histogram positive and frequency α-centroids are denoted by lα, rα, sα and ˜lα, ˜rα, ˜sα,
+respectively.
+In this paper, we investigate the following kinds of clusterings for sets of histograms:
+
+Hard clustering. Each histogram belongs to exactly one cluster:
+
+• k-means with respect to mixed divergences Mλ,α.
+• k-means with respect to symmetrized divergences Sλ,α.
+
+174
+
+
+Entropy 2014, 16, 3273–3301
+
+• Randomized seeding for mixed/symmetrized k-means by extending k-means++ with
+guaranteed probabilistic bounds for α-divergences.
+
+Soft clustering. Each histogram belongs to all clusters according to some weight distribution:
+the soft mixed α-clustering.
+
+3. Coupled k-Means++ α-Seeding
+
+It is well-known that the Lloyd k-means clustering algorithm monotonically decreases the loss
+function and stops after a finite number of iterations into a local optimal. Optimizing globally
+the k-means loss is NP-hard [17] when d > 1 and k > 1. In practice, the performance of the
+k-means algorithm heavily relies on the initialization. A breakthrough was obtained by the k-means++
+seeding [17], which guarantees in expectation a good starting partition. We extend this scheme to
+the coupled α-clustering. However, we point out that although k-means++ prove popular and are
+often used in practice with very good results; it has been recently pointed out that “worst case”
+configurations exist and even in small dimensions, on which the algorithm cannot beat significantly
+its expected approximability with a high probability [31]. Still, the expected approximability ratio,
+roughly in O(log(k)), is very good, as long as the number of clusters is not too large.
+
+Algorithm 2: Mixed α-seeding; MAS(H, k, λ, α)
+
+Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R;
+Let C ← hj with uniform probability ;
+for i = 2, 3, ..., k do
+
+Pick at random histogram h ∈ H with probability:
+
+πH(h)
+.=
+whMλ,α(ch : h : ch)
+
+∑y∈H wyMλ,α(cy : y : cy) ,
+(14)
+
+//where (ch, ch)
+.= arg min(z,z)∈C Mλ,α(z : h : z);
+C ← C ∪ {(h, h)};
+
+Output: Set of initial cluster centers C;
+
+Algorithm 2 provides our adaptation of k-means++ seeding [15,17]. It works for all three of our
+sided/symmetrized and mixed clustering settings:
+
+• Pick λ = 1 for the left-sided centroid initialization,
+• Pick λ = 0 for the right-sided centroid initialization (a left-sided initialization for −α),
+• with arbitrary λ, for the λ-Jα (skew Jeffreys) centroids or mixed λ centroids. Indeed, the
+initialization is the same (see the MAS procedure in Algorithm 2).
+
+Our proof follows and generalizes the proof described for the case of mixed Bregman seeding [15]
+(Lemma 2). In fact, our proof is more precise, as it quantifies the expected potential with respect to the
+optimum only, whereas in [15], the optimal potential is averaged with a dual optimal potential, which
+depends on the optimal centers, but may be larger than the optimum sought.
+
+Theorem 1. Let Cλ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew
+Jeffreys or mixed) in MASand Copt
+λ,α denote the optimal related clustering in k clusters, for λ
+∈ [0, 1] and α ∈ (−1, 1). Then, on average, with respect to distribution (14), the initial clustering of
+MAS satisfies:
+
+Eπ[Cλ,α]
+≤
+4
+
+�
+f (λ)g(k)h2(α)Copt
+λ,α
+if
+λ ∈ (0, 1)
+
+g(k)z(α)h4(α)Copt
+λ,α
+otherwise
+.
+(15)
+
+175
+
+
+Entropy 2014, 16, 3273–3301
+
+Here, f (λ) = max
+�
+1−λ
+
+λ ,
+λ
+
+1−λ
+�
+, g(k) = 2(2 + log k), z(α) =
+� 1+|α|
+
+1−|α|
+
+�
+8|α|2
+
+(1−|α|)2 , h(α) = maxi p|α|
+i / mini p|α|
+i ;
+the min is defined on strictly positive coordinates, and π denotes the picking distribution of Algorithm 2.
+
+Remark 2. The bound is particularly good when λ is close to 1/2, and in particular for the α-Jeffreys clustering,
+as in these cases, the only additional penalty compared to the Euclidean case [17] is h2(α), a penalty that relies
+on an optimal triangle inequality for α-divergences that we provide in Lemma A6 below.
+
+Remark 3. This guaranteed initialization is particularly useful for α-Jeffreys clustering, as there is no closed
+form solution for the centroids (except when α = ±1, see [32]).
+
+Algorithm 3: Mixed α-hard clustering: MAhC(H, k, λ, α)
+
+Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1], real α ∈ R;
+Let C = {(li, ri)}k
+i=1 ← MAS(H, k, λ, α);
+repeat
+
+//Assignment
+for i = 1, 2, ..., k do
+
+Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj)};
+
+// Centroid relocation
+for i = 1, 2, ..., k do
+
+ri ←
+�
+∑h∈Ai wih
+1−α
+
+2
+�
+2
+
+1−α ;
+
+li ←
+�
+∑h∈Ai wih
+1+α
+
+2
+�
+2
+
+1+α ;
+
+until convergence;
+Output: Partition of H in k clusters following C;
+
+Algorithm 3 presents the general hard mixed k-means clustering, which can be adapted also to
+left- (λ = 1) and right- (λ = 0) α-clustering.
+For skew Jeffreys centers, since the centroids are not available in closed form [32], we adopt a
+variational approach of k-means by updating iteratively the centroid in each cluster (thus improving
+the overall loss function without computing the optimal centroids that would eventually require
+infinitely many iterations).
+
+4. Sided, Symmetrized and Mixed α-Centroids
+
+The k-means clustering requires assigning data elements to their closest cluster center and
+then updating those cluster centers by taking their centroids. This section investigates the centroid
+computations for the sided, symmetrized and mixed α-divergences.
+Note that the mixed α-seeding presented in Section 3 does not require computing centroids and,
+yet, guarantees probabilistically a good clustering partition.
+Since mixed α-divergences are f-divergences, we start with the generic f-centroids.
+
+4.1. Csiszár f-Centroids
+
+The centroids induced by f-divergences of a set of positive measures (that relaxes the
+normalisation constraint) have been studied by Ben-Tal et al. [33]. Those entropic centroids are
+
+176
+
+
+Entropy 2014, 16, 3273–3301
+
+shown to be unique, since f-divergences are convex statistical distances in both arguments. Let Ef
+denote the energy to minimize when considering f-divergences:
+
+Ef
+.=
+min
+x∈X If (H : x) =
+n
+∑
+j=1
+wjIf (hj : x),
+(16)
+
+=
+min
+x∈X
+
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+pi
+j f
+
+�
+ci
+
+hi
+j
+
+�
+
+.
+(17)
+
+When the domain is the open probability simplex X = Δd, we get a constrained optimisation
+problem to solve. We transform this constrained minimisation problem (i.e., x ∈ Δd) into an equivalent
+unconstrained minimisation problem by using the Lagrange multiplier, γ:
+
+min
+x∈Rd
+
+n
+∑
+j=1
+wjIf (hj : c) + γ
+
+�
+d
+∑
+i=1
+xi − 1
+
+�
+
+.
+(18)
+
+Taking the derivatives according to xi, we get:
+
+∀i ∈ {1, ..., d},
+n
+∑
+j=1
+wj f ′
+�
+xi
+
+hi
+j
+
+�
+
+− γ = 0.
+(19)
+
+We now consider this equation for α-divergences and symmetrized α-divergences, both
+f-divergences.
+
+4.2. Sided Positive and Frequency α-Centroids
+
+The positive sided α-centroids for a set of weighted histograms were reported in [34] using the
+representation Bregman divergence. We summarise the results in the following theorem:
+
+Theorem 2 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted α-centroid
+coordinates of a set of n positive histograms h1, ..., hn are weighted α-means:
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+, li
+α = ri
+−α
+
+with fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+Furthermore, the frequency-sided α-centroids are simply the normalized-sided α-centroids.
+
+Theorem 3 (Sided frequency α-centroids [16]). The coordinates of the sided frequency α-centroids of a set of
+n weighted frequency histograms are the normalised weighted α-means.
+
+Table 1 summarizes the results concerning the sided positive and frequency α-centroids.
+
+177
+
+
+Entropy 2014, 16, 3273–3301
+
+Table 1. Positive and frequency α-centroids: the frequency α-centroids are normalized positive
+α-centroids, where w(h) denotes the cumulative sum of the histogram bins. The arithmetic mean
+is obtained for r−1 = l1 and the geometric mean for r1 = l−1.
+
+Positive centroid
+Frequency centroid
+
+Right-sided centroid
+riα =
+
+�
+(∑n
+j=1 wj(hi
+j)
+1−α
+
+2 )
+2
+
+1−α
+α ̸= 1
+ri
+1 = ∏n
+j=1(hi
+j)wj
+α = 1
+˜riα =
+ri
+α
+
+w(˜rα)
+
+Left-sided centroid
+liα = ri−α =
+
+�
+(∑n
+j=1 wj(hi
+j)
+1+α
+
+2 )
+2
+
+1+α
+α ̸= −1
+li
+−1 = ∏n
+j=1(hi
+j)wj
+α = −1
+˜liα = ˜ri−α =
+ri
+−α
+
+w(˜r−α)
+
+4.3. Mixed α-Centroids
+
+The mixed α-centroids for a set of n weighted histograms is defined as the minimizer of:
+
+∑
+j
+wjMλ,α(l : hj : r).
+(20)
+
+We state the theorem generalizing [15]:
+
+Theorem 4. The two mixed α-centroids are the left-sided and right-sided α-centroids.
+
+Figure 1 depicts some clustering result with our α-clustering software. We remark that the clusters
+found are all approximately subclusters of the “distinct” clusters that appear on the figure. When
+those distinct clusters are actually the optimal clusters—which is likely to be the case when they are
+separated by large minimal distance to other clusters—this is clearly a desirable qualitative property
+as long as the number of experimental clusters is not too large compared to the number of optimal
+clusters. We remark also that in the experiment displayed, there is no closed form solution for the
+cluster centers.
+
+Figure 1. Snapshot of the α-clustering software. Here, n = 800 frequency histograms of three bins with
+k = 8, and α = 0.7 and λ = 1
+
+2.
+
+178
+
+
+Entropy 2014, 16, 3273–3301
+
+4.4. Symmetrized Jeffreys-Type α-Centroids
+
+The Kullback–Leibler divergence can be symmetrized in various ways: Jeffreys divergence,
+Jensen–Shannon divergence and Chernoff information, just to mention a few. Here, we consider the
+following symmetrization of α-divergences extending Jeffreys J-divergence:
+
+Sα(p, q)
+=
+1
+2 (Dα(p : q) + Dα(q : p)) = S−α(p, q),
+
+=
+M 1
+
+2 (p : q : p),
+(21)
+
+For α = ±1, we get half of Jeffreys divergence:
+
+S±1(p, q) = 1
+
+2
+
+d
+∑
+i=1
+(pi − qi) log pi
+
+qi
+
+In particular, when p and q are frequency histograms, we have for α ̸= ±1:
+
+Jα( ˜p : ˜q)
+=
+8
+
+1 − α2
+
+�
+
+1 +
+d
+∑
+i=1
+H 1−α
+
+2 ( ˜pi, ˜qi)
+
+�
+
+,
+(22)
+
+where H 1−α
+
+2 (a, b) a symmetric Heinz mean [35,36]:
+
+Hβ(a, b) = aβb1−β + a1−βbβ
+
+2
+.
+
+Heinz means interpolate the arithmetic and geometric means and satisfies the inequality:
+
+√
+
+ab = H 1
+
+2 (a, b) ≤ Hα(a, b) ≤ H0(a, b) = a + b
+
+2
+.
+
+(Another interesting property of Heinz means is the integral representation of the logarithmic mean:
+L(x, y) =
+x−y
+
+log x−log y = � 1
+0 Hβ(x, y)dβ. This allows one to prove easily that √xy ≤ L(x, y) ≤ x+y
+
+2 .)
+The Jα-divergence is a Csiszár f-divergence [24,25].
+Observe that it is enough to consider α ∈ [0, ∞) and that the symmetrized α-divergence for
+positive and frequency histograms coincide only for α = ±1.
+For α = ±1, Sα(p, q) tends to the Jeffreys divergence:
+
+J(p, q) = KL(p, q) + KL(q, p) =
+d
+∑
+i=1
+(pi − qi)(log pi − log qi).
+(23)
+
+The Jeffreys divergence writes mathematically the same for frequency histograms:
+
+J( ˜p, ˜q) = KL( ˜p, ˜q) + KL( ˜q, ˜p) =
+d
+∑
+i=1
+( ˜pi − ˜qi)(log ˜pi − log ˜qi).
+(24)
+
+We state the results reported in [32]:
+
+Theorem 5 (Jeffreys positive centroid [32]). The Jeffreys positive centroid c = (c1, ..., cd) of a set {h1, ..., hn}
+of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W
+analytic function:
+
+ci =
+ai
+
+W( ai
+
+gi e)
+,
+
+where ai = ∑n
+j=1 πjhi
+j denotes the coordinate-wise arithmetic weighted means and gi = ∏n
+j=1(hi
+j)πj the
+coordinate-wise geometric weighted means.
+
+179
+
+
+Entropy 2014, 16, 3273–3301
+
+The Lambert analytic function W [37] (positive branch) is defined by W(x)eW(x) = x for x ≥ 0.
+
+Theorem 6 (Jeffreys frequency centroid [32]). Let ˜c denote the Jeffreys frequency centroid and ˜c′ =
+c
+wc the
+
+normalised Jeffreys positive centroid. Then, the approximation factor α˜c′ = S1(˜c′, ˜H)
+
+S1(˜c, ˜H) is such that 1 ≤ α˜c′ ≤
+1
+wc
+(with wc ≤ 1).
+
+Therefore, we shall consider α ̸= ±1 in the remainder.
+We state the following lemma generalizing the former results in [38] that were tailored to the
+symmetrized Kullback–Leibler divergence or the symmetrized Bregman divergence [14]:
+
+Lemma 1 (Reduction property). The symmetrized Jα-centroid of a set of n weighted histograms amount to
+computing the symmetrized α-centroid for the weighted α-mean and −α-mean:
+
+min Jα(x, H) = min
+x
+(Dα(x : rα) + Dα(lα : x)) .
+
+Proof. It follows that the minimization problem minx Sα(x, H) = ∑n
+j=1 wjSα(x, hj) reduces to the
+following minimization:
+
+min
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 ¯hi
+α − (xi)
+1−α
+
+2 ¯hi
+−α.
+(25)
+
+This is equivalent to minimizing:
+
+≡
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 ((¯hi
+α)
+2
+
+1−α )
+1−α
+
+2 −
+
+(xi)
+1−α
+
+2 ((¯hi
+−α)
+2
+
+1+α )
+1+α
+
+2 ,
+
+≡
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 (ri
+α)
+
+1−α
+
+2
+− (xi)
+1−α
+
+2 (li
+α)
+1+α
+
+2
+
+≡
+Dα(x : rα) + Dα(lα : x).
+
+Note that α = ±1, the lemma states that the minimization problem is equivalent to minimizing
+KL(a : x) + KL(x : g) with respect to x, where a = l1 and g = r1 denote the arithmetic and geometric
+means, respectively.
+
+The lemma states that the optimization problem with n weighted histograms is equivalent to the
+optimization with only two equally weighted histograms.
+The positive symmetrized α-centroid is equivalent to computing a representation symmetrized
+Bregman centroid [14,34].
+The frequency symmetrized α-centroid asks to minimize the following problem:
+
+min
+˜x∈Δd ∑
+j
+wjSα( ˜x, ˜hi).
+
+Instead of seeking for ˜x in the probability simplex, we can optimize on the unconstrained
+domain Rd−1 by using a reparameterization. Indeed, frequency histograms belong to the exponential
+families [39] (multinomials).
+Exponential families also include many other continuous distributions, like the Gaussian, Beta or
+Dirichlet distributions. It turns out the α-divergences can be computed in closed-form for members of
+the same exponential family:
+
+180
+
+
+Entropy 2014, 16, 3273–3301
+
+Lemma 2. The α-divergence for distributions belonging to the same exponential families amounts to computing
+a divergence on the corresponding natural parameters:
+
+Aα(p : q) =
+4
+
+1 − α2
+
+�
+
+1 − e−J( 1−α
+
+2 )
+F
+(θp:θq)
+�
+
+,
+
+where Jβ
+F(θ1 : θ2) = βF(θ1) + (1 − β)F(θ2) − F(βθ1 + (1 − β)θ2) is a skewed Jensen divergence defined for
+the log-normaliser F of the family.
+
+The proof follows from the fact that �
+pα(x)q1−α(x)dx = e−J
+(α)(θp:θq)
+F
+; see [40].
+
+First, we convert a frequency histogram ˜h to its natural parameter θ with θi = log ˜hi
+
+˜hd ; see [39].
+The log-normaliser is a non-separable convex function F(θ) = log(1 + ∑i eθi). To convert back a
+multinomial to a frequency histogram with d bins, we first set ˜hd =
+1
+
+1+∑d−1
+l=1 eθl and then retrieve the
+
+other bin values as ˜hi = ˜hdeθi.
+The centroids with respect to skewed Jensen divergences has been investigated in [13,40].
+
+Remark 4. Note that for the special case of α = 0 (squared Hellinger centroid), the sided and symmetrized
+centroids coincide. In that case, the coordinates si
+0 of the squared Hellinger centroid are:
+
+si
+0 =
+
+�
+n
+∑
+j=1
+wj
+�
+
+hi
+j
+
+�2
+, 1 ≤ i ≤ d.
+
+Remark 5. The symmetrized positive α-centroids can be solved in special cases (α = ±3, α = ±1 corresponding
+to the symmetrized χ2 and Jeffreys positive centroids). For frequency centroids, when dealing with binary
+histograms (d = 2), we have only one degree of freedom and can solve the binary frequency centroids. Binary
+histograms (and mixtures thereof) are used in computer vision and pattern recognition [41].
+
+Remark 6. Since α-divergences are Csiszár f-divergences and f-divergences can always be symmetrized by
+taking generator s(t) = f (t) + t f ( 1
+
+t ), we deduce that symmetrized α-divergences Sα are f-divergences for
+the generator:
+
+f (t) = − log((1 − α) + αt) − t log
+�
+(1 − α) + α
+
+t
+
+�
+.
+(26)
+
+Hence, Sα divergences are convex in both arguments, and the sα centroids are unique.
+
+5. Soft Mixed α-Clustering
+
+Algorithm 4 reports the general clustering with soft membership, which can be adapted to left
+(λinit = 1), right (λinit = 0) or mixed clustering. We have not considered a weighted histogram set in
+order not load the notations and because the extension is straightforward.
+Again, for skew Jeffreys centers, we shall adopt a variational approach. Notice that the soft
+clustering approach learns all parameters, including λ (if not constrained to zero or one) and α ∈ R.
+This is not the case for Matsuyama’s α-expectation maximization (EM) algorithm [42] in which α is
+fixed beforehand (and, thus, not learned).
+Assuming we model the prior for histograms by:
+
+pλ,α,j(hi) ∝
+
+λ exp −Dα(lj : hi) + (1 − λ) exp −Dα(hi : rj) ,
+(27)
+
+181
+
+
+Entropy 2014, 16, 3273–3301
+
+the negative log-likelihood involves the α-depending quantity:
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+p(j|hi) log pλ,α,j(hi) ≥
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+Mλ,α(lj : hi : rj)p(j|hi),
+(28)
+
+because of the concavity of the logarithm function.
+Therefore, the maximization step for α
+involves finding:
+
+arg max
+α
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+Mλ,α(lj : hi : rj)p(j|hi) .
+(29)
+
+Algorithm 4: Mixed α-soft clustering; MAsC(H, k, λ, α)
+
+Input: Histogram set H with |H| = m, integer k > 0, real λ ← λinit ∈ [0, 1], real α ∈ R;
+Let C = {(li, ri)}k
+i=1 ← MAS(H, k, λ, α);
+repeat
+
+//Expectation
+for i = 1, 2, ..., m do
+
+for j = 1, 2, ..., k do
+
+p(j|hi) =
+πj exp(−Mλ,α(lj:hi:rj))
+
+∑j′ πj′ exp(−Mλ,α(lj′ :hi:rj′));
+
+//Maximization
+for j = 1, 2, ..., k do
+
+πj ← 1
+
+m ∑i p(j|hi);
+
+li ←
+�
+1
+
+∑i p(j|hi) ∑i p(j|hi)h
+
+1+α
+
+2
+i
+
+�
+2
+
+1+α
+;
+
+ri ←
+�
+1
+
+∑i p(j|hi) ∑i p(j|hi)h
+
+1−α
+
+2
+i
+
+�
+2
+
+1−α
+;
+
+//Alpha - Lambda
+α ← α − η1 ∑k
+j=1 ∑m
+i=1 p(j|hi) ∂
+
+∂α Mλ,α(lj : hi : rj);
+
+if λinit ̸= 0, 1 then
+
+λ ← λ − η2
+�
+∑k
+j=1 ∑m
+i=1 p(j|hi)Dα(lj : hi)−
+
+∑k
+j=1 ∑m
+i=1 p(j|hi)Dα(hi : rj)
+�
+;
+
+//for some small η1, η2; ensure that λ ∈ [0, 1].
+
+until convergence;
+Output: Soft clustering of H according to k densities p(j|.) following C;
+
+No closed-form solution are known, so we compute the gradient update in Algorithm 4 with:
+
+∂Mλ,α(lj : hi : rj)
+
+∂α
+=
+
+λ∂Dα(lj : hi)
+
+∂α
++ (1 − λ)∂Dα(hi : rj)
+
+∂α
+,
+(30)
+
+182
+
+
+Entropy 2014, 16, 3273–3301
+
+∂Dα(p : q)
+
+∂α
+=
+2
+
+(1 − α)2 ×
+�
+
+q −
+�1 − α
+
+1 + α
+
+�2
+p + p
+1−α
+
+2 q
+1+α
+
+2
+�
+4α
+
+1 − α2 − ln
+� q
+
+p
+
+���
+
+.
+(31)
+
+The update in λ is easier as:
+
+∂Mλ,α(lj : hi : rj)
+
+∂λ
+=
+Dα(lj : hi) − Dα(hi : rj) .
+(32)
+
+Maximizing the likelihood in λ would imply choosing λ ∈ {0, 1} (a hard choice for left/right centers),
+yet we prefer the soft update for the parameter, like for α.
+
+6. Conclusions
+
+The family of α-divergences plays a fundamental role in information geometry: These statistical
+distortion measures are the canonical divergences of dual spaces on probability distribution manifolds
+with constant curvature κ = 1−α2
+
+4
+and the canonical divergences of dually flat manifolds for positive
+distribution manifolds [19].
+In this work, we have presented three techniques for clustering (positive or frequency) histograms
+using k-means:
+
+(1) Sided left or right α-centroid k-means,
+(2) Symmetrized Jeffreys-type α-centroid (variational) k-means, and
+(3) Coupled k-means with respect to mixed α-divergences relying on dual α-centroids.
+
+Sided and mixed dual centroids are always available in closed-forms and are therefore highly
+attractive from the standpoint of implementation. Symmetrized Jeffreys centroids are in general not
+available in closed-form and require one to implement a variational k-means by updating incrementally
+the cluster centroids in order to monotonically decrease the k-means loss function. From the clustering
+standpoint, this appears not to be a problem when guaranteed expected approximations to the optimal
+clustering are enough.
+Indeed, we also presented and analyzed an extension of k-means++ [17] for seeding those k-means
+algorithms. The mixed α-seeding initializations do not require one to calculate centroids and behaves
+like a discrete k-means by picking up the seeds among the data. We reported guaranteed probabilistic
+clustering bounds. Thus, it yields a fast hard/soft data partitioning technique with respect to mixed or
+symmetrized α-divergences. Recently, the advantage of clustering using α-divergences by tuning α in
+applications has been demonstrated in [18]. We thus expect the computationally fast mixed α-seeding
+with guaranteed performance to be useful in a growing number of applications.
+
+Acknowledgments
+
+NICTA is funded by the Australian Government as represented by the Department of Broadband,
+Communication and the Digital Economy and the Australian Research Council through the ICT Centre
+of Excellence program.
+
+Author Contributions: Author Contributions
+All authors contributed equally to the design of the research. The research was carried out by all
+authors. Frank Nielsen and Richard Nock wrote the paper. Frank Nielsen implemented the algorithms
+and performed experiments. All authors have read and approved the final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interests.
+
+183
+
+
+Entropy 2014, 16, 3273–3301
+
+Appendix Proof Sketch of Theorem 1
+
+We give here the key results allowing one to obtain the proof of the Theorem, following the proof
+scheme of [15]. In order not to load the notations, weights are considered uniform. The extension to
+non-uniform weights is immediate as it boils down to duplicate histograms in the histogram set and
+does not change the approximation result.
+Let A ⊆ H be an arbitrary cluster of Copt. Let us define UA and πA as the uniform and biased
+distributions conditioned to A. The key to the proof is to relate the expected potential of A under UA
+and πA to its contribution to the optimal potential.
+
+Lemma A1. Let A ⊆ H be an arbitrary cluster of Copt. Then:
+
+Ec∼UA[Mλ,α(A, c)]
+=
+Mopt,λ,α(A) + Mopt,λ,−α(A)
+
+=
+Ec∼UA[Mλ,−α(A, c)] ,
+
+where UA is the uniform distribution over A.
+
+Proof. α-coordinates have the property that for any subset A ⊆ H, (1/|A|) ∑p∈A uα(p) = uα(rα,A).
+Hence, we have:
+
+∀c ∈ A , ∑
+p∈A
+Dα(p : c)
+
+=
+∑
+p∈A
+Dϕα(uα(p) : uα(c))
+
+=
+∑
+p∈A
+Dϕα(uα(p) : uα(rα,A)) + |A|Dϕα(uα(rα,A) : uα(c))
+
+=
+∑
+p∈A
+Dα(p : rα,A) + |A|Dα(rα,A : c) .
+(A1)
+
+Because Dα(p : q) = D−α(q : p) and lα = r−α, we obtain:
+
+∀c ∈ A , ∑
+p∈A
+Dα(c : p)
+
+=
+∑
+p∈A
+D−α(p : c)
+
+=
+∑
+p∈A
+D−α(p : r−α,A) + |A|D−α(r−α,A : c)
+
+=
+∑
+p∈A
+Dα(lα,A : p) + |A|Dα(c : lα,A) .
+(A2)
+
+It comes now from (A1) and (A2) that:
+
+Ec∼UA[Mλ,α(A, c)]
+
+=
+1
+|A| ∑
+c∈A ∑
+p∈A
+{λDα(c : p) + (1 − λ)Dα(p : c)}
+
+=
+(1 − λ) ∑
+p∈A
+Dα(p : rα,A) + (1 − λ) ∑
+p∈A
+Dα(rα,A : p)
+
++λ ∑
+p∈A
+Dα(lα,A : p) + λ ∑
+p∈A
+Dα(p : lα,A)
+
+=
+(1 − λ)Mopt,0,α(A) + λMopt,1,α(A)
+
++(1 − λ)Mopt,0,−α(A) + λMopt,1,−α(A)
+
+=
+Mopt,λ,α(A) + Mopt,λ,−α(A) .
+(A3)
+
+184
+
+
+Entropy 2014, 16, 3273–3301
+
+This gives the left-hand side equality of the Lemma. The right-hand side follows from the fact that
+Ec∼UA[Mλ,−α(A, c)] = Mopt,1−λ,α(A) + Mopt,1−λ,−α(A).
+
+Instead of Mopt,λ,α(A) + Mopt,λ,−α(A), we want a term depending solely on Mopt,λ,α(A) as it is
+the “true” optimum. We now give two lemmata that shall be useful in obtaining this upper bound.
+The first is of independent interest, as it shows that any α-divergence is a scaled, squared Hellinger
+distance between geometric means of points.
+
+Lemma A2. For any p, q and α ̸= 1, there exists r ∈ [p, q], such that (1 − α)2Dα(p : q) = D0(p1−αrα :
+q1−αrα).
+
+Proof. By the definition of Bregman divergences, for any x, y, there exists some z ∈ [x, y], such that:
+
+Dϕα(x : y)
+=
+1
+2(x − y)2ϕ”α(z)
+
+=
+1
+2(x − y)2
+�
+1 + 1 − α
+
+2
+z
+� 2α
+
+1−α
+,
+
+and since uα is continuous and strictly increasing, for any p, q, there exists some r ∈ [p, q], such that:
+
+Dα(p : q)
+
+=
+Dϕα(uα(p) : uα(q))
+
+=
+1
+2(uα(p) − uα(q))2
+�
+1 + 1 − α
+
+2
+uα(r)
+� 2α
+
+1−α
+
+=
+2
+
+(1 − α)2
+
+�
+p
+1−α
+
+2 − q
+1−α
+
+2
+�2
+rα
+
+=
+2
+
+(1 − α)2
+
+�
+p1−α + q1−α − 2(pq)
+1−α
+
+2
+�
+rα
+
+=
+1
+
+(1 − α)2 D0(p1−αrα : q1−αrα) .
+
+Lemma A3. Let discrete random variable x take non-negative values x1, x2, ..., xm with uniform probabilities.
+Then, for any β > −1, we have var(x1+β/uβ) ≤ var(x), with u
+.= (1 + β)β maxi xi.
+
+Proof. First, ∀β > −1, remark that for any x, function f (x) = x(uβ − xβ) is increasing for x ≤
+u/(1 + β)β. Hence, assuming that the xis are put in non-increasing order without loss of generality, we
+have f (xi) ≥ f (xj), and so, xi(uβ − xβ
+i ) ≥ xj(uβ − xβ
+j ), ∀i ≥ j, as long as xi ≤ u/(1 + β)β. Choosing
+
+185
+
+
+Entropy 2014, 16, 3273–3301
+
+u = x1(1 + β)β yields, after reordering and putting the exponent, (x1+β
+i
+− x1+β
+j
+)2 ≤ (xiuβ − xjuβ)2.
+Hence:
+
+1
+m ∑
+i
+x2(1+β)
+i
+−
+
+�
+1
+m ∑
+i
+x(1+β)
+i
+
+�2
+
+=
+1
+
+2m2 ∑
+i,j
+(x1+β
+i
+− x1+β
+j
+)2
+
+≤
+1
+
+2m2 ∑
+i,j
+(xiuβ − xjuβ)2
+
+= u2β
+
+2m2 ∑
+i,j
+(xi − xj)2
+
+= u2β
+
+⎛
+
+⎝ 1
+
+m ∑
+i
+x2
+i −
+
+�
+1
+m ∑
+i
+xi
+
+�2⎞
+
+⎠ .
+
+Dividing by u2β the leftmost and rightmost terms and using the fact that var(λx) = λ2var(x) yields
+the statement of the Lemma.
+
+We are now ready to upper bound Mopt,λ,−α(A) as a function of Mopt,λ,α(A).
+
+Lemma A4. For any cluster A of Copt,
+
+Mopt,λ,−α(A)
+≤
+Mopt,λ,α(A) ×
+
+�
+f (λ)
+if λ ∈ (0, 1)
+z(α)h2(α)
+otherwise
+,
+
+where z(α), f (λ) and h(α) are defined in Theorem 1.
+
+Proof. The case λ ̸= 0, 1 is fast, as we have by definition:
+
+Mopt,λ,−α(A)
+=
+∑
+p∈A
+λD−α(l−α,A : p) + (1 − λ)D−α(p : r−α,A)
+
+=
+∑
+p∈A
+λDα(p : l−α,A) + (1 − λ)Dα(r−α,A : p)
+
+=
+∑
+p∈A
+λDα(p : rα,A) + (1 − λ)Dα(lα,A : p)
+
+≤
+max
+�1 − λ
+
+λ
+,
+λ
+
+1 − λ
+
+�
+Mopt,λ,α(A)
+
+= f (λ)Mopt,λ,α(A) .
+
+Suppose now that λ
+=
+0 and α
+≥
+0.
+Because Mopt,0,−α(A)
+=
+∑p∈A D−α(p : r−α,A)
+=
+∑p∈A Dα(lα,A : p) = Mopt,1,α(A), what we wish to do is upper bound ∑p∈A Dα(lα,A : p) = Mopt,1,α(A)
+as a function of ∑p∈A Dα(p : rα,A) = Mopt,0,α(A). We use Lemmatas A2 and A3 in the following
+derivations, using r(p) to refer to the r in Lemma A2, assuming α ≥ 0. We also note varA( f (p)) as
+
+186
+
+
+Entropy 2014, 16, 3273–3301
+
+the variance, under the uniform distribution over A, of discrete random variable f (p), for p ∈ A. We
+have:
+
+∑
+p∈A
+Dα(lα,A : p)
+
+=
+∑
+p∈A
+D−α(p : lα,A)
+
+=
+1
+
+(1 + α)2 ∑
+p∈A
+r(p)−αD0(p1+α : l1+α
+α,A )
+
+≤
+1
+
+(1 + α)2 minA pα ∑
+p∈A
+D0(p1+α : l1+α
+α,A )
+
+=
+1
+
+(1 + α)2 minA pα ∑
+p∈A
+
+�
+p1+α + l1+α
+α,A − 2p
+1+α
+
+2 l
+
+1+α
+
+2
+α,A
+
+�
+
+=
+|A|
+
+(1 + α)2 minA pα
+
+⎛
+
+⎝ 1
+
+|A| ∑
+p∈A
+p1+α −
+
+�
+1
+|A| ∑
+p∈A
+p
+1+α
+
+2
+
+�2⎞
+
+⎠
+
+=
+|A|varA(p
+1+α
+
+2 )
+
+(1 + α)2 minA pα .
+(A4)
+
+We have used the expression of left centroid l1+α
+α,A to simplify the expressions. Now, picking xi = p
+
+1−α
+
+i 2
+,
+
+β = 2α/(1 − α) and u =
+�
+1+α
+1−α
+� 2α
+
+1−α maxA p
+1−α
+
+2
+in Lemma A3 yields:
+
+varA(p
+1+α
+
+2 )
+
+=
+u2βvarA(p
+1+α
+
+2 /uβ)
+
+=
+u2βvarA
+�
+p
+1−α
+
+2 pα/uβ�
+
+=
+u2βvar(x1+β/uβ)
+
+≤
+u2βvar(x)
+
+= u2βvarA
+�
+p
+1−α
+
+2
+�
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2
+max
+A
+p2αvarA
+�
+p
+1−α
+
+2
+�
+.
+(A5)
+
+187
+
+
+Entropy 2014, 16, 3273–3301
+
+Plugging this in (A4) yields:
+
+∑
+p∈A
+Dα(lα,A : p)
+
+≤
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 |A| maxA p2αvarA
+�
+p
+1−α
+
+2
+�
+
+(1 + α)2 minA pα
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× |A| minA pαvarA(p
+1−α
+
+2 )
+
+(1 − α)2
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× minA pα
+
+(1 − α)2 ∑
+p∈A
+D0(p1−α : r1−α
+α,A )
+(A6)
+
+≤
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+×
+1
+
+(1 − α)2 ∑
+p∈A
+r(p)αD0(p1−α : r1−α
+α,A )
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× ∑
+p∈A
+Dα(p : rα,A)
+
+≤
+z(α)
+�maxA p
+
+minA p
+
+�2α
+× ∑
+p∈A
+Dα(p : rα,A) .
+(A7)
+
+Here, (A6) follows the path backwards of derivations that lead to (A4). The cases λ = 1 or α < 0 are
+obtained using the same chains of derivations and achieve the proof of Lemma A4.
+
+Lemma A4 can be directly used to refine the bound of Lemma A1 in the uniform distribution. We
+give the Lemma for the biased distribution, directly integrating the refinement of the bound.
+
+Lemma A5. Let A be an arbitrary cluster of Copt and C an arbitrary clustering. If we add a random couple
+(c, c) to C, chosen from A with π as in Algorithm 2, then:
+
+Ec∼πA[Mλ,α(A, c)]
+
+≤
+4
+
+�
+f (λ)h2(α)Mopt,λ,α(A)
+if
+λ ∈ (0, 1)
+z(α)h4(α)Mopt,λ,α(A)
+otherwise
+,
+(A8)
+
+where f (λ) and h(α) are defined in Theorem 1.
+
+Proof. The proof essentially follows the proof of Lemma 3 in [15]. To complete it, we need a triangle
+inequality involving α-divergences. We give it here.
+
+Lemma A6. For any p, q, r and α, we have:
+
+�
+
+Dα(p : q)
+≤
+�maxi{pi, qi, ri}
+
+mini{pi, qi, ri}
+
+�|α| ��
+
+Dα(p : r) +
+�
+
+Dα(r : q)
+�
+(A9)
+
+(where the min is over strictly positive values)
+
+Remark: take α = 0; we find the triangle inequality for the squared Hellinger distance.
+
+Proof. Using the proof of Lemma 2 in [15] for Bregman divergence Dϕα, we get:
+
+�
+
+Dϕα(x : z)
+
+≤
+ρ(α)
+��
+
+Dϕα(x : y) +
+�
+
+Dϕα(y : z)
+�
+,
+(A10)
+
+188
+
+
+Entropy 2014, 16, 3273–3301
+
+where:
+
+ρ(α)
+=
+max
+u,v
+
+�
+1 + 1−α
+
+2 u
+� 2α
+
+1−α
+
+�
+1 + 1−α
+
+2 v
+� 2α
+
+1−α
+.
+(A11)
+
+Taking x = uα(p), y = uα(q), z = uα(r) yields ρ(α) = maxs,t∈{pi,qi,ri}(s/t)|α| and the statement of
+Lemma A6.
+
+The rest of the proof of Lemma A5 follows the proof of Lemma 3 in [15].
+
+We get all of the ingredients to our proof, and there remains to use Lemma 4 in [15] to achieve the
+proof of Theorem 1.
+
+Appendix Properties of α-Divergences
+
+For positive arrays p and q, the α-divergence Dα(p : q) can be defined as an equivalent
+representational Bregman divergence [19,34] Bϕα(uα(p) : uα(q)) over the (uα, vα)-structure [43] with:
+
+ϕα(x)
+.=
+2
+
+1 + α
+
+�
+1 + 1 − α
+
+2
+x
+�
+2
+
+1−α
+,
+(A12)
+
+uα(p)
+.=
+2
+
+1 − α
+
+�
+p
+1−α
+
+2 − 1
+�
+,
+(A13)
+
+vα(p)
+.=
+2
+
+1 + α p
+1+α
+
+2
+,
+(A14)
+
+where we assume that α ̸= ±1. Otherwise, for α = ±1, we compute Dα(p : q) by taking the sided
+Kullback–Leibler divergence extended to positive arrays.
+In the proof of Theorem 1, we have used two properties of α-divergences of independent interest:
+
+• any α-divergence can be explained as a scaled squared Hellinger distance between geometric
+means of its arguments and a point that belong to their segment (Lemma A2);
+• any α-divergence satisfies a generalized triangle inequality (Lemma A6). Notice that this Lemma
+is optimal in the sense that for α = 0, it is possible to recover the triangle inequality of the
+Hellinger distance.
+
+The following lemma shows how to bound the mixed divergence as a function of an α-divergence.
+
+Lemma A7. For any positive arrays l, h, r and α ̸= ±1, define η
+.= λ(1 − α)/(1 − α(2λ − 1)) ∈ [0, 1], gη
+with gi
+η
+.= (li)η(ri)1−η and aη with ai
+η
+.= ηli + (1 − η)ri. Then, we have:
+
+Mλ,α(l : h : r)
+≤
+1 − α2(2λ − 1)2
+
+1 − α2
+Dα(2λ−1)(gη : h)
+
++2(1 − α(2λ − 1))
+
+1 − α2
+∑
+i
+
+�
+ai
+η − gi
+η
+�
+.
+
+Proof. For all index i, we have:
+
+Mλ,α(li : hi : ri) = λDα(li : hi) + (1 − λ)Dα(hi : ri)
+
+=
+4
+
+1 − α2
+
+�λ(1 − α)
+
+2
+li + (1 − λ)(1 + α)
+
+2
+ri + 1 + α(2λ − 1)
+
+2
+hi
+(A15)
+
+−λ(li)
+1−α
+
+2 (hi)
+1+α
+
+2 − (1 − λ)(ri)
+1+α
+
+2 (hi)
+1−α
+
+2
+�
+.
+(A16)
+
+189
+
+
+Entropy 2014, 16, 3273–3301
+
+The arithmetic-geometric-harmonic (AGH) inequality implies:
+
+λ(li)
+1−α
+
+2 (hi)
+1+α
+
+2 + (1 − λ)(ri)
+1+α
+
+2 (hi)
+1−α
+
+2
+≥
+(li)
+λ(1−α)
+
+2
+(ri)
+(1−λ)(1+α)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+=
+�
+(li)
+
+λ(1−α)
+
+1−α(2λ−1) (ri)
+
+(1−λ)(1+α)
+1−α(2λ−1)
+� 1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+=
+�
+(li)η(ri)1−η� 1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+= (gi
+η)
+1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+.
+
+It follows that (A16) yields:
+
+Mλ,α(li : hi : ri)
+≤
+4
+
+1 − α2
+
+�1 − α(2λ − 1)
+
+2
+
+�
+ηli + (1 − η)ri�
++
+(A17)
+
+1 + α(2λ − 1)
+
+2
+hi − (gi
+η)
+1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+�
+
+= 1 − α2(2λ − 1)2
+
+1 − α2
+Dα(2λ−1)(gi
+η : hi) + 2(1 − α(2λ − 1))
+
+1 − α2
+
+�
+ai
+η − gi
+η
+�
+,(A18)
+
+out of which we get the statement of the Lemma.
+
+Appendix Sided α-Centroids
+
+For the sake of completeness, we prove the following theorem:
+
+Theorem A1 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted
+α-centroid coordinates of a set of n positive histograms h1, ..., hn are weighted α-means:
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+, li
+α = ri
+−α
+
+with:
+
+fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+Proof. We distinguish three cases: α ̸= ±1, α = −1 and α = 1.
+First, consider the general case α ̸= ±1. We have to minimize:
+
+Rα(x, H) =
+4
+
+1 − α2
+
+n
+∑
+j=1
+wj×
+
+d
+∑
+i=1
+
+�1 − α
+
+2
+hi
+j + 1 + α
+
+2
+xi − (hi
+j)
+1−α
+
+2 (xi)
+1+α
+
+2
+�
+.
+
+Removing all additive terms independent of xi and the overall constant multiplicative factor
+4
+
+1−α2 ̸= 0,
+we get the following equivalent minimisation problem:
+
+R′
+α(x, H) =
+d
+∑
+i=1
+
+1 + α
+
+2
+xi − (xi)
+1+α
+
+2
+
+�
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2
+
+�
+
+�
+��
+�
+¯hiα
+
+,
+(A19)
+
+190
+
+
+Entropy 2014, 16, 3273–3301
+
+where ¯hi
+α denote the following aggregation term:
+
+¯hi
+α =
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2 .
+
+Setting coordinate-wise the derivative to zero of Equation (A19) (i.e., ∇xR′(x, H) = 0), we get:
+
+1 + α
+
+2
+− 1 + α
+
+2
+(xi)
+α−1
+
+2 ¯hi
+α = 0
+
+Thus, we find that the coordinates of the right-sided α-centroids are:
+
+ci
+α = (¯hi
+α)
+2
+
+1−α =
+
+�
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2
+
+�
+2
+
+1−α
+= ˆhi
+α.
+
+We recognise the expression of a quasi-arithmetic mean for the strictly monotonous generator fα(x):
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+,
+(A20)
+
+with:
+fα(x) = x
+1−α
+
+2 ,
+f −1
+α (x) = x
+2
+
+1−α , α ̸= ±1.
+
+Therefore, we conclude that the coordinates of the positive α-centroid are the weighted α-means of
+the histogram coordinates (for α ̸= ±1). Quasi-arithmetic means are also called in the literature
+quasi-linear means or f-means.
+When α = −1, we search for the right-sided extended Kullback–Leibler divergence centroid by
+minimising:
+
+R−1(x; ˜H) =
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+hi
+j log
+hi
+j
+
+xi + xi − hi
+j.
+
+It is equivalent to minimizing:
+
+R′
+−1(x; ˜H) =
+d
+∑
+i=1
+xi −
+
+�
+n
+∑
+j=1
+wjhi
+j
+
+�
+
+�
+��
+�
+a
+
+log xi,
+
+where a denotes the arithmetic mean. Solving coordinate-wise, we get ci = ai = ∑n
+j=1 wjhi
+j.
+When α = 1, the right-sided reverse extended KL centroid is a left-sided extended KL centroid. The
+minimisation problem is:
+
+R1(x; ˜H) =
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+xi log xi
+
+hi
+j
++ hi
+j − xi.
+
+Since ∑j wj = 1, we solve coordinate-wise and find log x = ∑j wj log hj. That is, ri
+1 is the geometric
+mean:
+
+ri
+1 =
+n
+∏
+j=1
+(hi
+j)wj.
+
+Both the arithmetic mean and the geometric mean are power means in the limit case (and hence
+quasi-arithmetic means). Thus,
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+,
+(A21)
+
+191
+
+
+Entropy 2014, 16, 3273–3301
+
+with:
+
+fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+References
+
+1.
+Baker, L.D.; McCallum, A.K.Distributional clustering of words for text classification. In Proceedings of the
+21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,
+Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 96–103.
+2.
+Bigi, B. Using Kullback–Leibler distance for text categorization.
+In Proceedings of the 25th European
+conference on IR research (ECIR), Pisa, Italy, 14–16 April 2003; Springer-Verlag: Berlin/Heidelberg, Germany,
+2003; ECIR’03, pp. 305–319.
+3.
+Bag of Words Data Set. Available online: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words (accessed
+on 17 June 2014).
+4.
+Csurka, G.; Bray, C.; Dance, C.; Fan, L. Visual Categorization with Bags of Keypoints; Workshop on Statistical
+Learning in Computer Vision (ECCV); Xerox Research Centre Europe: Meylan, France, 2004, pp. 1–22.
+5.
+Jégou, H.; Douze, M.; Schmid, C. Improving Bag-of-Features for Large Scale Image Search. Int. J. Comput.
+Vis. 2010, 87, 316–336.
+6.
+Yu, Z.; Li, A.; Au, O.; Xu, C. Bag of textons for image segmentation via soft clustering and convex shift. In
+Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI,
+USA, 16–21 June 2012; pp. 781–788.
+7.
+Steinhaus, H. Sur la division des corp matériels en parties. Bull. Acad. Polon. Sci. 1956, 1, 801–804. (in
+French)
+8.
+Lloyd, S.P. Least Squares Quantization in PCM; Technical Report RR-5497; Bell Laboratories: Murray Hill, NJ,
+USA, 1957.
+9.
+Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137.
+10.
+Chandrasekhar, V.; Takacs, G.; Chen, D.M.; Tsai, S.S.; Reznik, Y.A.; Grzeszczuk, R.; Girod, B. Compressed
+histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vis. 2012, 96, 384–399.
+11.
+Nock, R.; Nielsen, F.; Briys, E. Non-linear book manifolds: Learning from associations the dynamic geometry
+of digital libraries. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, New
+York, NY, USA, 2013; pp. 313–322.
+12.
+Kwitt, R.; Vasconcelos, N.; Rasiwasia, N.; Uhl, A.; Davis, B.C.; Häfner, M.; Wrba, F. Endoscopic image
+analysis in semantic space. Med. Image Anal. 2012, 16, 1415–1422.
+13.
+Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. 2010, arXiv:1009.4004.
+14.
+Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904.
+15.
+Nock, R.; Luosto, P.; Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proceedings of
+the European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium,
+15–19 September 2008; Springer-Verlag: Berlin/Heidelberg, Germany, 2008; pp. 154–169.
+16.
+Amari, S. Integration of Stochastic Models by Minimizing α-Divergence. Neural Comput. 2007, 19, 2780–2796.
+17.
+Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth
+Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, LA, USA, 7–9 January 2007;
+Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035.
+18.
+Olszewski, D.; Ster, B. Asymmetric clustering using the alpha-beta divergence. Pattern Recognit. 2014,
+47, 2031–2041.
+19.
+Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes.
+IEEE Trans. Inf. Theory 2009, 55, 4925–4931.
+20.
+Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res.
+2005, 6, 1705–1749.
+21.
+Teboulle, M. A unified continuous optimization framework for center-based clustering methods. J. Mach.
+Learn. Res. 2007, 8, 65–102.
+22.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000.
+23.
+Morimoto, T. Markov Processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331.
+
+192
+
+
+Entropy 2014, 16, 3273–3301
+
+24.
+Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat.
+Soc. Ser. B 1966, 28, 131–142.
+25.
+Csiszár, I. Information-type measures of difference of probability distributions and indirect observation.
+Studi. Sci. Math. Hung. 1967, 2, 229–318.
+26.
+Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust
+nonnegative matrix factorization. Entropy 2011, 13, 134–170.
+27.
+Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of
+Neural Networks; Operations Research/Computer Science Interfaces Series; Ellacott, S., Mason, J., Anderson,
+I., Eds.; Springer: New York, NY, USA, 1997; Volume 8, pp. 394–398.
+28.
+Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.
+Ann. Math. Stat. 1952, 23, 493–507.
+29.
+Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett.
+2013, 20, 269–272.
+30.
+Wu, J.; Rehg, J. Beyond the euclidean distance: creating effective visual codebooks using the histogram
+intersection kernel. In Proceedings of 2009 IEEE 12th International Conference on Computer Vision, Kyoto,
+Japan, 29 September–2 October 2009; pp. 630–637.
+31.
+Bhattacharya, A.; Jaiswal, R.; Ailon, N. A tight lower bound instance for k-means++ in constant dimension.
+In Theory and Applications of Models of Computation; Lecture Notes in Computer Science; Gopal, T., Agrawal,
+M., Li, A., Cooper, S., Eds.; Springer International Publishing: New York, NY, USA, 2014; Volume 8402, pp.
+7–22.
+32.
+Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight
+approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660.
+33.
+Ben-Tal, A.; Charnes, A.; Teboulle, M. Entropic means. J. Math. Anal. Appl. 1989, 139, 537–551.
+34.
+Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In
+Proceedings of International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June
+2009; pp. 71–78.
+35.
+Heinz, E. Beiträge zur Störungstheorie der Spektralzerlegung. Math. Anna. 1951, 123, 415–438. (in German)
+36.
+Besenyei, A. On the invariance equation for Heinz means. Math. Inequal. Appl. 2012, 15, 973–979.
+37.
+Barry, D.A.; Culligan-Hensley, P.J.; Barry, S.J. Real values of the W-function. ACM Trans. Math. Softw. 1995,
+21, 161–171.
+38.
+Veldhuis, R.N.J. The centroid of the symmetrical Kullback–Leibler distance. IEEE Signal Process. Lett. 2002,
+9, 96–99.
+39.
+Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. 2009, arXiv.org: 0911.4863.
+40.
+Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466.
+41.
+Romberg, S.; Lienhart, R. Bundle min-hashing for logo recognition.
+In Proceedings of the 3rd ACM
+Conference on International Conference on Multimedia Retrieval, Dallas, TX, USA, 16–19 April 2013; ACM:
+New York, NY, USA, 2013; pp. 113–120.
+42.
+Matsuyama, Y. The alpha-EM algorithm: Surrogate likelihood maximization using alpha-logarithmic
+information measures. IEEE Trans. Inf. Theory 2003, 49, 692–706.
+43.
+Amari, S.I. New developments of information geometry (26): Information geometry of convex programming
+and game theory. In Mathematical Sciences (suurikagaku); Number 605; The Science Company: Denver, CO,
+USA, 2013; pp. 65–74. (In Japanese)
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+193
+
+
+entropy
+
+Article
+New Riemannian Priors on the Univariate
+Normal Model
+
+Salem Said *, Lionel Bombrun and Yannick Berthoumieu
+
+Groupe Signal et Image, CNRS Laboratoire IMS, Institut Polytechnique de Bordeaux, Université de Bordeaux,
+UMR 5218, Talence, 33405, France; E-Mails: lionel.bombrun@u-bordeaux.fr (L.B.);
+yannick.berthoumieu@u-bordeaux.fr (Y.B.)
+*
+E-Mail: salem.said@u-bordeaux.fr; Tel.:+33-(0)5-4000-6185.
+
+Received: 17 April 2014; in revised form: 23 June 2014 / Accepted: 9 July 2014 /
+Published: 17 July 2014
+
+Abstract: The current paper introduces new prior distributions on the univariate normal model, with
+the aim of applying them to the classification of univariate normal populations. These new prior
+distributions are entirely based on the Riemannian geometry of the univariate normal model, so that
+they can be thought of as “Riemannian priors”. Precisely, if {pθ; θ ∈ Θ} is any parametrization of the
+univariate normal model, the paper considers prior distributions G( ¯θ, γ) with hyperparameters ¯θ ∈ Θ
+and γ > 0, whose density with respect to Riemannian volume is proportional to exp(−d2(θ, ¯θ)/2γ2),
+where d2(θ, ¯θ) is the square of Rao’s Riemannian distance. The distributions G( ¯θ, γ) are termed
+Gaussian distributions on the univariate normal model. The motivation for considering a distribution
+G( ¯θ, γ) is that this distribution gives a geometric representation of a class or cluster of univariate
+normal populations. Indeed, G( ¯θ, γ) has a unique mode ¯θ (precisely, ¯θ is the unique Riemannian
+center of mass of G( ¯θ, γ), as shown in the paper), and its dispersion away from ¯θ is given by γ.
+Therefore, one thinks of members of the class represented by G( ¯θ, γ) as being centered around ¯θ
+and lying within a typical distance determined by γ. The paper defines rigorously the Gaussian
+distributions G( ¯θ, γ) and describes an algorithm for computing maximum likelihood estimates of
+their hyperparameters. Based on this algorithm and on the Laplace approximation, it describes
+how the distributions G( ¯θ, γ) can be used as prior distributions for Bayesian classification of
+large univariate normal populations.
+In a concrete application to texture image classification,
+it is shown that this leads to an improvement in performance over the use of conjugate priors.
+
+Keywords: Fisher information; Riemannian metric; prior distribution; univariate normal distribution;
+image classification
+
+1. Introduction
+
+In this paper, a new class of prior distributions is introduced on the univariate normal model. The
+new prior distributions, which will be called Gaussian distributions, are based on the Riemannian
+geometry of the univariate normal model. The paper introduces these new distributions, uncovers
+some of their fundamental properties and applies them to the problem of the classification of univariate
+normal populations. It shows that, in the context of a real-life application to texture image classification,
+the use of these new prior distributions leads to improved performance in comparison with the use of
+more standard conjugate priors.
+To motivate the introduction of the new prior distributions, considered in the following, recall
+some general facts on the Riemannian geometry of parametric models.
+In information geometry [1], it is well known that a parametric model {pθ; θ ∈ Θ}, where Θ ⊂ Rp,
+can be equipped with a Riemannian geometry, determined by Fisher’s information matrix, say I(θ).
+
+Entropy 2014, 16, 4015–4031; doi:10.3390/e16074015
+www.mdpi.com/journal/entropy
+194
+
+
+Entropy 2014, 16, 4015–4031
+
+Indeed, assuming I(θ) is strictly positive definite, for each θ ∈ Θ, a Riemannian metric on Θ is defined
+by:
+
+ds2(θ) =
+
+p
+∑
+i,j=1
+Iij(θ)dθidθj
+(1)
+
+The fact that the length element Equation (1) is invariant to any change of parametrization was realized
+by Rao [2], who was the first to propose the application of Riemannian geometry in statistics.
+Once the Riemannian metric Equation (1) is introduced, the whole machinery of Riemannian
+geometry becomes available for application to statistical problems relevant to the parametric model
+{pθ; θ ∈ Θ}. This includes the notion of Riemannian distance between two distributions, pθ and pθ′,
+which is known as Rao’s distance, say d(θ, θ′), the notion of Riemannian volume, which is exactly the
+same as Jeffreys prior [3], and the notion of Riemannian gradient, which can be used in numerical
+optimization and coincides with the so-called natural gradient of Amari [4].
+It is quite natural to apply Rao’s distance to the problem of classifying populations that belong to
+the parametric model {pθ; θ ∈ Θ}. In the case where this parametric model is the univariate normal
+model, this approach to classification is implemented in [5]. For more general parametric models,
+beyond the univariate normal model, similar applications of Rao’s distance to problems of image
+segmentation and statistical tests can be found in [6–8].
+The idea of [5] is quite elegant. In general, it requires that some classes {SL; L = 1, . . . , C}, (based
+on a learning sequence) have been identified with “centers” ¯θL ∈ Θ. Then, in order to assign a test
+population, given by the parameter θt, to a class L∗, it is proposed to choose L∗, which minimizes Rao’s
+distance d2(θt, ¯θL), over L = 1, . . . , C. In the specific context of the classification of univariate normal
+populations [5], this leads to the introduction of hyperbolic Voronoi diagrams.
+The present paper is also concerned with the case where the parametric model {pθ; θ ∈ Θ} is a
+univariate normal model. It starts from the idea that a class SL should be identified not only with
+a center ¯θL, as in [5], but also with a kind of “variance”, say γ2, which will be called a dispersion
+parameter. Accordingly, assigning a test population given by the parameter θt to a class L should be
+based on a tradeoff between the square of Rao’s distance d2(θt, ¯θL) and the dispersion parameter γ2.
+Of course, this idea has a strong Bayesian flavor. It proposes to give more “confidence” to classes
+that have a smaller dispersion parameter. Thus, in order to implement it, in a concrete way, the paper
+starts by introducing prior distributions on the univariate normal model, which it calls Gaussian
+distributions. By definition, a Gaussian distribution G( ¯θ, γ2) has a probability density function, with
+respect to Riemannian volume, given by:
+
+p(θ| ¯θ, γ) ∝ exp
+�−d2(θ, ¯θ)
+
+2γ2
+
+�
+(2)
+
+Given this definition of a Gaussian distribution (which is developed in a detailed way, in Section 3),
+classification of univariate normal populations can be carried out by associating to each class SL of
+univariate normal populations a Gaussian distribution G( ¯θL, γ2
+L) and by assigning any test population
+with parameter θt to the class L∗, which maximizes the likelihood p(θt| ¯θL, γL), over L = 1, . . . , C.
+The present paper develops in a rigorous way the general approach to the classification of
+univariate normal populations, which has just been described. It proceeds as follows.
+Section 2, which is basically self-contained, provides the concepts, regarding the Riemannian
+geometry of the univariate normal model, which will be used throughout the paper.
+Section 3 introduces Gaussian distributions on the univariate normal model and uncovers some
+of their general properties. In particular, Section 3.2 of this section gives a Riemannian gradient descent
+algorithm for computing maximum likelihood estimates of the parameters ¯θ and γ of a Gaussian
+distribution.
+Section 4 states the general approach to classification of univariate normal populations proposed
+in this paper. It deals with two problems: (i) given a class S of univariate normal populations Si, how
+
+195
+
+
+Entropy 2014, 16, 4015–4031
+
+to fit a Gaussian distribution G(¯z, γ) to this class; and (ii) given a test univariate normal population St
+and a set of classes {SL, L = 1, . . . , C}, how to assign St to a suitable class SL∗.
+In the present paper, the chosen approach for resolving these two problems is marginalized
+likelihood estimation, in the asymptotic framework where each univariate normal population contains
+a large number of data points. In this asymptotic framework, the Laplace approximation plays a
+major role [9]. In particular, it reduces the first problem, of fitting a Gaussian distribution to a class
+of univariate normal populations, to the problem of maximum likelihood estimation, covered in
+Section 3.2.
+The final result of Section 4 is the decision rule Equation (37). This generalizes the one developed
+in [5] and already explained above, by taking into account the dispersion parameter γ, in addition to
+the center ¯θ, for each class.
+In Section 5, the formalism of Section 4 is applied to texture image classification, using the
+VisTeX image database [10]. This database is used to compare the performance obtained using
+Gaussian distributions, as in Section 4, to that obtained using conjugate prior distributions. It is
+shown that Gaussian distributions, proposed in the current paper, lead to a significant improvement
+in performance.
+Before going on, it should be noted that probability density functions of the form (2), on general
+Riemannian manifolds, were considered by Pennec in [11]. However, they were not specifically used
+as prior distributions, but rather as a representation of uncertainty in medical image analysis and
+directional or shape statistics.
+
+2. Riemannian Geometry of the Univariate Normal Model
+
+The current section presents in a self-contained way the results on the Riemannian geometry of
+the univariate normal model, which are required for the remainder of the paper. Section 2.1 recalls
+the fact that the univariate normal model can be reparametrized, so that its Riemannian geometry is
+essentially the same as that of the Poincaré upper half plane. Section 2.2 uses this fact to give analytic
+formulas for distance, geodesics and integration on the univariate normal model. Finally, Section 2.3
+presents, in general form, the Riemannian gradient descent algorithm.
+
+2.1. Derivation of the Fisher Metric
+
+This paper considers the Riemannian geometry of the univariate normal model, as based on the
+Fisher metric (1). To be precise, the univariate normal model has a two-dimensional parameter space
+Θ = {θ = (μ, σ)|μ ∈ R , σ > 0}, and is given by:
+
+pθ(x) = |2πσ2|−1/2 exp
+� −(x − μ)2
+
+2σ2
+
+�
+(3)
+
+where each pθ is a probability density function with respect to the Lebesgue measure on R. The Fisher
+information matrix, obtained from Equation (3), is the following:
+
+I(θ) =
+
+�
+1
+σ2
+0
+0
+2
+σ2
+
+�
+
+As in [12], this expression can be made more symmetric by introducing the parametrization z = (x, y),
+where x = μ/
+√
+
+2 and y = σ. This yields the Fisher information matrix:
+
+I(z) = 2 ×
+
+�
+1
+y2
+0
+
+0
+1
+y2
+
+�
+
+196
+
+
+Entropy 2014, 16, 4015–4031
+
+It is suitable to drop the factor two in this expression and introduce the following Riemannian metric
+for the univariate normal model,
+
+ds2(z) = dx2 + dy2
+
+y2
+(4)
+
+This is essentially the same as the Fisher metric (up to the factor tow) and will be considered throughout
+the following. The resulting Rao’s distance and Riemannian geometry are given in the following
+paragraph.
+
+2.2. Distance, Geodesics and Volume
+
+The Riemannian metric (4), obtained in the last paragraph, happens to be a very well-known
+object in differential geometry. Precisely, the parameter space H = {z = (x, y)|y > 0} equipped with
+the metric (4) is known as the Poincaré upper half plane and is a basic model of a two-dimensional
+hyperbolic space [13].
+Rao’s distance between two points z1 = (x1, y1) and z2 = (x2, y2) in H can be expressed as follows
+(for results in the present paragraph, see [13], or any suitable reference on hyperbolic geometry),
+
+d(z1, z2) = acosh
+�
+1 + (x1 − x2)2 + (y1 − y2)2
+
+2y1y2
+
+�
+(5)
+
+where acosh denotes the inverse hyperbolic cosine.
+Starting from z1, in any given direction, it is possible to draw a unique geodesic ray γ : R+ → H.
+This is a curve having the property that γ(0) = z1 and, for any t ∈ R+, if γ(t) = z2 then d(z1, z2) = t.
+In other words, the length of γ between z1 and z2 is equal to the distance between z1 and z2.
+The equation of a geodesic ray starting from z ∈ H is conveniently written down in complex
+notation (that is, by treating points of H as complex numbers). To begin, consider the case of z = i
+(which stands for x = 0 and y = 1). The geodesic in the direction making an angle ψ with the y-axis is
+the curve,
+
+γi(t) = et/2 cos(ψ/2) i − e−t/2 sin(ψ/2)
+
+et/2 sin(ψ/2) i + e−t/2 cos(ψ/2)
+(6)
+
+In particular ψ = 0 gives γi(t) = eti and ψ = π gives γi(t) = e−ti. If ψ is not a multiple of π, γi(t)
+traces out a portion of a circle, which is parallel to the y-axis, in the limit t → ∞. For a general starting
+point z, the geodesic ray in the direction making an angle ψ with the y-axis can be written:
+
+γz(t, ψ) = x + yγi(t/y, ψ)
+(7)
+
+where z = (x, y) and γi(t, ψ) is given by Equation (6). A more detailed treatment of Rao’s distance (5)
+and of geodesics in the Poincaré upper half plane, along with applications in image clustering, can be
+found in [5].
+The Riemannian volume (or area, since H is of dimension 2) element corresponding to the
+Riemannian metric (4) is dA(z) = dxdy/y2. Accordingly, the integral of a function f : H → R with
+respect to dA is given by:
+�
+
+H f (z)dA(z) =
+� +∞
+
+0
+
+� +∞
+
+−∞
+f (x, y)
+
+y2
+dxdy
+(8)
+
+In many cases, the analytic computation of this integral can be greatly simplified by using polar
+coordinates (r, φ) defined with respect to some “origin” ¯z ∈ H. Polar coordinates (r, ϕ) map to the
+point z(r, ϕ) given by:
+
+z(r, ϕ) = γ¯z
+�
+r, π
+
+2 − ϕ
+�
+(9)
+
+197
+
+
+Entropy 2014, 16, 4015–4031
+
+where the right-hand side is defined according to Equation (7). The polar coordinates (r, ϕ) do indeed
+define a global coordinate system of H, in the sense that the application that takes a complex number
+reiϕ to the point z(r, ϕ) in H is a diffeomorphism. The standard notation from differential geometry is:
+
+exp¯z
+�
+reiϕ�
+= z(r, ϕ)
+(10)
+
+In these coordinates, the Riemannian metric (4) takes on the form:
+
+ds2(z) = dr2 + sinh2 rdϕ2
+(11)
+
+The integral Equation (8) can be computed in polar coordinates using the formula [13],
+
+�
+
+H f (z)dA(z) =
+� 2π
+
+0
+
+� +∞
+
+0
+( f ◦ exp¯z)
+�
+reiϕ�
+sinh(r)drdϕ
+(12)
+
+where exp¯z was defined in Equation (10) and ◦ denotes composition. This is particularly useful when
+f ◦ exp¯z does not depend on ϕ.
+
+2.3. Riemannian Gradient Descent
+
+In this paper, the problem of minimizing, or maximizing, a differentiable function f : H → R will
+play a central role. A popular way of handling the minimization of a differentiable function defined on
+a Riemannian manifold (such as H) is through Riemannian gradient descent [14].
+Here, the definition of Riemannian gradient is reviewed, and a generic description of Riemannian
+gradient descent is provided. The Riemannian gradient of f is here defined as a mapping ∇ f : H → C
+with the following property:
+
+1
+y2 × Re {∇ f (z) h∗} = Re {d f (z) h∗}
+(13)
+
+for any complex number h, where Re denotes the real part, ∗ denotes conjugation and d f is the
+“derivative”, d f = (∂ f /∂x) + (∂ f /∂y) i. For example, if f (z) = y, it follows from Equation (13) that
+∇ f (z) = y2.
+Riemannian gradient descent consists in following the direction of −∇ f at each step, with the
+length of the step (in other words, the step size) being determined by the user. The generic algorithm
+is, up to some variations, the following:
+
+INPUT
+ˆz ∈ H
+% Initial guess
+
+WHILE
+∥∇ f (ˆz)∥ > ε
+% ε ≈ 0 machine precision
+
+ˆz ← expˆz (−λ∇ f (ˆz))
+% λ > 0 step size, depends on ˆz
+
+END WHILE
+
+OUTPUT
+ˆz
+% near critical point of f
+
+Here, in the condition for the while loop, ∥∇ f (zk)∥ is the Riemannian norm of the gradient
+∇ f (zk). In other words,
+
+∥∇ f (zk)∥2 = 1
+
+y2
+k
+× Re {∇ f (zk) ∇ f (zk)∗}
+
+Just like a classical gradient descent algorithm, the above Riemannian gradient descent consists in
+following the direction of the negative gradient −∇ f (ˆz), in order to define a new estimate. This is
+repeated as long as the gradient is sensibly nonzero, in the sense of the loop condition.
+The generic algorithm described above has no guarantee of convergence. Convergence and
+behavior near limit points depends on the function f, on the initialization of the algorithm and on the
+step sizes λ. For these aspects, the reader may consult [14](Chapter 4).
+
+198
+
+
+Entropy 2014, 16, 4015–4031
+
+3. Riemannian Prior on the Univariate Normal Model
+
+The current section introduces new prior distributions on the univariate normal model. These may
+be referred to as “Riemannian priors”, since they are entirely based on the Riemannian geometry of
+this model, and will also be called “Gaussian distributions”, when viewed as probability distributions
+on the Poincaré half plane.
+Here, Section 3.1 defines in a rigorous way Gaussian distributions on H (based on the intuitive
+Formula (2)). A Gaussian distribution G(¯z, γ) has two parameters, ¯z ∈ H, called the center of mass, and
+γ > 0, called the dispersion parameter. Section 3.2 uses the Riemannian gradient descent algorithm
+Section 2.3 to provide an algorithm for computing maximum likelihood estimates of ¯z and γ. Finally,
+Section 3.3 proves that ¯z is the Riemannian center of mass or Karcher mean of the distribution G(¯z, γ),
+(Historically, it is more correct to speak of the “Fréchet mean”, since this concept was proposed by
+Fréchet in 1948 [15]), and that γ is uniquely related to mean square Rao’s distance from ¯z.
+The reader may wish to note that the results of Section 3.3 are not used in the following, so this
+paragraph may be skipped on a first reading.
+
+3.1. Gaussian Distributions on H
+
+A Gaussian distribution G(¯z, γ) on H is a probability distribution with the following probability
+density function:
+
+p(z|¯z, γ) =
+1
+
+Z(γ) exp
+� −d2(z, ¯z)
+
+2γ2
+
+�
+(14)
+
+Here, ¯z ∈ H is called the center of mass and γ > 0 the dispersion parameter of the distribution G(¯z, γ).
+The squared distance d2(z, ¯z) refers to Rao’s distance (5). The probability density function (14) is
+understood with respect to the Riemannian volume element dA(z). In other words, the normalization
+constant Z(γ) is given by:
+
+Z(γ) =
+�
+
+H f (z)dA(z)
+f (z) = exp
+� −d2(z, ¯z)
+
+2γ2
+
+�
+
+Using polar coordinates, as in Equation (12), it is possible to calculate this integral explicitly. To do so,
+let (r, ϕ), whose origin is ¯z. Then, d2(z, ¯z) = r2 when z = z(r, ϕ), as in Equation (9). It follows that:
+
+( f ◦ exp¯z) (r, ϕ) = exp
+� −r2
+
+2γ2
+
+�
+(15)
+
+According to Equation (12), the integral Z(γ) reduces to:
+
+Z(γ) =
+� 2π
+
+0
+
+� +∞
+
+0
+exp
+� −r2
+
+2γ2
+
+�
+sinh(r)drdϕ
+
+which is readily calculated,
+
+Z(γ) = 2π ×
+�
+
+π
+2 γ × e
+γ2
+2 × erf
+� γ
+√
+
+2
+
+�
+(16)
+
+where erf denotes the error function. Formula (16) completes the definition of the Gaussian distribution
+G(¯z, γ). This definition is the same as suggested in [11], with the difference that, in the present work, it
+has been possible to compute exactly the normalization constant Z(γ).
+It is noteworthy that the normalization constant Z(γ) depends only on γ and not on ¯z. This shows
+that the shape of the probability density function (14) does not depend on ¯z, which only plays the role
+of a location parameter. At a deeper mathematical level, this reflects the fact that H is a homogeneous
+Riemannian space [13].
+
+199
+
+
+Entropy 2014, 16, 4015–4031
+
+The probability density function (14) bears a clear resemblance to the usual Gaussian (or normal)
+probability density function. Indeed, both are proportional to the exponential minus the “square
+distance”, but in one case, the distance is interpreted as Euclidean distance and, in the other (that of
+Equation (14)) as Rao’s distance.
+
+3.2. Maximum Likelihood Estimation of ¯z and γ
+
+Consider the problem of computing maximum likelihood estimates of the parameters ¯z and γ of
+the Gaussian distribution G(¯z, γ), based on independent samples {zi}N
+i=1 from this distribution. Given
+the expression (14) of the density p(z|¯z, γ), the log-likelihood function ℓ(¯z, γ) can be written,
+
+ℓ(¯z, γ) = −N log{Z(γ)} −
+1
+
+2γ2
+
+N
+∑
+i=1
+d2(zi, ¯z)
+(17)
+
+Since ¯z only appears in the second term, the maximum likelihood estimate of ¯z, say ˆz, can be computed
+first. It is given by the minimization problem:
+
+ˆz = argminz∈H
+1
+2
+
+N
+∑
+i=1
+d2(zi, z)
+(18)
+
+In other words, the maximum likelihood estimate ˆz minimizes the sum of squared Rao distances to the
+samples zi. This exhibits ˆz as the Riemannian center of mass, also called the Karcher or the Fréchet
+mean [16], of the samples zi.
+The notion of Riemannian center of mass is currently a widely popular one in signal and image
+processing, with applications ranging from blind source separation and radar signal processing [17,18]
+to shape and motion analysis [19,20]. The definition of Gaussian distributions, proposed in the
+present paper, shows how the notion of Riemannian center of mass is related to maximum likelihood
+estimation, thereby giving it a statistical foundation.
+An original result, due to Cartan and cited in Equation [16], states that ˆz, as defined in
+Equation (18), exists and is unique, since H, with the Riemannian distance (4), has constant negative
+curvature. Here, ˆz is computed using Riemannian gradient descent, as described in Section 2.3. The
+cost function f to be minimized is given by (the factor N−1 is conventional),
+
+f (z) =
+1
+2N
+
+N
+∑
+i=1
+d2(zi, z)
+(19)
+
+Its Riemannian gradient ∇ f (z) is easily found by noting the following fact. Let fi(z) = (1/2)d2(z, zi).
+Then, the Riemannian gradient of this function is (see [21] (page 407)),
+
+∇ fi(z) = logz(zi)
+(20)
+
+where logz : H → C is the inverse of expz : C → H. It follows from Equation (20) that,
+
+∇ f (z) = 1
+
+N
+
+N
+∑
+i=1
+logz(zi)
+(21)
+
+The analytic expression of logz, for any z ∈ H, will be given below (see Equation (23)).
+Here, the gradient descent algorithm for computing ˆz is described. This algorithm uses a constant
+step size λ, which is fixed manually.
+
+200
+
+
+Entropy 2014, 16, 4015–4031
+
+Once the maximum likelihood estimate ˆz has been computed, using the gradient descent
+algorithm, the maximum likelihood estimate of γ, say ˆγ, is found by solving the equation:
+
+F(γ) = 1
+
+N
+
+N
+∑
+i=1
+d2(zi, ˆz)
+where F(γ) = γ3 × d
+
+dγ log{Z(γ)}
+(22)
+
+The gradient descent algorithm for computing ˆz is the following,
+
+INPUT
+{z1, . . . , zN}
+% N independent samples from G(¯z, γ)
+
+ˆz ∈ H
+% Initial guess
+
+WHILE
+∥∇ f (ˆz)∥ > ε
+% ε ≈ 0 machine precision
+
+ˆz ← expˆz (−λ∇ f (ˆz))
+% ∇ f (ˆz) given by Equation (21)
+
+% step size λ is constant
+
+END WHILE
+
+OUTPUT
+ˆz
+% near Riemannian center of mass
+
+Application of Formula (21) requires computation of logˆz(zi) for i = 1, . . . , N. Fortunately, this
+can be done analytically as follows. In general, for ˆz = ( ¯x, ¯y),
+
+logˆz(z) = ¯y logi
+
+�z − ¯x
+
+¯y
+
+�
+(23)
+
+where logi is found by inverting Equation (6). Precisely,
+
+logi(z) = reiϕ
+(24)
+
+where, for z = (x, y) with x ̸= 0,
+
+r = acosh
+�
+1 + x2 + (y − 1)2
+
+2y
+
+�
+
+and:
+
+cos(ϕ) =
+x
+
+y sinh(r)
+sin(ϕ) = cosh(r) − y−1
+
+sinh(r)
+
+and, for z = (0, y),
+logi(z) = ln(y)i
+
+with ln denoting the natural logarithm.
+
+3.3. Significance of ¯z and γ
+
+The parameters ¯z and γ of a Gaussian distribution G(¯z, γ) have been called the center of mass
+and the dispersion parameter. In the present paragraph, it is proven that,
+
+¯z = argminz∈H
+1
+2
+
+�
+
+H d2(z′, z)p(z′|¯z, γ)dA(z′)
+(25)
+
+and also that:
+F(γ) =
+�
+
+H d2(z′, ¯z)p(z′|¯z, γ)dA(z′)
+(26)
+
+where F(γ) was defined in Equation (22) and p(z′|¯z, γ) is the probability density function of G(¯z, γ),
+given in Equation (14).
+
+201
+
+
+Entropy 2014, 16, 4015–4031
+
+Note that Equations (25) and (26) are asymptotic versions of Equations (18) and (22). Indeed,
+Equations (25) and (26) can be written:
+
+¯z = argminz∈H
+1
+2E¯z,γd2(z′, z)
+F(γ) = E¯z,γd2(z, ¯z)
+(27)
+
+where E¯z,γ denotes the expectation with respect to G(¯z, γ), and the expectation is carried out on the
+variable z′ in the first formula. Now, these two formulae are the same as Equations (18) and (22), but
+with expectation instead of empirical mean.
+Note, moreover, that Equations (25) and (26) can be interpreted as follows. If z′ is distributed
+according to the Gaussian distribution G(¯z, γ), then Equation (25) states that ¯z is the unique point, out
+of all z ∈ H, which minimizes the expectation of squared Rao’s distance to z′. Moreover, Equation (26)
+states that the expectation of squared Rao’s distance between ¯z and z′ is equal to F(γ), so F(γ) is the
+least possible expected squared Rao’s distance between a point z ∈ H and z′. This interpretation
+justifies calling ¯z the center of mass of G(¯z, γ) and shows that γ is uniquely related to the expected
+dispersion, as measured by squared Rao’s distance, away from ¯z.
+In order to prove Equation (25), consider the log-likelihood function,
+
+ℓ(¯z, γ; z) = − log{Z(γ)} −
+1
+
+2γ2 d2(z, ¯z)
+(28)
+
+Let fz(¯z) = (1/2)d2(z, ¯z). The score function, with respect to ¯z is, by definition,
+
+∇¯zℓ(¯z, γ; z) = ∇ fz(¯z)
+(29)
+
+where ∇¯z indicates the Riemannian gradient (defined in Equation (13) of Section 2.3) is with respect to
+the variable ¯z. Under certain regularity conditions, which are here easily verified, the expectation of
+the score function is identically zero,
+E¯z,γ∇ fz(¯z) = 0
+(30)
+
+Let f (z) be defined by:
+
+f (z) = E¯z,γ fz′(z) = 1
+
+2E¯z,γd2(z′, z)
+
+with the expectation carried out on the variable z′. Clearly, f (z) is the expression to be minimized
+in Equation (25) (or in the first formula in Equation (27), which is just the same). By interchanging
+Riemannian gradient and expectation,
+
+∇ f (¯z) = E¯z,γ∇ fz(¯z) = 0
+
+where the last equality follows from Equation (30).
+It has just been proved that ¯z is a stationary point of f (a point where the gradient is zero).
+Theorem 2.1 in [16] states the function f has one and only one stationary point, which is moreover a
+global minimizer. This concludes the Proof (25).
+The proof of Equation (26) follows exactly the same method, defining the score function with
+respect to γ and noting that its expectation is identically zero.
+
+4. Classification of Univariate Normal Populations
+
+The previous section studied Gaussian distributions on H, “as they stand”, focusing on the
+fundamental issue of maximum likelihood estimation of their parameters. The present Section
+considers the use of Gaussian distributions as prior distributions on the univariate normal model.
+The main motivation behind the introduction of Gaussian distributions is that a Gaussian
+distribution G(¯z, γ) can be used to give a geometric representation of a cluster or class of univariate
+normal populations. Recall that each point (x, y) ∈ H is identified with a univariate normal population
+
+202
+
+
+Entropy 2014, 16, 4015–4031
+
+with mean μ =
+√
+
+2x and standard deviation σ = y. The idea is that populations belonging to the same
+cluster, represented by G(¯z, γ), should be viewed as centered on ¯z and lying within a typical distance
+determined by γ.
+In the remainder of this Section, it is shown how the maximum likelihood estimation algorithm of
+Section 3.2 can be used to fit the hyperparameters ¯z and γ to data, consisting in a class S = {Si; i =
+1, . . . , K} of univariate normal populations. This is then applied to the problem of the classification
+of univariate normal populations. The whole development is based on marginalized likelihood
+estimation, as follows.
+Assume each population Si contains Ni points, Si = {sj; j = 1, . . . , Ni}, and the points sj, in any
+class, are drawn from a univariate normal distribution with mean μ and standard deviation σ. The
+focus will be on the asymptotic case where the number Ni of points in each population Si is large.
+In order to fit the hyperparameters ¯z and γ to the data S, assume moreover that the distribution
+of z = (x, y), where (x, y) = (μ/
+√
+
+2, σ), is a Gaussian distribution G(¯z, γ). Then, the distribution of S
+can be written in integral form:
+
+p(S|¯z, γ) =
+K
+∏
+i=1
+
+�
+
+H p(Si|z)p(z|¯z, γ)dA(z)
+(31)
+
+where p(z|¯z, γ) is the probability density of a Gaussian distribution G(¯z, γ), defined in Equation (14).
+Moreover, expressing p(Si|z) as a product of univariate normal distributions p(sj|z), it follows,
+
+p(S|¯z, γ) =
+K
+∏
+i=1
+
+�
+
+H
+
+Ni
+∏
+j=1
+p(sj|z)p(z|¯z, γ)dA(z)
+(32)
+
+This expression, given the data S, is to be maximized over (¯z, γ). Using the Laplace approximation,
+this task is reduced to the maximum likelihood estimation problem, addressed in Section 3.2.
+The Laplace approximation will here be applied in its “basic form” [9]. That is, up to terms of
+order N−1
+i
+. To do so, write each of the integrals in Equation (32), using Equation (8) of Section 2.2.
+These integrals then take on the form:
+
+� +∞
+
+0
+
+� +∞
+
+−∞
+
+Ni
+∏
+j=1
+|2πy2|−1/2 exp
+
+⎛
+
+⎜
+⎝
+−
+�
+sj −
+√
+
+2 x
+�2
+
+2y2
+
+⎞
+
+⎟
+⎠ × p(z|¯z, γ) × 1
+
+y2 dxdy
+(33)
+
+where the univariate normal distribution p(sj|z) has been replaced by its full expression. Now, this
+expression can be written p(sj|z) = exp [−Nih(x, y)], where:
+
+h(x, y) = −1
+
+2 ln
+�
+2πy2�
+− B2
+i + V2
+i
+
+2y2
+
+Here, B2 and V2
+i are the empirical bias and variance, within population Si,
+
+Bi = ˆSi −
+√
+
+2 x
+V2
+i = N−1
+i
+
+Ni
+∑
+j=1
+( ˆSi − sj)2
+
+where ˆSi is the empirical mean of the population ˆSi = N−1
+i
+∑Ni
+j=1 sj.
+The expression h(x, y) is maximized when x = ˆxi and y = ˆyi, where ˆzi = ( ˆxi, ˆyi) is the couple of
+maximum likelihood estimates of the parameters (x, y), based on the population Si.
+
+203
+
+
+Entropy 2014, 16, 4015–4031
+
+According to the Laplace approximation, the integral Equation (33) is equal to:
+
+2π
+���∂2h( ˆxi, ˆyi)
+���
+−1/2
+× exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) × 1
+
+ˆy2
+i
++ O(N−1
+i
+)
+
+where ∂2h( ˆxi, ˆyi) is the matrix of second derivatives of h, and | · | denotes the determinant. Now, since
+h is essentially the logarithm of p(sj|z), a direct calculation shows that ∂2h( ˆxi, ˆyi) is the same as the
+Fisher information matrix derived in Section 2.1 (where it was denoted I(z)). Thus, the first factor in
+the above expression is 2π ˆy2
+i , and cancels out with the last factor.
+Finally, the Laplace approximation of the integral Equation (33) reads:
+
+2π × exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) + O(N−1
+i
+)
+
+and the resulting approximation of the distribution of S, as given by Equation (32), can be written:
+
+p(S|¯z, γ) ≈
+K
+∏
+i=1
+α × p(ˆzi|¯z, γ)
+(34)
+
+where α is a constant, which does not depend either on the data or on the parameters, and p(ˆzi|¯z, γ)
+has the expression (14).
+Accepting this expression to give the distribution of the data S, conditionally on the
+hyperparameters (¯z, γ), the task of estimating these hyperparameters becomes the same as the
+maximum likelihood estimation problem, described in Section 3.2.
+In conclusion, if one assumes the populations Si belong to a single cluster or class S and wishes
+to fit the hyperparameters ¯z and γ of a Gaussian distribution representing this cluster, it is enough to
+start by computing the maximum likelihood estimates ˆxi and ˆyi for each population Si and then to
+consider these as input to the maximum likelihood estimation algorithm described in Section 3.2.
+The same reasoning just carried out, using the Laplace approximation, can be generalized to
+the problem of classification of univariate normal populations. Indeed, assume that classes {SL, L =
+1, . . . , C}, each containing some number KL of univariate normal populations, have been identified
+based on some training sequence. Using the Laplace approximation and the maximum likelihood
+estimation approach of Section 3.2, to each one of these classes, it is possible to fit hyperparameters
+(¯zL, γL) of a Gaussian distribution G(¯zL, γL) on H.
+For a test population St, the maximum likelihood rule, for deciding which of the classes SL this
+test population St belongs to, requires finding the following maximum:
+
+L∗ = argmaxLp(St|¯zL, γL)
+(35)
+
+and assigning the test population St to the class with label L∗. If the number of points Nt in the
+population St is large, the Laplace approximation, in the same way used above, approximates the
+maximum in Equation (35) by:
+L∗ = argmaxLp(ˆzt|¯zL, γL)
+(36)
+
+where ˆzt = ( ˆxt, ˆyt) is the couple of maximum likelihood estimates computed based on the test
+population St and where p(ˆzt|¯zL, γL) is given by Equation (14). Now, writing out Equation (14), the
+decision rule becomes:
+
+L∗ = argmaxL
+
+�
+
+− log {Z(γL)} −
+1
+
+2γ2
+L
+d2(ˆzt, ¯zL)
+
+�
+
+(37)
+
+204
+
+
+Entropy 2014, 16, 4015–4031
+
+Under the homoscedasticity assumption, that all of the γL are equal, this decision rule essentially
+becomes the same as the one proposed in [5], which requires St to be assigned to the “nearest” cluster,
+in terms of Rao’s distance. Indeed, if all the γL are equal, then Equation (37) is the same as,
+
+L∗ = argminLd2(ˆzt, ¯zL)
+(38)
+
+This decision rule is expected to be less efficient that the one proposed in Equation (37), which also takes
+into account the uncertainty associated with each cluster, as measured by its dispersion parameter γL.
+
+5. Application to Image Classification
+
+In this section, the framework proposed in Section 4, for classification of univariate normal
+populations, is applied to texture image classification using Gabor filters. Several authors have found
+that Gabor energy features are well-suited texture descriptors. In the following, consider 24 Gabor
+energy sub-bands that are the result of three scales and eight orientations. Hence, each texture image
+can be decomposed as the collection of those 24 sub-bands. For more information concerning the
+implementation, the interested reader is referred to [22].
+Starting from the VisTeX database of 40 images [10] (these are displayed in Figure 1), each image
+was divided into 16 non-overlapping subimages of 128 × 128 pixels each. A training sequence was
+formed by choosing randomly eight subimages out of each image. To each subimage in the training
+sequence, a bank of 24 Gabor filters was applied. The result of applying a Gabor filter with scale s
+and orientation o to a subimage i belonging to an image L is a univariate normal population Si,s,o of
+128 × 128 points (one point for each pixel, after the filter is applied).
+
+Figure 1. Forty images of the VisTex database.
+
+These populations Si,s,o (called sub-bands) are considered independent, each one of them
+univariate normal with mean μi,s,o =
+√
+
+2xi,s,o, standard deviation σi,s,o = yi,s,o and with zi,s,o =
+(xi,s,o, yi,s,o). The couple of maximum likelihood estimates for these parameters is denoted ˆzi,s,o =
+( ˆxi,s,o, ˆyi,s,o). An image L (recall, there are 40 images) contains, in each sub-band, eight populations
+Si,s,o, with which hyperparameters ¯zL,s,o and γL,s,o are associated, by applying the maximum likelihood
+estimation algorithm of Section 3.2 to the inputs ˆzi,s,o.
+If St is a test subimage, then one should begin by applying the 24 Gabor filters to it, obtaining
+independent univariate normal populations St,s,o, and then compute for each population the couple
+of maximum likelihood estimates ˆzt,s,o = ( ˆxt,s,o, ˆyt,s,o). The decision rule Equation (37) of Section 4
+requires that St should be assigned to the image L∗, which realizes the maximum:
+
+L∗ = argmaxL ∑
+s,o
+− log{Z(γL,s,o)} −
+1
+
+2γ2
+L,s,o
+d2(ˆzt,s,o, ¯zL,s,o)
+(39)
+
+205
+
+
+Entropy 2014, 16, 4015–4031
+
+When considering the homoscedasticity assumption, i.e., γL,s,o = γs,o for all L, this decision rule
+becomes:
+L∗ = argminL ∑
+s,o
+d2(ˆzt,s,o, ¯zL,s,o)
+(40)
+
+For this concrete application, to the VisTex database, it is pertinent to compare the rate of successful
+classification (or overall accuracy) obtained using the Riemannian prior, based on the framework
+of Section 4, to that obtained using a more classical conjugate prior, i.e., a normal-inverse gamma
+distribution of the mean μ =
+√
+
+2x and the standard deviation σ = y. This conjugate prior is given by:
+
+p(μ|σ, μp, κp) =
+√κp
+σ
+√
+
+2π
+exp
+�
+− κp
+
+2σ2 (μ − μp)2�
+
+with an inverse gamma prior, on σ2,
+
+p(σ2|α, β) =
+βα
+
+Γ(α)
+
+�
+σ2�−(α+1)
+exp
+�
+− β
+
+σ2
+
+�
+(41)
+
+Using this conjugate prior, instead of a Riemannian prior, and following the procedure of applying the
+Laplace approximation, a different decision rule is obtained, where L∗ is taken to be the maximum of
+the following expression:
+
+∑
+s,o
+
+ln κpL,s,o
+
+2
+− κpL,s,o
+
+2 ˆy2
+t,s,o
+
+�√
+
+2 ˆxt,s,o − μpL,s,o
+�2
+
++ αL,s,o ln βL,s,o − ln Γ(αL,s,o) − 2(αL,s,o + 1) ln ˆyt,s,o − βL,s,o
+
+ˆy2
+t
+(42)
+
+where, as in Equation (39), ˆxt,s,o and ˆyt,s,o are the maximum likelihood estimates computed for the
+population St,s,o.
+Both the Riemannian and conjugate priors have been applied to the VisTex database, with half of
+the database used for training and half for testing. In the course of 100 Monte Carlo runs, a significant
+gain of about 3% is observed with the Riemannian prior compared to the conjugate prior. This is
+summarized in the following table.
+
+Prior Model
+Overall Accuracy
+
+Riemannian prior Equation (39)
+71.88% ± 2.16%
+
+Riemannian prior, homoscedasticity assumption Equation (40)
+69.06% ± 1.96%
+
+Conjugate prior Equation (42)
+68.73% ± 2.92%
+
+Recall that the overall accuracy is the ratio of the number of successfully classified subimages
+to the total number of subimages. The table shows that the use of a Riemannian prior, even under a
+homoscedasticity assumption, yields significant improvement upon the use of a conjugate prior.
+
+6. Conclusions
+
+Motivated by the problem of the classification of univariate normal populations, this paper
+introduced a new class of prior distributions on the univariate normal model. With the univariate
+normal model viewed as the Poincaré half plane H, these new prior distributions, called Gaussian
+distributions, were meant to reflect the geometric picture (in terms of Rao’s distance) that a cluster or
+class of univariate normal populations can be represented as having a center ¯z ∈ H and a “variance”
+or dispersion γ2. Precisely, a Gaussian distribution G(¯z, γ) has a probability density function p(z),
+
+with respect to Riemannian volume of the Poincaré half plane, which is proportional to exp
+�
+− d2(z,¯z)
+
+2γ2
+�
+.
+
+206
+
+
+Entropy 2014, 16, 4015–4031
+
+Using Gaussian distributions as prior distributions in the problem of the classification of univariate
+normal populations was shown to lead to a new, more general and efficient decision rule. This
+decision rule was implemented in a real-world application to texture image classification, where it
+led to significant improvement in performance, in comparison to decision rules obtained by using
+conjugate priors.
+The general approach proposed in this paper contains several simplifications and approximations,
+which could be improved upon in future work. First, it is possible to use different prior distributions,
+which are more geometrically rich than Gaussian distributions, to represent classes of univariate normal
+populations. For example, it may be helpful to replace Gaussian distributions that are “isotropic”, in
+the sense of having a scalar dispersion parameter γ, by non-isotropic distributions, with a dispersion
+matrix Γ (a 2 × 2 symmetric positive definite matrix). Another possibility would be to represent
+each class of univariate normal populations by a finite mixture of Gaussian distributions, instead of
+representing it by a single Gaussian distribution.
+These variants, which would allow classes with a more complex geometric structure to be taken
+into account, can be integrated in the general framework proposed in the paper, based on: (i) fitting
+each class to a prior distribution (Gaussian non-isotropic, mixture of Gaussians); and (ii) choosing, for
+a test population, the most adequate class, based on a decision rule. These two steps can be realized as
+above, through the Laplace approximation and maximum likelihood estimation, or through alternative
+techniques, based on Markov chain Monte Carlo stochastic optimization.
+In addition to generalizing the approach of this paper and improving its performance, a further
+important objective for future work will be to extend it to other parametric models, beyond univariate
+normal models. Indeed, there is an increasing number of parametric models (generalized Gaussian,
+elliptical models, etc.), whose Riemannian geometry is becoming well understood and where the
+present approach may be helpful.
+
+Author Contributions
+
+Salem Said carry out the mathematical development, and specify the algorithms, appearing in
+Sections 2, 3 and 4. Lionel Bombrun carry out all numerical simulations, and to propose the theoretical
+development of Section 4. Yannick Berthoumieu devise the main idea of the paper. That is, use of
+Riemannian priors as geometric representation of a class or cluster of univariate normal population.
+All authors have read and approved the final manuscript.
+
+Conflicts of Interest
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Amari, S.I; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2000.
+2.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc. 1945, 37, 81–91.
+3.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234.
+4.
+Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276.
+5.
+Nielsen, F; Nock, R. Hyperbolic Voronoi diagrams made easy. 2009 , arXiv:0903.3287.
+6.
+Lenglet, C.; Rousson, M.; Deriche, R.; Fougeras, O. Statistics on the manifold of multivariate normal
+distributions: Theory and application to diffusion tensor MRI processing. J. Math. Imaging Vis. 2006, 25,
+423–444.
+7.
+Verdoolaege, G.; Scheunders, P. On the geometry of multivariate generalized Gaussian models. J. Math.
+Imaging Vis. 2012, 43, 180–193.
+
+207
+
+
+Entropy 2014, 16, 4015–4031
+
+8.
+Berkane, M.; Oden, K. Geodesic estimation in elliptical distributions. J. Multival. Anal. 1997, 63, 35–46.
+9.
+Erdélyi, A. Asymptotic Expansions; Dover Books: Mineola, New York, NY, USA, 2010.
+10.
+MIT Vision and Modeling Group.
+Vision Texture.
+Available online: http://vismod.media.mit.edu/
+pub/VisTex (accessed on 10 June 2014).
+11.
+Pennec, X. Intrinsic statistics on Riemannian manifold: Basic tools for geometric measurements. J. Math.
+Imaging Vis. 2006, 25, 127–154.
+12.
+Atkinson, C.; Mitchell, A.F.S. Rao’s distance measure. Sankhya Ser. A 1981, 43, 345–365.
+13.
+Gallot, S.; Hulin, D.; Lafontaine, J. Riemannian Geometry; Springer-Verlag: Berlin, Germany, 2004.
+14.
+Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
+Press: Cambridge, MA, USA, 2006.
+15.
+Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’I.H.P. 1948,
+10, 215–310. (In French)
+16.
+Afsari, B. Riemannian Lp center of mass: Existence, Uniqueness and convexity. Proc. Am. Math. Soc. 2011,
+139, 655–673.
+17.
+Manton, J.H. A centroid (Karcher mean) approach to the joint approximate diagonalisation problem: The real
+symmetric case. Digit. Sign. Process. 2006, 16, 468–478.
+18.
+Arnaudon, M.; Barbaresco, F. Riemannian medians and means with applications to RADAR signal processing.
+IEEE J. Sel. Top. Sign. Process. 2013, 7, 595–604.
+19.
+Le, H. On the consistency of procrustean mean shapes. Adv. Appl. Prob. 1998, 30, 53–63.
+20.
+Turaga, P.; Veeraraghavan, A.; Chellappa, R. Statistical Snalysis on Stiefel and Grassmann Manifolds with
+Applications in Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern
+Recognition, Anchorage, AK, USA, 23–28 June 2008; doi: 10.1109/CVPR.2008.4587733.
+21.
+Chavel, I. Riemannian geometry: A modern introduction; Cambridge University Press: Princeton, MA, USA, 2008.
+22.
+Grigorescu, S.E.; Petkov, N.; Kruizinga, P. Comparison of texture features based on Gabor filter. IEEE Trans.
+Image Process. 2002, 11, 1160–1167.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+208
+
+
+entropy
+
+Article
+Combinatorial Optimization with Information
+Geometry: The Newton Method
+
+Luigi Malagò 1 and Giovanni Pistone 2,*
+
+1 Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135 Milano, Italy;
+E-Mail: malago@di.unimi.it
+2 de Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024 Moncalieri, Italy
+*
+E-Mail: giovanni.pistone@carloalberto.org; Tel.: +39-011-670-5033; Fax: +39-011-670-5082.
+
+Received: 31 March 2014; in revised form: 10 July 2014 / Accepted: 11 July 2014 /
+Published: 28 July 2014
+
+Abstract: We discuss the use of the Newton method in the computation of max(p �→ Ep [ f ]), where
+p belongs to a statistical exponential family on a finite state space. In a number of papers, the authors
+have applied first order search methods based on information geometry. Second order methods
+have been widely used in optimization on manifolds, e.g., matrix manifolds, but appear to be new
+in statistical manifolds. These methods require the computation of the Riemannian Hessian in a
+statistical manifold. We use a non-parametric formulation of information geometry in view of further
+applications in the continuous state space cases, where the construction of a proper Riemannian
+structure is still an open problem.
+
+Keywords: statistical manifold; Riemannian Hessian; combinatorial optimization; Newton method
+
+1. Introduction
+
+In this paper, statistical exponential families [1] are thought of as differentiable manifolds along
+the approach called information geometry [2] or the exponential statistical manifold [3]. Specifically,
+our aim is to discuss optimization on statistical manifolds using the Newton method, as is suggested
+in ([4] (Ch. 5 and 6)); see also the monograph [5]. This method is based on classical Riemannian
+geometry [6], but here, we put our emphasis on coordinate-free differential geometry; see [7,8].
+We mainly refer to the above-mentioned references [2,4], with one notable exception in the
+description of the tangent space. Our manifold will be an exponential family EV of positive densities,
+V being a vector space of sufficient statistics. Given a one-dimensional statistical model p(t) ∈ EV,
+t ∈ I, we define its velocity at time t to be its Fisher score s(t) = d
+
+dt ln p(t) [9]. The Fisher score s(t)
+is a random variable with zero expectation with respect to p(t), Ep(t) [s(t)] = 0. Because of that, the
+tangent space at p ∈ EV is a vector space of random variables with zero expectation at p. A vector field
+is a mapping from p to a random variable V(p), such that for all p ∈ E, the random variable V(p) is
+centered at p, Ep [V(p)] = 0. In other words, each point of the manifold has a different tangent space,
+and this tangent space can be used as a non-parametric model space of the manifold. In this formalism,
+a vector field is a mapping from densities to centered random variables, that is, it is what in statistics is
+called a pivot of the statistical model. To avoid confusion with the product of random variables, we
+do not use the standard notation for the action of a vector field on a real function. This approach is
+possibly unusual in differential geometry, but it is fully natural from the statistical point of view, where
+the Fisher score has a central place. Moreover, this approach scales nicely from the finite state space to
+the general state space; see the discussion in [9] and the review in [3].
+A complete construction of the geometric framework based on the idea of using the Fisher scores
+as elements of the tangent bundle has been actually worked out. In this paper, we go on by considering
+a second order geometry based on the non-parametric settings.
+
+Entropy 2014, 16, 4260–4289; doi:10.3390/e16084260
+www.mdpi.com/journal/entropy
+209
+
+
+Entropy 2014, 16, 4260–4289
+
+Our main motivation for such a geometrical construction is its application to combinatorial
+optimization using exponential families, whose first order version was developed in [10–14]. We give
+here an illustration of the methods in the following toy example.
+Consider the function f (x1, x2) = a0 + a1x1 + a2x2 + a12x1x2, with x1, x2 = ±1, a0, a1, a2, a12 ∈ R.
+The function f is a real random variable on the sample space Ω = {+1, −1}2 with the uniform
+probability λ.
+Note that the coordinate mappings X1, X2 of Ω generate an orthonormal basis
+1, X1, X2, X1X2 of L2(Ω, λ) and that f is the general form of a real random variable on such a space. Let
+P> be the open simplex of positive densities on (Ω, λ), and let EV be a statistical model, i.e., a subset
+of P>. The relaxed mapping F: EV → R,
+
+F(p) = Ep [ f ] = a0 + a1 Ep [X1] + a2 Ep [X2] + a12 Ep [X1X2] ,
+(1)
+
+is strictly bounded by the maximum of f, F(p) = Ep [ f ] < maxx∈Ω f (x), unless f is constant. We are
+looking for a sequence pn, n ∈ N, such that Epn [ f ] → maxx∈Ω f (x) as n → ∞. The existence of such a
+sequence is a nontrivial condition for the model E. Precisely, the closure of EV must contain a density,
+whose support is contained in the set of maxima {x ∈ Ω| f (x) = max f }. This condition is satisfied by
+the independence model, V = Span {X1, X2}, where we can write:
+
+F(η1, η2) = a0 + a1η1 + a2η2 + a12η1η2,
+ηi = Ep [Xi] ,
+(2)
+
+See Figure 1.
+The gradient of Equation (2) has components ∂1F = a1 + a12η2, ∂2F = a2 + a12η1, and the flow
+along the gradient produces increasing values for F; however, the gradient flow does not converge to
+the maximum of F; see the dotted line in Figure 2. However, one can follow the suggestion by [15] and
+use a modified gradient (the “natural” gradient) flow that produces better results in our problem; see
+Figure 3. Full details on this example are given in Section 2.5.2.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3.
+
+210
+
+
+Entropy 2014, 16, 4260–4289
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+η1
+
+η2
+
+−20
+
+−10
+
+0
+
+10
+
+20
+
+ −10 
+
+ −10 
+
+ 0 
+
+ 0 
+
+10 
+
+ 10 
+
+ 20 
+
+Expectation parameters
+
+Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside
+the square [−1, +1]2.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Gradient vs Natural gradient
+
+Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting
+at (−1/4, −1/4).
+
+In combinatorial optimization, the values of the function f are assumed to be available at each
+point, and the curve of steepest ascent of the relaxed function is learned through a simulation procedure
+based on exponential statistical models.
+
+211
+
+
+Entropy 2014, 16, 4260–4289
+
+In this paper, we introduce, in Section 2, the geometry of exponential families and its first order
+calculus. The second order calculus and the Hessian are discussed in Section 3. Finally, in Section 4,
+we apply the formalism to the discussion of the Newton method in the context of the maximization of
+the relaxed function.
+
+2. Models on a Finite State Space
+
+We consider here the exponential statistical manifold on the set of positive densities on a measure
+space (Ω, μ) with Ω finite and counting measure μ. The setup we describe below is not strictly required
+in the finite case, because in such a case, other approaches are possible, but it provides a mathematical
+formalism that has its own pros and that scales naturally to the infinite case.
+We provide below a schematic presentation of our formalism as an introduction to this section.
+
+• Two different exponential families can actually be the same statistical model, as the set of
+densities in the two exponential families are actually equal. This fact is due to both the
+arbitrariness of the reference density and the fact that sufficient statistics are actually a vector
+basis of the vector space generated by the sufficient statistics. In a non-parametric approach,
+we can refer directly to the vector space of centered log-densities, while the change of reference
+density is geometrically interpreted as a change of chart. The set of all possible such charts
+defines a manifold.
+• We make a specific interpretation of the tangent bundle as the vector space of Fisher’s scores at
+each density and use such tangent spaces as the space of coordinates. This produces a different
+tangent space/space of coordinates at each density, and different tangent spaces are mapped
+one onto another by a proper parallel transport, which is nothing else than the re-centering of
+random variables.
+• If a basis is chosen, a parametrization is given, and such a parametrization is, in fact, a new
+chart, whose values are real vectors. In the real parametrization, the natural scalar product in
+each scores space is given by Fisher’s information matrix.
+• Riemannian gradients are defined in the usual way. It is customary in information geometry to
+call “natural gradient” the real coordinate presentation of the Riemannian gradient. The natural
+gradient is computed by applying the inverse of the Fisher information matrix to the Euclidean
+gradient. It seems that there are tree gradients involved, but they all represent the same object
+when correctly understood.
+• The classical notion of expectation parameters for exponential families carries on as another
+chart on the statistical manifold, which gives rise to a further presentation of a geometrical
+object.
+• While the statistical manifold is unique, there are at least three relevant connections as structures
+on the vector bundles of the manifold: one relating to the exponential charts, one relating to the
+expectation charts and one depending on the Riemannian structure.
+
+2.1. Exponential Families As Manifolds
+
+On the finite sample space Ω, #Ω = n, let a set of random variables B = {X1, . . . , Xm} be
+given, such that ∑J αjXj is constant if, and only if, the αj’s are zero, or, equivalently, such that X0 =
+1, X1, . . . , Xm are affinely independent. The condition implies, necessarily, the linear independence of
+B. A common choice is to take a set of linearly independent and μ-centered random variables.
+We write V = Span {X1, . . . , Xm} and define the following exponential family of positive densities
+p ∈ P>:
+EV =
+�
+q ∈ P>
+���q ∝ eV p, V ∈ V
+�
+.
+(3)
+
+Given any couple p, q ∈ EV, then there exist a unique set of parameters θ = θp(q), such that:
+
+q = exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p
+(4)
+
+212
+
+
+Entropy 2014, 16, 4260–4289
+
+where eUp is the centering at p, that is,
+
+eUp : V ∋ U �→ U − Ep [U] ∈ eUpV.
+(5)
+
+The linear mapping eUp is one-to-one on V and eUpXj, j = 1, . . . , m, and is a basis of eUpV. We view
+each choice of a specific reference p as providing a chart centered at p on the exponential family EV,
+namely:
+
+σp : exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p �→ θ,
+(6)
+
+If:
+
+U = eUpU + Ep [U] =
+m
+∑
+j=1
+θj eUpXj + Ep [U] ,
+(7)
+
+then:
+
+Ep
+�
+U eUpXi
+� =
+m
+∑
+j=1
+θj Ep
+�eUpXi
+eUpXj
+�
+,
+(8)
+
+so that θ = I−1
+B (p) Ep [U eUpX], where:
+
+IB(p) =
+�
+Covp
+�
+Xi, Xj
+��
+
+ij = Ep
+�
+XX′� − Ep [X] Ep
+�
+X′�
+(9)
+
+is the Fisher information matrix of the basis B = {X1, . . . , Xm}.
+The mappings:
+σp : EV ∋ q �→ U �→ θ ∈ Rm
+(10)
+
+where:
+
+sp : q �→ U = log
+� q
+
+p
+
+�
+− Ep
+
+�
+log
+� q
+
+p
+
+��
+,
+(11)
+
+σp : q �→ θ = I−1
+B (p) Ep
+�
+U eUpX
+� = I−1
+B (p) Ep
+
+�
+log
+� q
+
+p
+
+�
+eUpX
+�
+,
+(12)
+
+are global charts in the non-parametric and parametric coordinates, respectively.
+Notice that
+Equation (12) provides the regression coefficients of the least squares estimate on eUpV of the
+log-likelihood.
+We denote by ep : Rm → EV the inverse of σp, i.e.,
+
+ep(θ) = exp
+
+�
+m
+∑
+j=1
+θj eUpXj − ψp(θ)
+
+�
+
+· p,
+(13)
+
+so that the representation of the divergence q �→ D (p ∥q) in the chart σp is ψp:
+
+ψp(θ) = log
+�
+Ep
+�
+e∑m
+j=1 θj eUpXj��
+= Eθ
+
+�
+log
+�
+p
+
+ep(θ)
+
+��
+= D
+�
+p ∥ep(θ)
+�
+.
+(14)
+
+The mapping IB : p �→ Covp (X, X) ∈ Rm×m is represented in the chart centered at p by:
+
+IB,p(θ) = IB(ep(θ)) = [Covep(θ)
+�
+Xi, Xj
+�]i,j = Hess ψp(θ),
+(15)
+
+See [1].
+
+213
+
+
+Entropy 2014, 16, 4260–4289
+
+2.2. Change of Chart
+
+Fix p, ¯p ∈ EV; then, we can express p in the chart centered at ¯p,
+
+p = exp
+� ¯U − kp( ¯U)
+� · ¯p,
+¯U ∈ eU ¯pV,
+k ¯p( ¯U) = log
+�
+E ¯p
+�
+e ¯U��
+.
+(16)
+
+In coordinates ¯U = ∑m
+j=1‘ ¯θj eU ¯pXj.
+For all q ∈ EV, q = exp
+�
+U − kp(U)
+�
+p, U ∈ eUpV, kp(U) = log
+�Ep
+�
+eU��
+, in coordinates
+U = ∑m
+j=1‘ θj eUpXj, we can write:
+
+q = exp
+�
+U − kp(U)
+� · p
+
+= exp
+�
+U − kp(U)
+�
+exp
+� ¯U − k ¯p( ¯U)
+� · ¯p
+
+= exp
+�
+U − kp(U) + ¯U − k ¯p( ¯U)
+� · ¯p
+
+= exp
+��(U + ¯U) − E ¯p [U]
+� −
+�
+kp(U) − k ¯p( ¯U) + E ¯p [U]
+�� · ¯p,
+(17)
+
+hence, the non-parametric coordinate of q in the chart centered at ¯p is U + ¯U − E ¯p [U] = eU ¯p(U) + ¯U.
+From Equation (12):
+
+σ¯p(q) = I−1
+V ( ¯p) E ¯p
+�
+(eU ¯pU + ¯U) eU ¯pX
+�
+
+= θ + ¯θ
+(18)
+
+This provides the change of charts σ¯p ◦ σ−1
+p
+: θ �→ θ + ¯θ. This atlas of charts defines the affine
+manifold (EV, (σp)). This fact has deep consequences that we do not discuss here, e.g., our manifold is
+an instance of a Hessian manifold [16].
+
+2.3. Tangent Bundle
+
+The space of Fisher scores at p is eUpV, and it is identified with the tangent space of the manifold
+at p, TqEV; see the discussion in [3,9]. Let us check the consistency of this statement with our θ-
+parametrization.
+Let:
+
+q(τ) = exp
+
+�
+m
+∑
+j=1
+θj(τ)
+eUq(0)X − ψq(0)(τ)
+
+�
+
+· q(0),
+(19)
+
+τ ∈ I, I an open interval containing zero, a curve in EV. In the chart centered at q(0), we have from
+Equation (12):
+
+σq(0)(q(τ)) = I−1
+B (q(0)) Eq(0)
+
+�
+log
+�q(τ)
+
+q(0)
+
+�
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+
+��
+m
+∑
+j=1
+θj(τ)
+eUq(0)Xj − ψq(0)(θ(τ))
+
+�
+eUq(0)X
+
+�
+
+= I−1
+B (q(0))
+m
+∑
+j=1
+θj(τ) Eq(0)
+�eUq(0)Xj
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�eUq(0)X
+eUq(0)X
+�
+θ
+
+= θ(τ).
+(20)
+
+The vector space eUpV is represented by the coordinates in the base eUpB. The tangent bundle
+TEV as a manifold is defined by the charts (σp, ˙σp) on the domain:
+
+TEV =
+�(p, v)
+��p ∈ EV, v ∈ TpEV
+�
+(21)
+
+214
+
+
+Entropy 2014, 16, 4260–4289
+
+with:
+(σp, ˙σp): (q, V) �→
+�
+I−1
+B (p) Ep
+
+�
+log
+� q
+
+p
+
+�
+eUpX
+�
+, I−1
+B (p) Ep
+�
+V eUpX
+��
+.
+(22)
+
+The dot notation ˙σp for the charts on the tangent spaces is justified by the computation in Equation (23)
+below:
+
+d
+dtσq(0)(q(τ))
+����
+τ=0
+= I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+����
+τ=0
+
+eUq(0)X
+�
+=
+
+I−1
+B (q(0)) Eq(0)
+�
+δq(0)
+eUq(0)X
+�
+= ˙σq(0)(δq(0)).
+(23)
+
+The velocity at τ = 0 is δq(0) =
+d
+dτ log (q(τ))
+���
+τ=0 ∈ Tq(0)EV and:
+
+d
+dτ θ(τ)
+����
+τ=0
+= I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+����
+τ=0
+
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�
+δq(0)
+eUq(0)X
+�
+,
+(24)
+
+which is consistent with both the definition of tangent space as set of Fisher scores and with the chart
+of the tangent bundle as defined in Equation (22).
+The velocity at a generic τ is δq(τ) =
+d
+dτ log (q(τ)) ∈ Tq(τ)EV and has coordinates at p:
+
+d
+dτ θ(τ) = I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�
+δq(τ)
+eUq(0)X
+�
+.
+(25)
+
+If V, W are vector fields on TEV, i.e., V(p), W(p) ∈ TpEV = eUpV, p ∈ EV, we define a Riemannian
+metric g(V, W)) by:
+g(V, W)(p) = gp(V(p), W(p)) = Ep [V(p)W(p)]
+(26)
+
+In coordinates at p, V(p) = ∑j ˙σj
+p(V) eUpXj, W(p) = ∑j ˙σj
+p(W) eUpXj, so that:
+
+gp(V(p), W(p)) = ˙σp(V)′IB(p) ˙σp(W).
+(27)
+
+2.4. Gradients
+
+Given a function φ: EV → R let φp = φ ◦ ep, ep = σ−1
+p , its representation in the chart centered
+at p:
+
+EV
+φ
+� R
+
+Rm
+
+ep
+�
+
+φp
+
+�
+(28)
+
+The derivative of θ �→ φp(θ) at θ = 0 along α ∈ Rm is:
+
+∇φp(0)α = ∇φp(0)I−1
+B (p)IB(p)α =
+�
+I−1
+B (p)∇φp(0)′�′
+IB(p)α = gp(I−1
+B (p)∇φp(0)′, α).
+(29)
+
+The mapping �∇φ: p �→ I−1
+B (p)(∇φp(0))′ ∈ Rm that appears in Equation (29) is Amari’s natural
+gradient of φ: EV; see [15]. It is a standard notion in Riemannian geometry; cf. [4] (p. 46).
+
+215
+
+
+Entropy 2014, 16, 4260–4289
+
+More generally, the derivative of θ �→ φp(θ) at θ along α ∈ Rm is:
+
+∇φp(θ)α = ∇φp(θ)I−1
+B (ep(θ))IB(ep(θ))α =
+�
+I−1
+B (ep(θ))∇φp(θ)′�′
+IB(ep(θ))α = gep(θ)(I−1
+B (ep(θ))∇φp(θ)′, α).
+(30)
+
+Let us compare ∇φq(0) and ∇φp(θ) when q = ep(θ). As φp = φ ◦ ep and φq = φ ◦ eq, we have the
+change of charts:
+φq = φ ◦ eq = φ ◦ ep ◦ σp ◦ eq = φp ◦ σp ◦ eq,
+(31)
+
+hence ∇φq(0) = ∇φp(σp(q))J(σp ◦ eq)(0), where J(σp ◦ eq) is the Jacobian of σp ◦ eq. As σp ◦ eq(θ) =
+θ + σp(q), we have J(σp ◦ eq) = Id, and in conclusion, ∇φep(θ)(0) = ∇φp(θ). For all p ∈ EV and
+θ ∈ Rm,
+�∇φ(ep(θ)) = I−1
+B (ep(θ))∇φp(θ).
+(32)
+
+Alternatively, for all q, p ∈ EV, �∇φ: EV → Rm is defined by:
+
+�∇φ(q) = I−1
+B (q)∇φp(σp(q)).
+(33)
+
+The Riemannian gradient of φ: EV is the vector field ∇φ, such that DYφ = g(∇φ, Y). Note that
+the Riemannian gradient takes values in the tangent bundle, while the natural gradient takes values in
+Rm. We compute the Riemannian gradient at p as follows. If y = ˙σp(Y(p)),
+
+DYφ(p) = dφp(0)y = gp( �∇φ(p), y) = Ep [∇φ(p)Y(p)] ,
+(34)
+
+hence �∇φ(p) = I−1
+B (p)∇φp(0)′ is the representation in the chart centered at p of the vector field
+∇φ: EV. Explicitly, we have (see Equation (22)),
+
+�∇φ(p) = I−1
+B (p)(∇φp(0))′ = I−1
+B (p) Ep
+�∇φ(p) eUpX
+�
+,
+(35)
+
+∇φ(p) = ∑
+j
+( �∇φ(p))j eUpXj
+(36)
+
+The Euclidean gradient ∇φp(θ) is sometimes called the “vanilla gradient.” It is equal to the
+covariance between the Riemannian gradient ∇φ(p) and the basis X, (∇φp(0))′ = Ep [∇φ(p) eUpX].
+We summarize in a display the relations between our three gradients: Euclidean ∇φp(0), natural
+�∇φ(p) and Riemannian ∇φ(p).
+
+TEV
+(σp, ˙σp)�
+
+π
+�
+
+R2m
+
+π1
+�
+
+EV
+σp
+� Rm
+
+TpEV
+˙σp
+� Rm
+
+IB(p)
+�
+
+EV
+
+∇φ(p)
+�
+
+∇φp(0)
+� Rm
+˙σp ◦ ∇φ(p) = I−1
+B ∇φp(0) = �∇φ(p)
+
+(37)
+In the following, we shall frequently use the fact that the representation of the gradient vector
+field ∇φ in a generic chart centered at p is:
+
+(∇φ)p(θ) = ˙σp(∇φ(ep(θ))) = ( �∇φ)(ep(θ)) = I−1
+B,p(θ)∇φp(θ).
+(38)
+
+It should be noted that the leftmost term (∇φ)p(θ) is the presentation of the gradient in the charts
+of the tangent bundle, while in the rightmost term, ∇φp(θ) denotes the Euclidean gradient of the
+presentation of the function φ in the charts of the manifold.
+
+216
+
+
+Entropy 2014, 16, 4260–4289
+
+2.4.1. Expectation Parameters
+
+As ψp is strictly convex, the gradient mapping θ �→ (∇ψp(θ))′ is a homeomorphism from the
+space of parameters Rm to the interior of the convex set generated by the image of eUpX; see [1]. The
+function μp : EV defined by:
+
+μp(q) = Eq
+�eUpX
+� = Eq [X] − Ep [X] = (∇ψp(θ))′,
+θ = σp(q)
+(39)
+
+is a chart for all p ∈ EV. The value of the inverse q = Lp(μ) is characterized as the unique q ∈ EV, such
+that μ = Eq [eUpX], i.e., the maximum likelihood estimator.
+Let us compute the change of chart from p to ¯p:
+
+μ ¯p ◦ μ−1
+p (η) = ¯η = η + Ep [X] − E ¯p [X] .
+(40)
+
+In fact, μ = ELp(μ) [eUpX] and ¯μ = μ ¯p(Lp(μ)) = ELp(μ)
+�eU ¯pX
+�
+.
+We do not discuss here the rich theory started in [2] about the duality between σp and μp. We
+limit ourselves to the computation of the Riemannian gradient in the expectation parameters. If φ: EV,
+
+φp(θ) = φ ◦ ep(θ) = φ ◦ Lp ◦ μp ◦ ep(θ) = (φ ◦ Lp) ◦ (∇ψp)(θ),
+(41)
+
+because μp ◦ ep(θ) = Eep(θ) [eUpX] = ∇φp(θ), hence:
+
+∇φp(θ) = ∇(φ ◦ Lp)(∇ψp(θ)) Hess ψp(θ),
+(42)
+
+�∇φ(p) = IV(p)−1(∇(φ ◦ Lp)(0) Hess ψp(0))′ = (∇(φ ◦ Lp)(0))′,
+(43)
+
+∇φ(p) = ∇(φ ◦ Lp)(0) eUpX,
+(44)
+
+that is, the natural gradient �∇φ at p = Lp(μ) is equal to the Euclidean gradient of μ �→ φ ◦ Lp(μ) at
+μ = 0.
+
+2.4.2. Vector Fields
+
+If V is a vector field of TEV and φ: EV is a real function, then we define the action of V on φ, ∇Vφ,
+to be the real function:
+∇Vφ: EV ∋ p �→ ∇Vφ(p) = ∇φp(0) ˙σp (V(p)) .
+(45)
+
+We prefer to avoid the standard notation Vφ, because in our setting, V(p) is a random variable, and
+the product V(p)φ(p) is otherwise defined as the ordinary product.
+Let us represent ∇Vφ in the chart centered at p:
+
+(∇Vφ)p(θ) = ∇Vφ(ep(θ)) = ∇φep(θ)(0) ˙σep(θ)
+�
+V(ep(θ))
+� = ∇φp(θ)Vp(θ),
+(46)
+
+where we have used the equality ∇φep(θ)(0) = ∇φp(θ) and Vp(θ) = ˙σep(θ)
+�
+V(ep(θ))
+�
+.
+If W is a vector field, we can compute ∇W∇Vφ at p as:
+
+∇W∇Vφ(p) = ∇(∇Vφ)p(0) ˙σp(W(p))
+
+= Vp(0)′ Hess φp(0)Wp(0) + ∇φp(0)JVp(0)Wp(0),
+(47)
+
+where J denotes the Jacobian matrix.
+The Lie bracket [W, V]φ (see [7] (§4.2), [8] (V, §1), [4] (Section 5.3.1)) is given by:
+
+[W, V]φ(p) = ∇W∇Vφ(p) − ∇V∇wφ(p) = ∇φp(0)
+�
+JVp(0)Wp(0) − JWp(0)Vp(0)
+�
+,
+(48)
+
+because of Equation (47) and the symmetry of the Hessian.
+
+217
+
+
+Entropy 2014, 16, 4260–4289
+
+The flow of the smooth vector field V : EV is a family of curves γ(t, p), p ∈ EV, t ∈ Jp, Jp open real
+interval containing zero, such that for all p ∈ EV and t ∈ Jp,
+
+γ(0, p) = p,
+(49)
+
+δγ(t, p) = V(γ(t, p)).
+(50)
+
+As uniqueness holds in Equation (50) (see [8] (VI, §1) or [7] (§4.1)), we have semi-group property
+γ(s + t, p) = γ(s, γ(t, p)), and Equation (50) is equivalent to δγ(0, p) = V(γ(0, p)), p ∈ EV.
+If a flow of V is available, we have an interpretation of ∇Vφ as a derivative of φ along γ(t, p),
+
+d
+dtφ(γ(t, p))
+����
+t=0
+= ∇φp(σp(γ(t, p)))
+� d
+
+dtσp(γ(t, p))
+�����
+t=0
+= ∇φp(0)V(p) = ∇Vφ(p).
+(51)
+
+2.5. Examples
+
+The following examples are intended to show how the formalism of gradients is usable in
+performing basic computations.
+
+2.5.1. Expectation
+
+Let f be any random variable, and define F: EV by F(p) = Ep [ f ]. In the chart centered at p, we
+have:
+
+Fp(θ) =
+�
+f exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p dμ
+(52)
+
+and the Euclidean gradient:
+∇Fp(0) = Covp ( f, X) ∈ (Rm)′.
+(53)
+
+The natural gradient is:
+
+�∇F(p) = Covp (X, X)−1 Covp (X, f ) ∈ Rm,
+(54)
+
+and the Riemannian gradient is:
+
+∇F(p) = ( �∇F(p))′ eUpX = Covp ( f, X) Covp (X, X)−1 eUpX ∈ TpEV.
+(55)
+
+From Equation (55), it follows that ∇F(p) is the L2(p)-projection f onto eUpV, while �∇F(p) in
+Equation (54) are the coordinates of the projection. Let us consider the family of curves:
+
+γ(t, p) = exp
+
+�
+m
+∑
+j=1
+t( �∇F(p))j eUpXj − ψp(t �∇F(p))
+
+�
+
+· p,
+t ∈ R.
+(56)
+
+The velocity is:
+
+δγ(t, p) = d
+
+dt
+
+�
+m
+∑
+j=1
+t( �∇F(p))j eUpXj − ψp(t �∇F(p))
+
+�
+
+= ∇F(p) − Eγ(t,p) [∇F(p)] ,
+(57)
+
+which is different from ∇F(γ(t, p)), unless f ∈ V ⊕ R. Then, γ is not, in general, the flow of ∇F, but it
+is a local approximation, as δγ(0, p) = ∇F(p).
+These computation are the basis of model-based methods in combinatorial optimization; see [10–14].
+
+218
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.2. Binary Independent Variables
+
+Here, we present, in full generality, the toy example of the Introduction; see [17] for more
+information on the application to combinatorial optimization. Our example is a very special case of
+Ising exactly solvable models [18], our aim being here to explore the geometric framework.
+Let Ω = {+1, −1}m with counting measure μ, and let the space V be generated by the coordinate
+projections B = {X1, . . . , Xd}. Note that we use here the coding +1, −1 (from physics) instead of
+the coding 0, 1, which is more common in combinatorial optimization. The exponential family is
+EV =
+�
+exp
+�
+∑m
+J=1 θjXj − ψλ(θ)
+�
+· 2−m�
+, λ(x) = 2−m for x ∈ Ω being the uniform density. The
+independence of the sufficient statistics Xj under all distributions in EV implies:
+
+ψλ(θ) =
+m
+∑
+j=1
+ψ(θj),
+ψ(θ) = log (cosh(θ)) .
+(58)
+
+We have:
+
+∇ψλ(θ) = [tanh(θj): j = 1, . . . , d]
+
+= ηλ(θ),
+(59)
+
+Hess ψλ(θ) = diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+
+= diag
+�
+e−2ψ(θj) : j = 1, . . . , d
+�
+
+= IB,λ(θ),
+(60)
+
+IB,λ(θ)−1 = diag
+�
+cosh2(θj): j = 1, . . . , d
+�
+
+= diag
+�
+e2ψ(θj) : j = 1, . . . , d
+�
+.
+(61)
+
+The quadratic function f (X) = a0 + ∑j ajXj + ∑{i,j} ai,jXiXj has expected value at p = eλ(θ), i.e.,
+relaxed value, equal to:
+
+F(p) = Fλ(θ) = Eθ [ f (X)] = a0 + ∑
+j
+aj tanh(θj) + ∑
+{i,j}
+ai,j tanh(θi) tanh(θj),
+(62)
+
+and covariance with Xk ∈ B equal to:
+
+Covθ ( f (X), Xk) = ∑
+j
+aj Covθ
+�
+Xj, Xk
+� + ∑
+{i,j}
+ai,j Covθ
+�
+XiXj, Xk
+�
+
+= ak Varθ (Xk) + ∑
+i̸=k
+ai,k Eθ [Xi] Varθ (Xk)
+
+= cosh−2(θk)
+
+�
+
+ak + ∑
+i̸=k
+ai,k tanh(θi)
+
+�
+
+.
+(63)
+
+In the computation, we have used the independence and the special algebra of ±1, which implies
+X2
+i = 1, so that Covθ
+�
+XiXj, Xk
+� = 0 if i, j ̸= k, otherwise Covθ (XiXk, Xk) = Eθ [Xi] − Eθ [Xi] Eθ [Xk]2;
+see [13].
+
+219
+
+
+Entropy 2014, 16, 4260–4289
+
+The Euclidean gradient, the natural gradient and the Riemannian gradient are, respectively,
+
+∇Fλ(θ) =
+
+�
+
+cosh−2(θj)
+
+�
+
+aj + ∑
+i̸=j
+ai,j tanh(θi)
+
+�
+
+: j = 1, . . . , d
+
+�
+
+,
+(64)
+
+�∇F(eλ(θ)) =
+
+�
+
+aj + ∑
+i̸=j
+ai,j tanh(θi): j = 1, . . . , d
+
+�
+
+,
+(65)
+
+∇F(eλ(θ)) =
+m
+∑
+j=1
+
+�
+
+aj + ∑
+i̸=j
+ai,j Eθ [Xi]
+
+�
+�
+Xj − Eθ
+�
+Xj
+��
+.
+(66)
+
+The (natural) gradient flow equations are:
+
+˙θj(t) = aj + ∑
+i̸=j
+ai,j tanh(θi(t)),
+j = 1, . . . , d.
+(67)
+
+Equations (64)–(66) are usable in practice if the aj’s and the ai,j’s are estimable. Otherwise, one
+can use Equation (63) and the following forms of the gradients:
+
+∇Fλ(θ) =
+�
+Covθ
+�
+Xj, f (X)
+�
+: j = 1, . . . , d
+�
+,
+(68)
+
+�∇F(eλ(θ)) =
+�
+cosh2(θj) Covθ
+�
+f (X), Xj
+�
+: j = 1, . . . , d
+�
+,
+(69)
+
+in which case, the gradient flow equations are:
+
+˙θj(t) = cosh2(θj) Covθ
+�
+f (X), Xj
+�
+,
+j = 1, . . . , d.
+(70)
+
+Let us study the relaxed function in the expectation parameters ηj = ηj(θ), j = 1, . . . , d,
+
+Fλ(η) = a0 + ∑
+j
+ajηj + ∑
+{i,j}
+ai,jηiηj,
+η ∈] − 1, +1[m.
+(71)
+
+The Euclidean gradient with respect to η has components:
+
+∂jFλ(η) = aj + ∑
+i̸=j
+ai,jηi,
+(72)
+
+which are equal to the components of the natural gradient; see Section 2.4.1. As:
+
+˙ηj(t) = d
+
+dt tanh(θj(t)) = cosh−2(θj(t)) ˙θj(t) =
+�
+1 − ηj(t)2�
+˙θj(t),
+j = 1, . . . , m,
+(73)
+
+the gradient flow expressed in the η-parameters has equations:
+
+˙ηj(t) =
+�
+1 − ηj(t)2� �
+
+aj + ∑
+i̸=j
+ai,jηi(t)
+
+�
+
+,
+j = 1, . . . , d.
+(74)
+
+Alternatively, in vector form,
+
+˙η(t) = diag
+�
+1 − ηj(t)2 : j = 1, . . . , d
+�
+(a + Aη(t)) ,
+(75)
+
+where a = [aj : j = 1, . . . , d]t and Ai,j = 0 if i = j, Aij = ai,j. The matrix A is symmetric with zero
+diagonal, and it has the meaning of the adjacency matrix of the (weighted) interaction graph. We do
+not know a closed-form solution of Equation (74). An example of a numerical solution is shown in
+Figure 3.
+
+220
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.3. Escort Probabilities
+
+For a given a > 0, consider the function C(a) : EV defined by C(a)(p) = �
+pa dμ. We have:
+
+C(a)
+p (θ) =
+�
+exp
+
+�
+
+a
+m
+∑
+j=1
+θj eUpXj − aψp(θ)
+
+�
+
+pa dμ
+(76)
+
+and:
+
+dC(a)
+p (0)α =
+�
+a
+
+�
+m
+∑
+j=1
+αj eUpXj
+
+�
+
+pa dμ =
+
+m
+∑
+j=1
+αj
+�
+a eUpXjpa dμ =
+m
+∑
+j=1
+αj Covp
+�
+Xj, apa−1�
+,
+(77)
+
+that is, the Euclidean gradient is ∇C(a)
+p (0) = Covp
+�
+apa−1, X
+�
+(row vector). The natural gradient is
+computed from Equation (35) as:
+
+�∇C(a)(p) = I−1
+B (p)(∇C(a)
+p (0))′ = Covp (X, X)−1 Covp
+�
+X, apa−1�
+,
+(78)
+
+while the Riemannian gradient follows from Equation (36):
+
+∇C(a)(p) = Covp
+�
+apa−1, X
+�
+Covp (X, X)−1 eUpX.
+(79)
+
+Note that the Riemannian gradient is the orthogonal projection of the random variable apa−1 onto
+the tangent space TpEV = eUpV.
+The probability density pa/C(p) is called the escort density in the literature on non-extensive
+statistical mechanics; see, e.g., [19] (Section 7.4).
+We compute now the tangent mapping of EV ∋ p �→ pa/C(a)(a) ∈ P>. Let us extend the basis
+X1, . . . , Xm to a basis X1, . . . , Xn, n ≥ m, whose exponential family is full, i.e., equal to P>. The
+
+non-parametric coordinate of q =
+�
+exp
+�
+∑m
+j=1 θj eUpXj − ψp(θ)
+�
+p
+�a
+/C(a)
+p (θ) in the chart centered at
+
+¯p = pa/C(a)
+p (0) is the ¯p-centering of the random variable:
+
+log
+� q
+
+¯p
+
+�
+= log
+
+⎛
+
+⎜
+⎝
+
+�
+exp
+�
+∑m
+j=1 θj eUpXj − ψp(θ)
+�
+p
+�a
+/C(a)
+p (θ)
+
+pa/C(a)
+p (0)
+
+⎞
+
+⎟
+⎠
+
+= a
+m
+∑
+j=1
+θj eUpXj − aψp(θ) + ln C(a)
+( 0) − ln C(a)
+p (θ),
+(80)
+
+that is,
+
+v = a
+m
+∑
+j=1
+θj eU ¯pXj.
+(81)
+
+The coordinates of v in the basis eU ¯pX1, . . . , eU ¯pXn are (aθ1, . . . , aθm, 0, . . . , 0), and the Jacobian of
+θ �→ (aθ, 0n−m) is the m × n matrix [aIm|0m×(n−m)].
+
+221
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.4. Polarization Measure
+
+The polarization measure has been introduced in Economics by [20]. Here, we consider the
+qualitative version of [21]. If π is a distribution of a finite set, the probability that in three independent
+samples from π there are exactly two equal is 3 ∑j π2
+j (1 − πj). If p ∈ EV, define:
+
+G(p) =
+�
+p2(1 − p) dμ = C(2)(p) − C(3)(p),
+(82)
+
+where C(2) and C(3) are defined as in Example 2.5.3.
+From Equation (78), we find the natural gradient:
+
+�∇G(p) = Covp (X, X)−1 Covp
+�
+X, 2p − 3p2�
+.
+(83)
+
+Note that �∇G(p) = 0 if p is constant; see Figure 4.
+
+Figure 4. Normalized polarization.
+
+3. Second Order Calculus
+
+In this section, we turn to considering second order calculus, in particular Hessians, in order to
+prepare the discussion of the Newton method for the relaxed optimization of Section 4.
+
+3.1. Metric Derivative (Levi–Civita connection)
+
+Let V, W : EV be vector fields, that is, V(p), W(p) ∈ TpEV = eUpV, p ∈ EV. Consider the real
+function R = g(V, W): EV → R, whose value at p ∈ EV is R(p) = gp(V(p), W(p)) = Ep [V(p)W(p)].
+Assuming smoothness, we want to compute the derivative of R along the vector field Y : EV, that is,
+(DYR)(p) = dRp(0)α, with α = ˙σp(Y(p)). The expression of R in the chart centered at p is, according
+to Equation (27),
+
+θ �→ Rp(θ) = ˙σp(V(ep(θ)))′IB(ep(θ)) ˙σp(W(ep(θ))) = Vp(θ)′IB,p(θ)Wp(θ),
+(84)
+
+where Vp and Wp are the presentation in the chart of the vector fields V and W, respectively.
+
+222
+
+
+Entropy 2014, 16, 4260–4289
+
+The i-th component ∂iRp(θ) of the Euclidean gradient ∇Rp(θ) is:
+
+∂iRp(θ) = ∂i
+�
+Vp(θ)′IB,p(θ)Wp(θ)
+� =
+
+∂iVp(θ)′IB,p(θ)Wp(θ) + Vp(θ)′∂iIB,p(θ)Wp(θ) + Vp(θ)′IB,p(θ)∂iWp(θ) =
+�
+∂iVp(θ) + 1
+
+2 I−1
+B,p(θ)∂iIB,p(θ)Vp(θ)
+�′
+IB,p(θ)Wp(θ)+
+
+Vp(θ)′IB,p(θ)
+�
+∂iWp(θ) + 1
+
+2 I−1
+B,p(θ)∂iIB,p(θ)Wp(θ)
+�
+,
+(85)
+
+so that the derivative at θ along α = ˙σep(θ)(Y(ep(θ))) is:
+
+dRp(θ)α =
+�
+dVp(θ)α + 1
+
+2 I−1
+B,p(θ)
+�
+dIB,p(θ)α
+�
+Vp(θ)
+�′
+IB,p(θ)Wp(θ)+
+
+Vp(θ)′IB,p(θ)
+�
+dWp(θ)α + 1
+
+2 I−1
+B,p(θ)
+�
+dIB,p(θ)α
+�
+Wp(θ)
+�
+.
+(86)
+
+Proposition 1. If we define DYV to be the vector field on EV, whose value at q = ep(θ) has coordinates
+centered at p given by:
+
+˙σp(DYV(q)) = dVp(θ)α + 1
+
+2 I−1
+B (p)
+�
+dIB,p(θ)α
+�
+Vp(θ),
+α = ˙σp(Y(q)),
+(87)
+
+then:
+DYg(V, W) = g(DYV, W) + g(V, DYW),
+(88)
+
+i.e., Equation (87) is a metric covariant derivative; see [6] (Ch. 2 §3), [8] (VIII §4), [4] (§5.3.2).
+
+The metric derivative Equation (87) could be computed from the flow of the vector field Y. Let
+(t, p) �→ γ(t, p) be the flow of the vector field V, i.e., δγ(t, p) = V(γ(t, p)) and γ(0, p) = p. Using
+Equation (23), we have:
+
+d
+dt ˙σ(V(γ(t, p)))
+����
+t=0
+= d
+
+dtVp(σp(γ(t, p)))
+����
+t=0
+
+= dVp(σp(γ(t, p))) d
+
+dtσp(γ(t, p))
+����
+t=0
+= dVp(0) ˙σp(δγ(0, p)) = dVp(0) ˙σp(Y(p)),
+(89)
+
+and:
+
+d
+dt IV(γ(t, p))
+����
+t=0
+= d
+
+dt IB,p(σpγ(t, p))
+����
+t=0
+= dIB,p(0) ˙σp(δγ(0, p)) = dIB,p(0) ˙σp(Y(p))Vp(0),
+(90)
+
+so that:
+
+˙σ(DYV(p)) = d
+
+dt ˙σV(γ(t, p))
+����
+t=0
++ 1
+
+2 I−1
+V (p) d
+
+dt IV(γ(t, p))
+����
+t=0
+.
+(91)
+
+Let us check the symmetry of the metric covariant derivative to show that it is actually the unique
+Riemannian or Levi–Civita affine connection; see [6] (Th. 3.6).
+The Lie bracket of the vector fields V and W is the vector field [V, W], whose coordinates are:
+
+[V, W]p(θ) = dVp(0)˙σp(W(p)) − dWp(0) ˙σp(V(p)).
+(92)
+
+223
+
+
+Entropy 2014, 16, 4260–4289
+
+As the ij entry of ∂kIB,p(0) is ∂k∂i∂jψp(0), then the symmetry (dIB,p(0)α)β = (dIB,p(0)β)α holds,
+and we have:
+
+˙σp (DWV(p) − DVW(p)) =
+
+dVp(0)˙σp(W(p)) + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0) ˙σp(W(p))
+�
+Vp(0)
+
+− dWp(0) ˙σp(V(p)) − 1
+
+2 I−1
+B (p)
+�
+dIB,p(0) ˙σp(V(p))
+�
+Wp(0)
+
+= ˙σ[V, W](p).
+(93)
+
+The term Γk(p) = 1
+
+2 I−1
+p (0)∂kdIB,p(0) of Equation (87) is sometimes referred to as the Christoffel
+matrix, but we do not use this terminology in this paper. As:
+
+IB,p(θ) = IB(ep(θ)) =
+�
+Covep(θ)
+�
+Xi, Xj
+��
+
+i,j=1,...,m =
+�
+∂i∂jψp(θ)
+�
+
+i,j=i,...,m ,
+(94)
+
+we have ∂kIB(ep(θ)) = [∂i∂j∂kψp(θ)]i,j=i,...,m =
+�
+Covep(θ)
+�
+Xi, Xj, Xk
+��
+
+i,j=i,...,m and:
+
+Γk(p) = 1
+
+2
+�
+Covp
+�
+Xi, Xj
+��−1
+i,j=i,...,m
+�
+Covp
+�
+Xi, Xj, Xk
+��
+
+i,j=i,...,m
+(95)
+
+.
+If V, W are vector fields of TEV, we have:
+
+Γ(p, V, W) = 1
+
+2 I−1
+B (p) Covp (X, V, W)
+
+= 1
+
+2 I−1
+B (p) Ep
+�eUpXVW
+�
+,
+(96)
+
+which is the projection of V(p)W(p)/2 on eUpV.
+Notice also that:
+
+(dI−1
+p (0)α)IB,p(0) = −I−1
+p (0)(dIB,p(0)α)I−1
+p (0)IB,p(0)y = −I−1
+p (0)
+�
+dIB,p(0)α
+�
+.
+(97)
+
+3.2. Acceleration
+
+Let p(t), t ∈ I, be a smooth curve in EV. Then, the velocity δp(t) = d
+
+dt log (p(t)) is a vector field
+V(p(t)) = δp(t), defined on the support p(I) of the curve. As the curve is the flow of the velocity
+field, we can compute the metric derivative of the velocity along the the velocity itself Dδpδp from
+Equation (91) with V(p(0)) = δp(0); we can use Equation (91) to get:
+
+˙σp(Dδpδp)(p(0)) = d
+
+dt ˙σp(0) (δ(p(t)))
+����
+t=0
++ 1
+
+2 I−1
+B (p(0)) d
+
+dt IB(p(t))
+����
+t=0
+=
+
+d2
+
+dt2 σp(0)(p(t))
+����
+t=0
++ 1
+
+2 I−1
+B (p(0)) d
+
+dt IB(p(t))
+����
+t=0
+.
+(98)
+
+which can be defined to be the Riemannian acceleration of the curve at t = 0.
+Let us write θ(t) = σp(p(t)), p = p(0) and:
+
+p(t) = exp
+
+�
+m
+∑
+j=1
+θj(t) eUpXj − ψp(θ(t))
+
+�
+
+· p,
+(99)
+
+224
+
+
+Entropy 2014, 16, 4260–4289
+
+so that ˙σp(δp)(0) = ˙θ(0) and
+d2
+dt2 σp(p(t))
+���
+t=0 = ¨θ(0). We have:
+
+d
+dt IB(p(t))
+����
+t=0
+= d
+
+dt IB,p(θ(t))
+����
+t=0
+= d
+
+dt Hess ψp(θ(t))
+����
+t=0
+= Covp(X, X,
+m
+∑
+j=1
+˙θj(t)Xj)
+(100)
+
+so that the acceleration at p has coordinates:
+
+¨θ(0) + 1
+
+2
+
+m
+∑
+i,j=1
+˙θi(0) ˙θj(0) Covp (X, X)−1 Covp(X, Xi, Xj) =
+
+¨θ(0) + 1
+
+2 Covp (X, X)−1 Covp(X,
+m
+∑
+i
+˙θi(0)Xi,
+m
+∑
+j=1
+˙θj(0)Xj).
+(101)
+
+A geodesic is a curve whose acceleration is zero at each point. The exponential map is the mapping
+Exp: TEV → EV defined by:
+(p, U) �→ Expp U = p(1),
+(102)
+
+where t �→ p(t) is the geodesic, such that p(0) = p and δp(0) = U, for all U, such that the geodesic
+exists for t = 1.
+The exponential map is a particular retraction, that is, a family of mappings Rp, p ∈ E, from
+the tangent space at p to the manifold; here R: TpE → E, such that Rp(0) = p and dRp(0) = Id;
+see [4] (§5.4). It should be noted that exponential manifolds have natural retractions other than Exp, a
+notable one being the exponential family itself. A retraction provides a crucial step in a gradient search
+algorithms by mapping a direction of increase of the objective function to a new trial point.
+
+3.2.1. Example: Binary Independent 2.5.2 Continued.
+
+Let us consider the binary independent model of Section 2.5.2. We have
+
+IB(eλ(θ)) = IB,λ(θ) = diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+,
+(103)
+
+it follows that
+
+∂kIB,λ(θ) = ∂k diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+
+= −2 cosh−3(θk) sinh(θk)Ekk,
+(104)
+
+where Ekk is the d × d matrix with entry one at (k, k), zero otherwise. The k-th Christoffel’s matrix in
+the second term in the definition of the metric derivative (aka Levi–Civita connection) is:
+
+Γk
+B(eλ(θ)) = Γk
+λ(θ) = 1
+
+2 I−1
+B,λ(θ)∂kIB,λ(θ) = − tanh(θk)Ekk.
+(105)
+
+In terms of the moments, we have IB,λ(θ) = Covθ (X, X′) = Hess ψλ(θ). As ∂k∂i∂jψλ(θ) =
+Covθ
+�
+Xk, Xi, Xj
+�
+, we that can write:
+
+∂kIB,λ(θ) = ∂k diag
+�
+Varθ
+�
+Xj
+�
+: j = 1, . . . , d
+�
+
+= Covθ (Xk, Xk, Xk) Ekk
+(106)
+
+and:
+
+Γk
+λ(θ) = 1
+
+2 Covθ (Xk, Xk)−1 Covθ (Xk, Xk, Xk) Ekk
+
+= 1
+
+2(1 − (ηk)2)−1(−2ηk + 2(ηk)3)Ekk = −ηkEkk.
+(107)
+
+225
+
+
+Entropy 2014, 16, 4260–4289
+
+The equations for the geodesics starting from θ(0) with velocity ˙θ(0) = u are:
+
+¨θk(t) +
+m
+∑
+ij=1
+Γk
+ij(θ(t)) ˙θi(t) ˙θj(t) = ¨θk(t) − tanh(θk(t))( ˙θk(t))2 = 0,
+k = 1, . . . , d.
+(108)
+
+The ordinary differential equation:
+
+¨θ − tanh(θ) ˙θ2 = 0
+(109)
+
+has the closed form solution:
+
+θ(t) = gd−1
+�
+gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t
+�
+= tanh−1
+�
+sin
+�
+gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t
+��
+(110)
+
+for all t, such that:
+
+− π/2 < gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t < π/2,
+(111)
+
+where gd: R →] − π/2, +π/2[ is the Gudermannian function, that is, gd′(x) = 1/ cosh x, gd(0) = 0;
+in closed form, gd(x) = arcsin(tanh(x)). In fact, if θ is a solution of Equation (109), then:
+
+d
+dt gd(θ(t)) =
+˙θ(t)
+
+cosh(θ(t))
+(112)
+
+d2
+
+dt2 gd(θ(t)) = −sinh(θ(t))˙(θ(t))2
+
+cosh2(θ(t))
++
+¨θ(t)
+
+cosh(θ(t))
+
+=
+1
+
+cosh(θ(t))
+
+�
+¨θ(t) − tanh(θ(t))( ˙θ(t))2�
+= 0,
+(113)
+
+so that t �→ gd(θ(t)) coincides (where it is defined) with an affine function characterized by the initial
+conditions.
+In particular, at t = 1, the geodesic Equation (110) defines the Riemannian exponential
+Exp: TEV → EV.
+If (p, U) ∈ TEV, that is, p ∈ EV and U ∈ TpEV, then σλ(p) = θ(0) and
+U = ∑ uj
+eUpXj, ˙σλ(U) = u. If:
+
+− π/2 < gd(θj) +
+uj
+
+cosh(θj) < π/2,
+(114)
+
+then we can take ˙θ(0) = u and t = 1, so that:
+
+Expp : U
+˙σλ
+�−→ u �→
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+: j = 1, . . . , d
+�
+eλ
+�−→
+
+m
+∏
+j=1
+exp
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+Xj − ψ
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+���
+2−m.
+(115)
+
+We have:
+
+exp
+�
+gd−1(v)
+�
+= exp
+�
+tanh−1(sin(v))
+�
+=
+
+�
+
+1 + sin v
+1 − sin v
+(116)
+
+and:
+
+ψ
+�
+gd−1(v)
+�
+= + log
+�
+gd−1(sin v)
+�
+= log
+�
+1
+
+cos v
+
+�
+,
+(117)
+
+226
+
+
+Entropy 2014, 16, 4260–4289
+
+hence u �→ Expp
+�
+∑d
+j=1 uj
+eUpXj
+�
+is given for:
+
+u ∈
+
+d×
+j=1
+
+�
+cosh(θj)(−π/2 − gd(θj)), cosh(θj)(π/2 − gd(θj))
+�
+,
+(118)
+
+by:
+
+Expθ(u) =
+m
+∏
+j=1
+cos
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+� ⎛
+
+⎝
+1 + sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+
+1 − sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+
+⎞
+
+⎠
+
+Xj
+2
+
+=
+
+m
+∏
+j=1
+
+�
+1 + sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+Xj
+
+�
+2−m ∈ EV.
+(119)
+
+The expectation parameters are:
+
+ηi(t) = Eθ=0
+
+�
+
+Xi
+
+m
+∏
+j=1
+
+�
+1 + sin
+�
+gd(θj) +
+tuj
+
+cosh(θj)
+
+�
+Xj
+
+��
+
+= sin
+�
+gd(θj) +
+tuj
+
+cosh(θj)
+
+�
+,
+(120)
+
+and:
+gd(θj) = arcsin(ηj),
+cosh(θj) =
+1
+
+�
+1 − (ηj)2� 1
+
+2
+,
+(121)
+
+so that the exponential in terms of the expectation parameters is:
+
+Expη(u) =
+�
+sin
+�
+arcsin ηj +
+�
+1 − (ηj)2� 1
+
+2 uj
+
+�
+: j = 1, . . . , m
+�
+.
+(122)
+
+The inverse of the Riemannian exponential provides a notion of translation between two elements
+of the exponential model, which is a particular parametrization of the model:
+
+−−→
+η1η2 = Exp−1
+η1 η2 =
+��
+(1 − (ηj
+i)2�− 1
+
+2 �
+arcsin ηj
+2 − arcsin ηj
+1
+�
+: j = 1, . . . , m
+�
+(123)
+
+In particular, at θ = 0, we have the geodesic:
+
+t �→
+d
+∏
+j=1
+
+�
+1 + sin(tuj)Xj
+�
+2−m,
+|t| <
+π
+
+2 max
+��uj
+��
+(124)
+
+See in Figure 5 some geodesic curves.
+
+227
+
+
+Entropy 2014, 16, 4260–4289
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+Expectation parameters
+
+η1
+
+η2
+
+Figure 5. Geodesics from η = (0.75, 0.75).
+
+3.3. Riemannian Hessian
+
+Let φ: EV → R with Riemannian gradient ∇φ(p) = ∑i( �∇φ)i(p) eUpXi, �∇φ(p) = I−1
+B (p)∇φp(0).
+The Riemannian Hessian of φ is the metric derivative of the gradient ∇φ along the vector field Y, that
+is, HessY φ = DY∇φ; see [6] (Ch. 6, Ex. 11), [4] (§5.5). in the following, we denote by the symbol Hess,
+without a subscript, the ordinary Hessian matrix.
+From Equation (87), we have the coordinates of HessY φ(p). Given a generic tangent vector α, we
+compute from Equation (38):
+
+d(∇φ)p(θ)α
+��
+θ=0 = d
+�
+I−1
+B,p(θ)∇φp(θ)
+�
+α
+���
+θ=0
+= (dI−1
+B,p(0)α)∇φp(0) + I−1
+B,p(0) Hess φp(0)α
+
+= −I−1
+B (p)(dIB,p(0)α) �∇φ(p) + I−1
+B (p) Hess φp(0)α
+(125)
+
+and, upon substitution of (∇φ)p to Vp in Equation (87),
+
+˙σp(HessY φ(p)) = d(∇φ)p(0)α + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� (∇φ)p(0),
+α = Sp(Y(p))
+
+= −I−1
+B (p)(dIB,p(0)α) �∇φ(p) + I−1
+B (p) Hess φp(0) + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� �∇φ(p)
+
+= I−1
+B (p) Hess φp(0)α − 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� �∇φ(p)
+
+= I−1
+B (p)
+�
+Hess φp(0)α − 1
+
+2
+�
+dIB,p(0)α
+� �∇φ(p)
+�
+(126)
+
+HessY φ is characterized by knowing the value of g(HessY φ, X): EV for all vector fields X. We have
+from Equation (126), with α = ˙σp(Y(p)) and β = ˙σp(X(p)),
+
+gp(HessY(p) φ(p), X(p)) = β′ Hess φp(0)α − 1
+
+2 β′ �
+dIB,p(0)α
+� �∇φ(p).
+(127)
+
+228
+
+
+Entropy 2014, 16, 4260–4289
+
+This is the presentation of the Riemannian Hessian as a bi-linear form on TEV; see the comments
+in [4] (Prop. 5.5.2-3). Note that the Riemannian Hessian is positive definite if:
+
+α′ Hess φp(0)α ≥ 1
+
+2α′ �
+dIB,p(0)α
+� �∇φ(p),
+α ∈ Rm.
+(128)
+
+4. Application to Combinatorial Optimization
+
+We conclude our paper by showing how the geometric method applies to the problem of finding
+the maximum of the expected value of a function.
+
+4.1. Hessian of a Relaxed Function
+
+Here is a key example of vector field. Let f be any bounded random variable, and define the
+relaxed function to be φ(p) = Ep [ f ], p ∈ P>. Define F(p) to be the projection of f, as an element of
+L2(p), onto TpEV = eUpV, i.e., F(p) is the element of eUpV, such that:
+
+Ep [( f − F(p))v] = 0,
+v ∈ eUpV
+(129)
+
+In the basis eUpB, we have F(p) = ∑i ˆfp,i
+eUpXi and:
+
+Covp
+�
+f, Xj
+� = ∑
+i
+ˆfp,i Ep
+�eUpXi
+eUpXj
+�
+,
+j = 1, . . . , m,
+(130)
+
+so that ˆfp = I−1
+B (p) Covp (X, f ) and
+
+F(p) = ˆf ′
+p
+eUpX = Covp ( f, X) I−1
+B (p) eUpX.
+(131)
+
+Let us compute the gradient of the relaxed function φ = E· [ f ] : EV. We have φp(θ) = Eep(θ) [ f ],
+and from the properties of exponential families, the Euclidean gradient is ∇φp(0) = Covp ( f, X). It
+follows that the natural gradient is:
+
+�∇φp(0) = I−1
+B (p) Covp (X, f ) = ˆf,
+(132)
+
+and the Riemannian gradient is ∇φ(p) = F(p).
+From the properties of exponential families, we have:
+
+Hess φp(0) = Covp (X, X, f ) ,
+
+so that, in this case, Equation (127), when written in terms of the moments, is:
+
+β′ Covp (X, X, f ) α − 1
+
+2 β′ Covp (X, X, α · X) Covp (X, X)−1 Covp (X, f ) .
+(133)
+
+4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued
+
+We list below the computation of the Hessian in the case of two binary independent variables.
+Computations were done with Sage [22], which allows both the reduction x2
+i = 1 in the ring of
+polynomials and the simplifications in the symbolic ring of parameters.
+
+Covη (X, f ) =
+
+�
+−
+�
+η2
+1 − 1
+�
+a1 −
+�
+η2
+1η2 − η2
+�
+a12
+−
+�
+η2
+2 − 1
+�
+a2 −
+�
+η1η2
+2 − η1
+�
+a12
+
+�
+
+=
+
+�
+−(η1 − 1)(η1 + 1)(a12η2 + a1)
+−(η2 − 1)(η2 + 1)(a12η1 + a2)
+
+�
+
+(134)
+
+Covη (X, X) =
+
+�
+−η2
+1 + 1
+0
+0
+−η2
+2 + 1
+
+�
+
+=
+
+�
+−(η1 − 1)(η1 + 1)
+0
+0
+−(η2 − 1)(η2 + 1)
+
+�
+
+(135)
+
+229
+
+
+Entropy 2014, 16, 4260–4289
+
+Covη (X, X)−1 Covη (X, f ) =
+
+�
+a12η2 + a1
+a12η1 + a2
+
+�
+
+= ∇F(η)
+(136)
+
+Covη (X, X, f ) =
+�
+2
+�
+η3
+1 − η1
+�
+a1 + 2
+�
+η3
+1η2 − η1η2
+�
+a12
+�
+η2
+1η2
+2 − η2
+1 − η2
+2 + 1
+�
+a12
+�
+η2
+1η2
+2 − η2
+1 − η2
+2 + 1
+�
+a12
+2
+�
+η1η3
+2 − η1η2
+�
+a12 + 2
+�
+η3
+2 − η2
+�
+a2
+
+�
+
+=
+
+�
+2 (η1 − 1)(η1 + 1)(a12η2 + a1)η1
+(η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12
+(η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12
+2 (η2 − 1)(η2 + 1)(a12η1 + a2)η2
+
+�
+
+(137)
+
+Covη (X, X)−1 Covη (X, X, f ) =
+
+�
+−2 (a12η2 + a1)η1
+−a12η2
+2 + a12
+−a12η2
+1 + a12
+−2 (a12η1 + a2)η2
+
+�
+
+(138)
+
+Covη (X, X, ∇F(η)) =
+�
+2 (a12η2 + a1)(η1 + 1)(η1 − 1)η1
+0
+0
+2 (a12η1 + a2)(η2 + 1)(η2 − 1)η2
+
+�
+
+(139)
+
+Covη (X, X)−1 Covη (X, X, ∇F(η)) =
+�
+−2 (a12η2 + a1)η1
+0
+0
+−2 (a12η1 + a2)η2
+
+�
+
+(140)
+
+The Riemannian Hessian as a matrix in the basis of the tangent space is:
+
+Hess F(η) = Covη (X, X)−1
+�
+Covη (X, X, f ) − 1
+
+2 Covη (X, X, ∇F(η))
+�
+=
+�
+−(a12η2 + a1)η1
+−a12(η2 + 1)(η2 − 1)
+−a12(η1 + 1)(η1 − 1)
+−(a12η1 + a2)η2
+
+�
+
+(141)
+
+As a check, let us compute the Riemannian Hessian as a natural Hessian in the Riemannian
+
+parameters, Hess φ ◦ Expp(u)
+���
+u=0; see [4] (Prop. 5.5.4). We have:
+
+F ◦ Expη(u) =
+
+a12 sin
+��
+
+−η2
+1 + 1u1 + arcsin (η1)
+�
+sin
+��
+
+−η2
+2 + 1u2 + arcsin (η2)
+�
++
+
+a1 sin
+��
+
+−η2
+1 + 1u1 + arcsin (η1)
+�
++ a2 sin
+��
+
+−η2
+2 + 1u2 + arcsin (η2)
+�
+(142)
+
+and:
+
+Hess F ◦ Expη(u)
+���
+u=0 =
+� �
+η2
+1 − 1
+�
+a12η1η2 +
+�
+η2
+1 − 1
+�
+a1η1
+�
+η2
+1 − 1
+��
+η2
+2 − 1
+�
+a12
+�
+η2
+1 − 1
+��
+η2
+2 − 1
+�
+a12
+�
+η2
+2 − 1
+�
+a12η1η2 +
+�
+η2
+2 − 1
+�
+a2η2
+
+�
+
+=
+
+�
+(a12η2 + a1)(η1 + 1)(η1 − 1)η1
+a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1)
+a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1)
+(a12η1 + a2)(η2 + 1)(η2 − 1)η2
+
+�
+
+.
+(143)
+
+230
+
+
+Entropy 2014, 16, 4260–4289
+
+Note the presence of the factor Covη (X, X).
+
+4.2. Newton Method
+
+The Newton method is an iterative method that generates a sequence of points pt, with t = 0, 1, . . . ,
+that converges towards a stationary point ˆp of a F(p) = Ep [ f ], p ∈ EV, that is, a critical point of the
+vector field p �→ ∇F(p), ∇F( ˆp) = 0. Here, we follow [4] (Ch. 5–6), and in particular Algorithm 5 on
+Page 113.
+Let ∇F be a gradient field. We reproduce in our case the basic derivation of the Newton method
+in the following. Note that, in this section, we use the notation Hess •[α] to denote Hessα •. Using
+the definition of metric derivative, we have for a geodesic curve [0, 1] ∋ t �→ p(t) ∈ EV connecting
+p = p(0) to ˆp = p(1) that:
+
+d
+dt gp(t) (∇F(p(t)), δp(t)) = gp(t) (Hess F(p(t))[δp(t)], δp(t))
+(144)
+
+hence the increment from p to ˆp is:
+
+g ˆp (∇F( ˆp), δp(1)) − gp (∇F(p), δp(0)) =
+� 1
+
+0 gp(t) (Hess F(p(t))[δp(t)], δp(t)) dt.
+(145)
+
+Now, we assume that ∇F( ˆp) = 0 and that in Equation (145), the integral is approximated by the
+initial value of the integrand, that is to say, the Hessian is approximately constant on the geodesic from
+p to ˆp; we obtain:
+− gp (∇F(p), δp(0)) = gp (Hess F(p)[δp(0)], δp(0)) + ϵ.
+(146)
+
+If we can solve the Newton equation:
+
+Hess F(p(t))[u] = −∇F(p)
+(147)
+
+then u is approximately equal to the initial velocity of the geodesic connecting p to ˆp, that is, ˆp =
+Expp(u).
+The particular structure of the exponential manifold suggests at least two natural retractions
+that could be used to move from u to ˆp.
+Namely, we have the Riemannian exponential
+(θt, θt+1) �→ Expθt(θt+1 − θt) and the e-retraction coming from the exponential family itself and
+defined by (θt, θt+1) �→ eθt(θt+1 − θt), with θt+1 − θt = ut.
+In the θ parameters, with the e-retraction, the Newton method generates a sequence (θt) according
+to the following updating rule:
+
+θt+1 = θt − λ Hess F(θt)−1 �∇F(θt)
+(148)
+
+where λ > 0 is an extra parameter intended to control the step size and, in turn, the convergence to ˆθ;
+see [5].
+We can rewrite Equation (148) in terms of covariances as:
+
+θt+1 = θt − λ
+�
+Covθt(X, X, f ) − 1
+
+2 Covθt(X, X, �∇F(θt))
+�−1
+�∇F(θt).
+(149)
+
+4.3. Example: Binary Independent
+
+In the η parameters, the Newton step is:
+
+u = − Hess F(η)−1∇F(η) =
+
+⎛
+
+⎜
+⎝
+
+a2
+12η1+a12a2+(a1a12η1+a1a2)η2
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2
+a1a2η1+a1a12+(a12a2η1+a2
+12)η2
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2
+
+⎞
+
+⎟
+⎠
+(150)
+
+231
+
+
+Entropy 2014, 16, 4260–4289
+
+and the new η in the Riemannian retraction is:
+
+Expη(u) =
+
+⎛
+
+⎜
+⎜
+⎝
+
+sin
+�
+(a2
+12η1+a12a2+(a1a12η1+a1a2)η2)√
+
+−η2
+1+1
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2 + arcsin (η1)
+�
+
+sin
+�
+(a1a2η1+a1a12+(a12a2η1+a2
+12)η2)√
+
+−η2
+2+1
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2 + arcsin (η2)
+�
+.
+
+⎞
+
+⎟
+⎟
+⎠
+(151)
+
+In Figure 6, we represented the vector field associated with the Newton step in the η parameters,
+with λ = 0.05, using the Riemannian retraction, for the case a1 = 1, a2 = 2 and a12 = 3, with:
+
+Expη(u) =
+
+⎛
+
+⎜
+⎜
+⎝
+
+sin
+�
+λ
+√
+
+−η2
+1+1((3 η1+2)η2+9 η1+6)
+
+3 (2 η1+3)η2
+2+9 η2
+1+(3 η2
+1+2 η1)η2−9 + arcsin (η1)
+�
+
+sin
+�
+λ
+(3 (2 η1+3)η2+2 η1+3)√
+
+−η2
+2+1
+
+3 (2 η1+3)η2
+2+9 η2
+1+(3 η2
+1+2 η1)η2−9 + arcsin (η2)
+�
+
+⎞
+
+⎟
+⎟
+⎠ .
+(152)
+
+The red dotted lines represented in the figure identify the basins of attraction of the vector field and
+correspond to the solutions of the explicit equation in η for which the Newton step u is not defined.
+This vector field can be compared to that in Figure 7, associated with the Newton step for F(η) using
+the Euclidean geometry. In the Euclidean geometry, F(η) is a quadratic function with one saddle point,
+so that from any η, the Newton step points in the direction of the critical point. This makes the Newton
+step unsuitable for an optimization algorithm. On the other side, in the Riemannian geometry, the
+vertices of the polytope are critical points for F(η), and they determine the presence of multiple basins
+of attraction, as expected.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines
+identify the different basins of attraction and correspond to the points for which the Newton step is not
+defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.
+
+232
+
+
+Entropy 2014, 16, 4260–4289
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+θ1
+
+θ2
+
+−2
+
+0
+
+2
+
+4
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Natural parameters
+
+ 0 
+
+ 0 
+
+ 0 
+
+ 0 
+
+Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted
+lines identify the different basins of attraction and correspond to the points for which the Newton step
+is not defined. The instability along the critical lines, which identifies the basins of attraction, is not
+represented.
+
+233
+
+
+Entropy 2014, 16, 4260–4289
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+θ1
+
+θ2
+
+−2
+
+0
+
+2
+
+4
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Natural parameters
+
+ 0 
+
+ 0 
+
+ 0 
+
+ 0 
+
+Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines
+identify the different basins of attraction and correspond to the points for which the Newton step is
+not defined. The instability along the critical lines, which identifies the basins of attraction, is not
+represented.
+
+Figure 8 shows the Newton step in the θ parameters based on the e-retraction of Equation (149),
+while Figure 9 represents the Newton step evaluated with respect to the Euclidean geometry. A
+comparison of the two vector fields shows that, differently from the η parameters, the number of
+basins of attraction is the same in the two geometries; however, the scale of the vectors is different.
+In particular, notice how on the plateau, for diverging θ, the Newton step in the Euclidean geometry
+vanishes, while in the Riemannian geometry, it gets larger. This behavior suggests better convergence
+properties for an optimization algorithm based on the Newton step evaluated using the proper
+Riemannian geometry. In the θ parameters, the boundaries of the basins of attraction represented by
+the red dotted lines have been computed numerically and correspond to the values of θ for which the
+update step is not defined.
+Finally, notice that in both the η and θ parameters, the step is not always in the direction of descent
+for the function, a common behavior of the Newton method, which converges to the critical points.
+
+5. Discussion and Conclusions
+
+In this paper, we introduced second-order calculus over a statistical manifold, following the
+approach described in [4], which has been adapted to the special case of exponential statistical
+models [2,3]. By defining the Riemannian Hessian and using the notion of retraction, we developed
+the proper machinery necessary for the definition of the updating rule of the Newton method for the
+optimization of a function defined over an exponential family.
+The examples discussed in the paper show that by taking into account the proper Riemannian
+geometry of a statistical exponential family, the vector fields associated with the Newton step in the
+different parametrizations change profoundly. Not only new basins of attraction associated with local
+and global minima appear, as for the expectation parameters, but also the magnitude of the Newton
+step is affected, as over the plateau in the natural parameters. Such differences are expected to have a
+strong impact on the performance of an optimization algorithm based on the Newton step, from both
+the point of view of achievable convergence and the speed of convergence to the optimum.
+
+234
+
+
+Entropy 2014, 16, 4260–4289
+
+The Newton method is a popular second order optimization technique based on the computation
+of the Hessian of the function to be optimized and is well known for its super-linear convergence
+properties. However, the use of the Newton method poses a number of issues in practice.
+First of all, as the examples in Figures 6 and 8 show, the Newton step does not always point
+in the direction of the natural gradient, and the algorithm may not converge to a (local) optimum
+of the function. Such behavior is not unexpected; indeed the Newton method tends to converge to
+critical points of the function to be optimized, which include local minima, local maxima and saddle
+points. In order to obtain a direction of ascent for the function to be optimized, the Hessian must
+be negative-definite, i.e., its eigenvalues must be strictly negative, which is not guaranteed in the
+general case. Another important remark is related to the computational complexity associated with
+the evaluation of the Hessian, compared to the (natural) gradient. Indeed, to obtain the Newton step d,
+Christoffel matrices have to be evaluated, together with the third order covariances between sufficient
+statistics and the function, and the Hessian has to be inverted. Finally, notice that when the Hessian is
+close to being non-invertible, numerical problems may arise in the computation of the Newton step,
+and the algorithm may become unstable and diverge.
+In the literature, different methods have been proposed to overcome these issues. Among them,
+we mention quasi-Newton methods, where the update vector is obtained using a modified Hessian,
+which has been made negative-definite, for instance, by adding a proper correction matrix.
+This paper represents the first step in the design of an algorithm based on the Newton method
+for the optimization over a statistical model. The authors are working on the computational aspects
+related to the implementation of the method, and a new paper with experimental results is in progress.
+
+Acknowledgments: Luigi Malagò was supported by the Xerox University Affairs Committee Award and by
+de Castro Statistics, Collegio Carlo Alberto, Moncalieri. Giovanni Pistone is supported by de Castro Statistics,
+Collegio Carlo Alberto, Moncalieri, and is a member of GNAMPA–INdAM, Roma.
+
+Author Contributions
+
+All authors contributed to the design of the research. The research was carried out by all authors.
+The study of the Hessian and of the Newton method in statistical manifolds was originally suggested
+by Luigi Malagò. The manuscript was written by Luigi Malagò and Giovanni Pistone. All authors
+have read and approved the final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory;
+Number 9 in IMS Lecture Notes. Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA,
+1986; p. 283.
+2.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2000; p. 206.
+3.
+Pistone, G. Nonparametric Information Geometry. In Geometric Science of Information, Proceedings of the
+First International Conference, GSI 2013, Paris, France, 28–30 August 2013; Nielsen, F., Barbaresco, F., Eds.;
+Lecture Notes in Computer Science, Volume 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36.
+4.
+Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
+Press: Princeton, NJ, USA, 2008; pp. xvi+224.
+5.
+Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial
+Engineering; Springer: New York, NY, USA, 2006; pp. xxii+664.
+6.
+Do Carmo, M.P. Riemannian geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston,
+MA, USA, 1992; pp. xiv+300.
+7.
+Abraham, R.; Marsden, J.E.; Ratiu, T.
+Manifolds, Tensor Analysis, and Applications, 2nd ed.; Applied
+Mathematical Sciences, Volume 75; Springer: New York, NY, USA, 1988; pp. x+654.
+
+235
+
+
+Entropy 2014, 16, 4260–4289
+
+8.
+Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York,
+NY, USA, 1995; pp. xiv+364.
+9.
+Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric
+Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P., Eds.; Cambridge University
+Press: Cambridge, UK, 2010.
+10.
+Malagò, L.; Matteucci, M.; Dal Seno, B. An information geometry perspective on estimation of distribution
+algorithms: Boundary analysis. In Proceedings of the 2008 GECCO Conference Companion On Genetic and
+Evolutionary Computation (GECCO ’08); ACM: New York, NY, USA, 2008; pp. 2081–2088.
+11.
+Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming. In
+Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity,
+Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
+12.
+Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by Estimation of Empirical
+Covariances. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA,
+USA, 5–8 June 2011; pp. 949–956.
+13.
+Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based
+on the exponential family. In Proceedings of the 11th Workshop on Foundations of Genetic Algorithms
+(FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011 ; ACM: New York, NY, USA, 2011; pp. 230–242.
+14.
+Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying
+perspective. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico,
+20–23 June 2013; pp. 486–493.
+15.
+Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276.
+16.
+Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA,
+2007; pp. xiv+246.
+17.
+Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis,
+Politecnico di Milano, Milano, Italy, 2012.
+18.
+Gallavotti, G. Statistical Mechanics: A Short Treatise; Texts and Monographs in Physics; Springer: Berlin,
+Germany, 1999; pp. xiv+339.
+19.
+Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149.
+20.
+Esteban, J.; Ray, D. On the Measurement of Polarization. Econometrica 1994, 62, 819–851.
+21.
+Montalvo, J.; Reynal-Querol, M. Ethnic polarization, potential conflict, and civil wars. Am. Econ. Rev. 2005,
+796–816.
+22.
+Stein, W. et al. Sage Mathematics Software (Version 6.0). The Sage Development Team, 2013. Available
+online: http://www.sagemath.org (accessed on 27 March 2014).
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+236
+
+
+entropy
+
+Article
+Information Geometric Complexity of a Trivariate
+Gaussian Statistical Model
+
+Domenico Felice 1,2,*, Carlo Cafaro 3 and Stefano Mancini 1,2
+
+1 School of Science and Technology, University of Camerino, I-62032 Camerino, Italy; E-Mail:
+stefano.mancini@unicam.it
+2 INFN-Sezione di Perugia, Via A. Pascoli, I-06123 Perugia, Italy
+3 Department of Mathematics, Clarkson University, Potsdam, 13699 NY, USA; E-Mail: carlocafaro2000@yahoo.it
+*
+E-Mail: domenico.felice@unicam.it
+
+Received: 1 April 2014; in revised form: 21 May 2014 / Accepted: 22 May 2014 /
+Published: 26 May 2014
+
+Abstract: We evaluate the information geometric complexity of entropic motion on low-dimensional
+Gaussian statistical manifolds in order to quantify how difficult it is to make macroscopic predictions
+about systems in the presence of limited information. Specifically, we observe that the complexity of
+such entropic inferences not only depends on the amount of available pieces of information but also
+on the manner in which such pieces are correlated. Finally, we uncover that, for certain correlational
+structures, the impossibility of reaching the most favorable configuration from an entropic inference
+viewpoint seems to lead to an information geometric analog of the well-known frustration effect that
+occurs in statistical physics.
+
+Keywords: probability theory; Riemannian geometry; complexity
+
+1. Introduction
+
+One of the main efforts in physics is modeling and predicting natural phenomena using relevant
+information about the system under consideration. Theoretical physics has had a general measure of
+the uncertainty associated with the behavior of a probabilistic process for more than 100 years: the
+Shannon entropy [1]. The Shannon information theory was applied to dynamical systems and became
+successful in describing their unpredictability [2].
+Along a similar avenue we may set Entropic Dynamics [3] which makes use of inductive inference
+(Maximum Entropy Methods [4]) and Information Geometry [5]. This is clearly remarkable given
+that microscopic dynamics can be far removed from the phenomena of interest, such as in complex
+biological or ecological systems. Extension of ED to temporally-complex dynamical systems on curved
+statistical manifolds led to relevant measures of chaoticity [6]. In particular, an information geometric
+approach to chaos (IGAC) has been pursued studying chaos in informational geodesic flows describing
+physical, biological or chemical systems. It is the information geometric analogue of conventional
+geometrodynamical approaches [7] where the classical configuration space is being replaced by a
+statistical manifold with the additional possibility of considering chaotic dynamics arising from non
+conformally flat metrics. Within this framework, it seems natural to consider as a complexity measure
+the (time average) statistical volume explored by geodesic flows, namely an Information Geometry
+Complexity (IGC).
+This quantity might help uncover connections between microscopic dynamics and experimentally
+observable macroscopic dynamics which is a fundamental issue in physics [8].
+An interesting
+manifestation of such a relationship appears in the study of the effects of microscopic external
+noise (noise imposed on the microscopic variables of the system) on the observed collective motion
+(macroscopic variables) of a globally coupled map [9]. These effects are quantified in terms of the
+complexity of the collective motion. Furthermore, it turns out that noise at a microscopic level reduces
+
+Entropy 2014, 16, 2944–2958; doi:10.3390/e16062944
+www.mdpi.com/journal/entropy
+237
+
+
+Entropy 2014, 16, 2944–2958
+
+the complexity of the macroscopic motion, which in turn is characterized by the number of effective
+degrees of freedom of the system.
+The investigation of the macroscopic behavior of complex systems in terms of the underlying
+statistical structure of its microscopic degrees of freedom also reveals effects due to the presence of
+microcorrelations [10]. In this article we first show which macro-states should be considered in a
+Gaussian statistical model in order to have a reduction in time of the Information Geometry Complexity.
+Then, dealing with correlated bivariate and trivariate Gaussian statistical models, the ratio between
+the IGC in the presence and in the absence of microcorrelations is explicitly computed, finding an
+intriguing, even though non yet deep understood, connection with the phenomenon of geometric
+frustration [11].
+The layout of the article is as follows. In Section 2 we introduce a general statistical model
+discussing its geometry and describing both its dynamics and information geometry complexity. In
+Section 3, Gaussian statistical models (up to a trivariate model) are considered. There, we compute
+the asymptotic temporal behaviors of their IGCs. Finally, in Section 4 we draw our conclusions by
+outlining our findings and proposing possible further investigations.
+
+2. Statistical Models and Information Geometry Complexity
+
+Given n real-valued random variables X1, . . . , Xn defined on the sample space Ω with joint
+probability density p : Rn → R satisfying the conditions
+
+p(x) ≥ 0 (∀x ∈ Rn)
+and
+�
+
+Rn dx p(x) = 1,
+(1)
+
+let us consider a family P of such distributions and suppose that they can be parametrized using m
+real-valued variables (θ1, . . . , θm) so that
+
+P = {pθ = p(x|θ)|θ = (θ1, . . . , θm) ∈ Θ},
+(2)
+
+where Θ ⊆ Rm is the parameter space and the mapping θ → pθ is injective. In such a way, P is an
+m-dimensional statistical model on Rn.
+The mapping ϕ : P → Rm defined by ϕ(pθ) = θ allows us to consider ϕ = [θi] as a coordinate
+system for P. Assuming parametrizations which are C∞, we can turn P into a C∞ differentiable
+manifold (thus, P is called statistical manifold) [5].
+The values x1, . . . , xn taken by the random variables define the micro-state of the system, while the
+values θ1, . . . , θm taken by parameters define the macro-state of the system.
+Let P = {pθ|θ ∈ Θ} be an m-dimensional statistical model. Given a point θ, the Fisher information
+matrix of P in θ is the m × m matrix G(θ) = [gij], where the (i, j) entry is defined by
+
+gij(θ) :=
+�
+
+Rn dxp(x|θ)∂i log p(x|θ)∂j log p(x|θ),
+(3)
+
+with ∂i standing for
+∂
+∂θi . The matrix G(θ) is symmetric, positive semidefinite and determines a
+Riemannian metric on the parameter space Θ [5]. Hence, it is possible to define a Riemannian statistical
+manifold M := (Θ, g), where g = gijdθi ⊗ dθj (i, j = 1, . . . , m) is the metric whose components gij are
+given by Equation (3) (throughout the paper we use the Einstein sum convention).
+Given the Riemannian manifold M = (Θ, g), it is well known that there exists only one
+linear connection ∇(the Levi–Civita connection) on M that is compatible with the metric g and
+symmetric [12]. We remark that the manifold M has one chart, being Θ an open set of Rm, and the
+Levi-Civita connection is uniquely defined by means of the Christoffel coefficients
+
+Γk
+ij = 1
+
+2 gkl�∂glj
+
+∂θi + ∂gil
+
+∂θj − ∂gij
+
+∂θl
+
+�
+,
+(i, j, k = 1, . . . , m)
+(4)
+
+238
+
+
+Entropy 2014, 16, 2944–2958
+
+where gkl is the (k, l) entry of the inverse of the Fisher matrix G(θ).
+The idea of curvature is the fundamental tool to understand the geometry of the manifold
+M = (Θ, g). Actually, it is the basic geometric invariant and the intrinsic way to obtain it is by
+means of geodesics. It is well-known, that given any point θ ∈ M and any vector v tangent to
+M at θ, there is a unique geodesic starting at θ with initial tangent vector v. Indeed, within the
+considered coordinate system, the geodesics are solutions of the following nonlinear second order
+coupled ordinary differential equations [12]
+
+d2θk
+
+dτ2 + Γk
+ij
+dθi
+
+dτ
+dθj
+
+dτ = 0,
+(5)
+
+with τ denoting the time.
+The recipe to compute some curvatures at a point θ ∈ M is the following: first, select a
+2-dimensional subspace Π of the tangent space to M at θ; second, follow the geodesics through
+θ whose initial tangent vectors lie in Π and consider the 2-dimensional submanifolds SΠ swiped out
+by them inheriting a Riemannian metric from M; finally, compute the Gaussian curvature of SΠ at θ,
+which can be obtained from its Riemannian metric as stated in the Theorema Egregium [13]. The number
+K(Π) found in such manner is called the sectional curvature of M at θ associated with the plane Π. In
+terms of local coordinates, to compute the sectional curvature we need the curvature tensor,
+
+Rh
+ijk =
+∂Γh
+jk
+
+∂θi − ∂Γh
+ik
+
+∂θj + Γl
+jkΓh
+il − Γl
+ikΓh
+jl.
+(6)
+
+For any basis (ξ, η) for a 2-plane Π ⊂ TθM, the sectional curvature at θ ∈ M is given by [12]
+
+K(ξ, η) =
+R(ξ, η, η, ξ)
+
+|ξ|2|η|2 − ⟨ξ, η⟩,
+(7)
+
+where R is the Riemann curvature tensor which is written in coordinates as R = Rijkldθi ⊗ dθj ⊗ dθk ⊗
+dθl with Rijkl = glhRh
+ijk and ⟨·, ·⟩ is the inner product defined by the metric g.
+The sectional curvature is directly related to the topology of the manifold; along this direction
+the Cartan-Hadamard Theorem [13] is enlightening by stating that any complete, simply connected
+n-dimensional manifold with non positive sectional curvature is diffeomorphic to Rn.
+We can consider upon the statistical manifold M = (Θ, g) the macro-variables θ as accessible
+information and then derive the information dynamical Equation (5) from a standard principle of least
+action of Jacobi type [3]. The geodesic Equations (5) describe a reversible dynamics whose solution is
+the trajectory between an initial and a final macrostate θinitial and θfinal, respectively. The trajectory can
+be equally traversed in both directions [10]. Actually, an equation relating instability with geometry
+exists and it makes hope that some global information about the average degree of instability (chaos)
+of the dynamics is encoded in global properties of the statistical manifolds [7]. The fact that this might
+happen is proved by the special case of constant-curvature manifolds, for which the Jacobi-Levi-Civita
+equation simplifies to [7]
+d2Ji
+
+dτ2 + KJi = 0,
+(8)
+
+where K is the constant sectional curvature of the manifold (see Equation (7)) and J is the geodesic
+deviation vector field. On a positively curved manifold, the norm of the separating vector J does not
+grow, whereas on a negatively curved manifold, the norm of J grows exponentially in time, and if the
+manifold is compact, so that its geodesic are sooner or later obliged to fold, this provide an example of
+chaotic geodesic motion [14].
+
+239
+
+
+Entropy 2014, 16, 2944–2958
+
+Taking into consideration these facts, we single out as suitable indicator of dynamical (temporal)
+complexity, the information geometric complexity defined as the average dynamical statistical
+volume [15]
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+:= 1
+
+τ
+
+� τ
+
+0 dτ′vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+,
+(9)
+
+where
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+:=
+�
+
+D(geodesic)
+Θ
+(τ′)
+
+�
+
+det(G(θ)) dθ,
+(10)
+
+with G(θ) the information matrix whose components are given by Equation (3). The integration space
+D(geodesic)
+Θ
+(τ′) is defined as follows
+
+D(geodesic)
+Θ
+(τ′) :=
+�
+θ = (θ1, . . . , θm) : θk(0) ≤ θk ≤ θk(τ′)
+�
+,
+(11)
+
+where θk ≡ θk(s) with 0 ≤ s ≤ τ′ such that θk(s) satisfies (5). The quantity vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+is the
+
+volume of the effective parameter space explored by the system at time τ′. The temporal average
+has been introduced in order to average out the possibly very complex fine details of the entropic
+dynamical description of the system’s complexity dynamics.
+Relevant properties, concerning complexity of geodesic paths on curved statistical manifolds, of
+the quantity (10) compared to the Jacobi vector field are discussed in [16].
+
+3. The Gaussian Statistical Model
+
+In the following we devote our attention to a Gaussian statistical model P whose element are
+multivariate normal joint distributions for n real-valued variables X1, . . . , Xn given by
+
+p(x|θ) =
+1
+�
+
+(2π)n det C
+exp
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+,
+(12)
+
+where μ =
+�E(X1), . . . , E(Xn)
+�
+is the n-dimensional mean vector and C denotes the n × n covariance
+matrix with entries cij = E(XiXj) − E(Xi)E(Xj), i, j = 1, . . . , n. Since μ is a n-dimensional real vector
+
+and C is a n × n symmetric matrix, the parameters involved in this model should be n + n(n+1)
+
+2
+.
+Moreover C is a symmetric, positive definite matrix, hence we have the parameter space given by
+
+Θ := {(μ, C)|μ ∈ Rn, C ∈ Rn×n, C > 0}.
+(13)
+
+Hereafter we consider the statistical model given by Equation (12) when the covariance matrix C has
+only variances σ2
+i = E(X2
+i ) − (E(Xi))2 as parameters. In fact we assume that the non diagonal entry
+(i, j) of the covariance matrix C equals ρσiσj with ρ ∈ R quantifying the degree of correlation.
+We may further notice that the function fij(x) := ∂i log p(x|θ)∂j log p(x|θ), when p(x|θ) is given
+by Equation (12), is a polynomial in the variables xi (i = 1, . . . , n) whose degree is not grater than four.
+Indeed, we have that
+
+∂i log p(x|θ) =
+1
+
+p(x|θ)∂ip(x|θ) = ∂i
+1
+�
+
+(2π)n det C + ∂i
+
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+,
+(14)
+
+and, therefore, the differentiation does not affect variables xi. With this in mind, in order to compute
+the integral in (3), we can use the following formula [17]
+
+1
+�
+
+(2π)n det C
+
+�
+dx fij(x) exp
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+= exp
+
+�
+1
+2
+
+n
+∑
+h,k=1
+chk
+∂
+∂xh
+
+∂
+∂xk
+
+�
+
+fij|x=μ,
+(15)
+
+where the exponential denotes the power series over its argument (the differential operator).
+
+240
+
+
+Entropy 2014, 16, 2944–2958
+
+3.1. The monovariate Gaussian Statistical Model
+
+We now start to apply the concepts of the previous section to a Gaussian statistical model of
+Equation (12) for n = 1. In this case, the dimension of the statistical Riemannian manifold M = (Θ, g)
+is at most two. Indeed, to describe elements of the statistical model P given by Equation (12), we
+basically need the mean μ = E(X) and variance σ2 = E(X − μ)2. We deal separately with the
+cases when the monovariate model has only μ as macro-variable (Case 1), when σ is the unique
+macro-variable (Case 2), and finally when both μ and σ are macro-variables (Case 3).
+
+3.1.1. Case 1
+
+Consider the monovariate model with only μ as macro-variable by setting σ = 1. In this case
+the manifold M is trivially the real flat straight line, since μ ∈ (−∞, +∞). Indeed, the integral
+
+in (3) is equal to 1 when the distribution p(x|θ) reads as p(x|μ) =
+exp
+�
+− 1
+
+2 (x−μ)2�
+
+√
+
+2π
+; so the metric
+
+is g = dμ2. Furthermore, from Equations (4) and (5) the information dynamics is described by
+the geodesic μ(τ) = A1τ + A2, where A1, A2 ∈ R. Hence, the volume of Equation (10) results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+= �
+dμ = A1τ + A2; since this quantity must be positive we assume A1, A2 > 0.
+
+Finally, the asymptotic behavior of the IGC (9) is
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� A1
+
+2
+
+�
+τ.
+(16)
+
+This shows that the complexity linearly increases in time meaning that acquiring information about μ
+and updating it, is not enough to increase our knowledge about the micro state of the system.
+
+3.1.2. Case 2
+
+Consider now the monovariate Gaussian statistical model of Equation(12) when μ = E(X) = 0
+and the macro-variable is only σ. In this case the probability distribution function reads p(x|σ) =
+
+exp
+�
+− x2
+
+2σ2
+�
+
+√
+
+2πσ
+while the Fisher–Rao metric becomes g =
+2
+σ2 dσ2. Emphasizing that also in this case the
+manifold is flat as well, we derive the information dynamics by means of Equations (4) and (5) and we
+obtain the geodesic σ(τ) = A1 exp
+�
+A2τ
+�
+. The volume in Equation (10) then results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� √
+
+2
+σ dσ =
+√
+
+2 log
+�
+A1 exp
+�
+A2τ
+��
+.
+(17)
+
+Again, to have positive volume we have to assume A1, A2 > 0. Finally, the (asymptotic) IGC (9)
+becomes
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+�√
+
+2A2
+2
+
+�
+τ.
+(18)
+
+This shows that also in this case the complexity linearly increases in time meaning that acquiring
+information about σ and updating it, is not enough to increase our knowledge about the micro-state of
+the system.
+
+3.1.3. Case 3
+
+The take home message of the previous cases is that we have to account for both mean μ and
+variance σ as macro-variables to look for possible non increasing complexity. Hence, consider the
+probability distribution function is given by,
+
+p(x1, x2|μ, σ) =
+exp
+�
+− 1
+
+2
+(x−μ)2
+
+σ2
+�
+
+σ
+√
+
+2π
+.
+(19)
+
+241
+
+
+Entropy 2014, 16, 2944–2958
+
+The dimension of the Riemannian manifold M = (Θ, g) is two, where the parameter space Θ is given
+by Θ = {(μ, σ)|μ ∈ (−∞, +∞), σ > 0} and the Fisher–Rao metric reads as g =
+1
+σ2 dμ2 + 2
+
+σ2 dσ2. Here,
+the sectional curvature given by Equation (7) is a negative function and despite the fact that is not
+constant, we expect a decreasing behavior in time of the IGC. Thanks to Equation (4), we find that the
+only non negative Christoffel coefficients are Γ1
+12 = − 1
+
+σ, Γ2
+11 =
+1
+2σ and Γ2
+22 = − 1
+
+σ. Substituting them
+into Equation (5) we derive the following geodesic equations
+
+⎧
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎩
+
+d2μ(τ)
+
+dτ2
+− 2
+
+σ
+dσ
+dτ
+dμ
+dτ = 0,
+
+d2σ(τ)
+
+dτ2
+− 1
+
+σ
+�
+dσ
+dτ
+�2
++ 1
+
+2σ
+� dμ
+
+dτ
+�2
+= 0.
+
+(20)
+
+The integration of the above coupled differential equations is non-trivial. We follow the method
+described in [10] and arrive at
+
+σ(τ) =
+2σ0 exp
+� σ0|A1|
+√
+
+2 τ
+�
+
+1 + exp
+� 2σ0|A1|
+√
+
+2
+τ
+�,
+μ(τ) = −
+2σ0
+√
+
+2A1
+
+|A1|
+�
+1 + exp
+� 2σ0|A1|
+√
+
+2
+τ
+��,
+(21)
+
+where σ0 and A1 are real constants. Then, using (21), the volume of Equation (10) results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� √
+
+2
+
+σ2 dσdμ =
+√
+
+2A1
+|A1| exp
+�
+− σ0|A1|
+√
+
+2
+τ
+�
+.
+(22)
+
+Since the last quantity must be positive, we assume A1 > 0. Finally, employing the above expression
+into Equation (9) we arrive at
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+�
+2
+
+σ0A1
+
+� 1
+
+τ .
+(23)
+
+We can now see a reduction in time of the complexity meaning that acquiring information about both
+μ and σ and updating them allows us to increase our knowledge about the micro state of the system.
+Hence, comparing Equations (16), (18) and (23) we conclude that the entropic inferences on a
+Gaussian distributed micro-variable is carried out in a more efficient manner when both its mean and
+the variance in the form of information constraints are available. Macroscopic predictions when only
+one of these pieces of information are available are more complex.
+
+3.2. Bivariate Gaussian Statistical Model
+
+Consider now the Gaussian statistical model P of the Equation (12) when n = 2. In this case
+the dimension of the Riemannian manifold M = (Θ, g) is at most four. From the analysis of the
+monovariate Gaussian model in Section 3.1 we have understood that both mean and variance should
+be considered. Hence the minimal assumption is to consider E(X1) = E(X2) = μ and E(X1 − μ)2 =
+E(X2 − μ)2 = σ2. Furthermore, in this case we have also to take into account the possible presence of
+(micro) correlations, which appear at the level of macro-states as off-diagonal terms in the covariance
+matrix. In short, this implies considering the following probability distribution function
+
+p(x1, x2|μ, σ) =
+exp
+�
+−
+1
+
+2σ2(1−ρ2)
+
+�
+(x1 − μ)2 − 2ρ(x1 − μ)(x2 − μ) + (x2 − μ)2��
+
+2πσ2�
+
+1 − ρ2
+,
+(24)
+
+where ρ ∈ (−1, 1).
+Thanks to Equation (15) we compute the Fisher-Information matrix G and find g = g11dμ2 +
+g22dσ2 with,
+
+g11 =
+2
+
+σ2(ρ + 1); g22 = 4
+
+σ2 .
+(25)
+
+242
+
+
+Entropy 2014, 16, 2944–2958
+
+The only non trivial Christoffel coefficients (4) are Γ1
+12 = − 1
+
+σ, Γ2
+11 =
+1
+
+2σ(ρ+1) and Γ2
+22 = − 1
+
+σ. In this case
+as well, the sectional curvature (Equation (7)) of the manifold M is a negative function and so we may
+expect a decreasing asymptotic behavior for the IGC. From Equation (5) it follows that the geodesic
+equations are,
+⎧
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎩
+
+d2μ(τ)
+
+dτ2
+− 2
+
+σ
+dσ
+dτ
+dμ
+dτ = 0
+
+d2σ(τ)
+
+dτ2
+− 1
+
+σ
+�
+dσ
+dτ
+�2
++
+1
+
+2(1+ρ)σ
+
+� dμ
+
+dτ
+�2
+= 0,
+
+(26)
+
+whose solutions are,
+
+σ(τ) =
+2σ0 exp
+�
+σ0|A1|
+√
+
+2(1+ρ)τ
+�
+
+1 + exp
+� 2σ0|A1|
+√
+
+2(1+ρ)τ
+�,
+μ(τ) = −
+2σ0
+�
+
+2(1 + ρ)A1
+
+|A1|
+�
+1 + exp
+� 2σ0|A1|
+√
+
+2(1+ρ)τ
+��.
+(27)
+
+Using (27) in Equation (10) gives the volume,
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+�
+2
+√
+
+2
+�
+
+1 + ρ σ2 dσdμ = 4A1
+
+|A1| exp
+�
+−
+σ0|A1|
+�
+
+2(1 + ρ)
+τ
+�
+.
+(28)
+
+To have it positive we have to assume A1 > 0. Finally, employing (28) in (9) leads to the IGC,
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 4
+√
+
+2
+
+σ0A1
+
+��
+
+1 + ρ
+τ
+,
+(29)
+
+with ρ ∈ (−1, 1). We may compare the asymptotic expression of the ICGs in the presence and in the
+absence of correlations, obtaining
+
+R
+strong
+bivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+�
+
+1 + ρ,
+(30)
+
+where “strong” stands for the fully connected lattice underlying the micro-variables. The ratio R
+strong
+bivariate(ρ)
+results a monotonic increasing function of ρ.
+While the temporal behavior of the IGC (29) is similar to the IGC in (23), here correlations play
+a fundamental role. From Equation (30), we conclude that entropic inferences on two Gaussian
+distributed micro-variables on a fully connected lattice is carried out in a more efficient manner when
+the two micro-variables are negatively correlated. Instead, when such micro-variables are positively
+correlated, macroscopic predictions become more complex than in the absence of correlations.
+Intuitively, this is due to the fact that for anticorrelated variables, an increase in one variable
+implies a decrease in the other one (different directional change): variables become more distant, thus
+more distinguishable in the Fisher–Rao information metric sense. Similarly, for positively correlated
+variables, an increase or decrease in one variable always predicts the same directional change for the
+second variable: variables do not become more distant, thus more distinguishable in the Fisher–Rao
+information metric sense. This may lead us to guess that in the presence of anticorrelations, motion on
+curved statistical manifolds via the Maximum Entropy updating methods becomes less complex.
+
+3.3. Trivariate Gaussian Statistical Model
+
+In this section we consider a Gaussian statistical model P of the Equation (12) when n = 3.
+In this case as well, in order to understand the asymptotic behavior of the IGC in the presence of
+correlations between the micro-states, we make the minimal assumption that, given the random vector
+X = (X1, X2, X3) distributed according to a trivariate Gaussian, then E(X1) = E(X2) = E(X3) = μ
+
+243
+
+
+Entropy 2014, 16, 2944–2958
+
+and E(X1 − μ)2 = E(X2 − μ)2 = E(X2 − μ)2 = σ2. Therefore, the space of the parameters of P is
+given by Θ = {(μ, σ)|μ ∈ R, σ > 0}.
+The manifold M = (Θ, g) changes its metric structure depending on the number of correlations
+between micro-variables, namely, one, two, or three . The covariance matrices corresponding to these
+cases read, modulo the congruence via a permutation matrix [17],
+
+C1 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+0
+ρ
+1
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ ,
+C2 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+ρ
+ρ
+1
+0
+ρ
+0
+1
+
+⎞
+
+⎟
+⎠ ,
+C3 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+ρ
+ρ
+1
+ρ
+ρ
+ρ
+1
+
+⎞
+
+⎟
+⎠ .
+(31)
+
+3.3.1. Case 1
+
+First, we consider the trivariate Gaussian statistical model of Equation (12) when C ≡ C1. Then
+proceeding like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 =
+3+ρ
+
+(1+ρ)σ2 and g22 =
+6
+σ2 . Also in
+this case we find that the sectional curvature of Equation (7) is a negative function. Hence, as we state
+in Section 2, we may expect a decreasing (in time) behavior of the information geometry complexity.
+Furthermore, we obtain the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(32)
+
+where A(ρ) = A2
+1(3+ρ)
+6(1+ρ) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−1, 1). Then, the volume (10)
+becomes
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� �
+
+6(3 − 4ρ)
+(1 − 2ρ2)
+1
+σ2 dσdμ = 6A1
+
+|A1| exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+,
+(33)
+
+requiring A1 > 0 for its positivity. Finally, using (33) in (9) we arrive at the asymptotic behavior of
+the IGC
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 6
+√
+
+6
+
+σ0A1
+
+��
+
+1 + ρ
+3 + ρ
+1
+τ .
+(34)
+
+Comparing (34) in the presence and in the absence of correlations yields
+
+R
+weak
+trivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+√
+
+3
+
+�
+
+1 + ρ
+3 + ρ,
+(35)
+
+where “weak” stands for low degree of connection in the lattice underlying the micro-variables
+Notice that Rweak
+trivariate(ρ) is a monotonic increasing function of the argument ρ ∈ (−1, 1).
+
+3.3.2. Case 2
+
+When the trivariate Gaussian statistical model of Equation (12) has C ≡ C2, the condition C > 0
+constraints the correlation coefficient to be ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ). Proceeding again like in Section 3.2 we
+
+have g = g11dμ2 + g22dσ2, where g11 =
+3−4ρ
+
+(1−2ρ2)σ2 and g22 =
+6
+σ2 . The sectional curvature of Equation (7)
+is a negative function as well and so we may apply the arguments of Section 2 expecting a decreasing
+in time of the complexity. Furthermore, we obtain the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(36)
+
+244
+
+
+Entropy 2014, 16, 2944–2958
+
+where A(ρ) = A2
+1(3−4ρ)
+
+6(1−2ρ2) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ). Then, the
+volume (10) becomes
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� �
+
+6(3 − 4ρ)
+(1 − 2ρ2)
+1
+σ2 dσdμ = 6A1
+
+|A1| exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+.
+(37)
+
+We have to set A1 > 0 for the positivity of the volume (37), and using it in (9) we arrive at the
+asymptotic behavior of the IGC
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 6
+√
+
+6
+
+σ0A1
+
+��
+
+1 − 2ρ2
+
+3 − 4ρ
+1
+τ .
+(38)
+
+Then, comparing (38) in the presence and in the absence of correlations yields
+
+R
+mildly weak
+trivariate (ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+√
+
+3
+
+�
+
+1 − 2ρ2
+
+3 − 4ρ ,
+(39)
+
+where “mildly weak” stands for a lattice (underlying micro-variables) neither fully connected nor with
+minimal connection.
+This is a function of the argument ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ) that attains the maximum
+�
+
+3
+2 at ρ = 1
+
+2, while
+
+in the extrema of the interval (−
+√
+
+2
+2 ,
+√
+
+2
+2 ) it tends to zero.
+
+3.3.3. Case 3
+
+Last, we consider the trivariate Gaussian statistical model of the Equation (12) when C ≡ C3. In
+this case, the condition C > 0 requires the correlation coefficient to be ρ ∈ (− 1
+
+2, 1). Proceeding again
+like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 =
+3
+
+(1+2ρ)σ2 and g22 =
+6
+σ2 . We find that the
+sectional curvature of Equation (7) is a negative function; hence, we may expect a decreasing (in time)
+behavior of the complexity. It follows the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(40)
+
+where A(ρ) =
+A2
+1
+
+2(1+2ρ) and A1 ∈ R. We note that A(ρ) > 0 for all ρ ∈ (− 1
+
+2, 1). Using (40), we compute
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+�
+3
+√
+
+2
+�
+
+(1 + 2ρ)
+
+1
+σ2 dσdμ = 6
+√
+
+2A1
+
+|A1|
+exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+.
+(41)
+
+Also in this case we need to assume A1 > 0 to have positive volume. Finally, substituting Equation (41)
+into Equation (9), the asymptotic behavior of the IGC results
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 12
+
+σ0A1
+
+��
+
+1 + 2ρ 1
+
+τ .
+(42)
+
+The comparison of (42) in the presence and in the absence of correlations yields
+
+R
+strong
+trivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+�
+
+1 + 2ρ,
+(43)
+
+245
+
+
+Entropy 2014, 16, 2944–2958
+
+where “strong” stands for a fully connected lattice underlying the (three) micro-variables. We remark
+the latter ratio is a monotonically increasing function of the argument ρ ∈ (− 1
+
+2, 1).
+
+The behaviors of R(ρ) of Equations (30), (35), (39) and (43) are reported in Figure 1.
+
+−1
+−0.5
+0
+0.5
+1
+
+ρpeak
+
+ρ
+
+R(ρ)
+
+Figure 1. Ratio R(ρ) of volumes vs. degree of correlations ρ. Solid line refers to R
+strong
+bivariate(ρ); Dotted line
+refers to Rweak
+trivariate(ρ); Dashed line referes to R
+mildly weak
+trivariate
+(ρ); Dash-dotted refers to R
+strong
+trivariate(ρ).
+
+The non-monotonic behavior of the ratio R
+mildly weak
+trivariate (ρ) in Equation (39) corresponds to the
+information geometric complexities for the mildly weak connected three-dimensional lattice.
+Interestingly, the growth stops at a critical value ρpeak = 1
+
+2 at which R
+mildly weak
+trivariate (ρpeak) = R
+strong
+bivariate(ρpeak). From
+Equation (30), we conclude that entropic inferences on three Gaussian distributed micro-variables on
+a fully connected lattice is carried out in a more efficient manner when the two micro-variables are
+negatively correlated. Instead, when such micro-variables are positively correlated, macroscopic
+predictions become more complex that in the absence of correlations.
+Furthermore, the ratio
+R
+strong
+trivariate(ρ) of the information geometric complexities for this fully connected three-dimensional
+lattice increases in a monotonic fashion. These conclusions are similar to those presented for the
+bivariate case. However, there is a key-feature of the IGC to emphasize when passing from the
+two-dimensional to the three-dimensional manifolds associated with fully connected lattices: the
+effects of negative-correlations and positive-correlations are amplified with respect to the respective
+absence of correlations scenarios,
+R
+strong
+trivariate(ρ)
+
+R
+strong
+bivariate(ρ) =
+
+�
+
+1 + 2ρ
+1 + ρ ,
+(44)
+
+where ρ ∈ (− 1
+
+2, 1).
+Specifically, carrying out entropic inferences on the higher-dimensional manifold in the presence
+
+of anti-correlations, that is for ρ ∈
+�
+− 1
+
+2, 0
+�
+, is less complex than on the lower-dimensional manifold as
+evident form Equation (44). The vice-versa is true in the presence of positive-correlations, that is for
+ρ ∈ (0, 1).
+
+4. Concluding Remarks
+
+In summary, we considered low dimensional Gaussian statistical models (up to a trivariate model)
+and have investigated their dynamical (temporal) complexity. This has been quantified by the volume
+of geodesics for parameters characterizing the probability distribution functions. To the best of our
+knowledge, there is no dynamic measure of complexity of geodesic paths on curved statistical manifolds
+that could be compared to our IGC. However, it could be worthwhile to understand the connection, if
+
+246
+
+
+Entropy 2014, 16, 2944–2958
+
+any, between our IGC and the complexity of paths of dynamic systems introduced in [20]. Specifically,
+according to the Alekseev-Brudno theorem in the algorithmic theory of dynamical systems [21], a way
+to predict each new segment of chaotic trajectory is obtained by adding information proportional to the
+length of this segment and independent of the full previous length of trajectory. This means that this
+information cannot be extracted from observation of the previous motion, even an infinitely long one!
+If the instability is a power law, then the required information per unit time is inversely proportional
+to the full previous length of the trajectory and, asymptotically, the prediction becomes possible.
+For the sake of completeness, we also point out that the relevance of volumes in quantifying the
+static model complexity of statistical models was already pointed out in [22] and [23]: complexity is
+related to the volume of a model in the space of distributions regarded as a Riemannian manifold
+of distributions with a natural metric defined by the Fisher–Rao metric tensor. Finally, we would
+like to point out that two of the Authors have recently associated Gaussian statistical models to
+networks [17]. Specifically, it is assumed that random variables are located on the vertices of the
+network while correlations between random variables are regarded as weighted edges of the network.
+Within this framework, a static network complexity measure has been proposed as the volume of the
+corresponding statistical manifold. We emphasize that such a static measure could be, in principle,
+applied to time-dependent networks by accommodating time-varying weights on the edges [24]. This
+requires the consideration of a time-sequence of different statistical manifolds. Thus, we could follow
+the time-evolution of a network complexity through the time evolution of the volumes of the associated
+manifolds.
+In this work we uncover that in order to have a reduction in time of the complexity one has to
+consider both mean and variance as macro-variables. This leads to different topological structures of
+the parameter space in (13); in particular, we have to consider at least a 2-dimensional manifold in
+order to have effects such as a power law decay of the complexity. Hence, the minimal hypothesis in a
+multivariate Gaussian model consists in considering all mean values equal and all covariances equal.
+In such a case, however, the complexity shows interesting features depending on the correlation among
+micro-variables (as summarized in Figure 1). For a trivariate model with only two correlations the
+information geometric complexity ratio exhibits a non monotonic behavior in ρ (correlation parameter)
+taking zero value at the extrema of the range of ρ. In contrast to closed configurations (bivariate
+and trivariate models with all micro-variables correlated each other) the complexity ratio exhibits a
+monotonic behavior in terms of the correlation parameter. The fact that in such a case this ratio cannot
+be zero at the extrema of the range of ρ is reminiscent of the geometric frustration phenomena that
+occurs in the presence of loops [11].
+Specifically, recall that a geometrically frustrated system cannot simultaneously minimize all
+interactions because of geometric constraints [11,18]. For example, geometric frustration can occur
+in an Ising model which is an array of spins (for instance, atoms that can take states ±1) that are
+magnetically coupled to each other. If one spin is, say, in the +1 state then it is energetically favorable
+for its immediate neighbors to be in the same state in the case of a ferromagnetic model. On the
+contrary, in antiferromagnetic systems, nearest neighbor spins want to align in opposite directions.
+This rule can be easily satisfied on a square. However, due to geometrical frustration, it is not possible
+to satisfy it on a triangle: for an antiferromagnetic triangular Ising model, any three neighboring spins
+are frustrated. Geometric frustration in triangular Ising models can be observed by considering spin
+configurations with total spin J = ±1 and analyzing the fluctuations in energy of the spin system as
+a function of temperature. There is no peak at all in the standard deviation of the energy in the case
+J = −1, and a monotonic behavior is recorded. This indicates that the antiferromagnetic system does
+not have a phase transition to a state with long-range order. Instead, in the case J = +1, a peak in
+the energy fluctuations emerges. This significant change in the behavior of energy fluctuations as a
+function of temperature in triangular configurations of spin systems is a signature of the presence of
+frustrated interactions in the system [19].
+
+247
+
+
+Entropy 2014, 16, 2944–2958
+
+In this article, we observe a significant change in the behavior of the information geometric
+complexity ratios as a function of the correlation coefficient in the trivariate Gaussian statistical models.
+Specifically, in the fully connected trivariate case, no peak arises and a monotonic behavior in ρ of
+the information geometric complexity ratio is observed. In the mildly weak connected trivariate
+case, instead, a peak in the information geometric complexity ratio is recorded at ρpeak ≥ 0. This
+dramatic disparity of behavior can be ascribed to the fact that when carrying out statistical inferences
+with positively correlated Gaussian random variables, the maximum entropy favorable scenario is
+incompatible with these working hypothesis. Thus, the system appears frustrated.
+These considerations lead us to conclude that we have uncovered a very interesting information
+geometric resemblance of the more standard geometric frustration effect in Ising spin models. However,
+for a conclusive claim of the existence of an information geometric analog of the frustration effect, we
+feel we have to further deepen our understanding. A forthcoming research project along these lines
+will be a detailed investigation of both arbitrary triangular and square configurations of correlated
+Gaussian random variables where we take into consideration both the presence of different intensities
+and signs of pairwise interactions (ρij ̸= ρik if j ̸= k, ∀i).
+
+Acknowledgments: Domenico Felice and Stefano Mancini acknowledge the financial support of the Future
+and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the
+European Commission, under the FET-Open grant agreement TOPDRIM, number FP7-ICT-318121.
+
+Author Contributions: The authors have equally contributed to the paper. All authors read and approved the
+final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Feldman, D.F.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+2.
+Kolmogorov, A.N. A new metric invariant of transitive dynamical systems and of automorphism of Lebesgue
+spaces. Doklady Akademii Nauk SSSR 1958, 119, 861–864.
+3.
+Caticha, A. Entropic Dynamics. Bayesian Inference and Maximum Entropy Methods in Science and Engineering,
+the 22nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and
+Engineering, Moscow, Idaho, 3-7 August 2002; Fry, R.L., Ed.; American Institute of Physics: College Park,
+MD, USA, 2002; Volume 617, p. 302.
+4.
+Caticha, A.; Preuss, R. Maximum entropy and Bayesian data analysis: Entropic prior distributions. Phys. Rev.
+E 2004, 70, 046127.
+5.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NY, USA, 2000.
+6.
+Cafaro, C. Works on an information geometrodynamical approach to chaos. Chaos Solitons Fractals 2009, 41,
+886–891.
+7.
+Pettini, M. Geometry and Topology in Hamiltonian Dynamics and Statistical Mechanics; Springer-Verlag:
+Berlin/Heidelberg, Germany, 2007.
+8.
+Lebowitz, J.L. Microscopic Dynamics and Macroscopic Laws. Ann. N. Y. Acad. Sci. 1981, 373, 220–233.
+9.
+Shibata, T.; Chawanya, T.; Chawanya, K. Noiseless Collective Motion out of Noisy Chaos. Phys. Rev. Lett.
+1999, 82, doi: http://dx.doi.org/10.1103/PhysRevLett.82.4424.
+10.
+Ali, S.A.; Cafaro, C.; Kim, D.-H.; Mancini, S. The effect of the microscopic correlations on the information
+geometric complexity of Gaussian statistical models. Physica A 2010, 389, 3117–3127.
+11.
+Sadoc, J.F.; Mosseri, R. Geometrical Frustration; Cambridge University Press: Cambridge, UK, 2006.
+12.
+Lee, J.M. Riemannian Manifolds: An Introduction to Curvature; Springer: New York, NY, USA, 1997.
+13.
+Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992.
+14.
+Cafaro, C.; Ali, S.A. Jacobi fields on statistical manifolds of negative curvature. Physica D 2007, 234, 70–80.
+15.
+Cafaro, C.; Giffin, A.; Ali, S.A.; Kim, D.-H. Reexamination of an information geometric construction of
+entropic indicators of complexity. Appl. Math. Comput. 2010, 217, 2944–2951.
+16.
+Cafaro, C.; Mancini, S. Quantifying the complexity of geodesic paths on curved statistical manifolds through
+information geometric entropies and Jacobi fields. Physica D 2011, 240, 607–618.
+
+248
+
+
+Entropy 2014, 16, 2944–2958
+
+17.
+Felice, D.; Mancini, S.; Pettini, M. Quantifying Networks Complexity from Information Geometry Viewpoint.
+J. Math. Phys. 2014, 55, 043505.
+18.
+Moessner, R.; Ramirez, A.P. Geometrical Frustration. Phys. Today 2006, 59, 24–29.
+19.
+MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge,
+UK, 2003.
+20.
+Brudno, A.A. The complexity of the trajectories of a dynamical system. Uspekhi Mat. Nauk 1978, 33, 207–208.
+21.
+Alekseev, V.M.; Yacobson, M.V. Symbolic dynamics and hyperbolic dynamic systems. Phys. Rep. 1981, 75,
+290–325.
+22.
+Myung, J.; Balasubramanian, V.; Pitt, M.A. Counting probability distributions: differential geometry and
+model selection. Proc. Natl. Acad. Sci. USA 2000, 97, 11170.
+23.
+Rodriguez, C.C. The volume of bitnets. AIP Conf. Proc. 2004, 735, 555–564.
+24.
+Motter, A.E.; Albert, R. Networks in motion. Phys. Today 2012, 65, 43–48.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+249
+
+
+entropy
+
+Article
+Learning from Complex Systems: On the Roles of
+Entropy and Fisher Information in Pairwise Isotropic
+Gaussian Markov Random Fields
+
+Alexandre Levada
+
+Computing Department, Federal University of São Carlos, Rod. Washington Luiz, km 235, São Carlos, SP, Brazil;
+E-Mail: alexandre@dc.ufscar.br
+
+Received: 4 December 2013; / Accepted: 30 January 2014 / Published: 18 February 2014
+
+Abstract: Markov random field models are powerful tools for the study of complex systems.
+However, little is known about how the interactions between the elements of such systems are
+encoded, especially from an information-theoretic perspective.
+In this paper, our goal is to
+enlighten the connection between Fisher information, Shannon entropy, information geometry and
+the behavior of complex systems modeled by isotropic pairwise Gaussian Markov random fields.
+We propose analytical expressions to compute local and global versions of these measures using
+Besag’s pseudo-likelihood function, characterizing the system’s behavior through its Fisher curve , a
+parametric trajectory across the information space that provides a geometric representation for the
+study of complex systems in which temperature deviates from infinity. Computational experiments
+show how the proposed tools can be useful in extracting relevant information from complex patterns.
+The obtained results quantify and support our main conclusion, which is: in terms of information,
+moving towards higher entropy states (A –> B) is different from moving towards lower entropy states
+(B –> A), since the Fisher curves are not the same, given a natural orientation (the direction of time).
+
+Keywords: Markov random fields; information theory; Fisher information; entropy; maximum
+pseudo-likelihood estimation
+
+1. Introduction
+
+With the increasing value of information in modern society and the massive volume of digital
+data that is available, there is an urgent need for developing novel methodologies for data filtering and
+analysis in complex systems. In this scenario, the notion of what is informative or not is a top priority.
+Sometimes, patterns that at first may appear to be locally irrelevant may turn out to be extremely
+informative in a more global perspective. In complex systems, this is a direct consequence of the
+intricate non-linear relationship between the pieces of data along different locations and scales.
+Within this context, information theoretic measures play a fundamental role in a huge variety of
+applications once they represent statistical knowledge in a systematic, elegant and formal framework.
+Since the first works of Shannon [1], and later with many other generalizations [2–4], the concept of
+entropy has been adapted and successfully applied to almost every field of science, among which we
+can cite physics [5], mathematics [6–8], economics [9] and, fundamentally, information theory [10–12].
+Similarly, the concept of Fisher information [13,14] has been shown to reveal important properties of
+statistical procedures, from lower bounds on estimation methods [15–17] to information geometry [18,19].
+Roughly speaking, Fisher information can be thought of as the likelihood analog of entropy, which is a
+probability-based measure of uncertainty.
+In general, classical statistical inference is focused on capturing information about location and
+dispersion of unknown parameters of a given family of distribution and studying how this information
+is related to uncertainty in estimation procedures. In typical situations, an exponential family of
+
+Entropy 2014, 16, 1002–1036; doi:10.3390/e16021002
+www.mdpi.com/journal/entropy
+250
+
+
+Entropy 2014, 16, 1002–1036
+
+distributions and independence hypothesis (independent random variables) are often assumed, giving
+the likelihood function a series of desirable mathematical properties [15–17].
+Although mathematically convenient for many problems, in complex systems modeling,
+independence assumption is not reasonable, because much of the information is somehow encoded
+in the relations between the random variables [20,21]. In order to overcome this limitation, Markov
+random field (MRF) models appear to be a natural generalization of the classical approach by the
+replacement of the independence assumption by a more realistic conditional independence assumption.
+Basically, in every MRF, knowledge of a finite-support neighborhood around a given variable isolates it
+from all the remaining variables. A further simplification consists in considering a pairwise interaction
+model, constraining the size of the maximum clique to be two (in other words, the model captures
+only binary relationships). Moreover, if the MRF model is isotropic, which means that the parameter
+controlling the interactions between neighboring variables is invariant to change in the directions,
+all the information regarding the spatial dependence structure of the system is conveyed by a single
+parameter, from now on denoted by β (or simply, the inverse temperature).
+In this paper, we assume an isotropic pairwise Gaussian Markov random field (GMRF) model [22,23],
+also known as an auto-normal model or a conditional auto-regressive model [24,25]. Basically, the question
+that motivated this work and that we are trying to elucidate here is: What kind of information is encoded
+by the β parameter in such a model? We want to know how this parameter, and as a consequence, the
+whole spatial dependence structure of a complex system modeled by a Gaussian Markov random field, is
+related to both local and global information theoretic measures, more precisely the observed and expected
+Fisher information, as well as self-information and Shannon entropy.
+In searching for answers for our fundamental question, investigations led us to an exact expression
+for the asymptotic variance of the maximum pseudo-likelihood (MPL) estimator of β in an isotropic
+pairwise GMRF model, suggesting that asymptotic efficiency is not granted. In the context of statistical
+data analysis, Fisher information plays a central role in providing tools and insights for modeling
+the interactions between complex systems and their components. The advantage of MRF models
+over the traditional statistical ones is that MRFs take into account the dependence between pieces of
+information as a function of the system’s temperature, which may even be variable along time. Briefly
+speaking, this investigation aims to explore ways to measure and quantify distances between complex
+systems operating in different thermodynamical conditions. By analyzing and comparing the behavior
+of local patterns observed throughout the system (defined over a regular 2D lattice), it is possible
+to measure how informative those patterns for a given inverse temperature are, or simply β (which
+encodes the expected global behavior).
+In summary, our idea is to describe the behavior of a complex system in terms of information
+as its temperature deviates from infinity (when the particles are statistically independent) to a lower
+bound. The obtained results suggest that, in the beginning, when the temperature is infinite and the
+information equilibrium prevails, the information is somehow spread along the system. However,
+when temperature is low and this equilibrium condition does not hold anymore, we have a more
+sparse representation in terms of information, since this information is concentrated in the boundaries
+of the regions that define a smooth global configuration. In the vast remaining of this “universe”, due
+to this smooth constraint, the strong alignment between the particles prevails, which is exactly the
+expected global behavior for temperatures below a critical value (making the majority of the interaction
+patterns along the system uninformative).
+The remainder of the paper is organized as follows: Section 2 discusses a technique for the
+estimation of the inverse temperature parameter, called the maximum pseudo-likelihood (MPL)
+approach, and provides derivations for the observed Fisher information in an isotropic pairwise GMRF
+model. Intuitive interpretations for the two versions of this local measure are discussed. In Section 3,
+we derive analytical expressions for the computation of the expected Fisher information, which allows
+us to assign a global information measure for a given system configuration. Similarly, in Section 4, an
+expression for the global entropy of a system modeled by a GMRF is shown. The results suggest a
+
+251
+
+
+Entropy 2014, 16, 1002–1036
+
+connection between maximum pseudo-likelihood and minimum entropy criteria in the estimation of
+the inverse temperature parameter on GMRFs. Section 5 discusses the uncertainty in the estimation
+of this important parameter by defining an expression for the asymptotic variance of its maximum
+pseudo-likelihood estimator in terms of both forms of Fisher information. In Section 6, the definition
+of the Fisher curve of a system as a parametric trajectory in the information space is proposed. Section
+7 shows the experimental setup. Computational simulations with both Markov chain Monte Carlo
+algorithms and some real data were conducted, showing how the proposed tools can be used to extract
+relevant information from complex systems. Finally, Section 8 presents our conclusions, final remarks
+and possibilities for future works.
+
+2. Fisher Information in Isotropic Pairwise GMRFs
+
+The remarkable Hammersley–Clifford theorem [26] states the equivalence between Gibbs random
+fields (GRF) and Markov random fields (MRF), which implies that any MRF can be defined either in
+terms of a global (joint Gibbs distribution) or a local (set of local conditional density functions) model.
+For our purposes, we will choose the latter representation.
+
+Definition 1. An isotropic pairwise Gaussian Markov random field regarding a local neighborhood system,
+ηi, defined on a lattice S = {s1, s2, . . . , sn} is completely characterized by a set of n local conditional density
+functions p(xi|ηi,⃗θ), given by:
+
+p
+�
+xi|ηi,⃗θ
+�
+=
+1
+√
+
+2πσ
+exp
+
+⎧
+⎨
+
+⎩− 1
+
+2σ2
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+(1)
+
+with⃗θ = (μ, σ2, β), where μ and σ2 are the expected value and the variance of the random variables,
+and β = 1/T is the parameter that controls the interaction between the variables (inverse temperature).
+Note that, for β = 0, the model degenerates to the usual Gaussian distribution. From an information
+geometry perspective [18,19], this means that we are constrained to a sub-manifold within the
+Riemannian manifold of probability distributions, where the natural Riemannian metric (tensor)
+is given by the Fisher information. It has been shown that the geometric structure of exponential
+family distributions exhibits constant curvature. However, little is known about information geometry
+on more general statistical models, such as GMRFs. For β > 0, some degree of correlation between
+the observations is expected, making the interactions grow stronger. Typical choices for ηi are the
+first and second order non-causal neighborhood systems, defined by the sets of four and eight nearest
+neighbors, respectively.
+
+2.1. Maximum Pseudo-Likelihood Estimation
+
+Maximum likelihood estimation is intractable in MRF parameter estimation, due to the existence
+of the partition function in the joint Gibbs distribution. An alternative, proposed by Besag [24], is
+maximum pseudo-likelihood estimation, which is based on the conditional independence principle.
+The pseudo-likelihood function is defined as the product of the LCDFs for all the n variables of the
+system, modeled as a random field.
+
+Definition 2. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the set corresponding to the observations at time
+t, the pseudo-likelihood function of the model is defined by:
+
+L
+�
+⃗θ; X(t)�
+=
+n
+∏
+i=1
+p(xi|ηi,⃗θ)
+(2)
+
+252
+
+
+Entropy 2014, 16, 1002–1036
+
+Note that the pseudo-likelihood function is a function of the parameters. For better mathematical
+tractability, it is usual to take the logarithm of L(⃗θ; X(t)). Plugging Equation (1) into Equation (2) and
+taking the logarithm leads to:
+
+log L
+�
+⃗θ; X(t)�
+= −n
+
+2 log
+�
+2πσ2�
+−
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(3)
+
+By differentiating Equation (3) with respect to each parameter and properly solving the
+pseudo-likelihood equations, we obtain the following maximum pseudo-likelihood estimators for the
+parameters, μ, σ2 and β:
+
+ˆβMPL =
+
+n
+∑
+i=1
+
+�
+
+(xi − μ) ∑
+j∈ηi
+
+�
+xj − μ
+�
+�
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(4)
+
+ˆμMPL =
+1
+
+n (1 − kβ)
+
+n
+∑
+i=1
+
+�
+
+xi − β ∑
+j∈ηi
+xj
+
+�
+
+(5)
+
+ˆσ2
+MPL = 1
+
+n
+
+n
+∑
+i=1
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(6)
+
+where k denotes the cardinality of the non-causal neighborhood set ηi. Note that if β = 0, the MPL
+estimators of both μ and σ2 become the widely known sample mean and sample variance.
+Since the cardinality of the neighborhood system, k = |ηi|, is spatially invariant (we are assuming
+a regular neighborhood system) and each variable is dependent on a fixed number of neighbors on a
+lattice, ˆβMPL can be rewritten in terms of cross-covariances:
+
+ˆβMPL =
+∑
+j∈ηi
+ˆσij
+
+∑
+j∈ηi ∑
+k∈ηi
+ˆσjk
+(7)
+
+where σij denotes the sample covariance between the central variable, xi, and xj ∈ ηi. Similarly, σjk
+denotes the sample covariance between two variables belonging to the neighborhood system, ηi (the
+definition of the neighborhood system, ηi, does not include the the location, si).
+
+2.2. Fisher Information of Spatial Dependence Parameters
+
+Basically, Fisher information measures the amount of information a sample conveys about
+an unknown parameter.
+It can be thought of as the likelihood analog of entropy, which is a
+probability-based measure of uncertainty. Often, when we are dealing with independent and identically
+distributed (i.i.d) random variables, the computation of the global Fisher information presented in
+
+a random sample X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } is quite straightforward, since each observation, xi,
+i = 1, 2, . . . , n, brings exactly the same amount of information (when we are dealing with independent
+samples, the superscript, t, is usually suppressed, since the underlying dependence structure does
+not change through time). However, this is not true for spatial dependence parameters in MRFs,
+since different configuration patterns (xi ∪ ηi) provide distinct contributions to the local observed
+Fisher information, which can be used to derive a reasonable approximation to the global Fisher
+information [27].
+
+253
+
+
+Entropy 2014, 16, 1002–1036
+
+2.3. The Information Equality
+
+It is widely known from statistical inference theory that, under certain regularity conditions,
+information equality holds in the case of independent observations in the exponential family [15–17].
+In other words, we can compute the Fisher information of a random sample regarding a parameter of
+interest, θ, by:
+
+I
+�
+θ; X(t)�
+= E
+
+�� ∂
+
+∂θ logL
+�
+θ; X(t)��2�
+
+= −E
+� ∂2
+
+∂θ2 logL
+�
+θ; X(t)��
+(8)
+
+where L
+�
+θ; X(t)�
+denotes the likelihood function at a time instant, t. In our investigations, to avoid the
+joint Gibbs distribution, often intractable due to the presence of the partition function (global Gibbs
+field), we replace the usual likelihood function by Besag’s pseudo-likelihood function, and then, we
+work with the local model instead (local Markov field).
+However, given the intrinsic spatial dependence structure of Gaussian Markov random field
+models, information equilibrium is not a natural condition. As we will discuss later, in general,
+information equality fails.
+Thus, in a GMRF model, we have to consider two kinds of Fisher
+information, from now on denoted by Type I (due to the first derivative of the pseudo-likelihood
+function) and Type II (due to the second derivative of the pseudo-likelihood function). Eventually,
+when certain conditions are satisfied, these two values of information will converge to a unique bound.
+Essentially, β is the parameter responsible to control whether both forms of information converge or
+diverge. Knowing the role of β (inverse temperature) in a GMRF model, it is expected that for β = 0
+(or T → ∞), information equilibrium prevails. In fact, we will see in the following sections that as β
+deviates from zero (and long-term correlations start to emerge), the divergence between the two kinds
+of information increases.
+In terms of information geometry, it has been shown that the geometric structure of the exponential
+family of distributions is basically given by the Fisher information matrix, which is the natural
+Riemmanian metric (metric tensor) [18,19]. So, when the inverse temperature parameter is zero, the
+geometric structure of the model is a surface since the parametric space is 2D (μ and σ2). However,
+as the inverse temperature parameter starts to increase, the original surface is gradually transformed
+to a 3D Riemmanian manifold, equipped with a novel metric tensor (the 3 × 3 Fisher information
+matrix for μ, σ2 and β). In this context, by measuring the Fisher information regarding the inverse
+temperature parameter along an interval ranging from βMIN = A = 0 to βMAX = B, we are essentially
+trying to capture part of the deformation in the geometric structure of the model. In this paper, we
+focus on the computation of this measure. In future works we expect to derive the complete Fisher
+information matrix in order to completely characterize the transformations in the metric tensor.
+
+2.4. Observed Fisher Information
+
+In order to quantify the amount of information conveyed by a local configuration pattern in a
+complex system, the concept of observed Fisher information must be defined.
+
+Definition 3. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The
+Type I local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β, is
+defined in terms of its local conditional density function as:
+
+φβ(xi) =
+� ∂
+
+∂βlog p
+�
+xi|ηi,⃗θ
+��2
+(9)
+
+254
+
+
+Entropy 2014, 16, 1002–1036
+
+Hence, for an isotropic pairwise GMRF model, the Type I local observed Fisher information
+regarding β for the observation, xi, is given by:
+
+φβ(xi) = 1
+
+σ4
+
+��
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+� �
+∑
+j∈ηi
+
+�
+xj − μ
+�
+��2
+
+= 1
+
+σ4
+
+�
+∑
+j∈ηi
+(xi − μ)
+�
+xj − μ
+� − β ∑
+j∈ηi ∑
+k∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�2
+(10)
+
+Definition 4. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The
+Type II local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β,
+is defined in terms of its local conditional density function as:
+
+ψβ(xi) = − ∂2
+
+∂β2 log p
+�
+xi|ηi,⃗θ
+�
+(11)
+
+In case of an isotropic pairwise GMRF model, the Type II local observed Fisher information
+regarding β for the observation, xi, is given by:
+
+φβ(xi) = 1
+
+σ2
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�
+
+(12)
+
+Note that φβ(xi) does not depend on xi, only on the neighborhood system, ηi.
+Therefore, we have two local measures, φβ(xi) and ψβ(xi), that can be assigned to every element of
+a system modeled by an isotropic pairwise GMRF. In the following, we will discuss some interpretations
+for what is being measured with the proposed tools and how to define global versions for these
+measures by means of the expected Fisher information.
+
+2.5. The Role of Fisher Information in GMRF Models
+
+At this point, a relevant issue is the interpretation of these Fisher information measures in a
+complex system modeled by an isotropic pairwise GMRF. Roughly speaking, φβ(xi) is the quadratic
+rate of change of the logarithm of the local likelihood function at xi, given a global value of β. As
+this global value of β determines what would be the expected global behavior (if β is large, a high
+degree of correlation among the observations is expected and if β is close to zero, the observations are
+independent), it is reasonable to admit that configuration patterns showing values of φβ(xi) close to
+zero are more likely to be observed throughout the field, once their likelihood values are high (close to
+the maximum local likelihood condition). In other words, these patterns are more “aligned” to what is
+considered to be the expected global behavior, and therefore, they convey little information about the
+spatial dependence structure (these samples are not informative once they are expected to exist in a
+system operating at that particular value of inverse temperature).
+Now, let us move on to configuration patterns showing high values of φβ(xi). Those samples
+can be considered landmarks, because they convey a large amount of information about the global
+spatial dependence structure. Roughly speaking, those points are very informative once they are
+not expected to exist for that particular value of β (which guides the expected global behavior of the
+system). Therefore, Type I local observed Fisher information minimization in GMRFs can be a useful
+tool in producing novel configuration patterns that are more likely to exist given the chosen value of
+inverse temperature. Basically, φβ(xi) tells us how informative a given pattern is for that specific global
+behavior (represented by a single parameter in an isotropic pairwise GMRF model). In summary, this
+
+255
+
+
+Entropy 2014, 16, 1002–1036
+
+measure quantifies the degree of agreement between an observation, xi, and the configuration defined
+by its neighborhood system for a given β.
+As we will see later in the experiments section, typical informative patterns (those showing
+high values of φβ(xi)) in an organized system are located at the boundaries of the regions defining
+homogeneous areas (since these boundary samples show an unexpected behavior for large β, which is:
+there is no strong agreement between xi and its neighbors).
+Let us analyze the Type II local observed Fisher information, ψβ(xi). Informally speaking, this
+measure can be interpreted as a curvature measure, that is, how curved is the local likelihood function
+at xi. Thus, patterns showing low values of ψβ(xi) tend to have a nearly flat local likelihood function.
+This means that we are dealing with a pattern that could have been observed for a variety of β values
+(a large set of β values have approximately the same likelihood). An implication of this fact is that
+in a system dominated by this kind of patterns (patterns for which ψβ(xi) is close to zero), small
+perturbations may cause a sharp change in β (and, therefore, in the expected global behavior). In other
+words, these patterns are more susceptible to changes once they do not have a “stable” configuration
+(it raises our uncertainty about the true value of β).
+On the other hand, if the global configuration is mostly composed of patterns exhibiting large
+values of ψβ(xi), changes on the global structure are unlikely to happen (uncertainty on β is sufficiently
+small). Basically, ψβ(xi) measures the degree of agreement or dependence among the observations
+belonging to the same neighborhood system. If at a given xi, the observations belonging to ηi are
+totally symmetric around the mean value, ψβ(xi) would be zero. It is reasonable to expect that in this
+situation, as there is no information about the induced spatial dependence structure (this means that
+there is no contextual information available at this point). Notice that the role of ψβ(xi) is not the same
+as φβ(xi). Actually, these two measures are almost inversely related, since if at xi the value of φβ(xi)
+is high (it is a landmark or boundary pattern), then it is expected that ψβ(xi) will be low (in decision
+boundaries or edges, the uncertainty about β is higher, causing ψβ(xi) to be small). In fact, we will
+observe this behavior in some computational experiments conducted in future sections of the paper.
+It is important to mention that these rather informal arguments define the basis for understanding
+the meaning of the asymptotic variance of maximum pseudo-likelihood estimators, as we will discuss
+ahead. In summary, ψβ(xi) is a measure of how sure or confident we are about the local spatial
+dependence structure (at a given point, xi), since a high average curvature is desired for predicting the
+system’s global behavior in a reasonable manner (reducing the uncertainty of β estimation).
+
+3. Expected Fisher Information
+
+In order to avoid the use of approximations in the computation of the global Fisher information
+in an isotropic pairwise GMRF, in this section, we provide an exact expression for ˆφβ and ˆψβ as Type
+I and Type II expected Fisher information. One advantage of using the expected Fisher information
+instead of its global observed counterpart is the faster computing time. As we will see, instead of
+computing a single local measure for each observation ,xi ∈ X, and then taking the average, both
+Φβ and Ψβ expressions depend only on the covariance matrix of the configuration patterns observed
+along the random field.
+
+3.1. The Type I Expected Fisher Information
+
+Recall that the Type I expected Fisher information, from now on denoted by Φβ, is given by:
+
+Φβ = E
+
+�� ∂
+
+∂βlog L
+�
+⃗θ; X(t)��2�
+
+(13)
+
+The Type II expected Fisher information, from now on denoted by Ψβ, is given by:
+
+Ψβ = −E
+� ∂2
+
+∂β2 log L
+�
+⃗θ; X(t)��
+(14)
+
+256
+
+
+Entropy 2014, 16, 1002–1036
+
+We first proceed to the definition of Φβ. Plugging Equation (3) in Equation (13), and after some
+algebra, we obtain the following expression, which is composed by four main terms:
+
+Φβ = 1
+
+σ4 E
+
+⎧
+⎨
+
+⎩
+
+�
+n
+∑
+s=1
+
+�
+
+xs − μ − β ∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+∑
+j∈ηs
+
+�
+xj − μ
+�
+��2⎫
+⎬
+
+⎭
+(15)
+
+= 1
+
+σ4 E
+
+�
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+
+xs − μ − β ∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+
+xr − μ − β ∑
+k∈ηr
+(xk − μ)
+
+�
+
+×
+
+�
+∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+∑
+k∈ηr
+(xk − μ)
+
+��
+
+= 1
+
+σ4 E
+
+�
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+
+(xs − μ) (xr − μ) − β ∑
+k∈ηr
+(xs − μ) (xk − μ) − β ∑
+j∈ηs
+(xr − μ)
+�
+xj − μ
+�
+
++β2 ∑
+j∈ηs ∑
+k∈ηr
+
+�
+xj − μ
+� (xk − μ)
+
+� �
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+xj − μ
+� (xk − μ)
+
+��
+
+= 1
+
+σ4
+
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+∑
+j∈ηs ∑
+k∈ηr
+E
+�(xs − μ) (xr − μ)
+�
+xj − μ
+� (xk − μ)
+�
+
+−β ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xs − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+�
+
+−β ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+E
+�(xr − μ) (xm − μ)
+�
+xj − μ
+� (xk − μ)
+�
+
++β2 ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xm − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+�
+�
+
+Hence, the expression for Φβ is composed by four main terms, each one of them involving
+a summation of higher-order cross-moments. According to Isserlis’ theorem [28], for normally
+distributed random variables, we can compute higher order moments in terms of the covariance
+matrix through the following identity:
+
+E [X1X2X3X4] = E [X1X2] E [X3X4] + E [X1X3] E [X2X4] + E [X2X3] E [X1X4]
+(16)
+
+Then, the first term of Equation (15) is reduced to:
+
+∑
+j∈ηs ∑
+k∈ηr
+E
+�(xs − μ) (xr − μ)
+�
+xj − μ
+� (xk − μ)
+� =
+(17)
+
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+E [(xs − μ) (xr − μ)] E
+��
+xj − μ
+� (xk − μ)
+�
+
++ E
+�(xs − μ)
+�
+xj − μ
+��
+E [(xr − μ) (xk − μ)]
+
++ E
+�(xr − μ)
+�
+xj − μ
+��
+E [(xs − μ) (xk − μ)]
+� =
+
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+σsrσjk + σsjσrk + σrjσsk
+�
+
+257
+
+
+Entropy 2014, 16, 1002–1036
+
+where σsr denotes the covariance between variables xs and xr (note that in an MRF, we have σsr = 0 if
+xr /∈ ηs). We now proceed to the expansion of the second main term of Equation (15). Similarly, by
+applying Isserlis’ identity, we have:
+
+∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xs − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+� = ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σsjσkl + σskσjl + σjkσsl
+�
+(18)
+
+The third term of Equation (15) can be rewritten as:
+
+∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+E
+�(xr − μ) (xm − μ)
+�
+xj − μ
+� (xk − μ)
+� =
+(19)
+
+= ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+
+�
+σrmσjk + σrjσmk + σmjσrk
+�
+
+Finally, the fourth term of it is:
+
+∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xm − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+� =
+(20)
+
+= ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σmjσkl + σmkσjl + σmlσjk
+�
+
+Therefore, by combining Expressions Equations (17)–(20), we have the complete expression for Φβ,
+the Type I expected Fisher information for an isotropic pairwise GMRF model regarding the inverse
+temperature parameter, as:
+
+Φβ = 1
+
+σ4
+
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+σsrσjk + σsjσrk + σrjσsk
+�
+(21)
+
+−β ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σsjσkl + σskσjl + σjkσsl
+�
+
+−β ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+
+�
+σrmσjk + σrjσmk + σmjσrk
+�
+
++β2 ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σmjσkl + σmkσjl + σmlσjk
+��
+
+However, since we are interested in studying how the spatial correlations change as the system evolves,
+
+we need to estimate a value for Φβ given a single global state X(t) =
+�
+x(t)
+1 , x(t)
+2 , . . . , x(t)
+n
+�
+. Hence, to
+
+compute Φβ from a single static configuration X(t) (a photograph of the system at a given moment),
+we consider n = 1 in the previous equation, which means, among other things, that s = r (which
+implies ηs = ηr) and that observations belonging to different local neighborhoods are independent
+from each other (as we are dealing with a pairwise interaction Markovian process, it does not make
+sense to model the interactions between variables that are far away from each other in the lattice).
+Before proceeding, we would like to clarify some points regarding the estimation of the β
+parameter and the computation of the expected Fisher information in the isotropic pairwise GMRF
+model. Basically, there are two main possibilities: (1) the parameter is spatially-invariant, which
+means that we have a unique value, ˆβ(t), for a global configuration of the system, X(t) (this is our
+assumption); or (2) the parameter is spatially-variant, which means that we have a set of ˆβs values,
+
+for s = 1, 2, . . . , n, each one of them estimated from Xs =
+�
+x(1)
+s , x(2)
+s , . . . , x(t)
+s
+�
+(we are observing the
+outcomes of a random pattern along time in a fixed position of the lattice). When we are dealing with
+the first model (β is spatially-invariant), all possible observation patterns (samples) are extracted from
+the global configuration by a sliding window (with the shape of the neighborhood system) that moves
+
+258
+
+
+Entropy 2014, 16, 1002–1036
+
+through the lattice at a fixed time instant, t. In this case, we are interested in studying the spatial
+correlations, not the temporal ones. In other words, we would like to investigate how the the spatial
+structure of a GMRF model is related to Fisher information (this is exactly the scenario described
+above, for which n = 1). Our motivation here is to characterize, via information-theoretic measures,
+the behavior of the system as it evolves from states of minimum entropy to states of maximum entropy
+(and vice versa) by providing a geometrical tool based on the definition of the Fisher curve , which will
+be introduced in the following sections.
+Therefore, in our case (n = 1), Equation (21) is further simplified for practical usage. By unifying
+s and r to a unique index, i, we have a final expression for Φβ in terms of the local covariances between
+the random variables in a given neighborhood system (i.e., for the eight nearest neighbors):
+
+Φβ = 1
+
+σ4
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+σ2σjk + 2σijσik
+�
+− 2β ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi
+
+�
+σijσkl + σikσjl + σilσjk
+�
+(22)
+
++β2 ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi ∑
+m∈ηi
+
+�
+σjkσlm + σjlσkm + σjmσkl
+��
+
+Note that we have two types of covariances in the definition of Φβ for an isotropic pairwise GMRF: (1)
+covariances between the central variable, xi, and a neighboring variable, xj, denoted by σij, for j ∈ ηi;
+and (2) covariances between two neighboring variables, xj and xk, for j, k ∈ ηi. In the next sections, we
+will see how to compute the value of Ψβ directly from the covariance matrix of the local patterns.
+
+3.2. The Type II Expected Fisher Information
+
+Following the same methodology of replacing the likelihood function by the pseudo-likelihood
+function of the GMRF model, a closed form expression for Ψβ is developed. Plugging Equation (3)
+into Equation (14) leads us to:
+
+Ψβ = 1
+
+σ2
+
+n
+∑
+i=1
+E
+
+⎧
+⎨
+
+⎩
+
+�
+∑
+xj∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+(23)
+
+= 1
+
+σ2
+
+n
+∑
+i=1
+E
+
+�
+∑
+xj∈ηi ∑
+xk∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�
+
+=
+
+= 1
+
+σ2
+
+n
+∑
+i=1
+
+�
+∑
+xj∈ηi ∑
+xk∈ηi
+E
+��
+xj − μ
+� (xk − μ)
+�
+�
+
+= 1
+
+σ2
+
+n
+∑
+i=1 ∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+Note that unlike Φβ, Ψβ does not depend explicitly on β (inverse temperature). As we have seen
+before, Φβ is a quadratic function of the spatial dependence parameter.
+In order to simplify the notations and also to make computations easier, the expressions for Φβ
+and Ψβ can be rewritten in a matrix-vector form. Let Σp be the covariance matrix of the random vectors
+⃗pi, i = 1, 2, . . . , n, obtained by lexicographic ordering of the local configuration patterns xi ∪ ηi. Thus,
+
+259
+
+
+Entropy 2014, 16, 1002–1036
+
+considering a neighborhood system, ηi, of size K, we have Σp given by a (K + 1) × (K + 1) symmetric
+matrix (for K + 1 odd, i.e., K = 4, 8, 12, . . .):
+
+Σp =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+σ1,1
+· · ·
+σ1,K/2
+σ1,(K/2)+1
+σ1,(K/2)+2
+· · ·
+σ1,K+1
+...
+...
+...
+...
+...
+...
+...
+σK/2,1
+· · ·
+σK/2,K/2
+σK/2,(K/2)+1
+σK/2,(K/2)+2
+· · ·
+σK/2,K+1
+σ(K/2)+1,1
+· · ·
+σ(K/2)+1,K/2
+σ(K/2)+1,(K/2)+1
+σ(K/2)+1,(K/2)+2
+· · ·
+σ(K/2)+1,K+1
+σ(K/2)+2,1
+· · ·
+σ(K/2)+2,K/2
+σ(K/2)+2,(K/2)+1
+σ(K/2)+2,(K/2)+2
+· · ·
+σ(K/2)+2,K+1
+...
+...
+...
+...
+...
+...
+...
+σK+1,1
+· · ·
+σK+1,K/2
+σK+1,(K/2)+1
+σK+1,(K/2)+2
+· · ·
+σK+1,K+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+Let Σ−
+p be the submatrix of dimensions K × K obtained by removing the central row and central
+column of Σp (the covariances between xi and each one of its neighbors, xj). Then, for K + 1 odd, we
+have:
+
+Σ−
+p =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+σ1,1
+· · ·
+σ1,K/2
+σ1,(K/2)+2
+· · ·
+σ1,K+1
+...
+...
+...
+...
+...
+...
+σK/2,1
+· · ·
+σK/2,K/2
+σK/2,(K/2)+2
+· · ·
+σK/2,K+1
+σ(K/2)+2,1
+· · ·
+σ(K/2)+2,K/2
+σ(K/2)+2,(K/2)+2
+· · ·
+σ(K/2)+2,K+1
+...
+...
+...
+...
+...
+...
+σK+1,1
+· · ·
+σK+1,K/2
+σK+1,(K/2)+2
+· · ·
+σK+1,K+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(24)
+
+Thus, Σ−
+p is a matrix that stores only the covariances among the neighboring variables. Furthermore,
+let ⃗ρ be the vector of dimensions K × 1 formed by all the elements of the central row of Σp, excluding
+the middle one (which is a variance actually), that is:
+
+⃗ρ =
+�
+σ(K/2)+1,1
+· · ·
+σ(K/2)+1,K/2
+σ(K/2)+1,(K/2)+2
+· · ·
+σ(K/2)+1,K+1
+�
+(25)
+
+Therefore, we can rewrite Equation (23) (for n = 1) using Kronecker products. The following definition
+provides a fast way to compute Φβ exploring these tensor products.
+
+Definition 5. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+system, ηi, of size K (usual choices for K are even values: four, eight, 12, 20 or 24). Assuming that X(t) =
+{x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t and ⃗ρ and Σ−
+p are defined as
+
+Equations (25) and (24), the Type I expected Fisher information, Φβ, for this state, X(t), is:
+
+Φβ = 1
+
+σ4
+
+�
+σ2 ���Σ−
+p
+���
++ + 2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(26)
+
+where ∥A∥+ denotes the summation of all the entries of the matrix, A (not to be confused with a matrix
+norm) and ⊗ denotes the Kronecker (tensor) product. From an information geometry perspective,
+the presence of tensor products indicates the intrinsic differential geometry of a manifold in the form
+of the Riemann curvature tensor [18]. Note that all the necessary information for computing the
+Fisher information is somehow encoded in the covariance matrix of the local configuration patterns,
+(xi ∪ ηi), i = 1, 2, . . . , n, as would be expected in the case of Gaussian variables (second-order statistics).
+The same procedure is applied to the Type II expected Fisher information.
+
+Definition 6. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi, of size K (usual choices for K are four, eight, 12, 20 or 24). Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n }
+
+260
+
+
+Entropy 2014, 16, 1002–1036
+
+denotes the global configuration of the system at time t and Σ−
+p is defined as Equation (24), the Type II expected
+
+Fisher information, Ψβ, for this state, X(t), is given by:
+
+Ψβ = 1
+
+σ2
+
+���Σ−
+p
+���
++
+(27)
+
+3.3. Information Equilibrium in GMRF Models
+
+From the definition of both Φβ and Ψβ, a natural question that raises would be: under what
+conditions do we have Φβ = Ψβ in an isotropic pairwise GMRF model? As we can see from
+
+Equations (26) and (27), the difference between Φβ and Ψβ, from now on denoted by Δβ
+�
+⃗ρ, Σ−
+p
+�
+,
+is simply:
+
+Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 1
+
+σ4
+
+�
+2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(28)
+
+Then, intuitively, the condition for information equality is achieved when Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0. As
+
+Δβ
+�
+⃗ρ, Σ−
+p
+�
+is a simple quadratic function of the inverse temperature parameter, β, we can easily find
+
+that the value, β∗, for which Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0, is:
+
+β∗ =
+
+���⃗ρT ⊗ Σ−
+p
+���
++
+��Σ−p ⊗ Σ−p
+��
++
+±
+√
+
+3
+3
+
+�
+
+3
+��⃗ρT ⊗ Σ−p
+��2
++ − 2
+��Σ−p ⊗ Σ−p
+��
++ ∥⃗ρ ⊗⃗ρT∥+
+��Σ−p ⊗ Σ−p
+��
++
+(29)
+
+provided that 3
+���⃗ρT ⊗ Σ−
+p
+���
+2
+
++ ≥ 2
+���Σ−
+p ⊗ Σ−
+p
+���
++
+
+��⃗ρ ⊗⃗ρT��
++ and
+���Σ−
+p ⊗ Σ−
+p
+���
++ ̸= 0.
+Note that if
+��⃗ρ ⊗⃗ρT��
++ = 0, then one solution for the above equation is β∗ = 0.
+In other words, when
+σij = 0, ∀j ∈ ηi (no correlation between xi and its neighbors, xj), information equilibrium is achieved for
+β∗ = 0, which in this case, is the maximum pseudo-likelihood estimate of β, since in this matrix-vector
+notation, ˆβMPL is given by:
+
+ˆβMPL =
+∑
+j∈ηi
+ˆσij
+
+∑
+j∈ηi ∑
+k∈ηi
+ˆσjk
+=
+∥⃗ρ∥+
+��Σ−p
+��
++
+(30)
+
+In the isotropic pairwise GMRF model, if β = 0, then we have ∥⃗ρ∥+ = 0, and as a consequence,
+Φβ = Ψβ. However, the opposite is not necessarily true, that is, we may observe that Φβ = Ψβ for a
+
+non-zero β. One example is for β∗, a solution of Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0.
+
+4. Entropy in Isotropic Pairwise GMRFs
+
+Our definition of entropy is done by repeating the same process employed to derive Φβ and Ψβ.
+Knowing that the entropy of random variable x is defined by the expected value of self-information,
+given by −log p(x), it can be thought of as a probability-based counterpart to the Fisher information.
+
+Definition 7. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t,
+then the entropy, Hβ, for this state X(t) is given by:
+
+261
+
+
+Entropy 2014, 16, 1002–1036
+
+Hβ = −E
+�
+log L
+�
+⃗θ; X(t)��
+= −E
+
+�
+
+log
+n
+∏
+i=1
+p
+�
+xi|ηi,⃗θ
+��
+
+=
+(31)
+
+= n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+E
+
+⎧
+⎨
+
+⎩
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭ =
+
+= n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+E
+�
+(xi − μ)2�
+− 2βE
+
+�
+∑
+j∈ηi
+(xi − μ)
+�
+xj − μ
+�
+�
+
++ β2E
+
+⎧
+⎨
+
+⎩
+
+�
+∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+
+⎫
+⎬
+
+⎭
+
+After some algebra, the expression for Hβ becomes:
+
+Hβ = n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+σ2 − 2β ∑
+j∈ηi
+σij + β2 ∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�
+
+=
+(32)
+
+=
+�n
+
+2 log(2πσ2) + n
+
+2
+
+�
+− β
+
+σ2
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi
+σij
+
+�
+
++ β2
+
+2σ2
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�
+
+Using the same matrix-vector notation introduced in the previous sections, we can further simplify the
+expression for Hβ (considering n = 1).
+
+Definition 8. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t
+and ⃗ρ and Σ−
+p are defined as Equations (25) and (24), the entropy, Hβ, for this state, X(t), is given by:
+
+Hβ = HG −
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2σ2
+
+���Σ−
+p
+���
++
+
+�
+= HG −
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2 Ψβ
+
+�
+(33)
+
+where HG denotes the entropy of a Gaussian random variable with variance σ2 and Ψβ is the Type II
+expected Fisher information.
+Note that Shannon entropy is a quadratic function of the spatial dependence parameter, β.
+Since the coefficient of the quadratic term is strictly non-negative (Ψβ is the Type II expected Fisher
+information), entropy is a convex function of β. Furthermore, as expected, when β = 0 and there is no
+induced spatial dependence in the system, the resulting expression for Hβ is the usual entropy of a
+Gaussian random variable, HG. Thus, there is a value,
+ˆ
+βMH, for the inverse temperature parameter,
+which minimizes the entropy of the system. In fact, ˆβMH is given by:
+
+∂Hβ
+∂β = β
+
+σ2
+
+���Σ−
+p
+���
++ − 1
+
+σ2 ∥⃗ρ∥+ = 0
+(34)
+
+ˆβMH = ∥⃗ρ∥+
+��Σ−p
+��
++
+= ˆβMPL
+
+262
+
+
+Entropy 2014, 16, 1002–1036
+
+showing that the maximum pseudo-likelihood and the minimum-entropy estimates are equivalent
+in an isotropic pairwise GMRF model. Moreover, using the derived equations, we see a relationship
+between Φβ, Ψβ and Hβ:
+
+Φβ − Ψβ = Δβ
+�
+⃗ρ, Σ−
+p
+�
+(35)
+
+∂2Hβ
+∂β2 = Ψβ
+
+where the functional Δβ
+�
+⃗ρ, Σ−
+p
+�
+that represents the difference between Φβ and Ψβ is defined by
+Equation (28). These equations relate the entropy and one form of Fisher information (Ψβ) in GMRF
+models, showing that Ψβ can be roughly viewed as the curvature of Hβ. In this sense, in a hypothetical
+information equilibrium condition Ψβ = Φβ = 0, the entropy’s curvature would be null (Hβ would
+never change). These results suggest that an increase in the value of Ψβ, which means stability (a
+measure of agreement between the neighboring observations of a given point), contributes to the curve
+and, therefore, to inducing a change in the entropy of the system. In this context, the analysis of the
+Fisher information could bring us insights in predicting the entropy of a system.
+
+5. Asymptotic Variance of MPL Estimators
+
+It is known from the statistical inference literature that unbiasedness is a property that is not
+granted by maximum likelihood estimation, nor by maximum pseudo-likelihood (MPL) estimation.
+Actually, there is no universal method that guarantees the existence of unbiased estimators for a fixed
+n-sized sample. Often, in the exponential family of distributions, maximum likelihood estimators
+(MLEs) coincide with the UMVU (uniform minimum variance unbiased) estimators, because MLEs
+are functions of complete sufficient statistics. There is an important result in statistical inference that
+shows that if the MLE is unique, then it is a function of sufficient statistics. We could enumerate
+and make a huge list of several properties that make maximum likelihood estimation a reference
+method [15–17]. One of the most important properties concerns the asymptotic behavior of MLEs:
+when we make the sample size grow infinitely (n → ∞), MLEs become asymptotically unbiased and
+efficient. Unfortunately, there is no result showing that the same occurs in maximum pseudo-likelihood
+estimation. The objective of this section is to propose a closed expression for the asymptotic variance
+of the maximum pseudo-likelihood of β in an isotropic pairwise GMRF model. Unsurprisingly, this
+variance is completely defined as a function of both forms of expected Fisher information, Ψβ and Φβ;
+as for general values of the inverse temperature parameter, the information equality condition fails.
+
+5.1. The Asymptotic Variance of the Inverse Temperature Parameter
+
+In mathematical statistics, asymptotic evaluations uncover several fundamental properties of
+inference methods, providing a powerful and general tool for studying and characterizing the behavior
+of estimators. In this section, our objective is to derive an expression for the asymptotic variance
+of the maximum pseudo-likelihood estimator of the inverse temperature parameter (β) in isotropic
+pairwise GMRF models. It is known from the statistical inference literature that both maximum
+likelihood and maximum pseudo-likelihood estimators share two important properties: consistency
+and asymptotic normality [29,30]. It is possible, therefore, to completely characterize their behaviors
+in the limiting case. In other words, the asymptotic distribution of ˆβMPL is normal, centered around
+the real parameter value (since consistency means that the estimator is asymptotically unbiased),
+with the asymptotic variance representing the uncertainty about how far we are from the mean (real
+value). From a statistical perspective, ˆβMPL ≈ N
+�
+β, υβ
+�
+, where υβ denotes the asymptotic variance
+
+263
+
+
+Entropy 2014, 16, 1002–1036
+
+of the maximum pseudo-likelihood estimator. It is known that the asymptotic covariance matrix of
+maximum pseudo-likelihood estimators is given by [31]:
+
+C(⃗θ) = H−1(⃗θ)J(⃗θ)H−1(⃗θ)
+(36)
+
+with:
+
+H(⃗θ) = Eβ
+�
+∇2log L
+�
+⃗θ; X(t)��
+(37)
+
+J(⃗θ) = Varβ
+�
+∇log L
+�
+⃗θ; X(t)��
+(38)
+
+where H and J denote, respectively, the Jacobian and Hessian matrices regarding the logarithm of the
+pseudo-likelihood function. Thus, considering the parameter of interest, β, we have the following
+definition for its asymptotic variance, υβ (the derivatives are taken with respect to β):
+
+υβ =
+Varβ
+�
+∂
+∂βlog L
+�
+⃗θ; X(t)��
+
+E2
+β
+�
+∂2
+∂β2 log L
+�
+⃗θ; X(t)
+�� =
+Eβ
+
+��
+∂
+∂βlog L
+�
+⃗θ; X(t)��2�
+− E2
+β
+�
+∂
+∂βlog L
+�
+⃗θ; X(t)��
+
+E2
+β
+�
+∂2
+∂β2 log L
+�
+⃗θ; X(t)
+��
+(39)
+
+However, note that the expected value of the first derivative of log L
+�
+⃗θ; X(t)�
+with relation to β is zero:
+
+E
+� ∂
+
+∂βlog L
+�
+⃗θ; X(t)��
+= 1
+
+σ2
+
+n
+∑
+i=1
+
+�
+
+E [xi − μ] − β ∑
+j∈ηi
+E
+�
+xj − μ
+�
+�
+
+= 0
+(40)
+
+Therefore, the second term of the numerator of Equation (39) vanishes and the final expression for the
+asymptotic variance of the inverse temperature parameter is given as the ratio between Φβ and Ψ2
+β:
+
+υβ =
+1
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�2
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+σ2σjk + 2σijσik
+�
+− 2β ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi
+
+�
+σijσkl + σikσjl + σilσjk
+�
+
++β2 ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi ∑
+m∈ηi
+
+�
+σjkσlm + σjlσkm + σjmσkl
+��
+
+(41)
+
+This derivation leads us to another definition concerning an isotropic pairwise GMRF.
+
+Definition 9. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t,
+and⃗ρ and Σ−
+p are defined as Equations (25) and (24), the asymptotic variance of the maximum pseudo-likelihood
+estimator of the inverse temperature parameter, β, is given by (using the same matrix-vector notation from the
+previous sections):
+
+υβ =
+σ2 ���Σ−
+p
+���
++ + 2
+��⃗ρ ⊗⃗ρT��
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+��Σ−p
+��2
++
+=
+(42)
+
+=
+σ2
+
+��Σ−p
+��
++
++
+σ4Δβ
+�
+⃗ρ, Σ−
+p
+�
+
+��Σ−p
+��2
++
+= 1
+
+Ψβ
++ 1
+
+Ψ2
+β
+
+�
+Φβ − Ψβ
+�
+
+264
+
+
+Entropy 2014, 16, 1002–1036
+
+Note that when information equilibrium prevails, that is Φβ = Ψβ, the asymptotic variance is
+given by the inverse of the expected Fisher information. However, the interpretation of this equation
+indicates that the uncertainty in the estimation of the inverse temperature parameter is minimized when
+Ψβ is maximized. Essentially, this means that on average, the local pseudo-likelihood functions are not
+flat, that is small changes on the local configuration patterns along the system cannot cause abrupt
+changes in the expected global behavior (the global spatial dependence structure is not susceptible to
+sharp changes). To reach this condition, there must be a reasonable degree of agreement between the
+neighboring elements throughout the system, a behavior that is usually associated to low temperature
+states (β is above a critical value and there is a visible induced spatial dependence structure).
+
+6. The Fisher Curve of a System
+
+With the definition of Φβ, Ψβ and Hβ, we have the necessary tools to compute three important
+information-theoretic measures of a global configuration of the system. Our idea is that we can study
+the behavior of a complex system by constructing a parametric curve in this information-theoretic
+space as a function of the inverse temperature parameter, β. Our expectation is that the resulting
+trajectory provides a geometrical interpretation of how the system moves from an initial configuration,
+A (with a low entropy value for instance), to a desired final configuration, B (with a greater value
+of entropy, for instance), since the Fisher information plays an important role in providing a natural
+metric to the Riemannian manifolds of statistical models [18,19]. We will call the path from global State
+A to global State B as the Fisher curve (from A to B) of the system, denoted by ⃗FB
+A(β). Instead of using
+time as the parameter to build the curve, ⃗F, we parametrize ⃗F by the inverse temperature parameter, β.
+
+Definition 10. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+system, ηi, and X(β1), X(β2), . . . , X(βn) be a sequence of outcomes (global configurations) produced by different
+values of βi (inverse temperature parameters) for which A = βMIN = β1 < β2 < · · · < βn = βMAX = B.
+The system’s Fisher curve from A to B is defined as the function ⃗F : ℜ → ℜ3 that maps each configuration,
+X(βi), to a point
+�
+Φβi, Ψβi, Hβi
+�
+from the information space, that is:
+
+⃗FB
+A (β) =
+�
+Φβ, Ψβ, Hβ
+�
+β = A, . . . , B
+(43)
+
+where Φβ, Ψβ and Hβ denote the Type I expected Fisher information, the Type II expected Fisher
+information and the Shannon entropy of the global configuration, X(β), defined by:
+
+Φβ = 1
+
+σ4
+
+�
+σ2 ���Σ−
+p
+���
++ + 2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(44)
+
+Ψβ = 1
+
+σ2
+
+���Σ−
+p
+���
++
+(45)
+
+Hβ = 1
+
+2
+
+�
+log
+�
+2πσ2 + 1
+��
+−
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2 Ψβ
+
+�
+(46)
+
+In the next sections, we show some computational experiments that illustrate the effectiveness of
+the proposed tools in measuring the information encoded in complex systems. We want to investigate
+what happens to the Fisher curve as the inverse temperature parameter is modified in order to control
+the system’s global behavior. Our main conclusion, which is supported by experimental analysis, is
+that ⃗FB
+A(β) ̸= ⃗FA
+B (β). In other words, in terms of information, moving towards higher entropy states
+is not the same as moving towards lower entropy states, since the Fisher curves that represent the
+trajectory between the initial State A and the final State B are significantly different.
+
+265
+
+
+Entropy 2014, 16, 1002–1036
+
+7. Computational Simulations
+
+This section discusses some numerical experiments proposed to illustrate some applications of
+the derived tools in both simulations and real data. Our computational investigations were divided
+into two main sets of experiments:
+
+(1) Local analysis: analysis of the local and global versions of the measures (φβ, ψβ, Φβ, Ψβ and Hβ),
+considering a fixed inverse temperature parameter;
+(2) Global analysis: analysis of the global versions of the measures (Φβ, Ψβ and Hβ) along Markov
+chain Monte Carlo (MCMC) simulations in which the inverse temperature parameter is modified
+to control the expected global behavior.
+
+7.1. Learning from Spatial Data with Local Information-Theoretic Measures
+
+First, in order to illustrate a simple application of both forms of local observed Fisher
+information, φβ and ψβ, we performed an experiment using some synthetic images generated by
+the Metropolis–Hastings algorithm. The basic idea of this simulation process is to start at an initial
+configuration in which temperature is infinite (or β = 0). This basic initial condition is randomly
+chosen, and after a fixed number of steps, the algorithm produces a configuration that is considered to
+be a valid outcome of an isotropic pairwise GMRF model. Figure 1 shows an example of the initial
+condition and the resulting system configuration after 1,000 iterations considering a second order
+neighborhood system (eight nearest neighbors). The model parameters were chosen as: μ = 0, σ2 = 5
+and β = 0.8.
+
+266
+
+
+Entropy 2014, 16, 1002–1036
+
+Figure 1. Example of Gaussian Markov random field (GMRF) model outputs. The values of the inverse
+temperature parameter, β, in the left and right configurations are zero and 0.8, respectively.
+
+Three Fisher information maps were generated from both initial and resulting configurations.
+The first map was obtained by calculating the value, φβ(xi), for every point of the system, that is for
+i = 1, 2, . . . , n. Similarly, the second one was obtained by using ψβ(xi). The last information map was
+built by using the ratio between φβ(xi) and ψβ(xi), motivated by the fact that boundaries are often
+composed of patterns that are not expected to be “aligned” to the global behavior (and, therefore, show
+high values of φβ(xi)) and also are somehow unstable (show low values of ψβ(xi)). We will recall
+this measure, Lβ(xi) = φβ(xi)/ψβ(xi), the local L-information, once it is defined in terms of the first
+two derivatives of the logarithm of the local pseudo-likelihood function. Figure 2 shows the obtained
+information maps as images. Note that while φβ has a strong response for boundaries (the edges are
+light), ψβ has a weak one (so the edges are dark), evidence in favor of considering L-information in
+boundary detection procedures. Note also that in the initial condition, when the temperature is infinite,
+the informative patterns are almost uniformly spread all over the system, while the final configuration
+
+267
+
+
+Entropy 2014, 16, 1002–1036
+
+shows a more sparse representation in terms of information. Figure 3 shows the distribution of local
+L-information for both systems’ configurations depicted in Figure 1.
+
+Figure 2. Fisher information maps. The first row shows the information maps of the system when the
+temperature is infinite (β = 0). The second row shows the same maps when the temperature is low
+(β = 0.8). The first and second columns show information maps that were generated by computing
+φβ(xi) and ψβ(xi) for each observation in the lattice. The column map was produced by computing the
+local L-information, that is the ratio between both local information measures. In terms of information,
+low temperature configurations are more sparse, since most local patterns are uninformative, due to
+the strong alignment of the particles throughout the system, which is the expected global behavior for
+β above a certain critical value.
+
+268
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+�
+�
+�
+�
+��
+��
+��
+��
+��
+��
+�
+
+����
+
+����
+
+����
+
+����
+
+�����
+
+�����
+
+�������������
+
+������������������������
+
+����������������������������������������������������������
+
+�
+�
+��
+��
+��
+��
+��
+��
+��
+��
+��
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+�
+
+��� ����
+�
+
+�������������
+
+������������������������
+
+�����������������������������������������������������������������
+
+Figure 3.
+Distribution of local L-information. When the temperature is infinite, the information
+is spread along the system. For low temperature configurations, the number of local patterns with
+zero information content significantly increases, that is the system is more sparse in terms of Fisher
+information.
+
+7.2. Analyzing Dynamical Systems with Global Information-Theoretic Measures
+
+In order to study the behavior of a complex system that evolves from an initial State A to
+another State B, we use the Metropolis–Hastings algorithm, an MCMC simulation method, to generate
+a sequence of valid isotropic pairwise GMRF model outcomes for different values of the inverse
+temperature parameter, β. This process is an attempt to perform a random walk on the state space of
+the system, that is, in the space of all possible global configurations in order to analyze the behavior of
+the proposed global measures: entropy and both forms of Fisher information. The main purpose of
+this experiment is to observe what happens to Φβ, Ψβ and Hβ when the system evolves from a random
+initial state to other global configurations. In other words, we want to investigate the Fisher curve of
+the system in order to characterize its behavior in terms of information. Basically, the idea is to use the
+Fisher curve as a kind of signature for the expected behavior of any system modeled by an isotropic
+pairwise GMRF, making it possible to gain insights into the understanding of large complex systems.
+
+269
+
+
+Entropy 2014, 16, 1002–1036
+
+To simulate a system in which we can control the inverse temperature parameter, we define an
+updating rule for β based on fixed increments. In summary, we start with a minimum value βMIN
+(when βMIN = 0, the temperature of the system is infinite). Then, the value of β in the iteration, t, is
+defined as the value of β in t − 1 plus a small increment (Δβ), until it reaches a pre-defined upper bound,
+βMAX. The process in then repeated with negative increments −Δβ, until the inverse temperature
+reaches its minimum value, βMIN, again. This process continues for a fixed number of iterations, NMAX,
+during an MCMC simulation. As a result of this approach, a sequence of GMRF samples is produced.
+We use this sequence to calculate Φβ, Ψβ and Hβ and define the Fisher curve ⃗F, for β = βMIN, . . . , βMAX.
+Figure 4 shows some of the system’s configurations along an MCMC simulation. In this experiment, the
+parameters were defined as: βMIN = 0, Δβ = 0.001, βMAX = 0.15 and NMAX = 1, 000, μ = 0, σ2 = 5
+and ηi = {(i − 1, j − 1), (i − 1, j), (i − 1, j + 1), (i, j − 1), (i, j + 1), (i + 1, j − 1), (i + 1, j), (i + 1, j + 1)}.
+
+Figure 4. Global configurations along a Markov chain Monte Carlo (MCMC) simulation. Evolution of
+the system as the inverse temperature parameter, β, is modified to control the expected global behavior.
+
+A plot of both forms of the expected Fisher information, Φβ and Ψβ, for each iteration of
+the MCMC simulation is shown in Figure 5.
+The graph produced by this experiment shows
+some interesting results. First of all, regarding upper and lower bounds on these measures, it is
+possible to note that when there is no induced spatial dependence structure (β ≈ 0), we have an
+information equilibrium condition (Φβ = Ψβ and the information equality holds). In this condition,
+the observations are practically independent in the sense that all local configuration patterns convey
+approximately the same amount of information. Thus, it is hard to find and separate the two categories
+of patterns we know: the informative and the non-informative ones. Once they all behave in a similar
+manner, there is no informative pattern to highlight. Moreover, in this information equilibrium
+situation, Ψβ reaches its lower bound (in this simulation, we observed that in the equilibrium
+Φβ ≈ Ψβ ≈ 8), indicating that this condition emerges when the system is most susceptible to a
+change in the expected global behavior, since the uncertainty about β is maximum at this moment. In
+other words, modification in the behavior of a small subset of local patterns may guide the system to a
+totally different stable configuration in the future.
+The results also show that the difference between Φβ and Ψβ is maximum when the system
+operates with large values of β, that is, when organization emerges and there is a strong dependence
+structure among the random variables (the global configuration shows clear visible clusters and
+boundaries between them). In such states, it is expected that the majority of patterns be aligned to the
+global behavior, which causes the appearance of few, but highly informative patterns: those connecting
+
+270
+
+
+Entropy 2014, 16, 1002–1036
+
+elements from different regions (boundaries). Besides that, the results suggest that it takes more time
+for the system to go from the information equilibrium state to organization than the opposite. We
+will see how this fact becomes clear by analyzing the Fisher curve along Markov chain Monte Carlo
+(MCMC) simulations. Finally, the results also suggest that both Φβ and Ψβ are bounded by a superior
+value, possibly related to the size of the neighborhood system.
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������������������
+
+�������������������������������������������������������������������
+
+���
+���
+�
+
+�
+
+��
+
+Figure 5. Evolution of Fisher information along an MCMC simulation. As the difference between Φβ
+and Ψβ is maximized (*), the uncertainty about the real inverse temperature parameter is minimized
+and the number of informative patterns increases. In the information equilibrium condition (**), it is
+hard to find informative patterns, since there is no induced spatial dependence structure.
+
+Figure 6 shows the real parameter values used to generate the GMRF outputs (blue line), the
+maximum pseudo-likelihood estimative used to calculate Φβ and Ψβ (red line) and also a plot of the
+asymptotic variances (uncertainty about the inverse temperature) along the entire MCMC simulation.
+
+271
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�����
+
+�
+
+����
+
+����
+
+����
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+����������
+
+�������������������������������������������������������������������������������������
+
+����������
+��������������
+��������
+
+Figure 6. Real and estimated inverse temperatures along the MCMC simulation. The system’s global
+behavior is controlled by the real inverse temperature parameter values (blue line), used to generate
+the GMRF outputs. The maximum pseudo-likelihood estimative is used to compute both Φβ and Ψβ.
+Note that the uncertainty about the inverse temperature increases as β → 0 and the system approaches
+the information equilibrium condition.
+
+We now proceed to the analysis of the Shannon entropy of the system along the simulation.
+Despite showing a behavior similar to Ψβ, the range of values for entropy is significantly smaller. In
+this simulation, we observed that 0 ≤ Hβ ≤ 4.5, 0 ≤ Φβ ≤ 18 and 8 ≤ Ψβ ≤ 61. An interesting point is
+that knowledge of Φβ and Ψβ allows us to infer the entropy of the system. For example, looking at
+Figures 5 and 7, we can see that Φβ and Ψβ start to diverge a little bit earlier (t ≈ 80), then the entropy
+in a GMRF model begins to grow (t ≈ 120). Therefore, in an isotropic pairwise GMRF model, if the
+system is close to the information equilibrium condition, then Hβ is low, since there is little variability
+in the observed configuration patterns. When the difference between Φβ and Ψβ is large, Hβ increases.
+
+272
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+����������
+
+�������
+
+��������������������������������������������
+
+�������������������
+����������
+
+Figure 7.
+Evolution of Shannon entropy along an MCMC simulation. Hβ start to grow when the
+system leaves the equilibrium condition, where the entropy in the isotropic pairwise GMRF model is
+identical to the entropy of a simple Gaussian random variable (since β → 0).
+
+Another interesting global information-theoretic measure is L-information, from now on denoted
+by Lβ, since it conveys all the information about the likelihood function (in a GMRF model, only the
+first two derivatives of L(⃗θ; X(t)) are not null). Lβ is defined as the ratio between the two forms of
+expected Fisher information, Φβ and Ψβ. A nice property about this measure is that 0 ≤ Lβ ≤ 1. With
+this single measurement, it is possible to gain insights about the global system behavior. Figure 8 shows
+that a value close to one indicates a system approximating the information equilibrium condition, while
+a value close to zero indicates a system close to the maximum entropy condition (a stable configuration
+with boundaries and informative patterns).
+
+273
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+�
+
+����������
+
+�������������
+
+��������������������������������������������������
+
+Figure 8.
+Evolution of L-information along an MCMC simulation. When Lβ approaches one, the
+system tends to the information equilibrium condition. For values close to zero, the system tends to the
+maximum entropy condition.
+
+To investigate the intrinsic non-linear connection between Φβ, Ψβ and Hβ in a complex system
+modeled by an isotropic pairwise GMRF model, we now analyze its Fisher curves. The first curve,
+which is a planar one, is defined as ⃗F(β) = (Φβ, Ψβ), for A = βmin to B = βmax and shows how Fisher
+information changes when the inverse temperature of the system is modified to control the global
+behavior. Figure 9 shows the results. In the first image, the blue piece of the curve is the path from
+A to B, that is, ⃗F(β)B
+A, and the red piece is the inverse path (from B to A), that is, ⃗F(β)A
+B . We must
+emphasize that ⃗F(β)B
+A is the trajectory from a lower entropy global configuration to a higher entropy
+global configuration. On the other hand, when the system moves from B to A, we are moving towards
+entropy minimization. To make this clear, the second image of Figure 9 illustrates the same Fisher
+curve as before, but now in three dimensions, that is, ⃗F(β) = (Φβ, Ψβ, Hβ). For comparison purposes,
+Figure 10 shows the Fisher curves for another MCMC simulation with different parameter settings.
+Note that the shape of the curves are quite similar to those in Figure 9.
+
+274
+
+
+Entropy 2014, 16, 1002–1036
+
+0
+2
+4
+6
+8
+10
+12
+14
+16
+18
+20
+0
+
+10
+
+20
+
+30
+
+40
+
+50
+
+60
+
+70
+
+PHI
+
+PSI
+
+2D Fisher curve for a GMRF model
+
+A
+
+Equilibrium Line
+
+B
+
+0
+
+5
+
+10
+
+15
+
+20
+
+0
+
+20
+
+40
+
+60
+
+80
+
+2
+
+2.5
+
+3
+
+3.5
+
+4
+
+PHI
+
+3D Fisher curve for a GMRF model
+
+PSI
+
+H
+
+B
+
+A
+
+Figure 9. 2D and 3D Fisher curves of a complex system along an MCMC simulation. The graph shows
+a parametric curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from
+a differential geometry perspective, as the divergence between Φβ and Ψβ increases, the torsion of the
+parametric curve becomes evident (the curve leaves the plane of constant entropy).
+
+275
+
+
+Entropy 2014, 16, 1002–1036
+
+0
+2
+4
+6
+8
+10
+12
+14
+16
+18
+20
+0
+
+10
+
+20
+
+30
+
+40
+
+50
+
+60
+
+70
+
+PHI
+
+PSI
+
+2D Fisher curve for an isotropic pairwise GMRF model
+
+Equilibrium line
+
+B
+
+A
+
+0
+
+5
+
+10
+
+15
+
+20
+0
+
+20
+
+40
+
+60
+
+80
+2
+
+2.1
+
+2.2
+
+2.3
+
+2.4
+
+PSI
+
+3D Fisher curve for an isotropic pairwise GMRF model
+
+PHI
+
+H
+
+A
+
+B
+
+Figure 10. 2D and 3D Fisher curves along another MCMC simulation. The graph shows a parametric
+curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from a geometrical
+perspective, the properties of these curves are essentially the same as the ones from the previous
+simulation.
+
+We can see that the majority of points along the Fisher curve is concentrated around two regions
+of high curvature: (A) around the information equilibrium condition (an absence of short-term and
+long-term correlations, since β = 0); and (B) around the maximum entropy value, where the divergence
+between the information values are maximum (self-organization emerges, since β is greater than a
+critical value, βc). The points that lie in the middle of the path connecting these two regions represent
+the system undergoing a phase transition. Its properties change rapidly and in an asymmetric way,
+since ⃗F(β)B
+A ̸= ⃗F(β)A
+B for a given natural orientation.
+By now, some observations can be highlighted. First, the natural orientation of the Fisher curve
+defines the direction of time. The natural A–B path (increase in entropy) is given by the blue curve and
+the natural B–A path (decrease in entropy) is given by the red curve. In other words, the only possible
+way to walk from A to B (increase Hβ) by the red path or to walk from B to A (decrease Hβ) by the
+blue path would be moving back in time (by running the recorded simulation backwards).Eventually,
+we believe that a possible explanation for this fact could be that the deformation process that takes the
+original geometric structure (with constant curvature) of the usual Gaussian model (A) to the novel
+geometric structure of the isotropic pairwise GMRF model (B) is not reversible. In other words, the
+way the model is "curved" is not simply the reversal of the "flattering" process (when it is restored to its
+
+276
+
+
+Entropy 2014, 16, 1002–1036
+
+constant curvature form). Thus, even the basic notion of time seems to be deeply connected with the
+relationship between entropy and Fisher information in a complex system: in the natural orientation
+(forward in time), it seems that the divergence between Φβ and Ψβ is the cause of an increase in
+the entropy, and the decrease of entropy is the cause of the convergence of Φβ and Ψβ. During the
+experimental analysis, we repeated the MCMC simulations with different parameter settings, and
+the observed behavior for Fisher information and entropy was the same. Figure 11 shows the graphs
+of Φβ, Ψβ and Hβ for another recorded MCMC simulation. The results indicate that in the natural
+orientation (in the direction of time), an increase in Ψβ seems to be a trigger for an increase in the
+entropy and a decrease in the entropy seems to be a trigger for a decrease in Ψβ. Roughly speaking,
+Ψβ “pushes Hβ up” and Hβ “pushes Ψβ down”.
+
+�
+���
+���
+���
+���
+����
+����
+����
+����
+����
+����
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+�����������������������������������������������������������
+
+���
+���
+
+�
+���
+���
+���
+���
+����
+����
+����
+����
+����
+����
+�
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����������
+
+�������
+
+������������������������������������������������
+
+Figure 11. Relations between entropy and Fisher information. When a system modeled by an isotropic
+pairwise GMRF evolves in the natural orientation (forward in time), two rules that relate Fisher
+information and entropy can be observed: (1) an increase in Ψβ is the cause of an increase in Hβ (the
+increase in Hβ is a consequence of the increase in Ψβ); (2) a decrease in Hβ is the cause of a decrease in
+Ψβ (the decrease in Ψβ is a consequence of the decrease in Hβ). In other words, when moving towards
+higher entropy states, changes in Fisher information precedes changes in entropy (Ψβ “pushes Hβ
+up”). When moving towards lower entropy states, changes in entropy precedes changes in Fisher
+information (Hβ “pushes Ψβ down”).
+
+In summary, the central idea discussed here is that while entropy provides a measure
+of order/disorder of the system at a given configuration, X(t), Fisher information links these
+thermodynamical states through a path (Fisher curve). Thus, Fisher information is a powerful
+
+277
+
+
+Entropy 2014, 16, 1002–1036
+
+mathematical tool in the study of complex and dynamical systems, since it establishes how these
+different thermodynamical states are related along the evolution of the inverse temperature. Instead of
+knowing whether the entropy, Hβ, is increasing or decreasing, with Fisher information, it is possible to
+know how and why this change is happening.
+
+7.2.1. The Effect of Induced Perturbations in the System
+
+To test whether a system can recover part of its original configuration after a perturbation is
+induced, we conducted another computational experiment. During a stable simulation, two kinds of
+perturbations were induced in the system: (1) the value of the inverse temperature parameter was set
+to zero for the next consecutive two iterations; (2) the value of the inverse temperature parameter was
+set to the equilibrium value, β∗ (the solution of Equation 28), for the next consecutive two iterations.
+We should mention that in both cases, the original value of β is recovered after these two iterations
+are completed.
+When the system is disturbed by setting β to zero, the simulations indicate that the system is
+not successful in recovering components from its previous stable configuration (note that Φβ and Ψβ
+clearly touch one another in the graph). When the same perturbation is induced, but using the smallest
+of the two β∗ values (minimum solution of Equation 28), after a short period of turbulence, the system
+can recover parts (components, clusters) of its previous stable state. This behavior suggests that this
+softer perturbation is not enough to remove all the information encoded within the spatial dependence
+structure of system, preserving some of the long-term correlations in data (stronger bonds), slightly
+remodeling the large clusters presented in the system. Figures 12 and 13 illustrate the results.
+
+7.3. Considerations and Final Remarks
+
+The goal of this section is to summarize the main results obtained in this paper, focusing on the
+interpretation of the Fisher curve of a system modeled by a GMRF. First, our system is initialized with a
+random configuration, simulating that in the moment of its creation, the temperature is infinite (β = 0).
+We observe two important things at this moment: (1) there is a perfect symmetry in information, since
+the equilibrium condition prevails, that is, Φβ = Ψβ; (2) the entropy of the system is minimal. By a
+mere convention, we name this initial state of minimal entropy, A.
+By reducing the global temperature (β increases), this “universe” is deviating from this initial
+condition. As the system is drifted apart from the initial condition, we clearly see a break in the
+symmetry of information (Φβ diverges from Ψβ), which apparently is the cause for an increase in the
+system’s entropy, since this symmetry break seems to precede an increase in the entropy, H. This is a
+fundamental symmetry break, since other forms of ruptures that will happen in the future and will give
+rise to several properties of the system, including the basic notion of time as an irreversible process,
+follow from this first one. During this first stage of evolution, the system evolves to the condition of
+maximum entropy, named B.
+Hence, after the break in the information equilibrium condition, there is a significant increase in
+the entropy as the system continues to evolve. This stage lasts while the temperature of the system is
+further reduced or kept established. When the temperature starts to increase (β decreases), another
+form of symmetry break takes place. By moving towards the initial condition (A) from B, changes in
+the entropy seem to precede changes in Fisher information (when moving from A to B, we observe
+exactly the opposite). Moreover, the variations in entropy and Fisher information towards A are
+not symmetric with the variations observed when we move towards B, a direct consequence of that
+first fundamental break of the information equilibrium condition. By continuing this process of
+increasing the temperature of the system until infinity (β is approaching zero), we take our system to a
+configuration that is equivalent to the initial condition, that is, where information equilibrium prevails.
+This fundamental symmetry break becomes evident when we look at the Fisher curve of the
+system. We clearly see that the path from the state of minimum entropy, A, and the state of maximum
+entropy, B, defined by the curve, ⃗FB
+A (the blue trajectory in Figure 9), is not the same as the path from B
+
+278
+
+
+Entropy 2014, 16, 1002–1036
+
+to A, defined by the curve, ⃗FA
+B (the red trajectory in Figure 9). An implication of this behavior is that if
+the system is moving along the arrow of time, then we are moving through the Fisher curve in the
+clockwise orientation. Thus, the only way to go from A to B along ⃗FA
+B (the red path) is going back in
+time.
+Therefore, if that first fundamental symmetry break did not exist, or even if it had happened, but
+all the posterior evolution of Φβ, Ψβ and Hβ were absolutely symmetric (i.e., the variations in these
+measures were exactly the same when moving from A to B and when moving from B to A), what we
+
+�
+��
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+����������������������������������������������������
+
+���
+���
+
+�
+��
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+�����������������������������������������������������
+
+���
+���
+
+Figure 12.
+Disturbing the system to induce changes. Variation on Φβ and Ψβ after the system is
+disturbed by an abrupt change in the value of β. In the first image, the inverse temperature is set
+to zero. Note that Φβ and Ψβ touch one another, indicating that no residual information is kept, as
+if the simulation had been restarted from a random configuration. In the second image, the inverse
+temperature is set to the equilibrium value, β∗. The results suggest that this kind of perturbation is not
+enough to remove all the information within the spatial dependence structure, allowing the system to
+recover a significant part of its original configuration after a short stabilization period.
+
+279
+
+
+Entropy 2014, 16, 1002–1036
+
+Figure 13. The sequence of outputs along the MCMC simulation before and after the system is
+disturbed. The first row (when β is set to zero) shows that the system evolved to a different stable
+configuration after the perturbation. The second row (when β is set to β∗) indicates that the system
+was able to recover a significant part from its previous stable configuration.
+
+would actually see is that ⃗FB
+A = ⃗FA
+B . As a consequence, to decrease/increase the system’s temperature
+would be like moving towards the future/past. In fact, the basic notion of time in that system would
+be compromised, since time would be a perfectly reversible process (just similar to a spatial dimension,
+in which we can move in both directions). In other words, we would not distinguish whether the
+system is moving forward or moving backwards in time.
+
+8. Conclusions
+
+The definition of what is information in a complex system is a fundamental concept in
+the study of many problems. In this paper, we discussed the roles of two important statistical
+measures in isotropic pairwise Markov random fields composed of Gaussian variables: Shannon
+entropy and Fisher information. By using the pseudo-likelihood function of the GMRF model, we
+derived analytical expressions for these measures. The definition of a Fisher curve as a geometric
+representation for the study and analysis of complex systems allowed us to reveal the intrinsic
+non-linear relation between these information-theoretic measures and gain insights about the behavior
+of such systems. Computational experiments demonstrate the effectiveness of the proposed tools
+in decoding information from the underlying spatial dependence structure of a Gaussian-Markov
+random field. Typical informative patterns in a complex systems are located in the boundaries of
+the clusters. One of the main conclusions of this scientific investigation concerns the notion of time
+in a complex system. The obtained results suggest that the relationship between Fisher information
+and entropy determines whether the system is moving forward or backward in time. Apparently,
+in the natural orientation (when the system is evolving forward in time), when β is growing, that
+is, the temperature of the system is reducing, an increase in Fisher information leads to an increase
+in the system’s entropy, and when β is reducing, that is the temperature of the system is growing,
+
+280
+
+
+Entropy 2014, 16, 1002–1036
+
+a decrease in the system’s entropy leads to a decrease in the Fisher information. In future works
+we expect to completely characterize the metric tensor that represents the geometric structure of the
+isotropic pairwise GMRF model by specifying all the elements of the Fisher information matrix. Future
+investigations should also include the definition and analysis of the proposed tools in other Markov
+random field models, such as the Ising and Potts pairwise interaction models. Besides, a topic of
+interest concerns the investigation of minimum and maximum information paths in graphs to explore
+intrinsic similarity measures between objects belonging to a common surface or manifold in ℜn. We
+believe this study could bring benefits to some pattern recognition and data analysis computational
+applications.
+
+Acknowledgments: The author would like to thank CNPQ(Brazilian Council for Research and Development) for
+the financial support through research grant number 475054/2011-3.
+
+Conflicts of Interest: Conflict of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Shannon, C.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana,
+Chicago, IL & London, USA, 1949.
+2.
+Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on
+Mathematics, Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California
+Press: Berkeley, CA, USA, 1961. pp. 547–561
+3.
+Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487.
+4.
+Bashkirov, A. Rényi entropy as a statistical entropy for complex systems.
+Theor. Math. Phys. 2006,
+149, 1559–1573.
+5.
+Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630.
+6.
+Grad, H. The many faces of entropy. Comm. Pure. Appl. Math. 1961, 14, 323–254.
+7.
+Adler, R.; Konheim, A.; McAndrew, A. Topological entropy. Trans. Am. Math. Soc. 1965, 114, 309–319.
+8.
+Goodwyn, L. Comparing topological entropy with measure-theoretic entropy. Am. J. Math. 1972, 94, 366–388.
+9.
+Samuelson, P.A. Maximum principles in analytical economics. Am. Econ. Rev. 1972, 62, 249–262.
+10.
+Costa, M. Writing on dirty paper. IEEE T. Inform. Theory 1983, 29, 439–441.
+11.
+Dembo, A.; Cover, T.; Thomas, J.
+Information theoretic inequalities.
+IEEE T. Inform.
+Theory 1991,
+37, 1501–1518.
+12.
+Cover, T.; Zhang, Z. On the maximum entropy of the sum of two dependent random variables. IEEE T.
+Inform. Theory 1994, 40, 1244–1246.
+13.
+Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, London,
+2004.
+14.
+Frieden, B.R.; Gatenby, R.A. Exploratory Data Analysis Using Fisher Information; Springer: London, UK, 2006.
+15.
+Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983.
+16.
+Bickel, P.J. Mathematical Statistics; Holden Day: New York, NY, USA, 1991.
+17.
+Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury: New York, NY, USA, 2002.
+18.
+Amari, S. Nagaoka, H. Methods of information geometry (Translations of mathematical monographs vol. 191); AMS
+Bookstore: Tokyo, Japan, 2000.
+19.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234.
+20.
+Anandkumar, A.; Tong, L.; Swami, A. Detection of Gauss-Markov random fields with nearest-neighbor
+dependency. IEEE T. Inform. Theory 2009, 55, 816–827.
+21.
+Gómez-Villegas, M.A.; Main, P.; Susi, R. The effect of block parameter perturbations in Gaussian Bayesian
+networks: Sensitivity and robustness. Inform. Sci. 2013, 222, 439–458.
+22.
+Moura, J.; Balram, N. Recursive structure of noncausal Gauss-Markov random fields. IEEE T. Inform. Theory
+1992, 38, 334–354.
+23.
+Moura, J.; Goswami, S. Gauss-Markov random fields (GMrf) with continuous indices. IEEE Trans. Inform.
+Theory 1997, 43, 1560–1573.
+
+281
+
+
+Entropy 2014, 16, 1002–1036
+
+24.
+Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Stat. Soc. B Stat. Meth. 1974,
+36, 192–236.
+25.
+Besag, J. Statistical analysis of non-lattice data. The Statistician 1975, 24, 179–195.
+26.
+Hammersley, J.; Clifford, P. (University of California, Berkeley, Oxford and Bristol). Markov Field on Finite
+Graphs and Lattices. Unpublished work, 1971.
+27.
+Efron, B.F.; Hinkley, D.V. Assessing the accuracy of the ml estimator: Observed versus expected fisher
+information. Biometrika 1978, 65, 457–487.
+28.
+Isserlis, L. On a formula for the product-moment coefficient of any order of a normal frequency distribution
+in any number of variables. Biometrika 1918, 12, 134–139.
+29.
+Jensen, J.; Künsh, H. On asymptotic normality of pseudo likelihood estimates for pairwise interaction
+processes. Ann. Inst. Stat. Math. 1994, 46, 475–486.
+30.
+Winkler, G. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction;
+Springer-Verlag New York, Inc.: Secaucus, NJ, USA, 2006.
+31.
+Liang, G.; Yu, B. Maximum pseudo likelihood estimation in network tomography. IEEE T. Signal Proces.
+2003, 51, 2043–2053.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+282
+
+
+entropy
+
+Article
+Network Decomposition and Complexity Measures:
+An Information Geometrical Approach
+
+Masatoshi Funabashi
+
+Sony Computer Science Laboratories, inc. Takanawa muse bldg. 3F, 3-14-13, Higashi Gotanda, Shinagawa-ku,
+Tokyo 141-0022, Japan; E-Mail: masa_funabashi@csl.sony.co.jp; Tel.: +81-3-5448-4380; Fax: +81-3-5448-4273
+
+Received: 28 March 2014; in revised form: 24 June 2014 / Accepted: 14 July 2014 /
+Published: 23 July 2014
+
+Abstract: We consider the graph representation of the stochastic model with n binary variables, and
+develop an information theoretical framework to measure the degree of statistical association existing
+between subsystems as well as the ones represented by each edge of the graph representation. Besides,
+we consider the novel measures of complexity with respect to the system decompositionability, by
+introducing the geometric product of Kullback–Leibler (KL-) divergence. The novel complexity
+measures satisfy the boundary condition of vanishing at the limit of completely random and ordered
+state, and also with the existence of independent subsystem of any size. Such complexity measures
+based on the geometric means are relevant to the heterogeneity of dependencies between subsystems,
+and the amount of information propagation shared entirely in the system.
+
+Keywords: information geometry; complexity measure; complex network; system decomposition-
+ability; geometric mean
+
+1. Introduction
+
+Complex systems sciences emphasize on the importance of non-linear interactions that can not be
+easily approximated linearly. In other word, the degrees of non-linear interactions are the source of
+complexity. The classical reductionism approach generally decomposes a system into its components
+with linear interactions, and tries to evaluate whether the whole property of the system can still
+be reproduced. If this decomposition of a system destroys too much information to reproduce the
+system’s whole property, the plausibility of such reductionism is lost. Inversely, if we can evaluate
+how much information is ignored by the decomposition, we can assume how much complexity of the
+whole system is lost. This gives us a way to measure the complexity of a system with respect to the
+system decomposition.
+In stochastic systems described as a set of joint distributions, the interaction can basically be
+expressed as the statistical association between the variables. The simplest reductionism approach is to
+separate the whole system into some subsets of variables, and assume the independence between them.
+If such decomposition does not affect the system’s property, the isolated subsystem is independent
+from the rest. On the other hand, if the decomposition loses too much information, then the subsystem
+is inside of a larger subsystem with strong internal dependencies and can not be easily separated.
+The stochastic models have often been represented with the use of graph representation, and
+treated with the name of complex network [1–3]. Generally, the nodes represent the variables and
+the weights on the edges are the statistical association between them. However, if we consider the
+information contained in the different orders of dependencies among variables, the graph with a single
+kind of edges is not sufficient to express the whole information of the system [4]. An edge of a graph
+with n nodes contains the information of statistical association up to the n-th order dependencies
+among n variables. If we try to decompose the system independently by cutting these information, we
+have to consider what it means to cut the edge of the graph from the information theoretical point
+of view.
+
+Entropy 2014, 16, 4132–4167; doi:10.3390/e16074132
+www.mdpi.com/journal/entropy
+283
+
+
+Entropy 2014, 16, 4132–4167
+
+Indeed, analysis on the degree of dependencies existing between variables derived many defini-
+tion of complexity in stochastic model [5], which have been mostly studied with information theoretical
+perspective. Beginning with seminal works of Lempel and Ziv (e.g., [6]), computation-oriented definition
+of complexity takes deterministic formalization and measures the necessary information to reproduce
+a given symbolic sequence exactly, which is classified with the name of algorithmic complexity [7–9].
+On the other hand, statistical approach to complexity, namely statistical complexity, assumes
+some stochastic model as theoretical basis, and refers to the structure of information source on it in
+measure-theoretic way [10–12].
+One of the most classical statistical complexities is the mutual information between two stochastic
+variables, and its generalized form to measure dependence between n variables is proposed (e.g., [13])
+and explored in relevance to statistical models and theories by several authors [14–16].
+We should also recall that complexity is not necessary conditioned only by information theory,
+but rather motivated from the organization of living system such as brain activity. The TSE complexity
+shows further extension of generalized mutual information into biological context, where complexity
+exists as the heterogeneity between different system hierarchies [17]. These statistical complexities
+are all based on the boundary condition of vanishing at the limit of completely random and ordered
+state [18].
+The complexity measure is usually the projection from system’s variables to one-dimensional
+quantity, which is composed to express the degree of characteristic that we define to be important in
+what means “complexity”. Since the complexity measure is always a many-to-one association, it has both
+aspects of compressing information to classify the system from simple to complex, and losing resolution
+of the system’s phase space. If the system has n variables, we generally need n independent complexity
+measures to completely characterize the system with real-value resolution. The problematics of
+defining a complexity measure is situated on the edge of balancing the information compression on
+system’s complexity with theoretical support, and the resolution of the system identification to be
+maintained high enough to avoid trivial classification. The latter criterion increases its importance as
+the system size becomes larger. The better complexity measure is therefore a set of indices, with as
+less number as possible, which characterizes major features related to the complexity of the system.
+In this sense, the ensemble of complexity measures is also analogous to the feature space of support
+vector machine. A non-trivial set of complexity measures need to be complementary to each other in
+parameter space for the possible best discrimination of different systems.
+In this paper, we first consider the stochastic system with binary variables and theoretically
+develop a way to measure the information between subsystems, which is consistent to the information
+represented by the edges of the graph representation.
+Next, we particularly focus on the generalized mutual information as a start point of the argument,
+and further consider to incorporate network heterogeneity into novel measures of complexity with
+respect to the system’s decompositionability. This approach will be revealed to be complementary to
+TSE complexity as the difference between arithmetic and geometric means of information.
+
+2. System Decomposition
+
+Let us consider the stochastic system with n binary variables x = (x1, · · · , xn) where xi ∈
+{0, 1} (1 ≤ i ≤ n). We denote the joint distribution of x by p(x). We define the decomposition pdec(x)
+of p(x) into two subsystems y1 = (x1
+1, · · · , x1
+n1) and y2 = (x2
+1, · · · , x2
+n2) (n1 + n2 = n, y1 ∪ y2 = x,
+y1 ∩ y2 = φ) as follows:
+
+pdec(x) = p(y1)p(y2),
+(1)
+
+where p(y1) and p(y2) are the joint distributions of y1 and y2, respectively. For simplicity, hereafter
+we denote the system decomposition using the smallest subscript of variables in each subsystem. For
+example, in case n = 4, y1 = (x1, x3) and y2 = (x2, x4), we describe the decomposed system pdec(x)
+
+284
+
+
+Entropy 2014, 16, 4132–4167
+
+as < 1212 >. The system decomposition means to cut all statistical association between the two
+subsystems, which is expressed as setting the independent relation between them.
+We will further consider the Equation (1) in terms of the graph representation. We define
+the undirected graph Γ := (V, E) of the system p(x), whose vertices V = {x1, · · · , xn} and edges
+E = V × V represent the variables and the statistical association, respectively. To express the system,
+we set the value of each vertex as the value of the corresponding variable, and the weight of each edge
+as the degree of dependency between the connected variables.
+There is however a problem considering the representation with a single kind of edge. The
+statistical association among variables is not only between two variables, but can be independently
+defined among plural variables up to the n-th order. Therefore, the exact definition of the weight
+of the edges remains unclear. To clarify these problematics, we consider the hierarchical marginal
+distributions j as another coordinates of the system p(x) as follows:
+
+j = (j1; j2; · · · ; jn),
+(2)
+
+where
+
+j1
+=
+(η1, · · · , ηi, · · · , ηn), (1 < i < n),
+
+j2
+=
+(η1,2, · · · , ηi,j, · · · , ηn−1,n), (1 < i < j < n),
+
+...
+
+jn
+=
+η1,2,··· ,n,
+(3)
+
+and
+
+η1
+=
+∑
+i2,··· ,in∈{0,1}
+p(1, i2, · · · , in),
+
+...
+
+ηn
+=
+∑
+i1,··· ,in−1∈{0,1}
+p(i1, · · · , in−1, 1),
+
+η1,2
+=
+∑
+i3,··· ,in∈{0,1}
+p(1, 1, i3, · · · , in),
+
+...
+
+ηn−1,n
+=
+∑
+i1,··· ,in−2∈{0,1}
+p(i1, · · · , in−2, 1, 1),
+
+...
+
+η1,2,··· ,n
+=
+p(1, 1, · · · , 1).
+(4)
+
+Since the definition of j is a linear transformation of p(x), both coordinates have the degrees of
+freedom ∑n
+k=1 nCk.
+The subcoordinates j1 are simply the set of marginal distributions of each variable.
+The
+subcoordinates jk (1 < k ≤ n) include the statistical association among k variables, that can not
+be expressed with the coordinates less than the k-th order. This means that the different statistical
+associations exist independently in each order among the corresponding sets of the variables. The
+statistical association represented by the weight of a graph edge {xi, xj} is therefore the superposition
+of the different dependencies defined on every subset of x including xi and xj.
+To measure the degree of statistical association in each order, the information geometry established
+the following setting [19]. We first define another coordinates ` = (`1; `2; · · · ; `n) that are the dual
+
+285
+
+
+Entropy 2014, 16, 4132–4167
+
+coordinates of j with respect to the Legendre transformation of the exponential family’s potential
+function ψ(`) to its conjugate potential φ(j) as follows:
+
+`1
+=
+(θ1, · · · , θn),
+
+`2
+=
+(θ1,2, · · · , θn−1,n),
+(5)
+...
+
+`n
+=
+θ1,2,··· ,n,
+
+where
+
+ψ(`)
+=
+log
+1
+
+p(0, · · · , 0),
+
+φ(j)
+= ∑
+i
+θiηi + ∑
+i<j
+θi,jηi,j + · · · + θ1,2,··· ,nη1,2,··· ,n − ψ(`),
+
+θi
+=
+∂φ(j)
+
+∂ηi
+, (1 ≤ i ≤ n),
+
+θi,j
+=
+∂φ(j)
+∂ηi,j
+, (1 ≤ i < j ≤ n),
+(6)
+
+...
+
+θ1,2,··· ,n
+=
+∂φ(j)
+
+∂η1,2,··· ,n
+.
+
+Note that j can be inversely derived from `, following Legendre transformation between φ(j) and
+ψ(`):
+
+ηi
+=
+∂ψ(`)
+
+∂θi
+, (1 ≤ i ≤ n),
+
+ηi,j
+=
+∂ψ(`)
+∂θi,j
+, (1 ≤ i < j ≤ n),
+(7)
+
+...
+
+η1,2,··· ,n
+=
+∂ψ(`)
+
+∂θ1,2,··· ,n
+.
+
+Using the coordinates `, the system is described in the form of the exponential family as follows:
+
+p(x) = ∑
+i
+θixi + ∑
+i<j
+θi,jxixj + · · · + θ1,2,··· ,nx1x2 · · · xn − ψ(`).
+(8)
+
+The information geometry revealed that the exponential family of probability distribution forms a
+manifold with a dual-flat structure. More precisely, the coordinates ` form a flat manifold with respect
+to the Fisher information matrix as the Riemannian metric, and α-connection with α = 1. Dually to `,
+the coordinates j are flat with respect to the same metric but α-connection with α = −1. It is known that
+` and j are orthogonal to each other with respect to the Fisher information matrix. This structure give
+us a way to decompose the degree of statistical association among variables into separated elements of
+arbitrary orders. We define the so-called k-cut mixture coordinates ık as follows [14].
+
+ık
+=
+(jk−; `k+),
+(9)
+
+jk−
+=
+(j1, · · · , jk),
+(10)
+
+`k+
+=
+(`k+1, · · · , `n).
+(11)
+
+286
+
+
+Entropy 2014, 16, 4132–4167
+
+We also define the k-cut mixture coordinates ık
+0 = (jk−; 0, · · · , 0) with no dependency above the
+k-th order. We denote the system specified with ık and ık
+0 as p(x, ık) and p(x, ık
+0 ), respectively.
+Then the degree of the statistical association more than the k-th order in the system can be
+measured by the Kullback-Leibler (KL-) divergence D[p(x, ı) : p(x, ık
+0 )].
+
+2N · D[p(x, ı) : p(x, ık
+0)] ∼ χ2(
+n
+∑
+i=k+1
+nCi),
+(12)
+
+where D[· : ·] is the KL-divergence from the first system to the second one.
+Here, the decomposition is performed according to the orders of statistical association, which does
+not spatially distinguish the vertices. If we define the weight of an edge {xi, xj} with the KL-divergence,
+the above k-cut coordinates ık are not appropriate to measure the information represented in each
+edge. We need to set another mixture coordinates so that to separate only the existing information
+between xi and xj regardless of its order.
+Let us return to the definition of the system decomposition and consider on the dual-flat
+coordinates ` and j.
+
+Proposition 1. The independence between the two decomposed systems y1 = (x1
+1, · · · , x1
+n1) and y2 =
+(x2
+1, · · · , x2
+n2) can be expressed on the new coordinates jdec as follows:
+
+ηdec
+i
+=
+ηi, (1 ≤ i ≤ n),
+
+ηdec
+i,j
+=
+
+�
+ηi,j, (1 ≤ i < j ≤ n),
+if {xi, xj} ⊆ y1 or ⊆ y2
+
+ηiηj, (1 ≤ i < j ≤ n),
+else
+,
+
+ηdec
+i,j,k
+=
+
+⎧
+⎪
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎪
+⎩
+
+ηi,j,k, (1 ≤ i < j < k ≤ n),
+if {xi, xj, xk} ⊆ y1 or ⊆ y2
+
+ηi,jηk, (1 ≤ i < j < k ≤ n),
+else if {xi, xj} ⊆ y1 or ⊆ y2
+
+ηiηj,k, (1 ≤ i < j < k ≤ n),
+else if {xj, xk} ⊆ y1 or ⊆ y2
+
+ηjηi,k, (1 ≤ i < j < k ≤ n),
+else (if {xi, xk} ⊆ y1 or ⊆ y2)
+
+,
+
+...
+(13)
+
+ηdec
+1,2,··· ,n
+=
+ηs[i,k1,··· ,kn1−1]ηs[j,l1,··· ,ln2−1], (xi ∈ y1, xj ∈ y2),
+
+where s[· · · ] is the ascending sort of the internal sequence.
+Then the corresponding dual coordinates `dec take 0 elements as follows:
+
+θdec
+i,j
+=
+0,
+(1 ≤ i < j <≤ n),
+if {xi, xj} ∩ y1 ̸= φ and {xi, xj} ∩ y2 ̸= φ
+
+θdec
+i,j,k
+=
+0,
+(1 ≤ i < j < k ≤ n),
+if {xi, xj, xk} ∩ y1 ̸= φ and {xi, xj, xk} ∩ y2 ̸= φ
+
+...
+
+θdec
+1,2,··· ,n
+=
+0.
+(14)
+
+Proof. For simplicity, we show the cases of n = 2 and n = 3 for the first node separation.
+
+287
+
+
+Entropy 2014, 16, 4132–4167
+
+For n = 2, the above defined jdec for the system decomposition < 12 > give its dual coordinates
+`dec as follows:
+
+θdec
+1
+=
+log
+ηdec
+1
+− ηdec
+1,2
+
+1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2
+=
+log
+η1
+
+1 − η1
+,
+
+θdec
+2
+=
+log
+ηdec
+2
+− ηdec
+1,2
+
+1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2
+=
+log
+η2
+
+1 − η2
+,
+
+θdec
+1,2
+=
+log
+ηdec
+1,2 (1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2 )
+
+(ηdec
+1
+− ηdec
+1,2 )(ηdec
+2
+− ηdec
+1,2 )
+=
+0,
+
+(15)
+
+which means the first and second node is independent.
+For n = 3, the above defined jdec for the system decomposition < 122 > give its dual coordinates
+`dec as follows:
+
+θdec
+1
+=
+log
+ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η1
+
+1 − η1
+,
+
+θdec
+2
+=
+log
+ηdec
+2
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η2 − η2,3
+
+1 − η2 − η3 + η2,3
+,
+
+θdec
+3
+=
+log
+ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η3 − η2,3
+
+1 − η2 − η3 + η2,3
+,
+
+(16)
+
+θdec
+1,2
+=
+log
+(ηdec
+1,2 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+0,
+
+θdec
+1,3
+=
+log
+(ηdec
+1,3 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+0,
+
+θdec
+2,3
+=
+log
+(ηdec
+2,3 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+log η2,3(1 − η2 − η3 + η2,3)
+
+(η2 − η2,3)(η3 − η2,3) ,
+
+(17)
+
+θdec
+1,2,3
+=
+log
+
+�
+ηdec
+1,2,3
+
+(ηdec
+1,2 − ηdec
+1,2,3)(ηdec
+1,3 − ηdec
+1,2,3)(ηdec
+2,3 − ηdec
+1,2,3)
+
+×
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+�
+
+=
+0,
+
+(18)
+
+which means the first node is independent from the other nodes.
+The generalization is possible with the use of recurrence formula between system size n and
+n + 1, according to the symmetry of the model and Legendre transformation between jdec and `dec
+
+coordinates.
+Numerical proof can be obtained by computing directly 0 elements of `dec from jdec.
+
+288
+
+
+Entropy 2014, 16, 4132–4167
+
+The definition of jdec means to decompose the hierarchical marginal distributions j into the
+products of the subsystems’ marginal distributions, in case the subscripts traverse the two subsystems.
+Therefore, only the statistical associations between two subsystems are set to be independent, while
+the internal dependencies of each subsystem remain unchanged. This is analytically equivalent to
+compose another mixture coordinates ¸, namely the < · · · >-cut coordinates, with proper description
+of the system decomposition with < · · · >. The ¸ consists of the j coordinates with the subscripts that
+do not traverse between the decomposed subsystems, and the ` coordinates whose subscripts traverse
+between them.
+For simplicity, we only describe here the case n = 4 and the decomposition < 1133 > (the set of
+the first, second, and the third, fourth nodes each form a subsystem). The system p(x) is expressed
+with the < 1133 >-cut coordinates ¸ as
+
+ξ1
+=
+η1,
+...
+
+ξ4
+=
+η4,
+
+ξ1,2
+=
+η1,2,
+
+ξ1,3
+=
+θ1,3,
+
+ξ1,4
+=
+θ1,4,
+
+ξ2,3
+=
+θ2,3,
+(19)
+
+ξ2,4
+=
+θ2,4,
+
+ξ3,4
+=
+η3,4,
+
+ξ1,2,3
+=
+θ1,2,3,
+...
+
+ξ2,3,4
+=
+θ2,3,4,
+
+ξ1,2,3,4
+=
+θ1,2,3,4.
+
+The decomposed system with no statistical association between two subsystems have the
+following coordinates ¸dec, which is, in any decomposition, equivalent to set all θ in ¸ as 0:
+
+ξdec
+1
+=
+η1,
+...
+
+ξdec
+4
+=
+η4,
+
+ξdec
+1,2
+=
+η1,2,
+
+ξdec
+1,3
+=
+0,
+
+ξdec
+1,4
+=
+0,
+
+ξdec
+2,3
+=
+0,
+(20)
+
+ξdec
+2,4
+=
+0,
+
+ξdec
+3,4
+=
+η3,4,
+
+ξdec
+1,2,3
+=
+0,
+
+...
+
+ξdec
+2,3,4
+=
+0,
+
+ξdec
+1,2,3,4
+=
+0.
+
+289
+
+
+Entropy 2014, 16, 4132–4167
+
+This is analytically equivalent to the definition of the decomposition (13)–(14) in case of < 1133 >.
+Therefore, the KL-divergence D[p(x, ¸) : p(x, ¸dec)] measures the information lost by the system
+decomposition. The following asymptotic agreement to χ2 test also holds.
+
+Proposition 2.
+
+2N · D[p(x, ¸) : p(x, ¸dec)] ∼ χ2(♯θ(¸)),
+(21)
+
+where ♯θ(¸) is the number of ` coordinates appearing in the ¸ coordinates.
+
+3. Edge Cutting
+
+We further expand the concept of system decomposition to eventually quantify the total amount of
+information expressed by an edge of the graph. Let us consider to cut an edge {xi, xj} (1 ≤ i < j ≤ n)
+of the graph with n vertices. Hereafter we call this operation as the edge cutting i − j. In the same way
+as the system decomposition, the edge cutting corresponds to modify the j coordinates to produce jec
+
+coordinates as follows:
+
+ηec
+i,j
+=
+ηiηj,
+
+ηec
+s[i,j,k1]
+=
+ηs[i,k1]ηs[j,k1],
+
+ηec
+s[i,j,k1,k2]
+=
+ηs[i,k1,k2]ηs[j,k1,k2],
+(22)
+
+...
+
+ηec
+s[i,j,k1,··· ,kn−2]
+=
+ηs[i,k1,··· ,kn−2]ηs[j,k1,··· ,kn−2],
+
+({i, j, k1, · · · , kn−2}
+=
+{1, · · · , n}),
+
+and the rest of jec remains the same as those of j.
+The formation of jec from j consists of replacing the k-th order elements (k ≥ 3) of j including both
+i and j in its subscripts, with the product of the k − 1-th order j in maximum subgraphs (k − 1 vertices)
+each including i or j. This means that all orders of statistical association including the variables xi and
+xj are set to be independent only between them. Other relations that do not include simultaneously xi
+and xj remain unchanged.
+Certain combinations of edge cuttings coincide with system decompositions. For example, in case
+n = 4, the edge cuttings 1 − 2, 1 − 3, and 1 − 4 are equivalent to the system decomposition < 1222 >.
+We define the i − j-cut mixture coordinates ¸ for orthogonal decomposition of the statistical
+association represented by the edge {xi, xj}. Although actual calculation can be performed only with j
+coordinates, this generalization is necessary to have a geometrical definition of the orthogonality. For
+simplicity, we only describe the ¸ in the case of n = 4:
+
+290
+
+
+Entropy 2014, 16, 4132–4167
+
+ξ1
+=
+η1,
+...
+
+ξ4
+=
+η4,
+
+ξ1,2
+=
+θ1,2,
+
+ξ1,3
+=
+η1,3,
+
+ξ1,4
+=
+η1,4,
+
+ξ2,3
+=
+η2,3,
+
+ξ2,4
+=
+η2,4,
+(23)
+
+ξ3,4
+=
+η3,4,
+
+ξ1,2,3
+=
+θ1,2,3,
+
+ξ1,2,4
+=
+θ1,2,4,
+
+ξ1,3,4
+=
+η1,3,4,
+
+ξ2,3,4
+=
+η2,3,4,
+
+ξ1,2,3,4
+=
+θ1,2,3,4,
+
+where orthogonality between the elements of j and ` holds with respect to the Fisher information
+matrix.
+Calculating the dual coordinates `ec of jec, we can define the coordinates ¸ec of the system after
+the edge cutting 1 − 2 as follows:
+
+ξec
+1
+=
+η1,
+...
+
+ξec
+4
+=
+η4,
+
+ξec
+1,2
+=
+θec
+1,2,
+
+ξec
+1,3
+=
+η1,3,
+
+ξec
+1,4
+=
+η1,4,
+
+ξec
+2,3
+=
+η2,3,
+
+ξec
+2,4
+=
+η2,4,
+
+ξec
+3,4
+=
+η3,4,
+
+ξec
+1,2,3
+=
+θec
+1,2,3,
+
+ξec
+1,2,4
+=
+θec
+1,2,4,
+
+ξec
+1,3,4
+=
+η1,3,4,
+
+ξec
+2,3,4
+=
+η2,3,4,
+
+ξec
+1,2,3,4
+=
+θec
+1,2,3,4.
+
+Note that the edge cutting can not be defined simply by setting the corresponding elements of `ec as 0.
+Then the KL-divergence D[p(x, ¸)
+:
+p(x, ¸ec)] represent the total amount of information
+represented by the edge 1 − 2.
+The following asymptotic agreement to χ2 test also holds:
+
+291
+
+
+Entropy 2014, 16, 4132–4167
+
+Proposition 3.
+
+2N · D[p(x, ¸) : p(x, ¸ec)] ∼ χ2(1 +
+n−2
+∑
+k=1
+n−2Ck).
+(24)
+
+We call this χ2 value or the KL-divergence itself as edge information of edge 1 − 2.
+
+4. Generalized Mutual Information as Complexity with Respect to the Total System
+Decomposition
+In previous sections, we have introduced a measure of complexity in terms of system
+decomposition, by measuring the KL-divergence between a given system and its independently
+decomposed subsystems.
+We consider here the total system decomposition, and measure the
+informational distance I between the system and the totally decomposed system where each element
+are independent.
+
+I :=
+n
+∑
+i=1
+H(xi) − H(x1, · · · , xn),
+(25)
+
+where
+
+H(x) := −∑
+x
+p(x) log(x).
+(26)
+
+This quantity is the generalization of mutual information, and is named in various ways such as
+generalized mutual information, integration, complexity, multi-information, etc. according to different
+authors. For simplicity, we call the I as “multi-information taking after [15]. This quantity can be
+interpreted as a measure of complexity that sums up the order-wise statistical association existing in
+each subset of components with information geometrical formalization [14]
+For simplicity, we denote the multi-information I of n-dimensional stochastic binary variables as
+follows, using the notation of the system decomposition:
+
+I = D[< 111 · · · 1 >:< 123 · · · n >].
+(27)
+
+5. Rectangle-Bias Complexity
+
+The multi-information contains some degrees of freedom in case n > 2. That is, we can define a
+set of distributions {p(x)|I = const.} with different parameters but the same I value. This fact can be
+clearly explained with the use of information geometry. From the Pythagorean relation, we obtain the
+followings in case of n = 3:
+
+D[< 111 >:< 113 >] + D[< 113 >:< 123 >]
+=
+D[< 111 >:< 123 >],
+
+D[< 111 >:< 121 >] + D[< 121 >:< 123 >]
+=
+D[< 111 >:< 123 >],
+(28)
+
+D[< 111 >:< 122 >] + D[< 122 >:< 123 >]
+=
+D[< 111 >:< 123 >].
+
+Using these relations, we can schematically represent the decomposed systems on a circle diagram
+with diameter
+√
+
+I. This representation is based on the analogous algebra between Pythagorean relation
+of KL-divergence, and that of Euclidian geometry where the circumferential angle of a semi-circular
+arc is always π
+
+2 .
+Figure1 represents two different cases with the same I value in case n = 3. The distance between
+two systems in the same diagram corresponds to the square root value of KL-divergence between
+them. Clearly the left and right figures represent different dependencies between nodes, although
+they both have the same I value. Such geometrical variation is possible by the abundance of degree of
+freedom in dual coordinates compared to the given constraint. There exist 7 degrees of freedom in η
+or θ coordinates for n = 3, while the only constraint is the invariance of I value, which only reduce 1
+
+292
+
+
+Entropy 2014, 16, 4132–4167
+
+degree of freedom. The remaining 6 degrees of freedom can then be deployed to produce geometrical
+variation in the circle diagram. As for considering system decomposition, the left figure is difficult to
+obtain decomposed systems without losing much information. While in the right figure there exists
+relatively easy decomposition < 122 >, which loses less information than any decomposition in the left
+figure. We call such degree of facility of decomposition with respect to the losing information as system
+decompositionability. In this sense, the left system is more complex although the 2 systems both have the
+same I value. Especially, in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] = D[< 111 >:< 121 >],
+the system does not have any easiest way of decomposition, and any isolation of a node loses significant
+amount of information.
+To further incorporate such geometrical structure reflecting system decompositionability into a
+measure of complexity, we consider a mathematical way to distinguish between these two figures.
+Although the total sum of KL-divergence along the sequence of system decomposition is always
+identical to I by Pythagorean relation, their product can vary according to the geometrical composition
+in the circle diagram. This is analogous to the isoperimetric inequality of rectangle, where regular
+tetragon gives the maximum dimensions amongst constant perimeter rectangles.
+We propose provisionary a new measure of complexity as follows, namely rectangle-bias
+complexity Cr:
+
+Cr =
+1
+
+|SD| − 2
+∑
+<···>∈SD
+D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >],
+(29)
+
+where SD is the set of possible system decomposition in n binary variables, and |SD| is the element
+number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5
+for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value
+for the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >
+] = D[< 111 >:< 121 >]. We propose provisionary a new measure of complexity as follows, namely
+rectangle-bias complexity Cr:
+
+Cr =
+1
+
+|SD| − 2
+∑
+<···>∈SD
+D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >],
+(30)
+
+where SD is the set of possible system decomposition in n binary variables, and |SD| is the element
+number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5
+for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value for
+the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] =
+D[< 111 >:< 121 >].
+
+293
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 1. Circle diagrams of system decomposition in 3-node network. Both systems have the same value of
+multi-information I that is expressed as the identical diameter length of the circles. 2 variations are
+shown, where the left system is more complex (Cr high) in a sense any system decomposition requires
+to lose more information than the easiest one (< 122 >) in the right figure (Cr low).
+
+6. Complementarity between Complexities Defined with Arithmetic and Geometric Means
+
+We evaluate the possibility and the limit of rectangle-bias complexity Cr comparing with other
+proposed measures of complexity.
+The Interests in measuring network heterogeneity have been developed toward the incorporation
+of multi-scale characteristics into complexity measures. The TSE complexity is motivated from the
+structure of the functional differentiation of brain activity, which measures the difference of neural
+integration between all sizes of subsystems and the whole system [17]. Biologically motivated TSE
+complexity is also investigated from theoretical point of view, to further attribute desirable property
+as an universal complexity measure independent of system size [20]. The hierarchical structure of
+the exponential family in information geometry also leads to the order-wise description of statistical
+association, which can be regarded as a multi-scale complexity measure [14]. The relation between
+the order-wise dependencies and the TSE complexity is theoretically investigated to establish the
+order-wise component correspondence between them [15].
+These indices of network heterogeneity, however, all depend on the arithmetic mean of the
+component-wise information theoretical measure. We show that these arithmetic means still miss to
+measure certain modularity based on the statistical independence between subsystems.
+Figure 2 present the simplified cases where complexity measures with arithmetic means fail to
+distinguish. We consider the two systems with different heterogeneity but identical multi-information
+I. Here, the multi-information can not reflect the network heterogeneity. The TSE complexity and its
+information geometrical correspondence in [15] has a sensitivity to measure the network heterogeneity,
+but since the arithmetic mean is taken over all subsystems, they do not distinguish the component-wise
+break of symmetry between different scales. The renormalized TSE complexity with respect to the
+multi-information I still has the same insensitivity. Even by incorporating the information of each
+subsystem scale, the arithmetic mean can balance out between the scale-wise variations, and a large
+range of the heterogeneity in different scale can realize the same value of these complexities. For the
+application in neuroscience, the assumption of a model with simple parametric heterogeneity and the
+comparison of TSE complexity between different I values alleviate this limitation [17].
+
+294
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+Figure 2. Schematic examples of stochastic systems with identical multi-information I where complexity
+measures with arithmetic mean fail to distinguish.
+(a):
+Example 1 of stochastic system with
+homogeneous mean of edge information and symmetric fluctuation of its heterogeneity; (b):
+Example 2 of heterogeneous stochastic system with bimodal edge information distribution and
+identical multi-information I and complexity based on arithmetic mean as example 1; (c): schematic
+representation of the distribution of statistical association (edge information) in upper network; (d):
+schematic representation of the distribution of statistical association (edge information) in upper
+network.
+
+In contrast to complexities with arithmetic mean, the rectangle-bias complexity Cr is related to
+the geometrical mean. The Cr can distinguish the two systems in Figure 2, giving relatively high Cr
+value to the left system and low value to the right one.
+This does not mean , however, that the Cr has a finer resolution than other complexity
+measures. The constant conditions of complexity measures are the constraints on ∑n
+k=1 nCk degrees of
+freedom in model parameter space, which define different geometrical composition of corresponding
+submanifolds. We basically need ∑n
+k=1 nCk independent measures to assure the real-value resolution
+of network feature characterization. Complexities with arithmetic and geometric means are just giving
+complementary information on network heterogeneity, or different constant-complexity submanifolds
+structure in statistical manifold as depicted in Figure 3. Therefore, it is also possible to construct a
+class of systems that has identical I and Cr values but different TSE complexity. Complexity measures
+should be utilized in combination, with respect to the non-linear separation capacity of network
+features of interest.
+
+295
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 3. Schematic representation of complementarity between complexity measures based on arithmetic
+mean (Ca) and geometric mean (Cg) of informational distance. An example of the n − 1 dimensional
+constant-complexity submanifolds with respect to Ca = const. and Cg = const. conditions are depicted
+with yellow and orange surface, respectively. The dimension of the whole statistical manifold S is the
+parameter number n.
+
+7. Cuboid-Bias Complexity with Respect to System Decompositionability
+
+We consider the expansion of Cr into general system size n. The n ≥ 4 situation is different from
+n = 3 and less in the existence of a hierarchical structure between system decompositions.
+Figure 4 shows the hierarchy of the system decompositions in case n = 4. Such hierarchical
+structure between system decompositions is not homogeneous with respect to the subsystems
+number, and depends on the isomorphic types of decomposed systems. This fact produces certain
+difference of meaning in complexity between each KL-divergences when considering the system
+decompositionability.
+
+Figure 4. Hierarchy of system decomposition for 4 nodes network (n = 4). Possible sequences of Seq =
+{SD1(is) → SD2(is) → SD3(is) → SD4(is)|1 ≤ is ≤ |Seq| = 18} are connected with the lines.
+
+A simple example in 4 nodes network is shown in Figure 5.
+
+296
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 5. Hierarchical effect of sequential system decomposition on cuboid volume and rectangle surface
+on circle graph. We consider to increase the diameter of the green circle from dotted to dashed one
+without changing those of the red and blue circles, which gives different effect on the change of
+D[< 1222 >:< 1233 >] and D[< 1133 >:< 1134 >] according to the hierarchical structure of the
+decomposition sequences.
+
+We consider the modification of 2 KL-divergences in the figure, D[< 1111 >:< 1222 >] and
+D[< 1111 >:< 1133 >] from the diameter of green dotted circle to the dashed one.
+The joint distribution P(x1, x2, x3, x4) of a discrete distribution with 4 binary variables
+(x1, x2, x3, x4) (x1, x2, x3, x4 ∈ {0, 1}) have 24 − 1 = 15 parameters, which define the dual-flat
+coordinates of statistical manifold in information geometry.
+On the other hand, the possible system decompositions exist as the followings in n = 4:
+
+SD
+:=
+{< 1111 >, < 1114 >, < 1131 >, < 1211 >, < 1222 >,
+
+< 1133 >, < 1212 >, < 1221 >, < 1134 >, < 1214 >,
+
+< 1231 >, < 1224 >, < 1232 >, < 1233 >, < 1234 >}.
+(31)
+
+Since the number of possible system decompositions is 15, and each is associated with the
+modification of different sets of P(x1, x2, x3, x4) parameters, the system decompositions and KL-
+divergences between them can be defined independently. This also holds even under the constant
+condition of I value or other complexity measures except the ones imposing dependency between
+system decompositions.
+This means that we can independently modify the diameter of green dotted circle in Figure 5,
+without changing the diameters of the red and blue circles, which define the system decompositions
+< 1233 > and < 1134 > in the sub-hierarchy of < 1222 > and < 1133 >, respectively. Other
+KL-divergences can also be maintained as given constant values for the same reason.
+The rectangle-biased complexity Cr increases its value with such modification, but does not reflect
+the heterogeneity of KL-divergences according to the hierarchy of system decompositions. If we
+consider the system decompositionability as the mean facility to decompose the given system into
+its finest components with respect to the “all” possible system decompositions, such hierarchical
+difference also has a meaning in the definition of complexity.
+The effect of modifying the diameter of the green dotted circle is different between the
+decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→<
+
+297
+
+
+Entropy 2014, 16, 4132–4167
+
+1133 >→< 1134 >→< 1234 >. The decrease of the KL-divergence D[< 1222 >:< 1233 >]
+is less than D[< 1133 >:< 1134 >] since the diameter of the red dotted circle is larger than the
+blue one in Figure 5. This means that the effect of changing the same amount of KL-divergences
+in D[< 1111 >:< 1222 >] and D[< 1111 >:< 1133 >] produces larger effect on the sequence
+< 1111 >→< 1133 >→< 1134 >→< 1234 > than < 1111 >→< 1222 >→< 1233 >→< 1234 >, if
+compared at the sequence level. The rectangle-biased complexity Cr does not reflect such characteristics
+since it does not distinguish between the hierarchical structure between the diameters of the green, red
+and blue dotted circles.
+To incorporate such hierarchical effect in a complexity measure with geometric mean, we have
+the natural expansion of the rectangle-biased complexity Cr as the cuboid-bias complexity Cc, which is
+defined as follows:
+
+Cc :=
+1
+
+|Seq|
+
+|Seq|
+∑
+is=1
+
+n−1
+∏
+i=1
+D[SDi(is) : SDi+1(is)],
+(32)
+
+where Seq represents the possible sequences of hierarchical system decompositions as follows:
+
+Seq = {SD1(is) → SD2(is) → · · · SDi(is) · · · → SDn(is)|1 ≤ is ≤ |Seq|}.
+(33)
+
+The elements SDi(is) of Seq corresponds to the system decomposition, which is aligned according to
+the hierarchy with the following algorithmic procedure (based on [15]):
+
+(1) Initialization: Set the initial sets of system decomposition of all sequences in Seq as the whole
+system SD1(is) :=< 111 · · · 1 > (1 ≤ is ≤ |Seq|).
+(2) Step i → i + 1: If the system decomposition is the total system decomposition (SDi(is) :=<
+123 · · · n >), then stop. Otherwise, choose a non-decomposed subsystem SSi(is) of the system
+decomposition SDi(is), and further divide it into two independent subsystems SS1
+i (is) and
+SS2
+i (is) different for each is. SDi+1(is) is then defined as a system decomposition of total system
+that further separates independently subsystems SS1
+i (is) and SS2
+i (is), in addition to the previous
+decomposition SDi(is).
+(3) Go to the next step i + 1 → i + 2.
+
+The value of |Seq| corresponds to the number of different sequences generated by this algorithm. For
+example, |Seq| = 3 and |Seq| = 18 holds for n = 3 and n = 4, respectively. The general analytical form
+|Seq|n of |Seq| with system size n is obtained as the following recurrence formula:
+
+|Seq|n =
+
+⌊ n
+
+2 ⌋
+∑
+i=1
+nCi|Seq|n−i|Seq|i,
+(34)
+
+where ⌊·⌋ is a floor function and with formal definition of |Seq|1 := 1.
+The products of KL-divergences according to the hierarchical sequences of system decompositions
+in Equation (32) is related to the volume of n − 1-dimensional cuboids in the circle diagram. An
+example in case of n = 4 is presented in Figure 5, where two cuboids with 3 orthogonal edges of the
+different decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→<
+1133 >→< 1134 >→< 1234 > are depicted, whose cuboid volumes are
+
+�
+
+D[< 1111 >:< 1222 >]D[< 1222 >:< 1233 >]D[< 1233 >:< 1234 >],
+(35)
+
+and
+�
+
+D[< 1111 >:< 1133 >]D[< 1133 >:< 1134 >]D[< 1134 >:< 1234 >],
+(36)
+
+respectively.
+
+298
+
+
+Entropy 2014, 16, 4132–4167
+
+In the same way as Cr, we took in the definition of Cc the arithmetic average of cuboid volumes
+so that to renormalize the combinatorial increase of the decomposition paths (|Seq|) according to the
+system size n.
+Note that on the other hand we did not renormalize the rectangle-bias complexity Cr and the
+cuboid-bias complexity Cc by taking the exact geometrical mean of each product of KL-divergences
+
+such as
+n−1�
+
+∏n−1
+i=1 D[SDi(is) : SDi+1(is)]. This is for further accessibility to theoretical analysis such as
+variational method (see “Further Consideration" section), and does not change qualitative behavior
+of Cr and Cc since the power root is a monotonically increasing function. This treatment can be
+interpreted as taking the (n − 1)-th power of the geometric means for the hierarchical sequences of
+KL-divergences.
+A more comprehensive example on the utility of the cuboid-bias complexity Cc with respect to
+the rectangle-biased one Cr is shown in Figure 6. We consider the 6 nodes networks (n = 6) with the
+same I and Cr values but different heterogeneity. The system in the top left figure has a circularly
+connected structure with medium intensity, while that of the top right figure has strongly connected 3
+subsystems. These systems have qualitatively five different ways of system decomposition that are
+the basic generators of all hierarchical sequences Seq = {SD1(is) → · · · → SD5(is)|1 ≤ is ≤ |Seq|} for
+these networks. The five basial system decompositions are shown with the number 1⃝, 2⃝, 2⃝′, 3⃝ and
+4⃝ in top figures.
+The circle diagrams of these systems are depicted in the middle figures. To suppose the same
+constant value of Cr in both systems, the following condition is satisfied in the middle right figure:
+D[< 111111 >:
+2⃝] < D[< 111111 >:
+1⃝in Middle Left figure] < D[< 111111 >:
+1⃝] < D[<
+111111 >: 2⃝in Middle Left figure] < D[< 111111 >: 3⃝] < D[< 111111 >: 4⃝]. Furthermore, the
+total surface of right triangles sharing the circle diameter as hypotenuse in the middle left and the
+middle right figures are conditioned to be identical, therefore the rectangle-bias complexity Cr fails to
+distinguish.
+On the other hand, under the same condition, the cuboid-bias complexity Cc distinguishes between
+these two systems and gives higher value to the left one. The volume of 5-dimensional cuboids of the
+
+decomposition sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝
+−−−−−−−−→< 123456 > are schematically shown in the bottom
+figures, maintaining the quantitative difference between KL-divergences. Since the multi-information
+I is identical between the two systems, so is the values of KL-divergence D[< 111111 >:< 123456 >],
+
+which is the sum of the KL-divergences along the sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝
+−−−−−−−−→< 123456 >
+from the Pythagorean theorem. This means that the inequality between the cuboid volumes can be
+represented as the isoperimetric inequality of high-dimensional cuboid. As a consequence, the left
+system has quantitatively higher value of Cc than the right one. The cuboid-bias complexity Cc is also
+sensitive to such heterogeneity.
+
+299
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+(e)
+(f)
+
+Figure 6. Meaning of taking geometric mean over the sequence of system decomposition in cuboid-bias complexity
+Cc. (a): Example of 6-node network with circularly connected structure with medium intensity. Edge
+width is proportional to edge information; (b): Example of 6-node network with strongly connected
+3 subsystems. Edge width is proportional to edge information. The multi-information I of the two
+systems in Top figures are conditioned to be identical; The dotted lines schematically represent possible
+system decompositions. (c,d): Circle diagrams of each system decomposition in upper networks; The
+total surface of right triangles sharing the circle diameter as hypotenuse in (c) and (d) are conditioned to
+be identical, therefore the rectangle-bias complexity Cr fails to distinguish. (e,f): 5-dimensional cuboids
+of upper networks (Figure 6a,b) whose edges are the root of KL-divergences for the strain of system
+
+decomposition < 111111 > 1⃝ 2⃝ 2⃝
+′ 3⃝ 4⃝
+−−−−−−−−→< 123456 >. Only the first 3-dimensional part is shown with
+solid line, and the remaining 2-dimensional part is represented with dotted line. The volume of
+cuboid in (e) is larger than the one in (f), according to the isoperimetric inequality of high-dimensional
+cuboid. The total squared length of each side is identical between two cuboids, which represents
+multi-information I = D[< 111111 >:< 123456 >].
+
+8. Regularized Cuboid-Bias Complexity with Respect to Generalized Mutual Information
+
+We further consider the geometrical composition of system decompositions in the circle diagram
+and insist the necessity of renormalizing the cuboid-bias complexity Cc with the multi-information I,
+which gives another measure of complexity namely “regularized cuboid-bias complexity CR
+c .”
+We consider the situation in actual data where the multi-information I varies. Figure 7 shows
+the n = 3 cases where the Cc fails to distinguish. Both the blue and red systems are supposed to have
+the same Cc value by adjusting the red system to have relatively smaller values of KL-divergences
+
+300
+
+
+Entropy 2014, 16, 4132–4167
+
+D[< 111 >:< 122 >] and D[< 113 >:< 123 >] than the blue one. Such conditioning is possible since
+the KL-divergences are independent parameters with each other.
+
+(a)
+(b)
+(c)
+
+Figure 7.
+Examples of the 3-node systems with identical cuboid-bias complexity Cc but different
+multi-information I on circle graph. (a): System with smaller I but larger CRc ; (b): System with larger I but
+smaller CRc ; (c): Superposition of the above two systems. The regularized cuboid-bias complexity CRc
+distinguishes between the blue and red systems.
+
+Although the Cc value is identical, the two systems have different geometrical composition of
+system decompositions in the circle diagram. The red system has relatively easier way of decomposition
+< 111 >→< 122 > if renormalized with the total system decomposition < 111 >→< 123 >. This
+relative decompositionability with respect to the renormalization with the multi-information I can
+be clearly understood by superimposing the circle diagram of the two systems and comparing the
+angles between each and total decomposition paths (bottom figure). The red system has larger angle
+between the decomposition paths < 111 >→< 122 > and < 111 >→< 123 > than any others in the
+blue system, which represents the relative facility of the decomposition under renormalization with I.
+In this term, the paths < 111 >→< 121 > in the red and blue system do not change its relative facility,
+and the paths < 111 >→< 113 > are easier in the blue system.
+To express the system decompositionability based on these geometrical compositions in a
+comprehensive manner, we define the regularized cuboid-bias complexity CR
+c as follows:
+
+CR
+c
+:=
+1
+
+|Seq|
+
+|Seq|
+∑
+is=1
+
+n−1
+∏
+i=1
+
+D[SDi(is) : SDi+1(is)]
+
+D[< 11 · · · 1 >:< 12 · · · n >]
+
+:=
+Cc
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+:=
+Cc
+In−1 .
+(37)
+
+The red system then has quantitatively smaller CR
+c value than the blue system in Figure 7.
+
+9. Modular Complexity with Respect to the Easiest System Decomposition Path
+
+We have considered so far the system decompositionability with respect to the all possible
+decomposition sequences.
+This was also a way to avoid the local fluctuation of the network
+heterogeneity to be reflected in some specific decomposition paths. On the other hand, the easiest
+decomposition is particularly important when considering the modularity of the system. If there exists
+hierarchical structure of modularity in different scales with different coherence of the system, the
+KL-divergence and the sequence of the easiest decomposition gives much information.
+
+301
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 8 schematically shows a typical example where there exist two levels of modularity. Such
+structure with different scales of statistical coherence appears as functional segregation in neural
+systems [17], and is expected to be observed widely in complex systems.
+The hierarchical topology of the easiest decomposition path reflects these structures.
+For
+example, in the system of Figure 8, the decompositions between <
+1 1 · · · 1
+> and <
+1 1 1 1 5 5 5 5 9 9 9 9 13 13 13 13 > are easier than those inside of the 4-node subsystems. The values of
+KL-divergence also reflect the hierarchy, giving relatively low values for the decomposition between
+the 4-node subsystems, and high values inside of them. By examining the shortest decomposition
+path and associated KL-divergences in possible Seq, one can project the hierarchical structure of the
+modularity existing in the system.
+
+Figure 8. Example of 16-node system < 11 · · · 1 > that has different levels of modularity. The four 4-node
+subsystems < 1111 > (blue blocks) are loosely connected and easy to be decomposed, while inside each
+component (red blocks) is tightly connected. The degree of connection represents statistical dependency
+or edge information between subsystems. Such hierarchical structure can be detected by observing the
+decomposition path of the modular complexity Cm.
+
+For this reason, we define the modular complexity Cm as follows, which is the shortest path
+component of the cuboid-bias complexity Cc:
+
+Cm :=
+n−1
+∏
+i=1
+D[SDi(imin) : SDi+1(imin)],
+(38)
+
+where the index imin of the sequence SD1(imin) → SD2(imin) → · · · → SDn(imin) is chosen as follows:
+
+imin
+=
+{i1} ∩ {i2} ∩ · · · ∩ {in−1},
+(39)
+
+where
+
+{i1}
+=
+argmin
+is
+{D[SD1(is) : SD2(is)]|1 ≤ is ≤ |Seq|},
+
+{i2}
+=
+argmin
+i1
+{D[SD2(i1) : SD3(i1)]|i1 ∈ {i1}},
+
+...
+
+{in−1}
+=
+argmin
+in−2
+{D[SDn−1(in−2) : SDn(in−2)]|in−1 ∈ {in−1}},
+(40)
+
+302
+
+
+Entropy 2014, 16, 4132–4167
+
+which gives eventually
+
+imin
+=
+in−1.
+(41)
+
+This means that beginning from the undecomposed state < 11 · · · 1 >, we continue to choose
+the shortest decomposition path in the next hierarchy of system decomposition. The minimization of
+the path length is guaranteed by the sequential minimization since the geometric mean of isometric
+path division is bounded below by its minimum component. imin is unique if the system is completely
+heterogenous (i.e., D[SD1(ik) : SD2(ik)] ̸= D[SD1(il) : SD2(il)], 1 ≤ ik < il ≤ |Seq|), otherwise plural
+decomposition paths that give the same Cm value are possible according to the homogeneity of the
+system. Besides its value, the modular complexity Cm should be utilized with the sequence information
+of the shortest decomposition path to evaluate the modularity structure of a system.
+The cases where Cm are identical but Cc are different can be composed by varying the system
+decompositions other than in the shortest path SD1(imin) → SD2(imin) → · · · → SDn(imin) without
+modifying the index imin. There exist also inverse examples with identical Cc and different Cm, due to
+the complementarity between Cm and Cc.
+We finally define the regularized modular complexity CR
+m as follows, for the same reason as defining
+CR
+c from Cc;
+
+CR
+m
+:=
+n−1
+∏
+i=1
+
+D[SDi(imin) : SDi+1(imin)]
+D[< 11 · · · 1 >:< 12 · · · n >]
+
+:=
+Cm
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+:=
+Cm
+In−1 .
+(42)
+
+Proposition 4. The cuboid-bias complexities Cc and CR
+c are bounded by the modular complexities Cm and CR
+m
+respectively:
+
+Cc ≤ Cm,
+(43)
+
+CR
+c ≤ CR
+m.
+(44)
+
+And they coincide at the maximum values under the given multi-information I:
+
+max{Cm|I = const.}
+=
+max{Cc|I = const.},
+(45)
+
+max{CR
+m}
+=
+max{CR
+c }.
+(46)
+
+These relations (43)–(46) are numerically shown in the “Numerical Comparison” section.
+The superiority of the modular complexities is due to the hierarchical dependency of
+KL-divergence value in decomposition paths. In the shortest decomposition path defining modular
+complexities, the easier system decomposition relatively increase its value since they incorporate more
+number of edge cutting. Since we eventually cut all edges to obtain < 12 · · · n > at the end of the
+decomposition sequence, collecting the edges with relatively weak edge information and cutting them
+together augment the value of the product of KL-divergences. The modular complexities are then
+the maximum value components among the possible decomposition paths calculated in cuboid-bias
+complexities:
+
+Cm
+=
+max
+
+�
+n−1
+∏
+i=1
+D[SDi(is) : SDi+1(is)]
+
+����� 1 ≤ is ≤ |Seq|
+
+�
+
+,
+(47)
+
+CR
+m
+=
+max
+
+�
+n−1
+∏
+i=1
+
+D[SDi(is) : SDi+1(is)]
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+����� 1 ≤ is ≤ |Seq|
+
+�
+
+.
+(48)
+
+303
+
+
+Entropy 2014, 16, 4132–4167
+
+The difference between the cuboid-bias complexities and the modular complexities is an index of
+the geometrical variation of decomposed systems in the circle graph, which reflects the fluctuation of
+the sequence-wise system decompositionability. If the variation of the system decompositionability for
+each system decomposition is large, accordingly the modular complexities tend to give higher values
+than the cuboid-bias complexities.
+
+10. Numerical Comparison
+
+We numerically investigate the complementarity between the proposed complexities, Cc, CR
+c , Cm,
+and CR
+m. Since the minimum node number giving non-trivial meaning to these measures is n = 4, the
+corresponding dimension of parameter space is ∑n
+k=1 nCk = 15. The constant-complexity submanifolds
+are therefore difficult to visualize due to the high dimensionality. For simplicity, we focus on the
+2-dimensional subspace of this parameter space whose first axis ranging from random to maximum
+dependencies of the system, and the second one representing the system decompositionability of
+< 1133 >.
+For this purpose, we introduce the following parameters α and β (0 ≤ α, β ≤ 1) in the j-coordinates
+of the discrete distribution with 4-dimensional binary stochastic variable:
+
+η1
+=
+η0,
+
+η2
+=
+η0,
+
+η3
+=
+η0,
+
+η4
+=
+η0,
+
+η1,2
+=
+η1η2 + α(η0 − ϵ − η1η2),
+(49)
+
+η3,4
+=
+η3η4 + α(η0 − ϵ − η3η4),
+
+η1,3
+=
+η1η3 + αβ(η0 − ϵ − η1η3),
+
+η1,4
+=
+η1η4 + αβ(η0 − ϵ − η1η4),
+
+η2,3
+=
+η2η3 + αβ(η0 − ϵ − η2η3),
+
+η2,4
+=
+η2η4 + αβ(η0 − ϵ − η2η4),
+
+η1,2,3
+=
+η1,2η3 + αβ(η0 − 2ϵ − η1,2η3),
+
+η1,2,4
+=
+η1,2η4 + αβ(η0 − 2ϵ − η1,2η4),
+
+η1,3,4
+=
+η1η3,4 + αβ(η0 − 2ϵ − η1η3,4),
+
+η2,3,4
+=
+η2η3,4 + αβ(η0 − 2ϵ − η2η3,4),
+
+η1,2,3,4
+=
+η1,2η3,4 + αβ(η0 − 3ϵ − η1,2η3,4).
+
+Where α represents the degree of statistical association from random (α = 0) to maximum (α = 1),
+and β control the system decompositionability of < 1133 >. If β = 1, the system has the maximum
+KL-divergence D[< 1111 >:< 1133 >] under the constraint of α parameter, and β = 0 gives D[<
+1111 >:< 1133 >] = 0.
+ϵ is the minimum value of the joint distribution of 4-dimensional variable, which is defined to be
+more than 0 to avoid singularity in the dual-flat coordinates of statistical manifold. ϵ = 1.0 × 10−10
+
+and η0 = 0.5 was chosen for the calculation.
+
+304
+
+
+Entropy 2014, 16, 4132–4167
+
+The system with maximum statistical association under given η0 corresponds to the α = β = 1
+condition in given parameters, whose j-coordinates become as follows:
+
+η1
+=
+η0,
+...
+
+η4
+=
+η0,
+
+η1,2
+=
+η0 − ϵ,
+...
+
+η3,4
+=
+η0 − ϵ,
+(50)
+
+η1,2,3
+=
+η0 − 2ϵ,
+...
+
+η2,3,4
+=
+η0 − 2ϵ,
+
+η1,2,3,4
+=
+η0 − 3ϵ, .
+
+On the other hand, the totally decomposed system corresponds to the α = 0 condition, and the
+j-coordinates are:
+
+η1
+=
+η0,
+...
+
+η4
+=
+η0,
+
+η1,2
+=
+η0η0,
+...
+
+η3,4
+=
+η0η0,
+(51)
+
+η1,2,3
+=
+η0η0η0,
+...
+
+η2,3,4
+=
+η0η0η0,
+
+η1,2,3,4
+=
+η0η0η0η0.
+
+Note that the completely deterministic case η0 = 1.0 and α = β = 1 gives I = 0.
+The intuitive meaning of these parameters α and β are also schematically depicted in Figure 9
+
+bottom right.
+
+305
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+Figure 9. Contour plot of the complexity landscape of I, Cc, Cm, CRc , and CRm on α-β plane. (a): Contour plot
+superposition of Cc and Cm. (b): Contour plot superposition of CRc and CRm. (c): Contour plot of I.
+The color of contour plots corresponds to the color gradient of 3D plots in Figure 10; (d): Schematic
+representation of the system in different regions of α-β plane. Edge width represents the degree of edge
+information, and independence is depicted with dotted line.
+
+Figure 10 shows the landscape of the proposed complexities on the α-β plane. Their contour plots
+are depicted in Figure 9. The proposed complexities each differs from others in almost everywhere
+points on α-β plane except at the intersection lines. Therefore, these measures serve as the independent
+features of the system, each has its specific meaning with respect to the system decompositionability.
+The α-β plane shows a section of the actual structure of the complementarity expressed in Figure 3
+between the proposed complexity measures.
+The
+relations
+between
+the
+cuboid-bias
+complexities
+and
+modular
+complexities
+in
+Equations (43)–(46) are also numerically confirmed.
+The modular complexities are superior
+than the corresponding cuboid-bias complexities, and coincide at the parameter α = β = 1 giving
+maximum values and dependencies in this parameterization.
+
+306
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+
+(b)
+(c)
+
+(d)
+(e)
+
+Figure 10. Landscape of complexities I, Cc, Cm, CRc , and CRm on α-β plane. (a): Multi-information I; (b):
+Cuboid-bias complexity Cc. (c): Modular complexity Cm;(d): Regularized cuboid-bias complexity
+CRc ; (e): Regularized modular complexity CRm. All complexity measures show the complementarity
+intersecting with each other, satisfying the boundary conditions vanishing at α = 0 and β = 0 except the
+multi-information I. Note that regularized complexities CRc and CRm show singularity of convergence at
+α → 0 due to the regularization of infinitesimal value.
+
+In general case without the parameterization with α, β and η0, the boundary conditions of Cc, CR
+c ,
+Cm and CR
+m include that of the multi-information I, which vanish at the completely random or ordered
+state. This is common to other complexity measures such as the LMC complexity, and fit to the basic
+
+307
+
+
+Entropy 2014, 16, 4132–4167
+
+intuition on the concept of complexity situated equivalently far from the completely predictable and
+disordered states [21,22].
+The proposed complexities further incorporate boundary conditions that vanish with the existence
+of a completely independent subsystem of any size. This means that the Cc, CR
+c , Cm and CR
+m of a system
+become 0 if we add another independent variable. This property does not reflect the intuition of
+complexity defined by the arithmetic average of statistical measures. The proposed complexity can
+better find its meaning in comparison to other complexity measures such as the multi-information
+I, and by interactively changing the system scale to avoid trivial results with small independent
+subsystem. For example, the proposed complexities could be utilized as the information criteria
+for the model selection problems, especially with an approximative modular structure based on the
+statistical independency of data between subsystems. We insist that the complementarity principle
+between plural complexity measures of different foundation is the key to understand the complexity
+in a comprehensive manner.
+To characterize the property of Cc, CR
+c , Cm and CR
+m in relation to the diverse composition of each
+system decomposition, it is useful to consider the geometry of their contour structure, as compared
+in Figure 9. The contour can be formalized as Cc, CR
+c , Cm, CR
+m = const. for each complexity measure,
+and D[< 11 · · · 1 >: SDi(is)] = const. (1 ≤ i ≤ n − 1, 1 ≤ is ≤ |Seq|) for each system decomposition.
+For that purpose, analysis with algebraic geometry can be considered as a prominent tool. Algebraic
+geometry investigates the geometrical property of polynomial equations [23]. The complexities Cc, CR
+c ,
+Cm and CR
+m can be interpreted as polynomial functions by taking each system decomposition as novel
+coordinates, therefore directly accessible to algebraic geometry. However, if we want to investigate the
+contour of the complexities on the p parameter space, logarithmic function appears as the definition of
+KL-divergence, which is a transcendental function and outreach the analytical requirement of algebraic
+geometry. To introduce compatibility between the p parameter space of information geometry and
+algebraic geometry, it suffices to describe the model by replacing the logarithmic functions as another n
+variables such as q = log p, and reconsider the intersection between the result from algebraic geometry
+on the coordinates (p, q) and q = log p condition. The contour of Cc, CR
+c , Cm and CR
+m is also important
+to seek for the utility of these measures as a potential to interpret the dynamics of statistical association
+as geodesics.
+
+11. Further Consideration
+
+11.1. Pythagorean Relations in System Decomposition and Edge Cutting
+
+We further look back at the system decomposition and edge cutting in terms of the Pythagorean
+relation between KL-divergences, which is based on the orthogonality between ` and j coordinates.
+In system decomposition, the distribution of decomposed system is analytically obtained from
+the product of subsystems’ η coordinates, which is equivalent to set all θdec parameters as 0 in mixture
+coordinate ξdec. From the consistency of θdec parameters in ξdec being 0 in all system decompositions,
+we have the Pythagorean relation according to the inclusion relation of system decomposition. For
+example, the following holds:
+
+D[< 1111 >:< 1234 >]
+=
+D[< 1111 >:< 1222 >]
+
++
+D[< 1222 >:< 1233 >]
+(52)
+
++
+D[< 1233 >:< 1234 >].
+
+The proof is in the same way as k-cut coordinates isolating k-tuple statistical association between
+variables [14].
+On the other hand, the edge cutting previously defined using the product of remaining maximum
+cliques’ η coordinates does not coincides with the θec = 0 condition in mixture coordinates ξec. We
+have defined the ηec values of edge cutting based only on the orthogonal relation between η and θ
+
+308
+
+
+Entropy 2014, 16, 4132–4167
+
+coordinates, by generalizing the rule of system decomposition in ηec coordinates, and did not consider
+the Pythagorean relation between different edge cuttings.
+It is then possible to define another way of edge cutting using θec = 0 condition in ξec. Indeed,
+in k-cut mixture coordinates, θk+ = 0 condition is derived from the independent condition of the
+variables in all orders, and k-tuple statistical association is measured by reestablishing the η parameters
+for the statistical association up to k − 1-tuple order. In the same way, we can set θdec = 0 condition for
+ξdec of a system decomposition, and reestablish edges with respect to the η parameters, except the one
+in focus for edge cutting.
+As a simple example, consider the system decomposition < 1222 > and edge cutting 1 − 2 in
+4-node graph. We have the mixture coordinate ξdec for the system decomposition as follows:
+
+ξdec
+1,2
+=
+θdec
+1,2 = 0,
+
+ξdec
+1,3
+=
+θdec
+1,3 = 0,
+
+ξdec
+1,4
+=
+θdec
+1,4 = 0,
+
+ξdec
+1,2,3
+=
+θdec
+1,2,3 = 0,
+(53)
+
+ξdec
+1,3,4
+=
+θdec
+1,3,4 = 0,
+
+ξdec
+1,2,3,4
+=
+θdec
+1,2,3,4 = 0,
+
+where all the rest of ξdec coordinates is equivalent to that of η coordinates.
+We then consider the new way of edge cutting 1 − 2 by recovering the statistical association in
+edges 1 − 3 and 1 − 4 from system decomposition < 1222 >, orthogonally to that of edge 1 − 2. The
+new mixture coordinate ξEC changes to the following:
+
+ξEC
+1,2 = θEC
+1,2 = 0,
+
+ξEC
+1,3 = η1,3,
+
+ξEC
+1,4 = η1,4,
+
+ξEC
+1,2,3 = θEC
+1,2,3 = 0,
+(54)
+
+ξEC
+1,3,4 = η1,3,4,
+
+ξEC
+1,2,3,4 = θEC
+1,2,3,4 = 0,
+
+and the rest is equivalent to that of η coordinates.
+This new ξEC is also compatible with k-cut coordinates formalization for its simple θEC = 0
+conditions. To obtain ξEC for arbitrary edge cutting i − j, one should take θEC containing i and j in
+its subscript, set them to 0, and combine with η coordinates for the rest of the subscript. For plural
+edge cuttings i − j, · · · , k − l (1 ≤ i, j, k, l ≤ n), it suffices to take θEC containing i and j, ... , k and l in
+its subscript respectively, then set them to 0.
+We finally obtain the Pythagorean relation between edge cuttings. Denoting the general edge
+cutting(s) coordinates as ξi−j,··· ,k−l, the following holds for the example of system decomposition
+< 1222 >:
+
+D[< 1111 >:< 1222 >]
+=
+D[< 1111 >: p(ξ1−2)]
+
++
+D[p(ξ1−2) : p(ξ1−2,1−3)]
+(55)
+
++
+D[p(ξ1−2,1−3) : p(ξ1−2,1−3,1−4)].
+
+Despite the consistency with the dual structure between θ and η, we do not generally have
+analytical solution to determine ηEC values from θEC = 0 conditions. We should call for some
+numerical algorithm to solve θEC = 0 conditions with respect to ηEC values, which are in general
+high-degree simultaneous polynomials. Furthermore, numerical convergence of the solution has to be
+
+309
+
+
+Entropy 2014, 16, 4132–4167
+
+very strict, since tiny deviation from the conditions can become non-negligible by passing fractional
+function and logarithmic function of θ coordinates.
+On the other hand, the previously defined edge cutting with ξec using the product between
+subgraphs’ η coordinates is analytically simple and does not need to consider the other edges’ recovery
+from system decomposition or independence hypothesis. We then chose the previous way of edge
+cutting for both calculability and clarity of the concept.
+There have been many attempts to approximate complex network by low-dimensional system
+with the use of statistical physics and network theory. As a contemporary example, moment-closure
+approximation provides a various way to abstract essential dynamics e.g., in discrete adaptive
+network [24]. Although the approximation takes several theoretical assumptions such as random graph
+approximation, it is difficult to quantitatively reproduce the dynamics even in some simplest model.
+This is partly due to homogeneous treatment of statistics such as truncation into pair-wise order. The
+edge cutting can offer a complementary view on the evaluation of moment-closure approximations.
+Using orthogonal decomposition between edge information, one can evaluate which part of network
+link and which order of statistics contain essential information, which does not necessary conform to
+top-down theoretical treatment.
+
+11.2. Complexity of the Systems with Continuous Phase Space
+
+We have developed the concept of system decompositionability based on discrete binary variables.
+One can also apply the same principle to continuous variable.
+For an ergodic map G : X → X in continuous space X, KS entropy h(μ, G) is defined as the
+maximum of entropy rate with respect to all possible system decomposition A, when the invariant
+measure μ exists:
+
+h(μ, G) = sup
+A
+h(μ, G, A).
+(56)
+
+where A is the disjoint decomposition of X that consists of non-trivial sets ai, whose total number is
+n(A), defined as
+
+X =
+
+n(A)
+�
+
+i=1
+ai,
+(57)
+
+ai ∩ aj = φ, i ̸= j, 1 ≤ i, j ≤ n(A),
+(58)
+
+meaning the natural expansion of system decomposition into continuous space.
+The entropy rate h(μ, G, A) in Equation (56) is defined as
+
+h(μ, G, A) = lim
+n→∞
+1
+n H(μ, A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)),
+(59)
+
+according to the entropy H(μ, A) based on the decomposition A = {ai}
+
+H(μ, A) = −
+
+n(A)
+∑
+i=1
+μ(ai) ln μ(ai),
+(60)
+
+and the product C = A ∨ B as
+
+C
+=
+A ∨ B
+
+=
+{ci = aj ∩ bk|1 ≤ j ≤ n(A), 1 ≤ k ≤ n(B)}.
+(61)
+
+310
+
+
+Entropy 2014, 16, 4132–4167
+
+In a more general case, topological entropy hT(G) is defined simply with the number of
+decomposed subsystem elements by preimages as follows, without requiring ergodicity, therefore
+neither the existence of invariant measure μ:
+
+hT(G) = sup
+A
+lim
+n→∞
+1
+n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)).
+(62)
+
+Topological entropy takes the maximum value of the possible preimage divisions, in order to
+measure the complexity in terms of the mixing degree of the orbits. For example, if the KS entropy
+is positive as h(μ, G) > 0, the dynamics of G on an invariant set of invariant measure μ is chaotic for
+almost everywhere initial conditions. As for the positive topological entropy hT(G) > 0, the dynamics
+of G contain chaotic orbits, but not necessary as attractive chaotic invariant set, since hT(G) ≥ h(μ, G)
+and the KS entropy can be negative.
+Although these definitions are useful to characterize the existence of chaotic dynamics, the system
+decompositionability is another property representing different aspect of the system complexity. It
+is rather the matter of the existence of independent dynamics components, or the degree of orbit
+localization between arbitrary system decompositions. We propose the following “geometric topological
+entropy” hg(G) applying the same principle of taking geometric product between all hierarchical
+structure of the system decomposition A.
+
+hg(G) := ∏
+σ(A)>0
+lim
+n→∞
+1
+n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)),
+(63)
+
+where σ(A) > 0 means to take all components of A having positive Lebesgue measure on X.
+This gives 0 if the preimage of certain ai ∈ A is ai itself, meaning there exist a subsystem ai whose
+range is invariant under G, closed by itself. The system X can be completely divided into ai and the
+rest. This corresponds to the existence of an independent subsystem in cuboid-bias and modular
+complexities. In case such independent components do not exist, it still reflects the degree of orbit
+localization for all possible system decompositions in multiplicative manner. The condition σ(A) > 0
+is to avoid trivial case such as the existence of unstable limit cycle, whose Lebesgue measure is 0.
+Typical example giving hg(G) = 0 is the function having independent ergodic components, such
+as the Chirikov-Taylor map with appropriate parameter [25].
+
+12. Conclusions and Discussion
+
+We have theoretically developed a framework to measure the degree of statistical association
+existing between subsystems as well as the ones represented by each edge of the graph representation.
+We then reconsidered the problem of how to define complexity measures in terms of the construction
+of non-linear feature space. We defined new type of complexity based on the geometrical product of
+KL-divergence representing the degree of system decompositionability. Different complexity measures
+as well as newly proposed ones are compared on a complementarity basis on statistical manifold.
+Application of presented theory can encompass a large field of complex systems and data science,
+such as social network, genetic expression network, neural activities, ecological database, and any
+kind of complex networks with binary co-occurrence matrix data e.g., [26–29], databases: [30–34].
+Continuous variables are also accessible by appropriate discretization of information source with e.g.,
+entropy maximization principle.
+In contrast to arithmetic mean of information over the whole system, geometric mean has not been
+investigated sufficiently in the analysis of complex network. However in different fields, theoretical
+ecology has already pointed out the importance of geometric mean when considering the long-term
+fitness of a species population in a randomly varying environment [35,36]. Long-term fitness refers
+to the ecological complexity of its survival strategy under large stochastic fluctuation. Here, we can
+find useful analogy between the growth rate of a population in ecology and the spatio-temporal
+
+311
+
+
+Entropy 2014, 16, 4132–4167
+
+propagation rate of information between subsystems in general. If we take an arbitrary subsystem
+and consider the amount of information it can exchange with all other subsystems, the proposed
+complexity measures with geometric mean reflect the minimum amount with amongst all possible
+other subsystems, which can not be distinguished with arithmetic mean. The propagation rate of a
+population in ecology and the information transmission in complex network hold mathematically
+analogous structure. In population ecology, the variance of growth rate is crucial to evaluate the
+long-term survival of the population. Even if the arithmetic mean of growth rate is high, large variance
+will lead to low geometric mean even with a small amount of exceptionally small fitness situation,
+which ecologically means extinction of an entire species. In stochastic network, the variance of system
+decompositionability is essential to evaluate the amount of information shared between subsystems, or
+information persistence in the entire network. Even the multi-information I is high, large heterogeneity
+of edge information can lead to informational isolation of certain subsystem, which means extinction
+of its information. If such subsystem is situated on the transmission pathway, information cannot
+propagate across these nodes. Therefore, the proposed complexity measures CC, CR
+C, Cm and CR
+m
+generally reflect the minimum amount of information propagation rate spread entirely on the system
+without exception of isolated division.
+Some recent studies on adaptive network focus on the evolution of network topology in response
+to node activity, such as game-theoretic evolution of strategies [37], opinion dynamics on an evolving
+network [38], epidemic spreading on an adaptive network [39], etc. Analysis of coevolution network
+between variables and interactions can capture important dynamical feature of complex systems. In
+contrast to topological network analysis, the newly proposed complexity measures can complement
+its statistical dynamics analysis. In addition to the topological change of network model, (e.g., linking
+dynamics of game theory, opinion community network structure, contact network of epidemics
+transmission), one can evaluate the emerged statistical association between the variables that does
+not necessary coincide with the network topology. Interesting feature of non-linear dynamics is the
+unexpected correlation between distant variables, which is quantified as Tsallis entropy [40]. The
+complementary relation between concrete interaction and resulting statistical association can provide a
+twofold methodology to characterize the coevolutionary dynamics of adaptive network. Such strategy
+can promote integrated science from laboratory experiments to open-field in natura situation, where
+actual multi-scale problematics remain to be solved [41].
+Arithmetic and geometric means can be integrated in a mutual formula called generalized
+mean [42]. Therefore, the proposed complexity measures with geometric mean of KL-divergence is
+an expansion of preexisting complexity measures with mixture coordinates. Table 1 summarizes the
+generalization of complexity measure in this article. Based on the k-cut coordinates ı, the weighted
+sum of KL-divergence representing k-tuple order of statistical association derived complexity measures
+with (weighted) arithmetic mean such as multi-information I and TSE complexity. On the other hand,
+we showed that subsystem-wise correlation can also be isolated with the use of mixture coordinates,
+namely < · · · >-cut coordinates ¸. To quantify the heterogeneity of system decompositionability, we
+generally took a weighted geometric mean of KL-divergence in CC, CR
+C, Cm and CR
+m. Here, the shortest
+path selection of Cm and CR
+m, and regularization of CR
+C and CR
+m with respect to multi-information I
+can be interpreted as the weight function of geometric mean. This perspective brings a definition
+of a generalized class of complexity measures based on the mixture coordinates and generalized
+mean of KL-divergence. Information discrepancy can also be generalized from KL-divergence to
+Bregman divergence, providing access to the concept of multiple centroids in large stochastic data
+analysis such as image processing [43]. The blank columns of the Table 1 imply the possibility of
+other complexity measures in this class. For example, the weighted geometric mean of KL-divergence
+defined between k-cut coordinates is expected to yield complexity measures that are sensitive to
+the heterogeneity of correlation orders. The weighted arithmetic mean of KL-divergence defined
+between < · · · >-cut coordinates should be sensitive to the mean decompositionability of arbitrary
+subsystem. Since these measures take analytically different form on mixture coordinates and/or mean
+
+312
+
+
+Entropy 2014, 16, 4132–4167
+
+functions, their derivatives do not coincide, which give independent information of the system on
+the complementary basis on statistical manifold, as long as the number of complexity measures are
+inferior to the freedom degree of the system.
+
+Table 1. Classification of complexity measures with KL-divergence on mixture coordinates.
+
+Generalized Mean of KL-Divergence
+
+Arithmetic Mean
+Geometric Mean
+
+Mixture Coordinates
+k-cut ı
+TSE complexity, I
+
+< · · · >-cut ¸
+CC, CR
+C, Cm, CRm
+
+Acknowledgments: This study was partially supported by CNRS, the long term study abroad support program
+of the university of Tokyo, and the French government (Promotion Simone de Beauvoir).
+
+Conflicts of Interest: Conflicts of Interest
+The author declares no conflict of interest.
+
+References
+
+1.
+Boccalettia, S.; Latorab, V.; Morenod, Y.; Chavezf, M.; Hwang, D.U. Complex Networks: Structure and
+Dynamics. Phys. Rep. 2006, 424, 175–308.
+2.
+Strogatz, S.H. Exploring Complex Networks. Nature 2001, 410, 268–276.
+3.
+Wasserman, S.; Faust, K. Social Network Analysis; Cambridge University Press: Cambridge, UK, 1994.
+4.
+Funabashi, M.; Cointet, J.P.; Chavalarias, D. Complex Network. In Studies in Computational Intelligence;
+Springer: Berlin/Heidelberg, Germany, 2009; Volume 207, pp. 161–172.
+5.
+Badii, R.; Politi, A. Complexity: Hierarchical Structures and Scaling in Physics; Cambridge University Press:
+Cambridge, UK, 2008.
+6.
+Lempel, A.; Ziv, J. On the Complexity of Finite Sequences. IEEE Trans. Inf. Theory 1976, 22, 75–81.
+7.
+Li, M.; Vitanyi, P. Texts in Computer Science. In An Introduction to Kolmogorov Complexity and Its Applications,
+2nd ed.; Springer: Berlin/Heidelberg, Germany, 1997.
+8.
+Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 2006.
+9.
+Bennett, C. On the Nature and Origin of Complexity in Discrete, Homogeneous, Locally-Interacting Systems.
+Found. Phys. 1986, 16, 585–592.
+10.
+Grassberger, P. Toward a Quantitative Theory of Self-Generated Complexity. Int. J. Theor. Phys. 1986,
+25, 907–938.
+11.
+Crutchfield, J.P.; Feldman, D.P. Regularities Unseen, Randomness Observed: The Entropy Convergence
+Hierarchy. Chaos 2003, 15, 25–54.
+12.
+Crutchfield, J.P. Inferring Statistical Complexity. Phys. Rev. Lett. 1989, 63, 105–108.
+13.
+Prichard, D.; Theiler, J. Generalized Redundancies for Time Series Analysis. Physica D 1995, 84, 476–493.
+14.
+Amari, S. Information Geometry on Hierarchy of Probability Distributions. IEEE Trans. Inf. Theory 2001,
+47, 1701–1711.
+15.
+Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A Unifying Framework for Complexity Measures of Finite Systems;
+Report 06-08-028; Santa Fe Institute: Santa Fe, NM, USA, 2006.
+16.
+MacKay, R.S. Nonlinearity in Complexity Science. Nonlinearity 2008, 21, T273–T281.
+17.
+Tononi, G.; Sporns, O.; Edelman, M. A Measure for Brain Complexity: Relating Functional Segregation and
+Integration in the Nervous System. Proc. Natl. Acad. Sci. USA 1994, 91, 5033.
+18.
+Feldman, D.P.; Crutchfield, J.P. Measures of statistical complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+19.
+Nakahara, H.; Amari, S. Information-Geometric Measure for Neural Spikes. Neural Comput. 2002, 14, 2269–
+2316.
+20.
+Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How Should Complexity Scale with System Size? Eur. Phys. J. B
+2008, 63, 407–415.
+21.
+Feldman, D.P.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+22.
+Lopez-Ruiz, R.; Mancini, H.; Calbet, X. A Statistical Measure of Complexity. Phys. Lett. A 1995, 209, 321–326.
+
+313
+
+
+Entropy 2014, 16, 4132–4167
+
+23.
+Hodge, W.; Pedoe, D. Methods of Algebraic Geometry; Cambridge Mathematical Library, Cambridge University
+Press: Cambridge, UK, 1994; Volume 1–3.
+24.
+Demirel, G.; Vazquez, F.; Bohme, G.; Gross, T. Moment-closure Approximations for Discrete Adaptive
+Networks. Physica D 2014, 267, 68–80.
+25.
+Fraser, G., Ed. The New Physics for the Twenty-First Century; Cambridge University Press: Cambridge, UK,
+2006; p. 335.
+26.
+Scott, J. Social Network Analysis: A Handbook; SAGE Publications Ltd.: London, UK, 2000.
+27.
+Geier, F.; Timmer, J.; Fleck, C. Reconstructing Gene-Regulatory Networks from Time Series, Knock-Out Data,
+and Prior Knowledge. BMC Syst. Biol. 2007, 1, doi:10.1186/1752-0509-1-11.
+28.
+Brown, E.N.; Kass, R.E.; Mitra, P.P. Multiple Neural Spike Train Data Analysis: State-of-the-Art and Future
+Challenges. Nat. Neurosci. 2004, 7, 456–461.
+29.
+Yee, T.W. The Analysis of Binary Data in Quantitative Plant Ecology.
+Ph.D. Thesis, The University of
+Auckland, New Zealand, 1993.
+30.
+Stanford Large Network Dataset Collection. Available online: http://snap.stanford.edu/data/ (accessed on
+19 July 2014).
+31.
+BioGRID. Available online: http://thebiogrid.org/ (accessed on 19 July 2014).
+32.
+Neuroscience Information Framework.
+Available online:
+http://www.neuinfo.org/ (accessed on
+19 July 2014).
+33.
+Global Biodiversity Information Facility. Available online: http://www.gbif.org/ (accessed on 19 July 2014).
+34.
+UCI Network Data Repository. Available online: http://networkdata.ics.uci.edu/index.php (accessed on 19
+July 2014).
+35.
+Lewontin, R.C.; Cohen, D. On Population Growth in a Randomly Varying Environment. Proc. Natl. Acad.
+Sci. USA 1969, 62, 1056–1060.
+36.
+Yoshimura, J.; Clark, C.W. Individual Adaptations in Stochastic Environments. Evol. Ecol. 1969, 5, 173–192.
+37.
+Wu, B.; Zhou, D.; Wang, L. Evolutionary Dynamics on Stochastic Evolving Networks for Multiple-Strategy
+Games. Phys. Rev. E 2011, 84, 046111.
+38.
+Fu, F.; Wang, L. Coevolutionary Dynamics of Opinions and Networks: From Diversity to Uniformity.
+Phys. Rev. E 2008, 78, 016104.
+39.
+Gross, T.; D’Lima, C.J.D.; Blasius, B. Epidemic Dynamics on an Adaptive Network.
+Phys. Rev. Lett.
+2006, 96, 208701.
+40.
+Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statistics. J. Stat. Phys. 1988, 52, 479–487.
+41.
+Quintana-Murci, L.; Alcais, A.; Abel, L.; Casanova, J.L. Immunology in natura: Clinical, Epidemiological
+and Evolutionary Genetics of Infectious Diseases. Nat. Immunol. 2007, 8, 1165–1171.
+42.
+Hardy, G.; Littlewood, J.; Polya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1967; Chapter 3.
+43.
+Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+314
+
+
+entropy
+
+Article
+The Entropy-Based Quantum Metric
+
+Roger Balian
+
+Institut de Physique Théorique, CEA/Saclay, F-91191 Gif-sur-Yvette Cedex, France;
+E-Mail: roger@balian.fr
+
+Received: 15 May 2014; in revised form: 25 June 2014 / Accepted: 11 July 2014 /
+Published: 15 July 2014
+
+Abstract: The von Neumann entropy S( ˆD) generates in the space of quantum density matrices
+ˆD the Riemannian metric ds2 = −d2S( ˆD), which is physically founded and which characterises
+the amount of quantum information lost by mixing ˆD and ˆD + d ˆD. A rich geometric structure is
+thereby implemented in quantum mechanics. It includes a canonical mapping between the spaces
+of states and of observables, which involves the Legendre transform of S( ˆD). The Kubo scalar
+product is recovered within the space of observables. Applications are given to equilibrium and non
+equilibrium quantum statistical mechanics. There the formalism is specialised to the relevant space of
+observables and to the associated reduced states issued from the maximum entropy criterion, which
+result from the exact states through an orthogonal projection. Von Neumann’s entropy specialises
+into a relevant entropy. Comparison is made with other metrics. The Riemannian properties of the
+metric ds2 = −d2S( ˆD) are derived. The curvature arises from the non-Abelian nature of quantum
+mechanics; its general expression and its explicit form for q-bits are given, as well as geodesics.
+
+Keywords: quantum entropy; metric; q-bit; information; geometry; geodesics; relevant entropy
+
+1. A Physical Metric for Quantum States
+
+Quantum physical quantities pertaining to a given system, termed as “observables” ˆO, behave
+as non-commutative random variables and are elements of a C*-algebra. We will consider below
+systems for which these observables can be represented by n-dimensional Hermitean matrices in a
+finite-dimensional Hilbert space H. In quantum (statistical) mechanics, the “state” of such a system
+encompasses the expectation values of all its observables [1]. It is represented by a density matrix ˆD,
+which plays the rôle of a probability distribution, and from which one can derive the expectation value
+of ˆO in the form
+< ˆO >= Tr ˆD ˆO = ( ˆD; ˆO) .
+(1)
+
+Density matrices should be Hermitean (< ˆO > is real for ˆO = ˆO†), normalised (the expectation
+value of the unit observable is Tr ˆD = 1) and non-negative (variances <
+ˆO2 > − <
+ˆO >2 are
+non-negative). They depend on n2 − 1 real parameters. If we keep aside the multiplicative structure of
+the set of operators and focus on their linear vector space structure, Equation (1) appears as a linear
+mapping of the space of observables onto real numbers. We can therefore regard the observables and
+the density operators ˆD as elements of two dual vector spaces, and expectation values (1) appear as
+scalar products.
+It is of interest to define a metric in the space of states. For instance, the distance between an
+exact state ˆD and an approximation ˆDapp would then characterise the quality of this approximation.
+However, all physical quantities come out in the form (1) which lies astride the two dual spaces of
+observables and states. In order to build a metric having physical relevance, we need to rely on another
+meaningful quantity which pertains only to the space of states.
+
+Entropy 2014, 16, 3878–3888; doi:10.3390/e16073878
+www.mdpi.com/journal/entropy
+315
+
+
+Entropy 2014, 16, 3878–3888
+
+We note at this point that quantum states are probabilistic objects that gather information about
+the considered system. Then, the amount of missing information is measured by von Neumann’s
+entropy
+S( ˆD) ≡ − Tr ˆD ln ˆD .
+(2)
+
+Introduced in the context of quantum measurements, this quantity is identified with the
+thermodynamic entropy when ˆD is an equilibrium state. In non-equilibrium statistical mechanics,
+it encompasses, in the form of “relevant entropy” (see Section 5 below), various entropies defined
+through the maximum entropy criterion. It is also introduced in quantum computation. Alternative
+entropies have been introduced in the literature, but they do not present all the distinctive and natural
+features of von Neumann’s entropy, such as additivity and concavity.
+As S( ˆD) is a concave function, and as it is the sole physically meaningful quantity apart from
+expectation values, it is natural to rely on it for our purpose. We thus define [2] the distance ds between
+two neighbouring density matrices ˆD and ˆD + d ˆD as the square root of
+
+ds2 = −d2S( ˆD) = Tr d ˆDd ln ˆD .
+(3)
+
+This Riemannian metric is of the Hessian form since the metric tensor is generated by taking second
+derivatives of the function S( ˆD) with respect to the n2 − 1 coordinates of ˆD. We may take for such
+coordinates the real and imaginary parts of the matrix elements, or equivalently (Section 6) some linear
+transform of these (keeping aside the norm Tr ˆD = 1).
+
+2. Interpretation in the Context of Quantum Information
+
+The simplest example, related to quantum information theory, is that of a q-bit (two-level system
+or spin 1
+
+2) for which n = 2. Its states, represented by 2 × 2 Hermitean normalised density matrices ˆD,
+can conveniently be parameterised, on the basis of Pauli matrices, by the components rμ = D12 + D21,
+i(D12 − D21), D11 − D22 (μ = 1, 2, 3) of a 3-dimensional vector r lying within the unit Poincaré–Bloch
+sphere (r ≤ 1). From the corresponding entropy
+
+S = 1 + r
+
+2
+ln
+2
+
+1 + r + 1 − r
+
+2
+ln
+2
+
+1 − r ,
+(4)
+
+we derive the metric
+
+ds2 =
+1
+
+1 − r2
+
+�r · dr
+
+r
+
+�2
++ 1
+
+2r ln 1 + r
+
+1 − r
+
+����
+r × dr
+
+r
+
+����
+
+2
+,
+(5)
+
+which is a natural Riemannian metric for q-bits, or more generally for positive 2 × 2 matrices. The
+metric tensor characterizing (5) diverges in the vicinity of pure states r = 1, due to the singularity of
+the entropy (2) for vanishing eigenvalues of ˆD. However, the distance between two arbitrary (even
+pure) states ˆD′ and ˆD′′ measured along a geodesic is always finite. We shall see (Equation (29)) that
+for n = 2 the geodesic distance s between two neighbouring pure states ˆD′ and ˆD′′, represented by
+unit vectors r′ and r′′ making a small angle δϕ ∼ |r′ − r′′|, behaves as δs2 ∼ δϕ2 ln(4√π/δϕ). The
+singularity of the metric tensor manifests itself through this logarithmic factor.
+Identifying von Neumann’s entropy to a measure of missing information, we can give a simple
+interpretation to the distance between two states. Indeed, the concavity of entropy expresses that some
+information is lost when two statistical ensembles described by different density operators merge. By
+mixing two equal size populations described by the neighbouring distributions ˆD′ = ˆD + 1
+
+2δ ˆD and
+ˆD′′ = ˆD − 1
+
+2δ ˆD separated by a distance δs, we lose an amount of information given by
+
+ΔS ≡ S
+� ˆD
+� − S( ˆD′) + S( ˆD′′)
+
+2
+∼ ffis2
+
+8
+,
+(6)
+
+316
+
+
+Entropy 2014, 16, 3878–3888
+
+and thereby directly related to the distance δs defined by (3). The proof of this equivalence relies on
+the expansion of the entropies S( ˆD′) and S( ˆD′′) around ˆD, and is valid when Tr δ ˆD2 is negligible
+compared to the smallest eigenvalue of ˆD. If ˆD′ and ˆD′′ are distant, the quantity 8ΔS cannot be
+regarded as the square of a distance that would be generated by a local metric. The equivalence (6) for
+neighbouring states shows that ds2 is the metric that is the best suited to measure losses of information
+my mixing.
+The singularity of δs2 at the edge of the positivity domain of ˆD may suggest that the result (6)
+holds only within this domain. In fact, this equivalence remains nearly valid even in the limit of
+pure states because ΔS itself involves a similar singularity. Indeed, if the states ˆD′ = |ψ′ >< ψ′|
+and ˆD′′ = |ψ′′ >< ψ′′| are pure and close to each other, the loss of information ΔS behaves as
+8ΔS ∼ ffi’2 ln(4/ffi’) where δϕ2 ∼ 2 Tr δD2. This result should be compared to various geodesic
+distances between pure quantum states, which behave as δs2 ∼ δϕ2 ln(4√π/δϕ for the present metric,
+and as δs2
+BH = 4δs2
+FS ∼ δϕ2 ∼ Tr( ˆD′ − ˆD′′)2 for the Bures – Helstrom and the quantum Fubini – Study
+metrics, respectively (see Section 7; these behaviours hold not only for n = 2 but for arbitrary n since
+only the space spanned by |ψ′ > and |ψ′′ > is involved). Thus, among these metrics, only ds2 = −d2S
+can be interpreted in terms of information loss, whether the states ˆD′ and ˆD′′ are pure or mixed.
+At the other extreme, around the most disordered state ˆD = ˆI/n, in the region ∥ n ˆD − ˆI ∥≪ 1,
+the metric becomes Euclidean since ds2 = Tr d ˆDd ln ˆD ∼ n Tr(d ˆD)2 (for n = 2, ds2 = dr2). For a given
+shift d ˆD, the qualitative change of a state ˆD, as measured by the distance ds, gets larger and larger as
+the state ˆD becomes purer and purer, that is, when the information contents of ˆD increases.
+
+3. Geometry of Quantum Statistical Mechanics
+
+A rich geometric structure is generated for both states and observables by von Neumann’s
+entropy through introduction of the metric ds2 = −d2S. Now, this metric (3) supplements the
+algebraic structure of the set of observables and the above duality between the vector spaces of states
+and of observables, with scalar product (1). Accordingly, we can define naturally within the space of
+states scalar products, geodesics, angles, curvatures.
+We can also regard the coordinates of d ˆD and d ln ˆD as covariant and contravariant components
+of the same infinitesimal vector (Section 6). To this aim, let us introduce the mapping
+
+ˆD ≡
+e ˆX
+
+Tr e ˆX
+(7)
+
+between ˆD in the space of states and ˆX in the space of observables. The operator ˆX appears as a
+parameterisation of ˆD. (The normalisation of ˆD entails that ˆX, defined within an arbitrary additive
+constant operator X0 ˆI, also depends on n2 − 1 independent real parameters.) The metric (3) can then
+be re-expressed in terms of ˆX in the form
+
+ds2 = Tr d ˆDd ˆX = Tr
+� 1
+
+0 dξ ˆDe−ξ ˆXd ˆXeξ ˆXd ˆX − (Tr ˆDd ˆX)2 = d2 ln Tr e ˆX = d2F ,
+(8)
+
+where we introduced the function
+F( ˆX) ≡ ln Tr e ˆX
+(9)
+
+of the observable ˆX(The addition of X0 ˆI to ˆX results in the addition of the irrelevant constant X0 to F).
+This mapping provides us with a natural metric in the space of observables, from which we recover
+the scalar product between d ˆX1 and d ˆX2 in the form of a Kubo correlation in the state ˆD. The metric
+(8) has been quoted in the literature under the names of Bogoliubov–Kubo–Mori.
+
+4. Covariance and Legendre Transformation
+
+We can recover the above geometric mapping (7) between ˆD and ˆX, or between the covariant
+and contravariant coordinates of d ˆD, as the outcome of a Legendre transformation, by considering
+
+317
+
+
+Entropy 2014, 16, 3878–3888
+
+the function F( ˆX). Taking its differential dF = Tr e ˆXd ˆX/ Tr e ˆX, we identify the partial derivatives
+of F( ˆX) with the coordinates of the state ˆD = e ˆX/ Tr e ˆX, so that ˆD appears as conjugate to ˆX in the
+sense of Legendre transformations. Expressing then ˆX as function of ˆD and inserting into F − Tr ˆD ˆX,
+we recognise that the Legendre transform of F( ˆX) is von Neumann’s entropy F − Tr ˆD ˆX = S( ˆD) =
+− Tr ˆD ln ˆD. The conjugation between ˆD and ˆX is embedded in the equations
+
+dF = Tr ˆDd ˆX ;
+dS = − Tr ˆXd ˆD .
+(10)
+
+Legendre transformations are currently used in equilibrium thermodynamics. Let us show that
+they come out in this context directly as a special case of the present general formalism. The entropy of
+thermodynamics is a function of the extensive variables, energy, volume, particle numbers, etc. Let us
+focus for illustration on the energy U, keeping the other extensive variables fixed. The thermodynamic
+entropy S(U), a function of the single variable U, generates the inverse temperature as β = ∂S/∂U.
+Its Legendre transform is the Massieu potential F(β) = S − βU. In order to compare these properties
+with the present formalism, we recall how thermodynamics comes out in the framework of statistical
+mechanics. The thermodynamic entropy S(U) is identified with the von Neumann entropy (2) of the
+Boltzmann–Gibbs canonical equilibrium state ˆD, and the internal energy with U = Tr ˆD ˆH. In the
+relation (7), the operator ˆX reads ˆX = −β ˆH (within an irrelevant additive constant). By letting U or
+β vary, we select within the spaces of states and of observables a one-dimensional subset. In these
+restricted subsets, ˆD is parameterised by the single coordinate U, and the corresponding ˆX by the
+coordinate −β.
+By specialising the general relations (10) to these subsets, we recover the thermodynamic relations
+dF = −Udβ and dS = βdU. We also recover, by restricting the metric (3) or (8) to these subsets, the
+current thermodynamic metric ds2 =−(∂2S/∂U2)dU2 =−dUdβ.
+More generally, we can consider the Boltzmann–Gibbs states of equilibrium statistical mechanics
+as the points of a manifold embedded in the full space of states. The thermodynamic extensive
+variables, which parameterise these states, are the expectation values of the conserved macroscopic
+observables, that is, they are a subset of the expectation values (1) which parameterise arbitrary
+density operators. Then the standard geometric structure of thermodynamics simply results from the
+restriction of the general metric (3) to this manifold of Boltzmann–Gibbs states. The commutation of
+the conserved observables simplifies the reduced thermodynamic metric, which presents the same
+features as a Fisher metric (see Section 6).
+
+5. Relevant Entropy and Geometry of the Projection Method
+
+The above ideas also extend to non-equilibrium quantum statistical mechanics [2–4]. When
+introducing the metric (3), we indicated that it may be used to estimate the quality of an approximation.
+Let us illustrate this point with the Nakajima–Zwanzig–Mori–Robertson projection method, best
+introduced through maximum entropy. Consider some set { ˆAk} of “relevant observables”, whose
+time-dependent expectation values ak ≡ < ˆAk > = Tr ˆD ˆAk we wish to follow, discarding all other
+variables. The exact state ˆD encodes the variables {ak} that we are interested in, but also the expectation
+values (1) of the other observables that we wish to eliminate. This elimination is performed by
+associating at each time with ˆD a “reduced state” ˆDR which is equivalent to ˆD as regards the set
+ak = Tr ˆDR ˆAk, but which provides no more information than the values{ak}. The former condition
+provides the constraints <
+ˆAk > = ak, and the latter condition is implemented by means of the
+maximum entropy criterion: One expresses that, within the set of density matrices compatible with
+these constraints, ˆDR is the one which maximises von Neumann’s entropy (2), that is, which contains
+solely the information about the relevant variables ak. The least biased state ˆDR thus defined has the
+form ˆDR = e ˆXR/ Tr e ˆXR, where ˆXR ≡ ∑k λk ˆAk involves the time-dependent Lagrange multipliers λk,
+which are related to the set ak through Tr ˆDR ˆAk = ak.
+
+318
+
+
+Entropy 2014, 16, 3878–3888
+
+The von Neumann entropy S( ˆDR) ≡ SR{ak} of this reduced state ˆDR is called the “relevant
+entropy” associated with the considered relevant observables ˆAk. It measures the amount of missing
+information, when only the values {ak} of the relevant variables are given. During its evolution, ˆD
+keeps track of the initial information about all the variables < ˆO > and its entropy S( ˆD) remains
+constant in time. It is therefore smaller than the relevant entropy S( ˆDR) which accounts for the
+loss of information about the irrelevant variables. Depending on the choice of relevant observables
+{ ˆAk}, the corresponding relevant entropies SR{ak} encompass various current entropies, such as the
+non-equilibrium thermodynamic entropy or Boltzmann’s H-entropy.
+The same structure as the one introduced above for the full spaces of observables and states is
+recovered in this context. Here, for arbitrary values of the parameters λk, the exponents ˆXR = ∑k λk ˆAk
+constitute a subspace of the full vector space of observables, and the parameters {λk} appear as the
+coordinates of ˆXR on the basis { ˆAk}. The corresponding states ˆDR, parameterised by the set {ak},
+constitute a subset of the space of states, the manifold R of “reduced states”(Note that this manifold is
+not a hyperplane, contrary to the space of relevant observables; it is embedded in the full vector space
+of states, but does not constitute a subspace). By regarding SR{ak} as a function of the coordinates {ak},
+we can define a metric ds2 = −d2SR{ak} on the manifold R, which is the restriction of the metric (3).
+Its alternative expression ds2 = ∑k dakdλk = d2FR{λk}, where FR{λk} ≡ ln Tr exp ∑k λk ˆAk, is a
+restriction of (8). The correspondence between the two parameterisations {ak} and {λk} is again
+implemented by the Legendre transformation which relates SR{ak} and FR{λk}.
+The projection method relies on the mapping ˆD �→ ˆDR which associates ˆDR to ˆD. It consists
+in replacing the Liouville–von Neumann equation of motion for ˆD by the corresponding dynamical
+equation for ˆDR on the manifold R, or equivalently for the coordinates {ak} or for the coordinates {λk},
+a programme that is in practice achieved through some approximations. This mapping is obviously
+a projection in the sense that ˆD �→ ˆDR �→ ˆDR, but moreover the introduction of the metric (3) shows
+that the vector ˆD − ˆDR in the space of states is perpendicular to the manifold R at the point ˆDR.
+This property is readily shown by writing, in this metric, the scalar product Tr d ˆD d ˆX′ of the vector
+d ˆD = ˆD − ˆDR by an arbitrary vector d ˆD′ in the tangent plane of R. The latter is conjugate to any
+combination d ˆX′ of observables ˆAk, and this scalar product vanishes because Tr ˆD ˆAk = Tr ˆDR ˆAk. Thus
+the mapping ˆD �→ ˆDR appears as an orthogonal projection, so that the relevant state ˆDR associated
+with ˆD may be regarded as its best possible approximation on the manifold R.
+
+6. Properties of the Metric
+
+The metric tensor can be evaluated explicitly in a basis where the matrix ˆD is diagonal. Denoting
+by Di its eigenvalues and by dDij the matrix elements of its variations, we obtain from (3)
+
+ds2 = Tr
+� ∞
+
+0
+dξ
+� d ˆD
+
+ˆD + ξ
+
+�2
+= ∑
+ij
+
+ln Di − ln Dj
+
+Di − Dj
+dDijdDji .
+(11)
+
+(For Di = Dj,whether or not i = j, the ratio is defined as 1/Di by continuity.) In the same basis, the
+form (8) of the metric reads
+
+ds2 = 1
+
+Z ∑
+ij
+
+eXi − eXj
+
+Xi − Xj
+dXijdXji −
+�
+∑i eXidXii
+
+Z
+
+�2
+,
+(12)
+
+with Z = ∑i eXi(For Xi = Xj, the ratio is eXi). The singularity of the metric (11) in the vicinity of
+vanishing eigenvalues of ˆD, in particular near pure states (end of Section 2), is not apparent in the
+representation (12) of this metric, because the mapping from ˆD to ˆX sends the eignevalue Xi to −∞
+when Di tends to zero.
+Let us compare the expression (11) with the corresponding classical metric, which is obtained
+by starting from Shannon’s entropy instead of von Neumann’s entropy. For discrete probabilities pi,
+
+319
+
+
+Entropy 2014, 16, 3878–3888
+
+we have then S{pi} = − ∑i pi ln pi and hence the same definition ds2 = −d2S{pi} as above of an
+entropy-based metric yields ds2 = ∑i dp2
+i /pi, which is identified with the Fisher information metric.
+The present metric thus appears as the extension to quantum statistical mechanics of the Fisher metric
+when the latter is interpreted in terms of entropy. In fact, the terms of (11) which involve the diagonal
+elements i = j of the variations d ˆD reduce to dD2
+ii/Di. This result was expected since density matrices
+behave as probability distributions if both ˆD and d ˆD are diagonal.
+Let us more generally consider in (11), instead of solely diagonal variations dDii, variations dDij
+with indices i and j such that
+��Di − Dj
+�� ≪ Di + Dj. The expansion of Di and Dj around 1
+
+2(Di + Dj)
+in the corresponding ratios of (11) yields (ln Di − ln Dj)/(Di − Dj) ∼ 2/(Di + Dj). The considered
+terms of (11) are therefore the same as in the Bures–Helstrom metric
+
+ds2
+BH = ∑
+ij
+
+2
+
+Di + Dj
+dDijdDji ,
+(13)
+
+introduced long ago as an extension to matrices of the Fisher metric [5]. We thus recover this
+Bures–Helstrom metric as an approximation of the present entropy-based metric ds2 = −d2S( ˆD).
+For n = 2, ds2
+BH is obtained from the expression (5) of ds2 by omitting the factor tanh−1 r/r entering
+the second term.
+In order to express the properties of the Riemannian metric (3) in a general form, which will
+exhibit the tensor structure, we use a Liouville representation. There, the observables ˆO = Oμ ˆΩμ,
+regarded as elements of a vector space, are represented by their coordinates Oμ on a complete basis
+ˆΩμ of n2 observables. The space of states is spanned by the dual basis ˆΣμ, such that Tr ˆΩν ˆΣμ = δν
+μ, and
+the states ˆD = Dμ ˆΣμ are represented by their coordinates Dμ. Thus, the expectation value (1) is the
+scalar product DμOμ. In the matrix representation which appears as a special case, μ denotes a pair of
+indices i, j, ˆΩμ stands for | j >< i |, ˆΣμ for | i >< j |, Oμ denotes the matrix element Oji and Dμ the
+element Dij. For the q-bit (n = 2) considered in Section 2, we have chosen the Pauli operators ˆσμ as
+basis ˆΩμ for observables, and 1
+
+2 ˆσμ as dual basis ˆΣμ for states, so that the coordinates Dμ = Tr ˆD ˆΩμ
+
+of ˆD = 1
+
+2( ˆI + rμ ˆσμ) are the components rμ of the vector r (The unit operator ˆI is kept aside since ˆD
+is normalised and since constants added to ˆX are irrelevant). The function F{X} = ln Tr e ˆX of the
+coordinates Xμ of the observable ˆX, and the von Neumann entropy S{D} as function of the coordinates
+Dμ of the state ˆD, are related by the Legendre transformation F = S + DμXμ, and the relations (10) are
+expressed by Dμ = ∂F/∂Xμ, Xμ = −∂S/∂Dμ. The metric tensor is given by
+
+gμν =
+∂2F
+
+∂Xμ∂Xν
+,
+gμν = −
+∂2S
+
+∂Dμ∂Dν ,
+(14)
+
+and the correspondence issued from (7) between covariant and contravariant infinitesimal variations
+of ˆX and ˆD is implemented as dDμ = gμνdXν, dXμ = gμνdDν.
+These expressions exhibit the Hessian nature of the metric. This property simplifies the expression
+of the Christoffel symbol, which reduces to
+
+Γμνρ = −1
+
+2
+∂3S
+
+∂Dμ∂Dν∂Dρ ,
+(15)
+
+and which provides a parametric representation ˆD(t) of the geodesics in the space of states through
+
+d2Dμ
+
+dt2
++ gμσΓσνρ
+dDν
+
+dt
+dDρ
+
+dt
+= 0 .
+(16)
+
+Then, the Riemann curvature tensor comes out as
+
+Rμρ νσ = gξζ(ΓμσξΓνρζ − ΓμνξΓρσζ) ,
+(17)
+
+320
+
+
+Entropy 2014, 16, 3878–3888
+
+the Ricci tensor and the scalar curvature as
+
+Rμν = gρσRμρ νσ,
+R = gμνRμν ,
+(18)
+
+We have noted that the classical equivalent of the entropy-based metric ds2 = −d2S is the Fisher
+metric ∑i dp2
+i /pi, which as regards the curvature is equivalent to a Euclidean metric. While the space of
+classical probabilities is thus flat, the above equations show that the space of quantum states is curved.
+This curvature arises from the non-commutation of the observables, it vanishes for the completely
+disordered state ˆD = ˆI/n. Curvature can thus be used as a measure of the degree of classicality of a
+state.
+
+7. Geometry of the Space of q-Bits
+
+In the illustrative example of a q-bit, the operator ˆX = χμ ˆσμ associated with ˆD is parameterised
+by the 3 components of the vector χμ (μ = 1, 2, 3), related to r by χ = tanh−1 r and χμ/χ = rμ/r. The
+metric tensor given by (5) is expressed as
+
+gμν = Krμrν + χ
+
+r δμν ,
+K ≡ 1
+
+r
+d
+dr
+χ
+r = 1
+
+r2
+
+�
+1
+
+1 − r2 − χ
+
+r
+
+�
+,
+(19)
+
+gμν = (1 − r2)pμν + r
+
+χqμν .
+
+(We have defined rμ = rμ, δμν = δμ
+ν = δμν so as to introduce the projectors rμrν/r2 ≡ pμν ≡ δμν − qμν
+
+in the Euclidean 3-dimensional space, and thus to simplify the subsequent calculations.) In polar
+coordinates r = (r, θ, ϕ), the infinitesimal distance takes the form
+
+ds2 = drdχ + rχ(dθ2 + sin2 θdϕ2) .
+(20)
+
+We determine from (15) and (19) the explicit form
+
+Γμνρ = K
+
+2
+�
+rμδνρ + rνδμρ + rρδμν
+� + 1
+
+2r
+dK
+dr rμrνrρ
+(21)
+
+of the Christoffel symbol. By raising its first index with gμν and using polar coordinates, we obtain
+from (16) the equations of geodesics for n = 2. Within the Poincaré–Bloch sphere the geodesics are
+deduced by rotations from a one-parameter family of curves which lie in the θ = 1
+
+2π, |ϕ| ≤ 1
+
+2π
+half-plane and which are symmetric with respect to the ϕ = 0 axis. This family is characterized by the
+equations (where χ = tanh−1 r):
+
+d2r
+dt2 +
+r
+
+1 − r2
+
+�dr
+
+dt
+
+�2
+− r
+
+2
+
+�
+1 + χ
+
+r
+
+�
+1 − r2�� �dϕ
+
+dt
+
+�2
+= 0 ,
+(22)
+
+d2ϕ
+dt2 + 1
+
+r
+dr
+dt
+dϕ
+dt + 1
+
+χ
+dχ
+dt
+dϕ
+dt = 0 ,
+(23)
+
+and the boundary conditions at t = 0:
+
+r (0) = a ,
+ϕ (0) = 0 ,
+dr (0)
+
+dt
+= 0 ,
+dϕ (0)
+
+dt
+= 1
+
+k ,
+k2 = a tanh−1 a .
+(24)
+
+Equation (23) provides, using the boundary conditions (24):
+
+dϕ
+dt = k
+
+rχ .
+(25)
+
+321
+
+
+Entropy 2014, 16, 3878–3888
+
+Insertion of (25) into (22) gives rise to an equation for r (t), which can be integrated by regarding t as a
+function of ζ = arcsin r. One obtains:
+
+�dr
+
+dt
+
+�2
+=
+�
+1 − r2� �
+1 − k2
+
+rχ
+
+�
+.
+(26)
+
+The scale of t has been fixed by relating to r (0) the boundary condition (24) for dϕ (0) /dt, a choice
+which ensures that ds2 = drdχ + rχdϕ2 = dt2, and hence that the parameter t measures the distance
+along geodesics.
+For k = 0, we obtain r = |sin t|, ϕ = ±π/2. Thus, the longest geodesics are the diameters of
+the Poincaré–Bloch sphere. We find the value π for their “length”, that is, for the geodesic distance
+between two orthogonal pure states. At the other extreme, when the middle point r = a, ϕ = 0 of
+a geodesic lies close to the surface r = 1 of the sphere, the asymptotic form of the equation (26) is
+solved as
+
+t = ±2k√
+
+πe−k2 erf ξ ,
+ξ =
+
+�
+
+1
+2 ln 1 − a
+
+1 − r ,
+k2 = 1
+
+2 ln
+2
+
+1 − a
+(27)
+
+(by taking ξ as variable instead of r). The determination of the explicit equations of such short geodesic
+curves is achieved by integrating (25) into
+
+ϕ = t
+
+k = ±2√
+
+πe−k2 erf ξ .
+(28)
+
+From (27) and (28) we can determine the geodesic distance between two neighbouring pure states ˆD′ =
+|ψ′ >< ψ′| and ˆD′′ = |ψ′′ >< ψ′′| represented by the points rmax = 1, ϕmax = ± 1
+
+2δϕ with δϕ small.
+At these two points, we have ξ → ∞, erf ξ = 1, and this determines k in terms of 1
+
+2δϕ through (28).
+The length of the geodesic that joins them, given by (27), is:
+
+δs2 = δϕ2 ln 4√π
+
+δϕ
+,
+δϕ = arccos
+��< ψ′ | ψ′′ >
+�� .
+(29)
+
+Thus, in spite of its singularity for r = 1, the present 3-dimensional metric (5) in the space r, θ, ϕ defines
+distances between pure states represented by points on the surface r = 1 of the Poincaré–Bloch sphere.
+However, It should be noted that the presence of the logarithmic factor in (29) forbids such distances
+to be generated by a 2-dimensional metric in the space θ, ϕ. In fact, the distance (29) is measured along
+a geodesic that penetrates the sphere r = 1, because no geodesic is tangent to the surface of this sphere
+nor lies on its surface.
+In contrast, all geodesics produced by the Bures–Helstrom metric are tangent to the surface of the
+sphere, or are its great circles. They are given by Equations (25) and (26), where χ is replaced by r and
+k by a; the solution of these equations provides the ellipses
+
+r cos ϕ = a cos t ,
+r sin ϕ = sin t .
+(30)
+
+Here as above, the largest distance π is reached for orthogonal pure states represented by opposite
+points on the sphere, but now a peculiarity occurs. Whereas the metric ds2 = −d2S produces a single
+geodesic, the diameter joining these two points (with “length” π), the Bures metric produces a double
+infinity of geodesics, the half-ellipses (30) having as long axis this diameter, and having all the same
+“length” π. Other pairs of pure states are joined by geodesics which are arcs of great circles, and their
+Bures distance δsBH = δϕ is identified with the ordinary length of the arc. Here for n = 2 as in the
+general case, the 3-dimensional Bures–Helstrom metric admits a restriction to pure states generated by
+a 2-dimensional metric, which is identified with the quantum Fubini–Study metric, itself defined only
+for pure states by sFS = arccos |< ψ′ | ψ′′ >| = 1
+
+2sBH.
+
+322
+
+
+Entropy 2014, 16, 3878–3888
+
+Returning to the metric ds2 = d2S, the Riemann curvature is obtained from (17) as
+
+Rμ
+ρ νσ = K
+
+4
+
+�
+(r2 + r
+
+χ − 1)(qμ
+σqνρ − qμ
+νqρσ) + (r2 − r
+
+χ + 1)(pμ
+σqνρ − pμ
+νqρσ)
+(31)
+
++ r
+
+χ
+1
+
+1 − r2 (r2 − r
+
+χ + 1)(qμ
+σpνρ − qμ
+νpρσ)
+�
+.
+
+Contracting with gρσ the indices of (30) as in (18), we finally derive the Ricci curvature
+
+Rμ
+ν = −Kr
+
+2χ
+
+�
+r2δμ
+ν + χ − r
+
+χ
+pμ
+ν
+
+�
+,
+(32)
+
+and the scalar curvature
+
+R = −Kr
+
+2χ
+
+�
+3r2 + χ − r
+
+χ
+
+�
+.
+(33)
+
+Both are negative in the whole Poincaré sphere. In the limit r → 0, the curvature R vanishes as
+R ∼ − 10
+
+9 r2, as expected from the general argument of Section 2: a weakly polarised spin behaves
+classically. At the other extreme r → 1, R behaves as R ∼ −2 [(1 − r) | ln(1 − r) |]−1; it diverges, again
+as expected: pure states have the largest quantum nature.
+
+The metric ds2 = −d2S, introduced above in the context of quantum mechanics for mixed states
+(and their pure limit) and information theory, might more generally be useful to characterise distances
+in spaces of positive matrices.
+
+Conflicts of Interest: Conflicts of Interest
+The author declares no conflict of interest.
+
+References
+
+1.
+Thirring, W. Quantum Mechanics of Large Systems.
+In A Course of Mathematical Physics; Volume 4;
+Springler-Verlag: New York, NY, USA, 1983.
+2.
+Balian, R.; Alhassid, Y.; Reinhardt, H. Dissipation in many-body systems: A geometric approach based on
+information theory. Phys. Rep. 1986, 131, 1–146.
+3.
+Balian, R. Incomplete descriptions and relevant entropies. Am. J. Phys. 1999, 67, 1078–1090.
+4.
+Balian, R. Information in statistical physics. Stud. Hist. Philos. Mod. Phys. 2005, 36, 323–353.
+5.
+Bures, D. An extension of Kakutani’s theorem. Trans. Am. Math. Soc. 1969,135, 199–212.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+323
+
+
+entropy
+
+Article
+Extending the Extreme Physical Information to
+Universal Cognitive Models via a Confident
+Information First Principle
+
+Xiaozhao Zhao 1, Yuexian Hou 1,2,*, Dawei Song 1,3 and Wenjie Li 2
+
+1 School of Computer Science and Technology, Tianjin University, Tianjin 300072, China; E-Mails:
+0.25eye@gmail.com (X.Z.); dawei.song2010@gmail.com (D.S.)
+2 Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China;
+E-Mail: cswjli@comp.polyu.edu.hk
+3 Department of Computing and Communications, The Open University, Milton Keynes MK76AA, UK
+*
+E-Mail: yxhou@tju.edu.cn; Tel.: +86-022-27406538.
+
+Received: 25 March 2014; in revised form: 6 June 2014 / Accepted: 20 June 2014 /
+Published: 1 July 2014
+
+Abstract: The principle of extreme physical information (EPI) can be used to derive many known
+laws and distributions in theoretical physics by extremizing the physical information loss K, i.e.,
+the difference between the observed Fisher information I and the intrinsic information bound J
+of the physical phenomenon being measured. However, for complex cognitive systems of high
+dimensionality (e.g., human language processing and image recognition), the information bound
+J could be excessively larger than I (J ≫ I), due to insufficient observation, which would lead
+to serious over-fitting problems in the derivation of cognitive models. Moreover, there is a lack
+of an established exact invariance principle that gives rise to the bound information in universal
+cognitive systems. This limits the direct application of EPI. To narrow down the gap between I and
+J, in this paper, we propose a confident-information-first (CIF) principle to lower the information
+bound J by preserving confident parameters and ruling out unreliable or noisy parameters in the
+probability density function being measured. The confidence of each parameter can be assessed
+by its contribution to the expected Fisher information distance between the physical phenomenon
+and its observations. In addition, given a specific parametric representation, this contribution can
+often be directly assessed by the Fisher information, which establishes a connection with the inverse
+variance of any unbiased estimate for the parameter via the Cramér–Rao bound. We then consider
+the dimensionality reduction in the parameter spaces of binary multivariate distributions. We show
+that the single-layer Boltzmann machine without hidden units (SBM) can be derived using the CIF
+principle. An illustrative experiment is conducted to show how the CIF principle improves the
+density estimation performance.
+
+Keywords: information geometry; Boltzmann machine; Fisher information; parametric reduction
+
+1. Introduction
+
+Information has been found to play an increasingly important role in physics. As stated in
+Wheeler [1]: “All things physical are information-theoretic in origin and this is a participatory
+universe...Observer participancy gives rise to information; and information gives rise to physics”.
+Following this viewpoint, Frieden [2] unifies the derivation of physical laws in major fields of physics,
+from the Dirac equation to the Maxwell-Boltzmann velocity dispersion law, using the extreme physical
+information principle (EPI). More specifically, a variety of equations and distributions can be derived by
+extremizing the physical information loss K, i.e., the difference between the observed Fisher information
+I and the intrinsic information bound J of the physical phenomenon being measured.
+
+Entropy 2014, 16, 3670–3688; doi:10.3390/e16073670
+www.mdpi.com/journal/entropy
+324
+
+
+Entropy 2014, 16, 3670–3688
+
+The first quantity, I, measures the amount of information as a finite scalar implied by the data
+with some suitable measure [2]. It is formally defined as the trace of the Fisher information matrix [3].
+In addition to I, the second quantity, the information bound J, is an invariant that characterizes the
+information that is intrinsic to the physical phenomenon [2]. During the measurement procedure, there
+may be some loss of information, which entails I = κJ, where κ ≤ 1 is called the efficiency coefficient
+of the EPI process in transferring the Fisher information from the phenomenon (specified by J) to
+the output (specified by I). For closed physical systems, in particular, any solution for I attains some
+fraction of J between 1/2 (for classical physics) and one (for quantum physics) [4].
+However, it is usually not the case in cognitive science. For complex cognitive systems (e.g.,
+human language processing and image recognition), the target probability density function (pdf) being
+measured is often of high dimensionality (e.g., thousands of words in a human language vocabulary
+and millions of pixels in an observed image). Thus, it is infeasible for us to obtain a sufficient collection
+of observations, leading to excessive information loss between the observer and nature. Moreover,
+there is a lack of an established exact invariance principle that gives rise to the bound information in
+universal cognitive systems. This limits the direct application of EPI in cognitive systems.
+In terms of statistics and machine learning, the excessive information loss between the observer
+and nature will lead to serious over-fitting problems, since the insufficient observations may not
+provide necessary information to reasonably identify the model and support the estimation of the
+target pdf in complex cognitive systems. Actually, a similar problem is also recognized in statistics and
+machine learning, known as the model selection problem [5]. In general, we would require a complex
+model with a high-dimensional parameter space to sufficiently depict the original high-dimensional
+observations. However, over-fitting usually occurs when the model is excessively complex with
+respect to the given observations. To avoid over-fitting, we would need to adjust the complexity of the
+models to the available amount of observations and, equivalently, to adjust the information bound J
+corresponding to the observed information I.
+In order to derive feasible computational models for cognitive phenomenon, we propose a
+confident-information-first (CIF) principle in addition to EPI to narrow down the gap between I and J
+(thus, a reasonable efficiency coefficient κ is implied), as illustrated in Figure 1. However, we do not
+intend to actually derive the distribution laws by solving the differential equations of the extremization
+of the new information loss K′. Instead, we assume that the target distribution belongs to some general
+multivariate binary distribution family and focus on the problem of seeking a proper information
+bound with respect to the constraint of the parametric number and the given observations.
+
+Figure 1. (a) The paradigm of the extreme physical information principle (EPI) to derive physical laws
+by the extremization of the information loss K∗ (K∗ = J/2 for classical physics and K∗ = 0 for quantum
+physics); (b) the paradigm of confident-information-first (CIF) to derive computational models by
+reducing the information loss K′ using a new physical bound J′.
+
+The key to the CIF approach is how to systematically reduce the physical information bound for
+high-dimensional complex systems. As stated in Frieden [2], the information bound J is a functional
+form that depends upon the physical parameters of the system. The information is contained in
+
+325
+
+
+Entropy 2014, 16, 3670–3688
+
+the variations of the observations (often imperfect, due to insufficient sampling, noise and intrinsic
+limitations of the “observer”), and can be further quantified using the Fisher information of system
+parameters (or coordinates) [3] from the estimation theory. Therefore, the physical information bound
+J of a complex system can be reduced by transforming it to a simpler system using some parametric
+reduction approach. Assuming there exists an ideal parametric model S that is general enough to
+represent all system phenomena (which gives the ultimate information bound in Figure 1), our goal is
+to adopt a parametric reduction procedure to derive a lower-dimensional sub-model M (which gives
+the reduced information bound in Figure 1) for a given dataset (usually insufficient or perturbed by
+noises) by reducing the number of free parameters in S.
+Formally speaking, let q(ξ) be the ideal distribution with parameters ξ that describes the physical
+system and q(ξ + Δξ) be the observations of the system with some small fluctuation Δξ in parameters.
+In [6], the averaged information distance I(Δξ) between the distribution and its observations, the
+so-called shift information, is used as a disorder measure of the fluctuated observations to reinterpret
+the EPI principle. More specifically, in the framework of information geometry, this information
+distance could also be assessed using the Fisher information distance induced by the Fisher–Rao
+metric, which can be decomposed into the variation in the direction of each system parameter [7].
+In principle, it is possible to divide system parameters into two categories, i.e., the parameters with
+notable variations and the parameters with negligible variations, according to their contributions to the
+whole information distance. Additionally, the parameters with notable contributions are considered
+to be confident, since they are important for reliably distinguishing the ideal distribution from its
+observation distributions. On the other hand, the parameters with negligible contributions can be
+considered to be unreliable or noisy. Then, the CIF principle can be stated as the parameter selection
+criterion that maximally preserves the Fisher information distance in an expected sense with respect
+to the constraint of the parametric number and the given observations (if available), when projecting
+distributions from the parameter space of S into that of the reduced sub-model M. We call it the
+distance-based CIF. As a result, we could manipulate the information bound of the underlying system
+by preserving the information of confident parameters and ruling out noisy parameters.
+In this paper, the CIF principle is analyzed in the multivariate binary distribution family in the
+mixed-coordinate system [8]. It turns out that, in this problematic configuration, the confidence of
+a parameter can be directly evaluated by its Fisher information, which also establishes a connection
+with the inverse variance of any unbiased estimate for the parameter via the Cramér–Rao bound [3].
+Hence, the CIF principle can also be interpreted as the parameter selection procedure that keeps the
+parameters with reliable estimates and rules out unreliable or noisy parameters. This CIF is called
+the information-based CIF. Note that the definition of confidence in distance-based CIF depends on
+both Fisher information and the scale of fluctuation, and the confidence in the information-based CIF
+(i.e., Fisher information) can be seen as a special case of confidence measure with respect to certain
+coordinate systems. This simplification allows us to further apply the CIF principle to improve existing
+learning algorithms for the Boltzmann machine.
+The paper is organized as follows. In Section 2, we introduce the parametric formulation for
+the general multivariate binary distributions in terms of information geometry (IG) framework [7].
+Then, Section 3 describes the implementation details of the CIF principle. We also give a geometric
+interpretation of CIF by showing that it can maximally preserve the expected information distance (in
+Section 3.2.1), as well as the analysis on the scale of the information distance in each individual system
+parameter (in Section 3.2.2). In Section 4, we demonstrate that a widely used cognitive model, i.e., the
+Boltzmann machine, can be derived using the CIF principle. Additionally, an illustrative experiment is
+conducted to show how the CIF principle can be utilized to improve the density estimation performance
+of the Boltzmann machine in Section 5.
+
+326
+
+
+Entropy 2014, 16, 3670–3688
+
+2. The Multivariate Binary Distributions
+
+Similar to EPI, the derivation of CIF depends on the analysis of the physical information bound,
+where the choice of system parameters, also called “Fisher coordinates” in Frieden [2], is crucial.
+Based on information geometry (IG) [7], we introduce some choices of parameterizations for binary
+multivariate distributions (denoted as statistical manifold S) with a given number of variables n, i.e.,
+the open simplex of all probability distributions over binary vector x ∈ {0, 1}n.
+
+2.1. Notations for Manifold S
+
+In IG, a family of probability distributions is considered as a differentiable manifold with certain
+parametric coordinate systems. In the case of binary multivariate distributions, four basic coordinate
+systems are often used: p-coordinates, η-coordinates, θ-coordinates and mixed-coordinates [7,9].
+Mixed-coordinates is of vital importance for our analysis.
+For the p-coordinates [p] with n binary variables, the probability distribution over 2n states
+of x can be completely specified by any 2n − 1 positive numbers indicating the probability of the
+corresponding exclusive states on n binary variables. For example, the p-coordinates of n = 2 variables
+could be [p] = (p01, p10, p11). Note that IG requires all probability terms to be positive [7].
+For simplicity, we use the capital letters I, J, . . . to index the coordinate parameters of probabilistic
+distribution. To distinguish the notation of Fisher information (conventionally used in literature,
+e.g., data information I and information bound J in Section 1) from the coordinate indexes, we
+make explicit explanations when necessary from now on. An index I can be regarded as a subset
+of {1, 2, . . . , n}. Additionally, pI stands for the probability that all variables indicated by I equal
+to one and the complemented variables are zero. For example, if I = {1, 2, 4} and n = 4, then
+pI = p1101 = Prob(x1 = 1, x2 = 1, x3 = 0, x4 = 1). Note that the null set can also be a legal index of
+the p-coordinates, which indicates the probability that all variables are zero, denoted as p0...0.
+Another coordinate system often used in IG is η-coordinates, which is defined by:
+
+ηI = E[XI] = Prob{∏
+i∈I
+xi = 1}
+(1)
+
+where the value of XI is given by ∏i∈I xi and the expectation is taken with respect to the probability
+distribution over x. Grouping the coordinates by their orders, the η-coordinate system is denoted
+as [η] = (η1
+i , η2
+ij, . . . , ηn
+1,2...n), where the superscript indicates the order number of the corresponding
+
+parameter. For example, η2
+ij denotes the set of all η parameters with the order number two.
+The θ-coordinates (natural coordinates) are defined by:
+
+log p(x) =
+∑
+I⊆{1,2,...,n},I̸=NullSet
+θIXI − ψ(θ)
+(2)
+
+where ψ(θ) = log(∑x exp{∑I θIXI(x)}) is the cumulant generating function and its value equals to
+− log Prob{xi = 0, ∀i ∈ {1, 2, ..., n}}. The θ-coordinate is denoted as [θ] = (θi
+1, θij
+2 , . . . , θ1,...,n
+n
+), where
+the subscript indicates the order number of the corresponding parameter. Note that the order indices
+locate at different positions in [η] and [θ] following the convention in Amari et al. [8].
+The relation between coordinate systems [η] and [θ] is bijective. More formally, they are connected
+by the Legendre transformation:
+
+θI = ∂φ(η)
+
+∂ηI
+, ηI = ∂ψ(θ)
+
+∂θI
+(3)
+
+where ψ(θ) is given in Equation (2) and φ(η) = ∑x p(x; η) log p(x; η) is the negative of entropy. It can
+be shown that ψ(θ) and φ(η) meet the following identity [7]:
+
+ψ(θ) + φ(η) − ∑ θIηI = 0
+(4)
+
+327
+
+
+Entropy 2014, 16, 3670–3688
+
+Next, we introduce mixed-coordinates, which is important for our derivation of CIF. In general,
+the manifold S of probability distributions could be represented by the l-mixed-coordinates [8]:
+
+[ζ]l = (η1
+i , η2
+ij, . . . , ηl
+i,j,...,k, θi,j,...,k
+l+1 , . . . , θ1,...,n
+n
+)
+(5)
+
+where the first part consists of η-coordinates with order less or equal to l (denoted by [ηl−]) and the
+second part consists of θ-coordinates with order greater than l (denoted by [θl+]), l ∈ {1, ..., n − 1}.
+
+2.2. Fisher Information Matrix for Parametric Coordinates
+
+For a general coordinate system [ξ], the i-th row and j-th column element of the Fisher information
+matrix for [ξ] (denoted by Gξ) is defined as the covariance of the scores of [ξi] and [ξj] [3], i.e.,
+
+gij = E[∂ log p(x; ξ)
+
+∂ξi
+· ∂ log p(x; ξ)
+
+∂ξj
+]
+
+under the regularity condition for the pdf that the partial derivatives exist. The Fisher information
+measures the amount of information in the data that a statistic carries about the unknown
+parameters [10]. The Fisher information matrix is of vital importance to our analysis, because the
+inverse of Fisher information matrix gives an asymptotically tight lower bound to the covariance
+matrix of any unbiased estimate for the considered parameters [3]. Another important concept related
+to our analysis is the orthogonality defined by Fisher information. Two coordinate parameters ξi and
+ξj are called orthogonal if and only if their Fisher information vanishes, i.e., gij = 0, meaning that their
+influences on the log likelihood function are uncorrelated.
+
+The Fisher information for [θ] can be rewritten as gIJ = ∂2ψ(θ)
+
+∂θI∂θJ , and for [η], it is gIJ = ∂2φ(η)
+
+∂ηI∂ηJ [7].
+
+Let Gθ = (gIJ) and Gη = (gIJ) be the Fisher information matrices for [θ] and [η], respectively. It can be
+shown that Gθ and Gη are mutually inverse matrices, i.e., ∑J gIJgJK = δI
+K, where δI
+K = 1 if I = K and
+zero otherwise [7]. In order to generally compute Gθ and Gη, we develop the following Propositions 1
+and 2. Note that Proposition 1 is a generalization of Theorem 2 in Amari et al. [8].
+
+Proposition 1. The Fisher information between two parameters θI and θJ in [θ], is given by:
+
+gIJ(θ) = ηI � J − ηIηJ
+(6)
+
+Proof. in Appendix A.
+
+Proposition 2. The Fisher information between two parameters ηI and ηJ in [η], is given by:
+
+gIJ(η) = ∑
+K⊆I∩J
+(−1)|I−K|+|J−K| · 1
+
+pK
+(7)
+
+where | · | denotes the cardinality operator.
+
+Proof. in Appendix B.
+
+Based on the Fisher information matrices Gη and Gθ, we can calculate the Fisher information
+matrix Gζ for the l-mixed-coordinate system [ζ]l, as follows:
+
+Proposition 3. The Fisher information matrix Gζ of the l-mixed-coordinates [ζ]l is given by:
+
+Gζ =
+
+�
+A
+0
+0
+B
+
+�
+
+(8)
+
+328
+
+
+Entropy 2014, 16, 3670–3688
+
+where A = ((G−1
+η )Iη)−1, B = ((G−1
+θ )Jθ)−1, Gη and Gθ are the Fisher information matrices of [η] and [θ],
+respectively, Iη is the index set of the parameters shared by [η] and [ζ]l, i.e., {η1
+i , ..., ηl
+i,j,...,k}, and Jθ is the index
+
+set of the parameters shared by [θ] and [ζ]l, i.e., {θi,j,...,k
+l+1 , . . . , θ1,...,n
+n
+}.
+
+Proof. in Appendix C.
+
+3. The General CIF Principle
+
+In this section, we propose the CIF principle to reduce the physical information bound for
+high-dimensionality systems. Given a target distribution q(x) ∈ S, we consider the problem of
+realizing it by a lower-dimensionality submanifold. This is defined as the problem of parametric
+reduction for multivariate binary distributions. The family of multivariate binary distributions has
+been proven to be useful when we deal with discrete data in a variety of applications in statistical
+machine learning and artificial intelligence, such as the Boltzmann machine in neural networks [11,12]
+and the Rasch model in human sciences [13,14].
+Intuitively, if we can construct a coordinate system so that the confidences of its parameters
+entail a natural hierarchy, in which high confident parameters are significantly distinguished from
+and orthogonal to lowly confident ones, then we can conveniently implement CIF by keeping
+the high confident parameters unchanged and setting the lowly confident parameters to neutral
+values. Therefore, the choice of coordinates (or parametric representations) in CIF is crucial to its
+usage. This strategy is infeasible in terms of p-coordinates, η-coordinates or θ-coordinates, since the
+orthogonality condition cannot hold in these coordinate systems. In this section, we will show that the
+l-mixed-coordinates [ζ]l meets the requirement of CIF.
+In principle, the confidence of parameters should be assessed according to their contributions to
+the expected information distance between the ideal distribution and its fluctuated observations. This is
+called the distance-based CIF (see Section 1). For some coordinated systems, e.g., the mixed-coordinate
+system [ζ]l, the confidence of a parameter can also be directly evaluated by its Fisher information. This
+is called the information-based CIF (see Section 1). The information-based CIF (i.e., Fisher information)
+can be seen as an approximation to distance-based CIF, since it neglects the influence of parameter
+scaling to the expected information distance. However, considering the standard mixed-coordinates
+[ζ]l for the manifold of multivariate binary distributions, it turns out that both distance-based CIF and
+information-based CIF entail the same submanifold M (refer to Section 3.2 for detailed reasons).
+For the purpose of legibility, we will start with the information-based CIF, where the parameter’s
+confidence is simply measured using its Fisher information.
+After that, we show that the
+information-based CIF leads to an optimal submanifold M, which is also optimal in terms of the
+more rigorous distance-based CIF.
+
+3.1. The Information-Based CIF Principle
+
+In this section, we will show that the l-mixed-coordinates [ζ]l meet the requirement of the
+information-based CIF. According to Proposition 3 and the following Proposition 4, the confidences of
+coordinate parameters (measured by Fisher information) in [ζ]l entail a natural hierarchy: the first part
+of high confident parameters [ηl−] are separated from the second part of low confident parameters
+[θl+]. Additionally, those low confident parameters [θl+] have the neutral value of zero.
+
+Proposition 4. The diagonal elements of A are lower bounded by one, and those of B are upper bounded by one.
+
+Proof. in Appendix D.
+
+Moreover, the parameters in [ηl−] are orthogonal to the ones in [θl+], indicating that we could
+estimate these two parts independently [9].
+Hence, we can implement the information-based
+CIF for parametric reduction in [ζ]l by replacing low confident parameters with neutral value
+
+329
+
+
+Entropy 2014, 16, 3670–3688
+
+zero and reconstructing the resulting distribution.
+It turns out that the submanifold of S
+tailored by information-based CIF becomes [ζ]lt
+=
+(η1
+i , ..., ηl
+ij...k, 0, . . . , 0).
+We call [ζ]lt the
+l-tailored-mixed-coordinates.
+To grasp an intuitive picture for the CIF strategy and its significance w.r.t mixed-coordinates,
+let us consider an example with [p] = (p001 = 0.15, p010 = 0.1, p011 = 0.05, p100 = 0.2, p101 =
+0.1, p110 = 0.05, p111 = 0.3). Then, the confidences for coordinates in [η], [θ] and [ζ]2 are given by the
+diagonal elements of the corresponding Fisher information matrices. Applying the two-tailored CIF in
+mixed-coordinates, the loss ratio of Fisher information is 0.001%, and the ratio of the Fisher information
+of the tailored parameter (θ123
+3
+) to the remaining η parameter with the smallest Fisher information is
+0.06%. On the other hand, the above two ratios become 7.58% and 94.45% (in η-coordinates) or 12.94%
+and 92.31% (in θ-coordinates), respectively. We can see that [ζ]2 gives us a much better way to tell
+apart confident parameters from noisy ones.
+
+3.2. The Distance-Based CIF: A Geometric Point-of-View
+
+In the previous section, the information-based CIF entails a submanifold of S determined by the
+l-tailored-mixed-coordinates [ζ]lt. A more rigorous definition for the confidence of coordinates is the
+distance-based confidence used in the distance-based CIF, which relies on both of the coordinate’s
+Fisher information and its fluctuation scaling. In this section, we will show that the the submanifold
+M determined by [ζ]lt is also an optimal submanifold M in terms of the distance-based CIF. Note that,
+for other coordinate systems (e.g., arbitrarily rescaling coordinates), the information-based CIF may
+not entail the same submanifold as the distance-based CIF.
+Let q(x), with coordinate ζq, denote the exact solution to the physical phenomenon being
+measured. Additionally, the act of observation would cause small random perturbations to q(x),
+leading to some observation q′(x) with coordinate ζq + Δζq. When two distributions q(x) and q′(x) are
+close, the divergence between q(x) and q′(x) on manifold S could be assessed by the Fisher information
+distance: D(q, q′) = (Δζq · Gζ · Δζq)1/2, where Gζ is the Fisher information matrix and the perturbation
+Δζq is small. The Fisher information distance between two close distributions q(x) and q′(x) on
+manifold S is the Riemannian distance under the Fisher–Rao metric, which is shown to be the square
+root of the twice of the Kullback–Leibler divergence from q(x) to q′(x) [8]. Note that we adopt the
+Fisher information distance as the distance measure between two close distributions, since it is shown
+to be the unique metric meeting a set of natural axioms for the distribution metrics [7,15,16], e.g., the
+invariant property with respect to reparametrizations and the monotonicity with respect to the random
+maps on variables.
+Let M be a smooth k-dimensionality submanifold in S (k < 2n − 1). Given the point q(x) ∈ S,
+the projection [8] of q(x) on M is the point p(x) that belongs to M and is closest to q(x) with respect
+to the Kullback–Leibler divergence (K-L divergence) [17] from the distribution q(x) to p(x). On the
+submanifold M, the projections of q(x) and q′(x) are p(x) and p′(x), with coordinates ζp and ζp + Δζp,
+respectively, shown in Figure 2.
+Let the preserved Fisher information distance be D(p, p′) after projecting on M. In order to retain
+
+the information contained in observations, we need the ratio D(p,p′)
+
+D(q,q′) to be as large as possible in the
+expected sense, with respect to the given dimensionality k of M. The next two sections will illustrate
+that CIF leads to an optimal submanifold M based on different assumptions on the perturbations Δζq.
+
+330
+
+
+Entropy 2014, 16, 3670–3688
+
+Figure 2. By projecting a point q(x) on S to a submanifold M, the l-tailored mixed-coordinates [ζ]lt
+gives a desirable M that maximally preserves the expected Fisher information distance when projecting
+a ε-neighborhood centered at q(x) onto M.
+
+3.2.1. Perturbations in Uniform Neighborhood
+
+Let Bq be a ε-sphere surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε},
+where KL(·, ·) denotes the K-L divergence and ε is small. Additionally, q′(x) is a neighbor of q(x)
+uniformly sampled on Bq, as illustrated in Figure 2. Recall that, for a small ε, the K-L divergence can be
+approximated by half of the squared Fisher information distance. Thus, in the parameterization
+of [ζ]l, Bq is indeed the surface of a hyper-ellipsoid (centered at q(x)) determined by Gζ.
+The
+following proposition shows that the general CIF would lead to an optimal submanifold M that
+maximally preserves the expected information distance, where the expectation is taken upon the
+uniform neighborhood, Bq.
+
+Proposition 5. Consider the manifold S in l-mixed-coordinates [ζ]l. Let k be the number of free parameters
+in the l-tailored-mixed-coordinates [ζ]lt. Then, among all k-dimensional submanifolds of S, the submanifold
+determined by [ζ]lt can maximally preserve the expected information distance induced by the Fisher–Rao metric.
+
+Proof. in Appendix E.
+
+3.2.2. Perturbations in Typical Distributions
+
+To facilitate our analysis, we make a basic assumption on the underlying distributions q(x)
+that at least (2n − 2n/2) p-coordinates are of the scale ϵ, where ϵ is a sufficiently small value. Thus,
+residual p-coordinates (at most 2n/2) are all significantly larger than zero (of scale Θ(1/2(n/2))), and
+their sum approximates one. Note that these assumptions are common situations in real-world data
+collections [18], since the frequent (or meaningful) patterns are only a small fraction of all of the
+system states.
+Next, we introduce a small perturbation Δp to the p-coordinates [p] for the ideal distribution
+q(x). The scale of each fluctuation ΔpI is assumed to be proportional to the standard variation of
+corresponding p-coordinate pI by some small coefficients (upper bounded by a constant a), which can
+be approximated by the inverse of the square root of its Fisher information via the Cramér–Rao bound.
+It turns out that we can assume the perturbation ΔpI to be a√pI.
+In this section, we adopt the l-mixed-coordinates [ζ]l = (ηl−; θl+), where l = 2 is used in
+the following analysis. Let Δζq = (Δη2−; Δθ2+) be the incremental of mixed-coordinates after the
+perturbation. The squared Fisher information distance D2(p, p′) = Δζq · Gζ · Δζq could be decomposed
+into the direction of each coordinate in [ζ]l. We will clarify that, under typical cases, the scale of the
+
+331
+
+
+Entropy 2014, 16, 3670–3688
+
+Fisher information distance in each coordinate of θl+ (reduced by CIF) is asymptotically negligible,
+compared to that in each coordinate of ηl− (preserved by CIF).
+The scale of squared Fisher information distance in the direction of ηI is proportional to ΔηI ·
+(Gζ)I,I · ΔηI, where (Gζ)I,I is the Fisher information of ηI in terms of the mixed-coordinates [ζ]2. From
+Equation (1), for any I of order one (or two), ηI is the sum of 2n−1 (or 2n−2) p-coordinates, and the scale
+is Θ(1). Hence, the incremental Δη2− is proportional to Θ(1), denoted as a · Θ(1). It is difficult to give
+an explicit expression of (Gζ)I,I analytically. However, the Fisher information (Gζ)I,I of ηI is bounded
+by the (I, I)-th element of the inverse covariance matrix [19], which is exactly 1/gI,I(θ) =
+1
+
+ηI−η2
+I (see
+
+Proposition 3). Hence, the scale of (Gζ)I,I is also Θ(1). It turns out that the scale of squared Fisher
+information distance in the direction of ηI is a2 · Θ(1).
+Similarly, for the part θ2+, the scale of squared Fisher information distance in the direction of
+θJ is proportional to ΔθJ · (Gζ)J,J · ΔθJ, where (Gζ)J,J is the Fisher information of θJ in terms of the
+mixed-coordinates [ζ]2. The scale of θJ is maximally f (k)|log(√ϵ)| based on Equation (2), where k is
+the order of θJ and f (k) is the number of p-coordinates of scale Θ(1/2(n/2)) that are involved in the
+calculation of θJ. Since we assume that f (k) ≤ 2(n/2), the maximum scale of θJ is 2(n/2)|log(√ϵ)|. Thus,
+the incremental ΔθJ is of a scale bounded by a · 2(n/2)|log(√ϵ)|. Similar to our previous deviation, the
+Fisher information (Gζ)J,J of θJ is bounded by the (J, J)-th element of the inverse covariance matrix,
+which is exactly 1/gJ,J(η) (see Proposition 3). Hence, the scale of (Gζ)J,J is (2k − f (k))−1ϵ. In summary,
+the scale of squared Fisher information distance in the direction of θJ is bounded by the scale of
+
+a2 · Θ(2nϵ |log(√ϵ)|2
+
+2k− f (k) ). Since ϵ is a sufficiently small value and a is constant, the scale of squared Fisher
+
+information distance in the direction of θJ is asymptotically zero.
+In summary, in terms of modeling the fluctuated observations of typical cognitive systems, the
+original Fisher information distance between the physical phenomenon (q(x)) and observations (q′(x))
+is systematically reduced using CIF by projecting them on an optimal submanifold M. Based on our
+above analysis, the scale of Fisher information distance in the directions of [ηl−] preserved by CIF is
+significantly larger than that of the directions [θl+] reduced by CIF.
+
+4. Derivation of Boltzmann Machine by CIF
+
+In the previous section, the CIF principle is uncovered in the [ζ]l coordinates. Now, we consider
+an implementation of CIF when l equals to two, which gives rise to the single-layer Boltzmann machine
+without hidden units (SBM).
+
+4.1. Notations for SBM
+
+The energy function for SBM is given by:
+
+ESBM(x; ξ) = −1
+
+2xTUx − bTx
+(9)
+
+where ξ = {U, b} are the parameters and the diagonals of U are set to zero.
+The Boltzmann
+distribution over x is p(x; ξ) =
+1
+Zexp{−ESBM(x; ξ)}, where Z is a normalization factor. Actually,
+the parametrization for SBM could be naturally expressed by the coordinate systems in IG (e.g.,
+[θ] = (θi
+1 = bi, θij
+2 = Uij, θijk
+3
+= 0, ..., θ1,2,...,n
+n
+= 0)).
+
+4.2. The Derivation of SBM using CIF
+
+Given any underlying probability distribution q(x) on the general manifold S over {x}, the
+logarithm of q(x) can be represented by a linear decomposition of θ-coordinates, as shown in
+Equation (2). Since it is impractical to recognize all coordinates for the target distribution, we would
+like to only approximate part of them and end up with a k-dimensional submanifold M of S, where k
+(≪ 2n − 1) is the number of free parameters. Here, we set k to be the same dimensionality as SBM,
+i.e., k = n(n+1)
+
+2
+, so that all candidate submanifolds are comparable to the submanifold endowed by
+
+332
+
+
+Entropy 2014, 16, 3670–3688
+
+SBM (denoted as Msbm). Next, the rationale underlying the design of Msbm can be illustrated using the
+general CIF.
+Let the two-mixed-coordinates of q(x) on S be [ζ]2 = (η1
+i , η2
+ij, θi,j,k
+3
+, . . . , θ1,...,n
+n
+). Applying the
+general CIF on [ζ]2, our parametric reduction rule is to preserve the high confident part parameters
+[η2−] and replace low confident parameters [θ2+] by a fixed neutral value of zero. Thus, we derive
+the two-tailored-mixed-coordinates: [ζ]2t = (η1
+i , η2
+ij, 0, . . . , 0), as the optimal approximation of q(x)
+by the k-dimensional submanifolds. On the other hand, given the two-mixed-coordinates of q(x),
+the projection p(x) ∈ Msbm of q(x) is proven to be [ζ]p = (η1
+i , η2
+ij, 0, . . . , 0) [8]. Thus, SBM defines a
+probabilistic parameter space that is derived from CIF.
+
+4.3. The Learning Algorithms for SBM
+
+Let q(x) be the underlying probability distribution from which samples D = {d1, d2, . . . , dN} are
+generated independently. Then, our goal is to train an SBM (with stationary probability p(x)) based on
+D that realizes q(x) as faithfully as possible. Here, we briefly introduce two typical learning algorithms
+for SBM: maximum-likelihood and contrastive divergence [11,20,21].
+Maximum-likelihood (ML) learning realizes a gradient ascent of log-likelihood of D:
+
+ΔUij = ε∂l(ξ; D)
+
+∂Uij
+= ε(Eq[xixj] − Ep[xixj])
+(10)
+
+where ε is the learning rate and l(ξ; D) =
+1
+N ∑N
+n=1 log(dn; ξ). Eq[·] and Ep[·] are expectations over
+q(x) and p(x), respectively. Actually, Eq[xixj] and Ep[xixj] are the coordinates η2
+ij of q(x) and p(x),
+respectively. Eq[xixj] could be unbiasedly estimated from the sample. Markov chain Monte Carlo [22]
+is often used to approximate Ep[xixj] with an average over samples from p(x).
+Contrastive divergence (CD) learning realizes the gradient descent of a different objective function
+to avoid the difficulty of computing the log-likelihood gradient, shown as follows:
+
+ΔUij = −ε∂(KL(q0||p) − KL(pm||p))
+
+∂Uij
+= ε(Eq0[xixj] − Epm[xixj])
+(11)
+
+where q0 is the sample distribution, pm is the distribution by starting the Markov chain with the data
+and running m steps and KL(·||·) denotes the K-L divergence. Taking samples in D as initial states, we
+could generate a set of samples for pm(x). Those samples can be used to estimate Epm[xixj].
+From the perspective of IG, we can see that ML/CD learning is to update parameters in SBM,
+so that its corresponding coordinates [η2−] are getting closer to the data (along with the decreasing
+gradient). This is consistent with our theoretical analysis in Section 3 and Section 4.2 that SBM uses
+the most confident information (i.e., [η2−]) for approximating an arbitrary distribution in an expected
+sense.
+
+5. Experimental Study: Incorporate Data into CIF
+
+In the information-based CIF, the actual values of the data were not used to explicitly effect the
+output PDF (e.g., the derivation of SBM in Section 4). The data constrains the state of knowledge about
+the unknown pdf. In order to force the estimate of our probabilistic model to obey the data, we need
+to further reduce the difference between data information and physical information bound. How can
+this be done?
+In this section, the CIF principle will also be used to modify existing SBM training algorithm (i.e.,
+CD-1) by incorporating data information. Given a particular dataset, the CIF can be used to further
+recognize less-confident parameters in SBM and to reduce them properly. Our solution here is to
+apply CIF to take effect on the learning trajectory with respect to specific samples and, hence, further
+confine the parameter space to the region indicated by the most confident information contained in
+the samples.
+
+333
+
+
+Entropy 2014, 16, 3670–3688
+
+5.1. A Sample-Specific CIF-Based CD Learning for SBM
+
+The main modification of our CIF-based CD algorithm (CD-CIF for short) is that we generate
+the samples for pm(x) based on those parameters with confident information, where the confident
+information carried by certain parameter is inherited from the sample and could be assessed using its
+Fisher information computed in terms of the sample.
+For CD-1 (i.e., m=1), the firing probability for the i-th neuron after a one-step transition from the
+
+initial state x(0) = {x(0)
+1 , x(0)
+2 , . . . , x(0)
+n }) is:
+
+p(x(1)
+i
+= 1|x(0)) =
+1
+
+1 + exp{− ∑j̸=i Uijx(0)
+j
+− bi}
+(12)
+
+For CD-CIF, the firing probability for the i-th neuron in Equation (12) is modified as follows:
+
+p(x(1)
+i
+=1|x(0))=
+1
+
+1+exp{−
+∑
+(j̸=i)&(F(Uij)>τ)
+Uijx(0)
+j −bi}
+(13)
+
+where τ is a pre-selected threshold, F(Uij) = Eq0[xixj] − Eq0[xixj]2 is the Fisher information of Uij (see
+Equation (6)) and the expectations are estimated from the given sample D. We can see that those
+weights whose Fisher information are less than τ are considered to be unreliable w.r.t D. In practice,
+we could setup τ by the ratio r to specify the proportion of the total Fisher information TFI of all
+parameters that we would like to remain, i.e., ∑Uij>τ,i<j F(Uij) = r ∗ TFI.
+In summary, CD-CIF is realized in two phases. In the first phase, we initially “guess” whether
+certain parameter could be faithfully estimated based on the finite sample. In the second phase, we
+approximate the gradient using the CD scheme, except for when the CIF-based firing function in
+Equation (13) is used.
+
+5.2. Experimental Results
+In this section, we empirically investigate our justifications for the CIF principle, especially how
+the sample-specific CIF-based CD learning (see Section 5) works in the context of density estimation.
+Experimental Setup and Evaluation Metric: We utilize the random distribution uniformly generated
+from the open probability simplex over 10 variables as underlying distributions, whose samples size
+N may vary. Three learning algorithms are investigated: ML, CD-1 and our CD-CIF. K-L divergence is
+used to evaluate the goodness-of-fit of the SBM’s trained by various algorithms. For sample size N,
+we run 100 instances (20 (randomly generated distributions) × 5 (randomly running)) and report the
+averaged K-L divergences. Note that we focus on the case that the variable number is relatively small
+(n = 10) in order to analytically evaluate the K-L divergence and give a detailed study on algorithms.
+Changing the number of variables only offers a trivial influence for the experimental results, since we
+obtained qualitatively similar observations on various variable numbers (not reported here).
+Automatically Adjusting r for Different Sample Sizes: The Fisher information is additive for i.i.d.
+sampling. When sample sizes change, it is natural to require that the total amount of Fisher information
+contained in all tailored parameters is steady. Hence, we have α = (1 − r)N, where α indicates the
+amount of Fisher information and becomes a constant when the learning model and the underlying
+distribution family are given. It turns out that we can first identify α using the optimal r w.r.t several
+distributions generated from the underlying distribution family and then determine the optimal r’s for
+various sample sizes using r = 1 − α/N. In our experiments, we set α = 35.
+Density Estimation Performance: The averaged K-L divergences between SBMs (learned by ML,
+CD-1 and CD-CIF with the r automatically determined) and the underlying distribution are shown in
+Figure 3a. In the case of relatively small samples (N ≤ 500) in Figure 3a, our CD-CIF method shows
+significant improvements over ML (from 10.3% to 16.0%) and CD-1 (from 11.0% to 21.0%). This is
+because we could not expect to have reliable identifications for all model parameters from insufficient
+
+334
+
+
+Entropy 2014, 16, 3670–3688
+
+samples, and hence, CD-CIF gains its advantages by using parameters that could be confidently
+estimated. This result is consistent with our previous theoretical insight that Fisher information gives a
+reasonable guidance for parametric reduction via the confidence criterion. As the sample size increases
+(N ≥ 600), CD-CIF, ML and CD-1 tend to have similar performances, since, with relatively large
+samples, most model parameters can be reasonably estimated, hence the effect of parameter reduction
+using CIF gradually becomes marginal. In Figure 3b and Figure 3c, we show how sample size affects
+the interval of r. For N = 100, CD-CIF achieves significantly better performances for a wide range of r.
+While, for N = 1, 200, CD-CIF can only marginally outperform baselines for a narrow range of r.
+
+�
+���
+���
+���
+���
+����
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+�������������������
+
+�����������������������������������
+
+�����������������������������������������
+
+�
+
+�
+
+����
+��
+������
+
+(a)
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+���
+
+����
+
+����
+
+���������������������������������������������������������������������
+
+�
+
+�����������������������������������
+
+�
+
+�
+
+����
+��
+������
+
+(b)
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+�
+����������������������������������������������������������������������
+
+�
+
+�����������������������������������
+
+�
+
+�
+
+����
+����
+����
+�
+
+����
+
+�����
+
+����
+
+����
+��
+������
+
+(c)
+
+����
+����
+����
+����
+����
+����
+����
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+����
+
+����
+
+�������������������������������������
+
+�
+
+�
+
+��������������������
+
+�������������������
+
+�����������������
+
+�����������
+���������
+
+�����������������
+
+(d)
+
+Figure 3. (a): the performance of CD-CIF on different sample sizes; (b) and (c): The performances
+of CD-CIF with various values of r on two typical sample sizes, i.e., 100 and 1200; (d) illustrates one
+learning trajectory of the last 100 steps for ML (squares), CD-1 (triangles) and CD-CIF (circles).
+
+Effects on Learning Trajectory: We use the 2D visualizing technology SNE [20] to investigate learning
+trajectories and dynamical behaviors of three comparative algorithms. We start three methods with the
+same parameter initialization. Then, each intermediate state is represented by a 55-dimensional vector
+formed by its current parameter values. From Figure 3d, we can see that: (1) In the final 100 steps, the
+three methods seem to end up staying in different regions of the parameter space, and CD-CIF confines
+the parameter in a relatively thinner region compared to ML and CD-1; (2) The true distribution is
+usually located on the side of CD-CIF, indicating its potential for converging to the optimal solution.
+Note that the above claims are based on general observations, and Figure 3d is shown as an illustration.
+Hence, we may conclude that CD-CIF regularizes the learning trajectories in a desired region of the
+parameter space using the sample-specific CIF.
+
+6. Conclusions
+
+Different from the traditional EPI, the CIF principle proposed in this paper aims at finding a
+way to derive computational models for universal cognitive systems by a dimensionality reduction
+
+335
+
+
+Entropy 2014, 16, 3670–3688
+
+approach in parameter spaces: specifically, by preserving the confident parameters and reducing the
+less confident parameters. In principle, the confidence of parameters should be assessed according
+to their contributions to the expected information distance between the ideal distribution and its
+fluctuated observations. This is called the distance-based CIF. For some coordinated systems, e.g.,
+the mixed-coordinate system [ζ]l, the confidence of a parameter can also be directly evaluated by
+its Fisher information, which establishes a connection with the inverse variance of any unbiased
+estimate for the parameter via the Cramér–Rao bound. This is called the information-based CIF.
+The criterion of information-based CIF (i.e., Fisher information) can be seen as an approximation to
+distance-based CIF, since it neglects the influence of parameter scaling to the expected information
+distance. However, considering the standard mixed-coordinates [ζ]l for the manifold of multivariate
+binary distributions, it turns out that both distance-based CIF and information-based CIF entail the
+same optimal submanifold M.
+The CIF provides a strategy for the derivation of probabilistic models. The SBM is a specific
+example in this regard.
+It has been theoretically shown that the SBM can achieve a reliable
+representation in parameter spaces by using the CIF principle.
+The CIF principle can also be used to modify existing SBM training algorithms by incorporating
+data information, such as CD-CIF. One interesting result shown in our experiments is that: although
+CD-CIF is a biased algorithm, it could significantly outperform ML when the sample is insufficient.
+This suggests that CIF gives us a reasonable criterion for utilizing confident information from the
+underlying data, while ML lacks a mechanism to do so.
+In the future, we will further develop the formal justification of CIF w.r.t various contexts (e.g.,
+distribution families or models).
+
+Acknowledgments: We would like to thank the anonymous reviewers for their valuable comments. We also
+thank Mengjiao Xie and Shuai Mi for their helpful discussions. This work is partially supported by the Chinese
+National Program on Key Basic Research Project (973 Program, Grant No. 2013CB329304 and 2014CB744604), the
+Natural Science Foundation of China (Grant Nos. 61272265, 61070044, 61272291, 61111130190 and 61105072).
+
+Appendix
+
+Appendix A Proof of Proposition 1
+
+Proof. By definition, we have:
+
+gIJ = ∂2ψ(θ)
+
+∂θI∂θJ
+
+where ψ(θ) is defined by Equation (4). Hence, we have:
+
+gIJ = ∂2(∑I θIηI − φ(η))
+
+∂θI∂θJ
+= ∂ηI
+
+∂θJ
+
+By differentiating ηI, defined by Equation (1), with respect to θJ, we have:
+
+gIJ
+=
+∂ηI
+∂θJ = ∂ ∑x XI(x)(exp{∑I θIXI(x) − ψ(θ)})
+
+∂θJ
+
+= ∑
+x
+XI(x)[XJ(x) − ηJ]p(x; θ) = ηI � J − ηIηJ
+
+This completes the proof.
+
+Appendix B Proof of Proposition 2
+
+Proof. By definition, we have:
+
+gIJ = ∂2φ(η)
+
+∂ηI∂ηJ
+
+336
+
+
+Entropy 2014, 16, 3670–3688
+
+where φ(η) is defined by Equation (4). Hence, we have:
+
+gIJ
+=
+∂2(∑J θJηJ − ψ(θ))
+
+∂ηI∂ηJ
+= ∂θI
+
+∂ηJ
+
+Based on Equations (2) and (1), the θI and pK could be calculated by solving a linear equation of [p]
+and [η], respectively. Hence, we have:
+
+θI = ∑
+K⊆I
+(−1)|I−K|log(pK); pK = ∑
+K⊆J
+(−1)|J−K|ηJ
+
+Therefore, the partial derivation of θI with respect to ηJ is:
+
+gIJ = ∂θI
+
+∂ηJ
+= ∑
+K
+
+∂θI
+
+∂pK
+· ∂pK
+
+∂ηJ
+= ∑
+K⊆I∩J
+(−1)|I−K|+|J−K| · 1
+
+pK
+
+This completes the proof.
+
+Appendix C Proof of Proposition 3
+
+Proof. The Fisher information matrix of [ζ] could be partitioned into four parts: Gζ =
+
+�
+A
+C
+D
+B
+
+�
+
+.
+
+It can be verified that in the mixed coordinate, the θ-coordinate of order k is orthogonal to any
+η-coordinate less than k-order, implying the corresponding element of the Fisher information matrix is
+zero (C = D = 0) [23]. Hence, Gζ is a block diagonal matrix.
+According to the Cramér–Rao bound [3], a parameter (or a pair of parameters) has a unique
+asymptotically tight lower bound of the variance (or covariance) of the unbiased estimate, which is
+given by the corresponding element of the inverse of the Fisher information matrix involving this
+parameter (or this pair of parameters). Recall that Iη is the index set of the parameters shared by [η]
+and [ζ]l and that Jθ is the index set of the parameters shared by [θ] and [ζ]l; we have (G−1
+ζ )Iζ = (G−1
+η )Iη
+
+and (G−1
+ζ )Jζ = (G−1
+θ )Jθ, i.e., G−1
+ζ
+=
+
+�
+(G−1
+η )Iη
+0
+0
+(G−1
+θ )Jθ
+
+�
+
+. Since Gζ is a block tridiagonal matrix, the
+
+proposition follows.
+
+Appendix D Proof of Proposition 4
+
+Proof. Assume the Fisher information matrix of [θ] to be: Gθ =
+
+�
+U
+X
+XT
+V
+
+�
+
+, which is partitioned
+
+based on Iη and Jθ. Based on Proposition 3, we have A = U−1. Obviously, the diagonal elements
+of U are all smaller than one. According to the succeeding Lemma A1, we can see that the diagonal
+elements of A (i.e., U−1) are greater than one.
+Next, we need to show that the diagonal elements of B are smaller than 1. Using the Schur
+complement of Gθ, the bottom-right block of G−1
+θ , i.e., (G−1
+θ )Jθ, equals to (V − XTU−1X)−1. Thus, the
+diagonal elements of B: Bjj = (V − XTU−1X)jj < Vjj < 1. Hence, we complete the proof.
+
+Lemma A1. With a l × l positive definite matrix H, if Hii < 1, then (H−1)ii > 1, ∀i ∈ {1, 2, . . . , l}.
+
+Proof. Since H is positive definite, it is a Gramian matrix of l linearly independent vectors v1, v2, . . . , vl,
+i.e., Hij = ⟨vi, vj⟩ (⟨·, ·⟩ denotes the inner product). Similarly, H−1 is the Gramian matrix of l linearly
+independent vectors w1, w2, . . . , wl and (H−1)ij = ⟨wi, wj⟩. It is easy to verify that ⟨wi, vi⟩ = 1, ∀i ∈
+{1, 2, . . . , l}. If Hii < 1, we can see that the norm ∥vi∥ = √Hii < 1. Since ∥wi∥ × ∥vi∥ ≥ ⟨wi, vi⟩ = 1,
+we have ∥wi∥ > 1. Hence, (H−1)ii = ⟨wi, wi⟩ = ∥wi∥2 > 1.
+
+337
+
+
+Entropy 2014, 16, 3670–3688
+
+Appendix E Proof of Proposition 5
+
+Proof. Let Bq be a ε-ball surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε},
+where KL(·, ·) denotes the Kullback–Leibler divergence and ε is small. ζq is the coordinates of q(x). Let
+q(x) + dq be a neighbor of q(x) uniformly sampled on Bq and ζq(x)+dq be its corresponding coordinates.
+For a small ε, we can calculate the expected information distance between q(x) and q(x) + dq as follows:
+
+EBq =
+�
+[(ζq(x)+dq − ζq)TGζ(ζq(x)+dq − ζq)]
+1
+2 dBq
+(A1)
+
+where Gζ is the Fisher information matrix at q(x).
+Since Fisher information matrix Gζ is both positive definite and symmetric, there exists a singular
+value decomposition Gζ = UTΛU where U is an orthogonal matrix and Λ is a diagonal matrix with
+diagonal entries equal to the eigenvalues of Gζ (all ≥ 0).
+Applying the singular value decomposition into Equation (A1), the distance becomes:
+
+EBq=
+�
+[(ζq(x)+dq − ζq)TUTΛU(ζq(x)+dq − ζq)]
+1
+2 dBq
+(A2)
+
+Note that U is an orthogonal matrix, and the transformation U(ζq(x)+dq − ζq) is a norm-preserving rotation.
+Now, we need to show that among all tailored k-dimensional submanifolds of S, [ζ]lt is the
+one that preserves maximum information distance. Assume IT = {i1, i2, . . . , ik} is the index of k
+coordinates that we choose to form the tailored submanifold T in the mixed-coordinates [ζ]. According
+to the fundamental analytical properties of the surface of the hyper-ellipsoid and the orthogonality of
+the mixed-coordinates, there exists a strict positive monotonicity between the expected information
+distance EBq for T and the sum of eigenvalues of the sub-matrix (Gζ)IT, where the sum equals to the
+trace of (Gζ)IT. That is, the greater the trace of (Gζ)IT, the greater the expected information distance
+EBq for T.
+Next, we show that the sub-matrix of Gζ specified by [ζ]lt gives a maximum trace. Based on
+Proposition 4, the elements on the main diagonal of the sub-matrix A are lower bounded by one and
+those of B upper bounded by one. Therefore, [ζ]lt gives the maximum trace among all sub-matrices of
+Gζ. This completes the proof.
+
+Author Contributions: Author Contributions
+Theoretical study and proof: Yuexian Hou and Xiaozhao Zhao.
+Conceived and designed
+the experiments:
+Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li.
+Performed the
+experiments: Xiaozhao Zhao. Analyzed the data: Xiaozhao Zhao, Yuexian Hou. Wrote the manuscript:
+Xiaozhao Zhao, Dawei Song, Wenjie Li and Yuexian Hou. All authors have read and approved the
+final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Wheeler, J.A. Time Today; Cambridge University Press: Cambridge, UK, 1994; pp. 1–29.
+2.
+Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, UK, 2004.
+3.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc. 1945, 37, 81–91.
+4.
+Frieden, B.R.; Gatenby, R.A. Principle of maximum Fisher information from Hardy’s axioms applied to
+statistical systems. Phys. Rev. E 2013, 88, 042144.
+5.
+Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information—Theoretic
+Approach; Springer: Berlin/Heidelberg, Germany, 2002.
+6.
+Vstovsky, G.V. Interpretation of the extreme physical information principle in terms of shift information.
+Phys. Rev. E 1995, 51, 975–979.
+
+338
+
+
+Entropy 2014, 16, 3670–3688
+
+7.
+Amari, S.; Nagaoka, H.
+Methods of Information Geometry; Translations of Mathematical Monographs;
+Oxford University Press: Oxford, UK, 1993.
+8.
+Amari, S.; Kurata, K.; Nagaoka, H. Information geometry of Boltzmann machines. IEEE Trans. Neural Netw.
+1992, 3, 260–271.
+9.
+Hou, Y.; Zhao, X.; Song, D.; Li, W. Mining pure high-order word associations via information geometry for
+information retrieval. ACM Trans. Inf. Syst. 2013, 31, 12:1–12:32.
+10.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–219.
+11.
+Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for Boltzmann machines. Cogn. Sci. 1985,
+9, 147–169.
+12.
+Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006,
+313, 504–507.
+13.
+Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational
+Research: Copenhagen, Denmark, 1960.
+14.
+Bond, T.; Fox, C. Applying the Rasch Model: Fundamental Measurement in the Human Sciences; Psychology Press:
+London, UK, 2013.
+15.
+Gibilisco, P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: Cambridge, UK, 2010.
+16.
+ˇCencov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Washington,
+D.C., USA, 1982.
+17.
+Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86.
+18.
+Buhlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory And Applications;
+Springer:
+Berlin/Heidelberg, Germany, 2011.
+19.
+Bobrovsky, B.; Mayer-Wolf, E.; Zakai, M. Some classes of global Cramér-Rao bounds. Ann. Stat. 1987,
+15, 1421–1438.
+20.
+Hinton, G.E. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002,
+14, 1771–1800.
+21.
+Carreira-Perpinan, M.A.; Hinton, G.E.
+On contrastive divergence learning.
+In Proceedings of the
+International Workshop on Artificial Intelligence and Statistics, Key West, FL, USA, 6–8 January 2005;
+pp. 33–40.
+22.
+Gilks, W.R.; Richardson, S.; Spiegelhalter, D. Introducing markov chain monte carlo. In Markov Chain Monte
+Carlo in Practice; Chapman and Hall/CRC: London, UK, 1996; pp. 1–19.
+23.
+Nakahara, H.; Amari, S.
+Information geometric measure for neural spikes.
+Neural Comput.
+2002,
+14, 2269–2316.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+339
+
+
+
+MDPI
+St. Alban-Anlage 66
+4052 Basel
+
+Switzerland
+Tel. +41 61 683 77 34
+Fax +41 61 302 89 18
+www.mdpi.com
+
+Entropy Editorial Office
+E-mail: entropy@mdpi.com
+
+www.mdpi.com/journal/entropy
+
+
+
+MDPI  
+St. Alban-Anlage 66 
+4052 Basel 
+Switzerland
+
+Tel: +41 61 683 77 34 
+Fax: +41 61 302 89 18
+
+www.mdpi.com
+ISBN 978-3-03897-633-2
+
+
diff --git a/papers/project_paper_2_neuroscience/references/Amari2016_Placeholder.md b/papers/project_paper_2_neuroscience/references/Amari2016_Placeholder.md
deleted file mode 100644
index d1ecc4e8..00000000
--- a/papers/project_paper_2_neuroscience/references/Amari2016_Placeholder.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Information Geometry and Its Applications (Amari 2016)
-
-This reference is a published book/monograph. 
-Due to copyright and its format, the full PDF is not hosted in this repository.
-
-**Citation:**
-Amari, S. (2016). *Information Geometry and Its Applications.* Springer.
diff --git a/papers/project_paper_2_neuroscience/references/Bastos2012.pdf b/papers/project_paper_2_neuroscience/references/Bastos2012.pdf
new file mode 100644
index 00000000..28ce3bd5
--- /dev/null
+++ b/papers/project_paper_2_neuroscience/references/Bastos2012.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:af1955d6a49759e0a78310c2e1948cf421e3c78babd18a294b7b621a0c7cd599
+size 533952
diff --git a/papers/project_paper_2_neuroscience/references/Bastos2012.txt b/papers/project_paper_2_neuroscience/references/Bastos2012.txt
new file mode 100644
index 00000000..652e6b8b
--- /dev/null
+++ b/papers/project_paper_2_neuroscience/references/Bastos2012.txt
@@ -0,0 +1,1759 @@
+Canonical microcircuits for predictive coding
+
+Andre M. Bastos1,2,6, W. Martin Usrey1,3,4, Rick A. Adams8, George R. Mangun2,3,5, Pascal
+Fries6,7, and Karl J. Friston8
+
+1Center for Neuroscience, University of California-Davis, Davis, CA 95618 USA. 2Center for Mind
+and Brain, University of California-Davis, Davis, CA 95618 USA. 3Department of Neurology,
+University of California-Davis, Davis, CA 95618 USA. 4Department of Neurobiology, Physiology
+and Behavior, University of California-Davis, Davis, CA 95618 USA. 5Department of Psychology,
+University of California-Davis, Davis, CA 95618 USA. 6Ernst Strüngmann Institute (ESI) for
+Neuroscience in Cooperation with Max Planck Society, Deutschordenstraße 46, 60528 Frankfurt,
+Germany. 7Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen,
+Kapittelweg 29, 6525 EN Nijmegen, Netherlands. 8The Wellcome Trust Centre for Neuroimaging,
+University College London, Queen Square, London WC1N 3BG, UK.
+
+Summary
+
+This review considers the influential notion of a canonical (cortical) microcircuit in light of recent
+theories about neuronal processing. Specifically, we conciliate quantitative studies of
+microcircuitry and the functional logic of neuronal computations. We revisit the established idea
+that message passing among hierarchical cortical areas implements a form of Bayesian inference –
+paying careful attention to the implications for intrinsic connections among neuronal populations.
+By deriving canonical forms for these computations, one can associate specific neuronal
+populations with specific computational roles. This analysis discloses a remarkable
+correspondence between the microcircuitry of the cortical column and the connectivity implied by
+predictive coding. Furthermore, it provides some intuitive insights into the functional asymmetries
+between feedforward and feedback connections and the characteristic frequencies over which they
+operate.
+
+Keywords
+
+neuronal; connectivity; cortical; microcircuit; computation; predictive coding; free energy
+principle; gamma oscillations; beta oscillations
+
+Introduction
+
+The idea that the brain actively constructs explanations for its sensory inputs is now
+generally accepted. This notion builds on a long history of proposals that the brain uses
+internal or generative models to make inferences about the causes of its sensorium
+(Helmholtz, 1860; Gregory 1968, 1980; Dayan et al., 1995). In terms of implementation,
+predictive coding is, arguably, the most plausible neurobiological candidate for making
+these inferences (Srinivasan et al., 1982; Mumford, 1992; Rao and Ballard, 1999). This
+review considers the canonical microcircuit in light of predictive coding. We focus on the
+intrinsic connectivity within a cortical column and the extrinsic connections between
+
+Correspondence: Karl Friston The Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, 12
+Queen Square, London, WC1N 3BG, UK. Tel (44) 207 833 7454 k.friston@ucl.ac.uk.
+
+NIH Public Access
+Author Manuscript
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+Published in final edited form as:
+Neuron. 2012 November 21; 76(4): 695–711. doi:10.1016/j.neuron.2012.10.038.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+columns in different cortical areas. We try to relate this circuitry to neuronal computations
+by showing the computational dependencies – implied by predictive coding – recapitulate
+the physiological dependencies implied by quantitative studies of intrinsic connectivity. This
+issue is important as distinct neuronal dynamics in different cortical layers are becoming
+increasingly apparent (de Kock et al., 2007; Sakata and Harris, 2009; Maier et al., 2010;
+Bollimunta et al., 2011). For example, recent findings suggest that the superficial layers of
+cortex show neuronal synchronization and spike-field coherence predominantly in the
+gamma frequencies, while deep layers prefer lower (alpha or beta) frequencies (Roopun et
+al., 2006, 2008; Maier et al., 2010; Buffalo et al., 2011). Since feedforward connections
+originate predominately from superficial layers and feedback connections from deep layers,
+these differences suggest that feedforward connections use relatively high frequencies,
+compared to feedback connections, as recently demonstrated empirically (Bosman et al.,
+2012). These asymmetries call for something quite remarkable: namely, a synthesis of
+spectrally distinct inputs to a cortical column and the segregation of its outputs. This
+segregation can only arise from local neuronal computations that are structured and
+precisely interconnected. It is the nature of this intrinsic connectivity – and the dynamics it
+supports – that we consider. The aim of this review is to speculate about the functional roles
+of neuronal populations in specific cortical layers in terms of predictive coding. Our long-
+term aim is to create computationally informed models of microcircuitry that can be tested
+with dynamic causal modelling (David et al., 2006; Moran et al., 2008, 2011).
+
+This review comprises three sections. We start with an overview of the anatomy and
+physiology of cortical connections – with an emphasis on quantitative advances. The second
+section considers the computational role of the canonical microcircuit that emerges from
+these studies. The third section provides a formal treatment of predictive coding and defines
+the requisite computations in terms of differential equations. We then associate the form of
+these equations with the canonical microcircuit to define a computational architecture. We
+conclude with some predictions about intrinsic connections and note some important
+asymmetries in feedforward and feedback connections that emerge from this treatment.
+
+The anatomy and physiology of cortical connections
+
+This section reviews laminar-specific connections that underlie the notion of a canonical
+microcircuit (Douglas et al., 1989; Douglas and Martin, 1991, 2004). We first focus on
+mammalian visual cortex and then consider whether visual microcircuitry can be generalized
+to a canonical circuit for the entire cortex. Both functional and anatomical techniques have
+been applied to study intrinsic (intracortical) and extrinsic connections. We will emphasise
+the insights from recent studies that combine both techniques.
+
+Intrinsic connections and the canonical microcircuit
+
+The seminal work of Douglas and Martin (1991), in the cat visual system, produced a model
+of how information flows through the cortical column. Douglas and Martin recorded
+intracellular potentials from cells in primary visual cortex during electrical stimulation of its
+thalamic afferents. They noted a stereotypical pattern of fast excitation, followed by slower
+and longer-lasting inhibition. The latency of the ensuing hyperpolarisation distinguished
+responses in supragranular and infragranular layers. Using conductance-based models, they
+showed that a simple model could reproduce these responses. Their model contained
+superficial and deep pyramidal cells with a common pool of inhibitory cells. All three
+neuronal populations received thalamic drive and were fully interconnected. The deep
+pyramidal cells received relatively weak thalamic drive but strong inhibition (Figure 1).
+These interconnections allowed the circuit to amplify transient thalamic inputs to generate
+sustained activity in the cortex, while maintaining a balance between excitation and
+inhibition, two tasks that must be solved by any cortical circuit. Their circuit, although based
+
+Bastos et al.
+Page 2
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+on recordings from cat visual cortex, was also proposed as a basic theme that might be
+present and replicated, with minor variations, throughout the cortical sheet (Douglas et al.,
+1989).
+
+Subsequent studies have used intracellular recordings and histology to measure spikes (and
+depolarisation) in pre and post-synaptic cells, whose cellular morphology can be determined.
+This approach quantifies both the connection probability – defined as the number of
+observed connections divided by total number of pairs recorded – and connection strength –
+defined in terms of post-synaptic responses. Thomson et al (2002) used these techniques to
+study layers 2 to 5 (L2 to L5) of the cat and rat visual systems. The most frequently
+connected cells were located in the same cortical layer, where the largest interlaminar
+projections were the ‘feedforward’ connections from L4 to L3 and from L3 to L5. Excitatory
+reciprocal ‘feedback’ connections were not observed (L3 to L4) or less commonly observed
+(L5 to L3), suggesting that excitation spreads within the column in a feedforward fashion.
+Feedback connections were typically seen when pyramidal cells in one layer targeted
+inhibitory cells in another (see Thomson and Bannister, 2003 for a review).
+
+While many studies have focused on excitatory connections, a few have examined inhibitory
+connections. These are more difficult to study, because inhibitory cells are less common
+than excitatory cells, and because there are at least seven distinct morphological classes
+(Salin and Bullier, 1995). However, recent advances in optogenetics have made it possible
+to more easily target inhibitory cells: Kätzel and colleagues combined optogenetics and
+whole-cell recording to investigate the intrinsic connectivity of inhibitory cells in mouse
+cortical areas M1, S1, and V1 (Kätzel et al., 2010). They transgenically expressed
+channelrhodopsin in inhibitory neurons and activated them, while recording from pyramidal
+cells. This allowed them to assess the effect of inhibition as a function of laminar position
+relative to the recorded neuron.
+
+Several conclusions can be drawn from this approach (Kätzel et al., 2010): first, L4
+inhibitory connections are more restricted in their lateral extent, relative to other layers. This
+supports the notion that L4 responses are dominated by thalamic inputs, while the remaining
+laminae integrate afferents from a wider cortical patch. Second, the primary source of
+inhibition originates from cells in the same layer, reflecting the prevalence of inhibitory
+intralaminar connections. Third, several interlaminar motifs appeared to be general – at least
+in granular cortex: principally, a strong inhibitory connection from L4 onto supragranular
+L2/3 and from infragranular layers onto L4. For more information on the cell-type
+specificity of inhibitory connections, see Yoshimura and Callaway, (2005). Figure 2
+provides a summary of key excitatory and inhibitory intralaminar connections.
+
+Microcircuits in the sensorimotor cortex
+
+Do the features of visual microcircuits generalize to other cortical areas? Recently, two
+studies have mapped the intrinsic connectivity of mouse sensory and motor cortices: Lefort
+and colleagues (2009) used multiple whole-cell recordings in mouse barrel cortex to
+determine the probability of monosynaptic connections – and the corresponding connection
+strength. As in visual cortex, the strongest connections were intralaminar and the strongest
+interlaminar connections were the ascending L4 to L2, and descending L3 to L5.
+
+One puzzle about canonical microcircuits is whether motor cortex has a local circuitry that is
+qualitatively similar to sensory cortex. This question is important because motor cortex lacks
+a clearly defined granular L4 (a property that earns it the name “agranular cortex”). Weiler
+and colleagues combined whole-cell recordings in mouse motor cortex with photo-
+stimulation to uncage Glutamate (Weiler et al., 2008). This allowed them to systematically
+stimulate the cortical column in a grid – centred on the pyramidal neuron from which they
+
+Bastos et al.
+Page 3
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+recorded. By recording from pyramidal neurons in L2-6 (L1 lacks pyramidal cells), the
+authors mapped the excitatory influence that each layer exerts over the others. They found
+that the L2/3 to L5A/B was the strongest connection – accounting for one-third of the total
+synaptic current in the circuit. The second strongest interlaminar connection was the
+reciprocal L5A to L2/3 connection. This pathway may be homologous to the prominent
+L4/5A to L2/3 pathway in sensory cortex. Also – as in sensory cortex – recurrent
+(intralaminar) connections were prominent, particularly in L2, L5A/B and L6. The largest
+fraction of synaptic input arrived in L5A/B, consistent with its key role in accumulating
+information from a wide range of afferents – before sending its output to the corticospinal
+tract. In summary, strong input-layer to superficial, and superficial to deep connectivity,
+together with strong intralaminar connectivity, suggests that the intrinsic circuitry of motor
+cortex is similar to other cortical areas.
+
+The anatomy and physiology of extrinsic connections
+
+Clearly, an account of microcircuits must refer to the layers of origin of extrinsic
+connections and their laminar targets. Although the majority of presynaptic inputs arise from
+intrinsic connections, cortical areas are also richly interconnected, where the balance
+between intrinsic and extrinsic processing mediates functional integration among specialised
+cortical areas (Engel et al., 2010). By numbers alone, intrinsic connections appear to
+dominate – 95% of all neurons labelled with a retrograde tracer lie within about 2 mm of the
+injection site (Markov et al., 2011). The remaining 5% represent cells giving rise to extrinsic
+connections, which – although sparse – can be extremely effective in driving their targets. A
+case in point is the LGN to V1 connection: although it is only the sixth strongest connection
+to V1, LGN afferents have a substantial effect on V1 responses (Markov et al., 2011).
+
+Hierarchies and functional asymmetries—Current dogma holds that the cortex is
+hierarchically organized. The idea of a cortical hierarchy rests on the distinction between
+three types of extrinsic connections: feedforward connections, that link an earlier area to a
+higher area, feedback connections, that link a higher to an earlier area, and lateral
+connections, that link areas at the same level (reviewed in Felleman and Van Essen, 1991).
+These connections are distinguished by their laminar origins and targets. Feedforward
+connections originate largely from superficial pyramidal cells and target L4, while feedback
+connections originate largely from deep pyramidal cells and terminate outside of L4
+(Felleman and van Essen 1991). Clearly, this description of cortical hierarchies is a
+simplification and can be nuanced in many ways: for example, as the hierarchical distance
+between two areas increases, the percentage of cells that send feedforward (resp. feedback)
+projections from a lower (resp. higher) level becomes increasingly biased towards the
+superficial (resp. deep) layers (Barone et al., 2000; Vezoli, 2004) .
+
+In addition to the laminar specificity of their origins and targets, feedforward and feedback
+connections also differ in their synaptic physiology. The traditional view holds that
+feedforward connections are strong and driving, capable of eliciting spiking activity in their
+targets and conferring classical receptive field properties – the prototypical example being
+the synaptic connection between LGN and V1 (Sherman and Guillery, 1998). Feedback
+connections are thought to modulate (extra-classical) receptive field characteristics
+according to the current context; e.g., visual occlusion, attention, salience, etc. The
+prototypical example of a feedback connection is the cortical L6 to LGN connection.
+Sherman and Guillery identified several properties that distinguish drivers from modulators.
+Driving connections tend to show a strong ionotropic component in their synaptic response,
+evoke large EPSPs, and respond to multiple EPSPs with depressing synaptic effects.
+Modulatory connections produce metabotropic and ionotropic responses when stimulated,
+evoke weak EPSPs, and show paired-pulse facilitation (Sherman and Guillery, 1998; 2011).
+
+Bastos et al.
+Page 4
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+These distinctions were based upon the inputs to the LGN, where retinal input is driving and
+cortical input is modulatory. Until recently, little data were available to assess whether a
+similar distinction applies to cortico-cortical feedforward and feedback connections.
+However, recent studies show that cortical feedback connections express not only
+modulatory but also driving characteristics:
+
+Are feedback connections driving, modulatory or both?—Although it is generally
+thought that feedback connections are weak and modulatory (Crick and Koch, 1998;
+Sherman and Guillery, 1998), recent evidence suggests that feedback connections do more
+than modulate lower level responses: Sherman and colleagues recorded cells in mouse area
+V1/V2 and A1/A2, while stimulating feedforward or feedback afferents. In both cases,
+driving-like responses as well as modulatory-like responses were observed (Covic and
+Sherman, 2011; De Pasquale and Sherman, 2011). This indicates that – for these
+hierarchically proximate areas – feedback connections can drive their targets just as strongly
+as feedforward connections. This is consistent with earlier studies showing that feedback
+connections can be driving: Mignard and Malpeli (1991) studied the feedback connection
+between areas 18 and 17, while layer A of the LGN was pharmacologically inactivated. This
+silenced the cells in L4 in area 17 but spared activity in superficial layers. However,
+superficial cells were silenced when area 18 was lesioned. This is consistent with a driving
+effect of feedback connections from area 18, in the absence of geniculate input. In summary,
+feedback connections can mediate modulatory and driving effects. This is important from
+the point of view of predictive coding, because top-down predictions have to elicit
+obligatory responses in their targets (cells reporting prediction errors):
+
+In predictive coding, feedforward connections convey prediction errors, while feedback
+connections convey predictions from higher cortical areas to suppress prediction errors in
+lower areas. In this scheme, feedback connections should therefore be capable of exerting
+strong (driving) influences on earlier areas to suppress or counter feedforward driving
+inputs. However, as we will see later, these influences also need to exert nonlinear or
+modulatory effects. This is because top-down predictions are necessarily context sensitive:
+e.g., the occlusion of one visual object by another. In short, predictive coding requires
+feedback connections to drive cells in lower levels in a context sensitive fashion, which
+necessitates a modulatory aspect to their postsynaptic effects.
+
+Are feedback connections excitatory or inhibitory?—Crucially, because feedback
+connections convey predictions – that serve to explain and thereby reduce prediction errors
+in lower levels – their effective (polysynaptic) connectivity is generally assumed to be
+inhibitory. An overall inhibitory effect of feedback connections is consistent with in vivo
+studies. For example, electrophysiological studies of the mismatch negativity suggest that
+neural responses to deviant stimuli – that violate sensory predictions established by a regular
+stimulus sequence – are enhanced relative to predicted stimuli (Garrido et al., 2009).
+Similarly, violating expectations of auditory repetition causes enhanced gamma-band
+responses in early auditory cortex (Todorovic et al., 2011). These enhanced responses are
+thought to reflect an inability of higher cortical areas to predict, and thereby suppress, the
+activity of populations encoding prediction error (Garrido et al., 2007; Wacongne et al.,
+2011). The suppression of predictable responses can also be regarded as repetition
+suppression, observed in single unit recordings from the inferior temporal cortex of macaque
+monkeys (Desimone, 1996). Furthermore, neurons in monkey inferotemporal cortex respond
+significantly less to a predicted sequence of natural images, compared to an unpredicted
+sequence (Meyer and Olson, 2011).
+
+The inhibitory effect of feedback connections is further supported by neuroimaging studies
+(Murray et al., 2002; Murray, 2005; Harrison et al., 2007; Summerfield et al., 2008, 2011;
+
+Bastos et al.
+Page 5
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Alink et al., 2010). These studies show that predictable stimuli evoke smaller responses in
+early cortical areas. Crucially, this suppression cannot be explained in terms of local
+adaptation, because the attributes of the stimuli that can be predicted are not represented in
+early sensory cortex (e.g., Harrison et al. 2007). It should be noted that the suppression of
+responses to predictable stimuli can coexist with (top-down) attentional enhancement of
+signal processing (Wyart et al., 2012): in predictive coding, attention is mediated by
+increasing the gain of populations encoding prediction error (Spratling, 2008; Feldman and
+Friston, 2010). The resulting attentional modulation (e.g., Hopfinger et al., 2000) can
+interact with top-down predictions to override their suppressive influence – as demonstrated
+empirically (Kok et al., 2011). See Buschman and Miller, (2007), Saalmann et al., (2007),
+Anderson et al. (2011), and Armstrong et al. (2012) for further discussion of top-down
+connections in attention.
+
+Further evidence for the inhibitory (suppressive) effect of feedback connections comes from
+neuropsychology: Patients with damage to the prefrontal cortex (PFC) show disinhibition of
+event related potential responses (ERP) to repeating stimuli (Knight et al., 1989; Yamaguchi
+and Knight, 1990; but see Barceló et al., 2000). In contrast, they show reduced-amplitude
+P300 ERPs in response to novel stimuli – as if there were a failure to communicate top-
+down predictions to sensory cortex (Knight, 1984). Furthermore, normal subjects show a
+rapid adaptation to deviant stimuli as they become predictable – an effect not seen in
+prefrontal patients.
+
+Several invasive studies complement these human studies in suggesting an overall inhibitory
+role for feedback connections. In a recent seminal study, Olsen et al. studied corticothalamic
+feedback between L6 of V1 and the LGN using transgenic expression of channel rhodopsin
+in L6 cells of V1. By driving these cells optogenetically – while recording units in V1 and
+the LGN – the authors showed that deep L6 principal cells inhibited their extrinsic targets in
+the LGN and their intrinsic targets in cortical layers 2 to 5 (Olsen et al., 2012). This
+suppression was powerful – in the LGN, visual responses were suppressed by 76%.
+Suppression was also high in V1, around 80-84% (Olsen et al., 2012). This evidence is in
+line with classical studies of corticogeniculate contributions to length tuning in the LGN,
+showing that cortical feedback contributes to the surround suppression of feline LGN cells:
+without feedback, LGN cells are disinhibited and show weaker surround suppression
+(Murphy and Sillito, 1987; Sillito et al., 1993; but see Alitto and Usrey, 2008).
+
+While these studies provide convincing evidence that cortical feedback to the LGN is
+inhibitory, the evidence is more complicated for corticocortical feedback connections
+(Sandell and Schiller, 1982; Johnson and Burkhalter, 1996, 1997). Hupé and colleagues
+cooled area V5/MT while recording from areas V1, V2, and V3 in the monkey. When visual
+stimuli were presented in the classical receptive field (CRF), cooling of area V5/MT
+decreased unit activity in earlier areas, suggesting an excitatory effect of extrinsic feedback
+(Hupé et al., 1998). However, when the authors used a stimulus that spanned the extra-
+classical RF the responses of V1 neurons were – on average – enhanced after cooling area
+V5, consistent with the suppressive role of feedback connections. These results indicate that
+the inhibitory effects of feedback connections may depend on (natural) stimuli that require
+integration over the visual field. Similar effects were observed when area V2 was cooled and
+neurons were measured in V1: when stimuli were presented only to the CRF, cooling V2
+decreased V1 spiking activity; however, when stimuli were present in the CRF and the
+surround, cooling V2 increased V1 activity (Bullier et al., 1996). Finally, others have argued
+for an inhibitory effect of feedback based on the timing and spatial extent of surround
+suppression in monkey V1 – concluding that the far surround suppression effects were most
+likely mediated by feedback (Bair et al., 2003).
+
+Bastos et al.
+Page 6
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+The empirical finding that feedback connections can both facilitate and suppress firing in
+lower hierarchical areas – depending on the content of classical and extra-classical receptive
+fields – is consistent with predictive coding: Rao and Ballard (1999) trained a hierarchical
+predictive coding network to recognise natural images. They showed that higher levels in
+the hierarchy learn to predict visual features that extend across many CRFs in the lower
+levels (e.g. tree trunks or horizons). Hence, higher visual areas come to predict that visual
+stimuli will span the receptive fields of cells in lower visual areas. In this setting, a stimulus
+that is confined to a CRF would elicit a strong prediction error signal (because it cannot be
+predicted). This provides a simple explanation for the findings of Hupé et al (1998) and
+Bullier et al (1996): when feedback connections are deactivated, there are no top-down
+predictions to explain responses in lower areas – leading to a disinhibition of responses in
+earlier areas, when – and only when – stimuli can be predicted over multiple CRFs.
+
+Feedback connections and layer 1—How might the inhibitory effect of feedback
+connections be mediated? The established view is that extrinsic corticocortical connections
+are exclusively excitatory (using glutamate as their excitatory neurotransmitter) – although
+recent evidence suggests that inhibitory extrinsic connections exist and may play an
+important role in synchronizing distant regions (Melzer et al., 2012). However, one
+important route by which feedback connections could mediate selective inhibition is via
+their termination in L1 (Anderson and Martin, 2006; Shipp, 2007): layer 1 is sometimes
+referred to as acellular due to its pale appearance with Nissl staining (the classical method
+for separating layers that selectively labels cell bodies). Indeed, a recent study concluded
+that L1 contains less than 0.5% of all cells in a cortical column (Meyer et al., 2011). These
+L1 cells are almost all inhibitory and interconnect strongly with each other, via electrical
+connections and chemical synapses (Chu et al., 2003). Simultaneous whole cell patch clamp
+recordings show that they provide strong monosynaptic inhibition to L2/3 pyramidal cells,
+whose apical dendrites project into L1 (Chu et al., 2003; Wozny and Williams, 2011). This
+means L1 inhibitory cells are in a prime position to mediate inhibitory effects of extrinsic
+feedback. The laminar location highlighted by these studies – the bottom of L1 and the top
+of L2/3 – has recently been shown to be a “hotspot” of inhibition in the column (Meyer et
+al., 2011). Indeed, a study of rat barrel cortex – that stimulated (and inactivated) L1 –
+showed that it exerts a powerful inhibitory effect on whisker-evoked responses (Shlosberg et
+al., 2006). These studies suggest that corticocortical feedback connections could deliver
+strong inhibition, if they were to recruit the inhibitory potential of L1.
+
+In terms of the excitatory and modulatory effect of feedback connections, predictive input
+from higher cortical areas might have an important impact via the distal dendrites of
+pyramidal neurons (Larkum et al., 2009). Furthermore, there is a specific type of
+GABAergic neuron that appears to control distal dendritic excitability, gating top down
+excitatory signals differentially during behavior (Gentet et al., 2012). Table 1 summarises
+the studies we have discussed in relation to the role of feedback connections.
+
+Feedforward and transthalamic connections
+
+While the evidence for an inhibitory effect of feedback connections has to be evaluated
+carefully, the evidence for an excitatory effect of feedforward connections is unequivocal.
+For example, in the monkey, V1 projects monosynaptically to V2, V3, V3a, V4, and V5/MT
+(Zeki, 1978; Zeki and Shipp, 1988). In all cases – when V1 is reversibly inactivated through
+cooling – single-cell activity in target areas is strongly suppressed (Girard and Bullier, 1989;
+Girard et al., 1991a, 1991b, 1992). In the cases of V2 and V3, the result of cooling area V1
+is a near-total silencing of single unit activity. These studies illustrate that activity in higher
+cortical areas depends on driving inputs from earlier cortical areas that establish their
+receptive field properties.
+
+Bastos et al.
+Page 7
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Finally, while many studies have focused on extrinsic connections that project directly from
+one cortical area to the next, there is mounting evidence that feedforward driving
+connections (and perhaps feedback) in the cortex could be mediated by transthalamic
+pathways (Sherman and Guillery, 1998, 2011). The strongest evidence for this claim comes
+from the somatosensory system, where it was shown recently that the posterior medial
+nucleus of the thalamus (POm) – a higher-order thalamic nucleus that receives direct input
+from cortex – can relay information between S1 and S2 (Theyel et al., 2009). In addition, the
+thalamic reticular nucleus has been proposed to mediate the inhibition that might underlie
+cross-modal attention or top-down predictions (Yamaguchi and Knight, 1990; Crick, 1984;
+Wurtz et al., 2011). Furthermore, computational considerations and recent experimental
+findings point to a potentially important role for higher-order thalamic nuclei in coordinating
+and synchronizing cortical responses (Vicente et al., 2008; Saalmann et al., 2012). The
+degree to which cortical areas are integrated directly via corticocortical or indirectly via
+cortico-thalamo-cortical connections – and the extent to which transthalamic pathways
+dissociate feedforward from feedback connections in the same way as we have proposed for
+the cortico-cortical connections – are open questions.
+
+The canonical microcircuit
+
+Central to the idea of a canonical microcircuit is the notion that a cortical column contains
+the circuitry necessary to perform requisite computations, and that these circuits can be
+replicated with minor variations throughout the cortex. One of the clearest examples of how
+cortical circuits process simple inputs – to generate complex outputs – is the emergence of
+orientation tuning in V1. Orientation tuning is a distinctly cortical phenomenon because
+geniculocortical relay cells show no orientation preferences. A further elaboration of cortical
+responses can be found in the distinction between simple and complex cells – while simple
+cells possess spatially confined receptive fields, complex cells are orientation-tuned but
+show less preference for the location of an oriented bar. Hubel and Wiesel proposed a model
+for how intrinsic and extrinsic connectivity could establish a circuit explaining these
+receptive field properties. They proposed that orientation tuning in simple cells could be
+generated by a single cortical cell receiving input from several ON centre – OFF surround
+geniculate cells arranged along a particular orientation, thereby endowing it with a
+preference for bars oriented in a particular direction (Hubel and Wiesel, 1962). Complex
+cells were hypothesized to receive inputs from several simple cells – with the same
+orientation preference and slightly varying receptive field locations. Thus, complex cells
+were thought not to receive direct LGN input but to be higher order cells in cortex.
+Subsequent findings supported these predictions, showing that input layer 4Cα and 4Cβ
+contained the largest proportion of cells receiving monosynaptic geniculate input, while
+superficial and deep layer cells contain a larger number of cells receiving disynaptic or
+polysynaptic input (Bullier and Henry, 1980). Furthermore, simple cells project
+monosynaptically onto complex cells, where they exert a strong feedforward influence
+(Alonso and Martinez, 1998; Alonso, 2002). These models suggest that intrinsic cortical
+circuitry allows processing to proceed along discrete steps that are capable of producing
+response properties in outputs that are not present in inputs.
+
+Segregation of processing streams
+
+A key property of canonical circuits is the segregation of parallel streams of processing. For
+example, in primates, parvocellular input enters the cortex primarily in layer 4Cβ, whereas
+magnocellular inputs enter in 4Cα. The corticogeniculate feedback pathway from L6
+maintains this segregation, as upper L6 cells preferentially synapse onto parvocellular cells
+in the LGN, while lower L6 cells target the magnocellular LGN layers (Fitzpatrick et al.,
+1994; Briggs and Usrey, 2009). Further examples of stream segregation are also present in
+
+Bastos et al.
+Page 8
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+the dorsal “where” and the ventral “what” pathways, and in the projection from V1 to the
+thick, thin, and inter-stripe regions of V2 (Zeki and Shipp, 1988; Sincich and Horton, 2005).
+
+Superficial and deep layers are anatomically interconnected, but mounting evidence suggests
+that they constitute functionally distinct processing streams: in an elegant experiment,
+Roopun et al. (2006) showed that L2/3 of rat somatomotor cortex shows prominent gamma
+oscillations that are co-expressed with beta oscillations in L5. Both rhythms persisted, when
+superficial and deep layers were disconnected at the level of L4. Maier and colleagues
+(2010) used multilaminar recordings to show strong LFP coherence amongst sites within the
+superficial layers (the superficial compartment), as well as strong coherence amongst sites in
+deep layers (the deep compartment), but weak inter-compartment coherence. These studies
+indicate a segregation of – potentially autonomous – supragranular and infragranular
+dynamics. Maier et al., found that supragranular sites had higher broadband gamma power
+than infragranular sites. This pattern was reversed in the alpha and beta range; with greater
+power in the infragranular and granular layers. Finally, the spiking activity of neurons in the
+superficial layers of visual cortex are more coherent with gamma frequency oscillations in
+the local field potential, while neurons in deep layers are more coherent with alpha
+frequency oscillations (Buffalo et al., 2011). This finding is consistent with an earlier study
+by Livingstone (1996) showing that 50% of cells in L2/3 of squirrel monkey V1 expressed
+gamma oscillations, compared to less than 20% of cells in L4C and infragranular layers. The
+different spectral behaviour of superficial and deep layers has led to the interesting proposal
+that feedforward and feedback signalling may be mediated by distinct (high and low)
+frequencies (reviewed in Wang, 2010, see also Buschman and Miller, 2007 in the context of
+attention), a proposal that has recently received experimental support - at least for the
+feedforward connections (Bosman et al., 2012; see also Gregoriou et al., 2009).
+
+Integration and segregation within canonical circuits—Given this functional and
+anatomical segregation into parallel streams, the question naturally arises, how are these
+streams integrated? It has been previously suggested that integration occurs through the
+synchronized firing of multiple neurons that form a neural ensemble (Gray et al., 1989;
+Singer, 1999), while others have emphasized inter-areal phase-synchronization or coherence
+(Varela et al., 2001; Fries, 2005; Fujisawa and Buzsáki, 2011). While a full treatment of this
+question is beyond the scope of the current review, we propose that the canonical
+microcircuit contains a clue for how the dialectic between segregation and integration might
+be resolved. While top-down and bottom-up inputs and outputs may be segregated in layers,
+streams, and frequency bands, the canonical microcircuit specifies the circuitry for how the
+basic units of cortex are interconnected and therefore how the intrinsic activity of the
+cortical column is entrained by extrinsic inputs. This intrinsic connectivity specifies how the
+cells of origin and termination of extrinsic projections are interconnected, and thus
+determine how top-down and bottom-up streams are integrated within each cortical column.
+
+Spatial segregation and cortical columns
+
+The notion of a canonical microcircuit implicitly assumes that each circuit is distinct from
+its neighbours; that could presumably carry out computations in parallel. Therefore, the
+canonical microcircuit specifies the spatial scale over which processing is integrated. The
+most likely candidate for this spatial scale is the cortical column – that can vary over three
+orders of magnitude between minicolumns, columns, and hypercolumns. Minicolumns are
+only a few cells wide, estimated to be about 50-60 micrometers in diameter by Mountcastle
+(1997) and are seen in Nissl sections of cortex as slight variations in cell density.
+Minicolumns were originally proposed as elementary units of cortex by Lorente de No
+(1949) and appear to reflect the migration of cells from the ventricular zone to the cortical
+sheet during foetal development (reviewed in Horton and Adams, 2005). Hubel and Wiesel
+
+Bastos et al.
+Page 9
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+estimated that orientation columns were on this order of magnitude, about 25-50
+micrometers wide, although they failed to establish a correspondence between orientation
+columns observed physiologically and the minicolumns seen in Nissl sections (Hubel and
+Wiesel, 1974). A cortical column was classically defined as a vertical alignment of cells
+containing neurons with similar receptive field properties, such as orientation preference and
+ocular dominance in V1; or touch in somatosensory cortex (Mountcastle, 1957; Hubel and
+Wiesel, 1972). These columns were suggested by Mountcastle to encompass a number of
+minicolumns, with a width of 300-400 micrometers (Mountcastle, 1997). Finally, Hubel and
+Wiesel defined a hypercolumn to be the unit of cortex necessary to traverse all possible
+values of a particular receptive field property, such as orientation or eye dominance –
+estimated to be between 0.5 to 1 mm wide (Hubel and Wiesel, 1974).
+
+Columns, connections and computations—So is the cortical column the basic unit
+of cortical computation? Some authors emphasize that even within a dendrite, there are all
+the necessary biophysical mechanisms for performing surprisingly advanced computations,
+such as direction selectivity, coincidence detection, or temporal integration (Häusser and
+Mel, 2003; London and Häusser, 2005). Others argue that single neurons can process their
+inputs at the dendrite, soma, and initial segment, such that the output spike trains of just two
+interconnected cells could mediate computations like independent components analysis
+(Klampfl et al., 2009). Others posit that cortical columns form the basic computational unit
+(Mountcastle, 1997; Hubel and Wiesel, 1972) but see (Horton and Adams, 2005). Donald
+Hebb proposed that neurons distributed over several cortical areas could form a functional
+computational unit called a neural assembly (Hebb, 1949). This view has re-emerged in
+recent years, with the development of the requisite recording and analytic techniques for
+evaluating this proposal (Buzsáki, 2010; Canolty et al., 2010; Singer et al., 1997; Lopes-dos-
+Santos et al., 2011).
+
+Computational modelling studies indicate that cortical columns with structured connectivity
+are computationally more efficient than a network containing the same number of neurons
+but with random connectivity (Haeusler and Maass, 2007). Others suggest that this circuitry
+allows the cortex to organize and integrate bottom-up, lateral, and top-down information
+(Ullman, 1995; Raizada and Grossberg, 2003). Douglas and Martin suggest that the rich
+anatomical connectivity of L2/3 pyramidal cells allows them to collect information from
+top-down, lateral, and bottom-up inputs, and – through processing in the dendritic tree –
+select the most likely interpretation of its inputs. More recently, George and Hawkins have
+suggested that the canonical microcircuit implements a form of Bayesian processing
+(George and Hawkins, 2009). In the following section, we pursue similar ideas, but ground
+them in the framework of predictive coding, and propose a cortical circuit that could
+implement predictive coding through canonical interconnections. In particular, we find that
+the proposed circuitry agrees remarkably well with quantitative characterisations of the
+canonical microcircuit (Haeusler and Maass, 2007).
+
+A canonical microcircuit for predictive coding
+
+This section considers the computational role of cortical microcircuitry in more detail. We
+try to show that the computations performed by canonical microcircuits can be specified
+more precisely than one might imagine and that these computations can be understood
+within the framework of predictive coding. In brief, we will show that (hierarchical
+Bayesian) inference about the causes of sensory input can be cast as predictive coding. This
+is important because it provides formal constraints on the dynamics one would expect to
+find in neuronal circuits. Having established these constraints, we then attempt to match
+them with the neurobiological constraints afforded by the canonical microcircuit. The
+endpoint of this exercise is a canonical microcircuit for predictive coding.
+
+Bastos et al.
+Page 10
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Predictive coding and the free energy principle
+
+It might be thought impossible to specify the computations performed by the brain.
+However, there are some fairly fundamental constraints on the basic form of neuronal
+dynamics. The argument goes as follows – and can be regarded as a brief summary of the
+free energy principle (see Friston, 2010 for details):
+
+•
+Biological systems are homoeostatic (or allostatic), which means that they
+minimise the dispersion (entropy) of their interoceptive and exteroceptive states.
+
+•
+Entropy is the average of surprise over time, which means biological systems
+minimise the surprise associated with their sensory states at each point in time.
+
+•
+In statistics, surprise is the negative logarithm of Bayesian model evidence, which
+means biological systems – like the brain – must continually maximise the
+Bayesian evidence for their (generative) model of sensory inputs.
+
+•
+Maximising Bayesian model evidence corresponds to Bayesian filtering of sensory
+inputs. This is also known as predictive coding.
+
+These arguments mean that by minimising surprise, through selecting appropriate
+sensations, the brain is implicitly maximising the evidence for its own existence – this is
+known as active inference. In other words, to maintain a homoeostasis the brain must predict
+its sensory states on the basis of a model. Fulfilling those predictions corresponds to
+accumulating evidence for that model – and the brain that embodies it. The implicit
+maximisation of Bayesian model evidence provides an important link to the Bayesian brain
+hypothesis (Hinton and van Camp, 1993; Dayan et al., 1995; Knill and Pouget, 2004) and
+many other compelling proposals about perceptual synthesis – including analysis by
+synthesis (Neisser, 1967; Yuille and Kersten, 2006) epistemological automata (MacKay,
+1956), the principle of minimum redundancy (Attneave, 1954; Barlow, H.B., 1961; Dan et
+al., 1996), the Infomax principle (Linsker, 1990; Atick, 1992; Kay and Phillips, 2011), and
+perception as hypothesis testing (Gregory, 1968; 1980).
+
+The most popular scheme – for Bayesian filtering in neuronal circuits – is predictive coding
+(Srinivasan et al., 1982; Buchsbaum and Gottschalk, 1983; Rao and Ballard, 1999). In this
+context, surprise corresponds (roughly) to prediction error. In predictive coding, top-down
+predictions are compared with bottom-up sensory information to form a prediction error.
+This prediction error is used to update higher-level representations – upon which top-down
+predictions are based. These optimised predictions then reduce prediction error at lower
+levels.
+
+To predict sensations, the brain must be equipped with a generative model of how its
+sensations are caused (Helmholtz, 1860). Indeed, this led Geoffrey Hinton and colleagues to
+propose that the brain is an inference (Helmholtz) machine (Hinton and Zemel, 1994; Dayan
+et al., 1995). A generative model describes how variables or causes in the environment
+conspire to produce sensory input. Generative models map from (hidden) causes to (sensory)
+consequences. Perception then corresponds to the inverse mapping from sensations to their
+causes, while action can be thought of as the selective sampling of sensations. Crucially, the
+form of the generative model dictates the form of the inversion – for example, predictive
+coding. Figure 3 depicts a general model as a probabilistic graphical model. A special case
+of these models are hierarchical dynamic models (see Figure 4), which grandfather most
+parametric models in statistics and machine learning (see Friston, 2008). These models
+explain sensory data in terms of hidden causes and states. Hidden causes and states are both
+hidden variables that cause sensations but they play slightly different roles: hidden causes
+link different levels of the model and mediate conditional dependencies among hidden states
+at each level. Conversely, hidden states model conditional dependencies over time (i.e.,
+
+Bastos et al.
+Page 11
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+memory) by modelling dynamics in the world. In short, hidden causes and states mediate
+structural and dynamic dependencies respectively.
+
+The details of the graph in Figure 3 are not important; it just provides a way of describing
+conditional dependencies among hidden states and causes responsible for generating sensory
+input. These dependencies mean that we can interpret neuronal activity as message passing
+among the nodes of a generative model, where each canonical microcircuit contains
+representations or expectations about hidden states and causes. In other words, the form of
+the underlying generative model defines the form of the predictive coding architecture used
+to invert the model. This is illustrated in Figure 4, where each node has a single parent. We
+will deal with this simple sort of model because it lends itself to an unambiguous description
+in terms of bottom-up (feedforward) and top-down (feedback) message passing. We now
+look at how perception or model inversion – recovering the hidden states and causes of this
+model given sensory data – might be implemented at the level of a microcircuit:
+
+Predictive coding and message passing
+
+In predictive coding, representations (or conditional expectations) generate top-down
+predictions to produce prediction errors. These prediction errors are then passed up the
+hierarchy in the reverse direction, to update conditional expectations. This ensures an
+accurate prediction of sensory input and all its intermediate representations. This hierarchal
+message passing can be expressed mathematically as a gradient descent on the (sum of
+squared) prediction errors 
+, where the prediction errors are weighted by their
+precision (inverse variance).
+
+(1)
+
+The first pair of equalities just says that conditional expectations about hidden causes and
+
+states 
+ are updated based upon the way we would predict them to change – the first
+term – and subsequent terms that minimise prediction error. The second pair of equations
+
+simply expresses prediction error 
+ as the difference between conditional
+expectations about hidden causes and (the changes in) hidden states and their predicted
+
+values, weighed by their precisions 
+. These predictions are nonlinear functions of
+conditional expectations (g(i), f(i)) at each level of the hierarchy and the level above.
+
+It is difficult to overstate the generality and importance of Equation (1) – it grandfathers
+nearly every known statistical estimation scheme, under parametric assumptions about
+additive noise. These range from ordinary least squares to advanced Bayesian filtering
+schemes (see Friston 2008). In this general setting, Equation (1) minimises variational free
+energy and corresponds to generalised predictive coding. Under linear models, it reduces to
+linear predictive coding, also known as Kalman-Bucy filtering (see Friston et al, 2010 for
+details).
+
+In neuronal network terms, Equation (1) says that prediction error units receive messages
+from the same level and the level above. This is because the hierarchical form of the model
+only requires conditional expectations from neighbouring levels to form prediction errors, as
+can be seen schematically in Figure 4. Conversely, expectations are driven by prediction
+
+Bastos et al.
+Page 12
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+error from the same level and the level below – updating expectations about hidden states
+and causes respectively. These constitute the bottom-up and lateral messages that drive
+conditional expectations to provide better predictions – or representations – that suppress
+prediction error. This updating corresponds to an accumulation of prediction errors – in that
+the rate of change of conditional expectations is proportional to prediction error.
+Electrophysiologically, this means that one would expect to see a transient prediction error
+response to bottom-up afferents (in neuronal populations encoding prediction error) that is
+suppressed to baseline firing rates by sustained responses (in neuronal populations encoding
+predictions). This is the essence of recurrent message passing between hierarchical levels to
+suppress prediction error (see Friston 2008 for a more detailed discussion).
+
+The nature of this message passing is remarkably consistent with the anatomical and
+physiological features of cortical hierarchies. An important prediction is that the nonlinear
+functions of the generative model – modelling context sensitive dependencies among hidden
+variables – appear only in the top-down and lateral predictions. This means,
+neurobiologically, we would predict feedback connections to possess nonlinear or
+neuromodulatory characteristics; in contrast to feedforward connections that mediate a linear
+mixture of prediction errors. This functional asymmetry is exactly consistent with the
+empirical evidence reviewed above. Another key feature of Equation (1) is that the top-
+down predictions produce prediction errors through subtraction. In other words, feedback
+connections should exert inhibitory effects, of the sort seen empirically. Table 2 summarises
+the features of extrinsic connectivity (reviewed in the previous section) that are explained by
+predictive coding. In the remainder of this review, we focus on intrinsic connections and
+cortical microcircuits.
+
+The cortical microcircuit and predictive coding
+
+We now try to associate the variables in Equation (1) with specific populations in the
+canonical microcircuit. Figure 5 illustrates a remarkable correspondence between the form
+of Equation (1) and the connectivity of the canonical microcircuit. Furthermore, the
+resulting scheme corresponds almost exactly to the computational architecture proposed by
+Mumford (1992). This correspondence rests upon the following intuitive steps:
+
+•
+First, we divide the excitatory cells in the superficial and deep layers into principal
+(pyramidal) cells and excitatory interneurons. This accommodates the fact that (in
+macaque V1) a significant percentage of superficial L2/3 cells (about half) and
+deep L5 excitatory cells (about 80%) do not project outside the cortical column
+(Callaway and Wiser, 1996; Briggs and Callaway, 2005).
+
+•
+Second, we know that the superficial and deep pyramidal cells provide feedforward
+and feedback connections respectively. This means that superficial pyramidal cells
+
+must encode and broadcast prediction errors on hidden causes 
+, while deep
+
+pyramidal cells must encode conditional expectations 
+ so that they can
+elaborate feedback predictions.
+
+•
+Third, we know that the (spiny stellate) excitatory cells in the granular layer receive
+
+feedforward connections encoding prediction errors 
+ on the hidden causes of the
+level below.
+
+•
+This leaves the inhibitory interneurons in the granular layer, which, for symmetry,
+we associate with prediction errors on the hidden states.
+
+•
+The remaining populations are the excitatory and inhibitory interneurons in the
+supragranular layer, to which we assign expectations about hidden causes and
+states respectively. These are mapped through descending (intrinsic) feedforward
+
+Bastos et al.
+Page 13
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+connections to cells in the deep layers that generate predictions. We do not suppose
+that this is a simple one-to-one mapping – rather it mediates the nonlinear
+transformation of expectations to predictions required by the earlier cortical level.
+
+This arrangement accommodates the fact that the dependencies among hidden states are
+confined to each node (by the nature of graphical models), which means that their
+expectations and prediction errors should be encoded by interneurons. Furthermore, the
+splitting of excitatory cells in the upper layers into two populations (encoding expectations
+and prediction errors on hidden causes) is sensible, because there is a one-to-one mapping
+between the expectations on hidden causes and their prediction errors.
+
+The ensuing architecture bears a striking correspondence to the microcircuit in (Haeusler
+and Maass, 2007) in the left panel of Figure 5 – in the sense that nearly every connection
+required by the predictive coding scheme appears to be present in terms of quantitative
+measures of intrinsic connectivity. However, there are two exceptions that both involve
+connections to the inhibitory cells in the granular layer (shown as dotted lines in Figure 5).
+Predictive coding requires that these cells (that encode prediction errors on hidden states)
+compare the expected changes in hidden states with the actual changes. This suggests that
+there should be interlaminar projections from supragranular (inhibitory) and infragranular
+(excitatory) cells. In terms of their synaptic characteristics, one would predict that these
+intrinsic connections would be of a feedback sort – in the sense that they convey predictions.
+Although not considered in this Haeusler and Maass scheme, feedback connections from
+infragranular layers are an established component of the canonical microcircuit (see Figure
+2).
+
+Functional asymmetries in the microcircuit
+
+The circuitry in Figure 5 appears consistent with the broad scheme of ascending
+(feedforward) and descending (feedback) intrinsic connections: feedforward prediction
+errors from a lower cortical level arrive at granular layers and are passed forward to
+excitatory and inhibitory interneurons in supragranular layers, encoding expectations. Strong
+and reciprocal intralaminar connections couple superficial excitatory interneurons and
+pyramidal cells. Excitatory and inhibitory interneurons in supragranular layers then send
+strong feedforward connections to the infragranular layer. These connections enable deep
+pyramidal cells and excitatory interneurons to produce (feedback) predictions, which ascend
+back to L4 or descend to a lower hierarchical level. This arrangement recapitulates the
+functional asymmetries between extrinsic feedforward and feedback connections and is
+consistent with the empirical characteristics of intrinsic connections.
+
+If we focus on the superficial and deep pyramidal cells, the form of the recognition
+dynamics in Equation (1) tells us something quite fundamental: We would anticipate higher
+frequencies in the superficial pyramidal cells, relative to the deep pyramidal cells. One can
+see this easily by taking the Fourier transform of the first equality in Equation (1)
+
+(2)
+
+This equation says that the contribution of any (angular) frequency ω in the prediction errors
+(encoded by superficial pyramidal cells) to the expectations (encoded by the deep pyramidal
+cells) is suppressed in proportion to that frequency (Friston 2008). In other words, high
+frequencies should be attenuated when passing from superficial to deep pyramidal cells.
+There is nothing mysterious about this attenuation – it is a simple consequence of the fact
+that conditional expectations accumulate prediction errors, thereby suppressing high-
+frequency fluctuations to produce smooth estimates of hidden causes. This smoothing –
+
+Bastos et al.
+Page 14
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+inherent in Bayesian filtering – leads to an asymmetry in frequency content of superficial
+and deep cells: for example, superficial cells should express more gamma relative to beta,
+and deep cells should express more beta relative to gamma (Roopun et al., 2006, 2008;
+Maier et al., 2010).
+
+Figure 6 provides a schematic illustration of the spectral asymmetry predicted by Equation
+2. Note that predictions about the relative amplitudes of high and low frequencies in
+superficial and deep layers pertain to all frequencies – there is nothing in predictive coding
+per se to suggest characteristic frequencies in the gamma and beta ranges. However, one
+might speculate the characteristic frequencies of canonical microcircuits have evolved to
+model and – through active inference – create the sensorium (Friston, 2010; Berkes et al.,
+2011; Engbert et al., 2011). Indeed, there is empirical evidence to support this notion; in the
+visual (Lakatos et al., 2008; Meirovithz et al., 2012; Bosman et al., 2009) and motor domain
+(Gwin and Ferris, 2012).
+
+In summary, predictions are formed by a linear accumulation of prediction errors.
+Conversely, prediction errors are nonlinear functions of predictions. This means that the
+conversion of prediction errors into predictions (Bayesian filtering) necessarily entails a loss
+of high frequencies. However, the nonlinearity in the mapping from predictions to prediction
+errors means that high frequencies can be created (consider the effect of squaring a sine
+wave, which would convert beta into gamma). In short, prediction errors should express
+higher frequencies than the predictions that accumulate them. This is another example of a
+potentially important functional asymmetry between feedforward and feedback message
+passing that emerges under predictive coding. It is particularly interesting given recent
+evidence that feedforward connections may use higher frequencies than feedback
+connections (Bosman et al., 2012).
+
+Conclusion
+
+In conclusion, there is a remarkable correspondence between the anatomy and physiology of
+the canonical microcircuit and the formal constraints implied by generalised predictive
+coding. Having said this, there are many variations on the mapping between computational
+and neuronal architectures: Even if predictive coding is an appropriate implementation of
+Bayesian filtering, there are many variations on the arrangement shown in Figure 5. For
+example, feedback connections could arise directly from cells encoding conditional
+expectations in supragranular layers. Indeed, there is emerging evidence that feedback
+connections between proximate hierarchical levels originate from both deep and superficial
+layers (Markov et al 2011). Note that this putative splitting of extrinsic streams is only
+predicted in the light of empirical constraints on intrinsic connectivity.
+
+One of our motivations – for considering formal constraints on connectivity – was to
+produce dynamic causal models of canonical microcircuits. Dynamic causal modelling
+enables one to compare different connectivity models, using empirical electrophysiological
+responses (David et al, 2006; Moran et al, 2008, 2011). This form of modelling rests upon
+Bayesian model comparison and allows one to assess the evidence for one microcircuit
+relative to another. In principle, this provides a way to evaluate different microcircuit
+models, in terms of their ability to explain observed activity. One might imagine that the
+particular circuits for predictive coding presented in this paper will be nuanced as more
+anatomical and physiological information becomes available. The ability to compare
+competing models or microcircuits – using optogenetics, local field potentials and
+electroencephalography – may be important for refining neurobiologically informed
+microcircuits. In short, many of the predictions and assumptions we have made about the
+specific form of the microcircuit for predictive coding may be testable in the near future.
+
+Bastos et al.
+Page 15
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Acknowledgments
+
+This work was supported by the Wellcome Trust and the NSF Graduate Research Fellowship under Grant No.
+2009090358 to A.M.B. Support was also provided by NIH grants MH055714 (G.R.M.) and EY013588 (W.M.U.),
+and NSF grant 1228535 (G.R.M and W.M.U). The authors would like to thank Julien Vezoli, Will Penny, Dimitris
+Pinotsis, Stewart Shipp, Vladimir Litvak, Conrado Bosman, Laurent Perrinet and Henry Kennedy for helpful
+discussions. We would also like to thank our reviewers for helpful comments and guidance.
+
+References
+
+Alink A, Schwiedrzik CM, Kohler A, Singer W, Muckli L. Stimulus predictability reduces responses
+in primary visual cortex. J. Neurosci. 2010; 30:2960–2966. [PubMed: 20181593]
+Alitto HJ, Usrey WM. Origin and dynamics of extraclassical suppression in the lateral geniculate
+nucleus of the macaque monkey. Neuron. 2008; 57:135–146. [PubMed: 18184570]
+Alonso JM. Book Review: Neural Connections and Receptive Field Properties in the Primary Visual
+Cortex. The Neuroscientist. 2002; 8:443. [PubMed: 12374429]
+Alonso JM, Martinez LM. Functional connectivity between simple cells and complex cells in cat
+striate cortex. Nat. Neurosci. 1998; 1:395–403. [PubMed: 10196530]
+Anderson JC, Kennedy H, Martin KA. Pathways of Attention: Synaptic Relationships of Frontal Eye
+Field to V4, Lateral Intraparietal Cortex, and Area 46 in Macaque Monkey. The Journal of
+Neuroscience. 2011; 31:10872. [PubMed: 21795539]
+Anderson JC, Martin KAC. Synaptic connection from cortical area V4 to V2 in macaque monkey. The
+Journal of Comparative Neurology. 2006; 495:709–721. [PubMed: 16506191]
+Armstrong, KM.; Schafer, RJ.; Chang, MH.; Moore, T. Attention and action in the frontal eye field. In
+The Neuroscience of Attention: Attentional Control and Selection. Mangun, GR., editor. Oxford
+University Press; New York: 2012. p. 151-166.
+Arnal LH, Wyart V, Giraud A-L. Transitions in neural oscillations reflect prediction errors generated
+in audiovisual speech. Nature Neuroscience. 2011; 14:797–801.
+Atick JJ. Could information theory provide an ecological theory of sensory processing? Network:
+Computation in Neural Systems. 1992; 3:213–251.
+Attneave F. Some informational aspects of visual perception. Psychol Rev. 1954; 61:183–193.
+[PubMed: 13167245]
+Bair W, Cavanaugh JR, Movshon JA. Time course and time-distance relationships for surround
+suppression in macaque V1 neurons. The Journal of Neuroscience. 2003; 23:7690. [PubMed:
+12930809]
+Barceló F, Suwazono S, Knight RT. Prefrontal modulation of visual processing in humans. Nat.
+Neurosci. 2000; 3:399–403. [PubMed: 10725931]
+Barlow, HB. Possible principles underlying the transformations of sensory messages.. In: Rosenblith,
+WA., editor. Sensory Communication. MIT Press; Cambridge, MA: 1961. p. 217-234.
+Barone P, Batardiere A, Knoblauch K, Kennedy H. Laminar distribution of neurons in extrastriate
+areas projecting to visual areas V1 and V4 correlates with the hierarchical rank and indicates the
+operation of a distance rule. The Journal of Neuroscience. 2000; 20:3263. [PubMed: 10777791]
+Berkes P, Orban G, Lengyel M, Fiser J. Spontaneous Cortical Activity Reveals Hallmarks of an
+Optimal Internal Model of the Environment. Science. 2011; 331:83–87. [PubMed: 21212356]
+Bollimunta A, Mo J, Schroeder CE, Ding M. Neuronal Mechanisms and Attentional Modulation of
+Corticothalamic Alpha Oscillations. The Journal of Neuroscience. 2011; 31:4935. [PubMed:
+21451032]
+Bosman CA, Schoffelen J-M, Brunet N, Oostenveld R, Bastos AM, Womelsdorf T, Rubehn B,
+Stieglitz T, De Weerd P, Fries P. Attentional Stimulus Selection through Selective
+Synchronization between Monkey Visual Areas. Neuron. 2012; 75:875–888. [PubMed: 22958827]
+Briggs F, Callaway EM. Laminar patterns of local excitatory input to layer 5 neurons in macaque
+primary visual cortex. Cerebral Cortex. 2005; 15:479. [PubMed: 15319309]
+Briggs F, Usrey WM. Parallel processing in the corticogeniculate pathway of the macaque monkey.
+Neuron. 2009; 62:135–146. [PubMed: 19376073]
+
+Bastos et al.
+Page 16
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Buchsbaum G, Gottschalk A. Trichromacy, opponent colours coding and optimum colour information
+transmission in the retina. Proc. R. Soc. Lond., B, Biol. Sci. 1983; 220:89–113. [PubMed:
+6140684]
+Buffalo EA, Fries P, Landman R, Buschman TJ, Desimone R. Laminar differences in gamma and
+alpha coherence in the ventral stream. Proceedings of the National Academy of Sciences. 2011;
+108:11262.
+Bullier J, Henry GH. Ordinal position and afferent input of neurons in monkey striate cortex. J. Comp.
+Neurol. 1980; 193:913–935. [PubMed: 6253535]
+Bullier J, Hupé JM, James A, Girard P. Functional interactions between areas V1 and V2 in the
+monkey. Journal of Physiology-Paris. 1996; 90:217–220.
+Buschman TJ, Miller EK. Top-Down Versus Bottom-Up Control of Attention in the Prefrontal and
+Posterior Parietal Cortices. Science. 2007; 315:1860–1862. [PubMed: 17395832]
+Buzsáki G. Neural Syntax: Cell Assemblies, Synapsembles, and Readers. Neuron. 2010; 68:362–385.
+[PubMed: 21040841]
+Callaway EM. Local circuits in primary visual cortex of the macaque monkey. Annual Review of
+Neuroscience. 1998; 21:47–74.
+Callaway EM, Wiser AK. Contributions of individual layer 2-5 spiny neurons to local circuits in
+macaque primary visual cortex. Vis. Neurosci. 1996; 13:907–922. [PubMed: 8903033]
+Canolty RT, Ganguly K, Kennerley SW, Cadieu CF, Koepsell K, Wallis JD, Carmena JM. Oscillatory
+phase coupling coordinates anatomically dispersed functional cell assemblies. Proceedings of the
+National Academy of Sciences. 2010; 107:17356.
+Chu Z, Galarreta M, Hestrin S. Synaptic interactions of late-spiking neocortical neurons in layer 1. The
+Journal of Neuroscience. 2003; 23:96. [PubMed: 12514205]
+Covic EN, Sherman SM. Synaptic properties of connections between the primary and secondary
+auditory cortices in mice. Cereb. Cortex. 2011; 21:2425–2441. [PubMed: 21385835]
+Crick F. Function of the thalamic reticular complex: the searchlight hypothesis. Proceedings of the
+National Academy of Sciences of the United States of America. 1984; 81:4586. [PubMed:
+6589612]
+Crick F, Koch C. Constraints on cortical and thalamic projections: the no-strong-loops hypothesis.
+Nature. 1998; 391:245–250. [PubMed: 9440687]
+Dan Y, Atick JJ, Reid RC. Efficient coding of natural scenes in the lateral geniculate nucleus:
+experimental test of a computational theory. The Journal of Neuroscience. 1996; 16:3351–3362.
+[PubMed: 8627371]
+David O, Kiebel SJ, Harrison LM, Mattout J, Kilner JM, Friston KJ. Dynamic causal modeling of
+evoked responses in EEG and MEG. NeuroImage. 2006; 30:1255–1272. [PubMed: 16473023]
+Dayan P, Hinton GE, Neal RM, Zemel RS. The Helmholtz machine. Neural Comput. 1995; 7:889–
+904. [PubMed: 7584891]
+Desimone R. Neural mechanisms for visual memory and their role in attention. Proc. Natl. Acad. Sci.
+U.S.A. 1996; 93:13494–13499. [PubMed: 8942962]
+Douglas RJ, Martin K. A functional microcircuit for cat visual cortex. The Journal of Physiology.
+1991; 440:735. [PubMed: 1666655]
+Douglas RJ, Martin KA, Whitteridge D. A canonical microcircuit for neocortex. Neural Computation.
+1989; 1:480–488.
+Douglas RJ, Martin KAC. Neuronal Circuits of the Neocortex. Annu. Rev. Neurosci. 2004; 27:419–
+451. [PubMed: 15217339]
+Engbert R, Mergenthaler K, Sinn P, Pikovsky A. PNAS Plus: An integrated model of fixational eye
+movements and microsaccades. Proceedings of the National Academy of Sciences. 2011;
+108:E765–E770.
+Engel, AK.; Friston, KJ.; Kelso, JA.; König, P.; Kovács, I.; MacDonald, A.; Miller, EK.; Phillips,
+WA.; Silverstein, SM.; Tallon-Baudry, C., et al. Coordination in Behavior and Cognition.. In: von
+der Malsburg, C.; Phillips, WA.; Singer, W., editors. Dynamic Coordination in the Brain: From
+Neurons to Mind. MIT Press; Cambridge, MA: 2010. p. 267-299.
+
+Bastos et al.
+Page 17
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Feldman H, Friston KJ. Attention, uncertainty, and free-energy. Front Hum Neurosci. 2010; 4:215.
+[PubMed: 21160551]
+Felleman DJ, Van Essen DC. Distributed hierarchical processing in the primate cerebral cortex. Cereb.
+Cortex. 1991; 1:1–47. [PubMed: 1822724]
+Fitzpatrick D, Usrey WM, Schofield BR, Einstein G. The sublaminar organization of corticogeniculate
+neurons in layer 6 of macaque striate cortex. Vis. Neurosci. 1994; 11:307–315. [PubMed:
+7516176]
+Fries P. A mechanism for cognitive dynamics: neuronal communication through neuronal coherence.
+Trends in Cognitive Sciences. 2005; 9:474–480. [PubMed: 16150631]
+Fries P, Reynolds JH, Rorie AE, Desimone R. Modulation of oscillatory neuronal synchronization by
+selective visual attention. Science. 2001; 291:1560–1563. [PubMed: 11222864]
+Friston K. Hierarchical Models in the Brain. PLoS Comput Biol. 2008; 4:e1000211. [PubMed:
+18989391]
+Friston K. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. 2010;
+11:127–138.
+Fujisawa S, Buzsáki G. A 4 Hz Oscillation Adaptively Synchronizes Prefrontal, VTA, and
+Hippocampal Activities. Neuron. 2011; 72:153–165. [PubMed: 21982376]
+Garrido MI, Kilner JM, Kiebel SJ, Friston KJ. Evoked brain responses are generated by feedback
+loops. Proc. Natl. Acad. Sci. U.S.A. 2007; 104:20961–20966. [PubMed: 18087046]
+Garrido MI, Kilner JM, Stephan KE, Friston KJ. The mismatch negativity: a review of underlying
+mechanisms. Clinical Neurophysiology. 2009; 120:453–463. [PubMed: 19181570]
+Gentet LJ, Kremer Y, Taniguchi H, Huang ZJ, Staiger JF, Petersen CCH. Unique functional properties
+of somatostatin-expressing GABAergic neurons in mouse barrel cortex. Nature Neuroscience.
+2012; 15:607–612.
+George D, Hawkins J. Towards a mathematical theory of cortical micro-circuits. PLoS Computational
+Biology. 2009; 5:e1000532. [PubMed: 19816557]
+Gilbert CD, Wiesel TN. Functional organization of the visual cortex. Prog. Brain Res. 1983; 58:209–
+218. [PubMed: 6138809]
+Girard P, Bullier J. Visual activity in area V2 during reversible inactivation of area 17 in the macaque
+monkey. Journal of Neurophysiology. 1989; 62:1287. [PubMed: 2600626]
+Girard P, Salin PA, Bullier J. Visual activity in areas V3a and V3 during reversible inactivation of area
+V1 in the macaque monkey. Journal of Neurophysiology. 1991a; 66:1493. [PubMed: 1765790]
+Girard P, Salin PA, Bullier J. Visual activity in macaque area V4 depends on area 17 input.
+Neuroreport. 1991b; 2:81–84. [PubMed: 1883988]
+Girard P, Salin PA, Bullier J. Response selectivity of neurons in area MT of the macaque monkey
+during reversible inactivation of area V1. Journal of Neurophysiology. 1992; 67:1437. [PubMed:
+1629756]
+Gray CM, König P, Engel AK, Singer W. Oscillatory responses in cat visual cortex exhibit inter-
+columnar synchronization which reflects global stimulus properties. Nature. 1989; 338:334–337.
+[PubMed: 2922061]
+Gregoriou GG, Gotts SJ, Zhou H, Desimone R. High-Frequency, Long-Range Coupling Between
+Prefrontal and Visual Cortex During Attention. Science. 2009; 324:1207–1210. [PubMed:
+19478185]
+Gregory RL. Perceptual illusions and brain models. Proc. R. Soc. Lond., B. Biol. Sci. 1968; 171:279–
+296. [PubMed: 4387405]
+Gregory RL. Perceptions as hypotheses. Philos. Trans. R. Soc. Lond., B. Biol. Sci. 1980; 290:181–197.
+[PubMed: 6106237]
+Gwin JT, Ferris DP. Beta- and gamma-range human lower limb corticomuscular coherence. Front
+Hum Neurosci. 2012; 6:258. [PubMed: 22973219]
+Haeusler S, Maass W. A statistical analysis of information-processing properties of lamina-specific
+cortical microcircuit models. Cerebral Cortex. 2007; 17:149. [PubMed: 16481565]
+Harrison LM, Stephan KE, Rees G, Friston KJ. Extra-classical receptive field effects measured in
+striate cortex with fMRI. Neuroimage. 2007; 34:1199–1208. [PubMed: 17169579]
+
+Bastos et al.
+Page 18
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Häusser M, Mel B. Dendrites: bug or feature? Current Opinion in Neurobiology. 2003; 13:372–383.
+[PubMed: 12850223]
+Hebb, DO. The Organization of Behavior: A Neuropsychological Theory. Wiley; New York: 1949.
+Helmholtz, H. English translation. Dover; New York: 1860. Handbuch der Physiologischen Optik..
+Hinton G, van Camp D. Keeping neural networks simple by minimizing the description length of
+weights. Proceedings of COLT-93. 1993:5–13.
+Hinton, GE.; Zemel, RS. Autoencoders, Minimum Description Length, and Helmholtz Free Energy..
+In: Cowan, JD.; Tesauro, G.; Alspector, J., editors. Advances in Neural Information Processing
+Systems 6. Morgan Kaufmann; San Mateo, CA: 1994.
+Hopfinger JB, Buonocore MH, Mangun GR. The neural mechanisms of top- down attentional control.
+Nature Neuroscience. 2000; 3:284–291.
+Horton JC, Adams DL. The cortical column: a structure without a function. Philosophical Transactions
+of the Royal Society B: Biological Sciences. 2005; 360:837–862.
+Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat's
+visual cortex. The Journal of Physiology. 1962; 160:106. [PubMed: 14449617]
+Hubel DH, Wiesel TN. Laminar and columnar distribution of geniculo-cortical fibers in the macaque
+monkey. The Journal of Comparative Neurology. 1972; 146:421–450. [PubMed: 4117368]
+Hubel DH, Wiesel TN. Sequence regularity and geometry of orientation columns in the monkey striate
+cortex. J. Comp. Neurol. 1974; 158:267–293. [PubMed: 4436456]
+Hupé JM, James AC, Payne BR, Lomber SG, Girard P, Bullier J. Cortical feedback improves
+discrimination between figure and background by V1, V2 and V3 neurons. Nature. 1998;
+394:784–787. [PubMed: 9723617]
+Johnson RR, Burkhalter A. Microcircuitry of forward and feedback connections within rat visual
+cortex. J. Comp. Neurol. 1996; 368:383–398. [PubMed: 8725346]
+Johnson RR, Burkhalter A. A polysynaptic feedback circuit in rat visual cortex. The Journal of
+Neuroscience. 1997; 17:7129. [PubMed: 9278547]
+Kätzel D, Zemelman BV, Buetfering C, Wölfel M, Miesenböck G. The columnar and laminar
+organization of inhibitory connections to neocortical excitatory cells. Nature Neuroscience. 2010;
+14:100–107.
+Kay JW, Phillips WA. Coherent Infomax as a computational goal for neural systems. Bull. Math. Biol.
+2011; 73:344–372. [PubMed: 20821064]
+Klampfl S, Legenstein R, Maass W. Spiking neurons can learn to solve information bottleneck
+problems and extract independent components. Neural Computation. 2009; 21:911–959. [PubMed:
+19018708]
+Knight RT. Decreased response to novel stimuli after prefrontal lesions in man. Electroencephalogr
+Clin Neurophysiol. 1984; 59:9–20. [PubMed: 6198170]
+Knight RT, Scabini D, Woods DL. Prefrontal cortex gating of auditory transmission in humans. Brain
+Res. 1989; 504:338–342. [PubMed: 2598034]
+Knill DC, Pouget A. The Bayesian brain: the role of uncertainty in neural coding and computation.
+Trends in Neurosciences. 2004; 27:712–719. [PubMed: 15541511]
+de Kock CPJ, Bruno RM, Spors H, Sakmann B. Layer- and cell-type-specific suprathreshold stimulus
+representation in rat primary somatosensory cortex. J. Physiol. (Lond.). 2007; 581:139–154.
+[PubMed: 17317752]
+Kok P, Rahnev D, Jehee JFM, Lau HC, de Lange FP. Attention Reverses the Effect of Prediction in
+Silencing Sensory Signals. Cerebral Cortex. 2011; 22:2197–2206. [PubMed: 22047964]
+Lakatos P, Karmos G, Mehta AD, Ulbert I, Schroeder CE. Entrainment of neuronal oscillations as a
+mechanism of attentional selection. Science. 2008; 320:110–113. [PubMed: 18388295]
+Larkum ME, Nevian T, Sandler M, Polsky A, Schiller J. Synaptic Integration in Tuft Dendrites of
+Layer 5 Pyramidal Neurons: A New Unifying Principle. Science. 2009; 325:756–760. [PubMed:
+19661433]
+Linsker R. Perceptual neural organization: some approaches based on network models and information
+theory. Annual Review of Neuroscience. 1990; 13:257–281.
+
+Bastos et al.
+Page 19
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Livingstone MS. Oscillatory firing and interneuronal correlations in squirrel monkey striate cortex.
+Journal of Neurophysiology. 1996; 75:2467–2485. [PubMed: 8793757]
+London M, Häusser M. Dendritic computation. Annu. Rev. Neurosci. 2005; 28:503–532. [PubMed:
+16033324]
+Lopes-dos-Santos V, Conde-Ocazionez S, Nicolelis MAL, Ribeiro ST, Tort ABL. Neuronal Assembly
+Detection and Cell Membership Specification by Principal Component Analysis. PLoS ONE.
+2011; 6:e20996. [PubMed: 21698248]
+MacKay, DM. Automata Studies. Shannon, CE.; McCarthy, J., editors. Princeton Univ. Press;
+Princeton, NJ: 1956. p. 235-251.
+Maier A, Adams GK, Aura C, Leopold DA. Distinct superficial and deep laminar domains of activity
+in the visual cortex during rest and stimulation. Frontiers in Systems Neuroscience. 2010; 4
+Markov NT, Misery P, Falchier A, Lamy C, Vezoli J, Quilodran R, Gariel MA, Giroud P, Ercsey-
+Ravasz M, Pilaz LJ, et al. Weight consistency specifies regularities of macaque cortical networks.
+Cerebral Cortex. 2011; 21:1254. [PubMed: 21045004]
+Meirovithz E, Ayzenshtat I, Werner-Reiss U, Shamir I, Slovin H. Spatiotemporal Effects of
+Microsaccades on Population Activity in the Visual Cortex of Monkeys during Fixation. Cerebral
+Cortex. 2011; 22:294–307. [PubMed: 21653284]
+Melzer S, Michael M, Caputi A, Eliava M, Fuchs EC, Whittington MA, Monyer H. Long-Range-
+Projecting GABAergic Neurons Modulate Inhibition in Hippocampus and Entorhinal Cortex.
+Science. 2012; 335:1506–1510. [PubMed: 22442486]
+Meyer HS, Schwarz D, Wimmer VC, Schmitt AC, Kerr JND, Sakmann B, Helmstaedter M. Inhibitory
+interneurons in a cortical column form hot zones of inhibition in layers 2 and 5A. Proceedings of
+the National Academy of Sciences. 2011; 108:16807–16812.
+Meyer T, Olson CR. Statistical learning of visual transitions in monkey inferotemporal cortex.
+Proceedings of the National Academy of Sciences. 2011; 108:19401–19406.
+Mignard M, Malpeli JG. Paths of information flow through visual cortex. Science. 1991; 251:1249.
+[PubMed: 1848727]
+Moran RJ, Stephan KE, Kiebel SJ, Rombach N, O'Connor WT, Murphy KJ, Reilly RB, Friston KJ.
+Bayesian estimation of synaptic physiology from the spectral responses of neural masses.
+Neuroimage. 2008; 42:272–284. [PubMed: 18515149]
+Moran RJ, Symmonds M, Stephan KE, Friston KJ, Dolan RJ. An in vivo assay of synaptic function
+mediating human cognition. Curr. Biol. 2011; 21:1320–1325. [PubMed: 21802302]
+Mountcastle VB. Modality and topographic properties of single neurons of cat's somatic sensory
+cortex. Journal of Neurophysiology. 1957; 20:408. [PubMed: 13439410]
+Mountcastle VB. The columnar organization of the neocortex. Brain. 1997; 120:701. [PubMed:
+9153131]
+Mumford D. On the computational architecture of the neocortex. II. The role of cortico-cortical loops.
+Biol Cybern. 1992; 66:241–251. [PubMed: 1540675]
+Murphy PC, Sillito AM. Corticofugal feedback influences the generation of length tuning in the visual
+pathway. Nature. 1987; 329:727–729. [PubMed: 3670375]
+Murray SO. Spatially Specific fMRI Repetition Effects in Human Visual Cortex. Journal of
+Neurophysiology. 2005; 95:2439–2445. [PubMed: 16394067]
+Murray SO, Kersten D, Olshausen BA, Schrater P, Woods DL. Shape perception reduces activity in
+human primary visual cortex. Proc. Natl. Acad. Sci. U.S.A. 2002; 99:15164–15169. [PubMed:
+12417754]
+Neisser, U. Cognitive psychology. Appleton-Century-Crofts; New York: 1967.
+de No, LR. Cerebral cortex: architecture, intracortical connections, motor projections.. In: Fulton, JF.,
+editor. Physiology of the Nervous System. Oxford University Press; Oxford: 1949. p. 288-330.
+Olsen SR, Bortone DS, Adesnik H, Scanziani M. Gain control by layer six in cortical circuits of vision.
+Nature. 2012
+De Pasquale R, Sherman SM. Synaptic properties of corticocortical connections between the primary
+and secondary visual cortical areas in the mouse. J. Neurosci. 2011; 31:16494–16506. [PubMed:
+22090476]
+
+Bastos et al.
+Page 20
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Raizada RDS, Grossberg S. Towards a theory of the laminar architecture of cerebral cortex:
+Computational clues from the visual system. Cerebral Cortex. 2003; 13:100–113. [PubMed:
+12466221]
+Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-
+classical receptive-field effects. Nature Neuroscience. 1999; 2:79–87.
+Roopun AK, Kramer MA, Carracedo LM, Kaiser M, Davies CH, Traub RD, Kopell NJ, Whittington
+MA. Period concatenation underlies interactions between gamma and beta rhythms in neocortex.
+Front Cell Neurosci. 2008; 2:1. [PubMed: 18946516]
+Roopun AK, Middleton SJ, Cunningham MO, LeBeau FE, Bibbig A, Whittington MA, Traub RD. A
+beta2-frequency (20–30 Hz) oscillation in nonsynaptic networks of somatosensory cortex.
+Proceedings of the National Academy of Sciences. 2006; 103:15646.
+Saalmann YB, Pigarev IN, Vidyasagar TR. Neural Mechanisms of Visual Attention: How Top-Down
+Feedback Highlights Relevant Locations. Science. 2007; 316:1612–1615. [PubMed: 17569863]
+Saalmann YB, Pinsk MA, Wang L, Li X, Kastner S. The Pulvinar Regulates Information Transmission
+Between Cortical Areas Based on Attention Demands. Science. 2012; 337:753–756. [PubMed:
+22879517]
+Sakata S, Harris KD. Laminar Structure of Spontaneous and Sensory-Evoked Population Activity in
+Auditory Cortex. Neuron. 2009; 64:404–418. [PubMed: 19914188]
+Salin PA, Bullier J. Corticocortical connections in the visual system: structure and function. Physiol.
+Rev. 1995; 75:107–154. [PubMed: 7831395]
+Sandell JH, Schiller PH. Effect of cooling area 18 on striate cortex cells in the squirrel monkey.
+Journal of Neurophysiology. 1982; 48:38. [PubMed: 6288886]
+Sherman SM, Guillery R. On the actions that one nerve cell can have on another: distinguishing
+“drivers” from “modulators.”. Proceedings of the National Academy of Sciences of the United
+States of America. 1998; 95:7121. [PubMed: 9618549]
+Sherman SM, Guillery RW. Distinct functions for direct and transthalamic corticocortical connections.
+Journal of Neurophysiology. 2011; 106:1068–1077. [PubMed: 21676936]
+Shipp S. Structure and function of the cerebral cortex. Current Biology. 2007; 17:443–449.
+Shlosberg D, Amitai Y, Azouz R. Time-Dependent, Layer-Specific Modulation of Sensory Responses
+Mediated by Neocortical Layer 1. Journal of Neurophysiology. 2006; 96:3170–3182. [PubMed:
+17110738]
+Sillito AM, Cudeiro J, Murphy PC. Orientation sensitive elements in the corticofugal influence on
+centre-surround interactions in the dorsal lateral geniculate nucleus. Exp Brain Res. 1993; 93:6–
+16. [PubMed: 8467892]
+Sincich LC, Horton JC. The circuitry of V1 and V2: integration of color, form, and motion. Annu.
+Rev. Neurosci. 2005; 28:303–326. [PubMed: 16022598]
+Singer W. Neuronal synchrony: a versatile code for the definition of relations? Neuron. 1999; 24:49–
+65. 111–125. [PubMed: 10677026]
+Singer W, Engel AK, Kreiter AK, Munk MH, Neuenschwander S, Roelfsema PR. Neuronal
+assemblies: necessity, signature and detectability. Trends Cogn. Sci. (Regul. Ed.). 1997; 1:252–
+261. [PubMed: 21223920]
+Spratling MW. Reconciling predictive coding and biased competition models of cortical function.
+Front Comput Neurosci. 2008; 2:4. [PubMed: 18978957]
+Srinivasan MV, Laughlin SB, Dubs A. Predictive coding: a fresh view of inhibition in the retina. Proc.
+R. Soc. Lond., B, Biol. Sci. 1982; 216:427–459. [PubMed: 6129637]
+Summerfield C, Trittschuh EH, Monti JM, Mesulam M-M, Egner T. Neural repetition suppression
+reflects fulfilled perceptual expectations. Nature Neuroscience. 2008; 11:1004–1006.
+Summerfield C, Wyart V, Johnen VM, de Gardelle V. Human Scalp Electroencephalography Reveals
+that Repetition Suppression Varies with Expectation. Front Hum Neurosci. 2011; 5:67. [PubMed:
+21847378]
+Theyel BB, Llano DA, Sherman SM. The corticothalamocortical circuit drives higher-order cortex in
+the mouse. Nature Neuroscience. 2009; 13:84–88.
+
+Bastos et al.
+Page 21
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Thomson AM, Bannister AP. Interlaminar connections in the neocortex. Cerebral Cortex. 2003; 13:5–
+14. [PubMed: 12466210]
+Todorovic A, van Ede F, Maris E, de Lange FP. Prior Expectation Mediates Neural Adaptation to
+Repeated Sounds in the Auditory Cortex: An MEG Study. Journal of Neuroscience. 2011;
+31:9118–9123. [PubMed: 21697363]
+Ullman S. Sequence seeking and counter streams: a computational model for bidirectional information
+flow in the visual cortex. Cereb. Cortex. 1995; 5:1–11. [PubMed: 7719126]
+Usrey WM, Fitzpatrick D. Specificity in the axonal connections of layer VI neurons in tree shrew
+striate cortex: evidence for distinct granular and supragranular systems. The Journal of
+Neuroscience. 1996; 16:1203. [PubMed: 8558249]
+Varela F, Lachaux JP, Rodriguez E, Martinerie J. The brainweb: phase synchronization and large-scale
+integration. Nature Reviews Neuroscience. 2001; 2:229–239.
+Vezoli J. Quantitative Analysis of Connectivity in the Visual Cortex: Extracting Function from
+Structure. The Neuroscientist. 2004; 10:476–482. [PubMed: 15359013]
+Vicente R, Gollo LL, Mirasso CR, Fischer I, Pipa G. Dynamical relaying can yield zero time lag
+neuronal synchrony despite long conduction delays. Proceedings of the National Academy of
+Sciences. 2008; 105:17157.
+Wacongne C, Labyt E, van Wassenhove V, Bekinschtein T, Naccache L, Dehaene S. Evidence for a
+hierarchy of predictions and prediction errors in human cortex. Proceedings of the National
+Academy of Sciences. 2011; 108:20754–20759.
+Wang X-J. Neurophysiological and computational principles of cortical rhythms in cognition. Physiol.
+Rev. 2010; 90:1195–1268. [PubMed: 20664082]
+Weiler N, Wood L, Yu J, Solla SA, Shepherd GMG. Top-down laminar organization of the excitatory
+network in motor cortex. Nat Neurosci. 2008; 11:360–366. [PubMed: 18246064]
+Wozny C, Williams SR. Specificity of Synaptic Connectivity between Layer 1 Inhibitory Interneurons
+and Layer⅔ Pyramidal Neurons in the Rat Neocortex. Cerebral Cortex. 2011
+Wurtz RH, McAlonan K, Cavanaugh J, Berman RA. Thalamic pathways for active vision. Trends
+Cogn. Sci. (Regul. Ed.). 2011; 15:177–184. [PubMed: 21414835]
+Wyart V, Nobre AC, Summerfield C. Dissociable prior influences of signal probability and relevance
+on visual contrast sensitivity. Proceedings of the National Academy of Sciences. 2012;
+109:3593–3598.
+Yamaguchi S, Knight RT. Gating of somatosensory input by human prefrontal cortex. Brain Res.
+1990; 521:281–288. [PubMed: 2207666]
+Yoshimura Y, Callaway EM. Fine-scale specificity of cortical networks depends on inhibitory cell
+type and connectivity. Nat. Neurosci. 2005; 8:1552–1559. [PubMed: 16222228]
+Yuille A, Kersten D. Vision as Bayesian inference: analysis by synthesis? Trends Cogn. Sci. (Regul.
+Ed.). 2006; 10:301–308. [PubMed: 16784882]
+Zeki S, Shipp S. The functional logic of cortical connections. Nature. 1988; 335:311–317. [PubMed:
+3047584]
+Zeki SM. The cortical projections of foveal striate cortex in the rhesus monkey. J. Physiol. (Lond.).
+1978; 277:227–244. [PubMed: 418174]
+
+Bastos et al.
+Page 22
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 1.
+This is a schematic of the classical microcircuit adapted from Douglas and Martin (1991).
+This minimal circuitry comprises superficial (layers 2 and 3) and deep (layers, 5 and 6)
+pyramidal cells and a population of smooth inhibitory cells. Feedforward inputs – from the
+thalamus – target all cell populations, but with an emphasis on inhibitory interneurons and
+superficial and granular layers. Note the symmetrical deployment of inhibitory and
+excitatory intrinsic connections that maintain a balance of excitation and inhibition.
+
+Bastos et al.
+Page 23
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 2.
+This is a simplified schematic of the key intrinsic connections among excitatory (E) and
+inhibitory (I) populations in granular (L4), supragranular (L1/2/3) and infragranular (L5/6)
+layers. The excitatory interlaminar connections are based largely on Gilbert and Wiesel
+(1983). Forward connections denote feedforward extrinsic corticocortical or thalamocortical
+afferents that are reciprocated by backward or feedback connections. Anatomical and
+functional data suggest that afferent input enters primarily into L4 and is conveyed to
+superficial layers L2/3 that are rich in pyramidal cells, which project forward to the next
+cortical area, forming a disynaptic route between thalamus and secondary cortical areas
+(Callaway, 1998). Information from L2/3 is then sent to L5 and L6, which sends (intrinsic)
+feedback projections back to L4 (Usrey and Fitzpatrick, 1996). L5 cells originate feedback
+connections to earlier cortical areas as well as to the pulvinar, superior colliculus, and brain
+stem. In summary, forward input is segregated by intrinsic connections into a superficial
+forward stream and a deep backward stream. In this schematic, we have juxtaposed densely
+interconnected excitatory and inhibitory populations within each layer.
+
+Bastos et al.
+Page 24
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 3.
+This schematic shows an example of a generative model. Generative models describe how
+(sensory) data are caused. In this figure, sensory states (blue circles on the periphery) are
+generated by hidden variables (in the centre). The left panel shows the model as a
+probabilistic graphical model, where unknown variables (hidden causes and states) are
+associated with the nodes of a dependency graph and conditional dependencies are indicated
+by arrows. Hidden states confer memory on the model by virtue of having dynamics, while
+hidden causes connect nodes. A graphical model describes the conditional dependencies
+among hidden variables generating data. These dependencies are typically modelled as
+(differential) equations with nonlinear mappings and random fluctuations 
+ with precision
+(inverse variance) Π(i) (see the equations in the insert on the left). This allows one to specify
+the precise form of the probabilistic generative model and leads to a simple and efficient
+inversion scheme (predictive coding; see next figure). Here 
+ denotes the set of hidden
+causes that constitute the parents of sensory s̃
+(i) or hidden x̃
+(i) states. The ~ indicates states in
+generalised coordinates of motion: x̃
+ = (x, x′, x″,...). An intuitive version of the model is
+shown on the right: here, we imagine that a singing bird is the cause of sensations, which –
+through a cascade of dynamical hidden states – produces modality-specific consequences
+(e.g., the auditory object of a bird song and the visual object of a song bird). These
+intermediate causes are themselves (hierarchically) unpacked to generate sensory signals.
+The generative model therefore maps from causes (e.g., concepts) to consequences (e.g.,
+sensations), while its inversion corresponds to mapping from sensations to concepts or
+representations. This inversion corresponds to perceptual synthesis – in which the generative
+model is used to generate predictions. Note that this inversion implicitly resolves the binding
+problem - by explaining multisensory cues with a single cause.
+
+Bastos et al.
+Page 25
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 4.
+This figure describes the predictive coding scheme associated with a simple hierarchical
+model shown on the left. In this model each node has a single parent. The ensuing inversion
+or generalised predictive coding scheme is shown on the right. The key quantities in this
+scheme are (conditional) expectations of the hidden states and causes and their associated
+prediction errors. The basic architecture – implied by the inversion of the graphical
+(hierarchical) model – suggests that prediction errors (caused by unpredicted fluctuations in
+hidden variables) are passed up the hierarchy to update conditional expectations. These
+conditional expectations now provide predictions that are passed down the hierarchy to form
+prediction errors. We presume that the forward and backward message passing between
+hierarchical levels is mediated by extrinsic (feedforward and feedback) connections.
+Neuronal populations encoding conditional expectations and prediction errors now have to
+be deployed in a canonical microcircuit to understand the computational logic of intrinsic
+connections – within each level of the hierarchy – as shown in the next figure.
+
+Bastos et al.
+Page 26
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 5.
+The left hand panel is the canonical microcircuit based on Haeusler and Maass (2007),
+where we have removed inhibitory cells from the deep layers – because they have very little
+interlaminar connectivity. The numbers denote connection strengths (mean amplitude of
+PSPs measured at soma in mV) and connection probabilities (in parentheses) according to
+Thomson et al. (2002). The right panel shows the proposed cortical microcircuit for
+predictive coding, where the quantities of the previous figure have been associated with
+various cell types. Here, prediction error populations are highlighted in pink. Inhibitory
+connections are shown in red, while excitatory connections are in black. The dotted lines
+refer to connections that are not present in the microcircuit on the left (but see Figure 2). In
+this scheme, expectations (about causes and states) are assigned to (excitatory and
+inhibitory) interneurons in the supragranular layers, which are passed to infragranular layers.
+The corresponding prediction errors occupy granular layers, while superficial pyramidal
+cells encode prediction errors that are sent forward to the next hierarchical level. Conditional
+expectations and prediction errors on hidden causes are associated with excitatory cell types,
+while the corresponding quantities for hidden states are assigned to inhibitory cells. Dark
+circles indicate pyramidal cells. Finally, we have placed the precision of the feedforward
+prediction errors against the superficial pyramidal cells. This quantity controls the
+postsynaptic sensitivity or gain to (intrinsic and top-down) pre-synaptic inputs. We have
+previously discussed this in terms of attentional modulation, which may be intimately linked
+to the synchronisation of pre-synaptic inputs and ensuing postsynaptic responses (Fries et al
+2001; Feldman and Friston, 2010).
+
+Bastos et al.
+Page 27
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 6.
+This schematic illustrates the functional asymmetry between the spectral activity of
+superficial and deep cells predicted theoretically. In this illustrative example, we have
+ignored the effects of influences on the expectations of hidden causes (encoded by deep
+pyramidal cells), other than the prediction error on causes (encoded by superficial pyramidal
+cells). The lower panel shows the spectral density of deep pyramidal cell activity, given the
+spectral density of superficial pyramidal cell activity in the upper panel. The equation
+expresses the spectral density of the deep cells as a function of the spectral density of the
+superficial cells; using Equation (2). This schematic is meant to illustrate how the relative
+amounts of low (beta) and high (gamma) frequency activity in superficial and deep cells can
+be explained by the evidence accumulation implicit in predictive coding.
+
+Bastos et al.
+Page 28
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+Bastos et al.
+Page 29
+
+Table 1
+
+Electrophysiological and neuroimaging findings consistent with predictive coding.
+
+Prediction violated
+Area studied
+Neuronal expression of Prediction-
+error
+
+Study
+
+Learned visual object pairings
+Monkey inferotemporal cortex
+(IT)
+
+Enhanced firing rate
+Meyer and Olson, 2011
+
+Natural image statistics
+Monkey V1, V2, V3
+Enhanced firing rate
+Hupé et al., 1998;
+Bullier et al., 1996; Bair
+et al., 2003
+
+Repetitive auditory stream
+Early human auditory cortex
+Enhanced Event Related Potentials
+(ERPs), enhanced gamma-band power
+
+Garrido et al., 2007,
+2009; Todorovic et al.,
+2011
+
+Coherence of visual form and
+motion
+
+Human V1, V2, V3, V4, V5/
+MT
+
+Enhanced BOLD response
+Murray et al 2002;
+Murray et al 2005;
+Harrison et al., 2007
+
+Audio-visual congruence of speech
+Visual and auditory cortex
+Gamma-band oscillatory activity
+Arnal et al., 2011
+
+Predictability of visual stimuli as a
+function of attention
+
+Human V1, V2, V3
+Enhanced BOLD response when
+unattended, reduced BOLD when
+attended
+
+Kok et al., 2011
+
+Hierarchical expectations in
+auditory sequences
+
+Human temporal cortex
+Enhanced Event Related Potentials
+(ERPs)
+
+Wacongne et al., 2011
+
+Expected repetition (or
+alternation) of face stimuli
+
+FFA in fMRI, parietal and
+central electrodes of EEG
+
+Enhanced BOLD response, diminished
+repetition suppression of ERP
+
+Summerfield et al.,
+2008, 2011
+
+Apparent motion of visual
+stimulus
+
+V1
+Enhanced BOLD response
+Alink et al., 2010
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+Bastos et al.
+Page 30
+
+Table 2
+
+The functional (computational) correlates of the anatomy and physiology of cortical hierarchies and their
+extrinsic connections.
+
+Anatomy and physiology
+Functional correlates
+
+Hierarchical organisation of cortical areas (Zeki and Shipp 1988;
+Felleman and Van Essen, 1991; Barone et al., 2000; Vezoli, 2004)
+
+Encoding of conditional dependencies in terms of a graphical
+model (Mumford, 1992; Rao and Ballard, 1999; Friston 2008).
+
+Distinct (laminar-specific) neuronal responses (Douglas et al., 1989;
+Douglas and Martin, 1991)
+
+Encoding expected states of the world (superficial pyramidal cells)
+and prediction errors (deep pyramidal cells) (Mumford, 1992;
+Friston 2008).
+
+Distinct (laminar-specific) extrinsic connections (Zeki and Shipp
+1988; Felleman and Van Essen, 1991; Barone et al., 2000; Vezoli, 2004;
+Markov et al., 2011).
+
+Forward connections convey prediction error (from superficial
+pyramidal cells) and backward connections convey predictions
+(from deep pyramidal cells) (Mumford, 1992; Friston 2008).
+
+Reciprocal extrinsic connectivity (Zeki and Shipp 1988; Felleman and
+Van Essen, 1991; Barone et al., 2000; Vezoli, 2004; Markov et al., 2011)
+
+Recurrent dynamics are intrinsically stable because they are trying
+to suppress prediction error (Crick and Koch 1998;; Friston 2008).
+
+Feedback extrinsic connections are (driving and) modulatory
+(Mignard and Malpeli 1991; Bullier et al., 1996; Sherman and Guillery
+1998; Covic and Sherman, 2011; De Pasquale and Sherman, 2011).
+
+Forwards (driving) and backwards (driving and modulatory)
+connections mediate the (linear) influence of prediction errors and
+the (linear and non-linear) construction of predictions (Friston
+2008; 2010).
+
+Feedback extrinsic connections are inhibitory (Murphy and Sillito,
+1987; Sillito et al., 1993; Chu et al., 2003; Olsen et al. 2012; Meyer et al.,
+2011; Wozny and Williams, 2011).
+
+Top-down predictions suppress or counter prediction errors
+produced by bottom up inputs (Mumford, 1992; Rao and Ballard,
+1999; Friston 2008).
+
+Differences in neuronal dynamics of superficial and deep layers (de
+Kock et al., 2007; Sakata and Harris, 2009; Maier et al., 2010;
+Bollimunta et al., 2011; Buffalo et al., 2011).
+
+Principal cells elaborating predictions (deep pyramidal cells) may
+show distinct (low-pass) dynamics, relative to those encoding error
+(superficial pyramidal cells) (Friston 2008).
+
+Dense intrinsic and horizontal connectivity (Thomson and Bannister,
+2003; Katzel et al., 2010).
+
+Lateral predictions and prediction errors mediating winnerless
+competition and competitive lateral dependencies (Desimone,
+1996; Friston 2010).
+
+Predominance of nonlinear synaptic (dendritic and
+neuromodulatory) infrastructure in superficial layers (Häusser and
+Mel, 2003; London and Häusser, 2005; Gentet et al., 2012).
+
+Required to scale prediction errors, in proportion to their precision,
+affording a form of cortical bias or gain control that encodes
+uncertainty (Feldman and Friston 2010; Spratling, 2008)
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+
diff --git a/papers/project_paper_2_neuroscience/references/Bastos2012_Placeholder.md b/papers/project_paper_2_neuroscience/references/Bastos2012_Placeholder.md
deleted file mode 100644
index 3b1ac7e1..00000000
--- a/papers/project_paper_2_neuroscience/references/Bastos2012_Placeholder.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Canonical microcircuits for predictive coding (Bastos 2012)
-
-This reference defines the anatomical pathways of the cortical microcircuit (L2/3, L4, L5, L6) and how they implement active inference.
-Due to copyright and its format, the full PDF is not hosted in this repository.
-
-**Citation:**
-Bastos, A. M. et al. (2012). *Neuron* **76**, 695.
diff --git a/papers/project_paper_3_darwinism/references/Schlosshauer2007.pdf b/papers/project_paper_3_darwinism/references/Schlosshauer2007.pdf
new file mode 100644
index 00000000..6b0ad828
--- /dev/null
+++ b/papers/project_paper_3_darwinism/references/Schlosshauer2007.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:eecf513088343ae57d312f1892d2e7069214be969b54f31fc64e14339bafd4cd
+size 2028534
diff --git a/papers/project_paper_3_darwinism/references/Schlosshauer2007.txt b/papers/project_paper_3_darwinism/references/Schlosshauer2007.txt
new file mode 100644
index 00000000..7f1a1127
--- /dev/null
+++ b/papers/project_paper_3_darwinism/references/Schlosshauer2007.txt
@@ -0,0 +1,3798 @@
+The quantum-to-classical transition and decoherence
+
+Maximilian Schlosshauer
+Department of Physics, University of Portland, 5000 North Willamette Boulevard, Portland, Oregon 97203, USA
+
+I give a pedagogical overview of decoherence and its role in providing a dynamical account of the
+quantum-to-classical transition. The formalism and concepts of decoherence theory are reviewed,
+followed by a survey of master equations and decoherence models.
+I also discuss methods for
+mitigating decoherence in quantum information processing and describe selected experimental
+investigations of decoherence processes.
+Note: Please see arXiv:1911.06282 [quant-ph] (published as Phys. Rep. 831, 1–57, 2019) for
+a much more extensive and up-to-date review of decoherence.
+
+CONTENTS
+
+I. Introduction
+1
+
+II. Basic formalism and concepts
+2
+
+A. Decoherence and interference damping
+2
+
+B. Environmental monitoring and information
+transfer
+3
+
+C. Environment-induced superselection and
+decoherence-free subspaces
+4
+
+1. Pointer states and the commutativity
+criterion
+5
+
+2. Decoherence-free subspaces
+6
+
+D. Proliferation of information and quantum
+Darwinism
+6
+
+E. Decoherence versus dissipation and noise
+7
+
+III. Master equations
+7
+
+A. Born–Markov master equations
+8
+
+B. Lindblad master equations
+8
+
+C. Non-Markovian decoherence
+9
+
+IV. Decoherence models
+10
+
+A. Collisional decoherence
+10
+
+B. Quantum Brownian motion
+11
+
+C. Spin–boson models
+13
+
+D. Spin-environment models
+13
+
+V. Qubit decoherence, quantum error correction,
+and error avoidance
+14
+
+A. Correction of decoherence-induced quantum
+errors
+14
+
+B. Quantum computation on decoherence-free
+subspaces
+15
+
+C. Environment engineering and dynamical
+decoupling
+16
+
+VI. Experimental studies of decoherence
+16
+
+A. Atoms in a cavity
+17
+
+B. Matter-wave interferometry
+17
+
+C. Superconducting systems
+17
+
+VII. Decoherence and the foundations of quantum
+mechanics
+19
+
+References
+19
+
+I.
+INTRODUCTION
+
+Realistic quantum systems are never completely iso-
+lated from their environment. When a quantum system
+interacts with its environment, it will in general become
+entangled with a large number of environmental degrees
+of freedom. This entanglement influences what we can
+locally observe upon measuring the system. In partic-
+ular, quantum interference effects with respect to cer-
+tain physical quantities (most notably, “classical” quan-
+tities such as position) become effectively suppressed,
+making them prohibitively difficult to observe in most
+cases of practical interest. This is the process of deco-
+herence, sometimes also called dynamical decoherence or
+environment-induced decoherence [1–10]. Stated in gen-
+eral and interpretation-neutral terms, decoherence de-
+scribes how entangling interactions with the environment
+influence the statistics of results of future measurements
+on the system.
+Formally, decoherence can be viewed as a dynamical
+filter on the space of quantum states, singling out those
+states that, for a given system, can be stably prepared
+and maintained, while effectively excluding most other
+states, in particular, nonclassical superposition states of
+the kind popularized by Schr¨odinger’s cat. In this way,
+decoherence lies at the heart of the quantum-to-classical
+transition. It ensures consistency between quantum and
+classical predictions for systems observed to behave clas-
+sically. It provides a quantitative, dynamical account of
+the boundary between quantum and classical physics. In
+any concrete experimental situation, decoherence theory
+specifies the physical requirements, both qualitative and
+quantitative, for pushing the quantum–classical bound-
+ary toward the quantum realm. Decoherence is a pure
+quantum effect, to be distinguished from classical dissi-
+pation and stochastic fluctuations (noise).
+Decoherence processes are extremely efficient.
+Even
+when the environment does not, from a classical point
+of view, impart significant classical perturbations on the
+system, quantum-mechanically the system will in most
+circumstances become rapidly and strongly entangled
+with the environment. Furthermore, due to the many un-
+controllable degrees of freedom of the environment, such
+entanglement is usually irreversible for all practical pur-
+poses. Increasingly realistic models of decoherence pro-
+
+arXiv:1404.2635v2  [quant-ph]  20 Nov 2019
+
+
+2
+
+cesses have been developed, progressing from toy models
+to complex models tailored to specific experiments (see
+Sec. IV). Advances in experimental techniques have made
+it possible to observe the gradual action of decoherence
+in experiments such as matter-wave interferometry [11],
+cavity QED [12], and superconducting systems [13] (see
+Sec. VI).
+The superposition states necessary for quantum in-
+formation processing are typically also those most sus-
+ceptible to decoherence. Thus, decoherence is a major
+barrier to implementing devices for quantum informa-
+tion processing such as quantum computers (see Sec. V).
+Qubit systems must be engineered to minimize environ-
+mental interactions detrimental to the preparation and
+longevity of the desired superposition states.
+At the
+same time, they must remain sufficiently open to al-
+low for their control.
+Quantum error correction can
+undo some of the decoherence-induced degradation of
+the superposition state and will be an integral part of
+quantum computers (see Sec. V A). Not only is deco-
+herence relevant to quantum information, but also vice
+versa. An information-centric view of quantum mechan-
+ics proves helpful in conveying the essence of the deco-
+herence process and is also used in recent explorations
+of the role of the environment as an information channel
+(see Sec. II B).
+It is a curious “historical accident” (Joos’s term [14,
+p. 13]) that the role of the environment in quantum me-
+chanics was appreciated only relatively late. While one
+can find—for example, in Heisenberg’s writings [15]—a
+few early anticipatory remarks about the role of environ-
+mental interactions in the quantum-mechanical descrip-
+tion of systems, it wasn’t until the 1970s that the ubiquity
+and implications of environmental entanglement were re-
+alized by Zeh [1, 16]. It took another decade for the for-
+malism of decoherence to be developed, chiefly by Zurek
+[2, 3], and for concrete models and numerical estimates
+of decoherence rates to be worked out [17, 18].
+Review papers on decoherence include Refs. [4–6, 10,
+
+19].
+There are two books on decoherence:
+a volume
+by Joos et al. [8] (a collection of chapters written by
+different authors) and a monograph by this author [9].
+Ref. [20] also contains material on decoherence. Foun-
+dational implications of decoherence are discussed in
+Refs. [6, 7, 9, 21].
+
+II.
+BASIC FORMALISM AND CONCEPTS
+
+In the double-slit experiment, we cannot observe an in-
+terference pattern if we also measure which slit the parti-
+cle went through (that is, if we obtain perfect which-path
+information). In fact, there is a continuous tradeoff be-
+tween interference (phase information) and which-path
+information: the better we can distinguish the two pos-
+sible paths, the less visible the interference pattern be-
+comes [22]. What is more, for a decrease in interference
+visibility to occur it suffices that there are degrees of
+
+freedom somewhere in the world that, if they were mea-
+sured, would allow us to make, with a certain degree of
+confidence, a statement about the path of the particle
+through the slits.
+While we cannot say that prior to
+their measurement, these degrees of freedom have en-
+coded information about a particular, definitive path of
+the particle—instead, we have merely correlations involv-
+ing both possible paths—no actual measurement is re-
+quired to bring about the decrease in interference visibil-
+ity. It is enough that, in principle, we could make such
+a measurement to obtain which-path information.
+This is somewhat loose talk, and conceptual caveats
+lurk. But it captures quite well the essence of what is
+happening in decoherence, where those “degrees of free-
+dom somewhere in the world” are the degrees of freedom
+of the system’s environment interacting with the system,
+leading to the creation of quantum correlations (entan-
+glement) between system and environment. Decoherence
+can thus be thought of as a process arising from the con-
+tinuous monitoring of the system by the environment;
+effectively, the environment is performing nondemolition
+measurements on the system (see Sec. II B). We now give
+a formal quantum-mechanical account of what we have
+just tried to convey in words, and then flesh out the con-
+sequences and details.
+
+A.
+Decoherence and interference damping
+
+Consider again the double-slit experiment and denote
+the quantum states of the particle (call it S, for “sys-
+tem”) corresponding to passage through slit 1 and 2 by
+|s1⟩ and |s2⟩, respectively. Suppose that the particle in-
+teracts with another system E—for example, a detec-
+tor or an environment—such that if the quantum state
+of the particle before the interaction is |s1⟩, then the
+quantum state of E will become |E1⟩ (and similarly for
+|s2⟩), resulting in the final composite states |s1⟩ |E1⟩ and
+|s2⟩ |E2⟩, respectively. For an initial superposition state
+α |s1⟩+β |s2⟩, the final composite state will be entangled,
+
+|Ψ⟩ = α |s1⟩ |E1⟩ + β |s2⟩ |E2⟩ .
+(1)
+
+The statistics of all possible local measurements on S
+are exhaustively encoded in the reduced density matrix
+ρS,
+
+ρS = TrE(ρSE) = TrE|Ψ⟩⟨Ψ|
+
+= |α|2 |s1⟩⟨s1| + |β|2 |s2⟩⟨s2|
++ αβ∗|s1⟩⟨s2|⟨E2|E1⟩ + α∗β|s2⟩⟨s1|⟨E1|E2⟩.
+(2)
+
+For example, suppose we measure particle’s position by
+letting the particle impinge on a distant detection screen.
+Statistically, the resulting particle probability density
+p(x) will be given by
+
+p(x) = TrS(ρSx)
+
+= |α|2 |ψ1(x)|2 + |β|2 |ψ2(x)|2
+
++ 2 Re {αβ∗ψ1(x)ψ∗
+2(x)⟨E2|E1⟩} ,
+(3)
+
+
+3
+
+where ψi(x) ≡ ⟨x|si⟩. The last term represents the in-
+terference contribution. Thus, the visibility of the inter-
+ference pattern is quantified by the overlap ⟨E2|E1⟩, i.e.,
+by the distinguishability of |E1⟩ and |E2⟩. In the lim-
+iting case of perfect distinguishability, ⟨E2|E1⟩ = 0, no
+interference pattern will be observable and we obtain the
+classical prediction. Phase relations have become locally
+(i.e., with respect to S) inaccessible, and there is no mea-
+surement on S that can reveal coherence between |s1⟩ and
+|s2⟩. The coherence is now between the states |s1⟩ |E1⟩
+and |s2⟩ |E2⟩, requiring an appropriate global measure-
+ment (acting jointly on S and E) for it to be revealed.
+Conversely, if the interaction between S and E is such
+that E is completely unable to resolve the path of the
+particle, then |E1⟩ and |E2⟩ are indistinguishable and full
+coherence is retained at the level of S, as is also directly
+obvious from Eq. (1). In the intermediary regime where
+0 < |⟨E2|E1⟩| < 1, meaning that |E1⟩ and |E2⟩ can be
+distinguished in a one-shot measurement with nonzero
+probability p = 1 − |⟨E2|E1⟩|2 < 1, an interference pat-
+tern of reduced visibility is obtained. Equation (3) shows
+that the reduction in visibility increases as |E1⟩ and |E2⟩
+become more distinguishable.
+Here is another way of putting the matter. Looking
+back at Eq. (1), we see that E encodes which-way infor-
+mation about S in the same “relative-state” sense [23] in
+which EPR correlations [24–26] may be said to encode
+“information.” That is, if ⟨E2|E1⟩ = 0 and we were to
+measure E and found it to be in state |E1⟩, we could, in
+EPR’s words [24, p. 777], “predict with certainty” that
+we will find S in |s1⟩.1 Whenever such a prediction is
+possible were we to measure E, no interference effects be-
+tween the components |s1⟩ and |s2⟩ can be measured at
+S, even if E is never actually measured. If |⟨E2|E1⟩| > 0,
+then E encodes only partial which-way information about
+S, in the sense that a measurement of E could not reliably
+distinguish between |E1⟩ and |E2⟩; instead, sometimes
+the measurement will result in an outcome compatible
+with both |E1⟩ and |E2⟩. Consequently, an interference
+experiment carried out on S would find reduced visibil-
+ity, representing diminished local coherence between the
+components |s1⟩ and |s2⟩.
+As hinted above, the description developed so far de-
+scribes the essence of the decoherence process if we iden-
+tify the particle S more generally with an arbitrary quan-
+tum system and the second system E with the environ-
+ment of S. Then an idealized account of the decoherence
+
+1 Of course, this must not be read as saying that S was already
+in |s1⟩ (i.e., “went through slit 1”) prior to the measurement
+of E.
+Nor does it mean that the result of a subsequent path
+measurement on S is necessarily determined, by virtue of the
+measurement on E, prior to this S-measurement’s actually be-
+ing carried out. After all, as Peres has cautioned us [27], unper-
+formed measurements have no outcomes. So while the picture
+of E as “encoding which-path information” about S is certainly
+suggestive and helpful, it should be used with an understanding
+of its conceptual pitfalls.
+
+interaction has form
+��
+
+i
+ci |si⟩
+�
+|E0⟩
+−→
+�
+
+i
+ci |si⟩ |Ei(t)⟩ .
+(4)
+
+We have here introduced a time parameter t, where t = 0
+corresponds to the onset of the environmental interac-
+tion, with |Ei(t)⟩ ≡ |E0⟩ for all i; at t < 0 the system
+and environment are assumed to be uncorrelated (an as-
+sumption common to most decoherence models).
+A single environmental particle interacting with the
+system will typically only insufficiently resolve the com-
+ponents |si⟩ in the system’s superposition state. But be-
+cause of the large number of such particles (and, hence,
+degrees of freedom), the overlap between their different
+joint states |Ei(t)⟩ will rapidly decrease as a result of
+the build-up of many interaction events. Specifically, in
+many decoherence models an exponential decay of over-
+lap is found [3, 5, 9, 17, 20, 28–31],
+
+⟨Ei(t)|Ej(t)⟩ ∝ e−t/τd
+for i ̸= j.
+(5)
+
+Here τd is the characteristic decoherence timescale, which
+can be evaluated for particular choices of the parameters
+in each model (see Sec. IV).
+
+B.
+Environmental monitoring and information
+transfer
+
+We will now motivate, in a different and more rigorous
+way, the picture of decoherence as a process of environ-
+mental monitoring.
+First, we express the influence of
+the environment in a completely general way.
+We as-
+sume that at t = 0 there are no correlations between
+system S and environment E, ρSE(0) = ρS(0) ⊗ ρE(0).
+We write ρE(0) in its diagonal decomposition, ρE(0) =
+�
+
+i pi|Ei⟩⟨Ei|, where �
+
+i pi = 1 and the states |Ei⟩ form
+an orthonormal basis of the Hilbert space of E.
+If
+H denotes the Hamiltonian (here assumed to be time-
+independent) of SE and U(t) = e−iHt represents the uni-
+tary time evolution operator, then the density matrix of
+S evolves according to
+
+ρS(t) = TrE
+
+�
+
+U(t)
+
+�
+
+ρS(0) ⊗
+
+��
+
+i
+pi|Ei⟩⟨Ei|
+
+��
+
+U †(t)
+
+�
+
+=
+�
+
+ij
+pi ⟨Ej| U(t) |Ei⟩ ρS(0) ⟨Ei| U †(t) |Ej⟩ .
+(6)
+
+Introducing the Kraus operators [32] defined by Eij ≡
+√pi ⟨Ej| U(t) |Ei⟩, we obtain
+
+ρS(t) =
+�
+
+ij
+EijρS(0)E†
+ij.
+(7)
+
+It is customary to combine the two indices i and j into a
+single index and write the Kraus operators as
+
+Wk ≡ √pik ⟨Ejk| U(t) |Eik⟩ ,
+(8)
+
+
+4
+
+such that
+
+ρS(t) =
+�
+
+k
+WkρS(0)W †
+k.
+(9)
+
+This Kraus-operator formalism (also called operator-sum
+formalism) represents the effect of the environment as
+a sequence of (in general nonunitary) transformations of
+ρS generated by the operators Wk. The Kraus operators
+exhaustively encode information about the initial state
+of the environment and about the dynamics of the joint
+SE system. Because the evolution of SE is unitary, the
+Kraus operators satisfy the completeness constraint
+�
+
+k
+WkW †
+k = IS,
+(10)
+
+where IS is the identity operator in the Hilbert space of
+S. Equations (9) and (10) together imply that the Wk are
+the generators of a completely positive map Φ : ρS(0) �→
+ρS, also known as a quantum operation [32] or quantum
+channel.2
+
+We will now use Eq. (9) to formally motivate the view
+that decoherence corresponds to an indirect measurement
+of the system by the environment, and that it thus re-
+sults from a transfer of information from the system to
+the environment (see also Ref. [19]).
+In such an indi-
+rect measurement, we let the system S interact with a
+probe—here the environment E—followed by a projec-
+tive measurement on E. The probe is treated as a quan-
+tum system. This procedure aims to yield information
+about S without performing a projective (and thus de-
+structive) direct measurement on S. To model such an
+indirect measurement, consider again an initial compos-
+ite density operator ρSE(0) = ρS(0) ⊗ ρE(0) evolving
+under the action of U(t) = e−iHt, where H is the to-
+tal Hamiltonian. Consider a projective measurement M
+on E with eigenvalues α and corresponding projectors
+Pα ≡ |α⟩⟨α|, with P 2
+α = P †
+α = Pα. The probability of
+obtaining outcome α in this measurement when S is de-
+scribed by the density operator ρS(t) is
+
+Prob (α | ρS(t)) = TrE (PαρE(t))
+
+= TrE
+�
+PαTrS
+�
+U(t) (ρS(0) ⊗ ρE(0)) U †(t)
+��
+.
+(11)
+
+The density matrix of S conditioned on the particular
+
+2 The Kraus formalism is of limited use in calculating decoherence
+dynamics for concrete situations of physical interest. This is so
+because finding the Kraus operators corresponds to diagonaliz-
+ing the full Hamiltonian of SE, usually a prohibitively difficult
+task.
+Moreover, the Kraus operators contain all contributions
+to the evolution of the reduced density matrix, while for con-
+siderations of decoherence we are typically interested only in
+the nonunitary terms, and certain contributions—such as back-
+action effects from the system on the environment—can often be
+neglected. (This is where master equations come into play; see
+Sec. III.)
+
+outcome α is
+
+ρ(α)
+S (t) = TrE {[I ⊗ Pα] ρSE(t) [I ⊗ Pα]}
+
+Prob (α | ρS(t))
+
+= TrE
+�
+[I ⊗ Pα] U(t) [ρS(0) ⊗ ρE(0)] U †(t) [I ⊗ Pα]
+�
+
+Prob (α | ρS(t))
+.
+
+(12)
+
+Inserting
+the
+diagonal
+decomposition
+ρE(0)
+=
+�
+
+k pk|Ek⟩⟨Ek|
+and
+carrying
+out
+the
+trace
+gives
+[19]
+
+ρ(α)
+S (t) =
+�
+
+k
+
+Mα,kρS(t)M †
+α,k
+
+Prob (α | ρS(t)).
+(13)
+
+Here we have introduced the measurement operators
+
+Mα,k ≡ √pk ⟨α| U(t) |Ek⟩ ,
+(14)
+
+which
+obey
+the
+completeness
+constraint
+�
+
+α,k Mα,kM †
+α,k = IS.
+Equation (12) describes the
+effect of the indirect measurement on the state of the
+system. If, however, we do not actually inquire about
+the result of this measurement, we must assign to the
+system a density operator that is a sum over all the
+possible conditional states ρ(α)
+S (t) weighted by their
+probabilities Prob (α | ρS(t)),
+
+ρS(t) =
+�
+
+α
+Prob (α | ρS(t)) ρ(α)
+S (t)
+
+=
+�
+
+α,k
+Mα,kρS(t)M †
+α,k.
+(15)
+
+Note that this expression is formally analogous to the
+Kraus-operator expression of Eq. (9), which described
+the effect of a general environmental interaction on the
+state of the system. Recall, further, that the situation we
+encounter in decoherence is precisely one in which we do
+not actually read out the environment—or, in the present
+picture, in which we do not inquire about the result of the
+indirect measurement.
+This suggests that decoherence
+can indeed be understood as an indirect measurement—
+a monitoring—of the system by its environment.
+
+C.
+Environment-induced superselection and
+decoherence-free subspaces
+
+Decoherence can occur in any basis; which observable
+is monitored by the environment depends on the spe-
+cific form of the system–environment interaction. The
+preferred states (or preferred observables) of the system
+emerge dynamically as those states that are the most ro-
+bust to the interaction with the environment, in the sense
+that they become least entangled with the environment;
+thus, they are the states most immune to decoherence.
+
+
+5
+
+This is the stability criterion for the selection of pre-
+ferred states, resulting in the dynamical selection of pre-
+ferred states (“environment-induced superselection”) [1–
+3, 16]. These environment-superselected preferred states
+(or observables) are sometimes also called pointer states
+(or pointer observables) [2], since they correspond to the
+physical quantities that are most easily “read off” at the
+level of the system, akin to the pointer on the dial of a
+measurement apparatus.
+
+1.
+Pointer states and the commutativity criterion
+
+To find the preferred states,
+we decompose the
+total system–environment Hamiltonian into the self-
+Hamiltonians of the system S and environment E rep-
+resenting the intrinsic dynamics, and a part Hint repre-
+senting the interaction between system and environment,
+
+H = HS + HE + Hint.
+(16)
+
+In many cases of practical interest, Hint dominates
+the evolution of the system, H ≈ Hint (the quantum-
+measurement limit of decoherence). We look for system
+states |si⟩ such that the composite system–environment
+state, when starting from a product state |si⟩ |E0⟩ at
+t = 0, remains in the product form |si⟩ |Ei(t)⟩ for all
+t > 0 under the action of Hint (we shall assume here
+that Hint is not explicitly time-dependent). That is, we
+demand that (setting ℏ ≡ 1 from here on)
+
+e−iHintt |si⟩ |E0⟩ = λi |si⟩ e−iHintt |E0⟩ ≡ |si⟩ |Ei(t)⟩ .
+(17)
+Thus, the pointer states |si⟩ are the eigenstates of the
+part of the interaction Hamiltonian Hint pertaining to the
+Hilbert space of the system, with eigenvalues λi. These
+states will be stationary under Hint [2]. It follows that the
+pointer observable defined by OS = �
+
+i oi|si⟩⟨si| com-
+mutes with Hint,
+�
+OS, Hint
+�
+= 0.
+(18)
+
+This commutativity criterion [2, 3] is particularly easy to
+apply when Hint takes the tensor-product form Hint =
+S ⊗ E, as is frequently the case. Then the environment-
+superselected observables will be those observables that
+commute with S.
+If S is Hermitian, it represents the
+physical quantity monitored by the environment. In gen-
+eral, any Hint can be written as a diagonal decomposition
+of (unitary but not necessarily Hermitian) system and
+environment operators Sα and Eα, Hint = �
+
+α Sα ⊗ Eα.
+If the Sα are Hermitian, such a Hamiltonian represents
+the simultaneous environmental monitoring of different
+observables Sα of the system. A sufficient condition for
+{|si⟩} to form a set of pointer states of the system is then
+given by the requirement that the |si⟩ be simultaneous
+eigenstates of the operators Sα,
+
+Sα |si⟩ = λ(α)
+i
+|si⟩
+for all α and i.
+(19)
+
+Interaction Hamiltonians frequently describe the scat-
+tering of surrounding particles (photons, air molecules,
+etc.), leading to collisional decoherence (see Sec. IV A).
+Since the force laws describing such processes typically
+depend on some power of distance, the interaction Hamil-
+tonian will then commute with the position operator.
+Thus, the pointer states will be approximate eigenstates
+of position (i.e., narrow position-space wave packets).
+This explains why superpositions of mesoscopically and
+macroscopically distinct positions are prohibitively diffi-
+cult to observe [2, 3, 17, 31, 33–39]. Collisional decoher-
+ence can also be dominant in microscopic systems when
+these systems occur in distinct spatial configurations that
+couple strongly to the surrounding medium. For exam-
+ple, chiral molecules such as sugar are always observed to
+be in chirality eigenstates (left-handed or right-handed),
+which are superpositions of different energy eigenstates.
+Any attempt to prepare such molecules in energy eigen-
+states leads to immediate decoherence into the environ-
+mentally stable chirality eigenstates [40, 41].
+The quantum limit of decoherence [42] arises when the
+modes of the environment are slow in comparison with
+the evolution of the system—that is, when the highest
+frequencies (i.e., energies) available in the environment
+are smaller than the separation between the energy eigen-
+states of the system. Then the environment will be able
+to monitor only quantities that are constants of motion.
+In the case of nondegeneracy, this quantity will be the en-
+ergy of the system, leading to the environment-induced
+superselection of energy eigenstates for the system [42].3
+
+In many realistic situations, the commutativity crite-
+rion, Eq. (18), can only be fulfilled approximately [43, 44].
+In addition, the self-Hamiltonian of the system and the
+interaction Hamiltonian may contribute in roughly equal
+strengths (e.g., in models for quantum Brownian motion
+[4, 45]; see Sec. IV B), rendering neither the quantum-
+measurement limit of negligible intrinsic dynamics nor
+the quantum limit of decoherence of a slow environ-
+ment appropriate. In such cases, more general methods
+for determining the preferred states are required. The
+predictability-sieve strategy [43, 44, 46] computes the time
+dependence of the amount of decoherence introduced into
+the system for a large set of initial states of the system
+evolving under the total system–environment Hamilto-
+nian. Typically, this decoherence is measured using ei-
+ther the purity Tr
+�
+ρ2
+S
+�
+or the von Neumann entropy
+
+3 Textbooks on quantum mechanics usually attribute a special role
+to such energy eigenstates (for closed systems) since they are
+stationary under the action of the Hamiltonian. In this closed-
+system picture, however, arbitrary superpositions of energy
+eigenstates should nonetheless be perfectly legitimate. Thus, it
+is important to realize that the environment-induced superselec-
+tion of energy eigenstates is not equivalent to a situation in which
+the presence of the environment could be neglected altogether;
+instead, the environment plays the crucial role of continuously
+monitoring the energy of the system, leading to a local suppres-
+sion of coherence between energy eigenstates.
+
+
+6
+
+S(ρS) = −Tr (ρS log2 ρS) of the reduced density matrix
+ρS. The states most immune to decoherence will be those
+which lead to the smallest decrease in purity or the small-
+est increase in von Neumann entropy.
+Application of
+this method leads to a ranking of the possible preferred
+states with respect to their robustness to the interac-
+tion with the environment. For particular models it has
+been explicitly shown that the states picked out by the
+predictability sieve are robust to the particular choice of
+the measure of decoherence. For example, in the model
+for quantum Brownian motion, different measures lead
+to the same minimum-uncertainty wave packets in phase
+space [5, 8, 16, 44, 47, 48].
+
+2.
+Decoherence-free subspaces
+
+The
+pointer-state
+condition
+of
+Eq.
+(19)
+can
+be
+strengthened to the concept of pointer subspaces [3] or
+decoherence-free subspaces (DFS) [49–58]. These are sub-
+spaces of the Hilbert space of the system in which every
+state in the subspace is immune to decoherence; this is
+a nontrivial requirement, since in general superpositions
+of pointer states will not be pointer states themselves.
+One important condition for this to happen is that the
+preferred states |si⟩ defined by Eq. (19) form an orthonor-
+mal basis of the subspace, and that the eigenvalues λ(α)
+i
+in Eq. (19) are independent of i, i.e., that all |si⟩ are
+simultaneous degenerate eigenstates of each Sα,
+
+Sα |si⟩ = λ(α) |si⟩
+for all α and i.
+(20)
+
+This condition states that the action of a given Sα must
+be the same for all basis states |si⟩ of the DFS, and thus
+the existence of a DFS corresponds to a symmetry in the
+structure of the system–environment interaction, i.e., to
+a dynamical symmetry. A necessary condition for such a
+symmetry to obtain is the absence of terms in Hint that
+act jointly on system and environment in a nontrivial
+manner.
+An arbitrary state |ψ⟩ in the DFS can then be written
+as |ψ⟩ = �
+
+i ci |si⟩ and will evolve according to
+
+e−iHintt |ψ⟩ |E0⟩ = |ψ⟩ e−i(
+�
+
+α λ(α)Eα)t |E0⟩
+≡ |ψ⟩ |Eψ(t)⟩ .
+(21)
+
+Thus, the state |ψ⟩ does not become entangled with the
+environment and is therefore immune to decoherence.
+When the self-Hamiltonian HS of the system cannot be
+neglected, one needs to additionally ensure that none of
+the basis states |si⟩ of the DFS will drift out of the sub-
+space under the evolution generated by HS. Otherwise
+an initially decoherence-free state would again become
+prone to decoherence. The concept of DFS can be gener-
+alized to the formalism of noiseless subsystems (or noise-
+less quantum codes) [58–60].
+
+D.
+Proliferation of information and quantum
+Darwinism
+
+Quantum Darwinism [61–69] builds on the ideas of de-
+coherence and environmental encoding of information, by
+broadening the role of the environment to that of a com-
+munication and amplification channel. Interactions be-
+tween the system and its environment lead to the redun-
+dant storage of selected information about the system in
+many fragments of the environment. By measuring some
+of these fragments, observers can indirectly obtain infor-
+mation about the system without appreciably disturbing
+the system itself. Indeed, this represents how we typi-
+cally observe objects. For example, we see an object not
+by directly interacting with it, but by intercepting scat-
+tered photons that encode information about the object’s
+spatial structure [67, 68].
+
+In this sense, quantum Darwinism provides a dynami-
+cal explanation for the robustness of states of (especially)
+macroscopic objects to observation. It was found that
+the observable of the system that can be imprinted most
+completely and redundantly in many distinct fragments
+of the environment coincides with the pointer observable
+selected by the system–environment interaction [62–65];
+conversely, most other states do not seem to be redun-
+dantly storable. Indeed, it has been shown that the re-
+dundant proliferation of information regarding pointer
+states is as inevitable as decoherence itself [70]. Quantum
+Darwinism has been studied in several concrete models,
+for example, in spin environments [64], quantum Brow-
+nian motion [71], and photon and photon-like environ-
+ments [67, 68, 70]. The efficiency of the amplification pro-
+cess described by quantum Darwinism can be expressed
+in terms of the quantum Chernoff information [70].
+
+The structure and amount of information that the
+environment encodes about the system can be quanti-
+fied using the measure of (classical [62, 63] or quantum
+[5, 64, 65]) mutual information. Classical mutual infor-
+mation is based on the choice of particular observables of
+the system S and the environment E and quantifies how
+well one can predict the outcome of a measurement of a
+given observable of S by measuring some observable on
+a fraction of E [62, 63]. Quantum mutual information is
+defined as S(ρS)+S(ρE)−S(ρ), where ρS, ρE, and ρ are
+the density matrices of S, E, and the composite system
+SE, respectively, and S(ρ) = −Tr (ρ log2 ρ) is the von
+Neumann entropy associated with ρ. Quantum mutual
+information quantifies the degree of quantum correlations
+between S and E. Classical and quantum mutual infor-
+mation give similar results [5, 62–65] because the differ-
+ence between the two measures, known as the quantum
+discord [72], disappears when decoherence is sufficiently
+effective to select a well-defined pointer basis [72].
+
+
+7
+
+E.
+Decoherence versus dissipation and noise
+
+While the presence of dissipation implies the pres-
+ence of decoherence, the converse is not necessarily true.
+When dissipation and decoherence are both present, they
+typically occur on vastly different timescales; the deco-
+herence timescale is typically many orders of magnitude
+shorter than the relaxation timescale. A rule-of-thumb
+estimate for the ratio of the relaxation timescale τr to the
+decoherence timescale τd for a massive object described
+by a superposition of two different positions a distance
+∆x apart is [18]
+
+τr
+τd
+∼
+� ∆x
+
+λdB
+
+�2
+,
+(22)
+
+where λdB = (2mkT)−1/2 is the thermal de Broglie wave-
+length of the object. For an object of mass m = 1 g at
+room temperature in a coherent superposition of two lo-
+cations a distance ∆x = 1 cm apart, τr/τd is on the order
+of 1040. Thus, for macroscopic objects the dissipative in-
+fluence of the environment is usually completely negligi-
+ble on the timescale relevant to the decoherence induced
+by this environment.
+Decoherence is a consequence of environmental entan-
+glement. In the literature on quantum computing, how-
+ever, the term “decoherence” is often used to refer to
+any process that affects the qubits, including perturba-
+tions due to classical fluctuations and imperfections. Ex-
+amples for sources of such classical noise in the context
+of quantum computing are the fluctuations in the inten-
+sity [73] and duration [74] of the laser beam incident on
+qubits in an ion trap, inhomogeneities in the magnetic
+fields in NMR quantum computing [75], and bias fluctu-
+ations in superconducting qubits [76]. The distinction be-
+tween classical noise and quantum decoherence has been
+further blurred in quantum error correction, since the
+error-correcting schemes are insensitive to the physical
+origin of the qubit errors (see Sec. V A).
+Phenomenologically and formally the influence of clas-
+sical noise processes may be described in a manner simi-
+lar to the effect of environmental entanglement, namely,
+in terms of a decay of the off-diagonal elements (in-
+terference terms) in the local density matrix (in the
+environment-superselected basis).
+But in the case of
+noise, the decay of the off-diagonal elements occurs be-
+cause the system’s density matrix is identified with an
+average over a physical ensemble of systems (or, put dif-
+ferently, over the different instances of particular noise
+processes), while in the case of decoherence the decay is
+due to an entanglement-induced delocalization of phase
+coherence for individual systems. The fundamental dif-
+ference between these physical processes is masked by the
+density-matrix description. Indeed, one can always find
+an experimental procedure that would, at least in princi-
+ple, distinguish between the different physical processes
+underlying formally similar density-matrix descriptions.
+In contrast with decoherence, noise does not create
+system–environment entanglement and can in principle
+
+always be undone using only local operations (witness,
+for example, the reversal of ensemble dephasing in NMR
+experiments using the spin-echo technique). In any indi-
+vidual realization of the noise process the dynamics of the
+system are completely unitary, and thus no coherence is
+lost from the system. By contrast, if the system becomes
+entangled with environmental degrees of freedom, at the
+very least we would need to perform a pair of measure-
+ments on the environment before and after the interac-
+tion with the system in order to gather enough informa-
+tion to reverse the effect of decoherence by application
+of an appropriate countertransformation. Moreover, as
+also seen experimentally [77], these measurements would
+not always constitute a sufficient procedure for “undo-
+ing” decoherence (see also Sec. IV.C of Ref. [5]).
+The loss of phase coherence due to environmental
+entanglement is sometimes simulated (with the above
+caveats) by classical fluctuations perturbing the system,
+i.e., by the addition of certain time-dependent terms to
+the self-Hamiltonian of the system. This strategy was
+implemented, for example, in theoretical [73, 78] and ex-
+perimental [77, 79] studies of the influence of fluctuating
+parameters in ion-trap quantum computers.
+
+III.
+MASTER EQUATIONS
+
+In the usual approach to modeling decoherence, the
+reduced density matrix ρS(t) is obtained from
+
+ρS(t) = TrE ρSE(t) ≡ TrE
+�
+U(t)ρSE(0)U †(t)
+�
+,
+(23)
+
+where U(t) is the time-evolution operator for the compos-
+ite system SE. The task of calculating ρSE(t) is often
+computationally cumbersome or even intractable. It is
+also unnecessarily detailed, because we are usually only
+interested in the dynamics of the system. A master equa-
+tion allows us to calculate ρS(t) directly from an expres-
+sion of the form
+
+ρS(t) = V(t)ρS(0),
+(24)
+
+where the superoperator V(t) is the dynamical map gen-
+erating the evolution of ρS(t).
+If the master equation
+is exact, then we merely have the identity V(t)ρS(0) ≡
+TrE
+�
+U(t)ρSE(0)U †(t)
+�
+and no computational advantage
+is gained.
+Therefore, master equations are typically
+based on simplifying approximations.
+In modeling decoherence, we focus on master equations
+that are first-order time-local differential equations of the
+form
+
+d
+dtρS(t) = L [ρS(t)] ≡ −i [H′
+S, ρS(t)] + D[ρS(t)].
+(25)
+
+This equation is local in time in the sense that the change
+of ρS at time t depends only on ρS evaluated at t. The
+superoperator L acting on ρS(t) typically depends on the
+initial state of the environment and the different terms
+in the Hamiltonian.
+We have decomposed L into two
+
+
+8
+
+parts to distinguish their physical interpretation.
+The
+first term, −i [H′
+S, ρS(t)], is unitary and given by the
+Liouville–von Neumann commutator with the “renormal-
+ized” Hamiltonian H′
+S of the system. (Because the en-
+vironment typically leads to a renormalization of the en-
+ergy levels of the system, this Hamiltonian does in general
+not coincide with the unperturbed free Hamiltonian HS
+of S that would generate the evolution of S in absence of
+the environment.) The second, nonunitary term D[ρS(t)]
+represents decoherence (and often also dissipation) due to
+the environment.
+
+A.
+Born–Markov master equations
+
+Born–Markov master equations allow for many deco-
+herence problems to be treated in a mathematically sim-
+ple, and often closed, form. They are based on the fol-
+lowing two approximations:
+
+1. The
+Born
+approximation.
+The
+system–
+environment coupling is sufficiently weak and
+the environment is reasonably large such that
+changes of the density operator ρE of the environ-
+ment are negligible and the system–environment
+density operator remains remains approximately
+factorized at all times, ρSE(t) ≈ ρS(t) ⊗ ρE.
+
+2. The Markov approximation. Memory effects of the
+environment are negligible, in the sense that any
+self-correlations within the environment created by
+the coupling to the system decay rapidly compared
+to the characteristic relaxation timescale of the
+open quantum system.
+
+Comparisons between the predictions of models based
+on Born–Markov master equations and experimental
+data indicate that the Born and Markov assumptions are
+reasonable in many physical situations (but see Sec. III C
+below for exceptions and non-Markovian models). As-
+suming these assumptions hold and writing the inter-
+action Hamiltonian as Hint = �
+
+α Sα ⊗ Eα, the Born–
+Markov master equation reads [9, 20]
+
+d
+dtρS(t) = −i [HS, ρS(t)]
+
+−
+�
+
+α
+{[Sα, BαρS(t)] + [ρS(t)Cα, Sα]} ,
+(26)
+
+where the system operators Bα and Cα are defined as
+
+Bα ≡
+� ∞
+
+0
+dτ
+�
+
+β
+cαβ(τ)S(I)
+β (−τ),
+(27a)
+
+Cα ≡
+� ∞
+
+0
+dτ
+�
+
+β
+cβα(−τ)S(I)
+β (−τ).
+(27b)
+
+Here S(I)
+α (−τ) denotes the operator Sα in the interaction
+picture. In the following, we will simplify notation by
+
+omitting the superscript “I”; instead we use the conven-
+tion that all operators bearing explicit time arguments
+are to be understood as interaction-picture operators.
+(For density operators, however, we will maintain the
+superscript notation in order to distinguish them from
+Schr¨odinger-picture density operators, which also carry
+a time argument.) The quantities cαβ(τ) appearing in
+Eq. (27) are given by
+
+cαβ(τ) ≡ ⟨Eα(τ)Eβ⟩ρE .
+(28)
+
+These environment self-correlation functions quantify
+how much information the environment retains over time
+about its interaction with the system. The Markov ap-
+proximation corresponds to the assumption of a rapid
+decay of the cαβ(τ) relative to the timescale set by the
+evolution of the system.
+In many situations of interest, the general form of the
+Born–Markov master equation, Eq. (26), simplifies con-
+siderably.
+For example, typically only a single system
+observable S is monitored by the environment, Hint =
+S ⊗E. Also, the time dependence of the operators Sα(τ)
+and Eα(τ) is often simple, facilitating the calculation of
+the quantities Bα and Cα.
+Examples are discussed in
+Sec. IV.
+
+B.
+Lindblad master equations
+
+Lindblad master equations constitute a particular, al-
+beit quite general, class of time-local Markovian mas-
+ter equations. They arise from the requirement that the
+evolution of the reduced density matrix generated by the
+master equation must ensure complete positivity [20, 80–
+85]. Complete positivity guarantees that the dynamical
+map ρS(0) �→ ρS(t) = V(t)ρS(0) described by the master
+equation generates physically consistent dynamics even
+when S is initially entangled with another system. While
+complete positivity is automatically fulfilled if the evo-
+lution is exact, approximate master equations will not
+necessarily ensure complete positivity [20, 83–86]. The
+Lindblad master equation is a special case of the gen-
+eral Born–Markov master equation that ensures complete
+positivity and takes the general form [81, 82]
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)]
+
++ 1
+
+2
+
+�
+
+αβ
+γαβ
+��
+Sα, ρS(t)S†
+β
+�
++
+�
+SαρS(t), S†
+β
+��
+,
+(29)
+
+where H′
+S is the renormalized Hamiltonian of the sys-
+tem. The coefficients γαβ are time-independent and ex-
+haustively encapsulate information about the physical
+parameters of the decoherence processes (and possibly
+dissipation processes).
+One can show that the matrix
+Γ ≡ (γαβ) formed by the coefficients γαβ is positive, i.e.,
+all its eigenvalues κµ are ≥ 0. Therefore, Eq. (29) can be
+
+
+9
+
+simplified by diagonalizing Γ, which results in the diago-
+nal form [82, 87]
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)]
+
+− 1
+
+2
+
+�
+
+µ
+κµ
+�
+L†
+µLµρS(t) + ρSL†
+µLµ − 2LµρS(t)L†
+µ
+�
+.
+
+(30)
+
+The Lindblad operators Lµ are linear combinations of the
+original operators Sα, with coefficients determined by the
+diagonalization of Γ. The Lindblad structure of a mas-
+ter equation can also be motivated from the requirement
+that it gives rise to the most general form of generators
+of quantum dynamical semigroups [20, 81, 82, 84, 87–89].
+It is possible to bring any Born–Markov master equation
+into Lindblad form by imposing the rotating-wave ap-
+proximation. This assumption, ubiquituous in quantum
+optics, is justified whenever the timescale set by the typ-
+ical energy differences ℏ(ω − ω′) of the system Hamilto-
+nian is short in comparison with the relaxation timescale
+of the system. (See Sec. 3.3.1 of Ref. [20] for details.)
+Because the Sα are not necessarily Hermitian, the
+Lindblad operators do not always correspond to physical
+observables. But when they do, we can rewrite Eq. (30)
+in compact double-commutator form,
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)] − 1
+
+2
+
+�
+
+µ
+κµ [Lµ, [Lµ, ρS(t)]] .
+
+(31)
+As an example, consider a situation in which the envi-
+ronment monitors the position of a system. With L = x
+and the “free”-particle Hamiltonian H′
+S = HS = p2/2m,
+Eq. (31) becomes
+
+d
+dtρS(t) = − i
+
+2m
+�
+p2, ρS(t)
+�
+− 1
+
+2κ [x, [x, ρS(t)]] .
+(32)
+
+Expressing this master equation in the position represen-
+tation results in
+
+∂ρS(x, x′, t)
+
+∂t
+= − i
+
+2m
+
+� ∂2
+
+∂x′2 − ∂2
+
+∂x2
+
+�
+ρS(x, x′, t)
+
+− 1
+
+2κ (x − x′)2 ρS(x, x′, t).
+(33)
+
+This is the classic equation of motion for decoherence due
+to environmental scattering first derived in Ref. [17].
+Lindblad master equations provide an intuitive and
+simple way of representing the environmental monitoring
+of an open quantum system. Most of the real physics be-
+hind this monitoring process is hidden in the coefficients
+κµ appearing in Eq. (30). If the Lindblad operators are
+chosen to be dimensionless, the κµ can be directly in-
+terpreted as decoherence rates, since they have units of
+inverse time.
+Equation (31) shows that the decoherence term van-
+ishes if
+
+[Lµ, ρS(t)] = 0
+for all µ, t.
+(34)
+
+In this case, ρS(t) evolves unitarily. Since the Lµ are lin-
+ear combinations of the Sα, Eq. (34) typically means that
+[Sα, ρS(t)] = 0 for all α, t. This implies that simultane-
+ous eigenstates of all Sα will be immune to decoherence,
+which is precisely the pointer-state criterion of Eq. (19).
+In quantum-jump and quantum-trajectory approaches,
+the evolution of the reduced density matrix is conditioned
+on an explicitly observed sequence of measurement re-
+sults in the environment. This allows for the (formal)
+description of a single realization of the system evolv-
+ing stochastically, conditioned on a particular measure-
+ment record. The dynamics are then described by a mas-
+ter equation of the Lindblad type, Eq. (31), for the re-
+duced density matrix ρC
+S conditioned on the measurement
+records of the Lindblad operators Lµ,
+
+dρC
+S = −i
+�
+HS, ρC
+S
+�
+dt − 1
+
+2
+
+�
+
+µ
+κµ
+�
+Lµ,
+�
+Lµ, ρC
+S
+��
+dt
+
++
+�
+
+µ
+
+√κµ W[Lµ]ρC
+S dWµ.
+(35)
+
+Here, W[L]ρ ≡ Lρ+ρL†−ρ Tr
+�
+Lρ + ρL†�
+, and the dWµ
+denote so-called Wiener increments. Equation (35) corre-
+sponds to a diffusive unraveling of the Lindblad equation
+into individual quantum trajectories, which can then be
+expressed by means of a stochastic Schr¨odinger equation
+[90–102].
+
+C.
+Non-Markovian decoherence
+
+The derivation of the Born–Markov master equation
+assumes that the coupling between system and environ-
+ment is weak and memory effects of the environment can
+be neglected.
+These conditions, however, are not met
+in certain situations of physical interest.
+An example
+would be a superconducting qubit strongly coupled to a
+low-temperature environment of other two-level systems
+[103, 104]. Also, a recent experiment [105] has measured
+strongly non-Ohmic spectral densities for the environ-
+ment of a quantum nanomechanical system; such densi-
+ties lead to non-Markovian evolution.
+In many cases, pronounced memory effects in the envi-
+ronment will cause strong dependencies of the evolution
+of the reduced density operator on the past history of the
+system–environment composite and therefore make it im-
+possible to describe the reduced dynamics by a differen-
+tial equation that is local in time. Surprisingly, however,
+one can show that even non-Markovian dynamics some-
+times can still be described by a time-local differential
+equation of the form
+
+d
+dtρS(t) = K(t)ρS(t),
+(36)
+
+where the superoperator K(t) depends only on t.
+For
+example, a non-Markovian master equation for quantum
+Brownian motion (see Sec. IV B) can be obtained through
+
+
+10
+
+a formal modification of the Born–Markov master equa-
+tion [4, 5]. In general, it is often possible to arrive at
+non-Markovian but time-local master equations via the
+so-called time-convolutionless projection operator tech-
+nique [106–109].
+
+IV.
+DECOHERENCE MODELS
+
+Many physical systems can be represented either by
+a qubit if the state space of the system is discrete and
+effectively two-dimensional, or by a particle described by
+continuous phase-space coordinates. Needless to say, in
+the case of quantum information processing the qubit
+representation is of particular relevance.
+Similarly, a wide range of environments can be modeled
+as a collection of quantum harmonic oscillators or qubits.
+Harmonic-oscillator environments are of great generality.
+At low energies, many systems interacting with an en-
+vironment can effectively be represented by one or two
+coordinates of the system linearly coupled to an environ-
+ment of harmonic oscillators; indeed, sufficiently weak in-
+teractions with an arbitrary environment can be mapped
+onto a system linearly coupled to a harmonic-oscillator
+environment [110, 111].
+Environments represented by qubits are often the ap-
+propriate model in the low-temperature regine, where de-
+coherence is typically dominated by interactions with lo-
+calized modes, such as paramagnetic spins, paramagnetic
+electronic impurities, tunneling charges, defects, and nu-
+clear spins [103, 104, 112]. Each of the localized modes is
+represented by a finite-dimensional Hilbert space with a
+finite energy cutoff. We can therefore model these modes
+as a set of discrete states. Typically, only two such states
+are relevant, and thus the localized modes can be mapped
+onto an environment of qubits. Since qubits can be for-
+mally represented by spin- 1
+
+2 particles, such models are
+known as spin-environment models.
+In the following, we will discuss four important stan-
+dard models, namely, collisional decoherence (Sec. IV A),
+quantum Brownian motion (Sec. IV B), the spin–boson
+model (Sec. IV C), and the spin–spin model (Sec. IV D).
+For details on these and other decoherence models, in-
+cluding derivations of the relevant master equations, see
+Secs. 3 and 5 of Ref. [9].
+
+A.
+Collisional decoherence
+
+Collisional decoherence arises from the scattering of en-
+vironmental particles by a massive free quantum particle.
+Models of collisional decoherence were first studied in the
+classic paper by Joos and Zeh [17]. A more rigorous and
+general treatment was later developed by Hornberger and
+collaborators [31, 36–39] (see also [34, 35, 113]), which,
+among other refinements, remedied a flaw in Joos and
+Zeh’s original derivation that had resulted in decoher-
+ence rates that were too large by a factor of 2π [31].
+
+If we assume that the central particle is much more
+massive than the environmental particles such that its
+center-of-mass state is not disturbed by the scattering
+events (no recoil), the time evolution of the reduced den-
+sity matrix is given by [9, 17, 31, 34, 35]
+
+∂ρS(x, x′, t)
+
+∂t
+= −F(x − x′)ρS(x, x′, t).
+(37)
+
+This master equation describes pure spatial decoherence
+without dissipation. The decoherence factor F(x − x′)
+plays the role of a localization rate.
+It represents the
+characteristic decoherence rate at which spatial coher-
+ences between two positions x and x′ become locally sup-
+pressed and is given by
+
+F(x − x′) =
+� ∞
+
+0
+dq ϱ(q)v(q)
+
+×
+� dˆn dˆn′
+
+4π
+
+�
+1 − eiq(ˆn−ˆn′)·(x−x′)�
+|f(qˆn, qˆn′)|2 .
+(38)
+
+Here ϱ(q) denotes the number density of incoming par-
+ticles with magnitude of momentum equal to q = |q|, ˆn
+and ˆn′ are unit vectors (with dˆn and dˆn′ representing the
+associated solid-angle differentials), and v(q) denotes the
+speed of particles with momentum q. For the scattering
+of massive environmental particles we have v(q) = q/m,
+where m is each particle’s mass, while for the scatter-
+ing of photons and other massless particles v(q) is equal
+to the speed of light. The quantity |f(qˆn, qˆn′)|2 is the
+differential cross section for the scattering of an environ-
+mental particle from initial momentum q = qˆn to final
+momentum q′ = qˆn′.
+Whenever the mass of the central particle becomes
+comparable to the mass of the environmental particles (as
+in the case of air molecules scattered by small molecules
+and free electrons [114]), the no-recoil assumption does
+not hold and more general models for collisional deco-
+herence have to be considered [35, 36].
+The resulting
+dynamics include dissipation, as well as decoherence in
+both position and momentum.
+To further evaluate the decoherence factor F(x − x′),
+Eq. (38), we distinguish two important limiting cases. In
+the short-wavelength limit, the typical wavelength of the
+scattered environmental particles is much shorter than
+the coherent separation ∆x = |x − x′| between the well-
+localized wave packets in the spatial superposition state
+of the system.
+Then a single scattering event will be
+able to fully resolve this separation and thus carry away
+complete which-path information, leading to maximum
+spatial decoherence per scattering event. In this limit,
+F(x − x′) turns out to be simply equal to the total scat-
+tering rate Γtot [9]. This implies the existence of an upper
+limit for the decoherence rate when increasing the sepa-
+ration ∆x, in contrast with decoherence rates obtained
+from linear models [compare Eqs. (22) and (54)]. Equa-
+tion (37) then shows that spatial interference terms will
+become exponentially suppressed at a rate set by Γtot,
+
+ρS(x, x′, t) = ρS(x, x′, 0)e−Γtott.
+(39)
+
+
+11
+
+TABLE I. Estimates of decoherence timescales (in seconds)
+for the suppression of spatial interferences over a distance ∆x
+equal to the size a of the object (∆x = a = 10−3 cm for a
+dust grain and ∆x = a = 10−6 cm for a large molecule). See
+Ref. [9] for details.
+
+Environment
+Dust grain Large molecule
+
+Cosmic background radiation
+1
+1024
+
+Photons at room temperature
+10−18
+106
+
+Best laboratory vacuum
+10−14
+10−2
+
+Air at normal pressure
+10−31
+10−19
+
+In the opposite long-wavelength limit, the environmen-
+tal wavelengths are much larger than the coherent sep-
+aration ∆x = |x − x′|, which implies that an individual
+scattering event will reveal only incomplete which-path
+information. For this case, one can show that spatial co-
+herences become exponentially suppressed at a rate that
+depends on the square of the separation ∆x [9],
+
+ρS(x, x′, t) = ρS(x, x′, 0)e−Λ(∆x)2t,
+(40)
+
+where Λ is a scattering constant that encapsulates the
+physical details of the interaction.
+Thus, the quantity
+Λ(∆x)2 plays the role of a decoherence rate.
+The de-
+pendence of this rate on ∆x is reasonable: if the envi-
+ronmental wavelengths are much larger than ∆x, it will
+require a large number of scattering events to encode
+an appreciable amount of which-path information in the
+environment, and this amount will increase, for a given
+number of scattering events, as ∆x becomes larger. Note
+that if ∆x is increased beyond the typical wavelength of
+the environment, the short-wavelength limit needs to be
+considered instead, for which the decoherence rate is in-
+dependent of ∆x and attains its maximum possible value.
+Numerical values of collisional decoherence rates ob-
+tained from Eqs. (39) and (40), with the physically rele-
+vant scattering parameters Γtot and Λ appropriately eval-
+uated, have shown the extreme efficiency of collisions in
+suppressing spatial interferences; Table I shows a few
+classic order-of-magnitude estimates [8, 9, 17].
+Excel-
+lent agreement between theory and experiment has been
+demonstrated for the decoherence of fullerenes due to col-
+lisions with background gas molecules in a Talbot–Lau
+interferometer [31, 115–118] (see Sec. VI B and Fig. 2),
+and for the decoherence of sodium atoms in a Mach–
+Zehnder interferometer due to the scattering of photons
+[119] and gas molecules [120].
+
+B.
+Quantum Brownian motion
+
+A classic and extensively studied model of decoherence
+and dissipation is the one-dimensional motion of a par-
+ticle weakly coupled to a thermal bath of noninteracting
+harmonic oscillators, a model known as quantum Brown-
+ian motion. The self-Hamiltonian HE of the environment
+
+is given by
+
+HE =
+�
+
+i
+
+� 1
+
+2mi
+p2
+i + 1
+
+2miω2
+i q2
+i
+
+�
+,
+(41)
+
+where mi and ωi denote the mass and natural frequency
+of the ith oscillator, and qi and pi are the canonical posi-
+tion and momentum operators. The interaction Hamilto-
+nian Hint describes the bilinear coupling of the system’s
+position coordinate x to the positions qi of the environ-
+mental oscillators, Hint = x ⊗ �
+
+i ciqi, where the ci de-
+note coupling strengths. This interaction represents the
+continuous environmental monitoring of the position co-
+ordinate of the system.
+The Born–Markov master equation describing the evo-
+lution of the density matrix ρS(t) of the system is given
+by [9, 45]
+
+d
+dtρS(t) = −i
+�
+HS, ρS(t)
+�
+
+−
+� ∞
+
+0
+dτ
+�
+ν(τ)
+�
+x,
+�
+x(−τ), ρS(t)
+��
+
+− iη(τ)
+�
+x,
+�
+x(−τ), ρS(t)
+���
+.
+(42)
+
+Here, x(τ) denotes the system’s position operator in the
+interaction picture, x(τ) = eiHSτxe−iHSτ.
+The curly
+brackets { · , · } in the second line denote the anticom-
+mutator {A, B} ≡ AB + BA. The functions
+
+ν(τ) =
+� ∞
+
+0
+dω J(ω) coth
+� ω
+
+2kT
+
+�
+cos (ωτ) ,
+(43)
+
+η(τ) =
+� ∞
+
+0
+dω J(ω) sin (ωτ) ,
+(44)
+
+are known as the noise kernel and dissipation kernel, re-
+spectively. The function J(ω), called the spectral density
+of the environment, is given by
+
+J(ω) ≡
+�
+
+i
+
+c2
+i
+
+2miωi
+δ(ω − ωi).
+(45)
+
+In general, spectral densities encapsulate the physi-
+cal properties of the environment.
+One frequently re-
+places the collection of individual environmental oscilla-
+tors by an (often phenomenologically motivated) contin-
+uous function J(ω) of the environmental frequencies ω.
+If we specialize to the important case of the system rep-
+resented by a harmonic oscillator with self-Hamiltonian
+
+HS =
+1
+
+2M p2 + 1
+
+2MΩ2x2,
+(46)
+
+the resulting Born–Markov master equation is
+
+d
+dtρS(t) = −i
+�
+HS + 1
+
+2M �Ω2x2, ρS(t)
+�
+−iγ
+�
+x,
+�
+p, ρS(t)
+��
+
+− D
+�
+x,
+�
+x, ρS(t)
+��
+− f
+�
+x,
+�
+p, ρS(t)
+��
+.
+(47)
+
+
+12
+
+The coefficients �Ω2, γ, D, and f are defined as
+
+�Ω2 ≡ − 2
+
+M
+
+� ∞
+
+0
+dτ η(τ) cos (Ωτ) ,
+(48a)
+
+γ ≡
+1
+
+MΩ
+
+� ∞
+
+0
+dτ η(τ) sin (Ωτ) ,
+(48b)
+
+D ≡
+� ∞
+
+0
+dτ ν(τ) cos (Ωτ) ,
+(48c)
+
+f ≡ − 1
+
+MΩ
+
+� ∞
+
+0
+dτ ν(τ) sin (Ωτ) .
+(48d)
+
+The first term on the right-hand side of Eq. (47) repre-
+sents the unitary dynamics of a harmonic oscillator whose
+natural frequency is shifted by �Ω. The second term de-
+scribes momentum damping (dissipation) at a rate pro-
+portional to γ, which depends only on the spectral den-
+sity but not the temperature of the environment. The
+third term is of the Lindblad double-commutator form
+[see Eq. (31)] and describes decoherence of spatial coher-
+ences over a distance ∆X at a rate D(∆X)2. Note that
+D depends on both the spectral density J(ω) and the
+temperature T of the environment. The fourth term also
+represents decoherence, but its influence on the dynam-
+ics of the system is usually negligible, especially at higher
+temperatures. In the long-time limit γt ≫ 1, the master
+equation (47) describes dispersion in position space given
+by
+
+∆X2(t) =
+D
+
+2m2γ2 t.
+(49)
+
+That is, the width ∆X(t) of the ensemble in position
+space asymptotically scales as ∆X(t) ∝
+√
+
+t, just as in
+classical Brownian motion; hence the term “quantum
+Brownian motion.”
+Figure 1 shows the time evolution of position-space and
+momentum-space superpositions of two Gaussian wave
+packets in the Wigner picture, as described by Eq. (47)
+[28]. Interference between the two wave packets is rep-
+resented by oscillations between the direct peaks. The
+interaction with the environment damps these oscilla-
+tions.
+The damping occurs on different timescales for
+the two initial conditions. While the momentum coordi-
+nate is not directly monitored by the environment, the
+intrinsic dynamics, through their creation of spatial su-
+perpositions from superpositions of momentum, result
+in decoherence in momentum space.
+This interplay of
+environmental monitoring and intrinsic dynamics leads
+to the emergence of pointer states that are minimum-
+uncertainty Gaussians (coherent states) well-localized in
+both position and momentum, thus approximating clas-
+sical points in phase space [5, 8, 16, 28, 44, 47, 48].
+Let us consider the important case of an ohmic spectral
+density J(ω) ∝ ω with a high-frequency cutoff Λ,
+
+J(ω) = 2Mγ0
+
+π
+ω
+Λ2
+
+Λ2 + ω2 .
+(50)
+
+In the limit of a high-temperature environment (kT ≫ Ω
+and kT ≫ Λ), we arrive at the Caldeira–Leggett master
+
+x
+
+p
+
+x
+
+p
+
+FIG. 1. Evolution of superpositions of Gaussian wave packets
+in quantum Brownian motion as studied in Ref. [28], visual-
+ized in the Wigner representation. Time increases from top
+to bottom. In the left column, the initial wave packets are
+separated in position; in the right column, the separation is
+in momentum.
+
+equation [121],
+
+d
+dtρS(t) = −i
+�
+H′
+S, ρS(t)
+�
+− iγ0
+�
+x,
+�
+p, ρS(t)
+��
+
+− 2Mγ0kT
+�
+x,
+�
+x, ρS(t)
+��
+,
+(51)
+
+where
+
+H′
+S = HS + 1
+
+2M �Ω2x2 =
+1
+
+2M p2 + 1
+
+2M
+�
+Ω2 − 2γ0Λ
+�
+x2
+
+(52)
+is the frequency-shifted Hamiltonian H′
+S of the system.
+This equation has been widely and successfully used to
+model decoherence and dissipation processes, even in
+cases where the assumptions were not strictly fulfilled
+(for example, in quantum-optical settings, where often
+kT ≲ Λ [122]).
+In the position representation, the final term on the
+right-hand side of Eq. (51) can be written as
+
+− γ0
+
+�x − x′
+
+λdB
+
+�2
+ρS(x, x′, t),
+(53)
+
+where λdB = (2MkT)−1/2 is the thermal de Broglie wave-
+length. This term describes spatial localization with a
+
+
+13
+
+decoherence rate τ −1
+|x−x′| given by [18]
+
+τ −1
+|x−x′| = γ0
+
+�x − x′
+
+λdB
+
+�2
+.
+(54)
+
+This is Eq. (22), and as discussed there, given that λdB is
+extremely small for macroscopic and even mesoscopic ob-
+jects, we see that superpositions of macroscopically sepa-
+rated center-of-mass positions will typically be decohered
+on timescales many orders of magnitude shorter than the
+dissipation (relaxation) timescale γ−1
+0 . Over timescales
+on the order of the decoherence time, we may therefore
+often neglect the dissipative term in Eq. (51), leading to
+the pure-decoherence master equation
+
+d
+dtρS(t) = −i
+�
+H′
+S, ρS(t)
+�
+− 2Mγ0kT
+�
+x,
+�
+x, ρS(t)
+��
+. (55)
+
+C.
+Spin–boson models
+
+In the spin–boson model, a qubit interacts with an
+environment of harmonic oscillators. The seminal review
+paper by Leggett et al. [29] discusses the dynamics of the
+spin–boson model in great detail.
+Let us first consider a simplified spin–boson model
+where the self-Hamiltonian of the system is taken to be
+HS = 1
+
+2ω0σz, with eigenstates |0⟩ and |1⟩. In contrast
+with the more general case discussed below, this Hamilto-
+nian does not include a tunneling term − 1
+
+2∆0σx, and thus
+HS does not generate any nontrivial intrinsic dynamics.
+We employ the familiar self-Hamiltonian, Eq. (41), for
+an environment of harmonic oscillators, and choose the
+bilinear interaction Hamiltonian Hint = σz ⊗�
+
+i ciqi. Us-
+ing the raising and lowering operators a† and a, we can
+recast the total Hamiltonian as
+
+H = 1
+
+2ω0σz +
+�
+
+i
+ωia†
+iai + σz ⊗
+�
+
+i
+
+�
+gia†
+i + g∗
+i ai
+�
+. (56)
+
+Note that since
+�
+H, σz
+�
+= 0, no transitions between |0⟩
+and |1⟩ can be induced by H. There is no energy ex-
+change between the system and the environment, and we
+therefore deal with a model of decoherence without dis-
+sipation. Such a model is a good representation of rapid
+decoherence processes during which the amount of dissi-
+pation is negligible, as is often the case in physical appli-
+cations. The resulting evolution can be solved exactly [9].
+For an ohmic spectral density with a high-frequency cut-
+off, it is found that superpositions of the form α |0⟩+β |1⟩
+are exponentially decohered on a timescale set by the
+thermal correlation time (kT)−1 of the environment.
+Inclusion of a tunneling term − 1
+
+2∆0σx yields the gen-
+eral spin–boson model defined by the Hamiltonian
+
+H = 1
+
+2ω0σz − 1
+
+2∆0σx +
+�
+
+i
+
+� 1
+
+2mi
+p2
+i + 1
+
+2miω2
+i q2
+i
+
+�
+
++ σz ⊗
+�
+
+i
+ciqi.
+(57)
+
+The rich non-Markovian dynamics of this model have
+been analyzed in Refs. [29, 123]. The particular dynamics
+strongly depend on the various parameters, such as the
+temperature of the environment, the form of the spec-
+tral density (subohmic, ohmic, or supraohmic), and the
+system–environment coupling strength. For each param-
+eter regime, a characteristic dynamical behavior emerges:
+localization, exponential or incoherent relaxation, expo-
+nential decay, and strongly or weakly damped coherent
+oscillations [29].
+In the weak-coupling limit, one can derive the Born–
+Markov master equation in much the same way as in
+the case of quantum Brownian motion (note the similar
+structure of the Hamiltonians). The result is (see Ref. [9]
+for details)
+
+d
+dtρS(t) = −i
+�
+H′
+SρS(t) − ρS(t)H′†
+S
+�
+
+− �D [σz, [σz, ρS(t)]] + ζσzρS(t)σy + ζ∗σyρS(t)σz.
+(58)
+
+The first term on the right-hand side of the master equa-
+tion (58) represents the evolution under the environment-
+shifted self-Hamiltonian H′
+S, the second term corre-
+sponds to decoherence in the σz eigenbasis of the system
+at a rate given by �D, and the last two terms describe
+the decay of the two-level system. H′
+S is the renormal-
+ized (and in general non-Hermitian) Hamiltonian of the
+system. The coefficients ζ∗, �D, �f, and �γ are given by
+
+ζ∗ = �f − i�γ,
+(59a)
+
+�D =
+� ∞
+
+0
+dτ ν(τ) cos (∆0τ) ,
+(59b)
+
+�f =
+� ∞
+
+0
+dτ ν(τ) sin (∆0τ) ,
+(59c)
+
+�γ =
+� ∞
+
+0
+dτ η(τ) sin (∆0τ) ,
+(59d)
+
+with the noise and the dissipation kernels ν(τ) and η(τ)
+taking the same form as in quantum Brownian motion
+[see Eqs. (43) and (44)].
+
+D.
+Spin-environment models
+
+A qubit linearly coupled to a collection of other
+qubits—known also as a spin–spin model—is often a good
+model of a single two-level system, such as a supercon-
+ducting qubit, strongly coupled to a low-temperature en-
+vironment [103, 104].
+The model of a harmonic oscil-
+lator interacting with a spin environment may be rele-
+vant to the description of decoherence and dissipation in
+quantum-nanomechanical systems and micron-scale ion
+traps [124]. For details on the theory of spin-environment
+models, see Refs. [104, 125–127].
+A simple version of a spin–spin model is described by
+
+
+14
+
+the total Hamiltonian
+
+H = HS + Hint = −1
+
+2∆0σx + 1
+
+2σz ⊗
+
+N
+�
+
+i=1
+giσ(i)
+z
+
+≡ −1
+
+2∆0σx + 1
+
+2σz ⊗ E.
+(60)
+
+Here, HS represents the intrinsic dynamics given by a
+tunneling term, while Hint describes the environmental
+monitoring of the observable σz.
+The model can be solved exactly [128, 129], and
+the resulting dynamics illustrate the dependence of the
+preferred basis on the relative strengths of the self-
+Hamiltonian of the system and the interaction Hamil-
+tonian.
+The preferred basis emerges as the local ba-
+sis that is most robust under the total Hamiltonian.
+When the interaction Hamiltonian dominates over the
+self-Hamiltonian, the pointer states are found to be eigen-
+states of the interaction Hamiltonian, in agreement with
+the commutativity criterion, Eq. (18). Conversely, when
+the modes of the environment are slow and the self-
+Hamiltonian dominates the evolution of the system (the
+quantum limit of decoherence [42]), the pointer states are
+the eigenstates of the Hamiltonian of the system.
+In the weak-coupling limit, spin environments can be
+mapped onto oscillator environments [110, 130]. Specifi-
+cally, the reduced dynamics of a system weakly coupled
+to a spin environment can be described by the system
+coupled to an equivalent oscillator environment described
+by an explicitly temperature-dependent spectral density
+of the form
+
+Jeff(ω, T) ≡ J(ω) tanh
+� ω
+
+2kT
+
+�
+,
+(61)
+
+where J(ω) is the original spectral density of the spin
+environment. (See Sec. 5.4.2 of Ref. [9] for details and
+examples.)
+
+V.
+QUBIT DECOHERENCE, QUANTUM
+ERROR CORRECTION, AND ERROR
+AVOIDANCE
+
+Quantum computation and quantum information pro-
+cessing rely on coherent superpositions of mesoscopically
+or macroscopically distinct states that are highly suscep-
+tible to decoherence. Avoiding, controlling, and mitigat-
+ing decoherence is therefore of paramount importance.
+While the qubits need to be protected from detrimental
+environmental interactions, we also need to be able to
+control and measure them via a macroscopic apparatus.
+The formidable challenge of designing a quantum com-
+puter consists of meeting both demands in a balanced
+way. Even so, decoherence induced by interactions with
+the environment and the control apparatus, as well as
+noise due to faulty gate operations, will likely be too
+strong to allow for useful quantum computations to be
+carried out [74, 131]. What is also needed is an active
+
+mitigation of the effects of decoherence through active
+quantum error correction [132–136].
+We may distinguish two limiting cases for modeling
+decoherence in qubits. The first limit is that of indepen-
+dent qubit decoherence. Here, each qubit couples indepen-
+dently to its own environment, without any interactions
+between these environments. For example, this may be
+the case if the qubits are spatially well-separated (rela-
+tive to the typical coherence length of the environment)
+and only couple to their immediate surroundings. Then
+the error processes affecting the qubits will be completely
+uncorrelated. Thus, if the probability of a particular er-
+ror to affect one qubit is p, the probability of this error
+to occur in K qubits will be pK. Many error-correcting
+schemes are only efficient in correcting such single-qubit
+errors, and thus the assumption of independent decoher-
+ence frequently underlies these schemes. This assump-
+tion, however, is unrealistic when the qubits are located
+spatially close to each other. In this case, all qubits ap-
+proximately feel the same environment, and it is likely
+that errors will become correlated among multiple qubits.
+The limiting case corresponding to this situation is that
+of collective qubit decoherence, in which all qubits couple
+to exactly the same environment.
+
+A.
+Correction of decoherence-induced quantum
+errors
+
+Consider a single qubit S, initially described by a pure
+state |ψ⟩ and interacting with an environment E. One
+can show that an arbitrary evolution of the combined
+qubit–environment state can always be written in the
+form
+
+|ψ⟩ |e0⟩ −→ I |ψ⟩ |eI⟩ +
+�
+
+s=x,y,z
+(σs |ψ⟩) |es⟩ ,
+(62)
+
+where the Pauli operators σs act on the Hilbert space of
+S, and |eI⟩ and {|es⟩} are environmental states that are
+not necessarily orthogonal or normalized. Thus, any in-
+fluence of the environment on the qubit can be expressed
+simply in terms of a weighted sum of the Pauli operators
+and the identity operator acting on the original state of
+the qubit. The effects of σx and σz on the qubit state are
+often referred as a bit-flip error and phase-flip error, re-
+spectively. If we restrict our attention to environmental
+entanglement and the resulting decoherence effects, then
+only phase-flip errors need to be taken into account.
+For N qubits, Eq. (62) generalizes to
+
+|ψ⟩ |e0⟩ −→
+�
+
+i
+(Ei |ψ⟩) |ei⟩ .
+(63)
+
+Here |ψ⟩ is the initial N-qubit state, and the error op-
+erators Ei are tensor products of N operators involv-
+ing identity and Pauli operators. Equation (63) repre-
+sents a worst-case scenario.
+In many cases, simplified
+versions can be used. One important case is that of par-
+tial decoherence. Here, only a small number K < N of
+
+
+15
+
+qubits become entangled with the environment between
+two successive applications of an error-correcting mech-
+anism. Then it will be sufficient to restrict our attention
+to the 2K possible error operators made up of at most K
+operators σz and N − K identity operators. In the case
+of independent qubit decoherence, we only need to con-
+sider a collection of independent phase-flip errors acting
+on single qubits, represented by error operators of the
+form E = I ⊗ · · · ⊗ I ⊗ σz ⊗ I ⊗ · · · ⊗ I.
+Given the entangled state on the right-hand side of
+Eq. (63), the goal of quantum error correction is to re-
+store the initial (unknown) state |ψ⟩. We let an ancilla,
+described by an initial state |a0⟩, interact with the qubit
+system such that
+
+|a0⟩
+
+��
+
+i
+(Ei |ψ⟩) |ei⟩
+
+�
+
+−→
+�
+
+i
+|ai⟩ (Ei |ψ⟩) |ei⟩ .
+(64)
+
+Let us assume that the ancilla states |ai⟩ are at least
+approximately mutually orthogonal, such that they can
+be distinguished by measurement. We now measure the
+observable OA = �
+
+i ai|ai⟩⟨ai| on the ancilla, with ai ̸=
+aj for i ̸= j. The projective measurement will yield a
+particular outcome, say, ak, and lead to the reduction of
+the entangled state,
+�
+
+i
+|ai⟩ (Ei |ψ⟩) |ei⟩ −→ |ak⟩ (Ek |ψ⟩) |ek⟩ .
+(65)
+
+The outcome ak of the measurement tells us the counter-
+transformation needed to restore the initial qubit state.
+Applying E−1
+k
+= E†
+k to the system gives
+
+|ak⟩ (Ek |ψ⟩) |ek⟩
+E−1
+k
+−−−→ |ak⟩ |ψ⟩ |ek⟩ .
+(66)
+
+Note that, as required in order to avoid introducing ad-
+ditional decoherence in the computational basis of the
+qubit system, we have obtained no information whatso-
+ever about the state of the system.
+This account of quantum error correction has been
+highly idealized.
+Let us mention three complications.
+First, it is impossible to design an interaction between
+the computational qubits and the ancilla that would al-
+low us to distinguish, by measuring the ancilla, between
+all possible errors. Second, in realistic settings the error
+operators Ei may be very complex, and it remains to be
+seen whether and how the corresponding countertrans-
+formations can be applied without introducing signifi-
+cant additional decoherence.
+Third, the ancilla qubits
+are physically similar to the computational qubits and
+can therefore be expected to be equally prone to en-
+vironmental interactions (and thus decoherence) as the
+computational qubits themselves. Since the inclusion of
+ancilla qubits increases the total number of qubits in the
+quantum computer, and since decoherence rates typically
+scale exponentially with the size of the system, it will re-
+quire sophisticated experimental designs to ensure not
+only that quantum error correction works in practice,
+but also that it does not aggravate the problem of qubit
+decoherence.
+
+B.
+Quantum computation on decoherence-free
+subspaces
+
+We introduced the concept of decoherence-free sub-
+spaces (DFS) [49–58],
+or pointer subspaces [3],
+in
+Sec. II C 2. DFS allow us to encode quantum informa-
+tion in “quiet corners” of the Hilbert space to protect
+it from environmental effects. In contrast with quantum
+error correction, DFS prevent errors from happening in
+the first place and thus represent a strategy for intrinsic
+error avoidance.
+The two limiting cases of independent qubit decoher-
+ence and collective qubit decoherence delineate the lim-
+its on the size of a DFS. To illustrate this relation-
+ship, let us consider the case of collective decoherence
+of an N-qubit system interacting with an oscillator bath
+[49, 51, 53, 56, 137].
+The interaction Hamiltonian for
+this generalized spin–boson model is taken to be [com-
+pare Eq. (56)]
+
+Hint =
+
+N
+�
+
+i=1
+σ(i)
+z
+⊗
+�
+
+j
+
+�
+gija†
+j + g∗
+ijaj
+�
+≡
+
+N
+�
+
+i=1
+σ(i)
+z
+⊗ Ei.
+
+(67)
+The assumption of collective decoherence implies that
+the couplings gij (and thus the environment operators
+Ei) must be independent of the index i. Then Eq. (67)
+becomes
+
+Hint =
+
+��
+
+i
+σ(i)
+z
+
+�
+
+⊗ E ≡ Sz ⊗ E.
+(68)
+
+Recall that a DFS is spanned by a degenerate set of
+eigenstates of the system operators Sα of the interaction
+Hamiltonian [see Eq. (20)]. Thus, in our case the DFS
+will be spanned by degenerate eigenstates of the collec-
+tive spin operator Sz. Any N-qubit product state of the
+computational basis states |0⟩ and |1⟩ (the eigenstates of
+σz with eigenvalues +1 and −1, respectively) will be an
+eigenstate of Sz. There are 2N +1 different possible inte-
+ger eigenvalues m, ranging from m = −N (corresponding
+to the basis state |1 · · · 1⟩) to m = +N (corresponding
+to |0 · · · 0⟩). The largest number of mutually orthogonal
+computational-basis states with the same eigenvalue m
+of Sz is given by the set S0 of basis states with m = 0,
+i.e., those with N/2 qubits in the state |0⟩. There are
+n0 =
+� N
+N/2
+�
+such states in this set, spanning a DFS of di-
+mension n0. For large values of N, we can approximate
+the binomial coefficient using Stirling’s formula,
+
+log2
+
+� N
+N/2
+
+�
+≈ N − 1
+
+2 log2(πN/2)
+N≫1
+−−−→ N.
+(69)
+
+Therefore, in the limiting case of collective decoherence,
+the dimension of our DFS approaches the dimension of
+the original Hilbert space, and the encoding efficiency
+approaches unity. For example, for N = 4 qubits, the set
+
+S0 = { |0011⟩ , |0101⟩ , |0110⟩ , |1001⟩ , |1010⟩ , |1100⟩ }
+(70)
+
+
+16
+
+spans a maximum-size DFS of dimension six, to be com-
+pared with the dimension of the original Hilbert space,
+which is 24 = 16. Thus, given the model for collective de-
+coherence considered here, using four physical qubits we
+can encode up to two logical qubits in a DFS (since en-
+coding three logical qubits would already require a DFS
+of dimension 23 = 8).
+As mentioned in Sec. II C 2, the existence of a DFS
+corresponds to a dynamical symmetry. Our model rep-
+resents a case of perfect dynamical symmetry, since
+the system–environment interaction, Eq. (68), is com-
+pletely symmetric with respect to any permutations of
+the qubits, thereby leading to a DFS of maximum size.
+What happens if the symmetry is broken by additional
+small independent coupling terms? It has been shown
+[50, 138] that, to first order in the perturbation strength,
+the storage of quantum information in DFS is stable to
+such perturbations to all orders in time, but that the pro-
+cessing of such quantum information encoded in DFS is
+robust only to first order in time.
+In the case of purely independent qubit decoherence,
+the environment operators Ei appearing in Eq. (67) will
+now differ from one another. To find a DFS, we follow
+the usual strategy [see Eq. (20)] of determining a set of
+orthonormal basis states {|si⟩} such that
+
+�
+I(1) ⊗ · · · ⊗ I(j−1) ⊗ σ(j)
+z
+⊗ I(j+1) ⊗ · · · ⊗ I(N)�
+|si⟩
+
+= λ(j) |si⟩
+(71)
+
+for all i and 1 ≤ j ≤ N. The only state fulfilling this
+eigenvalue problem is |0 · · · 0⟩.
+Since we need at least
+a two-dimensional subspace to encode a single logical
+qubit, the case of independent decoherence in the spin–
+boson model does not allow for the existence of a DFS for
+quantum computation. In the language of pointer sub-
+spaces, there is only a single exact pointer state, and this
+environment-superselected preferred state of the system
+will be the ground state |0 · · · 0⟩.
+In realistic settings, neither the assumption of purely
+independent decoherence nor the limit of entirely collec-
+tive decoherence will be entirely appropriate. We can,
+however, use a DFS to protect the qubits from collective
+decoherence effects, and we can recover from single-qubit
+errors due to independent decoherence using active error-
+correction methods. These two approaches can be con-
+catenated [54] to enable universal fault-tolerant quantum
+computation even when the restriction to single-qubit er-
+rors is dropped [55, 139].
+
+C.
+Environment engineering and dynamical
+decoupling
+
+For reasonably large DFS to exist,
+the system–
+environment interaction must exhibit a sufficiently high
+degree of symmetry.
+Such symmetries are unlikely to
+arise naturally in typical experimental settings.
+
+One way of overcoming this limitation is based on envi-
+ronment engineering. Here, one tries to generate certain
+symmetries in the structure of the system–environment
+interactions. For example, an appropriately engineered
+symmetrization could make superposition states in Bose–
+Einstein condensates correspond to (approximate) de-
+generate eigenstates of the interaction Hamiltonian, in
+which case such states would lie within a DFS, thereby
+significantly enhancing their longevity [140].
+In ion
+traps, changing the parameters in the effective interac-
+tion Hamiltonian for the trapped ion allows one to se-
+lect different pointer subspaces and thereby control into
+which DFS the trapped ion is driven [77, 79, 141, 142].
+Another approach to the active creation of DFS is
+known as dynamical decoupling [143–148].
+Here time-
+dependent modifications are introduced into the Hamil-
+tonian of the system that counteract the influence of
+the environment. These modifications take the form of
+sequences of rapid projective measurements or strong
+control-field pulses acting on the system (“quantum
+bang-bang control” [143]).
+Even if the structure of
+the system–environment interaction Hamiltonian is not
+known, decoherence can be suppressed arbitrarily well
+in the limit of an infinitely fast rate of the decoupling
+control field, thus dynamically creating a DFS (which
+then represents a dynamically decoupled subspace). In
+the realistic case of a finite control rate, sufficient (albeit
+imperfect) protection from decoherence can be achieved
+via this decoupling technique, provided the control rate
+is larger than the fastest timescale set by the rate of for-
+mation of environmental entanglement.
+
+VI.
+EXPERIMENTAL STUDIES OF
+DECOHERENCE
+
+Decoherence, of course, happens all around us, and
+in this sense its consequences are readily observed. But
+what we would like to do is to be able to experimen-
+tally study the gradual and controlled action of deco-
+herence. In this endeavor, several obstacles have to be
+overcome. We need to prepare the system in a superpo-
+sition of mesoscopically or even macroscopically distin-
+guishable states with a sufficiently long decoherence time
+such that the gradual action of decoherence can be re-
+solved. We must be able to monitor decoherence without
+introducing a significant amount of additional, unwanted
+decoherence. We would also like to have sufficient con-
+trol over the environment so we can tune the strength
+and form of its interaction with the system.
+Starting
+in the mid-1990s, several such experiments have been
+performed, for example, using cavity QED [12], meso-
+scopic molecules [149], and superconducting systems such
+as SQUIDs and Cooper-pair boxes [13]. Bose–Einstein
+condensates [150] and quantum nanomechanical systems
+[151, 152] are promising candidates for future experimen-
+tal tests of decoherence.
+These experiments are important for several reasons.
+
+
+17
+
+They are impressive demonstrations of the possibility
+of generating nonclassical quantum states in mesoscopic
+and macroscopic systems. They show that the quantum–
+classical boundary is smooth and can be shifted by vary-
+ing the relevant experimental parameters.
+They allow
+us to test and improve decoherence models, and they
+help us design devices for quantum information process-
+ing that are good at evading the detrimental influence
+of the environment. Finally, such experiments may be
+used to test quantum mechanics itself [13]. Such tests re-
+quire sufficient shielding of the system from decoherence
+so that an observed (full or partial) collapse of the wave-
+function could be unambigously attributed to some novel
+nonunitary mechanism in nature, such as those proposed
+in dynamical reduction models [153–155]. This shielding,
+however, is difficult to implement in practice, because
+the large number of particles required for the reduction
+mechanism to become effective will also lead to strong
+decoherence [114, 156].
+The superpositions realized in
+current experiments are still not sufficiently macroscopic
+to rule out collapse theories, although it has been demon-
+strated [118] that matter-wave interferometry with large
+molecular clusters (in the mass range between 106 and
+108 amu) would be able to test the collapse theories pro-
+posed in Refs. [154, 155]; such experiments may soon
+become technologically feasible [11].
+
+A.
+Atoms in a cavity
+
+In 1996 Brune et al. generated a superposition of ra-
+diation fields with classically distinguishable phases in-
+volving several photons [12, 150, 157]. This experiment
+was the first to realize a mesoscopic Schr¨odinger-cat state
+and allowed for the controlled observation and manipu-
+lation of its decoherence. A rubidium atom is prepared
+in a superposition of energy eigenstates |g⟩ and |e⟩ cor-
+responding to two circular Rydberg states.
+The atom
+enters a cavity C containing a radiation field contain-
+ing a few photons. If the atom is in the state |g⟩, the
+field remains unchanged, whereas if it is in the state
+|e⟩, the coherent state |α⟩ of the field undergoes a phase
+shift φ, |α⟩ −→
+��eiφα
+�
+; the experiment achieved φ ≈ π.
+An initial superposition of the atom is therefore am-
+plified into an entangled atom–field state of the form
+1
+√
+
+2 (|g⟩ |α⟩ + |e⟩ |−α⟩).
+The atom then passes through
+an additional cavity, further transforming the superposi-
+tion. Finally, the energy state of the atom is measured.
+This disentangles the atom and the field and leaves the
+latter in a superposition of the mesoscopically distinct
+states |α⟩ and |−α⟩.
+To monitor the decoherence of this superposition, a
+second rubidium atom is sent through the apparatus. Af-
+ter interacting with the field superposition state in the
+cavity C, the atom will always be found in the same en-
+ergy state as the first atom if the superposition has not
+been decohered. This correlation rapidly decays with in-
+creasing decoherence. Thus, by recording the measure-
+
+ment correlation as a function of the wait time τ between
+sending the first and second atom through the appara-
+tus, the decoherence of the field state can be monitored.
+Experimental results were in excellent agreement with
+theoretical predictions [158, 159]. It was found that de-
+coherence became faster as the phase shift φ and the
+mean number ¯n = |α|2 of photons in the cavity C was
+increased. Both results are expected, since an increase
+in φ and ¯n means that the components in the superpo-
+sition become more distinguishable. Recent experiments
+have realized superposition states involving several tens
+of photons [160] and have monitored the gradual deco-
+herence of such states [161].
+
+B.
+Matter-wave interferometry
+
+In these experiments (see Ref. [11] for a review), spatial
+interference patterns are demonstrated for mesoscopic
+molecules ranging from fullerenes [162] to molecular clus-
+ters involving hundreds of atoms, with a total size of
+up to 60 ˚A and masses of several thousand amu (see
+Fig. 2) [163, 164].
+Since the de Broglie wavelength of
+such molecules is on the order of picometers, standard
+double-slit interferometry is out of reach. Instead, the
+experiments make use of the Talbot effect, an interfer-
+ence phenomenon in which a plane wave incident on a
+diffraction grating creates an image of the grating at mul-
+tiples of a distance L behind the grating. In the experi-
+ment, the molecular density (at a macroscopic distance L
+from the grating) is scanned along the direction perpen-
+dicular to the molecular beam.
+An oscillatory density
+pattern (corresponding to the image of the slits in the
+grating) is observed, confirming the existence of coher-
+ence and interference between the different paths of each
+individual molecule passing through the grating. Recent
+experiments have used an improved version of the origi-
+nal Talbot–Lau setup [165], as well as optical ionization
+gratings [166].
+Decoherence is measured as a decrease of the visibil-
+ity of this pattern (Fig. 2). The controlled decoherence
+due to collisions with background gas particles [115, 116]
+and due to emission of thermal radiation from heated
+molecules [168] has been observed, showing a smooth de-
+cay of visibility in agreement with theoretical predictions
+[31, 117, 167]. These successes have led to speculations
+that one could perform similar experiments using even
+larger particles such as proteins and viruses [115, 169]
+or carbonaceous aerosols [170]. Such experiments will be
+limited by collisional and thermal decoherence and by
+noise due to inertial forces and vibrations [115, 169, 170].
+
+C.
+Superconducting systems
+
+Superconducting
+quantum
+interference
+devices
+(SQUIDs)
+and
+Cooper-pair
+boxes
+have
+important
+applications in quantum information processing.
+A
+
+
+18
+NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1263
+
+NATURE COMMUNICATIONS | 2:263 | DOI: 10.1038/ncomms1263 | www.nature.com/naturecommunications
+
+11 Macmillan Publishers Limited. All rights reserved.
+
+ysics, single-particle 
+regarded as a para-
+feature of quantum 
+objects of our mac-
+rinciple has become 
+ng feld of quantum 
+ch in many labora-
+nderstanding of the 
+uantum systems and 
+o the observation of 
+
+m interference with 
+r successful experi-
+our study focuses on 
+ion of the molecule 
+ce. We do this with 
+vide useful molecu-
+1 compares the size 
+8 and PFNS10, with 
+traphenylporphyrin 
+PF84 and TPPF152. 
+molecules in a three-
+apitza-Dirac-Talbot-
+
+rated in a thermal 
+ravitational free-fall 
+meter itself consists 
+amber at a pressure 
+mbrane with 90-nm 
+6 nm. Each slit of G1 
+ecular position that, 
+ads to a momentum 
+
+delocalization and 
+increasing distance 
+ser light wave with a 
+een the electric laser 
+y creates a sinusoidal 
+t matter waves. Te 
+n such that quantum 
+c molecular density 
+structure is sampled 
+cal to G1) across the 
+
+of the transmitted 
+MS).
+added various tech-
+to liquid samples, a 
+tial to maintain the 
+owed us to increase 
+r and many optimi-
+were needed to meet 
+s with very massive 
+
+tum interferograms 
+re 3. In all cases the 
+ude of the sinusoidal 
+al, exceeds the maxi-
+y a signifcant multi-
+t shown for TPPF84 
+ed interference con-
+ith individual scans 
+) and Vobs = 49% for 
+n, we have observed 
+10 and Vobs = 16 � 2% 
+
+for TPPF152 (see Figure 3), in which our classical model predicts 
+Vclass = 1%. Tis supports our claim of true quantum interference for 
+all these complex molecules.
+
+Te most massive molecules are also the slowest and therefore 
+
+the most sensitive ones to external perturbations. In our particle 
+
+Figure 1 | Gallery of molecules used in our interference study. (a) The 
+fullerene C60 (m = 720 AMU, 60 atoms) serves as a size reference and 
+for calibration purposes; (b) The perfluoroalkylated nanosphere PFNS8 
+(C60[C12F25]8, m = 5,672 AMU, 356 atoms) is a carbon cage with eight 
+perfluoroalkyl chains. (c) PFNS10 (C60[C12F25]10, m = 6,910 AMU, 430 
+atoms) has ten side chains and is the most massive particle in the set. 
+(d) A single tetraphenylporphyrin TPP (C44H30N4, m = 614 AMU, 78 
+atoms) is the basis for the two derivatives (e) TPPF84 (C84H26F84N4S4, 
+m = 2,814 AMU, 202 atoms) and (f) TPPF152 (C168H94F152O8N4S4, 
+m = 5,310 AMU, 430 atoms). In its unfolded configuration, the latter is the 
+largest molecule in the set. Measured by the number of atoms, TPPF152 
+and PFNS10 are equally complex. All molecules are displayed to scale. The 
+scale bar corresponds to 10 Å.
+
+y
+
+X
+
+Detector
+
+G1
+
+G2
+
+G3
+
+S3
+
+S2
+
+S1
+
+Oven
+
+Lens
+
+Laser
+
+Z
+
+Figure 2 | Layout of the Kapitza-Dirac-Talbot-Lau (KDTL) interference 
+experiment. The effusive source emits molecules that are velocity-selected 
+by the three delimiters S1, S2 and S3. The KDTL interferometer is composed 
+of two SiNx gratings G1 and G3, as well as the standing light wave G2. The 
+optical dipole force grating imprints a phase modulation �(x)��opt·P/(v·wy) 
+onto the matter wave. Here �opt is the optical polarizability, P the laser 
+power, v the molecular velocity and wy the laser beam waist perpendicular 
+to the molecular beam. The molecules are detected using electron impact 
+ionization and quadrupole mass spectrometry.
+
+0
+0.4
+0.8
+1.2
+1.6
+4
+
+6
+
+8
+10
+
+20
+
+30
+
+visibility (%)
+
+pressure (in 10−6 mbar)
+
+FIG. 2. Left: Molecular clusters used in recent interference
+experiments, drawn to scale (the scale bar represents 10 ˚A).
+Figure from Ref. [163].
+(a) Fullerene C60 (m = 720 amu,
+60 atoms). (b) Perfluoroalkylated nanosphere PFNS8 (m =
+5672 amu, 356 atoms).
+(c) PFNS10 (m = 6910 amu, 430
+atoms). (d) Tetraphenylporphyrin TPP (m = 614 amu, 78
+atoms).
+(e) TPPF84 (m = 2814 amu, 202 atoms).
+(f)
+TPPF152 (m = 5310 amu, 430 atoms). Right: Visibility of
+interference fringes of C70 fullerenes as a function of the pres-
+sure of the background gas. Measured values (circles) agree
+well with the theoretical prediction (solid line) [31, 117, 167]
+describing an exponential decay of visibility with pressure.
+Figure adapted from Ref. [115].
+
+SQUID consists of a ring of superconducting material
+interrupted by thin insulating barriers, the Josephson
+junctions (Fig. 3a).
+At sufficiently low temperatures,
+electrons of opposite spin condense into bosonic Cooper
+pairs.
+Quantum-mechanical tunneling of Cooper pairs
+through the junctions leads to the flow of a resistance-
+free supercurrent around the loop (Josephson effect),
+which creates a magnetic flux threading the loop. The
+collective center-of-mass motion of a macroscopic num-
+ber (∼ 109) of Cooper pairs can then be represented by
+a wave function labeled by a single macroscopic variable,
+namely, the total trapped flux Φ through the loop.
+The two possible directions of the supercurrent define
+a qubit with basis states {|⟳⟩ , |⟲⟩}.
+By adjusting an
+external magnetic field, the SQUID can be biased such
+
+(a)
+
+(a)
+
+80%
+
+60%
+
+40%
+
+5
+0
+
+probability for ⟳
+
+(b)
+
+Josephson junction
+
+superconducting
+
+ring
+
+supercurrent
+
+(b)
+
+(a)
+
+delay time τ (ns)
+
+80%
+
+60%
+
+40%
+
+5
+10
+15
+20
+25
+30
+35
+0
+
+probability for ⟳
+
+(b)
+
+Josephson junction
+
+superconducting
+
+ring
+
+supercurrent
+
+FIG. 3. (a) Schematic illustration of a SQUID. A supercon-
+ducting ring is interrupted by Josephson junctions, leading
+to a dissipationless supercurrent.
+(b) Decoherence of a su-
+perposition of clockwise and counterclockwise supercurrents
+in a superconducting qubit. The damping of the oscillation
+amplitude corresponds to the gradual loss of coherence from
+the system. Figure adapted from Ref. [173].
+
+that the two lowest-lying energy eigenstates |0⟩ and |1⟩
+are equal-weight superpositions of the persistent-current
+states |⟳⟩ and |⟲⟩.
+Such superposition states involving µA currents were
+first experimentally observed in 2000 using spectroscopic
+measurements [171, 172]. Their decoherence was subse-
+quently measured using Ramsey interferometry [173], as
+follows. Two consecutive microwave pulses are applied to
+the system. During the delay time τ between the pulses,
+the system evolves freely. After application of the second
+pulse, the system is left in a superposition of |⟳⟩ and |⟲⟩,
+with the relative amplitudes exhibiting an oscillatory de-
+pendence on τ. A series of measurements in the basis
+{|⟳⟩ , |⟲⟩} over a range of delay times τ then allows one
+to trace out an oscillation of the occupation probabilities
+for |⟳⟩ and |⟲⟩ as a function of τ (Fig. 3b). The envelope
+of the oscillation is damped as a consequence of decoher-
+ence acting on the system during the free evolution of
+duration τ. From the decay of the envelope we can infer
+the decoherence timescale; the original experiment gave
+20 ns [173], while subsequent experiments have achieved
+decoherence times of several µs [174].
+Superpositions states and their decoherence have also
+been observed in superconducting devices whose key vari-
+able is charge (or phase), instead of the flux variable used
+in SQUIDs. Such Cooper-pair boxes consist of a small
+superconducting island onto which Cooper pairs can tun-
+nel from a reservoir through a Josephson junction. Two
+different charge states of the island, differing by at least
+one Cooper pair, define the basis states. Coherent os-
+cillations between such charge states were first observed
+
+
+19
+
+in 1999 [175]. In 2002, Vion et al. [176] reported thou-
+sands of coherent oscillations with a decoherence time
+of 0.5 µs. Similar results have been obtained for phase
+qubits [177, 178], demonstrating decoherence times of
+several µs.
+
+VII.
+DECOHERENCE AND THE
+FOUNDATIONS OF QUANTUM MECHANICS
+
+Can decoherence address foundational problems? If so,
+which ones, and how? Addressing these subtle questions
+is beyond the scope of this review; a few brief remarks
+must suffice here.
+(See Refs. [6, 7, 9, 21] for in-depth
+discussions.) Decoherence, at its heart, is a technical re-
+sult concerning the dynamics and measurement statistics
+of open quantum systems. From this view, decoherence
+merely addresses a consistency problem, by explaining
+how and when the quantum probability distributions ap-
+proach the classically expected distributions. Since deco-
+herence follows directly from an application of the quan-
+tum formalism to interacting quantum systems, it is not
+tied to any particular interpretation of quantum mechan-
+ics, nor does it supply such an interpretation, nor does it
+amount to a theory that could make predictions beyond
+those of standard quantum mechanics.
+The predictively relevant part of decoherence theory
+relies on reduced density matrices, whose formalism and
+interpretation presume the collapse postulate and Born’s
+
+rule. If we understand the “quantum measurement prob-
+lem” as the question of how to reconcile the linear, de-
+terministic evolution described by the Schr¨odinger equa-
+tion with the occurrence of random measurement out-
+comes, then decoherence has not solved this problem
+[6, 9].
+Decoherence does, however, address an aspect
+sometimes associated with the quantum measurement
+problem, namely the preferred-basis problem (at least
+in the sense described in Sec. II C). Further explorations
+of the role of the environment, such as in quantum Dar-
+winism (see Sec. II D), can help illuminate fundamental
+questions concerning information transfer and amplifica-
+tion in the quantum setting.
+Decoherence has been used to identify internal con-
+cistency issues in interpretations of quantum mechanics,
+and the picture associated with the decoherence process
+has sometimes been seen as suggestive of particular inter-
+pretations of quantum mechanics [6, 7]. Indeed, histori-
+cally decoherence theory arose in the context of Zeh’s [1]
+independent formulation of an Everett-style interpreta-
+tion (see Ref. [179] for a historical analysis). Ultimately,
+however, it seems that certain interpretations simply may
+be more in need of decoherence than others for defin-
+ing their structure; see Ref. [180] for the example of an
+Everett-style interpretation [23]. At the end of the day,
+any interpretation that does not involve entities, claims,
+or structures in contradiction with the predictions of de-
+coherence theory (which is to say, with the predictions of
+quantum mechanics) will arguably remain viable.
+
+[1] H.D. Zeh, Found. Phys. 1, 69 (1970)
+[2] W.H. Zurek, Phys. Rev. D 24, 1516 (1981)
+[3] W.H. Zurek, Phys. Rev. D 26, 1862 (1982).
+doi:
+10.1103/PhysRevD.26.1862
+
+[4] J.P. Paz, W.H. Zurek, in Coherent Atomic Matter
+Waves, Les Houches Session LXXII, Les Houches Sum-
+mer School Series, vol. 72, ed. by R. Kaiser, C. West-
+brook, F. David (Springer, Berlin, 2001), Les Houches
+Summer School Series, vol. 72, pp. 533–614
+
+[5] W.H. Zurek, Rev. Mod. Phys. 75, 715 (2003).
+doi:
+10.1103/RevModPhys.75.715
+
+[6] M. Schlosshauer, Rev. Mod. Phys. 76, 1267 (2004)
+[7] G.
+Bacciagaluppi,
+in
+The
+Stanford
+Encyclopedia
+of Philosophy, ed. by E.N. Zalta (2012).
+Online at
+http://plato.stanford.edu/archives/win2012/entries/qm-
+decoherence
+
+[8] E. Joos, H.D. Zeh, C. Kiefer, D. Giulini, J. Kupsch, I.O.
+Stamatescu, Decoherence and the Appearance of a Clas-
+sical World in Quantum Theory, 2nd edn. (Springer,
+New York, 2003)
+
+[9] M.
+Schlosshauer,
+Decoherence
+and
+the
+Quantum-
+to-Classical Transition (Springer, Berlin/Heidelberg,
+2007)
+
+[10] M. Schlosshauer, Phys. Rep. 831, 1 (2019).
+doi:
+10.1016/j.physrep.2019.10.001
+
+[11] K. Hornberger, S. Gerlich, S. Nimmrichter, P. Haslinger,
+M. Arndt, Rev. Mod. Phys. 84, 157 (2012)
+
+[12] J.M. Raimond, M. Brune, S. Haroche, Rev. Mod. Phys.
+
+73, 565 (2001)
+
+[13] A.J. Leggett, J. Phys.:
+Condens. Matter 14, R415
+(2002)
+
+[14] E. Joos, in Decoherence: Theoretical, Experimental, and
+Conceptual Problems, ed. by P. Blanchard, D. Giulini,
+E. Joos, C. Kiefer, I.O. Stamatescu (Springer, Berlin,
+2000), pp. 1–17
+
+[15] M. Schlosshauer, K. Camilleri, AIP Conf. Proc. 1327,
+26 (2011)
+
+[16] O. K¨ubler, H.D. Zeh, Ann. Phys. (N.Y.) 76, 405 (1973)
+[17] E. Joos, H.D. Zeh, Z. Phys. B: Condens. Matter 59, 223
+(1985)
+
+[18] W.H. Zurek, in Frontiers of Nonequilibrium Statistical
+Mechanics, ed. by G.T. Moore, M.O. Scully (Plenum
+Press, New York, 1986), pp. 145–149. First published
+in 1984 as Los Alamos report LAUR 84-2750
+
+[19] K. Hornberger,
+in Entanglement and Decoherence:
+Foundations and Modern Trends,
+Lecture Notes in
+Physics, vol. 768, ed. by A. Buchleitner, C. Viviescas,
+M. Tiersch (Springer, Berlin, 2009), pp. 221–276
+
+[20] H.P. Breuer, F. Petruccione, The Theory of Open Quan-
+tum Systems (Oxford University Press, Oxford, 2002)
+
+[21] M. Schlosshauer, A. Fine, in Quantum Mechanics at the
+Crossroads: New Perspectives from History, Philosophy
+and Physics, ed. by J. Evans, A. Thorndike (Springer,
+Berlin, 2006), pp. 125–148
+
+[22] W.K. Wootters, W.H. Zurek, Phys. Rev. D 19, 473
+(1979)
+
+
+20
+
+[23] H. Everett, Rev. Mod. Phys. 29, 454 (1957)
+[24] A. Einstein, B. Podolsky, N. Rosen, Phys. Rev. 47, 777
+(1935)
+
+[25] J.S. Bell, Physics 1, 195 (1964)
+[26] J.S. Bell, Rev. Mod. Phys. 38, 447 (1966)
+[27] A. Peres, Am. J. Phys. 46 (1978)
+[28] J.P. Paz, S. Habib, W.H. Zurek, Phys. Rev. D 47, 488
+(1993)
+
+[29] A.J. Leggett, S. Chakravarty, A.T. Dorsey, M.P.A.
+Fisher, A. Garg, Rev. Mod. Phys. 59, 1 (1987)
+
+[30] S.G. Mokarzel, A.N. Salgueiro, M.C. Nemes, Phys. Rev.
+A 65, 044101 (2002)
+
+[31] K. Hornberger, J.E. Sipe, Phys. Rev. A 68, 012105
+(2003)
+
+[32] K. Kraus, States, Effects, and Operations (Springer,
+Berlin, 1983)
+
+[33] W.H. Zurek, Phys. Today 44, 36 (1991). See also the
+updated version available as eprint quant-ph/0306072
+
+[34] M.R. Gallis, G.N. Fleming, Phys. Rev. A 42, 38 (1990)
+[35] L. Di´osi, Europhys. Lett. 30, 63 (1995)
+[36] K. Hornberger, Phys. Rev. Lett. 97, 060601 (2006)
+[37] K. Hornberger, B. Vacchini, Phys. Rev. A 77, 022112
+(2008)
+
+[38] M. Busse, K. Hornberger, J. Phys. A: Math. Theor. 42,
+362001 (2009)
+
+[39] M. Busse, K. Hornberger, J. Phys. A: Math. Theor. 43,
+015303 (2010)
+
+[40] R.A. Harris, L. Stodolsky, J. Chem. Phys. 74, 2145
+(1981)
+
+[41] H.D. Zeh, in Decoherence:
+Theoretical, Experimen-
+tal, and Conceptual Problems, ed. by P. Blanchard,
+D. Giulini, E. Joos, C. Kiefer, I. Stamatescu, Lecture
+Notes in Physics No. 538 (Springer, Berlin, 2000), pp.
+19–42
+
+[42] J.P. Paz, W.H. Zurek, Phys. Rev. Lett. 82, 5181 (1999)
+[43] W.H. Zurek, S. Habib, J.P. Paz, Phys. Rev. Lett. 70,
+1187 (1993)
+
+[44] W.H. Zurek, Prog. Theor. Phys. 89, 281 (1993)
+[45] B.L. Hu, J.P. Paz, Y. Zhang, Phys. Rev. D 45, 2843
+(1992)
+
+[46] W.H. Zurek, Philos. Trans. R. Soc. London, Ser. A 356,
+1793 (1998)
+
+[47] L. Di´osi, C. Kiefer, Phys. Rev. Lett. 85, 3552 (2000)
+[48] J. Eisert, Phys. Rev. Lett. 92, 210401 (2004)
+[49] G.M. Palma, K.A. Suominen, A.K. Ekert, Proc. R. Soc.
+Lond. A 452, 567 (1996)
+
+[50] D.A. Lidar, I.L. Chuang, K.B. Whaley, Phys. Rev. Lett.
+81, 2594 (1998)
+
+[51] P. Zanardi, M. Rasetti, Phys. Rev. Lett. 79, 3306 (1997)
+[52] P. Zanardi, M. Rasetti, Mod. Phys. Lett. B 11, 1085
+(1997)
+
+[53] P. Zanardi, Phys. Rev. A 57, 3276 (1998)
+[54] D.A. Lidar, D. Bacon, K.B. Whaley, Phys. Rev. Lett.
+82, 4556 (1999)
+
+[55] D. Bacon, J. Kempe, D.A. Lidar, K.B. Whaley, Phys.
+Rev. Lett. 85, 1758 (2000)
+
+[56] L.M. Duan, G.C. Guo, Phys. Rev. A 57, 737 (1998)
+[57] P. Zanardi, Phys. Rev. A 63, 012301 (2001)
+[58] E. Knill, R. Laflamme, L. Viola, Phys. Rev. Lett. 82,
+2525 (2000)
+
+[59] J. Kempe, D. Bacon, D.A. Lidar, K.B. Whaley, Phys.
+Rev. A 63, 042307 (2001)
+
+[60] D.A. Lidar, K.B. Whaley, in Irreversible Quantum Dy-
+namics, Springer Lecture Notes in Physics, vol. 622, ed.
+
+by F. Benatti, R. Floreanini (Springer, Berlin, 2003),
+pp. 83–120. Also available as eprint quant-ph/0301032
+
+[61] W.H. Zurek, in Science and Ultimate Reality, ed. by
+J.D. Barrow, P.C.W. Davies, C.H. Harper (Cambridge
+University Press, Cambridge, England, 2004), pp. 121–
+137
+
+[62] H. Ollivier, D. Poulin, W.H. Zurek, Phys. Rev. Lett. 93,
+220401 (2004)
+
+[63] H. Ollivier, D. Poulin, W.H. Zurek, Phys. Rev. A 72,
+042113 (2005)
+
+[64] R. Blume-Kohout, W.H. Zurek, Found. Phys. 35, 1857
+(2005)
+
+[65] R. Blume-Kohout, W.H. Zurek, Phys. Rev. A 73,
+062310 (2006)
+
+[66] W.H. Zurek, Nature Phys. 5, 181 (2009)
+[67] C.J. Riedel, W.H. Zurek, Phys. Rev. Lett. 105, 020404
+(2010)
+
+[68] C.J. Riedel, W.H. Zurek, New J. Phys. 13, 073038
+(2011)
+
+[69] C.J. Riedel, W.H. Zurek, M. Zwolak, New J. Phys. 14,
+083010 (2012)
+
+[70] M. Zwolak, C.J. Riedel, W.H. Zurek, Phys. Rev. Lett.
+112, 140406 (2014)
+
+[71] R. Blume-Kohout, W.H. Zurek, Phys. Rev. Lett. 101,
+240405 (2008)
+
+[72] H. Ollivier, W.H. Zurek, Phys. Rev. Lett. 88, 017901
+(2002)
+
+[73] S. Schneider, G.J. Milburn, Phys. Rev. A 57, 3748
+(1998)
+
+[74] C. Miquel, J.P. Paz, W.H. Zurek, Phys. Rev. Lett. 78,
+3971 (1997)
+
+[75] L.M.K. Vandersypen, I.L. Chuang, Rev. Mod. Phys. 76,
+1037 (2004)
+
+[76] J.M. Martinis, S. Nam, J. Aumentado, K.M. Lang,
+C. Urbina, Phys. Rev. B 67, 094510 (2003)
+
+[77] C.J. Myatt, B.E. King, Q.A. Turchette, C.A. Sackett,
+D. Kielpinski, W.M. Itano, C. Monroe, D.J. Wineland,
+Nature 403, 269 (2000)
+
+[78] S. Schneider, G.J. Milburn, Phys. Rev. A 59, 3766
+(1999)
+
+[79] Q.A. Turchette, C.J. Myatt, B.E. King, C.A. Sackett,
+D. Kielpinski, W.M. Itano, C. Monroe, D.J. Wineland,
+Phys. Rev. A 62, 053807 (2000)
+
+[80] K. Kraus, Ann. Phys. 64, 311 (1971)
+[81] V. Gorini, A. Kossakowski, E.C.G. Sudarshan, J. Math.
+Phys. 17, 821 (1976)
+
+[82] G. Lindblad, Commun. Math. Phys. 48, 119 (1976)
+[83] R. Alicki, M. Fannes, Quantum Dynamical Systems
+(Oxford University Press, Oxford, 2001)
+
+[84] R. Alicki, K. Lendi, Quantum Dynamical Semigroups
+and Applications, Lect. Notes Phys., vol. 717, 2nd edn.
+(Springer, Berlin/Heidelberg, 2007)
+
+[85] F. Benatti, R. Floreanini, Int. J. Mod. Phys. B 19, 3063
+(2005). doi:10.1142/S0217979205032097
+
+[86] R. D¨umcke, H. Spohn, Z. Phys. B 34, 419 (1979)
+[87] V. Gorini, A. Frigerio, M. Verri, A. Kossakowski, E.C.G.
+Sudarshan, Rep. Math. Phys. 13, 149 (1978)
+
+[88] E.B. Davies, Commun. Math. Phys. 39, 91 (1974)
+[89] A. Kossakowski, Rep. Math. Phys. 3, 247 (1972)
+[90] A. Barchielli, V.P. Belavkin, J. Phys. A: Math. Gen. 24,
+1495 (1991)
+
+[91] V.P. Belavkin, in Lecture Notes in Control and Infor-
+mation Sciences, vol. 121 (Springer, Berlin, 1989), pp.
+245–265
+
+
+21
+
+[92] V.P. Belavkin, J. Phys. A: Math. Gen. 22, L1109 (1989)
+[93] V.P. Belavkin, Phys. Lett. A 140, 355 (1989)
+[94] V.P. Belavkin,
+in Chaos:
+The Interplay Between
+Stochastic and Deterministic Behaviour, ed. by P. Gar-
+baczewksi, M. Wolf, A. Veron, Lecture Notes in Physics
+(Springer, 1995), pp. 21–41
+
+[95] L. Di´osi, Phys. Lett. A 129, 419 (1988)
+[96] L. Di´osi, Phys. Lett. A 132, 233 (1988)
+[97] L. Di´osi, J. Phys. A 21, 2885 (1988)
+[98] N. Gisin, Phys. Rev. Lett. 52, 1657 (1984)
+[99] N. Gisin, Helv. Phys. Acta 62, 363 (1989)
+
+[100] H.M. Wiseman, Phys. Rev. A 49, 2133 (1994)
+[101] H.S. Goan, G.J. Milburn, H.M. Wiseman, H.B. Sun,
+Phys. Rev. B 63, 125326 (2001)
+
+[102] M.B. Plenio, P.L. Knight, Rev. Mod. Phys. 70, 101
+(1998)
+
+[103] N.V. Prokof’ev, P.C.E. Stamp, Rep. Prog. Phys. 63,
+669 (2000)
+
+[104] M. Dub´e, P.C.E. Stamp, Chem. Phys. 268, 257 (2001)
+[105] S. Gr¨oblacher, A. Trubarov, N. Prigge, M. Aspelmeyer,
+J. Eisert, Nature Comm. 6, 7606 (2015)
+
+[106] S. Chaturvedi, F. Shibata, Z. Phys. B 35, 297 (1979)
+[107] F. Shibata, T. Arimitsu, J. Phys. Soc. Jpn. 49, 891
+(1980)
+
+[108] A. Royer, Phys. Rev. A 6, 1741 (1972)
+[109] A. Royer, Phys. Lett. A 315, 335 (2003)
+[110] R. Feynman, F.L. Vernon, Ann. Phys. (N.Y.) 24, 118
+(1963)
+
+[111] A. Caldeira, A. Leggett, Ann. Phys. (N.Y.) 149, 374
+(1983)
+
+[112] O.V. Lounasmaa, Experimental Principles and Methods
+below 1 K (Academic Press, New York, 1974)
+
+[113] S.L. Adler, J. Phys. A: Math. Gen. 39, 14067 (2006)
+[114] M. Tegmark, Found. Phys. Lett. 6, 571 (1993)
+[115] L.
+Hackerm¨uller,
+K.
+Hornberger,
+B.
+Brezger,
+A. Zeilinger,
+M. Arndt,
+Appl. Phys. B 77,
+781
+(2003)
+
+[116] K. Hornberger, S. Uttenthaler, B. Brezger, L. Hack-
+erm¨uller, M. Arndt, A. Zeilinger, Phys. Rev. Lett. 90,
+160401 (2003)
+
+[117] K. Hornberger, J.E. Sipe, M. Arndt, Phys. Rev. A 70,
+053608 (2004)
+
+[118] S. Nimmrichter, K. Hornberger, P. Haslinger, M. Arndt,
+Phys. Rev. A 83, 043621 (2011)
+
+[119] D.A. Kokorowski, A.D. Cronin, T.D. Roberts, D.E.
+Pritchard, Phys. Rev. Lett. 86, 2191 (2001)
+
+[120] H. Uys, J.D. Perreault, A.D. Cronin, Phys. Rev. Lett.
+95, 150403 (2005)
+
+[121] A.O. Caldeira, A.J. Leggett, Physica A 121, 587 (1983)
+[122] D.F. Walls, M.J. Collett, G.J. Milburn, Phys. Rev. D
+32, 3208 (1985)
+
+[123] U. Weiss, Quantum Dissipative Systems (World Scien-
+tific, Singapore, 1999)
+
+[124] M. Schlosshauer, A.P. Hines, G.J. Milburn, Phys. Rev.
+A 77, 022111 (2008)
+
+[125] P.C.E. Stamp, in Tunnelling in Complex Systems, ed.
+by S. Tomsovic (World Scientific, Singapore, 1998), pp.
+101–197
+
+[126] N.V. Prokof’ev, P.C.E. Stamp, (1995)
+[127] N.V. Prokof’ev, P.C.E. Stamp, J. Phys. Chem. Lett. 5,
+L663 (1993)
+
+[128] V.V. Dobrovitski, H.A. De Raedt, M.I. Katsnelson,
+B.N. Harmon, Phys. Rev. Lett. 90, 210401 (2003). doi:
+10.1103/PhysRevLett.90.210401
+
+[129] F.M. Cucchietti, J.P. Paz, W.H. Zurek, Phys. Rev. A
+72, 052113 (2005). doi:10.1103/PhysRevA.72.052113
+
+[130] A.O. Caldeira, A.H. Castro Neto, T.O. de Carvalho,
+Phys. Rev. B 48, 13974 (1993)
+
+[131] C. Miquel, J.P. Paz, R. Perazzo, Phys. Rev. A 54, 2605
+(1996)
+
+[132] A.M. Steane, Phys. Rev. Lett. 77, 793 (1996)
+[133] P.W. Shor, Phys. Rev. A 52, R2493 (1995)
+[134] A.M. Steane, in Decoherence and Its Implications in
+Quantum Computation and Information Transfer, ed.
+by P. Turchi, A. Gonis (IOS Press, Amsterdam, 2001),
+pp. 284–298. Also available as eprint quant-ph/0304016
+
+[135] E. Knill, R. Laflamme, A. Ashikhmin, H. Barnum,
+L. Viola, W. Zurek, LA Science 27, 188 (2002)
+
+[136] M.A. Nielsen, I.L. Chuang, Quantum Computation and
+Quantum Information (Cambridge University Press,
+Cambridge, 2000)
+
+[137] J.H. Reina, L. Quiroga, N.F. Johnson, Phys. Rev. A 65
+65, 032326 (2002)
+
+[138] D. Bacon, D.A. Lidar, K.B. Whaley, Phys. Rev. A 60,
+1944 (1999)
+
+[139] D.A. Lidar, D. Bacon, J. Kempe, K.B. Whaley, Phys.
+Rev. A 63, 022307 (2001)
+
+[140] D.A.R. Dalvit, J. Dziarmaga, W.H. Zurek, Phys. Rev.
+A 62, 013607 (2000)
+
+[141] J.F. Poyatos, J.I. Cirac, P. Zoller, Phys. Rev. Lett. 77,
+4728 (1996)
+
+[142] A.R.R. Carvalho, P. Milman, R.L. de Matos Filho,
+L. Davidovich, Phys. Rev. Lett. 86, 4988 (2001)
+
+[143] L. Viola, S. Lloyd, Phys. Rev. A 58, 2733 (1998)
+[144] L. Viola, E. Knill, S. Lloyd, Phys. Rev. Lett. 82, 2417
+(1999)
+
+[145] P. Zanardi, Phys. Lett. A 258, 77 (1999)
+[146] L. Viola, E. Knill, S. Lloyd, Phys. Rev. Lett. 85, 3520
+(2000)
+
+[147] L.A. Wu, D.A. Lidar, Phys. Rev. Lett. 88, 207902
+(2002)
+
+[148] L.A. Wu, M.S. Byrd, D.A. Lidar, Phys. Rev. Lett. 89,
+127901 (2002)
+
+[149] M. Arndt, K. Hornberger, A. Zeilinger, Phys. World 18,
+35 (2005)
+
+[150] R. Kaiser, C. Westbrook, F. David (eds.).
+Coherent
+Atomic Matter Waves, Les Houches Session LXXII, Les
+Houches Summer School Series (Springer, Berlin, 2001)
+
+[151] M. Blencowe, Phys. Rep. 395, 159 (2004)
+[152] M. Aspelmeyer, T.J. Kippenberg, F. Marquardt, Rev.
+Mod. Phys. 86, 1391 (2014)
+
+[153] A. Bassi, G.C. Ghirardi, Phys. Rep. 379, 257 (2003)
+[154] S.L. Adler, J. Phys. A 40, 2935 (2007)
+[155] A. Bassi, D.A. Deckert, L. Ferialdi, EPL 92, 50006
+(2010)
+
+[156] S. Nimmrichter, K. Hornberger, Phys. Rev. Lett. 110,
+160403 (2013)
+
+[157] M. Brune, E. Hagley, J. Dreyer, X. Maˆıtre, A. Maali,
+C. Wunderlich, J.M. Raimond, S. Haroche, Phys. Rev.
+Lett. 77, 4887 (1996)
+
+[158] L. Davidovich, M. Brune, J.M. Raimond, S. Haroche,
+Phys. Rev. A 53, 1295 (1996)
+
+[159] X. Maˆıtre, E. Hagley, J. Dreyer, A. Maali, C.W.M.
+Brune, J.M. Raimond, S. Haroche, J. Mod. Opt. 44,
+2023 (1997)
+
+[160] A.
+Auffeves,
+P.
+Maioli,
+T.
+Meunier,
+S.
+Gleyzes,
+G. Nogues, M. Brune, J.M. Raimond, S. Haroche, Phys.
+Rev. Lett. 91, 230405 (2003)
+
+
+22
+
+[161] S. Del´eglise, I. Dotsenko, C. Sayrin, J. Bernu, M. Brune,
+J.M. Raimond, S. Haroche, Nature 455, 510 (2008)
+
+[162] M.
+Arndt,
+O.
+Nairz,
+J.
+Vos-Andreae,
+C.
+Keller,
+G. van der Zouw, A. Zeilinger, Nature 401, 680 (1999)
+
+[163] S. Gerlich, S. Eibenberger, M. Tomandl, S. Nimm-
+richter, K. Hornberger, P.J. Fagan, J. T¨uxen, M. Mayor,
+M. Arndt, Nature Comm. 2, 263 (2012)
+
+[164] S. Eibenberger,
+S. Gerlich,
+M. Arndt,
+M. Mayor,
+J. T¨uxen, Phys. Chem. Chem. Phys. 15, 14696 (2013)
+
+[165] S. Gerlich, L. Hackerm¨uller, K. Hornberger, A. Stibor,
+H. Ulbricht, F. Goldfarb, T. Savas, M. M¨uri, M. Mayor,
+M. Arndt, Nature Phys. 3, 711 (2007)
+
+[166] P. Haslinger, N. D¨orre, P. Geyer, J. Rodewald, S. Nimm-
+richter, M. Arndt, Nature Phys. 9, 144 (2013)
+
+[167] K. Hornberger, L. Hackerm¨uller, M. Arndt, Phys. Rev.
+A 71, 023601 (2005)
+
+[168] L.
+Hackerm¨uller,
+K.
+Hornberger,
+B.
+Brezger,
+A. Zeilinger, M. Arndt, Nature 427, 711 (2004)
+
+[169] M.
+Arndt,
+O.
+Nairz,
+A.
+Zeilinger,
+in
+Quantum
+[Un]Speakables:
+From Bell to Quantum Information,
+ed. by R.A. Bertlmann, A. Zeilinger (Springer, Berlin,
+2002), pp. 333–351
+
+[170] K. Hornberger, Phys. Rev. A 73, 052102 (2006)
+[171] J.R. Friedman, V. Patel, W. Chen, S.K. Yolpygo, J.E.
+
+Lukens, Nature 406, 43 (2000)
+
+[172] C.H. van der Wal, A.C.J. ter Haar, F.K. Wilhelm, R.N.
+Schouten, C.J.P.M. Harmans, T.P. Orlando, S. Lloyd,
+J.E. Mooij, Science 290, 773 (2000)
+
+[173] I. Chiorescu, Y. Nakamura, C.J.P.M. Harmans, J.E.
+Mooij, Science 21, 1869 (2003)
+
+[174] P. Bertet, I. Chiorescu, G. Burkard, K. Semba, C.J.P.M.
+Harmans, D.P. DiVincenzo, J.E. Mooij, Phys. Rev.
+Lett. 95, 257002 (2005)
+
+[175] Y. Nakamura, Y.A. Pashkin, J.S. Tsai, Nature 398, 786
+(1999)
+
+[176] D. Vion, A. Aassime, A. Cottet, P. Joyez, H. Pothier,
+C. Urbina, D. Esteve, M.H. Devoret, Science 296, 886
+(2002)
+
+[177] Y. Yu, S. Han, X. Chu, S.I. Chu, Z. Wang, Science 296,
+889 (2002)
+
+[178] J.M. Martinis, S. Nam, J. Aumentado, C. Urbina, Phys.
+Rev. Lett. 89, 117901 (2002)
+
+[179] K. Camilleri, Stud. Hist. Phil. Mod. Phys. 40, 290
+(2009)
+
+[180] D. Wallace, in Many Worlds? Everett, Quantum Theory
+and Reality, ed. by S. Saunders, J. Barrett, A. Kent,
+D. Wallace (Oxford University Press, Oxford, 2010), pp.
+53–72
+
+
diff --git a/papers/project_paper_3_darwinism/references/Schlosshauer2007_Placeholder.md b/papers/project_paper_3_darwinism/references/Schlosshauer2007_Placeholder.md
deleted file mode 100644
index eb2c06da..00000000
--- a/papers/project_paper_3_darwinism/references/Schlosshauer2007_Placeholder.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Decoherence and the Quantum-to-Classical Transition (Schlosshauer 2007)
-
-This textbook provides the authoritative derivations of the Spin-Boson Hamiltonian, Ohmic spectral densities, and decoherence functions used to model system-environment interactions.
-Due to copyright and its format, the full PDF is not hosted in this repository.
-
-**Citation:**
-Schlosshauer, M. (2007). *Decoherence and the Quantum-to-Classical Transition*. Springer.
diff --git a/papers/project_paper_3_darwinism/references/Tegmark2000.pdf b/papers/project_paper_3_darwinism/references/Tegmark2000.pdf
new file mode 100644
index 00000000..08fe95ed
--- /dev/null
+++ b/papers/project_paper_3_darwinism/references/Tegmark2000.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e3f207a1170ff3edb0765236fce4fbcaf4c73a9745664e3f35bdc38dafc20aab
+size 365231
diff --git a/papers/project_paper_3_darwinism/references/Tegmark2000.txt b/papers/project_paper_3_darwinism/references/Tegmark2000.txt
new file mode 100644
index 00000000..049d0e7f
--- /dev/null
+++ b/papers/project_paper_3_darwinism/references/Tegmark2000.txt
@@ -0,0 +1,2024 @@
+arXiv:quant-ph/9907009v2  10 Nov 1999
+
+The Importance of Quantum Decoherence in Brain Processes
+
+Max Tegmark
+Institute for Advanced Study, Olden Lane, Princeton, NJ 08540; max@ias.edu
+Dept. of Physics, Univ. of Pennsylvania, Philadelphia, PA 19104
+(Submitted to Phys. Rev. E July 2 1999, accepted October 25)
+
+Based on a calculation of neural decoherence rates, we ar-
+gue that that the degrees of freedom of the human brain that
+relate to cognitive processes should be thought of as a classical
+rather than quantum system, i.e., that there is nothing funda-
+mentally wrong with the current classical approach to neural
+network simulations. We find that the decoherence timescales
+(∼ 10−13 − 10−20 seconds) are typically much shorter than
+the relevant dynamical timescales (∼ 10−3 − 10−1 seconds),
+both for regular neuron firing and for kink-like polarization
+excitations in microtubules. This conclusion disagrees with
+suggestions by Penrose and others that the brain acts as a
+quantum computer, and that quantum coherence is related
+to consciousness in a fundamental way.
+
+I. INTRODUCTION
+
+In most current mainstream biophysics research on
+cognitive processes, the brain is modeled as a neural net-
+work obeying classical physics. In contrast, Penrose [1,2],
+and others have argued that quantum mechanics may
+play an essential role, and that successful brain simula-
+tions can only be performed with a quantum computer.
+The main purpose of this paper is to address this issue
+with quantitative decoherence calculations.
+The field of artificial neural networks (for an introduc-
+tion, see, e.g., [4–6]) is currently booming, driven by a
+broad range of applications and improved computing re-
+sources. Although the popular neurological models come
+in various levels of abstraction, none involve effects of
+quantum coherence in any fundamental way. Encouraged
+by successes in modeling memory, learning, visual pro-
+cessing, etc. [7,8], many workers in the field have boldly
+conjectured that a sufficiently complex neural network
+could in principle perform all cognitive processes that we
+associate with consciousness.
+On the other hand, many authors have argued that
+consciousness can only be understood as a quantum ef-
+fect. For instance, Wigner [9] suggested that conscious-
+ness was linked to the quantum measurement problem1,
+and this idea has been greatly elaborated by Stapp [3].
+There have been numerous suggestions that conscious-
+ness is a macroquantum effect, involving superconduc-
+
+1 Interestingly, Wigner changed his mind and gave up this
+idea [10] after he became aware in of the first paper on deco-
+herence in 1970 [11].
+
+tivity [12], superfluidity [13], electromagnetic fields [14],
+Bose condensation [15,16], superflourescence [17] or some
+other mechanism [18,19]. Perhaps the most concrete one
+is that of Penrose [2], proposing that this takes place
+in microtubules, the ubiquitous hollow cylinders that
+among other things help cells maintain their shapes. It
+has been argued that microtubules can process informa-
+tion like a cellular automaton [20], and Penrose suggests
+that they operate as a quantum computer. This idea has
+been further elaborated employing string theory methods
+[21–27].
+The make-or-break issue for all these quantum mod-
+els is whether the relevant degrees of freedom of the
+brain can be sufficiently isolated to retain their quan-
+tum coherence, and opinions are divided. For instance,
+Stapp has argued that interaction with the environment
+is probably small enough to be unimportant for cer-
+tain neural processes [28], whereas Zeh [29], Zurek [30],
+Scott [31], Hawking [32] and Hepp [33] have conjectured
+that environment-induced coherence will rapidly destroy
+macrosuperpositions in the brain. It is therefore timely
+to try to settle the issue with detailed calculations of the
+relevant decoherence rates. This is the purpose of the
+present work.
+The rest of this paper is organized as follows. In Sec-
+tion II, we briefly review the open system quantum me-
+chanics necessary for our calculations, and introduce a
+decomposition into three subsystems to place the prob-
+lem in its proper context.
+In Section III, we evaluate
+decoherence rates both for neuron firing and for the mi-
+crotubule processes proposed by Penrose et al., relegating
+some technical details to the Appendix. We conclude in
+Section IV by discussing the implications of our results,
+both for modeling cognitive brain processes and for in-
+corporating them into a quantum-mechanical treatment
+of the rest of the world.
+
+II. SYSTEMS AND SUBSYSTEMS
+
+In this section, we review those aspects of quantum
+mechanics for open systems that are needed for our cal-
+culations, and introduce a classification scheme and a
+subsystem decomposition to place the problem at hand
+in its appropriate context.
+
+1
+
+
+A. Notation
+
+Let us first briefly review the quantum mechanics of
+subsystems. The state of an arbitrary quantum system
+is described by its density matrix ρ, which left in isolation
+will evolve in time according to the Schr¨odinger equation
+
+˙ρ = −i[H, ρ]/¯h.
+(1)
+
+It is often useful to view a system as composed of two
+subsystems, so that some of the degrees of freedom cor-
+respond to the 1st and the rest to the 2nd. The state of
+subsystem i is described by the reduced density matrix
+ρi obtained by tracing (marginalizing) over the degrees
+of freedom of the other: ρ1 ≡ tr 2ρ, ρ2 ≡ tr 1ρ. Let us
+decompose the Hamiltonian as
+
+H = H1 + H2 + Hint,
+(2)
+
+where the operator H1 affects only the 1st subsystem
+and H2 affects only the 2nd subsystem. The interaction
+Hamiltonian Hint is the remaining nonseparable part, de-
+fined as Hint ≡ H − H1 − H2, so such a decomposition
+is always possible, although it is generally only useful if
+Hint is in some sense small.
+If Hint = 0, i.e., if there is no interaction between
+the two subsystems, then it is easy to show that ˙ρi =
+−i[Hi, ρi]/¯h, i = 1, 2, that is, we can treat each subsys-
+tem as if the rest of the Universe did not exist, ignoring
+any correlations with the other subsystem that may have
+been present in the full non-separable density matrix ρ.
+It is of course this property that makes density matrices
+so useful in the first place, and that led von Neumann
+to invent them [34]: the full system is assumed to obey
+equation (1) simply because its interactions with the rest
+of the Universe are negligible.
+
+B. Fluctuation, dissipation, communication and
+decoherence
+
+In practice, the interaction Hint between subsystems
+is usually not zero. This has a number of qualitatively
+different effects:
+
+1. Fluctuation
+
+2. Dissipation
+
+3. Communication
+
+4. Decoherence
+
+The first two involve transfer of energy between the sub-
+systems, whereas the last two involve exchange of infor-
+mation. The first three occur in classical physics as well
+- only the last one is a purely quantum-mechanical phe-
+nomenon.
+For example, consider a tiny colloid grain (subsystem
+1) in a jar of water (subsystem 2). Collisions with water
+
+molecules will cause fluctuations in the center-of-mass
+position of the colloid (brownian motion). If its initial ve-
+locity is high, dissipation (friction) will slow it down to
+a mean speed corresponding to thermal equilibrium with
+the water. The dissipation timescale τdiss, defined as the
+time it would take to lose half of the initial excess energy,
+will in this case be of order τcoll × (M/m), where τcoll is
+the mean-free time between collisions, M the colloid mass
+M and m is the mass of a water molecule. We will define
+communication as exchange of information. The infor-
+mation that the two subsystems have about each other,
+measured in bits, is
+
+I12 ≡ S1 + S2 − S,
+(3)
+
+where Si ≡ −tr iρi log ρi is the entropy of the ith subsys-
+tem, S ≡ −tr ρ log ρ is the entropy of the total system,
+and the logarithms are base 2. If this mutual informa-
+tion is zero, then the states of the two systems are un-
+correlated and independent, with the density matrix of
+the separable form ρ = ρ1 ⊗ ρ2. If the subsystems start
+out independent, any interaction will at least initially
+increase the subsystem entropies Si, thereby increasing
+the mutual information, since the entropy S of the total
+system always remains constant.
+This apparent entropy increase of subsystems, which
+is related to the arrow of time and the 2nd law of of ther-
+modynamics [35], occurs also in classical physics. How-
+ever, quantum mechanics produces a qualitatively new
+effect as well, known as decoherence [11,36,37], sup-
+pressing off-diagonal elements in the reduced density ma-
+trices ρi. This effect destroys the ability to observe long-
+range quantum superpositions within the subsystems,
+and is now rather well-understood and uncontroversial
+[30,38–42] – the interested reader is referred to [43] and
+a recent book on decoherence [44] for details.
+For in-
+stance, if our colloid was initially in a superposition of
+two locations separated by a centimeter, this macrosu-
+perposition would for all practical purposes be destroyed
+by the first collision with a water molecule, i.e., on a
+timescale τdec of order τcoll, with the quantum superpo-
+sition surviving only on scales below the de de Broigle
+wavelength of the water molecules [45,46].2 This means
+
+2Decoherence picks out a preferred basis in the quantum-
+mechanical Hilbert space, termed the “pointer basis” by
+Zurek [36], in which superpositions are rapidly destroyed and
+classical behavior is approached. This normally includes the
+position basis, which is why we never experience superposi-
+tions of objects in macroscopically different positions. Deco-
+herence is quite generic. Although it has been claimed that
+this preferred basis consists of the maximal set of commuting
+observables that also commute with Hint (the “microstable
+basis” of Omnes [43]), this is in fact merely a sufficient condi-
+tion, not a necessary one. If [Hint, x] = 0 for some observable
+x but [Hint, p] ̸= 0 for its conjugate p, then the interaction
+
+2
+
+
+that τdiss/τdec ∼ M/m in our example, i.e., that decoher-
+ence is much faster than dissipation for macroscopic ob-
+jects, and this qualitative result has been shown to hold
+quite generally as well (see [43] and references therein).
+Loosely speaking, this is because each microscopic par-
+ticle that scatters off of the subsystem carries away only
+a tiny fraction m/M of the total momentum, but essen-
+tially all of the necessary information.
+
+QUANTUM�
+SYSTEM
+
+NOT �
+INDEPENDENT�
+SYSTEM
+
+IMPOSSIBLE
+
+CLASSICAL�
+SYSTEM
+
+0.1
+1
+
+1
+
+0.1
+
+10
+
+100
+
+10
+100
+Dissipation time/Decoherence time
+
+Dynamical time/Decoherence time
+
+FIG. 1. The qualitative behavior of a subsystem depends on
+the timescales for dynamics, dissipation and decoherence.
+This
+classification is by necessity quite crude, so the boundaries should
+not be thought of as sharp.
+
+C. Classification of systems
+
+Let us define the dynamical timescale τdyn of a subsys-
+tem as that which is characteristic of its internal dynam-
+ics. For a planetary system or an atom, τdyn would be
+the orbital frequency.
+The qualitative behavior of a system depends on the
+ratio of these timescales, as illustrated in Figure 1. If
+τdyn ≪ τdec, we are are dealing with a true quantum sys-
+tem, since its superpositions can persist long enough to
+be dynamically important. If τdyn ≫ τdiss, it is hardly
+meaningful to view it as an independent system at all,
+since its internal forces are so week that they are dwarfed
+
+will indeed cause decoherence for x as advertised. But this
+will happen even if [Hint, x] ̸= 0 — all that matters is that
+[Hint, p] ̸= 0, i.e., that the interaction Hamiltonian contains
+(“measures”) x.
+
+by the effects of the surroundings. In the intermediate
+case where τdec ≪ τdyn <∼ τdiss, we have a familiar classi-
+cal system.
+The relation between τdec and τdiss depends only on
+the form of Hint, whereas the question of whether τdyn
+falls between these values depends on the normalization
+of Hint in equation (2). Since τdec ∼ τdiss for microscopic
+(atom-sized) systems and τdec ≪ τdiss for macroscopic
+ones, Figure 1 shows that whereas macroscopic systems
+can behave quantum-mechanically, microscopic ones can
+never behave classically.
+
+D. Three systems: subject, object and environment
+
+Most discussions of quantum statistical mechanics split
+the Universe into two subsystems [47]: the object under
+consideration and everything else (referred to as the en-
+vironment). Since our purpose is to model the observer,
+we need to include a third subsystem as well, the subject.
+As illustrated in Figure 2, we therefore decompose the
+total system into three subsystems:
+
+• The subject consists of the degrees of freedom as-
+sociated with the subjective perceptions of the ob-
+server. This does not include any other degrees of
+freedom associated with the brain or other parts of
+the body.
+
+• The object consists of the degrees of freedom that
+the observer is interested in studying, e.g., the
+pointer position on a measurement apparatus.
+
+• The environment consists of everything else, i.e.,
+all the degrees of freedom that the observer is not
+paying attention to. By definition, these are the
+degrees of freedom that we always perform a partial
+trace over.
+
+3
+
+
+SUBJECT
+OBJECT
+
+ENVIRONMENT
+
+Hs
+Ho
+
+He
+
+Hso
+
+Hoe
+Hse
+
+Object �
+decoherence
+
+Subject�
+decoherence,�
+finalizing �
+decisions
+
+Measurement,�
+observation,�
+"wavefuntion �
+collapse",�
+willful action
+
+(Always traced over)
+
+(Always zero entropy)
+
+FIG. 2. An observer can always decompose the world into three
+subsystems: the degrees of freedom corresponding to her subjective
+perceptions (the subject), the degrees of freedom being studied (the
+object), and everything else (the environment). As indicated, the
+subsystem Hamiltonians Hs, Ho, He and the interaction Hamilto-
+nians Hso, Hoe, Hse can cause qualitatively very different effects,
+which is why it is often useful to study them separately. This paper
+focuses on the interaction Hse.
+
+Note that the first two definitions are very restrictive.
+Whereas the subject would include the entire body of
+the observer in the common way of speaking, only very
+few degrees of freedom qualify as our subject or object.
+For instance, if a physicist is observing a Stern-Gerlach
+apparatus, the vast majority of the ∼ 1028 degrees of
+freedom in the the observer and apparatus are counted
+as environment, not as subject or object.
+The term “perception” is used in a broad sense in item
+1, including thoughts, emotions and any other attributes
+of the subjectively perceived state of the observer.
+The practical usefulness in this decomposition lies in
+that one can often neglect everything except the object
+and its internal dynamics (given by Ho) to first order,
+using simple prescriptions to correct for the interactions
+with the subject and the environment.
+The effects of
+both Hso and Hoe have been extensively studied in the
+literature. Hso involves quantum measurement, and gives
+rise to the usual interpretation of the diagonal elements of
+the object density matrix as probabilities. Hoe produces
+decoherence, selecting a preferred basis and making the
+object act classically if the conditions in Figure 1 are met.
+In contrast, Hse, which causes decoherence directly in
+the subject system, has received relatively little atten-
+
+tion. It is the focus of the present paper, and the next
+section is devoted to quantitative calculations of decoher-
+ence in brain processes, aimed at determining whether
+the subject system should be classified as classical or
+quantum in the sense of Figure 1.
+We will return to
+Figure 2 and a more detailed discussion of its various
+subsystem interactions in Section IV.
+
+III. DECOHERENCE RATES
+
+In this section, we will make quantitative estimates
+of decoherence rates for neurological processes. We first
+analyze the process of neuron firing, widely assumed to be
+central to cognitive processes. We also analyze electrical
+excitations in microtubules, which Penrose and others
+have suggested may be relevant to conscious thought.
+
+A. Neuron firing
+
+Neurons (see Figure 3) are one of the key building
+blocks of the brain’s information processing system. It is
+widely believed that the complex network of ∼ 1011 neu-
+rons with their nonlinear synaptic couplings is in some
+way linked to our subjective perceptions, i.e., to the sub-
+ject degrees of freedom. If this picture is correct, then if
+Hs or Hso puts the subject into a superposition of two
+distinct mental states, some neurons will be in a super-
+position of firing and not firing. How fast does such a
+superposition of a firing and non-firing neuron decohere?
+Let us consider this process in more detail.
+For in-
+troductory reviews of neuron dynamics, the reader is re-
+ferred to, e.g., [48–50].
+Like virtually all animal cells,
+neurons have ATP driven pumps in their membranes
+which push sodium ions out of the cell into the surround-
+ing fluids and potassium ions the other way. The former
+process is slightly more efficient, so the neuron contains a
+slight excess of negative charge in its “resting” state, cor-
+responding to a potential difference U0 ≈ −0.07 V across
+the axon membrane (“axolemma”). There is an inher-
+ent instability in the system, however. If the potential
+becomes substantially less negative, then voltage-gated
+sodium channels in the axon membrane open up, allow-
+ing Na+ ions to come gushing in. This makes the poten-
+tial still less negative, causes still more opening, etc. This
+chain reaction, “firing”, propagates down the axon at a
+speed of up to 100 m/s, changing the potential difference
+to a value U1 that is typically of order +0.03 V [49].
+The axon quickly recovers. After less than ∼ 1 ms, the
+sodium channels close regardless of the voltage, and large
+potassium channels (also voltage gated, but with a time
+delay) open up allowing K+ ions to flow out and restore
+the resting potential U0. The ATP driven pumps quickly
+restore the Na+ and K+ concentrations to their initial
+values, making the neuron ready to fire again if triggered.
+Fast neurons can fire over 103 times per second.
+
+4
+
+
+Na+
+Na+
+
+dendrites
+
+axon
+
+cell body
+
+myelin�
+insulation
+
+fraction f�
+not insulated
+
+thickness h
+
+Here�
+if�
+firing
+
+Here�
+if not�
+firing
+
+voltage�
+sensitive�
+gate
+
+length�
+L
+
+axon�
+membrane
+
+pulse
+
+di
+
+re
+
+ct
+
+io
+
+n
+
+diameter d
+
+FIG. 3. Schematic illustration of a neuron (left), a section of
+the myelinated axon (center) and and a piece of its axon membrane
+(right).
+The axon is typically insulated (myelinated) with small
+bare patches every 0.5 mm or so (so-called Nodes of Ranvier) where
+the voltage-sensitive sodium and potassium gates are concentrated
+[51,52]. If the neuron is in a superposition of firing and not firing,
+then N ∼ 106 Na+ ions are in a superposition of being inside and
+outside the cell (right).
+
+Consider a small patch of the membrane, assumed to
+be roughly flat with uniform thickness h as in Figure 3.
+If there is an excess surface density ±σ of charge near
+the inside/outside membrane surfaces, giving a voltage
+differential U across the membrane, then application of
+Gauss’ law tells us that σ = ǫ0E, where the electric field
+strength in the membrane is E = U/h and ǫ0 is the vac-
+uum permittivity.
+Consider an axon of length L and
+diameter d, with a fraction f of its surface area bare (not
+insulated with myelin). The total active surface area is
+thus A = πdLf, so the total number of Na+ ions that
+migrate in during firing is
+
+N = Aσ
+
+q
+= πdLfǫ0(U1 − U0)
+
+qh
+,
+(4)
+
+where q is the ionic charge (q = qe, the absolute value
+of the electron charge). Taking values typical for central
+nervous system axons [52,53], h = 8 nm, d = 10 µm,
+L = 10 cm, f = 10−3, U0 = −0.07 V and U1 = +0.03 V
+gives N ≈ 106 ions, and reasonable variations in our
+parameters can change this number by a few orders of
+magnitude.
+
+B. Neuron decoherence mechanisms
+
+Above we saw that a quantum superposition of the
+neuron states “resting” and “firing” involves of order a
+million ions being in a spatial superposition of inside and
+outside the axon membrane, separated by a distance of
+order h ∼ 10 nm. In this subsection, we will compute the
+timescale on which decoherence destroys such a superpo-
+sition.
+
+In this analysis, the object is the neuron, and the su-
+perposition will be destroyed by any interaction with
+other (environment) degrees of freedom that is sensitive
+to where the ions are located. We will consider the fol-
+lowing three sources of decoherence for the ions:
+
+1. Collisions with other ions
+
+2. Collisions with water molecules
+
+3. Coloumb interactions with more distant ions
+
+There are many more decoherence mechanisms [44–46].
+Exotic candidates such as quantum gravity [54] and
+modified quantum mechanics [55,56] are generally much
+weaker [46]. A number of decoherence effects may be even
+stronger than those listed, e.g., interactions as the ions
+penetrate the membrane — the listed effects will turn out
+to be so strong that we can make our argument by sim-
+ply using them as lower limits on the actual decoherence
+rate.
+Let ρ denote the density matrix for the position r of a
+single Na+ ion. As reviewed in the Appendix, all three
+of the listed processes cause ρ to evolve as
+
+ρ(x, x′, t0 + t) = ρ(x, x′, t0)f(x, x′, t)
+(5)
+
+for some function f that is independent of the ion state
+ρ and depends only on the interaction Hamiltonian Hint.
+This assumes that we can neglect the motion of the ion
+itself on the decoherence timescale — we will see that
+this condition is met with a broad margin.
+
+1. Ion–ion collisions
+
+For scattering of environment particles (processes 1
+and 2) that have a typical de Broigle wavelength λ, we
+have [46]
+
+f(x, x′, t) = e−Λt�
+1−e−|x′−x|2/2λ2�
+
+≈
+
+�
+e−|x′−x|2Λt/2λ2
+for |x′ − x| ≪ λ,
+
+e−Λt
+for |x′ − x| ≫ λ.
+(6)
+
+Here Λ is the scattering rate, given by Λ ≡ n⟨σv⟩, where
+n is the density of scatterers, σ is the scattering cross
+section and v is the velocity. The product σv is aver-
+aged over a the velocity distribution, which we take to
+be a thermal (Boltzmann) distribution for correspond-
+ing to T = 37◦C ≈ 310 K. The gist of equation (6) is
+that a single collision decoheres the ion down to the
+de Broigle wavelength of the scattering particle.
+The
+information I12 communicated during the scattering is
+I12 ∼ log2(∆x/λ) bits, where ∆x is the initial spread in
+the position of our particle.
+Since the typical de Broigle wavelength of a Na+ ion
+(mass m ≈ 23mp) or H2O molecule (m ≈ 18mp) is
+
+5
+
+
+λ =
+2π¯h
+√
+
+3mkT
+≈ 0.03 nm
+(7)
+
+at 310K, way smaller than the the membrane thickness
+h ∼ 10 nm over which we need to maintain quantum
+coherence, we are clearly in the |x′ − x| ≫ λ limit of
+equation (6). This means that the spatial superposition
+of an ion decays exponentially Λ−1, of order its mean
+free time between collisions. Since the superposition of
+the neuron states “resting” and “firing” involves N such
+superposed ions, it thus gets destroyed on a timescale
+τ ≡ (NΛ)−1.
+Let us now evaluate τ. Coulomb scattering between
+two ions of unit charge gives substantial deflection angles
+(θ ∼ 1) with a cross section or order3
+
+σ ∼
+� gq2
+
+mv2
+
+�2
+,
+(9)
+
+where v is the relative velocity and g ≡ 1/4πǫ0 is the
+Coulomb constant. In thermal equilibrium, the kinetic
+energy mv2/2 is of order kT , so v ∼
+�
+
+kT/m. For the
+ion density, let us write n = ηnH2O, where the density
+of water molecules nH2O is about (1 g/cm3)/(18mp) ∼
+1023/cm3 and η is the relative concentration of ions (pos-
+itive and negative combined). Typical ion concentrations
+during the resting state are [Na+] =9.2 (120) mmol/l and
+[K+] =140 (2.5) mmol/l inside (outside) the axon mem-
+brane [48], corresponding to total Na+ + K+ concentra-
+tions of η ≈ 0.00027 (0.00022) inside (outside). To be
+conservative, we will simply use η ≈ 0.0002 throughout.
+Ion–ion collisions therefore destroy the superposition on
+a timescale
+
+τ ∼
+1
+
+Nnσv ∼
+
+�
+
+m(kT )3
+
+Ng2q4en
+∼ 10−20 s.
+(10)
+
+2. Ion–water collisions
+
+Since H2O molecules are electrically neutral, the cross-
+section is dominated by their electric dipole moment
+p ≈ 1.85 Debye ≈ (0.0385 nm) × qe. We can model this
+
+3 If the first ion starts at rest at r1 = (0, 0, 0) and the sec-
+ond is incident with r2 = (vt, b, 0), then a very weak scatter-
+ing with deflection angle θ ≪ 1 will leave these trajectories
+roughly unchanged, the radial force F = gq2/|r1 −r2|2 merely
+causing a net transverse acceleration [57]
+
+∆vy =
+� ∞
+
+−∞
+
+�y · F
+
+m dt =
+� ∞
+
+−∞
+
+gq2b dt
+
+[b2 + (vt)2]3/2 = 2gq2
+
+mvb .
+(8)
+
+The approximation breaks down as the deflection angle θ ≈
+∆vy/v approaches unity. This occurs for b ∼ gq2/mv2, giving
+σ = πb2 as in equation (9).
+
+dipole as two opposing unit charges separated by a dis-
+tance y ≡ p/qe ≪ b, so summing the two corresponding
+contributions from equation (8) gives a deflection angle
+
+θ ≈ 2gqep
+
+mv2b2 .
+(11)
+
+This gives a cross section
+
+σ = πb2 ∼ gqep
+
+mv2 .
+(12)
+
+for scattering with large (θ ∼ 1) deflections. Although σ
+is smaller than for the case of ion–ion collisions, n is larger
+because the concentration factor η drops out, giving a
+final result
+
+τ ∼
+1
+
+Nnσv ∼
+
+√
+
+mkT
+
+Ngqepn ∼ 10−20 s
+(13)
+
+3. Interactions with distant ions
+
+As shown in the Appendix, long-range interaction with
+a distant (environment) particle gives
+
+f(r, r′, t) = �p2 [M(r′ − r)t/¯h] ,
+(14)
+
+up to a phase factor that is irrelevant for decoherence.
+Here �p2 is the Fourier transform of p2(r) ≡ ρ2(r, r), the
+probability distribution for the location of the environ-
+ment particle. M is the 3 × 3 Hessian matrix of second
+derivatives of the interaction potential of the two parti-
+cles at their mean separation. A slightly less general for-
+mula was derived in the seminal paper [45]. For roughly
+thermal states, ρ2 (and thus p) is likely to be well ap-
+proximated by a Gaussian [58,59]. This gives
+
+f(r, r′, t) = e− 1
+
+2 (r′−r)tMtΣM(r′−r)t2/¯h2,
+(15)
+
+where Σ = ⟨r2rt
+2⟩ − ⟨r2⟩⟨rt
+2⟩ is the covariance matrix of
+the location of the environment particle.
+Decoherence
+is destroyed when the exponent becomes of order unity,
+i.e., on a timescale
+
+τ ≡
+�
+(r′ − r)tMtΣM(r′ − r)
+�−1/2 ¯h.
+(16)
+
+Assuming a Coulomb potential V = gq2/|r2 − r1| gives
+M = (3�a�at − I)gq2/a3 where a ≡ r2 − r1 = a�a, |�a| =
+1. For thermal states, we have the isotropic case Σ =
+(∆x)2I, so equation (16) reduces to
+
+τ =
+¯ha3
+
+gq2|r′ − r|∆x
+�
+1 + 3 cos2 θ
+�−1/2 ,
+(17)
+
+where cos θ ≡ �a · (r′ − r)/|r′ − r|. To be conservative,
+we take ∆x to be as small as the uncertainty principle
+allows. With the thermal constraint (∆p)2/m <∼ kT on
+the momentum uncertainty, this gives
+
+6
+
+
+∆x =
+¯h
+
+2∆p ∼
+¯h
+√
+
+mkT
+.
+(18)
+
+Substituting this into equation (17) and dividing by the
+number of ions N, we obtain the decoherence timescale
+
+τ ∼
+a3√
+
+mkT
+
+Ngq2|r′ − r|.
+(19)
+
+caused by a single environment ion a distance a away.
+Each such ion will produce its own suppression factor f,
+so we need to sum the exponent in equation (15) over all
+ions. Since the tidal force M ∝ a−3 causes the exponent
+to drop as a−6, this sum will generally be dominated by
+the very closest ion, which will typically be a distance
+a ∼ n−1/3 away. We are interested in decoherence for
+separations |r′ − r| = h, the membrane thickness, which
+gives
+
+τ ∼
+
+√
+
+mkT
+
+Ngq2enh ∼ 10−19 s.
+(20)
+
+The relation between these different estimates is dis-
+cussed in more detail in the Appendix.
+
+C. Microtubules
+
+Microtubules are a major component of the cytoskele-
+ton, the “scaffolding” that helps cells maintain their
+shapes.
+They are hollow cylinders of diameter D =
+24 nm made up of 13 filaments that are strung together
+out of proteins known as tubulin dimers. These dimers
+can make transitions between two states known as α
+and β, corresponding to different electric dipole moments
+along the axis of the tube. It has been argued that micro-
+tubules may have additional functions as well, serving as
+a means of energy and information transfer [20]. A model
+has been presented whereby the dipole-dipole interac-
+tions between nearby dimers can lead to long-range po-
+larization and kink-like excitations that may travel down
+the microtubules at speeds exceeding 1 m/s [60].
+Penrose has gone further and suggested that the dy-
+namics of such excitations can make a microtubule act
+like a quantum computer, and that microtubules are the
+site of of human consciousness [2]. This idea has been fur-
+ther elaborated [21–24] employing methods from string
+theory, with the conclusion that quantum superpositions
+of coherent excitations can persist for as long as a second
+before being destroyed by decoherence. See also [61,62].
+This was hailed as a success for the model, the interpre-
+tation being that the quantum gravity effect on micro-
+tubules was identified with the human though process on
+this same timescale.
+This decoherence rate τ ∼ 1 s was computed assuming
+that quantum gravity is the main decoherence source.
+Since this quantum gravity model is somewhat contro-
+versial [32] and its effect has been found to be more than
+
+20 orders of magnitude weaker than other decoherence
+sources in some cases [46], it seems prudent to evalu-
+ate other decoherence sources for the microtubule case
+as well, to see whether they are in fact dominant. We
+will now do so.
+Using coordinates where the x-axis is along the tube
+axis, the above-mentioned models all focus on the time-
+evolution of p(x), the average x-component of the electric
+dipole moment of the tubulin dimers at each x. In terms
+of this polarization function p(x), the net charge per unit
+length of tube is −p′(x). The propagating kink-like exci-
+tations [60] are of the form
+
+p(x) =
+� +p0
+for x ≪ x0,
+
+−p0
+for x ≫ x0,
+(21)
+
+where the kink location x0 propagates with constant
+speed and has a width of order a few tubulin dimers.
+The polarization strength p0 is such that the total charge
+around the kink is Q = − � p′(x)dx = 2p0 ∼ 940qe, due
+to the presence of 18 Ca2+ ions on each of the 13 fila-
+ments contributing to p0 [60].
+Suppose that such a kink is in two different places
+in superposition, separated by some distance |r′ − r|.
+How rapidly will the superposition be destroyed by de-
+coherence?
+To be conservative, we will ignore colli-
+sions between polarized tubulin dimers and nearby water
+molecules, since it has been argued that these may be in
+some sense ordered and part of the quantum system [24]
+– although this argument is difficult to maintain for the
+water outside the microtubule, which permeates the en-
+tire cell volume. Let us instead apply equation (19), with
+N = Q/qe ∼ 103. The distance to the nearest ion will
+generally be less than a = R + n−1/3 ∼ 26 nm, where the
+tubulin diameter D = 24 nm dominates over the inter-
+ion separation n−1/3 ∼ 2 nm in the fluid surrounding
+the microtubule. Superpositions spanning many tubuline
+dimers (|r′ − r| ≫ D) therefore decohere on a timescale
+
+τ ∼ D2√
+
+mkT
+
+Ngq2e
+∼ 10−13 s.
+(22)
+
+due to the nearest ion alone. This is quite a conserva-
+tive estimate, since the other nD3 ∼ 103 ions that are
+merely a small fraction further away will also contribute
+to the decoherence rate, but it is nonetheless 6-7 orders
+of magnitude shorter than the estimates of Mavromatos
+& Nanopoulos [25–27]. We will comment on screening
+effects below.
+
+1. Decoherence summary
+
+Our decoherence rates are summarized in Table 1. How
+accurate are they likely to be?
+In the calculations above, we generally tried to be con-
+servative, erring on the side of underestimating the deco-
+herence rate. For instance, we neglected that N potas-
+sium ions also end up in superposition once the neuron
+
+7
+
+
+firing is quenched, we neglected the contribution of other
+abundant ions such as Cl− to η, and and we ignored col-
+lisions with water molecules in the microtubule case.
+Since we were only interested in order-of-magnitude
+estimates, we made a number of crude approximations,
+e.g., for the cross sections. We neglected screening ef-
+fects because the decoherence rates were dominated by
+the particles closest to the system, i.e., the very same par-
+ticles that are responsible for screening the charge from
+more distant ones.
+
+Table 1. Decoherence timescales.
+
+Object
+Environment
+τdec
+
+Neuron
+Colliding ion
+10−20s
+Neuron
+Colliding H2O
+10−20s
+Neuron
+Nearby ion
+10−19s
+Microtubule
+Distant ion
+10−13s
+
+IV. DISCUSSION
+
+A. The classical nature of brain processes
+
+The calculations above enable us to address the ques-
+tion of whether cognitive processes in the brain consti-
+tute a classical or quantum system in the sense of Fig-
+ure 1. If we take the characteristic dynamical timescale
+for such processes to be τdyn ∼ 10−2 s − 100 s (the ap-
+parent timescale of e.g., speech, thought and motor re-
+sponse), then a comparison of τdyn with τdec from Table 1
+shows that processes associated with either conventional
+neuron firing or with polarization excitations in micro-
+tubules fall squarely in the classical category, by a mar-
+gin exceeding ten orders of magnitude. Neuron firing it-
+self is also highly classical, since it occurs on a timescale
+τdyn ∼ 10−3 − 10−4 s [53]. Even a kink-like microtubule
+excitation is classical by many orders of magnitude, since
+it traverses a short tubule on a timescale τdyn ∼ 5×10−7 s
+[60].
+What about other mechanisms?
+It is worth noting
+that if (as is commonly believed) different neuron fir-
+ing patterns correspond in some way to different con-
+scious perceptions, then consciousness itself cannot be
+of a quantum nature even if there is a yet undiscovered
+physical process in the brain with a very long decoherence
+time. As mentioned above, suggestions for such candi-
+dates have involved, e.g., superconductivity [12], super-
+fluidity [13], electromagnetic fields [14], Bose condensa-
+tion [15,16], superflourescence [17] and other mechanisms
+[18,19]. The reason is that as soon as such a quantum
+subsystem communicates with the constantly decohering
+neurons to create conscious experience, everything deco-
+heres.
+How extreme variations in the decoherence rates can
+we obtain by changing our model assumptions? Although
+the rates can be altered by a few of orders of magnitudes
+by pushing parameters such as the neuron dimensions,
+the myelination fraction or the microtubule kink charge
+
+to the limits of plausibility, it is clearly impossible to
+change the basic conclusion that τdec ≪ 10−3 s, i.e., that
+we are dealing with a classical system in the sense of Fig-
+ure 1. Even the tiniest neuron imaginable, with only a
+single ion (N = 1) traversing the cell wall during firing,
+would have τdec ∼ 10−14 s.
+Likewise, reducing the ef-
+fective microtubule kink charge to a small fraction of qe
+would not help.
+How are we to understand the above-mentioned claims
+that brain subsystems can be sufficiently isolated to
+exhibit macroquantum behavior?
+It appears that the
+subtle distinction between dissipation and decoherence
+timescales has not always been appreciated.
+
+B. Implications for the subject-object-environment
+decomposition
+
+Let us now discuss the subsystem decomposition of
+Figure 2 in more detail in light of our results. As the
+figure indicates, the virtue of this decomposition into
+subject, object and environment is that the subsystem
+Hamiltonians Hs, Ho, He and the interaction Hamiltoni-
+ans Hso, Hoe, Hse can cause qualitatively very different
+effects. Let us now briefly discuss each of them in turn.
+Most of these processes are schematically illustrated
+in Figure 4 and Figure 5, where for purposes of illus-
+tration, we have shown the extremely simple case where
+both the subject and object have only a single degree of
+freedom that can take on only a few distinct values (3
+for the subject, 2 for the object). For definiteness, we
+denote the three subject states |¨- ⟩, | ¨⌣⟩ and | ¨⌢⟩, and in-
+terpret them as the observer feeling neutral, happy and
+sad, respectively. We denote the two object states |↑⟩
+and |↓⟩, and interpret them as the spin component (“up”
+or “down”) in the z-direction of a spin-1/2 system, say a
+silver atom. The joint system consisting of subject and
+object therefore has only 2 × 3 = 6 basis states: |¨- ↑⟩,
+|¨- ↓⟩, | ¨⌣↑⟩, | ¨⌣↓⟩, | ¨⌢↑⟩, | ¨⌢↓⟩. In Figures 4 and 5, we
+have therefore plotted ρ as a 6 × 6 matrix consisting of
+nine two-by-two blocks.
+
+=
++
+
+Object�
+evolution
+Object�
+decohe-�
+rence
+Ho
+(Entropy�
+constant)
+(Entropy�
+increases)
+
+Hoe
+
+Observation/Measurement
+
+(Entropy decreases)
+Hso
+�
+
+2
+1_
+2
+1_
+
+8
+
+
+FIG. 4. Time evolution of the 6×6 density matrix for the basis
+states |¨- ↑⟩, |¨- ↓⟩, | ¨⌣↑⟩, | ¨⌣↓⟩, | ¨
+⌢↑⟩, | ¨⌢↓⟩ as the object evolves in
+isolation, then decoheres, then gets observed by the subject. The
+final result is a statistical mixture of the states | ¨⌣↑⟩ and | ¨⌢↓⟩,
+simple zero-entropy states like the one we started with.
+
+1. Effect of Ho: constant entropy
+
+If the object were to evolve during a time interval t
+without interacting with the subject or the environment
+(Hso = Hoe = 0), then according to equation (1) its
+reduced density matrix ρo would evolve into UρoU † with
+the same entropy, since the time-evolution operator U ≡
+e−iHot is unitary.
+Suppose the subject stays in the state |¨- ⟩ and the
+object starts out in the pure state |↑⟩. Let the object
+Hamiltonian Ho correspond to a magnetic field in the y-
+direction causing the spin to precess to the x-direction,
+i.e., to the state (|↑⟩+|↓⟩)/
+√
+
+2. The object density matrix
+ρo then evolves into
+
+ρo = U|↑⟩⟨↑|U † = 1
+
+2(|↑⟩ + |↓⟩)(⟨↑| + ⟨↓|)
+
+= 1
+
+2(|↑⟩⟨↑| + |↑⟩⟨↓| + |↓⟩⟨↑| + |↓⟩⟨↓|),
+(23)
+
+corresponding to the four entries of 1/2 in the second
+matrix of Figure 4.
+This is quite typical of pure quantum time evolution: a
+basis state eventually evolves into a superposition of ba-
+sis states, and the quantum nature of this superposition
+is manifested by off-diagonal elements in ρo. Another fa-
+miliar example of this is the familiar spreading out of the
+wave packet of a free particle.
+
+2. Effect of Hoe: increasing entropy
+
+This was the effect of Ho alone. In contrast, Hoe will
+generally cause decoherence and increase the entropy of
+the object. As discussed in detail in Section III and the
+Appendix, it entangles it with the environment, which
+suppresses the off-diagonal elements of the reduced den-
+sity matrix of the object as illustrated in Figure 4. If Hoe
+couples to the z-component of the spin, this destroys the
+terms |↑⟩⟨↓| and |↓⟩⟨↑|. Complete decoherence therefore
+converts the final state of equation (23) into
+
+ρo = 1
+
+2(|↑⟩⟨↑| + |↓⟩⟨↓|),
+(24)
+
+corresponding to the two entries of 1/2 in the third ma-
+trix of Figure 4.
+
+3. Effect of Hso: decreasing entropy
+
+Whereas Hoe typically causes the apparent entropy of
+the object to increase, Hso typically causes it to decrease.
+
+Figure 4 illustrates the case of an ideal measurement,
+where the subject starts out in the state |¨- ⟩ and Hso is of
+such a form that gets perfectly correlated with the object.
+In the language of Section II, an ideal measurement is a
+type of communication where the mutual information I12
+between the subject and object systems is increased to its
+maximum possible value. Suppose that the measurement
+is caused by Hso becoming large during a time interval so
+brief that we can neglect the effects of Hs and Ho. The
+joint subject+object density matrix ρso then evolves as
+ρso �→ UρsoU †, where U ≡ exp
+�
+−i
+�
+Hsodt
+�
+. If observing
+|↑⟩ makes the subject happy and |↓⟩ makes the subject
+sad, then we have U|¨-↑⟩ = | ¨⌣↑⟩ and U|¨-↓⟩ = | ¨⌢↓⟩. The
+state given by equation (24) would therefore evolve into
+
+ρo = 1
+
+2U(|¨- ⟩⟨¨- |) ⊗ (|↑⟩⟨↑| + |↓⟩⟨↓|)U †
+(25)
+
+= 1
+
+2(U|¨-↑⟩⟨¨-↑|U † + U|¨-↓⟩⟨¨-↓|U †
+(26)
+
+= 1
+
+2(| ¨⌣↑⟩⟨ ¨⌣↑| + | ¨⌢↓⟩⟨ ¨⌢↓ |),
+(27)
+
+as illustrated in Figure 4.
+This final state contains a
+mixture of two subjects, corresponding to definite but
+opposite knowledge of the object state.
+According to
+both of them, the entropy of the object has decreased
+from one bit to zero bits.
+In general, we see that the object decreases its en-
+tropy when it exchanges information with the subject
+and increases when it exchanges information with the
+environment.4 Loosely speaking, the entropy of an ob-
+ject decreases while you look at it and increases while
+you don’t5.
+
+4If n bits of information are exchanged with the environ-
+ment, then equation (3) shows that the object entropy will
+increase by this same amount if the environment is in ther-
+mal equilibrium (with maximal entropy) throughout. If we
+were to know the state of the environment initially (by our
+definition of environment, we do not), then both the object
+and environment entropy will typically increase by n/2 bits.
+5 Here and throughout, we are assuming that the total
+system, which is by definition isolated, evolves according to
+the Schr¨odinger equation (1). Although modifications of the
+Schr¨odinger equation have been suggested by some authors,
+either in a mathematically explicit form as in [55,56] or ver-
+bally as a so-called reduction postulate, there is so far no
+experimental evidence suggesting that modifications are nec-
+essary. The original motivations for such modifications were
+
+1. to be able to interpret the diagonal elements of the
+density matrix as probabilities and
+
+2. to suppress off-diagonal elements of the density matrix.
+
+The subsequent discovery by Everett [64] that the probability
+interpretation automatically appears to hold for almost all
+observers in the final superposition solved problem 1, and is
+discussed in more detail in, e.g., [29,66–74]. The still more
+
+9
+
+
+=
++
+
+Subject�
+evolution
+Subject�
+decohe-�
+rence
+Hs
+(Snap �
+decision)
+�
+Hse
+
+�
+
+2
+1_
+2
+1_
+
+FIG. 5. Time evolution of the same 6 × 6 density matrix as in
+Figure 4 when the subject evolves in isolation, then decoheres. The
+object remains in the state |↑⟩ the whole time. The final result is
+a statistical mixture of the two states | ¨⌣↑⟩ and | ¨
+⌢↑⟩.
+
+4. Effect if Hs: the thought process
+
+So far, we have focused on the object and discussed
+effects of its internal dynamics (Ho) and its interactions
+with the environment (Hoe) and subject (Hso). Let us
+now turn to the subject and consider the role played by
+its internal dynamics (Hs) and interactions with the en-
+vironment (Hse).
+In his seminal 1993 book, Stapp [3]
+presents an argument about brain dynamics that can be
+summarized as follows.
+
+1. Since the brain contains ∼ 1011 synapses connected
+together by neurons in a highly nonlinear fashion,
+there must be a huge number of metastable rever-
+berating patters of pulses into which the brain can
+evolve.
+
+2. Neural network simulations have indicated that the
+metastable state into which a brain does in fact
+evolves depends sensitively on the initial conditions
+in small numbers of synapses.
+
+3. The latter depends on the locations of a small num-
+ber of calcium atoms, which might be expected to
+be in quantum superpositions.
+
+4. Therefore, one would expect the brain to evolve
+into
+a
+quantum
+superposition
+of
+many
+such
+metastable configurations.
+
+5. Moreover, the fatigue characteristics of the synap-
+tic junctions will cause any given metastable state
+
+recent discovery of decoherence [11,36,37] solved problem 2,
+as well as explaining so-called superselection rules for the first
+time (why for instance the position basis has a special status)
+[44].
+
+to become, after a short time, unstable:
+the
+subject will then be forced to search for a new
+metastable configuration, and will therefore con-
+tinue to evolve into a superposition of increasingly
+disparate states.
+
+If different states (perceptions) of the subject correspond
+to different metastable states of neuron firing patterns, a
+definite perception would eventually evolve into a super-
+position of several subjectively distinguishable percep-
+tions.
+We will follow Stapp in making this assumption about
+Hs. For illustrative purposes, let us assume that this can
+happen even at the level of a single thought or snap de-
+cision where the outcome feels unpredictable to us. Con-
+sider the following experiment: the subject starts out
+with a blank face and counts silently to three, then makes
+a snap decision on whether to smile or frown. The time-
+evolution operator U ≡ exp
+�
+−i � Hsdt
+�
+will then have
+the property that U|¨- ⟩ = (| ¨⌣⟩ + | ¨⌢⟩)/
+√
+
+2, so the sub-
+ject density matrix ρs will evolve into
+
+ρs = U|¨- ⟩⟨¨- |U † = 1
+
+2(| ¨⌣⟩ + | ¨⌢⟩)(⟨ ¨⌣| + ⟨ ¨⌢|)
+
+= 1
+
+2(| ¨⌣⟩⟨ ¨⌣| + | ¨⌣⟩⟨ ¨⌢| + | ¨⌢⟩⟨ ¨⌣| + | ¨⌢⟩⟨ ¨⌢|),
+(28)
+
+corresponding to the four entries of 1/2 in the second
+matrix in Figure 5.
+
+5. Effect of Hse: subject decoherence
+
+Just as Hoe can decohere the object, Hse can decohere
+the subject. The difference is that whereas the object can
+be either a quantum system (with small Hoe) or a classi-
+cal system (with large Hoe), a human subject always has
+a large interaction with the environment. As we showed
+in Section III, τdec ≪ τdyn for the subject, i.e., the ef-
+fect of Hse is faster than that of Hs by many orders of
+magnitude. This means that we should strictly speaking
+not think of macrosuperpositions such as equation (28)
+as first forming and then decohering as in Figure 5 —
+rather, subject decoherence is so fast that such superpo-
+sitions decohere already during their process of forma-
+tion. Therefore we are never even close to being able to
+perceive superpositions of different perceptions. Reduc-
+ing object decoherence (from Hoe) during measurement
+would make no difference, since decoherence would take
+place in the brain long before the transmission of the ap-
+propriate sensory input through sensory nerves had been
+completed.
+
+C. He and Hsoe
+
+The environment is of course the most complicated sys-
+tem, since it contains the vast majority of the degrees of
+
+10
+
+
+freedom in the total system. It is therefore very fortu-
+nate that we can so often ignore it, considering only those
+limited aspects of it that affect the subject and object.
+For the most general H, there can also be an ugly
+irreducible residual term Hsoe ≡ H − Hs − Ho − He −
+Hso − Hoe − Hse.
+
+D. Implications for modeling cognitive processes
+
+For the neural network community, the implication of
+our result is “business as usual”, i.e., there is no need
+to worry about the fact that current simulations do not
+incorporate effects of quantum coherence. The only rem-
+nant from quantum mechanics is the apparent random-
+ness that we subjectively perceive every time the subject
+system evolves into a superposition as in equation (28),
+but this can be simply modeled by including a random
+number generator in the simulation. In other words, the
+recipe used to prescribe when a given neuron should fire
+and how synaptic coupling strengths should be updated
+may have to involve some classical randomness to cor-
+rectly mimic the behavior of the brain.
+
+1. Hyper-classicality
+
+If a subject system is to be a good model of us, Hso
+and Hse need to meet certain criteria: decoherence and
+communication are necessary, but fluctuation and dissi-
+pation must be kept low enough that the subject does
+not lose its autonomy completely.
+In our study of neural processes, we concluded that the
+subject is not a quantum system, since τdec ≪ τdyn. How-
+ever, since the dissipation time τdiss for neuron firing is
+of the same order as its dynamical timescale, we see that
+in the sense of Figure 1, the subject is not a simple clas-
+sical system either. It is therefore somewhat misleading
+to think of it as simply some classical degrees of freedom
+evolving fairly undisturbed (only interacting enough to
+stay decohered and occasionally communicate with the
+outside world). Rather, the semi-autonomous degrees of
+freedom that constitute the subject are to be found at a
+higher level of complexity, perhaps as metastable global
+patters of neuron firing.
+These degrees of freedom might be termed “hyper-
+classical”:
+although
+there
+is
+nothing
+quantum-
+mechanical about their equations of motion (except that
+they can be stochastic), they may bear little resemblance
+with the underlying classical equations from which they
+were derived.
+Energy conservation and other familiar
+concepts from Hamiltonian dynamics will be irrelevant
+for these more abstract equations, since neurons are en-
+ergy pumped and highly dissipative. Other examples of
+such hyper-classical systems include the time-evolution
+of the memory contents of a regular (highly dissipative)
+
+digital computer as well as the motion on the screen of
+objects in a computer game.
+
+2. Nature of the subject system
+
+In this paper, we have tacitly assumed that conscious-
+ness is synonymous with certain brain processes. This is
+what Lockwood terms the “identity theory” [66]. It dates
+back to Hobbes (∼1660) and has been espoused by, e.g.,
+Russell, Feigl, Smart, Armstrong, Churchland and Lock-
+wood himself. Let us briefly explore the more specific
+assumption that the subject degrees of freedom are our
+perceptions. In this picture, some of the subject degrees
+of freedom would have to constitute a “world model”,
+with the interaction Hso such that the resulting commu-
+nication keeps these degrees of freedom highly correlated
+with selected properties of the outside world (object +
+environment). Some such properties, i.e.,
+
+• the intensity of the electromagnetic on the retina,
+averaged through three narrow-band filters (color
+vision) and one broad-band filter (black-and-white
+vision),
+
+• the spectrum of air pressure fluctuations in the ears
+(sound),
+
+• the chemical composition of gas in the nose (smell)
+and solutions in the mouth (taste),
+
+• heat and pressure at a variety of skin locations,
+
+• locations of body parts,
+
+are tracked rather continuously, with the corresponding
+mutual information I12 between subject and surround-
+ings remaining fairly constant.
+Persisting correlations
+with properties of the past state of the surroundings
+(memories) further contribute to the mutual information
+I12. Much of I12 is due to correlations with quite subtle
+aspects of the surroundings, e.g., the contents of books.
+The total mutual information I12 between a person and
+the external world is fairly low at birth, gradually grows
+through learning, and falls when we forget. In contrast,
+most innate objects have a very small mutual informa-
+tion with the rest of the world, books and diskettes being
+notable exceptions.
+The extremely limited selection of properties that the
+subject correlates with has presumably been determined
+by evolutionary utility, since it is known to differ between
+species: birds perceive four primary colors but cats only
+one, bees perceive light polarization, etc. In this picture,
+we should therefore not consider these particular (“classi-
+cal”) aspects of our surroundings to be more fundamental
+than the vast majority that the subject system is uncor-
+related with. Morover, our perception of e.g. space is as
+subjective as our perception of color, just as suggested
+by e.g. [50].
+
+11
+
+
+3. The binding problem
+
+One of the motivations for models with quantum co-
+herence in the brain was the so-called binding problem.
+In the words of James [75,76], “the only realities are the
+separate molecules, or at most cells. Their aggregation
+into a ‘brain’ is a fiction of popular speech”. James’ con-
+cern, shared by many after him, was that consciousness
+did not seem to be spatially localized to any one small
+part of the brain, yet subjectively feels like a coherent
+entity. Because of this, Stapp [3] and many others have
+appealed to quantum coherence, arguing that this could
+make consciousness a holistic effect involving the brain
+as a whole.
+However, non-local degrees of freedom can be impor-
+tant even in classical physics, For instance, oscillations
+in a guitar string are local in Fourier space, not in real
+space, so in this case the “binding problem” can be solved
+by a simple change of variables. As Eddington remarked
+[77], when observing the ocean we perceive the moving
+waves as objects in their own right because they display a
+certain permanence, even though the water itself is only
+bobbing up and down. Similarly, thoughts are presum-
+ably highly non-local excitation patterns in the neural
+network of our brain, except of a non-linear and much
+more complex nature.
+In short, this author feels that
+there is no binding problem.
+
+4. Outlook
+
+In summary, our decoherence calculations have in-
+dicated that there is nothing fundamentally quantum-
+mechanical about cognitive processes in the brain, sup-
+porting the Hepp’s conjecture [33]. Specifically, the com-
+putations in the brain appear to be of a classical rather
+than quantum nature, and the argument by Lisewski [78]
+that quantum corrections may be needed for accurate
+modeling of some details, e.g., non-Markovian noise in
+neurons, does of course not change this conclusion. This
+means that although the current state-of-the-art in neu-
+ral network hardware is clearly still very far from be-
+ing able to model and understand cognitive processes as
+complex as those in the brain, there are no quantum me-
+chanical reasons to doubt that this research is on the
+right track.
+
+Acknowledgements:
+The author wishes to thank
+the organizers of the Spaatind-98 and Gausdal-99 win-
+ter schools, where much of this work was done, and
+Mark Alford, Philippe Blanchard, Carlton Caves, Angel-
+ica de Oliveira-Costa, Matthew Donald, Andrei Gruzi-
+nov, Piet Hut, Nick Mavromatos, Henry Stapp, Hans-
+Dieter Zeh and Woitek Zurek for stimulating discussions
+and helpful comments. Support for this work was pro-
+vided by the Sloan Foundation and by NASA though
+
+grant NAG5-6034 and Hubble Fellowship HF-01084.01-
+96A from STScI, operated by AURA, Inc. under NASA
+contract NAS5-26555.
+
+APPENDIX: DECOHERENCE FORMULAS
+
+The quantitative effect of decoherence from both short
+range interactions (scattering) and long-range interac-
+tions was first derived in a seminal paper by Joos & Zeh
+[45]. Since our application involved scattering between
+particles of comparable mass, we used a generalized ver-
+sion of these results that included the effect of recoil [46].
+In this Appendix, we derive a slightly generalized formula
+for long-range interactions, and briefly comment on the
+relation between these short-range and long-range limit-
+ing cases.
+
+1. Decoherence due to tidal forces
+
+Even if the dissipation and fluctuation caused by Hint
+is dynamically unimportant, H1 and H2 can be neglected
+in equation (2) when calculating the decoherence effect in
+the many cases where the interaction Hamiltonian deco-
+heres the object on a timescale far below the dynamical
+time. In this approximation, we consider two particles
+with an interaction H = Hint = V (r2 − r1) for some
+potential V . According to equation (1), the two-particle
+density matrix ρ therefore evolves as
+
+ρ(r1, r′
+1, r2, r′
+2, t0 + t)
+
+= ρ(r1, r′
+1, r2, r′
+2, t)e−i[V (r2−r1)−V (r′
+2−r′
+1)]/¯h.
+(A1)
+
+Following [45], we assume that the two particles are fairly
+localized near their initial average positions
+
+r0
+i ≡ ⟨ri⟩0 = tr [riρi(t0)],
+(A2)
+
+i = 1, 2, and approximate the potential by its second
+order Taylor expansion
+
+V (r2 − r1) ≈ V (a) − F · (x2 − x1)
+
++ 1
+
+2(x2 − x1)tM(x2 − x1).
+(A3)
+
+Here F ≡= −∇V (a) is the average force, M is the Hes-
+sian matrix Mij ≡ ∂i∂jV (a) and a ≡ r0
+2−r0
+1. We have in-
+troduced relative coordinates xi ≡ ri−r0
+i . Assuming that
+the two particles are independent initially as in [45], i.e.,
+that ρ(t0) takes the separable form ρ(x1, x′
+1, x2, x′
+2, t0) =
+ρ1(x1, x′
+1, t0)ρ2(x2, x′
+2, t0), this gives
+
+ρ1(x1, x′
+1, t0 + t) = tr 2ρ(t0 + t) =
+�
+ρ(x1, x′
+1, x, x, t0 + t)d3x = ρ1(x1, x′
+1, t0)f(x1, x′
+1, t), (A4)
+
+where
+
+12
+
+
+f(x1, x′
+1, t) ≈
+
+eiφ(x1,x′
+1,t)
+�
+ρ2(x2, x′
+2, t0)e−it(x′
+1−x1)tMx2/¯hd3x2 =
+
+eiφ(x1,x′
+1,t)�p2[M(x′
+1 − x1)t/¯h].
+(A5)
+
+Here the phase factor
+
+eiφ(x,x′,t) ≡ e
+i
+¯h[F·(x′−x)+ 1
+
+2 x′tMx′− 1
+
+2 xtMx]
+(A6)
+
+is of no importance for decoherence, since it does not
+suppress the magnitude |ρ1(x1, x′
+1, t)| of the off-diagonal
+elements – it merely causes momentum transfer related
+to fluctuation and dissipation.
+It is the other term
+that causes decoherence. �p2 is the Fourier transform of
+p2(x) ≡ ρ2(x, x, t0), the probability distribution for the
+location of the environment particle.
+
+2. Properties of the effect
+
+Let us briefly discuss some qualitative features of equa-
+tion (A5).
+Since �p2(0) =
+�
+p2(x2)d3x2 = tr ρ2 = 1,
+ρ1(x, x′) remains unchanged on the diagonal x = x′.
+This is because Hint is not changing the position of our
+our object particle, merely its momentum.
+Since the
+mean position ⟨x2⟩ =
+�
+p2x2d3x2 = tr [x2ρ2] = 0 van-
+ishes (using equation (A2)), we have ∇�p2(0) = 0.
+In
+fact, |f| takes a maximum on the diagonal, and the
+Riemann-Lebesgue Lemma shows that |f| = |�p2| ≤ 1
+whenever x ̸= x′, with equality only for the unphys-
+ical case where p2 is a delta function, i.e., where the
+location of the environment particle is perfectly known.
+∂i∂j|f(0)| = −M⟨x2xt
+2⟩Mt2/2¯h2, so so the larger ⟨x2xt
+2⟩
+is (i.e., the more spread out the environment particle is),
+the closer to the diagonal decoherence will suppress our
+density matrix.
+Since M is the shear matrix of the force field −∇V , we
+see that it is tidal forces that are causing the decoherence
+— the average force F simply contributes to the phase
+factor eiφ. Specifically, the rate at which our object de-
+grees of freedom r1 decohere grows with the tidal force
+that it exerts on the environment: if the environment
+particle is spread out with ⟨x2xt
+2⟩ large, experiencing a
+wide range of forces from the object, object decoherence
+is rapid. In the opposite situation, where the object is
+spread out and the environment is not, the object will
+experience strong classical tidal forces but no decoher-
+ence.
+
+3. Relation between long-range and short-range
+decoherence
+
+Above we derived the effect of decoherence from long-
+range tidal forces. Another interesting case that has been
+solved analytically [45] is that of short-range interactions
+
+that can be modeled as scattering events. If the scatter-
+ing takes place during short enough a time interval that
+we can neglect the internal dynamics of the object, then
+its reduced density matrix changes as [46]
+
+ρ1(r, r′) �→ ρ1(r, r′)�p
+�r′ − r
+
+¯h
+
+�
+,
+(A7)
+
+where p(q) is the probability distribution for the momen-
+tum transfer q in the collision. This equation generalizes
+the scattering result of [45] by including the effect of re-
+coil. The larger the uncertainty in momentum transfer,
+the stronger the decoherence effect becomes, since widen-
+ing p narrows its Fourier transform �p. Changing the mean
+momentum transfer ⟨q⟩ does not affect the decoherence,
+merely contributes a phase factor just as F did above.
+Typically, the last factor in equation (A7) destroys coher-
+ence down to scales of order the de Broigle wavelength
+of the scatterer, with directional modulations from the
+angular dependence of the scattering cross section. Gen-
+eralization to a steady flux of scattering particles [46]
+gives equation (6).
+Equation (A7) has striking similarities with the tidal
+force result of equation (A5): in both cases, the density
+matrix gets multiplied by the Fourier transform of a prob-
+ability distribution.
+If fact, up to uninteresting phase
+factors, we can rewrite our equation (A5) in exactly the
+form of equation (A7) by redefining p to be the probabil-
+ity distribution for momentum transfer q = M(x2 −x1)t
+due to tidal forces for a fixed x1, i.e.,
+
+p(q) ≡ p2(x2)d3x2
+
+d3q = p2(x1 + M−1q/t)
+
+t3 det M
+.
+(A8)
+
+Fourier transforming this expression and substituting the
+result into equation (A7), we recover equation (A5) up
+to a phase factor.
+Perhaps the simplest way to understand all these re-
+sults is in terms of Wigner functions [79]. If W(x1, p1) is
+the Wigner phase space distribution for the object parti-
+cle, then any of the momentum-transferring interactions
+that we have considered will take the form
+
+W(x1, p1) �→
+�
+W(x1, p1 − q)p(q, x1)d3q
+(A9)
+
+for some probability distribution p that may or may not
+depend on x1. Since the density matrix
+
+ρ1(x1, x′
+1) =
+�
+W
+�x1 + x′
+1
+
+2
+, p
+�
+e−i(x−x′)·pd3p
+(A10)
+
+is just the Wigner function Fourier transformed in the
+momentum direction (and rotated by 45◦), the convolu-
+tion with p in equation (A9) reduces to a simple multi-
+plication with �p in equation (A7).
+
+13
+
+
+[1] R. Penrose, The Emperor’s New Mind (Oxford, Oxford
+Univ. Press, 1989).
+[2] R. Penrose, in The Large, the Small and the Human
+Mind, ed.
+M. Longair (Cambridge, Cambridge Univ.
+Press, 1997).
+[3] H. P. Stapp, Mind, Matter and Quantum Mechanics
+(Berlin, Springer, 1993).
+[4] D. J. Amit, Modeling Brain Functions (Cambridge, Cam-
+bridge Univ. Press, 1989).
+[5] M. M´ezard, G. Parisi, and M. Virasoro, Spin Glass The-
+ory and Beyond (Singapore, World Scientific, 1993).
+[6] R. L. Harvey, Neural Network Principles (Englewood
+Cliffs, Prentice Hall, 1994).
+[7] F. H. Eeckman and J. M. Bower, Computation and Neu-
+ral Systems (Boston, Kluwer, 1993).
+[8] D. R. McMillen, G. M. T D’Eleuterio, and J. R. P
+Halperin, Phys. Rev. E 59, 6 (1999).
+[9] E. P. Wigner, in The Scientist Speculates: an Anthology
+of Partly-Baked Ideas, p284-302, ed. I. J. Good (London,
+Heinemann, 1962).
+[10] J. Mehra and A. S. Wightman, The Collected Works of
+E. P. Wigner, Vol. VI, p271 (Berlin, Springer, 1995).
+[11] H. D. Zeh, Found. Phys. 1, 69 (1970).
+[12] E. H. Walker, Mathematical Biosciences 7, 131 (1970).
+[13] L. H. Domash, in Scientific Research on TM, ed. D. W.
+Orme-Johnson and J. T. Farrow (Weggis, Switzerland,
+Maharishi Univ. Press, 1977).
+[14] H. P. Stapp, Phys. Rev. D 28, 1386 (1983).
+[15] I. N. Marshall, New Ideas in Psychology 7, 73 (1989).
+[16] D. Zohar, The Quantum Self (New York, William Mor-
+row, 1990).
+[17] H. Rosu, Metaphysical Review 3, 1, gr-qc/9409007
+(1997).
+[18] L. M. Ricciardi and H. Umezawa, Kibernetik 4, 44
+(1967).
+[19] A. Vitiello, Int. J. Mod. Phys. B9, 973-89 (1996).
+[20] S. R. Hameroff and R. C. Watt, Journal of Theoretical
+Biology 98, 549 (1982).
+S. R. Hameroff, Ultimate Computing: Biomolecular Con-
+sciousness
+and
+Nanotechnology
+(Amsterdam, North-
+Holland, 1987).
+[21] D. V. Nanopoulos 1995, hep-ph/9505374
+[22] N.
+Mavromatos
+and D.
+V.
+Nanopoulos
+1995,
+hep-
+ph/9505401
+[23] N. Mavromatos and D. V. Nanopoulos 1995, quant-
+ph/9510003
+[24] N. Mavromatos and D. V. Nanopoulos 1995, quant-
+ph/9512021
+[25] N. Mavromatos and D. V. Nanopoulos, Int. J. Mod. Phys
+B 12, 517, quant-ph/9708003 (1998).
+[26] N. Mavromatos and D. V. Nanopoulos 1998, quant-
+ph/9802063
+[27] N. Mavromatos 1999, J.
+Bioelectrochemistry & Bioenergetics;48;273
+[28] H. P. Stapp, Found. Phys. 21, 1451 (1991).
+[29] H. D. Zeh, quant-ph/9908084, Epistemological Letters of
+the Ferdinand-Gonseth Association 63:0 (Biel, Switzer-
+land, 1981)
+[30] W. H. Zurek, Phys. Today 44 (10), 36 (1991).
+[31] A. Scott, J. Consciousness Studies 6, 484 (1996).
+
+[32] S. Hawking, in The Large, the Small and the Human
+Mind, ed.
+M. Longair (Cambridge, Cambridge Univ.
+Press, 1997).
+[33] K. Hepp, in Quantum Future, ed. P. Blanchard and A.
+Jadczyk (Berlin, Springer, 1999).
+[34] J.
+von
+Neumann,
+Matematische
+Grundlagen
+der
+Quanten-Mechanik (Springer, Berlin, 1932).
+[35] H. D. Zeh, The Arrow of Time, 3rd ed. (Springer, Berlin,
+1999).
+[36] W. H. Zurek, Phys. Rev. D 24, 1516 (1981).
+[37] W. H. Zurek, Phys. Rev. D 26, 1862 (1982).
+[38] W. H. Zurek, reprint LAUR 84-2750, in Non-Equilibrium
+Statistical Physics, ed. G. Moore and M. O. Sculy (New
+York, Plenum, 1984).
+[39] E. Peres, Am. J. Phys. 54, 688 (1986).
+[40] P. Pearle, Phys. Rev. A 39, 2277 (1989).
+[41] M. R. Gallis and G. N. Fleming, Phys. Rev. A 42, 38
+(1989).
+[42] W. H. Unruh and W. H. Zurek, Phys. Rev. D 40, 1071
+(1989).
+[43] R. Omn`es, Phys. Rev. A 56, 3383 (1997).
+[44] D. Giulini, E. Joos, C. Kiefer, J. Kupsch, I. O. Sta-
+matescu, and H. D. Zeh, Decoherence and the Appear-
+ance of a Classical World in Quantum Theory (Springer,
+Berlin, 1996).
+[45] E. Joos and H. D. Zeh, Z. Phys. B 59, 223 (1985).
+[46] M. Tegmark, Found. Phys. Lett. 6, 571 (1993).
+[47] R. P. Feynman, Statistical Mechanics (Reading, Ben-
+jamin, 1972).
+[48] B. Katz, Nerve, Muscle, and Synapse (New York,
+McGraw-Hill, 1966).
+[49] J. P. Schad´e and D. H. Ford, Basic Neurology, 2nd ed.
+(Amsterdam, Elsevier, 1973).
+[50] A. G. Cairns-Smith, Evolving the Mind (Cambridge,
+Cambridge Univ. Press, 1996).
+[51] P. Morell and W. T. Norton, Sci. Am. 242, 74 (1980).
+[52] A. Hirano and J. A. Llena, in The Axon, ed. S. G. Wax-
+man, J. D. Kocsis, and P. K. Stys (New York, Oxford
+Univ. Press, 1995).
+[53] J. M. Ritchie, in The Axon, ed. S. G. Waxman, J. D.
+Kocsis, and P. K. Stys (New York, Oxford Univ. Press,
+1995).
+[54] J. Ellis, S. Mohanty, and Nanopoulos D V, Phys. Lett. B
+221, 113 (1989).
+[55] P. Pearle, Phys. Rev. D 13, 857 (1976).
+[56] G. C. Ghirardi, A. Rimini, and T. Weber, Phys. Rev. D
+34, 470 (1986).
+[57] J. D. Jackson, Classical Electrodynamics (New York, Wi-
+ley, 1975).
+[58] W. H. Zurek, S. Habib, and J. P. Paz, Phys. Rev. Lett.
+70, 1187 (1993).
+[59] M. Tegmark and H. S. Shapiro, Phys. Rev. E 50, 2538
+(1994).
+[60] M. V. Satari´c, J. A. Tuszy´nski, and R. B. ˇZakula, Phys.
+Rev. E 48, 589 (1993).
+[61] H. C. Rosu, Phys. Rev. E 55, 2038 (1997).
+[62] H. C. Rosu, Nuovo Cimento D 20, 369 (1998).
+[63] H. S. Stapp 1999,
+Attention, Intention, and Mind in Quantum Physics and
+Quantum Ontology and Mind-Matter Synthesis, available
+
+14
+
+
+at www-physics.lbl.gov/ stapp/stappfiles.html.
+[64] H. Everett III, Rev. Mod. Phys. 29, 454 (1957).
+H. Everett III, The Many-Worlds Interpretation of Quan-
+tum Mechanics, B. S. DeWitt and N. Graham (Prince-
+ton, Princeton Univ. Press, 1986).
+[65] J. A. Wheeler, Rev. Mod. Phys. 29, 463;1957 (1957).
+L. M. Cooper and D. van Vechten, Am. J. Phys 37, 1212
+(1969).
+B. S. DeWitt, Phys. Today 23, 30 (1971).
+[66] M. Lockwood, Mind, Brain and the Quantum (Cam-
+bridge, Blackwell, 1989).
+[67] D. Deutsch The Fabric of Reality (Allen Lane, New York,
+1997).
+[68] D. N. Page A 1995, gr-qc/9507025
+[69] M. J. Donald 1997, quant-ph/9703008
+[70] M. J. Donald 1999, quant-ph/9904001
+[71] L. Vaidman 1996, quant-ph/9609006, Int. Stud. Phil.
+Sci., in press
+[72] T. Sakaguchi 1997, quant-ph/9704039
+[73] M. Tegmark, quant-ph/9709032, Fortschr. Phys. 46, 855
+(1997).
+[74] M. Tegmark, gr-qc/9704009, Annals of Physics 270, 1
+(1998).
+[75] W. James, The Principles of Psychology (New York, Holt,
+1890).
+[76] W. James 1904, in The Writings of William James,
+pp169-183, ed. J. J. McDermott (Chicago, Univ. Chicago
+Press, 1977).
+[77] A. Eddington, Space, Time & Gravitation (Cambridge,
+Cambridge Univ. Press, 1920).
+[78] A. M. Lisewski 1999, quant-ph/9907052
+[79] E. P. Wigner, Phys. Rev. 40, 749 (1932).
+M. Hillery, R. H. O’Connell, M. O. Scully & Wigner E
+P, Phys. Rep. 106, 121 (1984).
+Y. S. Kim and M. E. Noz, Phase Space Picture of Quan-
+tum Mechanics: Group Theoretical Approach (Singapore,
+World Scientific, 1991).
+
+15
+
+
diff --git a/papers/project_paper_3_darwinism/references/Tegmark2000_Placeholder.md b/papers/project_paper_3_darwinism/references/Tegmark2000_Placeholder.md
deleted file mode 100644
index a1eb5956..00000000
--- a/papers/project_paper_3_darwinism/references/Tegmark2000_Placeholder.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Importance of quantum decoherence in brain processes (Tegmark 2000)
-
-This paper proves that the decoherence timescale in the brain is ~10^-13 seconds, demonstrating the absolute physical limit of sustained quantum states at biological temperatures.
-Due to copyright and its format, the full PDF is not hosted in this repository.
-
-**Citation:**
-Tegmark, M. (2000). *Phys. Rev. E* **61**, 4194.
diff --git a/papers/project_paper_5_turing/references/Sparso2001.pdf b/papers/project_paper_5_turing/references/Sparso2001.pdf
new file mode 100644
index 00000000..0507b33e
--- /dev/null
+++ b/papers/project_paper_5_turing/references/Sparso2001.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:35bad8700412d0272a2d3b70216bb7581007c6ae945b927201e31b359965e589
+size 2602288
diff --git a/papers/project_paper_5_turing/references/Sparso2001.txt b/papers/project_paper_5_turing/references/Sparso2001.txt
new file mode 100644
index 00000000..a1dc2b4b
--- /dev/null
+++ b/papers/project_paper_5_turing/references/Sparso2001.txt
@@ -0,0 +1,24580 @@
+PRINCIPLES OF
+ASYNCHRONOUS CIRCUIT DESIGN
+– A Systems Perspective
+
+Edited by
+JENS SPARSØ
+Technical University of Denmark
+
+STEVE FURBER
+The University of Manchester, UK
+
+Kluwer Academic Publishers
+Boston/Dordrecht/London
+
+
+Contents
+
+Preface
+xi
+
+Part I
+Asynchronous circuit design – A tutorial
+Author: Jens Sparsø
+
+1
+Introduction
+3
+1.1
+Why consider asynchronous circuits?
+3
+1.2
+Aims and background
+4
+1.3
+Clocking versus handshaking
+5
+1.4
+Outline of Part I
+8
+
+2
+Fundamentals
+9
+2.1
+Handshake protocols
+9
+2.1.1
+Bundled-data protocols
+9
+2.1.2
+The 4-phase dual-rail protocol
+11
+2.1.3
+The 2-phase dual-rail protocol
+13
+2.1.4
+Other protocols
+13
+2.2
+The Muller C-element and the indication principle
+14
+2.3
+The Muller pipeline
+16
+2.4
+Circuit implementation styles
+17
+2.4.1
+4-phase bundled-data
+18
+2.4.2
+2-phase bundled data (Micropipelines)
+19
+2.4.3
+4-phase dual-rail
+20
+2.5
+Theory
+23
+2.5.1
+The basics of speed-independence
+23
+2.5.2
+Classification of asynchronous circuits
+25
+2.5.3
+Isochronic forks
+26
+2.5.4
+Relation to circuits
+26
+2.6
+Test
+27
+2.7
+Summary
+28
+
+3
+Static data-flow structures
+29
+3.1
+Introduction
+29
+3.2
+Pipelines and rings
+30
+
+v
+
+
+vi
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+3.3
+Building blocks
+31
+3.4
+A simple example
+33
+3.5
+Simple applications of rings
+35
+3.5.1
+Sequential circuits
+35
+3.5.2
+Iterative computations
+35
+3.6
+FOR, IF, and WHILE constructs
+36
+3.7
+A more complex example: GCD
+38
+3.8
+Pointers to additional examples
+39
+3.8.1
+A low-power filter bank
+39
+3.8.2
+An asynchronous microprocessor
+39
+3.8.3
+A fine-grain pipelined vector multiplier
+40
+3.9
+Summary
+40
+
+4
+Performance
+41
+4.1
+Introduction
+41
+4.2
+A qualitative view of performance
+42
+4.2.1
+Example 1: A FIFO used as a shift register
+42
+4.2.2
+Example 2: A shift register with parallel load
+44
+4.3
+Quantifying performance
+47
+4.3.1
+Latency, throughput and wavelength
+47
+4.3.2
+Cycle time of a ring
+49
+4.3.3
+Example 3: Performance of a 3-stage ring
+51
+4.3.4
+Final remarks
+52
+4.4
+Dependency graph analysis
+52
+4.4.1
+Example 4: Dependency graph for a pipeline
+52
+4.4.2
+Example 5: Dependency graph for a 3-stage ring
+54
+4.5
+Summary
+56
+
+5
+Handshake circuit implementations
+57
+5.1
+The latch
+57
+5.2
+Fork, join, and merge
+58
+5.3
+Function blocks – The basics
+60
+5.3.1
+Introduction
+60
+5.3.2
+Transparency to handshaking
+61
+5.3.3
+Review of ripple-carry addition
+64
+5.4
+Bundled-data function blocks
+65
+5.4.1
+Using matched delays
+65
+5.4.2
+Delay selection
+66
+5.5
+Dual-rail function blocks
+67
+5.5.1
+Delay insensitive minterm synthesis (DIMS)
+67
+5.5.2
+Null Convention Logic
+69
+5.5.3
+Transistor-level CMOS implementations
+70
+5.5.4
+Martin’s adder
+71
+5.6
+Hybrid function blocks
+73
+5.7
+MUX and DEMUX
+75
+5.8
+Mutual exclusion, arbitration and metastability
+77
+5.8.1
+Mutual exclusion
+77
+5.8.2
+Arbitration
+79
+5.8.3
+Probability of metastability
+79
+
+
+Contents
+vii
+
+5.9
+Summary
+80
+
+6
+Speed-independent control circuits
+81
+6.1
+Introduction
+81
+6.1.1
+Asynchronous sequential circuits
+81
+6.1.2
+Hazards
+82
+6.1.3
+Delay models
+83
+6.1.4
+Fundamental mode and input-output mode
+83
+6.1.5
+Synthesis of fundamental mode circuits
+84
+6.2
+Signal transition graphs
+86
+6.2.1
+Petri nets and STGs
+86
+6.2.2
+Some frequently used STG fragments
+88
+6.3
+The basic synthesis procedure
+91
+6.3.1
+Example 1: a C-element
+92
+6.3.2
+Example 2: a circuit with choice
+92
+6.3.3
+Example 2: Hazards in the simple gate implementation
+94
+6.4
+Implementations using state-holding gates
+96
+6.4.1
+Introduction
+96
+6.4.2
+Excitation regions and quiescent regions
+97
+6.4.3
+Example 2: Using state-holding elements
+98
+6.4.4
+The monotonic cover constraint
+98
+6.4.5
+Circuit topologies using state-holding elements
+99
+6.5
+Initialization
+101
+6.6
+Summary of the synthesis process
+101
+6.7
+Petrify: A tool for synthesizing SI circuits from STGs
+102
+6.8
+Design examples using Petrify
+104
+6.8.1
+Example 2 revisited
+104
+6.8.2
+Control circuit for a 4-phase bundled-data latch
+106
+6.8.3
+Control circuit for a 4-phase bundled-data MUX
+109
+6.9
+Summary
+113
+
+7
+Advanced 4-phase bundled-data
+protocols and circuits
+115
+
+7.1
+Channels and protocols
+115
+7.1.1
+Channel types
+115
+7.1.2
+Data-validity schemes
+116
+7.1.3
+Discussion
+116
+7.2
+Static type checking
+118
+7.3
+More advanced latch control circuits
+119
+7.4
+Summary
+121
+
+8
+High-level languages and tools
+123
+8.1
+Introduction
+123
+8.2
+Concurrency and message passing in CSP
+124
+8.3
+Tangram: program examples
+126
+8.3.1
+A 2-place shift register
+126
+8.3.2
+A 2-place (ripple) FIFO
+126
+
+
+viii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+8.3.3
+GCD using while and if statements
+127
+8.3.4
+GCD using guarded commands
+128
+8.4
+Tangram: syntax-directed compilation
+128
+8.4.1
+The 2-place shift register
+129
+8.4.2
+The 2-place FIFO
+130
+8.4.3
+GCD using guarded repetition
+131
+8.5
+Martin’s translation process
+133
+8.6
+Using VHDL for asynchronous design
+134
+8.6.1
+Introduction
+134
+8.6.2
+VHDL versus CSP-type languages
+135
+8.6.3
+Channel communication and design flow
+136
+8.6.4
+The abstract channel package
+138
+8.6.5
+The real channel package
+142
+8.6.6
+Partitioning into control and data
+144
+8.7
+Summary
+146
+Appendix: The VHDL channel packages
+148
+A.1
+The abstract channel package
+148
+A.2
+The real channel package
+150
+
+Part II
+Balsa - An Asynchronous Hardware Synthesis System
+Author: Doug Edwards, Andrew Bardsley
+
+9
+An introduction to Balsa
+155
+9.1
+Overview
+155
+9.2
+Basic concepts
+156
+9.3
+Tool set and design flow
+159
+9.4
+Getting started
+159
+9.4.1
+A single-place buffer
+161
+9.4.2
+Two-place buffers
+163
+9.4.3
+Parallel composition and module reuse
+164
+9.4.4
+Placing multiple structures
+165
+9.5
+Ancillary Balsa tools
+166
+9.5.1
+Makefile generation
+166
+9.5.2
+Estimating area cost
+167
+9.5.3
+Viewing the handshake circuit graph
+168
+9.5.4
+Simulation
+168
+
+10
+The Balsa language
+173
+10.1
+Data types
+173
+10.2
+Data typing issues
+176
+10.3
+Control flow and commands
+178
+10.4
+Binary/unary operators
+181
+10.5
+Program structure
+181
+10.6
+Example circuits
+183
+10.7
+Selecting channels
+190
+
+
+Contents
+ix
+11
+Building library components
+193
+11.1
+Parameterised descriptions
+193
+11.1.1 A variable width buffer definition
+193
+11.1.2 Pipelines of variable width and depth
+194
+11.2
+Recursive definitions
+195
+11.2.1 An n-way multiplexer
+195
+11.2.2 A population counter
+197
+11.2.3 A Balsa shifter
+200
+11.2.4 An arbiter tree
+202
+
+12
+A simple DMA controller
+205
+12.1
+Global registers
+205
+12.2
+Channel registers
+206
+12.3
+DMA controller structure
+207
+12.4
+The Balsa description
+211
+12.4.1 Arbiter tree
+211
+12.4.2 Transfer engine
+212
+12.4.3 Control unit
+213
+
+Part III
+Large-Scale Asynchronous Designs
+
+13
+Descale
+221
+Joep Kessels & Ad Peeters, Torsten Kramer and Volker Timm
+13.1
+Introduction
+222
+13.2
+VLSI programming of asynchronous circuits
+223
+13.2.1 The Tangram toolset
+223
+13.2.2 Handshake technology
+225
+13.2.3 GCD algorithm
+226
+13.3
+Opportunities for asynchronous circuits
+231
+13.4
+Contactless smartcards
+232
+13.5
+The digital circuit
+235
+13.5.1 The 80C51 microcontroller
+236
+13.5.2 The prefetch unit
+239
+13.5.3 The DES coprocessor
+241
+13.6
+Results
+243
+13.7
+Test
+245
+13.8
+The power supply unit
+246
+13.9
+Conclusions
+247
+
+14
+An Asynchronous Viterbi Decoder
+249
+Linda E. M. Brackenbury
+14.1
+Introduction
+249
+14.2
+The Viterbi decoder
+250
+14.2.1 Convolution encoding
+250
+14.2.2 Decoder principle
+251
+14.3
+System parameters
+253
+14.4
+System overview
+254
+
+
+x
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+14.5
+The Path Metric Unit (PMU)
+256
+14.5.1 Node pair design in the PMU
+256
+14.5.2 Branch metrics
+259
+14.5.3 Slot timing
+261
+14.5.4 Global winner identification
+262
+14.6
+The History Unit (HU)
+264
+14.6.1 Principle of operation
+264
+14.6.2 History Unit backtrace
+264
+14.6.3 History Unit implementation
+267
+14.7
+Results and design evaluation
+269
+14.8
+Conclusions
+271
+14.8.1 Acknowledgement
+272
+14.8.2 Further reading
+272
+
+15
+Processors
+273
+Jim D. Garside
+15.1
+An introduction to the Amulet processors
+274
+15.1.1 Amulet1 (1994)
+274
+15.1.2 Amulet2e (1996)
+275
+15.1.3 Amulet3i (2000)
+275
+15.2
+Some other asynchronous microprocessors
+276
+15.3
+Processors as design examples
+278
+15.4
+Processor implementation techniques
+279
+15.4.1 Pipelining processors
+279
+15.4.2 Asynchronous pipeline architectures
+281
+15.4.3 Determinism and non-determinism
+282
+15.4.4 Dependencies
+288
+15.4.5 Exceptions
+297
+15.5
+Memory – a case study
+302
+15.5.1 Sequential accesses
+302
+15.5.2 The Amulet3i RAM
+303
+15.5.3 Cache
+307
+15.6
+Larger asynchronous systems
+310
+15.6.1 System-on-Chip (DRACO)
+310
+15.6.2 Interconnection
+310
+15.6.3 Balsa and the DMA controller
+312
+15.6.4 Calibrated time delays
+313
+15.6.5 Production test
+314
+15.7
+Summary
+315
+
+Epilogue
+317
+
+References
+319
+
+Index
+333
+
+
+Preface
+
+This book was compiled to address a perceived need for an introductory text
+on asynchronous design. There are several highly technical books on aspects of
+the subject, but no obvious starting point for a designer who wishes to become
+acquainted for the first time with asynchronous technology. We hope this book
+will serve as that starting point.
+The reader is assumed to have some background in digital design. We as-
+sume that concepts such as logic gates, flip-flops and Boolean logic are famil-
+iar. Some of the latter sections also assume familiarity with the higher levels of
+digital design such as microprocessor architectures and systems-on-chip, but
+readers unfamiliar with these topics should still find the majority of the book
+accessible.
+The intended audience for the book comprises the following groups:
+
+Industrial designers with a background in conventional (clocked) digital
+design who wish to gain an understanding of asynchronous design in
+order, for example, to establish whether or not it may be advantageous
+to use asynchronous techniques in their next design task.
+
+Students in Electronic and/or Computer Engineering who are taking a
+course that includes aspects of asynchronous design.
+
+The book is structured in three parts. Part I is a tutorial in asynchronous
+design. It addresses the most important issue for the beginner, which is how to
+think about asynchronous systems. The first big hurdle to be cleared is that of
+mindset – asynchronous design requires a different mental approach from that
+normally employed in clocked design. Attempts to take an existing clocked
+system, strip out the clock and simply replace it with asynchronous handshakes
+are doomed to disappoint. Another hurdle is that of circuit design methodol-
+ogy – the existing body of literature presents an apparent plethora of disparate
+approaches. The aim of the tutorial is to get behind this and to present a single
+unified and coherent perspective which emphasizes the common ground. In
+this way the tutorial should enable the reader to begin to understand the char-
+acteristics of asynchronous systems in a way that will enable them to ‘think
+
+xi
+
+
+xii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+outside the box’ of conventional clocked design and to create radical new de-
+sign solutions that fully exploit the potential of clockless systems.
+Once the asynchronous design mindset has been mastered, the second hur-
+dle is designer productivity. VLSI designers are used to working in a highly
+productive environment supported by powerful automatic tools. Asynchronous
+design lags in its tools environment, but things are improving. Part II of the
+book gives an introduction to Balsa, a high-level synthesis system for asyn-
+chronous circuits. It is written by Doug Edwards (who has managed the Balsa
+development at the University of Manchester since its inception) and Andrew
+Bardsley (who has written most of the software). Balsa is not the solution to all
+asynchronous design problems, but it is capable of synthesizing very complex
+systems (for example, the 32-channel DMA controller used on the DRACO
+chip described in Chapter 15) and it is a good way to develop an understanding
+of asynchronous design ‘in the large’.
+Knowing how to think about asynchronous design and having access to suit-
+able tools leaves one question: what can be built in this way? In Part III we
+offer a number of examples of complex asynchronous systems as illustrations
+of the answer to this question. In each of these examples the designers have
+been asked to provide descriptions that will provide the reader with insights
+into the design process. The examples include a commercial smart card chip
+designed at Philips and a Viterbi decoder designed at the University of Manch-
+ester. Part III closes with a discussion of the issues that come up in the design
+of advanced asynchronous microprocessors, focusing on the Amulet processor
+series, again developed at the University of Manchester.
+Although the book is a compilation of contributions from different authors,
+each of these has been specifically written with the goals of the book in mind –
+to provide answers to the sorts of questions that a newcomer to asynchronous
+design is likely to ask. In order to keep the book accessible and to avoid it
+becoming an intimidating size, much valuable work has had to be omitted. Our
+objective in introducing you to asynchronous design is that you might become
+acquainted with it. If your relationship develops further, perhaps even into the
+full-blown affair that has smitten a few, included among whose number are the
+contributors to this book, you will, of course, want to know more. The book
+includes an extensive bibliography that will provide food enough for even the
+most insatiable of appetites.
+
+JENS SPARSØ AND STEVE FURBER, SEPTEMBER 2001
+
+
+xiii
+
+Acknowledgments
+
+Many people have helped significantly in the creation of this book. In addi-
+tion to writing their respective chapters, several of the authors have also read
+and commented on drafts of other parts of the book, and the quality of the work
+as a whole has been enhanced as a result.
+The editors are also grateful to Alan Williams, Russell Hobson and Steve
+Temple, for their careful reading of drafts of this book and their constructive
+suggestions for improvement.
+Part I of the book has been used as a course text and the quality and con-
+sistency of the content improved by feedback from the students on the spring
+2001 course “49425 Design of Asynchronous Circuits” at DTU.
+Any remaining errors or omissions are the responsibility of the editors.
+
+The writing of this book was initiated as a dissemination effort within the
+European Low-Power Initiative for Electronic System Design (ESD-LPD), and
+this book is part of the book series from this initiative. As will become clear,
+the book goes far beyond the dissemination of results from projects within
+in the ESD-LPD cluster, and the editors would like to acknowledge the sup-
+port of the working group on asynchronous circuit design, ACiD-WG, that has
+provided a fruitful forum for interaction and the exchange of ideas. The ACiD-
+WG has been funded by the European Commission since 1992 under several
+Framework Programmes: FP3 Basic Research (EP7225), FP4 Technologies
+for Components and Subsystems (EP21949), and FP5 Microelectronics (IST-
+1999-29119).
+
+
+
+Foreword
+
+This book is the third in a series on novel low-power design architectures,
+methods and design practices. It results from a large European project started
+in 1997, whose goal is to promote the further development and the faster and
+wider industrial use of advanced design methods for reducing the power con-
+sumption of electronic systems.
+Low-power design became crucial with the widespread use of portable in-
+formation and communication terminals, where a small battery has to last for a
+long period. High-performance electronics, in addition, suffers from a contin-
+uing increase in the dissipated power per square millimeter of silicon, due to
+increasing clock-rates, which causes cooling and reliability problems or other-
+wise limits performance.
+The European Union’s Information Technologies Programme ‘Esprit’ there-
+fore launched a ‘Pilot action for Low-Power Design’, which eventually grew
+to 19 R&D projects and one coordination project, with an overall budget of 14
+million EUROs. This action is known as the European Low-Power Initiative
+for Electronic System Design (ESD-LPD) and will be completed in the year
+2002. It aims to develop or demonstrate new design methods for power reduc-
+tion, while the coordination project ensures that the methods, experiences and
+results are properly documented and publicised.
+The initiative addresses low-power design at various levels. These include
+system and algorithmic level, instruction set processor level, custom proces-
+sor level, register transfer level, gate level, circuit level and layout level. It
+covers data-dominated, control-dominated and asynchronous architectures. 10
+projects deal mainly with digital circuits, 7 with analog and mixed-signal cir-
+cuits, and 2 with software-related aspects. The principal application areas are
+communication, medical equipment and e-commerce devices.
+
+xv
+
+
+xvi
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+The following list describes the objectives of the 20 projects. It is sorted by
+decreasing funding budget.
+
+CRAFT CMOS Radio Frequency Circuit Design for Wireless Application
+
+Advanced CMOS RF circuit design including blocks such as LNA, down con-
+verter mixers & phase shifters, oscillators and frequency synthesisers, integrated
+filters delta sigma conversion, power amplifiers
+
+Development of novel models for active and passive devices as well as fine-tuning
+and validation based on first silicon prototypes
+
+Analysis and specification of sophisticated architectures to meet, in particular,
+low-power single-chip implementation
+
+PAPRICA Power and Part Count Reduction Innovative Communication Architecture
+
+Feasibility assessment of DQIF, through physical design and characterisation of
+the core blocks
+
+Low-power RF design techniques in standard CMOS digital processes
+
+RF design tools and framework; PAPRICA Design Kit
+
+Demonstration of a practical implementation of a specific application
+
+MELOPAS Methodology for Low Power Asic design
+
+To develop a methodology to evaluate the power consumption of a complex ASIC
+early on in the design flow
+
+To develop a hardware/software co-simulation tool
+
+To quickly achieve a drastic reduction in the power consumption of electronic
+equipment
+
+TARDIS Technical Coordination and Dissemination
+
+To organise the communication between design experiments and to exploit their
+potential synergy
+
+To guide the capturing of methods and experiences gained in the design experi-
+ments
+
+To organise and promote the wider dissemination and use of the gathered design
+know-how and experience
+
+LUCS Low-Power Ultrasound Chip Set.
+
+Design methodology on low-power ADC, memory and circuit design
+
+Prototype demonstration of a hand-held medical ultrasound scanner
+
+
+Foreword
+xvii
+
+ALPINS Analog Low-Power Design for Communications Systems
+
+Low-voltage voice band smoothing filters and analog-to-digital and digital-to-
+analog converters for an analog front-end circuit for a DECT system
+
+High linear transconductor-capacitor (gm-C) filter for GSM Analog Interface Cir-
+cuit operating at supply voltages as low as 2.5V
+
+Formal verification tools, which will be implemented in the industrial partner’s
+design environment. These tools support the complete design process from sys-
+tem level down to transistor level
+
+SALOMON System-level analog-digital trade-off analysis for low power
+
+A general top-down design flow for mixed-signal telecom ASICs
+
+High-level models of analog and digital blocks and power estimators for these
+blocks
+
+A prototype implementation of the design flow with particular software tools to
+demonstrate the general design flow
+
+DESCALE Design Experiment on a Smart Card Application for Low Energy
+
+The application of highly innovative handshake technology
+
+Aiming at some 3 to 5 times less power and some 10 times smaller peak currents
+compared to synchronously operated solutions
+
+SUPREGE A low-power SUPerREGEnerative transceiver for wireless data transmission at
+short distances
+
+Design trade-offs and optimisation of the micro power receiver/transmitter as a
+function of various parameters (power consumption, area, bandwidth, sensitivity,
+etc)
+
+Modulation/demodulation and interface with data transmission systems
+
+Realisation of the integrated micro power receiver/transmitter based on the super-
+regeneration principle
+
+PREST Power REduction for System Technologies
+
+Survey of contemporary Low-Power Design techniques and commercial power
+analysis software tools
+
+Investigation of architectural and algorithmic design techniques with a power
+consumption comparison
+
+Investigation of Asynchronous design techniques and Arithmetic styles
+
+Set-up and assessment of a low-power design flow
+
+Fabrication and characterisation of a Viterbi demonstrator to assess the most
+promising power reduction techniques
+
+
+xviii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+DABLP Low-Power Exploration for Mapping DAB Applications to Multi-Processors
+
+A DAB channel decoder architecture with reduced power consumption
+
+Refined and extended ATOMIUM methodology and supporting tools
+
+COSAFE Low-Power Hardware-Software Co-Design for Safety-Critical Applications
+
+The development of strategies for power-efficient assignment of safety critical
+mechanisms to hardware or software
+
+The design and implementation of a low-power, safety-critical ASIP, which re-
+alises the control unit of a portable infusion pump system
+
+AMIED Asynchronous Low-Power Methodology and Implementation of an Encryption/De-
+cryption System
+
+Implementation of the IDEA encryption/decryption method with drastically re-
+duced power consumption
+
+Advanced low-power design flow with emphasis on algorithm and architecture
+optimisations
+
+Industrial demonstration of the asynchronous design methodology based on com-
+mercial tools
+
+LPGD A Low-Power Design Methodology/Flow and its Application to the Implementation of
+a DCS1800-GSM/DECT Modulator/Demodulator
+
+To complete the development of a top-down, low-power design methodology/flow
+for DSP applications
+
+To demonstrate the methods on the example of an integrated GFSK/GMSK Modu-
+lator-Demodulator (MODEM) for DCS1800-GSM/DECT applications
+
+SOFLOPO Low-Power Software Development for Embedded Applications
+
+Develop techniques and guidelines for mapping a specific algorithm code onto
+appropriate instruction subsets
+
+Integrate these techniques into software for the power-conscious ARM-RISC and
+DSP code optimisation
+
+I-MODE Low-Power RF to Base Band Interface for Multi-Mode Portable Phone
+
+To raise the level of integration in a DECT/DCS1800 transceiver, by implement-
+ing the necessary analog base band low-pass filters and data converters in CMOS
+technology using low-power techniques
+
+COOL-LOGOS Power Reduction through the Use of Local don’t Care Conditions and Global
+Gate Resizing Techniques: An Experimental Evaluation.
+
+To apply the developed low-power design techniques to an existing 24-bit DSP,
+which is already fabricated
+
+To assess the merit of the new techniques using experimental silicon through com-
+parisons of the projected power reduction (in simulation) and actually measured
+reduction of new DSP; assessment of the commercial impact
+
+
+Foreword
+xix
+
+LOVO Low Output VOltage DC/DC converters for low-power applications
+
+Development of technical solutions for the power supplies of advanced low-
+power systems
+
+New methods for synchronous rectification for very low output voltage power
+converters
+
+PCBIT Low-Power ISDN Interface for Portable PC’s
+
+Design of a PC-Card board that implements the PCBIT interface
+
+Integrate levels 1 and 2 of the communication protocol in a single ASIC
+
+Incorporate power management techniques in the ASIC design:
+
+– system level: shutdown of idle modules in the circuit
+– gate level: precomputation, gated-clock FSMs
+
+COLOPODS Design of a Cochlear Hearing Aid Low-Power DSP System
+
+Selection of a future oriented low-power technology enabling future power re-
+duction through integration of analog modules
+
+Design of a speech processor IC yielding a power reduction of 90% compared to
+the 3.3 Volt implementation
+
+The low power design projects have achieved the following results:
+
+Projects that have designed prototype chips can demonstrate power re-
+ductions of 10 to 30 percent.
+
+New low-power design libraries have been developed.
+
+New proven low-power RF architectures are now available.
+
+New smaller and lighter mobile equipment has been developed.
+
+Instead of running a number of Esprit projects at the same time indepen-
+dently of each other, during this pilot action the projects have collaborated
+strongly. This is achieved mostly by the novel feature of this action, which
+is the presence and role of the coordinator: DIMES - the Delft Institute of
+Microelectronics and Submicron-technology, located in Delft, the Netherlands
+(http://www.dimes.tudelft.nl). The task of the coordinator is to co-ordinate,
+facilitate, and organize:
+
+the information exchange between projects;
+
+the systematic documentation of methods and experiences;
+
+the publication and the wider dissemination to the public.
+
+
+xx
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+The most important achievements, credited to the presence of the coordina-
+tor are:
+
+New personnel contacts have been made, and as a consequence the re-
+sulting synergy between partners resulted in better and faster develop-
+ments.
+
+The organization of low-power design workshops, special sessions at
+conferences, and a low-power design web site:
+
+http://www.esdlpd.dimes.tudelft.nl.
+
+At this site all of the public reports from the projects can be found, as
+can all kinds of information about the initiative itself.
+
+The design methodology, design methods and/or design experience are
+disclosed, are well-documented and available.
+
+Based on the work of the projects, and in cooperation with the projects,
+the publication of a low-power design book series is planned. Written
+by members of the projects, this series of books on low-power design
+will disseminate novel design methodologies and design experiences
+that were obtained during the run-time of the European Low Power Ini-
+tiative for Electronic System Design, to the general public.
+
+In conclusion, the major contribution of this project cluster is, in addition
+to the technical achievements already mentioned, the acceleration of the in-
+troduction of novel knowledge on low-power design methods into mainstream
+development processes.
+We would like to thank all project partners from all of the different compa-
+nies and organizations who make the Low-Power Initiative a success.
+
+Rene van Leuken, Reinder Nouta, Alexander de Graaf, Delft, June 2001
+
+
+I
+
+ASYNCHRONOUS CIRCUIT DESIGN
+– A TUTORIAL
+
+Author: Jens Sparsø
+Technical University of Denmark
+jsp@imm.dtu.dk
+
+Abstract
+Asynchronous circuits have characteristics that differ significantly from those
+of synchronous circuits and, as will be clear from some of the later chapters
+in this book, it is possible exploit these characteristics to design circuits with
+very interesting performance parameters in terms of their power, performance,
+electromagnetic emissions (EMI), etc.
+Asynchronous design is not yet a well-established and widely-used design meth-
+odology. There are textbooks that provide comprehensive coverage of the under-
+lying theories, but the field has not yet matured to a point where there is an estab-
+lished currriculum and university tradition for teaching courses on asynchronous
+circuit design to electrical engineering and computer engineering students.
+As this author sees the situation there is a gap between understanding the fun-
+damentals and being able to design useful circuits of some complexity. The aim
+of Part I of this book is to provide a tutorial on asynchronous circuit design that
+fills this gap.
+More specifically the aims are: (i) to introduce readers with background in syn-
+chronous digital circuit design to the fundamentals of asynchronous circuit de-
+sign such that they are able to read and understand the literature, and (ii) to
+provide readers with an understanding of the “nature” of asynchronous circuits
+such that they are to design non-trivial circuits with interesting performance pa-
+rameters.
+The material is based on experience from the design of several asynchronous
+chips, and it has evolved over the last decade from tutorials given at a number
+of European conferences and from a number of special topics courses taught
+at the Technical University of Denmark and elsewhere. In May 1999 I gave a
+one-week intensive course at Delft University of Technology and it was when
+preparing for this course I felt that the material was shaping up, and I set out
+to write the following text. Most of the material has recently been used and
+debugged in a course at the Technical University of Denmark in the spring 2001.
+Supplemented by a few journal articles and a small design project, the text may
+be used for a one semester course on asynchronous design.
+
+Keywords:
+asynchronous circuits, tutorial
+
+
+
+Chapter 1
+
+INTRODUCTION
+
+1.1.
+Why consider asynchronous circuits?
+
+Most digital circuits designed and fabricated today are “synchronous”. In
+essence, they are based on two fundamental assumptions that greatly simplify
+their design: (1) all signals are binary, and (2) all components share a common
+and discrete notion of time, as defined by a clock signal distributed throughout
+the circuit.
+Asynchronous circuits are fundamentally different; they also assume bi-
+nary signals, but there is no common and discrete time. Instead the circuits
+use handshaking between their components in order to perform the necessary
+synchronization, communication, and sequencing of operations. Expressed in
+‘synchronous terms’ this results in a behaviour that is similar to systematic
+fine-grain clock gating and local clocks that are not in phase and whose period
+is determined by actual circuit delays – registers are only clocked where and
+when needed.
+This difference gives asynchronous circuits inherent properties that can be
+(and have been) exploited to advantage in the areas listed and motivated below.
+The interested reader may find further introduction to the mechanisms behind
+the advantages mentioned below in [140].
+
+Low power consumption, [136, 138, 42, 45, 99, 102]
+
+�
+�
+�due to fine-grain clock gating and zero standby power consumption.
+
+High operating speed, [156, 157, 88]
+
+�
+�
+�operating speed is determined by actual local latencies rather than
+global worst-case latency.
+
+Less emission of electro-magnetic noise, [136, 109]
+
+�
+�
+�the local clocks tend to tick at random points in time.
+
+Robustness towards variations in supply voltage, temperature, and fabri-
+cation process parameters, [87, 98, 100]
+
+�
+�
+�timing is based on matched delays (and can even be insensitive to
+circuit and wire delays).
+
+3
+
+
+4
+Part I: Asynchronous circuit design – A tutorial
+
+Better composability and modularity, [92, 80, 142, 128, 124]
+
+�
+�
+�because of the simple handshake interfaces and the local timing.
+
+No clock distribution and clock skew problems,
+
+�
+�
+�there is no global signal that needs to be distributed with minimal
+phase skew across the circuit.
+
+On the other hand there are also some drawbacks. The asynchronous con-
+trol logic that implements the handshaking normally represents an overhead
+in terms of silicon area, circuit speed, and power consumption. It is therefore
+pertinent to ask whether or not the investment pays off, i.e. whether the use of
+asynchronous techniques results in a substantial improvement in one or more
+of the above areas. Other obstacles are a lack of CAD tools and strategies and
+a lack of tools for testing and test vector generation.
+Research in asynchronous design goes back to the mid 1950s [93, 92], but
+it was not until the late 1990s that projects in academia and industry demon-
+strated that it is possible to design asynchronous circuits which exhibit signifi-
+cant benefits in nontrivial real-life examples, and that commercialization of the
+technology began to take place. Recent examples are presented in [106] and in
+Part III of this book.
+
+1.2.
+Aims and background
+
+There are already several excellent articles and book chapters that introduce
+asynchronous design [54, 33, 34, 35, 140, 69, 124] as well as several mono-
+graphs and textbooks devoted to asynchronous design including [106, 14, 25,
+18, 95] – why then write yet another introduction to asynchronous design?
+There are several reasons:
+
+My experience from designing several asynchronous chips [123, 103],
+and from teaching asynchronous design to students and engineers over
+the past 10 years, is that it takes more than knowledge of the basic prin-
+ciples and theories to design efficient asynchronous circuits. In my ex-
+perience there is a large gap between the introductory articles and book
+chapters mentioned above explaining the design methods and theories
+on the one side, and the papers describing actual designs and current re-
+search on the other side. It takes more than knowing the rules of a game
+to play and win the game. Bridging this gap involves experience and a
+good understanding of the nature of asynchronous circuits. An experi-
+ence that I share with many other researchers is that “just going asyn-
+chronous” results in larger, slower and more power consuming circuits.
+The crux is to use asynchronous techniques to exploit characteristics in
+the algorithm and architecture of the application in question. This fur-
+
+
+Chapter 1: Introduction
+5
+
+ther implies that asynchronous techniques may not always be the right
+solution to the problem.
+
+Another issue is that asynchronous design is a rather young discipline.
+Different researchers have proposed different circuit structures and de-
+sign methods. At a first glance they may seem different – an observation
+that is supported by different terminologies; but a closer look often re-
+veals that the underlying principles and the resulting circuits are rather
+similar.
+
+Finally, most of the above-mentioned introductory articles and book
+chapters are comprehensive in nature. While being appreciated by those
+already working in the field, the multitude of different theories and ap-
+proaches in existence represents an obstacle for the newcomer wishing
+to get started designing asynchronous circuits.
+
+Compared to the introductory texts mentioned above, the aims of this tu-
+torial are: (1) to provide an introduction to asynchronous design that is more
+selective, (2) to stress basic principles and similarities between the different ap-
+proaches, and (3) to take the introduction further towards designing practical
+and useful circuits.
+
+1.3.
+Clocking versus handshaking
+
+Figure 1.1(a) shows a synchronous circuit. For simplicity the figure shows a
+pipeline, but it is intended to represent any synchronous circuit. When design-
+ing ASICs using hardware description languages and synthesis tools, designers
+focus mostly on the data processing and assume the existence of a global clock.
+For example, a designer would express the fact that data clocked into register
+R3 is a function CL3 of the data clocked into R2 at the previous clock as the
+following assignment of variables: R3 :� CL3�R2�. Figure 1.1(a) represents
+this high-level view with a universal clock.
+When it comes to physical design, reality is different. Todays ASICs use a
+structure of clock buffers resulting in a large number of (possibly gated) clock
+signals as shown in figure 1.1(b). It is well known that it takes CAD tools
+and engineering effort to design the clock gating circuitry and to minimize
+and control the skew between the many different clock signals. Guaranteeing
+the two-sided timing constraints – the setup to hold time window around the
+clock edge – in a world that is dominated by wire delays is not an easy task.
+The buffer-insertion-and-resynthesis process that is used in current commercial
+CAD tools may not converge and, even if it does, it relies on delay models that
+are often of questionable accuracy.
+
+
+6
+Part I: Asynchronous circuit design – A tutorial
+
+CL4
+
+CL4
+
+"Channel" or "Link"
+
+R2
+R3
+R4
+R1
+CL4
+CL3
+
+(d)
+
+Ack
+
+R2
+R3
+R4
+R1
+Data
+CL3
+CL4
+
+Req
+CTL
+CTL
+CTL
+CTL
+
+Req
+
+Ack
+
+Data
+
+R2
+R3
+R1
+CL3
+
+CLK
+
+(b)
+
+CLK
+
+R2
+R3
+R4
+R1
+CL3
+(a)
+
+(c)
+
+R4
+
+clock gate signal
+
+Figure 1.1.
+(a) A synchronous circuit, (b) a synchronous circuit with clock drivers and clock
+gating, (c) an equivalent asynchronous circuit, and (d) an abstract data-flow view of the asyn-
+chronous circuit. (The figure shows a pipeline, but it is intended to represent any circuit topol-
+ogy).
+
+
+Chapter 1: Introduction
+7
+
+Asynchronous design represents an alternative to this. In an asynchronous
+circuit the clock signal is replaced by some form of handshaking between
+neighbouring registers; for example the simple request-acknowledge based
+handshake protocol shown in figure 1.1(c). In the following chapter we look
+at alternative handshake protocols and data encodings, but before departing
+into these implementation details it is useful to take a more abstract view as
+illustrated in figure 1.1(d):
+
+think of the data and handshake signals connecting one register to the
+next in figure 1.1(c) as a “handshake channel” or “link,”
+
+think of the data stored in the registers as tokens tagged with data values
+(that may be changed along the way as tokens flow through combina-
+tional circuits), and
+
+think of the combinational circuits as being transparent to the handshak-
+ing between registers; a combinatorial circuit simply absorbs a token on
+each of its input links, performs its computation, and then emits a to-
+ken on each of its output links (much like a transition in a Petri net, c.f.
+section 6.2.1).
+
+Viewed this way, an asynchronous circuit is simply a static data-flow struc-
+ture [36]. Intuitively, correct operation requires that data tokens flowing in the
+circuit do not disappear, that one token does not overtake another, and that new
+tokens do not appear out of nowhere. A simple rule that can ensure this is the
+following:
+
+A register may input and store a new data token from its predecessor if its
+successor has input and stored the data token that the register was previ-
+ously holding. [The states of the predecessor and successor registers are
+signaled by the incoming request and acknowledge signals respectively.]
+
+Following this rule data is copied from one register to the next along the path
+through the circuit. In this process subsequent registers will often be holding
+copies of the same data value but the old duplicate data values will later be
+overwritten by new data values in a carefully ordered manner, and a handshake
+cycle on a link will always enclose the transfer of exactly one data-token. Un-
+derstanding this “token flow game” is crucial to the design of efficient circuits,
+and we will address these issues later, extending the token-flow view to cover
+structures other than pipelines. Our aim here is just to give the reader an intu-
+itive feel for the fundamentally different nature of asynchronous circuits.
+An important message is that the “handshake-channel and data-token view”
+represents a very useful abstraction that is equivalent to the register transfer
+level (RTL) used in the design of synchronous circuits. This data-flow ab-
+straction, as we will call it, separates the structure and function of the circuit
+from the implementation details of its components.
+
+
+8
+Part I: Asynchronous circuit design – A tutorial
+
+Another important message is that it is the handshaking between the regis-
+ters that controls the flow of tokens, whereas the combinational circuit blocks
+must be fully transparent to this handshaking. Ensuring this transparency is not
+always trivial; it takes more than a traditional combinational circuit, so we will
+use the term ’function block’ to denote a combinational circuit whose input
+and output ports are handshake-channels or links.
+Finally, some more down-to-earth engineering comments may also be rele-
+vant. The synchronous circuit in figure 1.1(b) is “controlled” by clock pulses
+that are in phase with a periodic clock signal, whereas the asynchronous circuit
+in figure 1.1(c) is controlled by locally derived clock pulses that can occur at
+any time; the local handshaking ensures that clock pulses are generated where
+and when needed. This tends to randomize the clock pulses over time, and is
+likely to result in less electromagnetic emission and a smoother supply current
+without the large di�dt spikes that characterize a synchronous circuit.
+
+1.4.
+Outline of Part I
+
+Chapter 2 presents a number of fundamental concepts and circuits that are
+important for the understanding of the following material. Read through it but
+don’t get stuck; you may want to revisit relevant parts later.
+Chapters 3 and 4 address asynchronous design at the data-flow level: chap-
+ter 3 explains the operation of pipelines and rings, introduces a set of hand-
+shake components and explains how to design (larger) computing structures,
+and chapter 4 addresses performance analysis and optimization of such struc-
+tures, both qualitatively and quantitatively.
+Chapter 5 addresses the circuit level implementation of the handshake com-
+ponents introduced in chapter 3, and chapter 6 addresses the design of hazard-
+free sequential (control) circuits. The latter includes a general introduction to
+the topics and in-depth coverage of one specific method: the design of speed-
+independent control circuits from signal transition graph specifications. These
+techniques are illustrated by control circuits used in the implementation of
+some of the handshake components introduced in chapter 3.
+All of the above chapters 2–6 aim to explain the basic techniques and meth-
+ods in some depth. The last two chapters are briefer. Chapter 7 introduces
+more advanced topics related to the implementation of circuits using the 4-
+phase bundled-data protocol, and chapter 8 addresses hardware description
+languages and synthesis tools for asynchronous design. Chapter 8 is by no
+means comprehensive; it focuses on CSP-like languages and syntax-directed
+compilation, but also describes how asynchronous design can be supported by
+a standard language, VHDL.
+
+
+Chapter 2
+
+FUNDAMENTALS
+
+This chapter provides explanations of a number of topics and concepts that
+are of fundamental importance for understanding the following chapters and
+for appreciating the similarities between the different asynchronous design
+styles. The presentation style will be somewhat informal and the aim is to
+provide the reader with intuition and insight.
+
+2.1.
+Handshake protocols
+
+The previous chapter showed one particular handshake protocol known as a
+return-to-zero handshake protocol, figure 1.1(c). In the asynchronous commu-
+nity it is given a more informative name: the 4-phase bundled-data protocol.
+
+2.1.1
+Bundled-data protocols
+
+The term bundled-data refers to a situation where the data signals use nor-
+mal Boolean levels to encode information, and where separate request and
+acknowledge wires are bundled with the data signals, figure 2.1(a). In the 4-
+phase protocol illustrated in figure 2.1(b) the request and acknowledge wires
+also use normal Boolean levels to encode information, and the term 4-phase
+refers to the number of communication actions: (1) the sender issues data and
+sets request high, (2) the receiver absorbs the data and sets acknowledge high,
+(3) the sender responds by taking request low (at which point data is no longer
+guaranteed to be valid) and (4) the receiver acknowledges this by taking ac-
+knowledge low. At this point the sender may initiate the next communication
+cycle.
+The 4-phase bundled data protocol is familiar to most digital designers, but
+it has a disadvantage in the superfluous return-to-zero transitions that cost un-
+necessary time and energy. The 2-phase bundled-data protocol shown in fig-
+ure 2.1(c) avoids this. The information on the request and acknowledge wires
+is now encoded as signal transitions on the wires and there is no difference
+between a 0
+� 1 and a 1
+� 0 transition, they both represent a “signal event”.
+Ideally the 2-phase bundled-data protocol should lead to faster circuits than
+the 4-phase bundled-data protocol, but often the implementation of circuits
+
+9
+
+
+10
+Part I: Asynchronous circuit design – A tutorial
+
+(push) channel
+
+(a)
+
+(b)
+4-phase protocol
+(c)
+2-phase protocol
+
+Data
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Data
+
+n
+
+Bundled data
+
+Data
+
+Ack
+
+Req
+
+Figure 2.1.
+(a) A bundled-data channel. (b) A 4-phase bundled-data protocol. (c) A 2-phase
+bundled-data protocol.
+
+responding to events is complex and there is no general answer as to which
+protocol is best.
+At this point some discussion of terminology is appropriate. Instead of the
+term bundled-data that is used throughout this text, some texts use the term
+single-rail. The term ‘bundled-data’ hints at the timing relationship between
+the data signals and the handshake signals, whereas the term ‘single-rail’ hints
+at the use of one wire to carry one bit of data. Also, the term single-rail may
+be considered consistent with the dual-rail data representation discussed in the
+next section. Instead of the term 4-phase handshaking (or signaling) some texts
+use the terms return-to-zero (RTZ) signaling or level signaling, and instead of
+the term 2-phase handshaking (or signaling) some texts use the terms non-
+return-to-zero (NRZ) signaling or transition signaling. Consequently a return-
+to-zero single-track prococol is the same as a 4-phase bundled-data protocol,
+etc.
+The protocols introduced above all assume that the sender is the active party
+that initiates the data transfer over the channel. This is known as a push chan-
+nel. The opposite, the receiver asking for new data, is also possible and is
+called a pull channel. In this case the directions of the request and acknowl-
+edge signals are reversed, and the validity of data is indicated in the acknowl-
+edge signal going from the sender to the receiver. In abstract circuit diagrams
+showing links/channels as one symbol we will often mark the active end of a
+channel with a dot, as illustrated in figure 2.1(a).
+To complete the picture we mention a number of variations: (1) a channel
+without data that can be used for synchronization, and (2) a channel where
+data is transmitted in both directions and where req and ack indicate validity
+
+
+Chapter 2: Fundamentals
+11
+
+of the data that is exchanged. The latter could be used to interface a read-
+only memory: the address would be bundled with req and the data would be
+bundled with ack. These alternatives are explained later in section 7.1.1. In the
+following sections we will restrict the discussion to push channels.
+All the bundled-data protocols rely on delay matching, such that the order
+of signal events at the sender’s end is preserved at the receiver’s end. On a
+push channel, data is valid before request is set high, expressed formally as
+Valid
+�Data�
+� Req. This ordering should also be valid at the receiver’s end,
+and it requires some care when physically implementing such circuits. Possible
+solutions are:
+
+To control the placement and routing of the wires, possibly by routing
+all signals in a channel as a bundle. This is trivial in a tile-based datapath
+structure.
+
+To have a safety margin at the sender’s end.
+
+To insert and/or resize buffers after layout (much as is done in today’s
+synthesis and layout CAD tools).
+
+An alternative is to use a more sophisticated protocol that is robust to wire
+delays. In the following sections we introduce a number of such protocols that
+are completely insensitive to delays.
+
+2.1.2
+The 4-phase dual-rail protocol
+
+The 4-phase dual-rail protocol encodes the request signal into the data sig-
+nals using two wires per bit of information that has to be communicated, fig-
+ure 2.2. In essence it is a 4-phase protocol using two request wires per bit of
+information d; one wire d
+�t is used for signaling a logic 1 (or true), and an-
+other wire d
+�f is used for signaling logic 0 (or false). When observing a 1-bit
+channel one will see a sequence of 4-phase handshakes where the participating
+
+"1"
+"0"
+"E"
+
+dual-rail
+(push) channel
+
+0
+
+0
+1
+1
+
+d.t d.f
+
+0
+
+1
+0
+1
+
+Valid  "0"
+Valid  "1"
+Not used
+
+Empty ("E")
+
+2n
+
+Ack
+
+Data, Req
+4-phase
+
+Data {d.t, d.f} Empty
+Valid
+Empty
+Valid
+
+Ack
+
+Figure 2.2.
+A delay-insensitive channel using the 4-phase dual-rail protocol.
+
+
+12
+Part I: Asynchronous circuit design – A tutorial
+
+“request” signal in any handshake cycle can be either d
+�t or d
+�f . This protocol
+is very robust; two parties can communicate reliably regardless of delays in the
+wires connecting the two parties – the protocol is delay-insensitive.
+Viewed together the
+�x�f
+�x�t
+� wire pair is a codeword;
+�x�f
+�x�t
+�
+�
+�1�0�
+and
+�x�f
+�x�t
+�
+�
+�0�1� represent “valid data” (logic 0 and logic 1 respectively)
+and
+�x�f
+�x�t
+�
+�
+�0�0� represents “no data” (or “spacer” or “empty value” or
+“NULL”). The codeword
+�x�f
+�x�t
+�
+�
+�1�1� is not used, and a transition from
+one valid codeword to another valid codeword is not allowed, as illustrated in
+figure 2.2.
+This leads to a more abstract view of 4-phase handshaking: (1) the sender
+issues a valid codeword, (2) the receiver absorbs the codeword and sets ac-
+knowledge high, (3) the sender responds by issuing the empty codeword, and
+(4) the receiver acknowledges this by taking acknowledge low. At this point
+the sender may initiate the next communication cycle. An even more abstract
+view of what is seen on a channel is a data stream of valid codewords separated
+by empty codewords.
+Let’s now extend this approach to bit-parallel channels. An N-bit data chan-
+nel is formed simply by concatenating N wire pairs, each using the encoding
+described above. A receiver is always able to detect when all bits are valid (to
+which it responds by taking acknowledge high), and when all bits are empty
+(to which it responds by taking acknowledge low). This is intuitive, but there
+is also some mathematical background – the dual-rail code is a particularly
+simple member of the family of delay-insensitive codes [147], and it has some
+nice properties:
+
+any concatenation of dual-rail codewords is itself a dual-rail codeword.
+
+for a given N (the number of bits to be communicated), the set of all
+possible codewords can be disjointly divided into 3 sets:
+
+– the empty codeword where all N wire pairs are
+�0,0�.
+
+– the intermediate codewords
+where some wire-pairs assume the
+empty state and some wire pairs assume valid data.
+
+– the 2N different valid codewords.
+
+Figure 2.3 illustrates the handshaking on an N-bit channel: a receiver will
+see the empty codeword, a sequence of intermediate codewords (as more and
+more bits/wire-pairs become valid) and eventually a valid codeword. After
+receiving and acknowledging the codeword, the receiver will see a sequence of
+intermediate codewords (as more and more bits become empty), and eventually
+the empty codeword to which the receiver responds by driving acknowledge
+low again.
+
+
+Chapter 2: Fundamentals
+13
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+
+Data
+
+Time
+
+1
+
+0
+
+Time
+
+Figure 2.3.
+Illustration of the handshaking on a 4-phase dual-rail channel.
+
+2.1.3
+The 2-phase dual-rail protocol
+
+The 2-phase dual-rail protocol also uses 2 wires
+�d
+�t
+�d
+�f
+� per bit, but the
+information is encoded as transitions (events) as explained previously. On an
+N-bit channel a new codeword is received when exactly one wire in each of
+the N wire pairs has made a transition. There is no empty value; a valid mes-
+sage is acknowledged and followed by another message that is acknowledged.
+Figure 2.4 shows the signal waveforms on a 2-bit channel using the 2-phase
+dual-rail protocol.
+
+Ack
+
+(d1.t, d1.f)
+
+(d0.t, d0.f)
+
+d1.t
+
+d1.f
+
+Ack
+
+d0.f
+
+d0.t
+
+00
+01
+00
+11
+
+Figure 2.4.
+Illustration of the handshaking on a 2-phase dual-rail channel.
+
+2.1.4
+Other protocols
+
+The previous sections introduced the four most common channel protocols:
+the 4-phase bundled-data push channel, the 2-phase bundled-data push chan-
+nel, the 4-phase dual-rail push channel and the 2-phase dual-rail push channel;
+but there are many other possibilities. The two wires per bit used in the dual-
+rail protocol can be seen as a one-hot encoding of that bit and often it is useful
+to extend to 1-of-n encodings in control logic and higher-radix data encodings.
+
+
+14
+Part I: Asynchronous circuit design – A tutorial
+
+If the focus is on communication rather than computation, m-of-n encodings
+may be of relevance. The solution space can be expressed as the cross product
+of a number of options including:
+
+�2-phase
+�4-phase
+�
+�
+�bundled-data
+�dual-rail
+�1-of-n
+�
+�
+�
+�
+�
+�
+�push
+�pull
+�
+
+The choice of protocol affects the circuit implementation characteristics
+(area, speed, power, robustness, etc.). Before continuing with these imple-
+mentation issues it is necessary to introduce the concept of indication or ac-
+knowledgement, as well as a new component, the Muller C-element.
+
+2.2.
+The Muller C-element and the indication principle
+
+In a synchronous circuit the role of the clock is to define points in time
+where signals are stable and valid. In between the clock-ticks, the signals
+may exhibit hazards and may make multiple transitions as the combinational
+circuits stabilize. This does not matter from a functional point of view. In
+asynchronous (control) circuits the situation is different. The absence of a
+clock means that, in many circumstances, signals are required to be valid all
+the time, that every signal transition has a meaning and, consequently, that
+hazards and races must be avoided.
+Intuitively a circuit is a collection of gates (normally including some feed-
+back loops), and when the output of a gate changes it is seen by other gates
+that in turn may decide to change their outputs accordingly. As an example fig-
+ure 2.5 shows one possible implementation of the CTL circuit in figure 1.1(c).
+The intention here is not to explain its function, just to give an impression of
+the type of circuit we are discussing. It is obvious that hazards on the Ro,
+Ai, and Lt signals would be disastrous if the circuit is used in the pipeline of
+figure 1.1(c).
+
++
+
+&
+&
+
++
+
+Ao
+
+Ri
+
+Ai
+
+Ro
+
+CTL
+
+Lt
+
+Figure 2.5.
+An example of an asynchronous control circuit. Lt is a “local” clock that is in-
+tended to control a latch.
+
+
+Chapter 2: Fundamentals
+15
+
+0
+0
+1
+1
+
+0
+
+a
+b
+y
+
+1
+
+0
+1
+0
+1
+
+a
+
+b
+
+y
++
+1
+1
+
+Figure 2.6.
+A normal OR gate
+
+a
+
+b
+y
+
+a
+y
+C
+b
+
+Some specifications:
+
+1: if a
+� b then y :� a
+
+2:
+a
+� b
+�� y :� a
+
+3:
+y
+� ab
+�y�a
+�b�
+
+4:
+a
+b
+y
+
+0
+0
+0
+0
+1
+no change
+1
+0
+no change
+1
+1
+1
+
+Figure 2.7.
+The Muller C-element: symbol, possible implementation, and some alternative
+specifications.
+
+The concept of indication or acknowledgement plays an important role in
+the design of such circuits. Consider the simple 2-input OR gate in figure 2.6.
+An observer seeing the output change from 1 to 0 may conclude that both
+inputs are now at 0. However, when seeing the output change from 0 to 1 the
+observer is not able to make conclusions about both inputs. The observer only
+knows that at least one input is 1, but it does not know which. We say that
+the OR gate only indicates or acknowledges when both inputs are 0. Through
+similar arguments it can be seen that an AND gate only indicates when both
+inputs are 1.
+Signal transitions that are not indicated or acknowledged in other signal
+transitions are the source of hazards and should be avoided. We will address
+this issue in greater detail later in section 2.5.1 and in chapter 6.
+A circuit that is better in this respect is the Muller C-element shown in fig-
+ure 2.7. It is a state-holding element much like an asynchronous set-reset latch.
+When both inputs are 0 the output is set to 0, and when both inputs are 1 the
+output is set to 1. For other input combinations the output does not change.
+Consequently, an observer seeing the output change from 0 to 1 may conclude
+that both inputs are now at 1; and similarly, an observer seeing the output
+change from 1 to 0 may conclude that both inputs are now 0.
+
+
+16
+Part I: Asynchronous circuit design – A tutorial
+
+Combining this with the observation that all asynchronous circuits rely on
+handshaking that involves cyclic transitions between 0 and 1, it should be clear
+that the Muller C-element is indeed a fundamental component that is exten-
+sively used in asynchronous circuits.
+
+2.3.
+The Muller pipeline
+
+Figure 2.8 shows a circuit that is built from C-elements and inverters. The
+circuit is known as a Muller pipeline or a Muller distributor. Variations and ex-
+tensions of this circuit form the (control) backbone of almost all asynchronous
+circuits. It may not always be obvious at a first glance, but if one strips off the
+cluttering details, the Muller pipeline is always there as the crux of the matter.
+The circuit has a beautiful and symmetric behaviour, and once you understand
+its behaviour, you have a very good basis for understanding most asynchronous
+circuits.
+The Muller pipeline in figure 2.8 is a mechanism that relays handshakes.
+After all of the C-elements have been initialized to 0 the left environment
+may start handshaking. To understand what happens let’s consider the ith C-
+element, C
+�i�: It will propagate (i.e. input and store) a 1 from its predecessor,
+C
+�i
+� 1�, only if its successor, C
+�i
+� 1�, is 0. In a similar way it will propagate
+(i.e. input and store) a 0 from its predecessor if its successor is 1. It is often
+useful to think of the signals propagating in an asynchronous circuit as a se-
+quence of waves, as illustrated at the bottom of figure 2.8. Viewed this way, the
+role of a C-element stage in the pipeline is to propagate crests and troughs of
+waves in a carefully controlled way that maintains the integrity of each wave.
+On any interface between C-element pipeline stages an observer will see
+correct handshaking, but the timing may differ from the timing of the hand-
+shaking on the left hand environment; once a wave has been injected into the
+Muller pipeline it will propagate with a speed that is determined by actual de-
+lays in the circuit.
+Eventually the first handshake (request) injected by the left hand environ-
+ment will reach the right hand environment. If the right hand environment
+does not respond to the handshake, the pipeline will eventually fill. If this hap-
+pens the pipeline will stop handshaking with the left hand environment – the
+Muller pipeline behaves like a ripple through FIFO!
+In addition to this elegant behaviour, the pipeline has a number of beautiful
+symmetries. Firstly, it does not matter if you use 2-phase or 4-phase handshak-
+ing. It is the same circuit. The difference is in how you interpret the signals
+and use the circuit. Secondly, the circuit operates equally well from right to
+left. You may reverse the definition of signal polarities, reverse the role of the
+request and acknowledge signals, and operate the circuit from right to left. It
+is analogous to electrons and holes in a semiconductor; when current flows in
+
+
+Chapter 2: Fundamentals
+17
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Req
+Req
+
+Ack
+Ack
+Ack
+
+Req
+Req
+
+Ack
+
+C
+CC
+C
+
+if   C[i-1]      C[i+1]   then   C[i] := C[i-1]
+
+C[i+1]
+C[i-1]
+
+Right
+
+C[i]
+
+Left
+
+Figure 2.8.
+The Muller pipeline or Muller distributor.
+
+one direction it may be carried by electrons flowing in one direction or by holes
+flowing in the opposite direction.
+Finally, the circuit has the interesting property that it works correctly regard-
+less of delays in gates and wires – the Muller-pipeline is delay-insensitive.
+
+2.4.
+Circuit implementation styles
+
+As mentioned previously, the choice of handshake protocol affects the cir-
+cuit implementation (area, speed, power, robustness, etc.). Most practical cir-
+cuits use one of the following protocols introduced in section 2.1:
+
+4-phase bundled-data – which most closely resembles the design of syn-
+chronous circuits and which normally leads to the most efficient circuits,
+due to the extensive use of timing assumptions.
+
+2-phase bundled-data – introduced under the name Micropipelines by Ivan
+Sutherland in his 1988 Turing Award lecture.
+
+4-phase dual-rail – the classic approach rooted in David Muller’s pioneering
+work in the 1950s.
+
+Common to all protocols is the fact that the corresponding circuit imple-
+mentations all use variations of the Muller pipeline for controlling the storage
+elements. Below we explain the basics of pipelines built using simple transpar-
+
+
+18
+Part I: Asynchronous circuit design – A tutorial
+
+C
+C
+C
+
+C
+C
+C
+
+Latch
+
+EN
+
+Comb.
+
+F
+Latch
+
+EN
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Data
+
+Latch
+
+EN
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Comb.
+
+F
+
+Req
+
+Ack
+
+(a)
+
+(b)
+
+Figure 2.9.
+A simple 4-phase bundled-data pipeline.
+
+ent latches as storage elements. More optimized and elaborate circuit imple-
+mentations and more complex circuit structures are the topics of later chapters.
+
+2.4.1
+4-phase bundled-data
+
+A 4-phase bundled-data pipeline is particularly simple. A Muller pipeline
+is used to generate local clock pulses. The clock pulse generated in one stage
+overlaps with the pulses generated in the neighbouring stages in a carefully
+controlled interlocked manner. Figure 2.9(a) shows a FIFO, i.e. a pipeline
+without data processing, and figure 2.9(b) shows how combinational circuits
+(also called function blocks) can be added between the latches. To maintain
+correct behaviour matching delays have to be inserted in the request signal
+paths.
+You may view this circuit as a traditional “synchronous” data-path, con-
+sisting of latches and combinational circuits that are clocked by a distributed
+gated-clock driver, or you may view it as an asynchronous data-flow structure
+composed of two types of handshake components: latches and function blocks,
+as indicated by the dashed boxes.
+The pipeline implementation shown in figure 2.9 is particularly simple but
+it has some drawbacks: when it fills the state of the C-elements is (0, 1, 0,
+1, etc.), and as a consequence only every other latch is storing data. This
+
+
+Chapter 2: Fundamentals
+19
+
+C
+C
+C
+
+C P
+
+Latch
+
+C P
+
+Latch
+
+C P
+
+Latch
+
+Req
+Req
+Req
+
+Ack
+Ack
+Ack
+
+Req
+
+Ack
+
+Data
+Data
+
+Figure 2.10.
+A simple 2-phase bundled-data pipeline.
+
+is no worse than in a synchronous circuit using master-slave flip-flops, but
+it is possible to design asynchronous pipelines and FIFOs that are better in
+this respect. Another disadvantage is speed. The throughput of a pipeline
+or FIFO depends on the time it takes to complete a handshake cycle and for
+the above implementation this involves communication with both neighbours.
+Chapter 7 addresses alternative implementations that are both faster and have
+better occupancy when full.
+
+2.4.2
+2-phase bundled data (Micropipelines)
+
+A 2-phase bundled-data pipeline also uses a Muller pipeline as the backbone
+control circuit, but the control signals are interpreted as events or transitions,
+figure 2.10. For this reason special capture-pass latches are needed: events
+on the C and P inputs alternate, causing the latch to alternate between cap-
+ture mode and pass mode. This calls for a special latch design as shown in
+figure 2.11 and explained below. The switch symbol in figure 2.11 is a multi-
+plexer, and the event controlled latch can be understood as two ordinary level
+sensitive latches (operating in an alternating fashion) followed by a multiplexer
+and a buffer.
+Figure 2.10 shows a pipeline without data processing. Combinational cir-
+cuits with matching delay elements can be inserted between latches in a similar
+way to the 4-phase bundled-data approach in figure 2.9.
+The 2-phase bundled-data approach was pioneered by Ivan Sutherland in the
+late 1980s and an excellent introduction is given in his 1988 Turing Award Lec-
+ture [128]. The title Micropipelines is often used synonymously with the use
+of the 2-phase bundled-data protocol, but it also refers to the use of a particular
+set of components that are based on event signalling. In addition to the latch in
+figure 2.11 these are: AND, OR, Select, Toggle, Call and Arbiter. The above
+figures 2.10 and 2.11 are similar to figures 15 and 12 in [128], but they empha-
+sise stronger the fact that the control structure is a Muller-pipeline. Some alter-
+
+
+20
+Part I: Asynchronous circuit design – A tutorial
+
+pass
+
+pass
+
+C
+P
+
+In
+Out
+
+C
+P
+
+C=0
+P=0
+C=1
+P=0
+
+C=1
+P=1
+C=0
+P=1
+
+capture
+
+t0:
+t1:
+
+capture
+
+t2:
+t3:
+
+Latch
+
+Figure 2.11.
+Implementation and operation of a capture-pass event controlled latch. At time
+t0 the latch is transparent (i.e. in pass mode) and signals C and P are both low. An event on the
+C input turns the latch into capture mode, etc.
+
+native latch designs that are (significantly) smaller and (significantly) slower
+are also presented in [128].
+At the conceptual level the 2-phase bundled-data approach is elegant and ef-
+ficient; compared to the 4-phase bundled-data approach it avoids the power and
+performance loss that is incurred by the return-to-zero part of the handshaking.
+However, as illustrated by the latch design, the implementation of components
+that respond to signal transitions is often more complex than the implemen-
+tation of components that respond to normal level signals. In addition to the
+storage elements explained above, conditional control logic that responds to
+signal transitions tends to be complex as well. This has been experienced by
+this author [123], by the University of Manchester [42, 45] and by many others.
+Having said this, the 2-phase bundled-data approach may be the preferred
+solution in systems with unconditional data-flows and very high speed require-
+ments. But as just mentioned, the higher speed comes at a price: larger silicon
+area and higher power consumption. In this respect asynchronous design is no
+different from synchronous design.
+
+2.4.3
+4-phase dual-rail
+
+A 4-phase dual-rail pipeline is also based on the Muller pipeline, but in a
+more elaborate way that has to do with the combined encoding of data and
+request. Figure 2.12 shows the implementation of a 1-bit wide and three stage
+deep pipeline without data processing. It can be understood as two Muller
+
+
+Chapter 2: Fundamentals
+21
+
+C
+
+C
+
++
+
+C
+
+C
+
++
+
+C
+
+C
+
++
+
+d.f
+
+d.t
+
+Ack
+
+d.f
+
+d.t
+
+Ack
+
+Figure 2.12.
+A simple 3-stage 1-bit wide 4-phase dual-rail pipeline.
+
+C
+
+C
+
+C
+
+C
+
++
+
++
+
++
+
+C
+
+C
+
+C
+
+di[0].f
+
+di[0].t
+
+di[1].f
+
+di[1].t
+
+di[2].f
+
+di[2].t
+
++
+
++
+
++
+
+"All empty" 
+
+ack_i
+ack_o
+
+do[0].f
+
+do[0].t
+
+do[1].f
+
+do[1].t
+
+do[2].f
+
+do[2].t
+
+Alternative completion detector 
+
+C
+
+"All valid" 
+
+&
+
+&
+
+Figure 2.13.
+An N-bit latch with completion detection.
+
+pipelines connected in parallel, using a common acknowledge signal per stage
+to synchronize operation. The pair of C-elements in a pipeline stage can store
+the empty codeword
+�d
+�t
+�d
+�f
+�
+�
+�0�0�, causing the acknowledge signal out
+of that stage to be 0, or it can store one of the two valid codewords
+�0�1�
+and
+�1�0�, causing the acknowledge signal out of that stage to be logic 1.
+At this point, and referring back to section 2.2, the reader should notice that
+because the codeword
+�1�1� is illegal and does not occur, the acknowledge
+signal generated by the OR gate safely indicates the state of the pipeline stage
+as being “valid” or “empty.”
+An N-bit wide pipeline can be implemented by using a number of 1-bit
+pipelines in parallel. This does not guarantee to a receiver that all bits in a
+word arrive at the same time, but often the necessary synchronization is done
+
+
+22
+Part I: Asynchronous circuit design – A tutorial
+
+b
+y
+
+E
+E
+0
+0
+
+F
+F
+T
+F
+T
+F
+T
+T
+
+1
+1
+0
+1
+
+0
+0
+
+NO  CHANGE
+
+y.f
+y.t
+
+0
+1
+
+a
+b
+
+AND
+
+a
+
++
+y.f
+
+C
+
+C
+
+C
+
+C
+y.t
+
+a.f
+00
+
+01
+
+10
+
+11
+
+a.t
+
+b.t
+
+b.f
+
+Figure 2.14.
+A 4-phase dual-rail AND gate: symbol, truth table, and implementation.
+
+in the function blocks. In [124, 125] we describe a design of this style using
+the DIMS combinational circuits explained below.
+If bit-parallel synchronization is needed, the individual acknowledge signals
+can be combined into one global acknowledge using a C-element. Figure 2.13
+shows an N-bit wide latch. The OR gates and the C-element in the dashed box
+form a completion detector that indicates whether the N-bit dual-rail codeword
+stored in the latch is empty or valid. The figure also shows an implementation
+of a completion detector using only a 2-input C-element.
+Let us now look at how combinational circuits for 4-phase dual-rail cir-
+cuits are implemented. As mentioned in chapter 1 combinational circuits must
+be transparent to the handshaking between latches. Therefore, all outputs of
+a combinational circuit must not become valid until after all inputs have be-
+come valid. Otherwise the receiving latch may prematurely set acknowledge
+high (before all signals from the sending latch have become valid). In a sim-
+ilar way all outputs of a combinational circuit must not become empty until
+after all inputs have become empty. Otherwise the receiving latch may pre-
+maturely set acknowledge low (before all signals from the sending latch have
+become empty). Consequently a combinational circuit for the 4-phase dual-
+rail approach involves state holding elements and it exhibits a hysteresis-like
+behaviour in the empty-to-valid and valid-to-empty transitions.
+A particularly simple approach, using only C-elements and OR gates, is
+illustrated in figure 2.14, which shows the implementation of a dual-rail AND
+gate.The circuit can be understood as a direct mapping from sum-of-minterms
+expressions for each of the two output wires into hardware. The circuit waits
+for all its inputs to become valid. When this happens exactly one of the four
+C-elements goes high. This again causes the relevant output wire to go high
+corresponding to the gate producing the desired valid output. When all inputs
+become empty the C-elements are all set low, and the output of the dual-rail
+AND gate becomes empty again. Note that the C-elements provide both the
+
+
+Chapter 2: Fundamentals
+23
+
+necessary ’and’ operator and the hysteresis in the empty-to-valid and valid-to-
+empty transitions that is required for transparent handshaking. Note also that
+(again) the OR gate is never exposed to more than one input signal being high.
+Other dual-rail gates such as OR and EXOR can be implemented in a sim-
+ilar fashion, and a dual-rail inverter involves just a swap of the true and false
+wires. The transistor count in these basic dual-rail gates is obviously rather
+high, and in chapter 5 we explore more efficient circuit implementations. Here
+our interest is in the fundamental principles.
+Given a set of basic dual-rail gates one can construct dual-rail combinational
+circuits for arbitrary Boolean expressions using normal combinational circuit
+synthesis techniques. The transparency to handshaking that is a property of
+the basic gates is preserved when composing gates into larger combinational
+circuits.
+The fundamental ideas explained above all go back to David Muller’s work
+in the late 1950s and early 1960s [93, 92]. While [93] develops the fundamen-
+tal theory for the design of speed-independent circuits, [92] is a more practical
+introduction including a design example: a bit-serial multiplier using latches
+and gates as explained above.
+
+2.5.
+Theory
+
+Asynchronous circuits can be classified, as we will see below, as being self-
+timed, speed-independent or delay-insensitive depending on the delay assump-
+tions that are made. In this section we introduce some important theoretical
+concepts that relate to this classification. The goal is to communicate the basic
+ideas and provide some intuition on the problems and solutions, and a reader
+who wishes to dig deeper into the theory is referred to the literature. Some
+recent starting points are [95, 54, 69, 35, 18].
+
+2.5.1
+The basics of speed-independence
+
+We will start by reviewing the basics of David Muller’s model of a cir-
+cuit and the conditions for it being speed-independent [93]. A circuit is mod-
+eled along with its (dummy) environment as a closed network of gates, closed
+meaning that all inputs are connected to outputs and vice versa. Gates are
+modeled as Boolean operators with arbitrary non-zero delays, and wires are
+assumed to be ideal. In this context the circuit can be described as a set of
+concurrent Boolean functions, one for each gate output. The state of the circuit
+is the set of all gate outputs. Figure 2.15 illustrates this for a stage of a Muller
+pipeline with an inverter and a buffer mimicing the handshaking behaviour of
+the left and right hand environments.
+A gate whose output is consistent with its inputs is said to be stable; its
+“next output” is the same as its “current output”, zi
+
+�
+� zi. A gate whose inputs
+
+
+24
+Part I: Asynchronous circuit design – A tutorial
+
+r i
+a i+1
+
+c i
+a i
+r i+1
+
+iy
+
+C
+
+ri
+
+�
+�
+not
+�ci
+
+�
+ci
+
+�
+�
+riyi
+
+�
+�ri
+
+�yi
+
+�ci
+yi
+
+�
+�
+not
+�ai�1
+
+�
+ai�1
+
+�
+�
+ci
+
+Figure 2.15.
+Muller model of a Muller pipeline stage with “dummy” gates modeling the envi-
+ronment behaviour.
+
+have changed in such a way that an output change is called for is said to be
+excited; its “next output” is different from its “current output”, i.e. zi
+
+�
+�� zi.
+After an arbitrary delay an excited gate may spontaneously change its output
+and become stable. We say that the gate fires, and as excited gates fire and
+become stable with new output values, other gates in turn become excited, etc.
+To illustrate this, suppose that the circuit in figure 2.15 is in state
+�ri
+
+�yi
+
+�ci
+
+�
+ai�1
+
+�
+�
+�0�1�0�0�. In this state (the inverter) ri is excited corresponding to the
+left environment being about to take request high. After the firing of ri
+
+� the
+circuit reaches state
+�ri
+
+�yi
+
+�ci
+
+�ai�1
+
+�
+�
+�1�1�0�0� and ci now becomes excited.
+For synthesis and analysis purposes one can construct the complete state graph
+representing all possible sequences of gate firings. This is addressed in detail
+in chapter 6. Here we will restrict the discussion to an explanation of the
+fundamental ideas.
+In the general case it is possible that several gates are excited at the same
+time (i.e. in a given state). If one of these gates, say zi, fires the interesting
+thing is what happens to the other excited gates which may have zi as one
+of their inputs: they may remain excited, or they may find themselves with a
+different set of input signals that no longer calls for an output change. A circuit
+is speed-independent if the latter never happens. The practical implication of
+an excited gate becoming stable without firing is a potential hazard. Since
+delays are unknown the gate may or may not have changed its output, or it
+may be in the middle of doing so when the ‘counter-order’ comes calling for
+the gate output to remain unchanged.
+Since the model involves a Boolean state variable for each gate (and for
+each wire segment in the case of delay-insensitive circuits) the state space be-
+comes very large even for very simple circuits. In chapter 6 we introduce signal
+transition graphs as a more abstract representation from which circuits can be
+synthesized.
+Now that we have a model for describing and reasoning about the behaviour
+of gate-level circuits let’s address the classification of asynchronous circuits.
+
+
+Chapter 2: Fundamentals
+25
+
+d
+
+d
+
+dA
+
+2
+
+3
+
+d1
+A
+
+B
+
+dB
+
+C
+
+dC
+
+Figure 2.16.
+A circuit fragment with gate and wire delays. The output of gate A forks to inputs
+of gates B and C.
+
+2.5.2
+Classification of asynchronous circuits
+
+At the gate level, asynchronous circuits can be classified as being self-timed,
+speed-independent or delay-insensitive depending on the delay assumptions
+that are made. Figure 2.16 serves to illustrate the following discussion. The
+figure shows three gates: A, B, and C, where the output signal from gate A is
+connected to inputs on gates B and C.
+A speed-independent (SI) circuit as introduced above is a circuit that oper-
+ates “correctly” assuming positive, bounded but unknown delays in gates and
+ideal zero-delay wires. Referring to figure 2.16 this means arbitrary dA, dB,
+and dC, but d1
+
+� d2
+
+� d3
+
+� 0. Assuming ideal zero-delay wires is not very
+realistic in today’s semiconductor processes. By allowing arbitrary d1 and d2
+and by requiring d2
+
+� d3 the wire delays can be lumped into the gates, and
+from a theoretical point of view the circuit is still speed-independent.
+A circuit that operates “correctly” with positive, bounded but unknown de-
+lays in wires as well as in gates is delay-insensitive (DI). Referring to fig-
+ure 2.16 this means arbitrary dA, dB, dC, d1, d2, and d3. Such circuits are obvi-
+ously extremely robust. One way to show that a circuit is delay-insensitive is to
+use a Muller model of the circuit where wire segments (after forks) are modeled
+as buffer components. If this equivalent circuit model is speed-independent,
+then the circuit is delay-insensitive.
+Unfortunately the class of delay-insensitive circuits is rather small. Only
+circuits composed of C-elements and inverters can be delay-insensitive [82],
+and the Muller pipeline in figures 2.5, 2.8, and 2.15 is one important example.
+Circuits that are delay-insensitive with the exception of some carefully identi-
+fied wire forks where d2
+
+� d3 are called quasi-delay-insensitive (QDI). Such
+wire forks, where signal transitions occur at the same time at all end-points, are
+called isochronic (and discussed in more detail in the next section). Typically
+these isochronic forks are found in gate-level implementations of basic build-
+ing blocks where the designer can control the wire delays. At the higher levels
+of abstraction the composition of building blocks would typically be delay-
+insensitive. After these comments it is obvious that a distinction between DI,
+QDI and SI makes good sense.
+
+
+26
+Part I: Asynchronous circuit design – A tutorial
+
+Because the class of delay-insensitive circuits is so small, basically exclud-
+ing all circuits that compute, most circuits that are referred to in the literature
+as delay-insensitive are only quasi-delay-insensitive.
+Finally a word about self-timed circuits: speed-independence and delay-
+insensitivity as introduced above are (mathematically) well defined properties
+under the unbounded gate and wire delay model. Circuits whose correct opera-
+tion relies on more elaborate and/or engineering timing assumptions are simply
+called self-timed.
+
+2.5.3
+Isochronic forks
+
+From the above it is clear that the distinction between speed-independent
+circuits and delay-insensitive circuits relates to wire forks and, more specifi-
+cally, to whether the delays to all end-points of a forking wire are identical or
+not. If the delays are identical, the wire-fork is called isochronic.
+The need for isochronic forks is related to the concept of indication intro-
+duced in section 2.2. Consider a situation in figure 2.16 where gate A has
+changed its output. Eventually this change is observed on the inputs of gates
+B and C, and after some time gates B and C may respond to the new input by
+producing a new output. If this happens we say that the output change on gate
+A is indicated by output changes on gates B and C. If, on the other hand, only
+gate B responds to the new input, it is not possible to establish whether gate C
+has seen the input change as well. In this case it is necessary to strengthen the
+assumptions to d2
+
+� d3 (i.e. that the fork is isochronic) and conclude that since
+the input signal change was indicated by the output of B, gate C has also seen
+the change.
+
+2.5.4
+Relation to circuits
+
+In the 2-phase and 4-phase bundled-data approaches the control circuits are
+normally speed-independent (or in some cases even delay-insensitive), but the
+data-path circuits with their matched delays are self-timed. Circuits designed
+following the 4-phase dual-rail approach are generally quasi-delay-insensitive.
+In the circuits shown in figures 2.12 and 2.14 the forks that connect to the inputs
+of several C-elements must be isochronic, whereas the forks that connect to the
+inputs of several OR gates are delay-insensitive.
+The different circuit classes, DI, QDI, SI and self-timed, are not mutually-
+exclusive ways to build complete systems, but useful abstractions that can be
+used at different levels of design. In most practical designs they are mixed.
+For example, in the Amulet processors [44, 43, 48] SI design is used for lo-
+cal asynchronous controllers, bundled-data for local data processing, and DI
+is used for high-level composition. Another example is the hearing-aid filter
+bank design presented in [103]. It uses the DI dual-rail 4-phase protocol inside
+
+
+Chapter 2: Fundamentals
+27
+
+RAM-modules and arithmetic circuits to provide robust completion indication,
+and 4-phase bundled-data with SI control at the top levels of design, i.e. some-
+what different from the Amulet designs. This emphasizes that the choice of
+handshake protocol and circuit implementation style is among the factors to
+consider when optimizing an asynchronous digital system.
+It is important to stress that speed-independence and delay-insensitivity are
+mathematical properties that can be verified for a given implementation. If an
+abstract component – such as a C-element or a complex And-Or-Invert gate
+– is replaced by its implementation using simple gates and possibly some
+wire-forks, then the circuit may no longer be speed-independent or delay-
+insensitive.
+As an illustrative example we mention that the simple Muller
+pipeline stage in figures 2.8 and 2.15 is no longer delay-insensitive if the C-
+element is replaced by the gate-level implementation shown in figure 2.5 that
+uses simple AND and OR gates. Furthermore, even simple gates are abstrac-
+tions; in CMOS the primitives are N and P transistors, and even the simplest
+gates include forks.
+In chapter 6 we will explore the design of SI control circuits in great detail
+(because theory and synthesis tools are well developed). As SI circuits ignore
+wire delays completely some care is needed when physically implementing
+these circuits. In general one might think that the zero wire-delay assumption
+is trivially satisfied in small circuits involving 10-20 gates, but this need not be
+the case: a normal place and route CAD tool might spread the gates of a small
+controller all over the chip. Even if the gates are placed next to each other
+they may have different logic thresholds on their inputs which in combination
+with slowly rising or falling signals can cause (and have caused!) circuits
+to malfunction. For static CMOS and for circuits operating with low supply
+voltages (e.g. VDD
+
+� VtN
+
+�
+�VtP
+
+�) this is less of a problem, but for dynamic
+circuits using a larger VDD (e.g. 3.3 V or 5.0 V) the logic thresholds can be
+very different. This often overlooked problem is addressed in detail in [134].
+
+2.6.
+Test
+
+When it comes to the commercial exploitation of asynchronous circuits the
+problem of test comes to the fore. Test is a major topic in its own right, and
+it is beyond the scope of this tutorial to do anything more than mention a few
+issues and challenges. Although the following text is brief it assumes some
+knowledge of testing. The material does not constitute a foundation for the
+following chapters and it may be skipped.
+The previous discussion about Muller circuits (excited gates and the firing
+of gates), the principle of indication, and the discussion of isochronoic forks
+ties in nicely with a discussion of testing for stuck at faults. In the stuck-at
+fault model defects are modeled at the gate level as (individual) inputs and
+outputs being stuck-at-1 or stuck-at-0. The principle of indication says that all
+
+
+28
+Part I: Asynchronous circuit design – A tutorial
+
+input signal transitions on a gate must be indicated by an output signal tran-
+sition on the gate. Furthermore, asynchronous circuits make extensive use of
+handshaking and this causes signals to exhibit cyclic transitions between 0 and
+1. In this scenario, the presence of a stuck-at fault is likely to cause the cir-
+cuit to halt; if one component stops handshaking the stall tends to “propagate”
+to neighbouring components, and eventually the entire circuit halts. Conse-
+quently, the development of a set of test patterns that exhaustively tests for all
+stuck-at faults is simply a matter of developing a set of test patterns that toggle
+all nodes, and this is generally a comparatively simple task.
+Since isochronic forks are forks where a signal transition in one or more
+branches is not indicated in the gates that take these signals as inputs, it follows
+that isochronic forks imply untestable stuck-at faults.
+Testing asynchronous circuits incurs additional problems. As we will see
+in the following chapters, asynchronous circuits tend to implement registers
+using latches rather than flip-flops. In combination with the absence of a global
+clock, this makes it less straightforward to connect registers into scan-paths.
+Another consequence of the distributed self-timed control (i.e. the lack of a
+global clock) is that it is less straightforward to single-step the circuit through
+a sequence of well-defined states. This makes it less straightforward to steer
+the circuit into particular quiescent states, which is necessary for IDDQ testing,
+– the technique that is used to test for shorts and opens which are faults that
+are typical in today’s CMOS processes.
+The extensive use of state-holding elements (such as the Muller C-element),
+together with the self-timed behaviour, makes it difficult to test the feed-back
+circuitry that implements the state holding behaviour. Delay-fault testing rep-
+resents yet another challenge.
+The above discussion may leave the impression that the problem of testing
+asynchronous circuits is largely unsolved. This is not correct. The truth is
+rather that the techniques for testing synchronous circuits are not directly ap-
+plicable. The situation is quite similar to the design of asynchronous circuits
+that we will address in detail in the following chapters. Here a mix of new
+and well-known techniques are also needed. A good starting point for reading
+about the testing of asynchronous circuits is [120]. Finally, we mention that
+testing is also touched upon in chapters 13 and 15.
+
+2.7.
+Summary
+
+This chapter introduced a number of fundamental concepts. We will now
+return to the main track of designing circuits. The reader will probably want to
+revisit some of the material in this chapter again while reading the following
+chapters.
+
+
+Chapter 3
+
+STATIC DATA-FLOW STRUCTURES
+
+In this chapter we will develop a high-level view of asynchronous design
+that is equivalent to RTL (register transfer level) in synchronous design. At
+this level the circuits may be viewed as static data-flow structures. The aim is
+to focus on the behaviour of the circuits, and to abstract away the details of the
+handshake signaling which can be considered an orthogonal implementation
+issue.
+
+3.1.
+Introduction
+
+The various handshake protocols and the associated circuit implementa-
+tion styles presented in the previous chapters are rather different. However,
+when looking at the circuits at a more abstract level – the data-flow handshake-
+channel level introduced in chapter 1 – these differences diminish, and it makes
+good sense to view the choice of handshake protocol and circuit implementa-
+tion style as low level implementation decisions that can be made largely in-
+dependently from the more abstract design decisions that establish the overall
+structure and operation of the circuit.
+Throughout this chapter we will assume a 4-phase protocol since this is
+most common. From a data-flow point of view this means that the we will be
+dealing with data streams composed of alternating valid and empty values – in
+a two-phase protocol we would see only a sequence of valid values, but apart
+from that everything else would be the same. Furthermore we will be dealing
+with simple latches as storage elements. The latches are controlled according
+to the simple rule stated in chapter 1:
+
+A latch may input and store a new token (valid or empty) from its pre-
+decessor if its successor latch has input and stored the token that it was
+previously holding.
+
+Latches are the only components that initiate and take an active part in hand-
+shaking; all other components are “transparent” to the handshaking. To ease
+the distinction between latches and combinational circuits and to emphasize
+the token flow in circuit diagrams, we will use a box symbol with double verti-
+cal lines to represent latches throughout the rest of this tutorial (see figure 3.1).
+
+29
+
+
+30
+Part I: Asynchronous circuit design – A tutorial
+
+L0
+L1
+L2
+L3
+L4
+
+E
+V
+V
+E
+E
+
+Bubble
+Bubble
+Token
+Token
+Token
+
+Figure 3.1.
+A possible state of a five stage pipeline.
+
+V
+
+V
+E
+V
+
+E
+E
+
+E
+V
+V
+
+E
+V
+E
+t3:
+
+t2:
+
+t1:
+
+t0:
+Token Token
+Bubble
+
+Figure 3.2.
+Ring: (a) a possible state; and (b) a sequence of data transfers.
+
+3.2.
+Pipelines and rings
+
+Figure 3.1 shows a snapshot of a pipeline composed of five latches. The
+“box arrows” represent channels or links consisting of request, acknowledge
+and data signals (as explained on page 5). The valid value in L1 has just
+been copied into L2 and the empty value in L3 has just been copied into
+L4. This means that L1 and L3 are now holding old duplicates of the val-
+ues now stored in L2 and L4. Such old duplicates are called “bubbles”, and the
+newest/rightmost valid and empty values are called “tokens”. To distinguish
+tokens from bubbles, tokens are represented with a circle around the value. In
+this way a latch may hold a valid token, an empty token or a bubble. Bubbles
+can be viewed as catalysts: a bubble allows a token to move forward, and in
+supporting this the bubble moves backwards one step.
+Any circuit should have one or more bubbles, otherwise it will be in a dead-
+lock state. This is a matter of initializing the circuit properly, and we will
+elaborate on this shortly. Furthermore, as we will see later, the number of
+bubbles also has a significant impact on performance.
+In a pipeline with at least three latches, it is possible to connect the output
+of the last stage to the input of the first, forming a ring in which data tokens
+can circulate autonomously. Assuming the ring is initialized as shown in fig-
+ure 3.2(a) at time t0 with a valid token, an empty token and a bubble, the first
+steps of the circulation process are shown in figure 3.2(b), at times t1, t2 and
+
+
+Chapter 3: Static data-flow structures
+31
+
+t3. Rings are the backbone structures of circuits that perform iterative compu-
+tations. The cycle time of the ring in figure 3.2 is 6 “steps” (the state at t6 will
+be identical to the state at t0). Both the valid token and the empty token have
+to make one round trip. A round trip involves 3 “steps” and as there is only
+one bubble to support this the cycle time is 6 “steps”. It is interesting to note
+that a 4-stage ring initialized to hold a valid token, an empty token and two
+bubbles can iterate in 4 “steps”. It is also interesting to note that the addition
+of one more latch does not re-time the circuit or alter its function (as would be
+the case in a synchronous circuit); it is still a ring in which a single data token
+is circulating.
+
+3.3.
+Building blocks
+
+Figure 3.3 shows a minimum set of components that is sufficient to im-
+plement asynchronous circuits (static data-flow structures with deterministic
+behaviour, i.e. without arbiters). The components can be grouped in four cat-
+egories as explained below. In the next section we will see examples of the
+token-flow behaviour in structures composed of these components. Compo-
+nents for mutual exclusion and arbitration are covered in section 5.8.
+
+Latches provide storage for variables and implement the handshaking that
+supports the token flow. In addition to the normal latch a number of
+degenerate latches are often needed: a latch with only an output channel
+is a source that produces tokens (with the same constant value), and a
+latch with only an input channel is a sink that consumes tokens. Fig-
+ure 2.9 shows the implementation of a 4-phase bundled-data latch, fig-
+ure 2.11 shows the implementation of a 2-phase bundled-data latch, and
+figures 2.12 – 2.13 show the implementation of a 4-phase dual-rail latch.
+
+Function blocks are the asynchronous equivalent of combinatorial circuits.
+They are transparent/passive from a handshaking point of view. A func-
+tion block will: (1) wait for tokens on its inputs (an implicit join), (2)
+perform the required combinatorial function, and (3) issue tokens on its
+outputs. Both empty and valid tokens are handled in this way. Some
+implementations assume that the inputs have been synchronized. In this
+case it may be necessary to use an explicit join component. The imple-
+mentation of function blocks is addressed in detail in chapter 5.
+
+Unconditional flow control: Fork and join components are used to handle
+parallel threads of computation. In engineering terms, forks are used
+when the output from one component is input to more components, and
+joins are used when data from several independent channels needs to
+be synchronized – typically because they are (independent) inputs to a
+circuit. In the following we will often omit joins and forks from cir-
+
+
+32
+Part I: Asynchronous circuit design – A tutorial
+
+Merge
+
+Latch
+Source
+Sink
+
+0
+
+1
+
+MUX
+DEMUX
+
+0
+
+1
+
+Function block
+
+Join
+
+... behaves like:
+
+Fork      
+
+   - Fork
+(carried out in sequence)
+
+   - Join;
+   - Comb. logic;
+
+(Alternative symbols)
+
+Figure 3.3.
+A minimum and, for most cases, sufficient set of asynchronous components.
+
+cuit diagrams: the fan-out of a channel implies a fork, and the fan-in of
+several channels implies a join.
+
+A merge component has two or more input channels and one output
+channel. Handshakes on the input channels are assumed to be mutually
+exclusive and the merge relays input tokens/handshakes to the output.
+
+Conditional flow control: MUX and DEMUX components perform the usual
+functions of selecting among several inputs or steering the input to one
+of several outputs. The control input is a channel just like the data in-
+puts and outputs. A MUX will synchronize the control channel and the
+relevant input channel and send the input data to the data output. The
+other input channel is ignored. Similarly a DEMUX will synchronize
+the control and data input channels and steer the input to the selected
+output channel.
+
+As mentioned before the latches implement the handshaking and thereby the
+token flow in a circuit. All other components must be transparent to the hand-
+
+
+Chapter 3: Static data-flow structures
+33
+
+shaking. This has significant implications for the implementation of these com-
+ponents!
+
+3.4.
+A simple example
+
+Figure 3.4 shows an example of a circuit composed of latches, forks and
+joins that we will use to illustrate the token-flow behaviour of an asynchronous
+circuit. The structure can be described as pipeline segments and a ring con-
+nected into a larger structure using fork and join components.
+
+Figure 3.4.
+An example asynchronous circuit composed of latches, forks and joins.
+
+Assume that the circuit is initialized as shown in figure 3.5 at time t0: all
+latches are initialized to the empty value except for the bottom two latches in
+the ring that are initialized to contain a valid value and an empty value. Values
+enclosed in circles are tokens and the rest are bubbles. Assume further that
+the left and right hand environments (not shown) take part in the handshakes
+that the circuit is prepared to perform. Under these conditions the operation
+of the circuit (i.e. the flow of tokens) is as illustrated in the snapshots labeled
+t0
+
+�t11. The left hand environment performs one handshake cycle inputting a
+valid value followed by an empty value. In a similar way the right environment
+takes part in one handshake cycle and consumes a valid value and an empty
+value.
+Because the flow of tokens is controlled by local handshaking the circuit
+could exhibit many other behaviours. For example, at time t5 the circuit is
+ready to accept a new valid value from its left environment. Notice also that
+if the initial state had no tokens in the ring, then the circuit would deadlock
+after a few steps. It is highly recommended that the reader tries to play the
+token-bubble data-flow game; perhaps using the same circuit but with different
+initial states.
+
+
+34
+Part I: Asynchronous circuit design – A tutorial
+
+V
+
+V
+
+V
+
+V
+
+V
+
+V
+
+V
+
+E
+
+V
+
+E
+
+E
+
+E
+
+E
+
+V
+
+E
+
+E
+
+V
+
+V
+E
+
+E
+
+V
+
+E
+
+E
+V
+
+V
+
+V
+E
+
+E
+
+V
+
+E
+
+V
+
+E
+
+V
+
+E
+
+V
+
+E
+
+E
+V
+
+E
+
+E
+
+V
+E
+
+V
+E
+
+E
+
+V
+E
+
+V
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+E
+
+E
+
+V
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+V
+
+V
+
+E
+
+E
+
+V
+
+V
+
+E
+
+E
+
+E
+
+E
+
+E
+
+t0:
+
+t1:
+
+t2:
+
+t3:
+
+E
+E
+
+V
+
+t5:
+
+t4:
+
+E
+
+E
+
+V
+
+V
+
+V
+
+V
+
+V
+
+E
+
+E
+
+t6:
+
+E
+E
+
+t7:
+
+E
+
+t8:
+
+E
+
+E
+E
+
+E
+
+E
+
+V
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+t9:
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+t10:
+
+t10:
+
+V
+Bubble
+E
+
+Valid token
+Empty token
+
+Bubble
+
+Legend:
+
+Figure 3.5.
+A possible operation sequence of the example circuit from figure 3.4.
+
+
+Chapter 3: Static data-flow structures
+35
+
+3.5.
+Simple applications of rings
+
+This section presents a few simple and obvious circuits based on a single
+ring.
+
+3.5.1
+Sequential circuits
+
+Figure 3.6 shows a straightforward implementation of a finite state machine.
+Its structure is similar to a synchronous finite state machine; it consists of a
+function block and a ring that holds the current state. The machine accepts an
+“input token” that is joined with the “current state token”. Then the function
+block computes the output and the next state, and finally the fork splits these
+into an “output token” and a “next state token.”
+
+state
+Current
+Next
+state
+
+F
+Input
+Output
+
+V
+E
+E
+
+Figure 3.6.
+Implementation of an asynchronous finite state machine using a ring.
+
+3.5.2
+Iterative computations
+
+A ring can also be used to build circuits that implement iterative computa-
+tions. Figure 3.7 shows a template circuit. The idea is that the circuit will:
+(1) accept an operand, (2) sequence through the same operation a number of
+times until the computation terminates and (3) output the result. The necessary
+control is not shown. The figure shows one particular implementation. Pos-
+
+F
+E
+E
+
+1
+
+0
+0
+
+1
+
+Operand(s)
+Result
+
+E
+
+Figure 3.7.
+Implementation of an iterative computation using a ring.
+
+
+36
+Part I: Asynchronous circuit design – A tutorial
+
+sible variations involve locating the latches and the function block differently
+in the ring as well as decomposing the function block and putting these (sim-
+pler) function blocks between more latches. In [156] Ted Williams presents
+a circuit that performs division using a self-timed 5-stage ring. This design
+was later used in a floating point coprocessor in a commercial microprocessor
+[157].
+
+3.6.
+FOR, IF, and WHILE constructs
+
+Very often the desired function of a circuit is expressed using a program-
+ming language (C, C++, VHDL, Verilog, etc.). In this section we will show
+implementation templates for a number of typical conditional structures and
+loop structures. A reader who is familiar with control-data-flow graphs, per-
+haps from high-level synthesis, will recognize the great similarities between
+asynchronous circuits and control-data-flow graphs [36, 127].
+
+if <cond> then <body1> else <body2>
+An asynchronous circuit template
+for implementing an if statement is shown in figure 3.8(a). The data-type of
+the input and output channels to the if circuit is a record containing all vari-
+ables in the <cond> expression and the variables manipulated by <body1> and
+<body2>. The data-type of the output channel from the cond block is a Boolean
+that controls the DEMUX and MUX components. The FORK associated with
+this channel is not shown.
+Since the execution of <body1> and <body2> is mutually exclusive it is
+possible to replace the controlled MUX in the bottom of the circuit with a
+simpler MERGE as shown in figure 3.8(b). The circuit in figure 3.8 contains
+
+body1
+body2
+
+1
+0
+
+1
+0
+
+{variables}
+
+cond
+
+{variables}
+
+body1
+body2
+
+1
+0
+
+{variables}
+
+cond
+
+{variables}
+
+merge
+
+(b)
+(a)
+
+Figure 3.8.
+A template for implementing if statements.
+
+
+Chapter 3: Static data-flow structures
+37
+
+no feedback loops and no latches – it can be considered a large function block.
+The circuit can be pipelined for improved performance by inserting latches.
+
+for <count> do <body>
+An asynchronous circuit template for implementing
+a for statement is shown in figure 3.9. The data-type of the input channel to
+the for circuit is a record containing all variables manipulated in the <body>
+and the loop count, <count>, that is assumed to be a non-negative integer. The
+data-type of the output channel is a record containing all variables manipulated
+in the <body>.
+
+0
+
+count
+
+E
+
+body
+
+1
+0
+
+{variables}
+
+{variables}
+
+{variables},  count
+
+1
+0
+
+Initial tokens
+
+Figure 3.9.
+A template for implementing for statements.
+
+The data-type of the output channel from the count block is a Boolean,
+and one handshake on the input channel of the count block encloses <count>
+handshakes on the output channel: <count> - 1 handshakes providing the
+Boolean value “1” and one (final) handshake providing the Boolean value “0”.
+Notice the two latches on the control input to the MUX. They must be initial-
+ized to contain a data token with the value “0” and an empty token in order to
+enable the for circuit to read the variables into the loop.
+After executing the for statement once, the last handshake of the count block
+will steer the variables in the loop onto the output channel and put a “0” token
+and an empty token into the two latches, thereby preparing the for circuit for
+a subsequent activation. The FORK in the input and the FORK on the output
+of the count block are not shown. Similarly a number of latches are omitted.
+Remember: (1) all rings must contain at least 3 latches and (2) for each latch
+initialized to hold a data token there must also be a latch initialized to hold an
+empty token (when using 4-phase handshaking).
+
+
+38
+Part I: Asynchronous circuit design – A tutorial
+
+while <cond> do <body>
+An asynchronous circuit template for implement-
+ing a while statement is shown in figure 3.10. Inputs to (and outputs from) the
+circuit are the variables in the <cond> expression and the variables manipu-
+lated by <body>. As before in the for circuit, it is necessary to put two latches
+initialized to contain a data token with the value “0” and an empty token on
+the control input of the MUX. And as before a number of latches are omitted
+in the two rings that constitute the while circuit. When the while circuit termi-
+nates (after zero or more iterations) data is steered out of the loop and this also
+causes the latches on the MUX control input to become initialized properly for
+the subsequent activation of the circuit.
+
+0
+
+cond
+
+{variables}
+
+body
+
+{variables}
+
+{variables}
+
+1
+0
+
+1
+0
+
+E
+
+Initial tokens
+
+Figure 3.10.
+A template for implementing while statements.
+
+3.7.
+A more complex example: GCD
+
+Using the templates just introduced we will now design a small example
+circuit, GCD, that computes the greatest common divisor of two integers. GCD
+is often used as an introductory example, and figure 3.11 shows a programming
+language specification of the algorithm.
+In addition to its role as a design example in the current context, GCD can
+also serve to illustrate the similarities and differences between different design
+techniques. In chapter 8 we will use the same example to illustrate the Tangram
+language and the associated syntax-directed compilation process (section 8.3.3
+on pages 127–128).
+The implementation of GCD is shown in figure 3.12. It consists of a while
+template whose body is an if template. Figure 3.12 shows the circuit including
+all the necessary latches (with their initial states). The implementation makes
+no attempt at sharing resources – it is a direct mapping following the imple-
+mentation templates presented in the previous section.
+
+
+Chapter 3: Static data-flow structures
+39
+
+input (a,b);
+while a
+�� b do
+if a
+� b
+then a
+� a
+�b;
+else b
+� b
+�a;
+output (a);
+
+Figure 3.11.
+A programming language specification of GCD.
+
+0
+
+1
+
+0
+
+A>B
+
+1
+
+0
+
+1
+
+0
+
+A-B
+
+B-A
+
+0
+
+1
+
+E
+
+E
+
+E
+E
+
+A==B
+
+A,B
+GCD(A,B)
+
+A,B
+
+A,B
+
+1
+
+1
+
+Figure 3.12.
+An asynchronous circuit implementation of GCD.
+
+3.8.
+Pointers to additional examples
+
+3.8.1
+A low-power filter bank
+
+In [103] we reported on the design of a low-power IFIR filter bank for a
+digital hearing aid. It is a circuit that was designed following the approach
+presented in this chapter. The paper also provides some insight into the design
+of low power circuits as well as the circuit level implementation of memory
+structures and datapath units.
+
+3.8.2
+An asynchronous microprocessor
+
+In [23] we reported on the design of a MIPS microprocessor, called ARISC.
+Although there are many details to be understood in a large-scale design like
+a microprocessor, the basic architecture shown in figure 3.13 can be under-
+stood as a simple data-flow structure. The solid-black rectangles represent
+latches, the box-arrows represent channels, and the text-boxes represents func-
+tion blocks (combinatorial circuits).
+The processor is a simple pipelined design with instructions retiring in pro-
+gram order. It consists of a fetch-decode-issue ring with a fixed number of to-
+
+
+40
+Part I: Asynchronous circuit design – A tutorial
+
+REG
+Read
+
+PC
+Read
+
+PC
+ALU
+
+REG
+Write
+Data
+Mem.
+
+Inst.
+Mem.
+
+On
+Bolt
+
+Issue
+
+Decode
+
+Flush
+
+Arith.
+
+Logic
+
+Shift
+
+CP0
+
+Lock
+
+UnLock
+
+Figure 3.13.
+Architecture of the ARISC microprocessor.
+
+kens. This ensures a fixed instruction prefetch depth. The issue stage forks de-
+coded instructions into the execute pipeline and initiates the fetch of one more
+instruction. Register forwarding is avoided by a locking mechanism: when an
+instruction is issued for execution the destination register is locked until the
+write-back has taken place. If a subsequent instruction has a read-after-write
+data hazard this instruction is stalled until the register is unlocked. The tokens
+flowing in the design contain all operands and control signals related to the
+execution of an instruction, i.e. similar to what is stored in a pipeline stage in a
+synchronous processor. For further information the interested reader is referred
+to [23]. Other asynchronous microprocessors are based on similar principles.
+
+3.8.3
+A fine-grain pipelined vector multiplier
+
+The GCD circuit and the ARISC presented in the preceding sections use bit-
+parallel communication channels. An example of a static data-flow structure
+that uses 1-bit channels and fine grain pipelining is the serial-parallel vector
+multiplier design reported in [124, 125]. Here all necessary word-level syn-
+chronization is performed implicitly by the function blocks. The large number
+of interacting rings and pipeline segments in the static data-flow representa-
+tion of the design makes it rather complex. After reading the next chapter on
+performance analysis the interested reader may want to look at this design; it
+contains several interesting optimizations.
+
+3.9.
+Summary
+
+This chapter developed a high-level view of asynchronous design that is
+equivalent to RTL (register transfer level) in synchronous design – static data
+flow structures. The next chapter address performance analysis at this level of
+abstraction.
+
+
+Chapter 4
+
+PERFORMANCE
+
+In this chapter we will address the performance analysis and optimization
+of asynchronous circuits. The material extends and builds upon the “static
+data-flow structures view” introduced in the previous chapter.
+
+4.1.
+Introduction
+
+In a synchronous circuit, performance analysis and optimization is a matter
+of finding the longest latency signal path between two registers; this determines
+the period of the clock signal. The global clock partitions the circuit into many
+combinatorial circuits that can be analyzed individually. This is known as static
+timing analysis and it is a rather simple task, even for a large circuit.
+For an asynchronous circuit, performance analysis and optimization is a
+global and therefore much more complex problem. The use of handshaking
+makes the timing in one component dependent on the timing of its neighbours,
+which again depends on the timing of their neighbours, etc. Furthermore, the
+performance of a circuit does not depend only on its structure, but also on
+how it is initialized and used by its environment. The performance of an asyn-
+chronous circuit can even exhibit transients and oscillations.
+We will first develop a qualitative understanding of the dynamics of the to-
+ken flow in asynchronous circuits. A good understanding of this is essential for
+designing circuits with good performance. We will then introduce some quan-
+titative performance parameters that characterize individual pipeline stages and
+pipelines and rings composed of identical pipeline stages. Using these param-
+eters one can make first-level design decisions. Finally we will address how
+more complex and irregular structures can be analyzed.
+The following text represents a major revision of material from [124] and
+it is based on original work by Ted Williams [153, 154, 155]. If consulting
+these references the reader should be aware of the exact definition of a token.
+Throughout this book a token is defined as a valid data value or an empty data
+value, whereas in the cited references (that deal exclusively with 4-phase hand-
+shaking) a token is a valid-empty data pair. The definition used here accentu-
+ates the similarity between a token in an asynchronous circuit and the token in
+
+41
+
+
+42
+Part I: Asynchronous circuit design – A tutorial
+
+a Petri net. Furthermore it provides some unification between 4-phase hand-
+shaking and 2-phase handshaking – 2-phase handshaking is the same game,
+but without empty-tokens.
+In the following we will assume 4-phase handshaking, and the examples we
+provide all use bundled-data circuits. It is left as an exercise for the reader
+to make the simple adaptations that are necessary for dealing with 2-phase
+handshaking.
+
+4.2.
+A qualitative view of performance
+
+4.2.1
+Example 1: A FIFO used as a shift register
+
+The fundamental concepts can be illustrated by a simple example: a FIFO
+composed of a number of latches in which there are N valid tokens separated
+by N empty tokens, and whose environment alternates between reading a token
+from the FIFO and writing a token into the FIFO (see figure 4.1(a)). In this way
+the nomber of tokens in the FIFO is invariant. This example is relevant because
+many designs use FIFOs in this way, and because it models the behaviour of
+shift registers as well as rings – structures in which the number of tokens is
+also invariant.
+A relevant performance figure is the throughput, which is the rate at which
+tokens are input to or output from the shift register. This figure is proportional
+to the time it takes to shift the contents of the chain of latches one position to
+the right.
+Figure 4.1(b) illustrates the behaviour of an implementation in which there
+are 2N latches per valid token and figure 4.1(c) illustrates the behaviour of an
+implementation in which there are 3N latches per valid token. In both exam-
+ples the number of valid tokens in the FIFO is N
+� 3, and the only difference
+between the two situations in figure 4.1(b) and 4.1(c) is the number of bubbles.
+In figure 4.1(b) at time t1 the environment reads the valid token, D1, as
+indicated by the solid channel symbol. This introduces a bubble that enables
+data transfers to take place one at a time (t2
+
+�t5). At time t6 the environment
+inputs a valid token, D4, and at this point all elements have been shifted one
+position to the right. Hence, the time used to move all elements one place to
+the right is proportional to the number of tokens, in this case 2N
+� 6 time steps.
+Adding more latches increases the number of bubbles, which again increases
+the number of data transfers that can take place simultaneously, thereby im-
+proving the performance. In figure 4.1(c) the shift register has 3N stages and
+therefore one bubble per valid-empty token-pair. The effect of this is that N
+data transfers can occur simultaneously and the time used to move all elements
+one place to the right is constant; 2 time steps.
+If the number of latches was increased to 4N there would be one token per
+bubble, and the time to move all tokens one step to the right would be only
+
+
+Chapter 4: Performance
+43
+
+E
+D1
+E
+E
+D2
+
+(c) N data tokens and N empty tokens in 3N stages:
+
+(b) N data tokens and N empty tokens in 2N stages:
+
+(a) A FIFO and its environment:
+
+bubble
+bubble
+bubble
+
+D3
+
+E
+E
+E
+D2
+
+E
+E
+D2
+
+E
+E
+
+E
+
+D1
+
+E
+
+E
+
+E
+
+D2
+
+D2
+E
+
+E
+
+E
+
+D2
+
+D2
+
+E
+
+E
+D4
+
+D3
+
+D3
+E
+
+D2
+E
+D3
+D4
+E
+
+E
+D1
+E
+D2
+D3
+E
+
+E
+E
+D1
+E
+D2
+D3
+D4
+
+E
+D2
+E
+D3
+E
+D4
+
+D4
+
+D4
+D3
+D2
+
+bubble
+bubble
+bubble
+
+E
+E
+E
+
+bubble
+bubble
+bubble
+
+D2
+D3
+D4
+E
+E
+E
+E
+
+E
+E
+D2
+E
+D3
+E
+D4
+
+E
+
+D4
+
+E
+
+D4
+
+E
+
+Environment
+
+E
+D2
+E
+D3
+E
+
+E
+
+E
+E
+t4:
+
+bubble
+bubble
+bubble
+
+D1
+D3
+D2
+
+E
+E
+
+bubble
+bubble
+bubble
+
+t3:
+
+t2:
+
+t1:
+
+t0:
+
+E
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+t5:
+
+t4:
+
+t3:
+
+t2:
+
+t1:
+
+t0:
+
+t8:
+
+t7:
+
+t6:
+
+D4
+
+E
+
+D3
+
+E
+
+D2
+
+E
+E
+
+D1
+
+D3
+
+D3
+
+D3
+
+D3
+
+Figure 4.1.
+A FIFO and its environment. The environment alternates between reading a token
+from the FIFO and writing at token into the FIFO.
+
+
+44
+Part I: Asynchronous circuit design – A tutorial
+
+1 time step. In this situation the pipeline is half full and the latches holding
+bubbles act as slave latches (relative to the latches holding tokens). Increasing
+the number of bubbles further would not increase the performance further. Fi-
+nally, it is interesting to notice that the addition of just one more latch holding
+a bubble to figure 4.1(b) would double the performance. The asynchronous
+designer has great freedom in trading more latches for performance.
+As the number of bubbles in a design depends on the number of latches per
+token, the above analysis illustrates that performance optimization of a given
+circuit is primarily a task of structural modification – circuit level optimization
+like transistor sizing is of secondary importance.
+
+4.2.2
+Example 2: A shift register with parallel load
+
+In order to illustrate another point – that the distribution of tokens and bub-
+bles in a circuit can vary over time, depending on the dynamics of the circuit
+and its environment – we offer another example: a shift register with parallel
+load. Figure 4.2 shows an initial design of a 4-bit shift register. The circuit has
+a bit-parallel input channel, din[3:0], connecting it to a data producing envi-
+ronment. It also has a 1-bit data channel, do, and a 1-bit control channel, ctl,
+connecting it to a data consuming environment. Operation is controlled by the
+data consuming environment which may request the circuit to: (ctl
+� 0) per-
+form a parallel load and to provide the least significant bit from the bit-parallel
+channel on the do channel, or (ctl
+� 1) to perform a right shift and provide
+the next bit on the do channel. In this way the data consuming environment
+always inputs a control token (valid or empty) to which the circuit always re-
+sponds by outputting a data token (valid or empty). During a parallel load, the
+previous content of the shift register is steered into the “dead end” sink-latches.
+During a right shift the constant 0 is shifted into the most significant position
+– corresponding to a logical right shift. The data consuming environment is
+not required to read all the input data bits, and it may continue reading zeros
+beyond the most significant input data bit.
+The initial design shown in figure 4.2 suffers from two performance lim-
+iting inexpediencies: firstly, it has the same problem as the shift register in
+figure 4.1(b) – there are too few bubbles, and the peak data rate on the bit-
+serial output reduces linearly with the length of the shift register. Secondly,
+the control signal is forked to all of the MUXes and DEMUXes in the design.
+This implies a high fan-out of the request and data signals (which requires a
+couple of buffers) and synchronization of all the individual acknowledge sig-
+nals (which requires a C-element with many inputs, possibly implemented as
+a tree of C-elements). The first problem can be avoided by adding a 3rd latch
+to the datapath in each stage of the circuit corresponding to the situation in
+
+
+Chapter 4: Performance
+45
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+1
+
+0
+
+din[1]
+din[2]
+din[0]
+
+din[1]
+din[0]
+din[3] din[2]
+
+din[3:0]
+
+producing
+environment
+
+Data
+
+environment
+consuming
+Data
+
+ctl
+
+do
+E
+d3
+E
+d2
+
+din[3]
+
+E
+d1
+0
+
+Figure 4.2.
+Initial design of the shift register with parallel load.
+
+
+46
+Part I: Asynchronous circuit design – A tutorial
+
+0
+
+0
+
+0
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+din[0]
+din[2]
+din[3]
+
+din[3]
+din[2]
+din[1]
+din[0]
+
+dout
+
+dout
+
+dout
+
+din[0]
+din[1]
+din[2]
+din[3]
+
+E
+0
+E
+0
+E
+0
+
+d2
+d1
+d3
+d0
+din[1]
+
+E
+E
+E
+
+E
+
+d3
+d2
+
+0
+E
+0
+
+E
+E
+E
+
+0
+E
+
+d1
+d0
+
+d0
+
+0
+
+d1
+E
+d0
+
+E
+1
+
+E
+E
+
+E
+
+d2
+
+E
+
+d2
+
+E
+
+E
+
+d3
+
+0
+
+E
+
+(c)
+
+(b)
+
+(a)
+
+Figure 4.3.
+Improved design of the shift register with parallel load.
+
+
+Chapter 4: Performance
+47
+
+figure 4.1(c), but if the extra latches are added to the control path instead, as
+shown in figure 4.3(a) on page 46, they will solve both problems.
+This improved design exhibits an interesting and illustrative dynamic be-
+haviour: initially, the data latches are densely packed with tokens and all the
+control latches contain bubbles, figure 4.3(a). The first step of the parallel load
+cycle is shown in figure 4.3(b), and figure 4.3(c) shows a possible state after
+the data consuming environment has read a couple of bits. The most-significant
+stage is just about to perform its “parallel load” and the bubbles are now in the
+chain of data latches. If at this point the data consuming environment paused,
+the tokens in the control path would gradually disappear while tokens in the
+datapath would pack again. Note that at any time the total number of tokens in
+the circuit is constant!
+
+4.3.
+Quantifying performance
+
+4.3.1
+Latency, throughput and wavelength
+
+When the overall structure of a design is being decided, it is important to
+determine the optimal number of latches or pipeline stages in the rings and
+pipeline fragments from which the design is composed. In order to establish a
+basis for first order design decisions, this section will introduce some quantita-
+tive performance parameters. We will restrict the discussion to 4-phase hand-
+shaking and bundled-data circuit implementations and we will consider rings
+with only a single valid token. Subsection 4.3.4, which concludes this section
+on performance parameters, will comment on adapting to other protocols and
+implementation styles.
+The performance of a pipeline is usually characterized by two parameters:
+latency and throughput (or its inverse called period or cycle time). For an asyn-
+chronous pipeline a third parameter, the dynamic wavelength, is important as
+well. With reference to figure 4.4 and following [153, 154, 155] these param-
+eters are defined as follows:
+
+Latency: The latency is the delay from the input of a data item until the corre-
+sponding output data item is produced. When data flows in the forward
+direction, acknowledge signals propagate in the reverse direction. Con-
+sequently two parameters are defined:
+
+The forward latency,Lf , is the delay from new data on the input of
+a stage (Data�i
+� 1� or Req�i
+� 1�) to the production of the corre-
+sponding output (Data�i� or Req�i�) provided that the acknowledge
+signals are in place when data arrives. Lf
+�V and Lf
+�E denote the
+latencies for propagating a valid token and an empty token respec-
+tively. It is assumed that these latencies are constants, i.e. that
+they are independent of the value of the data. [As forward propa-
+
+
+48
+Part I: Asynchronous circuit design – A tutorial
+
+Data[i-1]
+
+Req[i-1]
+
+Ack[i]
+Ack[i+1]
+
+Data[i]
+
+Ack[i]
+
+Data[i-1]
+
+Ack[i+1]
+
+Data[i]
+
+d
+
+Dual-rail pipeline:
+
+Req[i]
+
+Bundled-data pipeline:
+
+L[i]
+F[i]
+
+L[i]
+F[i]
+
+Figure 4.4.
+Generic pipelines for definition of performance parameters.
+
+gation of an empty token does not “compute” it may be desirable
+to minimize Lf
+�E. In the 4-phase bundled-data approach this can
+be achieved through the use of an asymmetric delay element.]
+
+The reverse latency, Lr, is the delay from receiving an acknowl-
+edge from the succeeding stage (Ack[i+1]) until the corresponding
+acknowledge is produced to the preceding stage (Ack[i]) provided
+that the request is in place when the acknowledge arrives. Lr
+
+� and
+Lr
+
+� denote the latencies of propagating Ack
+� and Ack
+� respectively.
+
+Period: The period, P, is the delay between the input of a valid token (fol-
+lowed by its succeeding empty token) and the input of the next valid
+token, i.e. a complete handshake cycle. For a 4-phase protocol this in-
+volves: (1) forward propagation of a valid data value, (2) reverse propa-
+gation of acknowledge, (3) forward propagation of the empty data value,
+and (4) reverse propagation of acknowledge. Therefore a lower bound
+on the period is:
+P
+� L f
+�V
+
+�Lr
+
+�
+�L f
+�E
+
+�Lr
+
+�
+(4.1)
+
+Many of the circuits we consider in this book are symmetric, i.e. Lf
+�V
+
+�
+L f
+�E and Lr
+
+�
+� Lr
+
+�, and for these circuits the period is simply:
+
+P
+� 2L f
+
+�2Lr
+(4.2)
+
+We will also consider circuits where Lf
+�V
+
+� L f
+�E and, as we will see in
+section 4.4.1 and again in section 7.3, the actual implementation of the
+latches may lead to a period that is larger than the minimum possible
+
+
+Chapter 4: Performance
+49
+
+given by equation 4.1. In section 4.4.1 we analyze a pipeline whose
+period is:
+P
+� 2Lr
+
+�2L f
+�V
+(4.3)
+
+Throughput: The throughput, T, is the number of valid tokens that flow
+through a pipeline stage per unit time: T
+� 1�P
+
+Dynamic wavelength: The dynamic wavelength, Wd, of a pipeline is the num-
+ber of pipeline stages that a forward-propagating token passes through
+during P:
+
+Wd
+
+� P
+
+L f
+(4.4)
+
+Explained differently: Wd is the distance – measured in pipeline stages
+– between successive valid or empty tokens, when they flow unimpeded
+down a pipeline. Think of a valid token as the crest of a wave and its
+associated empty token as the trough of the wave. If Lf
+�V
+
+�� L f
+�E the
+average forward latency Lf
+
+� 1
+
+2
+
+�L f
+�V
+
+�L f
+�E
+
+� should be used in the above
+equation.
+
+Static spread: The static spread, S, is the distance – measured in pipeline
+stages – between successive valid (or empty) tokens in a pipeline that is
+full (i.e. contains no bubbles). Sometimes the term occupancy is used;
+this is the inverse of S.
+
+4.3.2
+Cycle time of a ring
+
+The parameters defined above are local performance parameters that char-
+acterize the implementation of individual pipeline stages. When a number of
+pipeline stages are connected to form a ring, the following parameter is rele-
+vant:
+
+Cycle time: The cycle time of a ring, TCycle, is the time it takes for a token
+(valid or empty) to make one round trip through all of the pipeline stages
+in the ring. To achieve maximum performance (i.e. minimum cycle
+time), the number of pipeline stages per valid token must match the dy-
+namic wavelength, in which case TCycle
+
+� P. If the number of pipeline
+stages is smaller, the cycle time will be limited by the lack of bubbles,
+and if there are more pipeline stages the cycle time will be limited by
+the forward latency through the pipeline stages. In [153, 154, 155] these
+two modes of operation are called bubble limited and data limited, re-
+spectively.
+
+
+50
+Part I: Asynchronous circuit design – A tutorial
+
+Wd
+
+cycle
+T
+
+N < W   :
+d
+
+Tcycle  =
+2 N
+N - 2 L
+r
+
+(Bubble limited)
+
+N > W   :
+d
+
+Tcycle
+=  N Lf
+
+(Data limited)
+
+N
+
+P
+
+Figure 4.5.
+Cycle time of a ring as a function of the number of pipeline stages in it.
+
+The cycle time of an N-stage ring in which there is one valid token,
+one empty token and N
+� 2 bubbles can be computed from one of the
+following two equations (illustrated in figure 4.5):
+
+When N
+� Wd the cycle time is limited by the forward latency
+through the N stages:
+
+TCycle
+
+�DataLimited
+�
+� N
+�Lf
+(4.5)
+
+If Lf
+�V
+
+�� L f
+�E use Lf
+
+� max�L f
+�V;L f
+�E
+
+�.
+
+When N
+�Wd the cycle time is limited by the reverse latency. With
+N pipeline stages, one valid token and one empty token, the ring
+contains N
+� 2 bubbles, and as a cycle involves 2N data transfers
+(N valid and N empty), the cycle time becomes:
+
+TCycle
+
+�BubbleLimited
+�
+�
+2N
+
+N
+�2Lr
+(4.6)
+
+If Lr
+
+�
+�� Lr
+
+� use Lr
+
+� 1
+
+2
+
+�Lr
+
+�
+�Lr
+
+�
+�
+
+For the sake of completeness it should be mentioned that a third possible
+mode of operation called control limited exists for some circuit config-
+urations [153, 154, 155]. This is, however, not relevant to the circuit
+implementation configurations presented in this book.
+
+The topic of performance analysis and optimization has been addressed in
+some more recent papers [31, 90, 91, 37] and in some of these the term “slack
+matching” is used (referring to the process of balancing the timing of forward
+flowing tokens and backward flowing bubbles).
+
+
+Chapter 4: Performance
+51
+
+4.3.3
+Example 3: Performance of a 3-stage ring
+
+ 
+
+Pipeline stage [i]
+
+Req[i-1]
+
+Data[i-1]
+Data[i]
+
+Req[i]
+
+Ack[i+1]
+Ack[i]
+
+CL
+L
+
+ti = 1
+
+Lf
+ 
+
+Lr
+Ack[i]
+
+Req[i-1]
+Req[i]
+
+Ack[i+1]
+
+Data[i]
+Data[i-1]
+CL
+L
+
+ti = 1
+
+td = 3
+td = 3
+
+tc = 2
+tc = 2
+
+C
+C
+
+Figure 4.6.
+A simple 4-phase bundled-data pipeline stage, and an illustration of its forward
+and reverse latency signal paths.
+
+Let us illustrate the above by a small example: a 3-stage ring composed of
+identical 4-phase bundled-data pipeline stages that are implemented as illus-
+trated in figure 4.6(a). The data path is composed of a latch and a combinatorial
+circuit, CL. The control part is composed of a C-element and an inverter that
+controls the latch and a delay element that matches the delay in the combinato-
+rial circuit. Without the combinatorial circuit and the delay element we have a
+simple FIFO stage. For illustrative purposes the components in the control part
+are assigned the following latencies: C-element: tc
+
+� 2 ns, inverter: ti
+
+� 1 ns,
+and delay element: td
+
+� 3 ns.
+Figure 4.6(b) shows the signal paths corresponding to the forward and re-
+verse latencies, and table 4.1 lists the expressions and the values of these pa-
+rameters. From these figures the period and the dynamic wavelength for the
+two circuit configurations are calculated. For the FIFO, Wd
+
+� 5�0 stages, and
+for the pipeline, Wd
+
+� 3�2. A ring can only contain an integer number of stages
+and if Wd is not integer it is necessary to analyze rings with
+�Wd
+
+� and
+�Wd
+
+�
+
+Table 4.1.
+Performance of different simple ring configurations.
+
+FIFO
+Pipeline
+Parameter
+Expression
+Value
+Expression
+Value
+
+Lr
+tc
+
+�ti
+3 ns
+tc
+
+�ti
+3 ns
+L f
+tc
+2 ns
+tc
+
+�td
+5 ns
+P
+� 2L f
+
+�2Lr
+4tc
+
+�2ti
+10 ns
+4tc
+
+�2ti
+
+�2td
+16 ns
+
+Wd
+5 stages
+3.2 stages
+
+TCycle (3 stages)
+6 Lr
+18 ns
+6 Lr
+18 ns
+TCycle (4 stages)
+4 Lr
+12 ns
+4 Lf
+20 ns
+TCycle (5 stages)
+3�3 Lr
+
+� 5 L f
+10 ns
+5 Lf
+25 ns
+TCycle (6 stages)
+6 Lf
+12 ns
+6 Lf
+30 ns
+
+
+52
+Part I: Asynchronous circuit design – A tutorial
+
+stages and determine which yields the smallest cycle time. Table 4.1 shows the
+results of the analysis including cycle times for rings with 3 to 6 stages.
+
+4.3.4
+Final remarks
+
+The above presentation made a number of simplifying assumptions: (1)
+only rings and pipelines composed of identical pipeline stages were consid-
+ered, (2) it assumed function blocks with symmetric delays (i.e. circuits where
+L f
+�V
+
+� L f
+�E), (3) it assumed function blocks with constant latencies (i.e. ig-
+noring the important issue of data-dependent latencies and average-case per-
+formance), (4) it considered rings with only a single valid token, and (5) the
+analysis considered only 4-phase handshaking and bundled-data circuits.
+For 4-phase dual-rail implementations (where request is embedded in the
+data encoding) the performance parameter equations defined in the previous
+section apply without modification. For designs using a 2-phase protocol, some
+straightforward modifications are necessary: there are no empty tokens and
+hence there is only one value for the forward latency Lf and one value for the
+reverse latency Lr. It is also a simple matter to state expressions for the cycle
+time of rings with more tokens.
+It is more difficult to deal with data-dependent latencies in the function
+blocks and to deal with non-identical pipeline stages. Despite these deficien-
+cies the performance parameters introduced in the previous sections are very
+useful as a basis for first-order design decisions.
+
+4.4.
+Dependency graph analysis
+
+When the pipeline stages incorporate different function blocks, or function
+blocks with asymmetric delays, it is a more complex task to determine the crit-
+ical path. It is necessary to construct a graph that represents the dependencies
+between signal transitions in the circuit, and to analyze this graph and identify
+the critical path cycle [19, 153, 154, 155]. This can be done in a systematic or
+even mechanical way but the amount of detail makes it a complex task.
+The nodes in such a dependency graph represent rising or falling signal
+transitions, and the edges represent dependencies between the signal transi-
+tions. Formally, a dependency is a marked graph [28]. Let us look at a couple
+of examples.
+
+4.4.1
+Example 4: Dependency graph for a pipeline
+
+As a first example let us consider a (very long) pipeline composed of identi-
+cal stages using a function block with asymmetric delays causing Lf
+�E
+
+� L f
+�V.
+Figure 4.7(a) shows a 3-stage section of this pipeline. Each pipeline stage has
+
+
+Chapter 4: Performance
+53
+
+the following latency parameters:
+
+L f
+�V
+
+�
+td
+�0�1
+
+�
+�tc
+
+� 5 ns
+�2 ns
+� 7 ns
+L f
+�E
+
+�
+td
+�1�0
+
+�
+�tc
+
+� 1 ns
+�2 ns
+� 3 ns
+Lr
+
+�
+� Lr
+
+�
+�
+ti
+
+�tc
+
+� 3 ns
+
+There is a close relationship between the circuit diagram and the dependency
+graph. As signals alternate between rising transitions (�) and falling transitions
+(�) – or between valid and empty data values – the graph has two nodes per
+circuit element. Similarly the graph has two edges per wire in the circuit.
+Figure 4.7(b) shows the two graph fragments that correspond to a pipeline
+stage, and figure 4.7(c) shows the dependency graph that corresponds to the 3
+pipeline stages in figure 4.7(a).
+A label outside a node denotes the circuit delay associated with the signal
+transition. We use a particular style for the graphs that we find illustrative: the
+nodes corresponding to the forward flow of valid and empty data values are
+organized as two horizontal rows, and nodes representing the reverse flowing
+acknowledge signals appear as diagonal segments connecting the rows.
+The cycle time or period of the pipeline is the time from a signal transition
+until the same signal transition occurs again. The cycle time can therefore be
+determined by finding the longest simple cycle in the graph, i.e. the cycle with
+the largest accumulated circuit delay which does not contain a sub-cycle. The
+dotted cycle in figure 4.7(c) is the longest simple cycle. Starting at point A the
+corresponding period is:
+
+P
+�
+tD�0�1�
+
+�tC
+
+�
+��
+�
+Lf
+�V
+
+�tI
+
+�tC
+
+�
+��
+�
+
+Lr
+
+�
+
+�tD�1�0�
+
+�tC
+
+�
+��
+�
+Lf
+�V
+
+�tI
+
+�tC
+
+�
+��
+�
+
+Lr
+
+�
+
+�
+2Lr
+
+�2L f
+�V
+
+� 20 ns
+
+Note that this is the period given by equation 4.3 on page 49. An alternative
+cycle time candidate is the following:
+
+R
+
+�i�
+
+�;Req
+
+�i�
+
+�
+
+�
+��
+�
+Lf
+�V
+
+;A
+
+�i�1�
+
+�;Req
+
+�i�1�
+
+�
+
+�
+��
+�
+
+Lr
+
+�
+
+;R
+
+�i�
+
+�;Req
+
+�i�
+
+�
+
+�
+��
+�
+Lf
+�E
+
+;A
+
+�i�1�
+
+�;Req
+
+�i�1�
+
+�
+
+�
+��
+�
+
+Lr
+
+�
+
+;
+
+and the corresponding period is:
+
+P
+� 2Lr
+
+�L f
+�V
+
+�L f
+�E
+
+� 16 ns
+
+Note that this is the minimum possible period given by equation 4.1 on page 48.
+The period is determined by the longest cycle which is 20 ns. Thus, this ex-
+ample illustrates that for some (simple) latch implementations it may not be
+possible to reduce the cycle time by using function blocks with asymmetric
+delays (Lf
+�E
+
+� L f
+�V).
+
+
+54
+Part I: Asynchronous circuit design – A tutorial
+
+L
+
+ 
+
+CL
+L
+
+ 
+
+CL
+L
+
+ 
+
+CL
+
+ti = 1
+
+Ack[i-1]
+Ack[i]
+
+Stage[i-1]
+Stage[i]
+Stage[i+1]
+
+Data[i-1]
+
+td(0->1) = 5
+td(0->1) = 5
+
+Req[i-1]
+
+td(0->1) = 5
+
+Req[i]
+
+ti = 1
+
+Req[i-2]
+
+Data[i-2]
+
+Ack[i+2]
+
+Req[i+1]
+
+Data[i+1]
+
+Ack[i+1]
+
+Data[i]
+
+ti = 1
+
+td(1->0) = 1
+td(1->0) = 1
+td(1->0) = 1
+tc = 2
+tc = 2
+tc = 2
+
+(a)
+
+C
+C
+C
+
+Req[ i ]
+Ack[ i ]
+Req[ i-1]
+tc
+
+ti
+
+Ack[ i+1]
+
+R[ i ]
+
+A[ i ]
+
+td[ i ](1->0)
+
+tc
+
+ti
+
+R[ i ]
+
+A[ i ]
+
+Ack[ i+1]
+
+td[ i ](0->1)
+Req[ i ]
+Ack[ i ]
+
+0->1 transition of Req[i]:
+1->0 transition of Req[i]:
+
+ti = 1 ns
+ti = 1 ns
+
+ti = 1 ns
+ti = 1 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+ti = 1 ns
+
+tc = 2 ns
+
+R[i-1]
+R[i]
+
+td(0->1) = 5 ns
+td(0->1) = 5 ns
+td(0->1) = 5 ns
+
+Ack[i]
+R[i+1]
+Ack[i+1]
+Req[i]
+Req[i+1]
+Req[i-1]
+
+A[i-1]
+A[i]
+A[i+1]
+
+A[i-1]
+A[i]
+A[i+1]
+
+R[i-1]
+Ack[i-1]
+Req[i-1]
+
+td(1->0) = 1 ns
+td(1->0) = 1 ns
+
+R[i]
+Ack[i]
+Req[i]
+
+td(1->0) = 1 ns
+
+R[i+1]
+Req[i+1]
+Ack[i+1]
+
+ti = 1 ns
+
+Stage [i]
+Stage [i-1]
+Stage [i+1]
+
+Ack[i-1] 
+
+A
+
+C
+
+C
+Req[ i-1]
+
+(b)
+
+(c)
+
+Figure 4.7.
+Data dependency graph for a 3-stage section of a pipeline: (a) the circuit dia-
+gram, (b) the two graph fragments corresponding to a pipeline stage, and (c) the resulting data-
+dependency graph.
+
+4.4.2
+Example 5: Dependency graph for a 3-stage ring
+
+As another example of dependency graph analysis let us consider a three
+stage 4-phase bundled-data ring composed of different pipeline stages, fig-
+ure 4.8(a): stage 1 with a combinatorial circuit that is matched by a symmetric
+
+
+Chapter 4: Performance
+55
+
+L
+
+ 
+
+CL
+L
+
+ 
+ 
+
+(a)
+
+L
+
+Ack1
+
+Req3
+
+Data3
+CL
+Data1
+
+Ack2
+
+Req1
+
+Ack3
+
+Data2
+
+Stage 2
+Stage 1
+
+Ack1
+
+Req3
+
+Data3
+
+Stage 3
+
+tc = 2 ns
+
+ti = 1 ns
+ti = 1 ns
+
+Req2
+
+td3(0->1) = 6 ns
+td3(1->0) = 1 ns
+tc = 2 ns
+
+ti = 1 ns
+
+td2 = 2 ns
+
+tc = 2 ns
+
+C
+C
+C
+
+B
+A
+
+B
+A
+
+R1
+
+A1
+
+Req1
+Ack1
+
+ti = 1 ns
+
+ti = 1 ns
+ti = 1 ns
+
+R1
+
+A1
+
+Req1
+Ack1
+R2
+
+A2
+
+Req2
+Ack2
+
+ti = 1 ns
+ti = 1 ns
+
+R3
+Req3
+Ack3
+
+A3
+
+td1(0->1) = 2 ns
+tc = 2 ns
+
+R2
+
+A2
+
+Req2
+Ack2
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+td1(1->0) = 2 ns
+
+td3(0->1) = 6 ns
+tc = 2 ns
+
+Req3
+Ack3
+
+A3
+
+R3
+
+ti = 1 ns
+
+tc = 2 ns
+
+Stage 1
+Stage 2
+Stage 3
+
+td3(1->0) = 1 ns
+
+(b)
+
+Figure 4.8.
+Data dependency graph for an example 3-stage ring: (a) the circuit diagram for
+the ring and (b) the resulting data-dependency graph.
+
+delay element, stage 2 without combinatorial logic, and stage 3 with a combi-
+natorial circuit that is matched by an asymmetric delay element.
+The dependency graph is similar to the dependency graph for the 3-stage
+pipeline from the previous section. The only difference is that the output port
+of stage 3 is connected to the input port of stage 1, forming a closed graph.
+There are several “longest simple cycle” candidates:
+1 A cycle corresponding to the forward flow of valid-tokens:
+
+(R1�; Req1�; R2�; Req2�; R3�; Req3�)
+
+For this cycle, the cycle time is TCycle
+
+� 14 ns.
+
+2 A cycle corresponding to the forward flow of empty-tokens:
+
+(R1�; Req1�; R2�; Req2�; R3�; Req3
+�)
+
+For this cycle, the cycle time is TCycle
+
+� 9 ns.
+
+
+56
+Part I: Asynchronous circuit design – A tutorial
+
+3 A cycle corresponding to the backward flowing bubble:
+
+(A1�; Req1�; A3�; Req3�; A2�; Req2�; A1�; Req1�; A3�; Req3�;
+A2�; Req2�)
+
+For this cycle, the cycle time is TCycle
+
+� 6Lr
+
+� 18 ns.
+
+The 3-stage ring contains one valid-token, one empty-token and one bub-
+ble, and it is interesting to note that the single bubble is involved in six
+data transfers, and therefore makes two reverse round trips for each for-
+ward round trip of the valid-token.
+
+4 There is, however, another cycle with a slightly longer cycle time, as
+illustrated in figure 4.8(b). It is the cycle corresponding to the backward-
+flowing bubble where the sequence:
+
+(A1�; Req1�; A3�)
+is replaced by
+(R3�)
+
+For this cycle the cycle time is TCycle
+
+� 6Lr
+
+� 20 ns.
+
+A dependency graph analysis of a 4-stage ring is very similar. The only
+difference is that there are two bubbles in the ring. In the dependency graph
+this corresponds to the existence of two “bubble cycles” that do not interfere
+with each other.
+The dependency graph approach presented above assumes a closed circuit
+that results in a closed dependency graph. If a component such as a pipeline
+fragment is to be analyzed it is necessary to include a (dummy) model of its
+environment as well – typically in the form of independent and eager token
+producers and token consumers, i.e. dummy circuits that simply respond to
+handshakes. Figure 2.15 on page 24 illustrated this for a single pipeline stage
+control circuit.
+Note that a dependency graph as introduced above is similar to a signal
+transition graph (STG) which we will introduce more carefully in chapter 6.
+
+4.5.
+Summary
+
+This chapter addressed the performance analysis of asynchronous circuits
+at several levels: firstly, by providing a qualitative understanding of perfor-
+mance based on the dynamics of tokens flowing in a circuit; secondly, by in-
+troducing quantitative performance parameters that characterize pipelines and
+rings composed of identical pipeline stages and, thirdly, by introducing de-
+pendency graphs that enable the analysis of pipelines and rings composed of
+non-identical stages.
+At this point we have covered the design and performance analysis of asyn-
+chronous circuits at the “static data-flow structures” level, and it is time to
+address low-level circuit design principles and techniques. This will be the
+topic of the next two chapters.
+
+
+Chapter 5
+
+HANDSHAKE CIRCUIT IMPLEMENTATIONS
+
+In this chapter we will address the implementation of handshake compo-
+nents. First, we will consider the basic set of components introduced in sec-
+tion 3.3 on page 32: (1) the latch, (2) the unconditional data-flow control el-
+ements join, fork and merge, (3) function blocks, and (4) the conditional flow
+control elements MUX and DEMUX. In addition to these basic components
+we will also consider the implementation of mutual exclusion elements and
+arbiters and touch upon the (unavoidable) problem of metastability. The major
+part of the chapter (sections 5.3–5.6) is devoted to the implementation of func-
+tion blocks and the material includes a number of fundamental concepts and
+circuit implementation styles.
+
+5.1.
+The latch
+
+As mentioned previously, the role of latches is: (1) to provide storage for
+valid and empty tokens, and (2) to support the flow of tokens via handshak-
+ing with neighbouring latches. Possible implementations of the handshake
+latch were shown in chapter 2: Figure 2.9 on page 18 shows how a 4-phase
+bundled-data handshake latch can be implemented using a conventional latch
+and a control circuit (the figure shows several such examples assembled into
+pipelines). In a similar way figure 2.11 on page 20 shows the implementation
+of a 2-phase bundled-data latch, and figures 2.12-2.13 on page 21 show the
+implementation of a 4-phase dual-rail latch.
+A handshake latch can be characterized in terms of the throughput, the dy-
+namic wavelength and the static spread of a FIFO that is composed of identical
+latches. Common to the two 4-phase latch designs mentioned above is that a
+FIFO will fill with every other latch holding a valid token and every other latch
+holding an empty token (as illustrated in figure 4.1(b) on page 43). Thus, the
+static spread for these FIFOs is S
+� 2.
+A 2-phase implementation does not involve empty tokens and consequently
+it may be possible to design a latch whose static spread is S
+� 1. Note, how-
+ever, that the implementation of the 2-phase bundled-data handshake latch in
+
+57
+
+
+58
+Part I: Asynchronous circuit design – A tutorial
+
+figure 2.11 on page 20 involves several level-sensitive latches; the utilization
+of the level sensitive latches is no better.
+Ideally, one would want to pack a valid token into every level-sensitive latch,
+and in chapter 7 we will address the design of 4-phase bundled-data handshake
+latches that have a smaller static spread.
+
+5.2.
+Fork, join, and merge
+
+Possible 4-phase bundled-data and 4-phase dual-rail implementations of the
+fork, join, and merge components are shown in figure 5.1. For simplicity the
+figure shows a fork with two output channels only, and join and merge compo-
+nents with two input channels only. Furthermore, all channels are assumed to
+be 1-bit channels. It is, of course, possible to generalize to three or more inputs
+and outputs respectively, and to extend to n-bit channels. Based on the expla-
+nation given below this should be straightforward, and it is left as an exercise
+for the reader.
+
+4-phase fork and join
+A fork involves a C-element to combine the acknowl-
+edge signals on the output channels into a single acknowledge signal on the
+input channel. Similarly a 4-phase bundled-data join involves a C-element to
+combine the request signals on the input channels into a single request signal
+on the output channel. The 4-phase dual-rail join does not involve any active
+components as the request signal is encoded into the data.
+The particular fork in figure 5.1 duplicates the input data, and the join con-
+catenates the input data. This happens to be the way joins and forks are mostly
+used in our static data-flow structures, but there are many alternatives: for ex-
+ample, the fork could split the input data which would make it more symmetric
+to the join in figure 5.1. In any case the difference is only in how the input data
+is transferred to the output. From a control point of view the different alter-
+natives are identical: a join synchronizes several input channels and a fork
+synchronizes several output channels.
+
+4-phase merge
+The implementation of the merge is a little more elaborate.
+Handshakes on the input channels are mutually exclusive, and the merge sim-
+ply relays the active input handshake to the output channel.
+Let us consider the implementation of the 4-phase bundled-data merge first.
+It consists of an asynchronous control circuit and a multiplexer that is con-
+trolled by the input request. The control circuit is explained below.
+The request signals on the input channels are mutually exclusive and may
+simply be ORed together to produce the request signal on the output channel.
+For each input channel, a C-element produces an acknowledge signal in re-
+sponse to an acknowledge on the output channel provided that the input chan-
+nel has valid data. For example, the C-element driving the xack signal is set high
+
+
+Chapter 5: Handshake circuit implementations
+59
+
+C
+
+1) 
+
+y
+
+1) 
+
+y
+
+x.f
+z.f
+
+y
+
+Merge
+
+Join
+
+C
+
+C
+
+y−req
+
+y−ack
+z−ack
+
+x−req
+
+y
+
+z
+
+z−req
+
+x−ack
+
+x
+
+y.f
+
+MUX
+
+x−ack
+
+x−req
+
+z.t
+
+y.t
+
+x−ack
+
+x
+
+x.t
+
+x−ack
+
+y
+
+z−ack
+y−ack
+
+z0.f
+
+z−ack
+
+z−req
+
+Fork      
+
+y−ack
+
+x
+y
+z1
+z0
+
+z0.t
+y−ack
+
+x−ack
+
+x.t
+x.f
+
+z1.t
+z1.f
+
+z
+
+y−ack
+
+y−req
+
+z−req
+
+Component
+4−phase bundled−data
+4−phase dual−rail
+
+x−ack
+z−ack
+
+x−req
+y−req
+
+z.f 
+
+x−ack
+
+y−ack
+
+z−ack
+
+z−ack
+
+z.t 
+x.t 
+x.f 
+
+y.t 
+y.f 
+
+y.t
+y.f
+
+C
+
+C
+C
+
++
+
++
++
+
++
+
+C
+z
+
++
+
+x
+
+x
+
+z
+
+z
+
+x
+
+Figure 5.1.
+4-phase bundled-data and 4-phase dual-rail implementations of the fork, join and
+merge components.
+
+when xreq and zack have both gone high, and it is reset when both signals have
+gone low again. As zack goes low in response to xreq going low, it will suffice to
+reset the C-element in response to zack going low. This optimization is possible
+if asymmetric C-elements are available, figure 5.2. Similar arguments applies
+for the C-element that drives the yack signal. A more detailed introduction to
+generalized C-elements and related state-holding devices is given in chapter 6,
+sections 6.4.1 and 6.4.5.
+
++
+
+C
+x-ack
+z-ack
+
+x-req
+
+z-ack
+x-ack
+
+reset
+
+x-req
+set
+
+Figure 5.2.
+A possible implementation of the upper asymmetric C-element in the 4-phase
+bundled-data merge in figure 5.1.
+
+
+60
+Part I: Asynchronous circuit design – A tutorial
+
+The implementation of the 4-phase dual-rail merge is fairly similar. As
+request is encoded into the data signals an OR gate is used for each of the
+two output signals z�t and z�f . Acknowledge on an input channel is produced
+in response to an acknowledge on the output channel provided that the input
+channel has valid data. Since the example assumes 1-bit wide channels, the
+latter is established using an OR gate (marked “1”), but for N-bit wide channels
+a completion detector (as shown in figure 2.13 on page 21) would be required.
+
+2-phase fork, join and merge
+Finally a word about 2-phase bundled-data
+implementations of the fork, join and merge components: the implementation
+of 2-phase bundled-data fork and join components is identical to the imple-
+mentation of the corresponding 4-phase bundled-data components (assuming
+that all signals are initially low).
+The implementation of a 2-phase bundled-data merge, on the other hand,
+is complex and rather different, and it provides a good illustration of why the
+implementation of some 2-phase bundled-data components is complex. When
+observing an individual request or acknowledge signal the transitions will obvi-
+ously alternate between rising and falling, but since nothing is known about the
+sequence of handshakes on the input channels there is no relationship between
+the polarity of a request signal transition on an input channel and the polarity
+of the corresponding request signal transition on the output channel. Similarly
+there is no relationship between the polarity of an acknowledge signal transi-
+tion on the output channel and the polarity of the corresponding acknowledge
+signal transition on the input channel channel. This calls for some kind of stor-
+age element on each request and acknowledge signal produced by the circuit.
+This brings complexity, as does the associated control logic.
+
+5.3.
+Function blocks – The basics
+
+This section will introduce the fundamental principles of function block de-
+sign, and subsequent sections will illustrate function block implementations
+for different handshake protocols. The running example will be an N-bit ripple
+carry adder.
+
+5.3.1
+Introduction
+
+A function block is the asynchronous equivalent of a combinatorial circuit:
+it computes one or more output signals from a set of input signals. The term
+“function block” is used to stress the fact that we are dealing with circuits with
+a purely functional behaviour.
+However, in addition to computing the desired function(s) of the input sig-
+nals, a function block must also be transparent to the handshaking that is im-
+plemented by its neighbouring latches. This transparency to handshaking is
+
+
+Chapter 5: Handshake circuit implementations
+61
+
+block
+Function 
+
+A
+
+B
+SUM
+
+ADD
+
+cin
+cout
+
+Join
+Fork
+
+Figure 5.3.
+A function block whose operands and results are provided on separate channels
+requires a join of the inputs and a fork on the output.
+
+what makes function blocks different from combinatorial circuits and, as we
+will see, there are greater depths to this than is indicated by the word “trans-
+parent” – in particular for function blocks that implicitly indicate completion
+(which is the case for circuits using dual-rail signals).
+The most general scenario is where a function block receives its operands
+on separate channels and produces its results on separate channels, figure 5.3.
+The use of several independent input and output channels implies a join on the
+input side and a fork on the output side, as illustrated in the figure. These can
+be implemented separately, as explained in the previous section, or they can be
+integrated into the function block circuitry. In what follows we will restrict the
+discussion to a scenario where all operands are provided on a single channel
+and where all results are provided on a single channel.
+We will first address the issue of handshake transparency and then review
+the fundamentals of ripple carry addition, in order to provide the necessary
+background for discussing the different implementation examples that follow.
+A good paper on the design of function blocks is [97].
+
+5.3.2
+Transparency to handshaking
+
+The general concepts are best illustrated by considering a 4-phase dual-rail
+scenario – function blocks for bundled data protocols can be understood as
+a special case. Figure 5.4(a) shows two handshake latches connected directly
+and figure 5.4(b) shows the same situation with a function block added between
+the two latches. The function block must be transparent to the handshaking.
+Informally this means that if observing the signals on the ports of the latches,
+one should see the same sequence of handshake signal transitions; the only
+difference should be some slow-down caused by the latency of the function
+block.
+A function block is obviously not allowed to produce a request on its output
+before receiving a request on its input; put the other way round, a request on the
+output of the function block should indicate that all of the inputs are valid and
+that all (relevant) internal signals and all output signals have been computed.
+
+
+62
+Part I: Asynchronous circuit design – A tutorial
+
+Ack
+
+Data
+
+Ack
+
+F
+
+Input
+data
+Output
+data
+
+(b)
+(a)
+
+LATCH
+
+LATCH
+
+LATCH
+
+LATCH
+
+Figure 5.4.
+(a) Two latches connected directly by a handshake channel and (b) the same situ-
+ation with a function block added between the latches. The handshaking as seen by the latches
+in the two situations should be the same, i.e. the function block must be designed such that it is
+transparent to the handshaking.
+
+(Here we are touching upon the principle of indication once again.) In 4-phase
+protocols a symmetric set of requirements apply for the return-to-zero part of
+the handshaking.
+Function blocks can be characterized as either strongly indicating or weakly
+indicating depending on how they behave with respect to this handshake trans-
+parency. The signalling that can be observed on the channel between the two
+
+All
+valid
+
+All
+empty
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+1
+0
+
+Time
+
+Input data
+
+Output data
+
+(1)
+
+(2a)
+
+(2b)
+
+(3)
+(4a)
+
+(4b)
+(4b)
+
+(1)
+“All inputs become defined”
+�
+“Some outputs become defined”
+(2)
+“All outputs become defined”
+�
+“Some inputs become undefined”
+(3)
+“All inputs become undefined”
+�
+“Some outputs become undefined”
+(4)
+“All outputs become undefined”
+�
+“Some inputs become defined”
+
+Figure 5.5.
+Signal traces and event orderings for a strongly indicating function block.
+
+
+Chapter 5: Handshake circuit implementations
+63
+
+All
+valid
+
+All
+empty
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+1
+
+0
+
+Time
+
+Input data
+
+Output data
+
+(6b)
+
+(3b)
+
+(6a)
+
+(6b)
+
+(2)
+
+(1)
+(3a)
+
+(4)
+(5)
+
+(1)
+“Some inputs become defined”
+�
+“Some outputs become defined”
+(2)
+“All inputs become defined”
+�
+“All outputs become defined”
+(3)
+“All outputs become defined”
+�
+“Some inputs become undefined”
+(4)
+“Some inputs become undefined”
+�
+“Some outputs become undefined”
+(5)
+“All inputs become undefined”
+�
+“All outputs become undefined”
+(6)
+“All outputs become undefined”
+�
+“Some inputs become defined”
+
+Figure 5.6.
+Signal traces and event orderings for a weakly indicating function block.
+
+latches in figure 5.4(a) was illustrated in figure 2.3 on page 13. We can illus-
+trate the handshaking for the situation in figure 5.4(b) in a similar way.
+
+A function block is strongly indicating, as illustrated in figure 5.5, if (1)
+it waits for all of its inputs to become valid before it starts to compute and
+produce valid outputs, and if (2) it waits for all of its inputs to become
+empty before it starts to produce empty outputs.
+
+A function block is weakly indicating, as illustrated in figure 5.6, if (1)
+it starts to compute and produce valid outputs as soon as possible, i.e.
+when some but not all input signals have become valid, and if (2) it
+starts to produce empty outputs as soon as possible, i.e. when some but
+not all input signals have become empty.
+
+For a weakly indication function block to behave correctly, it is necessary
+to require that it never produces all valid outputs until after all inputs have be-
+come valid, and that it never produces all empty outputs until after all inputs
+have become empty. This behaviour is identical to Seitz’s weak conditions in
+[121]. In [121] Seitz further explains that it can be proved that if the individual
+components satisfy the weak conditions then any “valid combinatorial circuit
+
+
+64
+Part I: Asynchronous circuit design – A tutorial
+
+structure” of function blocks also satisfies the weak conditions, i.e. that func-
+tion blocks may be combined to form larger function blocks. By “valid com-
+binatorial circuit structure” we mean a structure where no components have
+inputs or outputs left unconnected and where there are no feed-back signal
+paths. Strongly indicating function blocks have the same property – a “valid
+combinatorial circuit structure” of strongly indicating function blocks is itself
+a strongly indicating function block.
+Notice that both weakly and strongly indicating function blocks exhibit a
+hysteresis-like behaviour in the valid-to-empty and empty-to-valid transitions:
+(1) some/all outputs must remain valid until after some/all inputs have become
+empty, and (2) some/all outputs must remain empty until after some/all inputs
+have become valid. It is this hysteresis that ensures handshake transparency,
+and the implementation consequence is that one or more state holding circuits
+(normally in the form of C-elements) are needed.
+Finally, a word about the 4-phase bundled-data protocol. Since Req� is
+equivalent to “all data signals are valid” and since Req� is equivalent to “all
+data signals are empty,” a 4-phase bundled-data function block can be catego-
+rized as strongly indicating.
+As we will see in the following, strongly indicating function blocks have
+worst-case latency. To obtain actual case latency weakly indicating function
+blocks must be used. Before addressing possible function block implementa-
+tion styles for the different handshake protocols it is useful to review the basics
+of binary ripple-carry addition, the running example in the following sections.
+
+5.3.3
+Review of ripple-carry addition
+
+Figure 5.7 illustrates the implementation principle of a ripple-carry adder.
+A 1-bit full adder stage implements:
+
+s
+�
+a
+�b
+�c
+(5.1)
+d
+�
+ab
+�ac
+�bc
+(5.2)
+
+ai bi
+a1 b1
+an bn
+
+�����
+�����
+�����
+�����
+
+
+
+
+
+di
+d1
+ci
+cn
+
+sn
+si
+s1
+
+cout
+cin
+
+Figure 5.7.
+A ripple-carry adder. The carry output of one stage di is connected to the carry
+input of the next stage ci�1.
+
+
+Chapter 5: Handshake circuit implementations
+65
+
+In many implementations inputs a and b are recoded as:
+
+p
+�
+a
+�b
+(“propagate” carry)
+(5.3)
+g
+�
+ab
+(“generate” carry)
+(5.4)
+k
+�
+ab
+(“kill” carry)
+(5.5)
+
+�
+�
+�and the output signals are computed as follows:
+
+s
+�
+p
+�c
+(5.6)
+d
+�
+g
+� pc
+or alternatively
+(5.7a)
+
+d
+�
+k
+� pc
+(5.7b)
+
+For a ripple-carry adder, the worst case critical path is a carry rippling across
+the entire adder. If the latency of a 1-bit full adder is tadd the worst case latency
+of an N-bit adder is N
+� tadd. This is a very rare situation and in general the
+longest carry ripple during a computation is much shorter. Assuming random
+and uncorrelated operands the average latency is log�N
+�
+�tadd and, if numeri-
+cally small operands occur more frequently, the average latency is even less.
+Using normal Boolean signals (as in the bundled-data protocols) there is no
+way to know when the computation has finished and the resulting performance
+is thus worst-case.
+By using dual-rail carry signals (d
+�t
+�d
+�f ) it is possible to design circuits
+that indicate completion as part of the computation and thus achieve actual
+case latency. The crux is that a dual-rail carry signal, d, conveys one of the
+following 3 messages:
+
+(d
+�t
+�d
+�f ) = (0,0) = Empty
+“The carry has not been computed yet”
+(possibly because it depends on c)
+(d
+�t
+�d
+�f ) = (1,0) = True
+“The carry is 1”
+(d
+�t
+�d
+�f ) = (0,1) = False
+“The carry is 0”
+
+Consequently it is possible for a 1-bit adder to output a valid carry without
+waiting for the incoming carry if its inputs make this possible (a
+� b
+� 0 or
+a
+� b
+� 1). This idea was first put forward in 1955 in a paper by Gilchrist [52].
+The same idea is explained in [62, pp. 75-78] and in [121].
+
+5.4.
+Bundled-data function blocks
+
+5.4.1
+Using matched delays
+
+A bundled-data implementation of the adder in figure 5.7 is shown in fig-
+ure 5.8. It is composed of a traditional combinatorial circuit adder and a match-
+ing delay element. The delay element provides a constant delay that matches
+the worst case latency of the combinatorial adder. This includes the worst case
+
+
+66
+Part I: Asynchronous circuit design – A tutorial
+
+comb.
+circuit
+
+s[n:1]
+
+d
+
+a[n:1]
+b[n:1]
+c
+
+matched
+delay
+Req-in
+Req-out
+
+Ack-in
+Ack-out
+
+Figure 5.8.
+A 4-phase bundled data implementation of the N
+�bit handshake adder from fig-
+ure 5.7.
+
+critical path in the circuit – a carry rippling across the entire adder – as well as
+the worst case operating conditions. For reliable operation some safety margin
+is needed.
+In addition to the combinatorial circuit itself, the delay element represents
+a design challenge for the following reasons: to a first order the delay element
+will track delay variations that are due to the fabrication process spread as well
+as variations in temperature and supply voltage. On the other hand, wire de-
+lays can be significant and they are often beyond the designer’s control. Some
+design policy for matched delays is obviously needed. In a full custom de-
+sign environment one may use a dummy circuit with identical layout but with
+weaker transistors. In a standard cell automatic place and route environment
+one will have to accept a fairly large safety margin or do post-layout timing
+analysis and trimming of the delays. The latter sounds tedious but it is similar
+to the procedure used in synchronous design where setup and hold times are
+checked and delays trimmed after layout.
+In a 4-phase bundled-data design an asymmetric delay element may be
+preferable from a performance point of view, in order to perform the return-to-
+zero part of the handshaking as quickly as possible. Another issue is the power
+consumption of the delay element. In the ARISC processor design reported in
+[23] the delay elements consumed 10 % of the total power.
+
+5.4.2
+Delay selection
+
+In [105] Nowick proposed a scheme called “speculative completion”. The
+basic principle is illustrated in figure 5.9. In addition to the desired function
+some additional circuitry is added that selects among several matched delays.
+The estimate must be conservative, i.e. on the safe side. The estimation can
+be based on the input signals and/or on some internal signals in the circuit that
+implements the desired function.
+For an N-bit ripple-carry adder the propagate signals (c.f. equation 5.3)
+that form the individual 1-bit full adders (c.f. figure 5.7) may be used for the
+estimation. As an example of the idea consider a 16-bit adder. If p8
+
+� 0 the
+
+
+Chapter 5: Handshake circuit implementations
+67
+
+large
+
+small
+
+medium
+
+Estimate
+
+Funct.
+
+Req_in
+Req_out
+
+Inputs
+Outputs
+
+MUX
+
+Figure 5.9.
+The basic idea of “speculative completion”.
+
+longest carry ripple can be no longer than 8 stages, and if p12
+
+� p8
+
+� p4
+
+� 0
+the longest carry ripple can be no longer than 4 stages. Based on such simple
+estimates a sufficiently large matched delay is selected. Again, if a 4-phase
+protocol is used, asymmetric delay elements are preferable from a performance
+point of view.
+To the designer the trade-off is between an aggressive estimate with a large
+circuit overhead (area and power) or a less aggressive estimate with less over-
+head. For more details on the implementation and the attainable performance
+gains the reader is is referred to [105, 107].
+
+5.5.
+Dual-rail function blocks
+
+5.5.1
+Delay insensitive minterm synthesis (DIMS)
+
+In chapter 2 (page 22 and figure 2.14) we explained the implementation of
+an AND gate for dual-rail signals. Using the same basic topology it is possible
+to implement other simple gates such as OR, EXOR, etc. An inverter involves
+no active circuitry as it is just a swap of the two wires.
+Arbitrary functions can be implemented by combining gates in exactly the
+same way as when one designs combinatorial circuits for a synchronous cir-
+cuit. The handshaking is implicitly taken care of and can be ignored when
+composing gates and implementing Boolean functions. This has the important
+implication that existing logic synthesis techniques and tools may be used, the
+only difference is that the basic gates are implemented differently.
+The dual-rail AND gate in figure 2.14 is obviously rather inefficient: 4 C-
+elements and 1 OR gate totaling approximately 30 transistors – a factor five
+greater than a normal AND gate whose implementation requires only 6 tran-
+sistors. By implementing larger functions the overhead can be reduced. To
+illustrate this figure 5.10(b)-(c) shows the implementation of a 1-bit full adder.
+We will discuss the circuit in figure 5.10(d) shortly.
+
+
+68
+Part I: Asynchronous circuit design – A tutorial
+
+b.f   b.t
+
+c.f   c.t
+
+s.f   s.t
+
+d.f   d.t
+
+Generate
+
+Kill
+
+E
+E
+E
+0
+0
+0
+0
+
+c
+b
+a
+
+F
+F
+F
+T
+F
+F
+F
+T
+F
+T
+T
+F
+T
+F
+F
+T
+F
+T
+T
+T
+F
+T
+T
+T
+
+0
+1
+0
+1
+1
+0
+1
+0
+0
+1
+0
+1
+1
+0
+1
+
+0
+0
+0
+0
+
+0
+
+1
+1
+
+1
+
+1
+
+1
+
+1
+0
+1
+
+0
+0
+
+0
+
+NO  CHANGE
+
+Kill
+
+Generate
+
+s.t
+s.f
+d.t
+d.f
+
+(b)
+
+(d)
+(c)
+
+(a)
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+b.f
+
+c.f
+
+b.t
+
+c.t
+
+a.f
+a.t
++
+
++
+
++
+
++
+
+s.t
+
+s.f
+
+d.t
+
+d.f
+
+ADD
+
+a.f   a.t
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+b.f
+
+c.f
+
+b.t
+
+c.t
+
+a.f
+a.t
+
+C
+
+C
+
++
+
++
+
++
+
++
+
+s.t
+
+s.f
+
+d.t
+
+d.f
+
+1
+
+Figure 5.10.
+A 4-phase dual-rail full-adder: (a) Symbol, (b) truth table, (c) DIMS implemen-
+tation and (d) an optimization that makes the full adder weakly indicating.
+
+The PLA-like structure of the circuit in figure 5.10(c) illustrates a general
+principle for implementing arbitrary Boolean functions. In [124] we called this
+approach DIMS – Delay-Insensitive Minterm Synthesis – because the circuits
+are delay-insensitive and because the C-elements in the circuits generate all
+minterms of the input variables. The truth tables have 3 groups of rows speci-
+fying the output when the input is: (1) the empty codeword to which the circuit
+responds by setting the output empty, (2) an intermediate codeword which does
+not affect the output, or (3) a valid codeword to which the circuit responds by
+setting the output to the proper valid value.
+The fundamental ideas explained above all go back to David Muller’s work
+in the late 1950s and early 1960s [93, 92]. While [93] develops the funda-
+mental theorem for the design of speed-independent circuits, [92] is a more
+practical introduction including a design example: a bit-serial multiplier using
+latches and gates as explained above.
+Referring to section 5.3.2, the DIMS circuits as explained here can be cat-
+egorized as strongly indicating, and hence they exhibit worst case latency. In
+
+
+Chapter 5: Handshake circuit implementations
+69
+
+an N-bit ripple-carry adder the empty-to-valid and valid-to-empty transitions
+will ripple in strict sequence from the least significant full adder to the most
+significant one.
+If we change the full-adder design slightly as illustrated in figure 5.10(d) a
+valid d may be produced before the c input is valid (“kill” or “generate”), and
+an N-bit ripple-carry adder built from such full adders will exhibit actual-case
+latency – the circuits are weakly indicating function blocks.
+The designs in figure 5.10(c) and 5.10(d), and ripple-carry adders built from
+these full adders, are all symmetric in the sense that the latency of propagating
+an empty value is the same as the latency of propagating the preceding valid
+value. This may be undesirable. Later in section 5.5.4 we will introduce an
+elegant design that propagates empty values in constant time (with the latency
+of 2 full adder cells).
+
+5.5.2
+Null Convention Logic
+
+The C-elements and OR gates from the previous sections can be seen as n-
+of-n and 1-of-n threshold gates with hysteresis, figure 5.11. By using arbitrary
+m-of-n threshold gates with hysteresis – an idea proposed by Theseus Logic,
+Inc., [39] – it is possible to reduce the implementation complexity. An m-
+of-n threshold gate with hysteresis will set its output high when any m inputs
+have gone high and it will set its output low when all its inputs are low. This
+elegant circuit implementation idea is the key element in Theseus Logic’s Null
+Convention Logic. At the higher levels of design NCL is no different from
+the data-flow view presented in chapter 3 and NCL has great similarities to the
+circuit design styles presented in [92, 122, 124, 97]. Figure 5.11 shows that
+
+1
+1
+
+2
+2
+
+3
+
+5
+
+1
+1
+1
+
+2
+2
+
+3
+3
+
+4
+4
+
+C-elements
+
+Inverter
+
+OR-gates
+
+Figure 5.11.
+NCL gates: m�of�n threshold gates with hysteresis (1
+� m
+� n).
+
+
+70
+Part I: Asynchronous circuit design – A tutorial
+
+2
+
+2
+
+3
+
+3
+
+b.f
+b.t
+
+c.t
+c.f
+
+a.t
+a.f
+
+s.t
+
+s.f
+
+d.t
+d.f
+
+Figure 5.12.
+A full adder using NCL gates.
+
+OR gates and C-elements can be seen as special cases in the world of threshold
+gates. The digit inside a gate symbol is the threshold of the gate. Figure 5.12
+shows the implementation of a dual-rail full adder using NCL threshold gates.
+The circuit is weakly indicating.
+
+5.5.3
+Transistor-level CMOS implementations
+
+The last two adder designs we will introduce are based on CMOS transistor-
+level implementations using dual-rail signals. Dual-rail signals are essentially
+what are produced by precharged differential logic circuits that are used in
+memory structures and in logic families like DCVSL, figure 5.13 [151, 55].
+In a bundled-data design the precharge signal can be the request signal on
+the input channel to the function block. In a dual-rail design the precharge
+p-type transistors may be replaced by transistor networks that detect when all
+
+A
+
+B
+B
+
+Precharge
+
+N  transistor
+network
+
+Out.t
+
+Inputs
+
+Precharge
+
+Out.f
+
+Figure 5.13.
+A precharged differential CMOS combinatorial circuit. By adding the cross-
+coupled p-type transistors labeled “A” or the (weak) feedback-inverters labeled “B” the circuit
+becomes (pseudo)static.
+
+
+Chapter 5: Handshake circuit implementations
+71
+
+c.f
+
+c.t
+
+b.t
+
+a.f
+
+b.f
+
+c.f
+
+c.t
+
+b.t
+
+a.f
+
+b.f
+
+b.t
+
+c.f
+
+b.f
+
+c.t
+
+d.f
+
+b.f
+
+c.f
+
+b.f
+
+c.f
+
+d.t
+
+b.t
+
+c.f
+
+b.t
+
+c.t
+
+b.f
+
+c.t
+
+b.t
+
+c.t
+
+a.t
+a.t
+
+a.t
+a.f
+a.f
+a.f
+a.t
+a.f
+a.t
+a.t
+
+Figure 5.14.
+Transistor-level implementation of the carry signal for the strongly indicating full
+adder from figure 5.10(c).
+
+inputs are empty. Similarly the pull down n-type transistor signal paths should
+only conduct when the required input signals are valid.
+Transistor implementations of the DIMS and NCL gates introduced above
+are thus straightforward. Figure 5.14 shows a transistor-level implementation
+of a carry circuit for a strongly-indicating full adder. In the pull-down circuit
+each column of transistors corresponds to a minterm. In general when imple-
+menting DCVSL gates it is possible to share transistors in the two pull-down
+networks, but in this particular case it has not been done in order to illustrate
+better the relationship between the transistor implementation and the gate im-
+plementation in figure 5.10(c).
+The high stacks of p-type transistors are obviously undesirable. They may
+be replaced by a single transistor controlled by an “all empty” signal generated
+elsewhere. Finally, we mention that the weakly-indicating full adder design
+presented in the next section includes optimizations that minimize the p-type
+transistor stacks.
+
+5.5.4
+Martin’s adder
+
+In [85] Martin addresses the design of dual-rail function blocks in general
+and he illustrates the ideas using a very elegant dual-rail ripple-carry adder.
+The adder has a small transistor count, it exhibits actual case latency when
+adding valid data, and it propagates empty values in constant time – the adder
+represents the ultimate in the design of weakly indicating function blocks.
+Looking at the weakly-indicating transistor-level carry circuit in figure 5.14
+we see that d remains valid until a, b, and c are all empty. If we designed a
+similar sum circuit its output s would also remain valid until a, b, and c are all
+empty. The weak conditions in figure 5.6 only require that one output remains
+
+
+72
+Part I: Asynchronous circuit design – A tutorial
+
+s1
+
+a1 b1
+
+d1
+c1
+c2
+
+b2
+a2
+
+d2
+
+s2
+
+b3
+a3
+
+d3
+
+s3
+
+c3
+
+c3
+
+d2
+d1
+
+c2
+
+d3
+
+s3
+s2
+s1
+
+c1
+
+a3, b3
+a1,b1
+
+c3
+
+d2
+d1
+
+c2
+
+d3
+
+s3
+s2
+s1
+
+c1
+
+a3, b3
+a2, b2
+a1, b1
+
+a2, b2
+
+(b)
+
+(c)
+
+(a)
+
+Ripple-carry adder:
+
+Validity indication:
+
+Empty indication:
+
+Kill /Generate
+Propagate
+
+Figure 5.15.
+(a) A 3-stage ripple-carry adder and graphs illustrating how valid data (b) and
+empty data (c) propagate through the circuit (Martin [85]).
+
+valid until all inputs have become invalid. Hence it is allowed to split the
+indication of a, b and c being empty among the carry and the sum circuits.
+In [85] Martin uses some very illustrative directed graphs to express how
+the output signals indicate when input signals and internal signals are valid or
+empty. The nodes in the graphs are the signals in the circuit and the directed
+edges represent indication dependencies. Solid edges represent guaranteed de-
+pendencies and dashed edges represent possible dependencies. Figure 5.15(a)
+shows three full adder stages of a ripple-carry adder, and figures 5.15(b) and
+5.15(c) show how valid and empty inputs respectively propagate through the
+circuit.
+The propagation and indication of valid values is similar to what we dis-
+cussed above in the other adder designs, but the propagation and indication of
+empty values is different and exhibits constant latency. When the outputs d3,
+s3, s2, and s1 are all valid this indicates that all input signals and all inter-
+nal carry signals are valid. Similarly when the outputs d3, s3, s2, and s1 are
+all empty this indicates that all input signals and all internal carry signals are
+empty – the ripple-carry adder satisfies the weak conditions.
+
+
+Chapter 5: Handshake circuit implementations
+73
+
+a.t
+
+c.t
+
+c.f
+
+c.t
+
+c.f
+
+c.t
+
+b.t
+
+c.f
+
+b.f
+
+a.f
+
+c.t
+
+b.f
+
+a.t
+
+c.f
+
+a.f
+
+b.t
+
+s.f
+
+s.t
+
+a.f
+
+b.t
+
+a.t
+
+a.f
+
+b.f
+
+c.f
+
+a.f
+b.f
+
+d.f
+
+b.f
+
+a.f
+
+b.t
+
+a.t
+
+a.t
+
+b.t
+a.t
+b.t
+
+c.t
+
+d.t
+
+b.f
+
+Figure 5.16.
+The CMOS transistor implementation of Martin’s adder [85, Fig. 3].
+
+The corresponding transistor implementation of a full adder is shown in
+figure 5.16. It uses 34 transistors, which is comparable to a traditional combi-
+natorial circuit implementation.
+The principles explained above apply to the design of function blocks in
+general. “Valid/empty indication (or acknowledgement), dependency graphs”
+as shown in figure 5.15 are a very useful technique for understanding and de-
+signing circuits with low latency and the weakest possible indication.
+
+5.6.
+Hybrid function blocks
+
+The final adder we will present has 4-phase bundled-data input and output
+channels and a dual-rail carry chain. The design exhibits characteristics sim-
+ilar to Martin’s dual-rail adder presented in the previous section: actual case
+latency when propagating valid data, constant latency when propagating empty
+data, and a moderate transistor count. The basic structure of this hybrid adder
+is shown in figure 5.17. Each full adder is composed of a carry circuit and
+a sum circuit. Figure 5.18(a)-(b) shows precharged CMOS implementations
+of the two circuits. The idea is that the circuits precharge when Reqin
+
+� 0,
+evaluate when Reqin
+
+� 1, detect when all carry signals are valid and use this
+information to indicate completion, i.e. Reqout
+
+�. If the latency of the comple-
+tion detector does not exceed the latency in the sum circuit in a full adder then
+a matched delay element is needed as indicated in figure 5.17.
+The size and latency of the completion detector in figure 5.17 grows with the
+size of the adder, and in wide adders the latency of the completion detector may
+significantly exceed the latency of the sum circuit. An interesting optimization
+that reduces the completion detector overhead – possibly at the expense of
+a small increase in overall latency (Reqin
+
+� to Reqout
+
+�) – is to use a mix of
+strongly and weakly indicating function blocks [101]. Following the naming
+convention established in figure 5.7 on page 64 we could make, for example,
+
+
+74
+Part I: Asynchronous circuit design – A tutorial
+
+C
+
+Completion
+detector
+
+Precharge/Evaluate
+all cy and sum circuits
+
++
++
++
+
+sum
+sum
+
+sn
+si
+
+sum
+
+s1
+Req_out
+
+c1.t
+
+d1.f
+ci.f
+
+ci.t
+d1.t
+di.t
+
+di.f
+dn.f
+
+dn.t
+
+c1.f
+
+cin
+
+cout
+
+carry
+carry
+carry
+
+bn
+bi
+a1 b1
+ai
+an
+
+cn.f
+
+cn.t
+
+Req_in
+
+Figure 5.17.
+Block diagram of a hybrid adder with 4-phase bundled-data input and output
+channels and with an internal dual-rail carry chain.
+
+Req_in
+Req_in
+
+Req_in
+
+c.t
+
+a
+b
+a
+a
+
+b
+b
+c.f
+
+a
+b
+
+d.t
+
+d.f
+
+Req_in
+
+Req_in
+
+a
+a
+
+b
+b
+b
+b
+
+c.t
+c.f
+
+s
+
+Req_in
+
+d.t
+
+d.f
+
+(c)
+Req_in
+
+c.f
+
+a
+
+Req_in
+
+a
+
+c.t
+
+a
+
+b
+b
+
+a
+
+(a)
+(b)
+
+Figure 5.18.
+The CMOS transistor implementation of a full adder for the hybrid adder in
+figure 5.17: (a) a weakly indicating carry circuit, (b) the sum circuit and (c) a strongly indicating
+carry circuit.
+
+
+Chapter 5: Handshake circuit implementations
+75
+
+adders 1, 4, 7,
+�
+�
+�weakly indicating and all other adders strongly indicating. In
+this case only the carry signals out of stages 3, 6, 9,
+�
+�
+�need to be checked to
+detect completion. For i
+� 3�6�9�
+�
+�
+� di indicates the completion of di�1 and
+di�2 as well. Many other schemes for mixing strongly and weakly indicating
+full adders are possible. The particular scheme presented in [101] exploited the
+fact that typical-case operands (sampled audio) are numerically small values,
+and the design detects completion from a single carry signal.
+
+Summary – function block design
+
+The previous sections have explained the basics of how to implement func-
+tion blocks and have illustrated this using a variety of ripple-carry adders. The
+main points were “transparency to handshaking” and “actual case latency”
+through the use of weakly-indicating components.
+Finally, a word of warning to put things into the right perspective: to some
+extent the ripple-carry adders explained above over-sell the advantages of aver-
+age-case performance. It is easy to get carried away with elegant circuit de-
+signs but it may not be particularly relevant at the system level:
+
+In many systems the worst-case latency of a ripple-carry adder may sim-
+ply not be acceptable.
+
+In a system with many concurrently active components that synchronize
+and exchange data at high rates, the slowest component at any given time
+tends to dominate system performance; the average-case performance of
+a system may not be nearly as good as the average-case latency of its
+individual components.
+
+In many cases addition is only one part of a more complex compound
+arithmetic operation. For example, the final design of the asynchronous
+filter bank presented in [103] did not use the ideas presented above. In-
+stead we used entirely strongly-indicating full adders because this al-
+lowed an efficient two-dimensional precharged compound add-multiply-
+accumulate unit to be implemented.
+
+5.7.
+MUX and DEMUX
+
+Now that the principles of function block design have been covered we are
+ready to address the implementation of the MUX and DEMUX components,
+c.f. figure 3.3 on page 32. Let’s recapitulate their function: a MUX will syn-
+chronize the control channel and relay the data and the handshaking of the
+selected input channel to the output data channel. The other input channel is
+ignored (and may have a request pending). Similarly a DEMUX will synchro-
+nize the control and the data input channels and steer the input to the selected
+output channel. The other output channel is passive and in the idle state.
+
+
+76
+Part I: Asynchronous circuit design – A tutorial
+
+If we consider only the “active” channels then the MUX and the DEMUX
+can be understood and designed as function blocks – they must be transparent
+to the handshaking in the same way as function blocks. The control chan-
+nel and the (selected) input data channel are first joined and then an output is
+produced. Since no data transfer can take place without the control channel
+and the (selected) input data channel both being active, the implementations
+become strongly indicating function blocks.
+Let’s consider implementations using 4-phase protocols. The simplest and
+most intuitive designs use a dual-rail control channel. Figure 5.19 shows the
+implementation of the MUX and the DEMUX using the 4-phase bundled-data
+
+n
+
+n
+
+"Join"
+
+"Join"
+
+1
+
+MUX
+
+0
+
+"Join"
+
+1
+
+z−ack
+
+"Join"
+
+y
+
+DEMUX
+
+y
+
+z
+
+ctl_ack
+
+x
+
+x−ack
+
+Component
+4−phase bundled−data
+
+y−ack
+
+y−ack
+0
+
+y
+
+y−req
+
+z
+
+ctl_ack
+
+x−req
+
+z−req
+
+y−req
+
+ctl.f  ctl.t
+
+n
+
+ctl.f  ctl.t
+
+n
+n
+
+MUX
+
+x
+
+y
+
+z−ack
+
+z−req
+
+C
+x−ack
+
+x−req
+
+n
+
+ctl
+
+z
+
+C
+
+x
+
+z
+
++
+
+ctl
+
+C
+
++
+
+C
+
+C
+
+x
+
+C
+
+Figure 5.19.
+Implementation of MUX and DEMUX. The input and output data channels x,
+y, and z use the 4-phase bundled-data protocol and the control channel ctl uses the 4-phase
+dual-rail protocol (in order to simplify the design).
+
+
+Chapter 5: Handshake circuit implementations
+77
+
+protocol on the input and output data channels and the 4-phase dual-rail proto-
+col on the control channel. In both circuits ctl�t and ctl�f can be understood as
+two mutually exclusive requests that select between the two alternative input-
+to-output data transfers, and in both cases ctl�t and ctl�f are joined with the
+relevant input requests (at the C-elements marked “Join”). The rest of the
+MUX implementation is then similar to the 4-phase bundled-data MERGE in
+figure 5.1 on page 59. The rest of the DEMUX should be self explanatory; the
+handshaking on the two output ports are mutually exclusive and the acknowl-
+edge signals yack and zack are ORed to form xack
+
+� ctlack.
+All 4-phase dual-rail implementations of the MUX and DEMUX compo-
+nents are rather similar, and all 4-phase bundled-data implementations may be
+obtained by adding 4-phase bundled-data to 4-phase dual-rail protocol con-
+version circuits on the control input. At the end of chapter 6, an all 4-phase
+bundled-data MUX will be one of the examples we use to illustrate the design
+of speed-independent control circuits.
+
+5.8.
+Mutual exclusion, arbitration and metastability
+
+5.8.1
+Mutual exclusion
+
+Some handshake components (including MERGE) require that the commu-
+nication along several (input) channels is mutually exclusive. For the simple
+static data-flow circuit structures we have considered so far this has been the
+case, but in general one may encounter situations where a resource is shared
+between several independent parties/processes.
+The basic circuit needed to deal with such situations is a mutual exclusion
+element (MUTEX), figure 5.20 (we will explain the implementation shortly).
+The input signals R1 and R2 are two requests that originate from two inde-
+pendent sources, and the task of the MUTEX is to pass these inputs to the
+corresponding outputs G1 and G2 in such a way that at most one output is ac-
+tive at any given time. If only one input request arrives the operation is trivial.
+If one input request arrives well before the other, the latter request is blocked
+until the first request is de-asserted. The problem arises when both input sig-
+
+R1
+
+R2
+
+R1
+
+R2
+
+Bistable
+
+&
+
+&
+
+G2
+
+G1
+
+G1
+
+G2
+
+MUTEX
+
+Metastability filter
+
+x2
+
+x1
+
+Figure 5.20.
+The mutual exclusion element: symbol and possible implementation.
+
+
+78
+Part I: Asynchronous circuit design – A tutorial
+
+nals are asserted at the same time. Then the MUTEX is required to make an
+arbitrary decision, and this is where metastability enters the picture.
+The problem is exactly the same as when a synchronous circuit is exposed
+to an asynchronous input signal (one that does not satisfy set-up and hold time
+requirements). For a clocked flip-flop that is used to synchronize an asyn-
+chronous input signal, the question is whether the data signal made its tran-
+sition before or after the active edge of the clock. As with the MUTEX the
+question is again which signal transition occured first, and as with the MU-
+TEX a random decision is needed if the transition of the data signal coincides
+with the active edge of the clock signal.
+The fundamental problem in a MUTEX and in a synchronizer flip-flop is
+that we are dealing with a bi-stable circuit that receives requests to enter each
+of its two stable states at the same time. This will cause the circuit to enter a
+metastable state in which it may stay for an unbounded length of time before
+randomly settling in one of its stable states. The problem of synchronization
+is covered in most textbooks on digital design and VLSI, and the analysis of
+metastability that is presented in these textbooks applies to our MUTEX com-
+ponent as well. A selection of references is: [95, sect. 9.4] [53, sect. 5.4 and
+6.5] [151, sect. 5.5.7] [115, sect. 6.2.2 and 9.4-5] [150, sect. 8.9].
+For the synchronous designer the problem is that metastability may per-
+sist beyond the time interval that has been allocated to recover from potential
+metastability. It is simply not possible to obtain a decision within a bounded
+length of time. The asynchronous designer, on the other hand, will eventually
+obtain a decision, but there is no upper limit on the time he will have to wait
+for the answer. In [22] the terms “time safe” and “value safe” are introduced
+to denote and classify these two situations.
+A possible implementation of the MUTEX, as shown in figure 5.20, in-
+volves a pair of cross coupled NAND gates and a metastability filter. The cross
+coupled NAND gates enable one input to block the other. If both inputs are
+asserted at the same time, the circuit becomes metastable with both signals
+x1 and x2 halfway between supply and ground. The metastability filter pre-
+vents these undefined values from propagating to the outputs; G1 and G2 are
+both kept low until signals x1 and x2 differ by more than a transistor threshold
+voltage.
+The metastability filter in figure 5.20 is a CMOS transistor-level implemen-
+tation from [83]. An NMOS predecessor of this circuit appeared in [121].
+Gate-level implementations are also possible: the metastability filter can be
+implemented using two buffers whose logic thresholds have been made partic-
+ularly high (or low) by “trimming” the strengths of the pull-up and pull-down
+transistor paths ([151, section 2.3]). For example, a 4-input NAND gate with
+all its inputs tied together implements a buffer with a particularly high logic
+
+
+Chapter 5: Handshake circuit implementations
+79
+
+threshold. The use of this idea in the implementation of mutual exclusion ele-
+ments is described in [6, 139].
+
+5.8.2
+Arbitration
+
+The MUTEX can be used to build a handshake arbiter that can be used to
+control access to a resource that is shared between several autonomous inde-
+pendent parties. One possible implementation is shown in figure 5.21.
+
+&
+
+&
+
+C
+
+C
+
+R0
+A0
+
+R1
+A1
+
+R2
+A2
+
+ARBITER
+
++
+R0
+
+A0
+
+y1
+
+y2
+
+G1
+
+MUTEX
+
+R2
+G2
+
+G1
+R1
+
+A1
+
+R1
+
+R2
+
+A2
+
+G2
+
+A1
+
+A2
+
+a’
+
+aa’
+
+b’
+
+bb’
+
+Figure 5.21.
+The handshake arbiter: symbol and possible implementation.
+
+The MUTEX ensures that signals G1 and G2 at the a’–aa’ interface are
+mutually exclusive. Following the MUTEX are two AND gates whose purpose
+it is to ensure that handshakes on the
+�y1�A1� and
+�y2�A2� channels at the b’–
+bb’ interface are mutually exclusive: y2 can only go high if A1 is low and
+y1 can only go high if signal A2 is low. In this way, if handshaking is in
+progress along one channel, it blocks handshaking on the other channel. As
+handshaking along channels
+�y1�A1� and
+�y2�A2� are mutually exclusive the
+rest of the arbiter is simply a MERGE, c.f., figure 5.1 on page 59. If data needs
+to be passed to the shared resource a multiplexer is needed in exactly the same
+way as in the MERGE. The multiplexer may be controlled by signals y1 and/or
+y2.
+
+5.8.3
+Probability of metastability
+
+Let us finally take a quantitative look at metastability: if P�mett
+
+� denotes
+the probability of the MUTEX being metastable for a period of time of t or
+longer (within an observation interval of one second), and if this situation is
+considered a failure, then we may calculate the mean time between failure as:
+
+MTBF
+�
+1
+
+P�mett
+
+�
+(5.8)
+
+The probability P�mett
+
+� may be calculated as:
+
+P�mett
+
+�
+� P�mett
+
+�mett
+�0
+
+�
+�P�mett
+�0
+
+�
+(5.9)
+
+
+80
+Part I: Asynchronous circuit design – A tutorial
+
+where:
+
+P�mett
+
+�mett
+�0
+
+� is the probability that the MUTEX is still metastable at
+time t given that it was metastable at time t
+� 0.
+
+P�mett
+�0
+
+� is the probability that the MUTEX will enter metastability
+within a given observation interval.
+
+The probability P�mett
+�0
+
+� can be calculated as follows: the MUTEX will go
+metastable if its inputs R1 and R2 are exposed to transitions that occur almost
+simultaneously, i.e. within some small time window ∆. If we assume that
+the two input signals are uncorrelated and that they have average switching
+frequencies fR1 and fR2 respectively, then:
+
+P�mett
+�0
+
+�
+�
+1
+
+∆
+� fR1
+
+� fR2
+(5.10)
+
+which can be understood as follows: within an observation interval of one
+second the input signal R2 makes 1� fR2 attempts at hitting one of the 1� fR1
+time intervals of duration ∆ where the MUTEX is vulnerable to metastability.
+The probability P�mett
+
+�mett
+�0
+
+� is determined as:
+
+P�mett
+
+�mett
+�0
+
+�
+� e
+
+�t
+�τ
+(5.11)
+
+where τ expresses the ability of the MUTEX to exit the metastable state spon-
+taneously. This equation can be explained in two different ways and experi-
+mental results have confirmed its correctness. One explanation is that the cross
+coupled NAND gates have no memory of how long they have been metastable,
+and that the only probability distribution that is “memoryless” is an exponen-
+tial distribution. Another explanation is that a small-signal model of the cross-
+coupled NAND gates at the metastable point has a single dominating pole.
+Combining equations 5.8–5.11 we obtain:
+
+MTBF
+�
+e t
+�τ
+
+∆
+� fR1
+
+� fR2
+(5.12)
+
+Experiments and simulations have shown that this equation is reasonably
+accurate provided that t is not very small, and experiments or simulations may
+be used to determine the two parameters ∆ and τ. Representative values for
+good circuit designs implemented in a 0.25 µm CMOS process are ∆
+� 30ps
+and τ
+� 25ps.
+
+5.9.
+Summary
+
+This chapter addressed the implementation of the various handshake com-
+ponents: latch, fork, join, merge, function blocks, mux, demux, mutex and
+arbiter). A significant part of the material addressed principles and techniques
+for implementing function blocks.
+
+
+Chapter 6
+
+SPEED-INDEPENDENT CONTROL CIRCUITS
+
+This chapter provides an introduction to the design of asynchronous sequen-
+tial circuits and explains in detail one well-developed specification and synthe-
+sis method: the synthesis of speed-independent control circuits from signal
+transition graph specifications.
+
+6.1.
+Introduction
+
+Over time many different formalisms and theories have been proposed for
+the design of asynchronous control circuits (e.g. sequential circuits or state
+machines). The multitude of approaches arises from the combination of: (a)
+different specification formalisms, (b) different assumptions about delay mod-
+els for gates and wires, and (c) different assumptions about the interaction
+between the circuit and its environment. Full coverage of the topic is far be-
+yond the scope of this book. Instead we will first present some of the basic
+assumptions and characteristics of the various design methods and give point-
+ers to relevant literature and then we will explain in detail one method: the
+design of speed-independent circuits from signal transition graphs – a method
+that is supported by a well-developed public domain tool, Petrify.
+A good starting point for further reading is a book by Myers [95]. It provides
+in-depth coverage of the various formalisms, methods, and theories for the
+design of asynchronous sequential circuits and it provides a comprehensive
+list of references.
+
+6.1.1
+Asynchronous sequential circuits
+
+To start the discussion figure 6.1 shows a generic synchronous sequential
+circuit and two alternative asynchronous control circuits: a Huffman style fun-
+damental mode circuit with buffers (delay elements) in the feedback signals,
+and a Muller style input-output mode circuit with wires in the feedback path.
+The synchronous circuit is composed of a set of registers holding the current
+state and a combinational logic circuit that computes the output signals and the
+next state signals. When the clock ticks the next state signals are copied into the
+registers thus becoming the current state. Reliable operation only requires that
+
+81
+
+
+82
+Part I: Asynchronous circuit design – A tutorial
+
+Synchronous:
+
+Clock
+
+Current state
+Next state
+
+Asynchronous
+Huffman style 
+fundamental mode: 
+Muller style 
+Asynchronous
+
+input-output mode: 
+
+Logic
+Logic
+Logic
+
+Inputs
+Outputs
+
+Figure6.1.
+(a) A synchronous sequential circuit. (b) A Huffman style asynchronous sequential
+circuit with buffers in the feedback path, and (c) a Muller style asynchronous sequential circuit
+with wires in the feedback path.
+
+the next state output signals from the combinational logic circuit are stable in a
+time window around the rising edge of the clock, an interval that is defined by
+the setup and hold time parameters of the register. Between two clock ticks the
+combinational logic circuit is allowed to produce signals that exhibit hazards.
+The only thing that matters is that the signals are ready and stable when the
+clock ticks.
+In an asynchronous circuit there is no clock and all signals have to be valid
+at all times. This implies that at least the output signals that are seen by the
+environment must be free from all hazards. To achieve this, it is sometimes
+necessary to avoid hazards on internal signals as well. This is why the syn-
+thesis of asynchronous sequential circuits is difficult. Because it is difficult
+researchers have proposed different methods that are based on different (sim-
+plifying) assumptions.
+
+6.1.2
+Hazards
+
+For the circuit designer a hazard is an unwanted glitch on a signal. Fig-
+ure 6.2 shows four possible hazards that may be observed. A circuit that is in
+a stable state does not spontaneously produce a hazard – hazards are related
+to the dynamic operation of a circuit. This again relates to the dynamics of
+the input signals as well as the delays in the gates and wires in the circuit. A
+discussion of hazards is therefore not possible without stating precisely which
+delay model is being used and what assumptions are made about the interaction
+between the circuit and its environment. There are greater theoretical depths
+in this area than one might think at a first glance.
+Gates are normally assumed to have delays. In section 2.5.3 we also dis-
+cussed wire delays, and in particular the implications of having different delays
+in different branches of a forking wire. In addition to gate and wire delays it is
+also necessary to specify which delay model is being used.
+
+
+Chapter 6: Speed-independent control circuits
+83
+
+Static-1  hazard:
+
+Static-0  hazard:
+
+1
+
+0
+0
+
+1
+
+1
+0
+1
+1
+0
+
+0
+1
+1 0
+0
+
+1
+
+0
+
+0
+
+1
+
+1
+
+0
+
+Desired signal
+Actual signal
+
+Dynamic-10  hazard:
+
+Dynamic-01  hazard:
+
+Figure 6.2.
+Possible hazards that may be observed on a signal.
+
+6.1.3
+Delay models
+
+A pure delay that simply shifts any signal waveform later in time is perhaps
+what first comes to mind. In the hardware description language VHDL this is
+called a transport delay. It is, however, not a very realistic model as it implies
+that the gates and wires have infinitely high bandwidth. A more realistic delay
+model is the inertial delay model. In addition to the time shifting of a signal
+waveform, an inertial delay suppresses short pulses. In the inertial delay model
+used in VHDL two parameters are specified, the delay time and the reject time,
+and pulses shorter than the reject time are filtered out. The inertial delay model
+is the default delay model used in VHDL.
+These two fundamental delay models come in several flavours depending on
+how the delay time parameter is specified. The simplest is a fixed delay where
+the delay is a constant. An alternative is a min-max delay where the delay is
+unknown but within a lower and upper bound: tmin
+
+� tdelay
+
+� tmax. A more
+pessimistic model is the unbounded delay where delays are positive (i.e. not
+zero), unknown and unbounded from above: 0
+� tdelay
+
+� ∞. This is the delay
+model that is used for gates in speed-independent circuits.
+It is intuitive that the inertial delay model and the min-max delay model
+both have properties that help filter out some potential hazards.
+
+6.1.4
+Fundamental mode and input-output mode
+
+In addition to the delays in the gates and wires, it is also necessary to for-
+malize the interaction between the circuit being designed and its environment.
+Again, strong assumptions may simplify the design of the circuit. The design
+methods that have been proposed over time all have their roots in one of the
+following assumptions:
+
+Fundamental mode: The circuit is assumed to be in a state where all input
+signals, internal signals, and output signals are stable. In such a sta-
+ble state the environment is allowed to change one input signal. After
+
+
+84
+Part I: Asynchronous circuit design – A tutorial
+
+that, the environment is not allowed to change the input signals again
+until the entire circuit has stabilized. Since internal signals such as state
+variables are unknown to the environment, this implies that the longest
+delay in the circuit must be calculated and the environment is required
+to keep the input signals stable for at least this amount of time. For this
+to make sense, the delays in gates and wires in the circuit have to be
+bounded from above. The limitation on the environment is formulated
+as an absolute time requirement.
+
+The design of asynchronous sequential circuits based on fundamental
+mode operation was pioneered by David Huffman in the 1950s [59, 60].
+
+Input-output mode: Again the circuit is assumed to be in a stable state. Here
+the environment is allowed to change the inputs. When the circuit has
+produced the corresponding output (and it is allowable that there are no
+output changes), the environment is allowed to change the inputs again.
+There are no assumptions about the internal signals and it is therefore
+possible that the next input change occurs before the circuit has stabi-
+lized in response to the previous input signal change.
+
+The restrictions on the environment are formulated as causal relations
+between input signal transitions and output signal transitions. For this
+reason the circuits are often specified using trace based methods where
+the designer specifies all possible sequences of input and output signal
+transitions that can be observed on the interface of the circuit. Signal
+transition graphs, introduced later, are such a trace-based specification
+technique.
+
+The design of asynchronous sequential circuits based on the input-output
+mode of operation was pioneered by David Muller in the 1950s [93, 92].
+As mentioned in section 2.5.1, these circuits are speed-independent.
+
+6.1.5
+Synthesis of fundamental mode circuits
+
+In the classic work by Huffman the environment was only allowed to change
+one input signal at a time. In response to such an input signal change, the
+combinational logic will produce new outputs, of which some are fed back,
+figure 6.1(b). In the original work it was further required that only one feed-
+back signal changes (at a time) and that the delay in the feedback buffer is
+large enough to ensure that the entire combinational circuit has stabilized be-
+fore it sees the change of the feedback signal. This change may, in turn, cause
+the combinational logic to produce new outputs, etc. Eventually through a se-
+quence of single signal transitions the circuit will reach a stable state where
+the environment is again allowed to make a single input change. Another way
+of expressing this behaviour is to say that the circuit starts out in a stable state
+
+
+Chapter 6: Speed-independent control circuits
+85
+
+s0
+00
+01
+11
+c
+10
+
+Inputs a,b
+Output 
+
+s0
+s1
+s2
+-
+0
+
+s1
+s3
+-
+-
+
+s2
+s3
+-
+-
+
+s3
+s5
+s4
+-
+
+s4
+s0
+
+s5
+s0
+-
+-
+
+0
+
+0
+
+1
+
+1
+
+1
+
+-
+-
+
+s1
+s2
+
+s3
+
+s4
+s5
+
+00/0
+
+00/0
+
+11/1
+10/0
+
+11/1
+
+10/0
+01/0
+
+01/0
+
+Primitive flow table
+Mealy type state diagram
+
+ab/c
+
+01/1
+10/1
+
+01/1
+10/1
+
+00/0
+
+11/1
+
+Burst mode specification
+
+s0
+
+a+b+/c+
+
+a-b-/c-
+
+s3
+
+Figure 6.3.
+Some alternative specifications of a Muller C-element: a Mealy state diagram, a
+primitive flow table, and a burst-mode state diagram.
+
+(which is defined to be a state that will persist until an input signal changes). In
+response to an input signal change the circuit will step through a sequence of
+transient, unstable states, until it eventually settles in a new stable state. This
+sequence of states is such that from one state to the next only one variable
+changes.
+The interested reader is encouraged to consult [75], [133] or [95] and to
+specify and synthesize a C-element. The following gives a flavour of the design
+process and the steps involved:
+
+The design may start with a state graph specification that is very simi-
+lar to the specification of a synchronous sequential circuit. This is op-
+tional. Figure 6.3 shows a Mealy type state graph specification of the
+C-element.
+
+The classic design process involves the following steps:
+
+The intended sequential circuit is specified in the form of a primitive
+flow table (a state table with one row per stable state). Figure 6.3 shows
+the primitive flow table specification of a C-element.
+
+A minimum-row reduced flow table is obtained by merging compatible
+states in the primitive flow table.
+
+The states are encoded.
+
+Boolean equations for output variables and state variables are derived.
+
+Later work has generalized the fundamental mode approach by allowing a
+restricted form of multiple-input and multiple-output changes. This approach
+
+
+86
+Part I: Asynchronous circuit design – A tutorial
+
+is called burst mode [32, 27]. When in a stable state, a burst-mode circuit
+will wait for a set of input signals to change (in arbitrary order). After such
+an input burst has completed the machine computes a burst of output signals
+and new values of the internal variables. The environment is not allowed to
+produce a new input burst until the circuit has completely reacted to the pre-
+vious burst – fundamental mode is still assumed, but only between bursts of
+input changes. For comparison, figure 6.3 also shows a burst-mode specifica-
+tion of a C-element. Burst-mode circuits are specified using state graphs that
+are very similar to those used in the design of synchronous circuits. Several
+mature tools for synthesizing burst-mode controllers have been developed in
+academia [40, 160]. These tools are available in the public domain.
+
+6.2.
+Signal transition graphs
+
+The rest of this chapter will be devoted to the specification and synthesis
+of speed-independent control circuits. These circuits operate in input-output
+mode and they are naturally specified using signal transition graphs, (STGs).
+An STG is a petri net and it can be seen as a formalization of a timing dia-
+gram. The synthesis procedure that we will explain in the following consists
+of: (1) Capturing the behaviour of the intended circuit and its environment in
+an STG. (2) Generating the corresponding state graph, and adding state vari-
+ables if needed. (3) Deriving Boolean equations for the state variables and
+outputs.
+
+6.2.1
+Petri nets and STGs
+
+Briefly, a Petri net [3, 113, 94] is a graph composed of directed arcs and
+two types of nodes: transitions and places. Depending on the interpretation
+that is assigned to places, transitions and arcs, Petri nets can be used to model
+and analyze many different (concurrent) systems. Some places can be marked
+with tokens and the Petri net model can be “executed” by firing transitions. A
+transition is enabled to fire if there are tokens on all of its input places, and
+an enabled transition must eventually fire. When a transition fires, a token is
+removed from each input place and a token is added to each output place. We
+will show an example shortly. Petri nets offer a convenient way of expressing
+choice and concurrency.
+It is important to stress that there are many variations of and extensions to
+Petri nets – Petri nets are a family of related models and not a single, unique
+and well defined model. Often certain restrictions are imposed in order to make
+the analysis for certain properties practical. The STGs we will consider in the
+following belong to such a restricted subclass: an STG is a 1-bounded Petri net
+in which only simple forms of input choice are allowed. The exact meaning of
+
+
+Chapter 6: Speed-independent control circuits
+87
+
+Timing diagram
+C-element and dummy environment
+
+a
+
+b
+
+c
+etc.
+
+a
+c
+b
+
+b+
+
+b-
+
+a+
+
+a-
+
+c-
+
+c+
+
+STG
+
+b+
+
+c+
+
+b-
+
+c-
+
+a+
+
+a-
+
+Petri net
+
+Figure 6.4.
+A C-element and its ‘well behaved’ dummy environment, its specification in the
+form of a timing diagram, a Petri net, and an STG formalization of the timing diagram.
+
+“1-bounded” and “simple forms of input choice” will be defined at the end of
+this section.
+In an STG the transitions are interpreted as signal transitions and the places
+and arcs capture the causal relations between the signal transitions. Figure 6.4
+shows a C-element and a ‘well behaved’ dummy environment that maintains
+the input signals until the C-element has changed its outputs. The intended be-
+haviour of the C-element could be expressed in the form of a timing diagram
+as shown in the figure. Figure 6.4 also shows the corresponding Petri net spec-
+ification. The Petri net is marked with tokens on the input places to the a� and
+b� transitions, corresponding to state
+�a�b�c�
+�
+�0�0�0�. The a� and b� tran-
+sitions may fire in any order, and when they have both fired the c� transition
+becomes enabled to fire, etc. Often STGs are drawn in a simpler form where
+most places have been omitted. Every arc that connects two transitions is then
+thought of as containing a place. Figure 6.4 shows the STG specification of the
+C-element.
+A given marking of a Petri net corresponds to a possible state of the sys-
+tem being modeled, and by executing the Petri net and identifying all possible
+
+
+88
+Part I: Asynchronous circuit design – A tutorial
+
+markings it is possible to derive the corresponding state graph of the system.
+The state graph is generally much more complex than the corresponding Petri
+net.
+An STG describing a meaningful circuit enjoys certain properties, and for
+the synthesis algorithms used in tools like Petrify to work, additional properties
+and restrictions may be required. An STG is a Petri net with the following
+characteristics:
+
+1 Input free choice: The selection among alternatives must only be con-
+trolled by mutually exclusive inputs.
+
+2 1-bounded: There must never be more than one token in a place.
+
+3 Liveness: The STG must be free from deadlocks.
+
+An STG describing a meaningful speed-independent circuit has the following
+characteristics:
+
+4 Consistent state assignment: The transitions of a signal must strictly
+alternate between
+� and
+� in any execution of the STG.
+
+5 Persistency: If a signal transition is enabled it must take place, i.e. it
+must not be disabled by another signal transition. The STG specification
+of the circuit must guarantee persistency of internal signals (state vari-
+ables) and output signals, whereas it is up to the environment to guaran-
+tee persistency of the input signals.
+
+In order to be able to synthesize a circuit implementation the following char-
+acteristic is required:
+
+6 Complete state coding (CSC): Two or more different markings of the
+STG must not have the same signal values (i.e. correspond to the same
+state). If this is not the case, it is necessary to introduce extra state
+variables such that different markings correspond to different states. The
+synthesis tool Petrify will do this automatically.
+
+6.2.2
+Some frequently used STG fragments
+
+For the newcomer it may take a little practice to become familiar with spec-
+ifying and designing circuits using STGs. This section explains some of the
+most frequently used templates from which one can construct complete speci-
+fications.
+The basic constructs are: fork, join, choice and merge, see figure 6.5. The
+choice is restricted to what is called input free choice: the transitions follow-
+ing the choice place must represent mutually exclusive input signal transitions.
+This requirement is quite natural; we will only specify and design determin-
+istic circuits. Figure 6.6 shows an example Petri net that illustrates the use
+
+
+Chapter 6: Speed-independent control circuits
+89
+
+Fork
+
+Join
+
+Choice
+
+Merge
+
+Figure 6.5.
+Petri net fragments for fork, join, free choice and merge constructs.
+
+P9
+
+T7
+
+P8
+
+P7
+
+P2
+
+P6
+P5
+
+P4
+P3
+
+P1
+
+T3
+T2
+
+T8
+
+T6
+
+T5
+
+T1
+Fork
+
+Join
+
+Merge
+
+Choice
+
+T4
+
+Figure 6.6.
+An example Petri net that illustrates the use of fork, join, free choice and merge.
+
+of fork, join, free choice and merge. The example system will either perform
+transitions T6 and T7 in sequence, or it will perform T1 followed by the con-
+current execution of transitions T2, T3 and T4 (which may occur in any order),
+followed by T5.
+Towards the end of this chapter we will design a 4-phase bundled-data ver-
+sion of the MUX component from figure 3.3 on page 32. For this we will need
+some additional constructs: a controlled choice and a Petri net fragment for the
+input end of a bundled-data channel.
+Figure 6.7 shows a Petri net fragment where place P1 and transitions T3
+and T4 represent a controlled choice: a token in place P1 will engage in either
+transition T3 or transition T4. The choice is controlled by the presence of
+a token in either P2 or P3. It is crucial that there can never be a token in
+both these places at the same time, and in the example this is ensured by the
+mutually exclusive input signal transitions T1 and T2.
+
+
+90
+Part I: Asynchronous circuit design – A tutorial
+
+T2
+
+T5
+
+P1
+
+T0
+
+T1
+
+P2
+P1: Controlled Choice
+
+Mutually exclusive "paths"
+
+T3
+T4
+
+P3
+
+P0
+P0: Free Choice
+
+Figure 6.7.
+A Petri net fragment including a controlled choice.
+
+Figure 6.8 shows a Petri net fragment for a component with a one-bit input
+channel using a 4-phase bundled-data protocol. It could be the control chan-
+nel used in the MUX and DEMUX components introduced in figure 3.3 on
+page 32. The two transitions dummy1 and dummy2 do not represent transitions
+on the three signals in the channel, they are dummy transitions that facilitate
+expressing the specification. These dummy transitions represent an extension
+to the basic class of STGs.
+Note also that the four arcs connecting:
+place P5 and transition Ctl
+�
+place P5 and transition Ctl
+�
+place P6 and transition dummy2
+place P7 and transition dummy1
+have arrows at both ends. This is a shorthand notation for an arc in each direc-
+tion. Note also that there are several instances where a place is both an input
+place and a output place for a transition. Place P5 and transition Ctl
+� is an
+example of this.
+The overall structure of the Petri net fragment can be understood as follows:
+at the top is a sequence of transitions and places that capture the handshaking
+on the Req and Ack signals. At the bottom is a loop composed of places P6
+and P7 and transitions Ctl
+� and Ctl
+� that captures the control signal changing
+between high and low. The absence of a token in place P5 when Req is high
+expresses the fact that Ctl is stable in this period. When Req is low and a
+token is present in place P5, Ctl is allowed to make as many transitions as it
+desires. When Req� fires, a token is put in place P4 (which is a controlled
+choice place). The Ctl signal is now stable, and depending on its value one of
+the two transitions dummy1 or dummy2 will become enabled and eventually
+
+
+Chapter 6: Speed-independent control circuits
+91
+
+Req
+
+Ack
+
+Ctl
+
+Bundled data interface
+
+Ctl
+Req/Ack
+
+Ctl−
+
+Ctl+
+
+Ack+
+
+Req−
+
+Ack−
+
+Do the "Ctl=0" action
+Do the "Ctl=1" action
+
+dummy1
+dummy2
+
+P1
+
+P2
+
+P6
+P7
+
+Req+
+
+P3
+
+P4
+
+P5
+
+P0
+
+Figure 6.8.
+A Petri net fragment for a component with a one-bit input channel using a 4-phase
+bundled-data protocol.
+
+fire. At this point the intended input-to-output operation that is not included in
+this example may take place, and finally the handshaking on the control port
+finishes (Ack
+�; Req�; Ack
+�).
+
+6.3.
+The basic synthesis procedure
+
+The starting point for the synthesis process is an STG that satisfies the re-
+quirements listed on page 88. From this STG the corresponding state graph is
+derived by identifying all of the possible markings of the STG that are reach-
+able given its initial marking. The last step of the synthesis process is to derive
+Boolean equations for the state variables and output variables.
+We will go through a number of examples by hand in order to illustrate the
+techniques used. Since the state of a circuit includes the values of all of the
+signals in the circuit, the computational complexity of the synthesis process
+can be large, even for small circuits. In practice one would always use one
+of the CAD tools that has been developed – for example Petrify that we will
+introduce later.
+
+
+92
+Part I: Asynchronous circuit design – A tutorial
+
+6.3.1
+Example 1: a C-element
+
+c
+ab
+00
+01
+10
+0
+
+1
+0
+0
+
+11
+0* 0
+
+1*
+1
+1
+1
+
+c = ab + ac + bc 
+
+C element and its environment
+State Graph
+
+Karnaugh map for C
+
+0*0*0
+
+10*0
+0*10
+
+110*
+
+01*1
+1*01
+
+001*
+
+1*1*1
+
+a
+c
+b
+
+Figure 6.9.
+State graph and Boolean equation for the C-element STG from figure 6.4.
+
+Figure 6.9 shows the state graph corresponding to the STG specification in
+figure 6.4 on page 87. Variables that are excited in a given state are marked
+with an asterisk. Also shown in figure 6.9 is the Karnaugh map for output
+signal c. The Boolean expression for c must cover states in which c
+� 1 and
+states where it is excited, c
+� 0
+
+� (changing to 1). In order to better distinguish
+excited variables from stable ones in the Karnaugh maps, we will use R (rising)
+instead of 0
+
+� and F (falling) instead of 1
+
+� throughout the rest of this book.
+It is comforting to see that we can successfully derive the implementation of
+a known circuit, but the C-element is really too simple to illustrate all aspects
+of the design process.
+
+6.3.2
+Example 2: a circuit with choice
+
+The following example provides a better illustration of the synthesis pro-
+cedure, and in a subsequent section we will come back to this example and
+explain more efficient implementations. The example is simple – the circuit
+has only 2 inputs and 2 outputs – and yet it brings forward all relevant issues.
+The example is due to Chris Myers of the University of Utah who presented it
+in his 1996 course EE 587 “Asynchronous VLSI System Design.” The example
+has roots in the papers [12, 13].
+Figure 6.10 shows a specification of the circuit. The circuit has two inputs
+a and b and two outputs c and d, and the circuit has two alternative behaviours
+as illustrated in the timing diagram. The corresponding STG specification is
+shown in figure 6.11 along with the state graph for the circuit. The STG in-
+
+
+Chapter 6: Speed-independent control circuits
+93
+
+Environment
+
+a
+
+b
+
+c
+
+d
+
+a
+
+b
+
+c
+
+d
+
+Figure 6.10.
+The example circuit from [12, 13].
+
+001*0
+
+10*00
+
+1100*
+
+110*1
+
+1111*
+
+0*0*00
+
+1*110
+
+01*10
+
+010*0
+
+14
+
+6
+
+4
+
+0
+
+8
+
+12
+
+2
+
+15
+
+13
+
+a+
+
+b+
+
+d+
+
+c+
+d-
+
+c-
+
+b-
+
+b+
+
+c+
+
+a-
+
+b+
+a+
+
+c+
+b+
+
+d+
+
+c+
+a-
+
+b-
+
+c-
+
+d-
+
+P1
+
+P0
+
+Figure 6.11.
+The STG specification and the corresponding state graph.
+
+x3
+
+x2
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
++
+
+&
+
+&
+
+c = d + a b + b c
+
+d
+a
+b
+c
+
+&
+
+&
+&
+
++
+
+b
+
+a
+d
+
+c
+
+Karnaugh map:
+
+Boolean equation for c:
+
+An atomic complex gate:
+
+Using simple gates:
+
+x1
+
+Figure 6.12.
+The Karnaugh map, the Boolean equation, and two alternative gate-level imple-
+mentations of output signal c.
+
+
+94
+Part I: Asynchronous circuit design – A tutorial
+
+cludes only the free choice place P0 and the merge place P1. All arcs that
+directly connect two transitions are assumed to include a place. The states in
+the state diagram have been labeled with decimal numbers to ease filling out
+the Karnaugh maps.
+The STG satisfies all of the properties 1-6 listed on page 88 and it is thus
+possible to proceed and derive Boolean equations for output signals c and d.
+[Note: In state 0 both inputs are marked to be excited,
+�a�b�
+�
+�0
+
+�
+
+�0
+
+�
+
+�, and
+in states 4 and 8 one of the signals is still 0 but no longer excited. This is a
+problem of notation only. In reality only one of the two variables is excited in
+state 0, but we don’t know which one. Furthermore, the STG is only required
+to be persistent with respect to the internal signals and the output signals. Per-
+sistency of the input signals must be guaranteed by the environment].
+For output c, figure 6.12 shows the Karnaugh map, the Boolean equation and
+two alternative gate implementations: one using a single atomic And-Or-Invert
+gate, and one using simple AND and OR gates. Note that there are states that
+are not reachable by the circuit. In the Karnaugh map these states correspond
+to don’t cares. The implementation of output signal d is left as an exercise for
+the reader (d
+� abc).
+
+6.3.3
+Example 2: Hazards in the simple gate
+implementation
+
+The STG in figure 6.10 satisfies all of the implementation conditions 1-6
+(including persistency), and consequently an implementation where each out-
+put signal is implemented by a single atomic complex gate is hazard free. In
+the case of c we need a complex And-Or gate with inversion of input signal a.
+In general such an atomic implementation is not feasible and it is necessary to
+decompose the implementation into a structure of simpler gates. Unfortunately
+this will introduce extra variables, and these extra variables may not satisfy the
+persistency requirement that an excited signal transition must eventually fire.
+Speed-independence preserving logic decomposition is therefore a very inter-
+esting and relevant topic [20, 76].
+The implementation of c using simple gates that is shown in figure 6.12 is
+not speed-independent; it may exhibit both static and dynamic hazards, and it
+provides a good illustration of the mechanisms behind hazards. The problem
+is that the signals x1, x2 and x3 are not included in the original STG and state
+graph. A detailed analysis that includes these signals would not satisfy the
+persistency requirement. Below we explain possible failure sequences that may
+cause a static-1 hazard and a dynamic-10 hazard on output signal c. Figure 6.13
+illustrates the discussion.
+
+A static-1 hazard may occur when the circuit goes through the following se-
+quence of states: 12, 13, 15, 14. The transition from state 12 to state 13
+
+
+Chapter 6: Speed-independent control circuits
+95
+
+Potential dynamic-10 hazard.
+Potential static-1 hazard.
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+a b
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+a b
+
+d
+
+b c
+
+d
+
+b c
+
+Figure 6.13.
+The Karnaugh maps for output signal c showing state sequences that may lead to
+hazards.
+
+corresponds to d going high and the transition from state 15 to state 14
+corresponds to d going low again. In state 13 c is excited (R) and it is
+supposed to remain high throughout states 13, 15, 14, and 6. States 13
+and 15 are covered by the cube d, and state 14 is covered by cube bc that
+is supposed to “take over” and maintain c
+� 1 after d has gone low. If
+the AND gate with output signal x2 that corresponds to cube bc is slow
+we have the problem - the static-1 hazard.
+
+A dynamic-10 hazard may occur when the circuit goes through the following
+sequence of states: 4, 6, 2, 0. This situation corresponds to the upper
+AND gate (with output signal x1) and the OR gate relaying b� into
+c� and b� into c�. However, after the c� transition the lower AND
+gate, x2, becomes excited (R) as well, but the firing of this gate is not
+indicated by any other signal transition – the OR gate already has one
+input high. If the lower AND gate (x2) fires, it will later become excited
+(F) in response to c�. The net effect of this is that the lower AND gate
+(x2) may superimpose a 0-1-0 pulse onto the c output after the intended
+c� transition has occured.
+
+In the above we did not consider the inverter with input signal a and output
+signal x3. Since a is not an input to any other gate, this decomposition is SI.
+In summary both types of hazard are related to the circuit going through a
+sequence of states that are covered by several cubes that are supposed to main-
+tain the signal at the same (stable) level. The cube that “takes over” represents
+a signal that may not be indicated by any other signal. In essence it is the same
+problem that we touched upon in section 2.2 on page 14 and in section 2.4.3
+on page 20 – an OR gate can only indicate when the first input goes high.
+
+
+96
+Part I: Asynchronous circuit design – A tutorial
+
+6.4.
+Implementations using state-holding gates
+
+6.4.1
+Introduction
+
+During operation each variable in the circuit will go through a sequence of
+states where it is (stable) 0, followed by one or more states where it is excited
+(R), followed by a sequence of states where it is (stable) 1, followed by one or
+more states where it is excited (F), etc. In the above implementation we were
+covering all states where a variable, z, was high or excited to go high (z
+� 1
+and z
+� R
+� 0�).
+An alternative is to use a state-holding device such as a set-reset latch. The
+Boolean equations for the set and reset signals need only cover the z
+� R
+� 0�
+states and the z
+� F
+� 1� states respectively. This will lead to simpler equations
+and potentially simpler decompositions. Figure 6.14 shows the implementation
+template using a standard set-reset latch and an alternative solution based on a
+standard C-element. In the latter case the reset signal must be inverted. Later,
+in section 6.4.5, we will discuss alternative and more elaborate implementa-
+tions, but for the following discussion the basic topologies in figure 6.14 will
+suffice.
+
+logic
+
+Set
+logic
+
+Reset
+z
+C
+z
+SR
+Reset
+logic
+
+Set
+logic
+
+latch
+
+Standard C-element implementation:
+SR flip-flop implementation:
+
+Figure 6.14.
+Possible implementation templates using (simple) state holding elements.
+
+At this point it is relevant to mention that the equations for when to set
+and reset the state-holding element for signal z can be found by rewriting the
+original equation (that covers states in which z
+� R and z
+� 1) into the following
+form:
+
+z
+� “Set”
+�z
+�“Reset”
+(6.1)
+
+For signal c in the above example (figure 6.12 on page 93) we would get the
+following set and reset functions: cset
+
+� d
+�ab and creset
+
+� b (which is identical
+to the result in figure 6.15 in section 6.4.3). Furthermore it is obvious that for
+all reachable states (only) the set and reset functions for a signal z must never
+be active at the same time:
+
+“Set”
+�“Reset”
+� 0
+
+
+Chapter 6: Speed-independent control circuits
+97
+
+The following sections will develop the idea of using state-holding elements
+and we will illustrate the techniques by re-implementing example 2 from the
+previous section.
+
+6.4.2
+Excitation regions and quiescent regions
+
+The above idea of using a state-holding device for each variable can be formal-
+ized as follows:
+
+An excitation region, ER, for a variable z is a maximally-connected set of
+states in which the variable is excited:
+
+ER(z�) denotes a region of states where z
+� R
+� 0*
+
+ER(z�) denotes a region of states where z
+� F
+� 1*
+
+A quiescent region, QR, for a variable z is a maximally-connected set of states
+in which the variable is not excited:
+
+QR(z�) denotes a region of states where z
+� 1
+
+QR(z�) denotes a region of states where z
+� 0
+
+For a given circuit the state space can be disjointly divided into one or more
+regions of each type.
+
+The set function (cover) for a variable z:
+
+must contain all states in the ER(z�) regions
+
+may contain states from the QR(z�) regions
+
+may contain states not reachable by the circuit
+
+The reset function (cover) for a variable z:
+
+must contain all states in the ER(z�) regions
+
+may contain states from the QR(z�) regions
+
+may contain states not reachable by the circuit
+
+In section 6.4.4 below we will add what is known as the monotonic cover
+constraint or the unique entry constraint in order to avoid hazards:
+
+A cube (product term) in the set or reset function of a variable must only
+be entered through a state where the variable is excited.
+
+Having mentioned this last constraint, we have above a complete recipe
+for the design of speed-independent circuits where each non-input signal is
+implemented by a state holding device. Let us continue with example 2.
+
+
+98
+Part I: Asynchronous circuit design – A tutorial
+
+6.4.3
+Example 2: Using state-holding elements
+
+Figure 6.15 illustrates the above procedure for example 2 from sections 6.3.2
+and 6.3.3. As before, the Boolean equations (for the set and reset functions)
+may need to be implemented using atomic complex gates in order to ensure
+that the resulting circuit is speed-independent.
+
+ER2(c-)
+
+QR1(c+)
+
+QR1(c-)
+
+ER1(c+)
+
+ER2(c+)
+
+a+
+
+b+
+
+d+
+
+c+
+d-
+
+c-
+
+b-
+
+b+
+
+c+
+
+a-
+
+001*0
+
+10*00
+
+1100*
+
+110*1
+
+1111*
+
+0*0*00
+
+1*110
+
+01*10
+
+010*0
+
+14
+
+6
+
+4
+
+0
+
+8
+
+12
+
+2
+
+15
+
+13
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+c-reset = b
+
+c-set = d + a b
+
+Figure 6.15.
+Excitation and quiescent regions in the state diagram for signal c in the example
+circuit from figure 6.10, and the corresponding derivation of equations for the set and reset
+functions.
+
+C
+x1
+
+c-set
+
+c-reset
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+c-reset = b
+
+c-set = d + a b
+
+&
++
+a
+d
+
+b
+
+b
+c
+
+Figure 6.16.
+Implementation of c using a standard C-element and simple gates, along with the
+Karnaugh map from which the set and reset functions were derived.
+
+6.4.4
+The monotonic cover constraint
+
+A standard C-element based implementation of signal c from above, with
+the set and reset functions implemented using simple gates, is shown in fig-
+ure 6.16 along with the Karnaugh map from which the set and reset functions
+are derived. The set function involves two cubes d and ab that are input signals
+to an OR gate. This implementation may exhibit a dynamic-10 hazard on the
+
+
+Chapter 6: Speed-independent control circuits
+99
+
+cset-signal in a similar way to that discussed previously. The Karnaugh map in
+figure 6.16 shows the sequence of states that may lead to a malfunction: (8, 12,
+13, 15, 14, 6, 0). Signal d is low in state 12, high in states 13 and 15, and low
+again in state 14. This sequence of states corresponds to a pulse on d. Through
+the OR gate this will create a pulse on the cset signal that will cause c to go
+high. Later in state 2, c will go low again. This is the desired behaviour. The
+problem is that the internal signal x1 that corresponds to the other cube in the
+expression for cset becomes excited (x1
+� R) in state 6. If this AND gate is
+slow this may produce an unintended pulse on the cset signal after c has been
+reset again.
+If the cube ab (that covers states 4, 5, 7, and 6) is reduced to include only
+states 4 and 5 corresponding to cset
+
+� d
+� abc we would avoid the problem.
+The effect of this modification is that the OR gate is never exposed to more
+than one input signal being high, and when this is the case we do not have
+problems with the principle of indication (c.f. the discussion of indication and
+dual-rail circuits in chapter 2). Another way of expressing this is that a cover
+cube must only be entered through states belonging to an excitation region.
+This requirement is known as:
+
+the monotonic cover constraint: only one product term in a sum-of-
+products implementation is allowed to be high at any given time. Obvi-
+ously the requirement need only be satisfied in the states that are reach-
+able by the circuit, or alternatively
+
+the unique entry constraint: cover cubes may only be entered through
+excitation region states.
+
+6.4.5
+Circuit topologies using state-holding elements
+
+In addition to the set-reset flip-flop and the standard C-element based tem-
+plates presented above, there are a number of alternative solutions for imple-
+menting variables using a state-holding device.
+A popular approach is the generalized C-element that is available to the
+CMOS transistor-level designer. Here the state-holding mechanism and the set
+and reset functions are implemented in one (atomic) compound structure of
+n- and p-type transistors. Figure 6.17 shows a gate-level symbol for a circuit
+where zset
+
+� ab and zreset
+
+� bc along with dynamic and static CMOS imple-
+mentations.
+An alternative implementation that may be attractive to a designer using a
+standard cell library that includes (complex) And-Or-Invert gates is shown in
+figure 6.18. The circuit has the interesting property that it produces both the
+desired signal z and its complement z and during transitions it never produces
+
+�z�z
+�
+�
+�1�1�. Again, the example is a circuit where zset
+
+� ab and zreset
+
+� bc.
+
+
+100
+Part I: Asynchronous circuit design – A tutorial
+
+P
+
+N
+"Set"
+
+"Reset"
+
+P
+
+"Reset"
+
+N
+
+"Set"
+"Reset"
+
+"Set"
+
+N
+
+P
+
+z
+z
+
+z-set = a b
+
+z-reset = b c
+
+Dynamic (and pseudostatic) CMOS implementation:
+
+Gate level symbol: 
+
+Static CMOS implementation:
+
++
+
+-
+
+a
+b
+c
+z
+C
+
+b
+
+c
+b
+
+c
+
+a
+
+b
+b
+
+a
+
+a
+
+b
+
+c
+
+z
+
+Figure 6.17.
+A generalized C-element: gate-level symbol, and some CMOS transistor imple-
+mentations.
+
+&
+
+&
+
+&
+
+&
+
++
+
++
+
+Set
+
+Reset
+
+z
+
+z
+
+a
+b
+
+b
+
+c
+
+Figure 6.18.
+An SR implementation based on two complex And-Or-Invert gates.
+
+
+Chapter 6: Speed-independent control circuits
+101
+
+6.5.
+Initialization
+
+Initialization is an important aspect of practical circuit design, and unfortu-
+nately it has not been addressed in the above. The synthesis process assumes
+an initial state that corresponds to the initial marking of the STG, and the re-
+sulting synthesized circuit is a correct speed-independent implementation of
+the specification provided that the circuit starts out in the same initial state.
+Since the synthesized circuits generally use state-holding elements or circuitry
+with feedback loops it is necessary to actively force the circuit into the intended
+initial state.
+Consequently, the designer has to do a manual post-synthesis hack and ex-
+tend the circuit with an extra signal which, when active, sets all state-holding
+constructs into the desired state.
+Normally the circuits will not be speed-
+independent with respect to this initialization signal; it is assumed to be as-
+serted for long enough to cause the desired actions before it is de-asserted.
+For circuit implementations using state-holding elements such as set-reset
+latches and standard C-elements, initialization is trivial provided that these
+components have special clear/preset signals in addition to their normal inputs.
+In all other cases the designer has to add an initialization signal to the relevant
+Boolean equations explicitly. If the synthesis process is targeting a given cell
+library, the modified logic equations may need further logic decomposition,
+and as we have seen this may compromise speed-independence.
+The fact that initialization is not included in the synthesis process is ob-
+viously a drawback, but normally one would implement a library of control
+circuits and use these as building blocks when designing circuits at the more
+abstract “static data-flow structures” level as introduced in chapter 3.
+Initializing all control circuits as outlined above is a simple and robust ap-
+proach. However, initialization of asynchronous circuits based on handshake
+components may also be achieved by an implicit approach that that exploits
+the function of the circuit to “propagate” initial signal values into the circuit.
+In Tangram (section 8.3, and chapter 13 in part III) this this is called self-
+initialization, [135].
+
+6.6.
+Summary of the synthesis process
+
+The previous sections have covered the basic theory for synthesizing SI con-
+trol circuits from STG specifications. The style of presentation has deliberately
+been chosen to be an informal one with emphasis on examples and the intuition
+behind the theory and the synthesis procedure.
+The theory has roots in work done by the following Universities and groups:
+University of Illinois [93], MIT [26, 24], Stanford [13], IMEC [145, 159], St.
+Petersburg Electrical Engineering Institute [146], and the multinational group
+of researchers who have developed the Petrify tool [29] that we will introduce
+
+
+102
+Part I: Asynchronous circuit design – A tutorial
+
+in the next section. This author has attended several discussions from which
+it is clear that in some cases the concepts and theories have been developed
+independently by several groups, and I will refrain from attempting a precise
+history of the evolution. The reader who is interested in digging deeper into the
+subject is encouraged to consult the literature; in particular the book by Myers
+[95].
+In summary the synthesis process outlined in the previous sections involves
+the following steps:
+
+1 Specify the desired behaviour of the circuit and its (dummy) environ-
+ment using an STG.
+
+2 Check that the STG satisfies properties 1-5 on page 88: 1-bounded, con-
+sistent state assignment, liveness, only input free choice and controlled
+choice and persistency. An STG satisfying these conditions is a valid
+specification of an SI circuit.
+
+3 Check that the specification satisfies property 6 on page 88: complete
+state coding (CSC). If the specification does not satisfy CSC it is neces-
+sary to add one or more state variables or to change the specification
+(which is often possible in 4-phase control circuits where the down-
+going signal transitions can be shuffled around). Some tools (including
+Petrify) can insert state variables automatically, whereas re-shuffling of
+signals – which represents a modification of the specification – is a task
+for the designer.
+
+4 Select an implementation template and derive the Boolean equations for
+the variables themselves, or for the set and reset functions when state
+holding devices are used. Also decide if these equations can be imple-
+mented in atomic gates (typically complex AOI-gates) or if they are to
+be implemented by structures of simpler gates. These decisions may be
+set by switches in the synthesis tools.
+
+5 Derive the Boolean equations for the desired implementation template.
+
+6 Manually modify the implementation such that the circuit can be forced
+into the desired initial state by an explicit reset or initialization signal.
+
+7 Enter the design into a CAD tool and perform simulation and layout of
+the circuit (or the system in which the circuit is used as a component).
+
+6.7.
+Petrify: A tool for synthesizing SI circuits from STGs
+
+Petrify is a public domain tool for manipulating Petri nets and for syn-
+thesizing SI control circuits from STG specifications.
+It is available from
+http://www.lsi.upc.es/˜jordic/petrify/petrify.html.
+
+
+Chapter 6: Speed-independent control circuits
+103
+
+Petrify is a typical UNIX program with many options and switches. As a
+circuit designer one would probably prefer a push-button synthesis tool that
+accepts a specification and produces a circuit. Petrify can be used this way but
+it is more than this. If you know how to play the game it is an interactive tool
+for specifying, checking, and manipulating Petri nets, STGs and state graphs.
+In the following section we will show some examples of how to design speed-
+independent control circuits.
+Input to Petrify is an STG described in a simple textual format. Using the
+program draw astg that is part of the Petrify distribution (and that is based
+on the graph visualization package ‘dot’ developed at AT&T) it is possible to
+produce a drawing of the STGs and state graphs. The graphs are “nice” but the
+topological organization may be very different from how the designer thinks
+about the problem. Even the simple task of checking that an STG entered in
+textual form is indeed the intended STG may be difficult.
+To help ease this situation a graphical STG entry and simulation tool called
+VSTGL (Visual STG Lab) has been developed at the Technical University of
+Denmark. To help the designer obtain a correct specification VSTGL includes
+an interactive simulator that allows the designer to add tokens and to fire tran-
+sitions. It also carries out certain simple checks on the STG.
+VSTGL is available from http://vstgl.sourceforge.net/ and it is the
+result of a small student project done by two 4th year students. VSTGL is sta-
+ble and reliable, though naming of signal transitions may seem a bit awkward.
+Petrify can solve CSC violations by inserting state variables, and it can be
+controlled to target the implementation templates introduced in section 6.4:
+
+The -cg option will produce a complex-gate circuit (one where each non-
+input signal is implemented in a single complex gate).
+
+The -gc option will produce a generalized C-element circuit. The outputs
+from Petrify are the Boolean equations for the set and reset functions for
+each non-input signal.
+
+The -gcm option will produce a generalized C-element solution where
+the set and reset functions satisfy the monotonic cover requirement. Con-
+sequently the solution can also be mapped onto a standard C-element
+implementation where the set and reset functions are implemented us-
+ing simple AND and OR gates.
+
+The -tm option will cause Petrify to perform technology mapping onto
+a gate library that can be specified by the user. Technology mapping can
+obviously not be combined with the -cg and -gc options.
+
+Petrify comes with a manual and some examples. In the following section
+we will go through some examples drawn from the previous chapters of the
+book.
+
+
+104
+Part I: Asynchronous circuit design – A tutorial
+
+6.8.
+Design examples using Petrify
+
+In the following we will illustrate the use of Petrify by specifying and syn-
+thesizing: (a) example 2 – the circuit with choice, (b) a control circuit for the
+4-phase bundled-data implementation of the latch from figure 3.3 on page 32
+and (c) a control circuit for the 4-phase bundled-data implementation of the
+MUX from figure 3.3 on page 32. For all of the examples we will assume push
+channels only.
+
+6.8.1
+Example 2 revisited
+
+As a first example, we will synthesize the different versions of example 2
+that we have already designed manually. Figure 6.19 shows the STG as it is
+entered into VSTGL. The corresponding textual input to Petrify (the ex2.g file)
+and the STG as it may be visualized by Petrify are shown in figure 6.20. Note
+in figure 6.20 that an index is added when a signal transition appears more than
+once in order to facilitate the textual input.
+
+P0
+
+c+
+
+b+
+
+P1
+
+c-
+
+b-
+
+a-
+
+d-
+
+c+
+
+a+
+
+b+
+
+d+
+d+
+d+
+d+
+d+
+d+
+d+
+d+
+
+Figure 6.19.
+The STG of example 2 as it is entered into VSTGL.
+
+Using complex gates
+
+> petrify ex2.g -cg -eqn ex2-cg.eqn
+
+The STG has CSC.
+# File generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# from <ex2.g> on 6-Mar-01 at 8:30 AM
+....
+
+
+Chapter 6: Speed-independent control circuits
+105
+
+.model ex2
+.inputs a b
+.outputs c d
+.graph
+P0 a+ b+
+c+ P1
+b+ c+
+P1 b-
+c- P0
+b- c-
+a- P1
+d- a-
+c+/1 d-
+a+ b+/1
+b+/1 d+
+d+ c+/1
+.marking { P0 }
+.end
+
+INPUTS:   a,b
+OUTPUTS:  c,d
+
+P0
+
+a+
+b+
+
+b+/1
+c+
+
+c-
+
+P1
+
+b-
+
+d+
+
+c+/1
+a-
+
+d-
+
+Figure 6.20.
+The textual description of the STG for example 2 and the drawing of the STG
+that is produced by Petrify.
+
+# The original TS had (before/after minimization) 9/9 states
+# Original STG:
+2 places,
+10 transitions,
+13 arcs
+...
+# Current STG:
+4 places,
+9 transitions,
+18 arcs
+...
+# It is a Petri net with 1 self-loop places
+...
+
+> more ex2-cg.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 7.00
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[c] = b (c + a’) + d;
+[d] = a b c’;
+
+Using generalized C-elements:
+
+> petrify ex2.g -gc -eqn ex2-gc.eqn
+
+...
+
+> more ex2-gc.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 12.00
+
+
+106
+Part I: Asynchronous circuit design – A tutorial
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[0] = a’ b + d;
+[1] = a b c’;
+[d] = d c’ + [1];
+# mappable onto gC
+[c] = c b + [0];
+# mappable onto gC
+
+The equations for the generalized C-elements should be “interpreted”
+according to equation 6.1 on page 96
+
+Using standard C-elements and set/reset functions that satisfy the monotonic
+cover constraint:
+
+> petrify ex2.g -gcm -eqn ex2-gcm.eqn
+
+...
+
+> more ex2-gcm.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 10.00
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[0] = a’ b c’ + d;
+[d] = a b c’;
+[c] = c b + [0];
+# mappable onto gC
+
+Again, the equations for the generalized C-element should be “interpreted”
+according to equation 6.1 on page 96.
+
+6.8.2
+Control circuit for a 4-phase bundled-data latch
+
+Figure 6.21 shows an asynchronous handshake latch with a dummy environ-
+ment on its left and right side. The latch can be implemented using a normal
+N-bit wide transparent latch and the control circuit we are about to design.
+A driver may be needed for the latch control signal Lt. In order to make the
+latch controller robust and independent of the delay in this driver, we may feed
+the buffered signal (Lt) back such that the controller knows when the signal
+has been presented to the latch. Figure 6.21 also shows fragments of the STG
+specification – the handshaking of the left and right hand environments and
+ideas about the behaviour of the latch controller. Initially Lt is low and the
+latch is transparent, and when new input data arrives they will flow through
+the latch. In response to Rin�, and provided that the right hand environment
+is ready for another handshake (Aout
+� 0), the controller may generate Rout
+�
+
+
+Chapter 6: Speed-independent control circuits
+107
+
+Latch controller
+Right hand environment
+
+Lt-
+Lt+
+
+Rin+
+
+Ain+
+
+Rin-
+
+Ain-
+
+Rin+
+Aout-
+
+Rout+
+
+Ain+
+
+Rin-
+Aout+
+
+Rout-
+
+Ain-
+
+Rout+
+
+Aout+
+
+Rout-
+
+Aout-
+
+Left hand environment
+
+EN
+EN
+
+EN = 0: Latch is transparant
+
+EN = 1: Latch holds data
+
+The control circuit
+
+A handshake latch
+
+Latch
+
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+
+Figure 6.21.
+A 4-phase bundled-data handshake latch and some STG fragments that capture
+ideas about its behaviour.
+
+Rout+
+
+Rin+
+
+Lt+
+
+Ain+
+
+Rin-
+
+Rout-
+
+Lt-
+
+Ain-
+
+Aout-
+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+
+Figure 6.22.
+The resulting STG for the latch controller (as input to VSTGL).
+
+
+108
+Part I: Asynchronous circuit design – A tutorial
+
+right away. Furthermore the data should be latched, Lt
+�, and an acknowledge
+sent to the left hand environment, Ain�. A symmetric scenario is possible in
+response to Rin� when the latch is switched back into the transparent mode.
+Combining these STG fragments results in the STG shown in figure 6.22.
+Running Petrify yields the following:
+
+> petrify lctl.g -cg -eqn lctl-cg.eqn
+
+The STG has CSC.
+# File generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# from <lctl.g> on 6-Mar-01 at 11:18 AM
+...
+# The original TS had (before/after minimization) 16/16 states
+# Original STG:
+0 places,
+10 transitions,
+14 arcs (
+0 pt + ...
+# Current STG:
+0 places,
+10 transitions,
+12 arcs (
+0 pt + ...
+# It is a Marked Graph
+.model lctl
+.inputs
+Aout Rin
+.outputs
+Lt Rout Ain
+.graph
+Rout+ Aout+ Lt+
+Lt+ Ain+
+Aout+ Rout-
+Rin+ Rout+
+Ain+ Rin-
+Rin- Rout-
+Ain- Rin+
+Rout- Lt- Aout-
+Aout- Rout+
+Lt- Ain-
+.marking { <Aout-,Rout+> <Ain-,Rin+> }
+.end
+
+> more lctl-cg.eqn
+
+# EQN file for model lctl
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 7.00
+
+INORDER = Aout Rin Lt Rout Ain;
+OUTORDER = [Lt] [Rout] [Ain];
+[Lt] = Rout;
+[Rout] = Rin (Rout + Aout’) + Aout’ Rout;
+[Ain] = Lt;
+
+The equation for [Rout] may be rewritten as:
+
+[Rout] = Rin Aout’ + Rout (Rin + Aout’)
+
+which can be recognized to be a C-element with inputs Rin and Aout’.
+
+
+Chapter 6: Speed-independent control circuits
+109
+
+6.8.3
+Control circuit for a 4-phase bundled-data MUX
+
+After the above two examples, where we have worked out already well-
+known circuit implementations, let us now consider a more complex example
+that cannot (easily) be done by hand. Figure 6.23 shows the handshake multi-
+plexer from figure 3.3 on page 32. It also shows how the handshake MUX can
+be implemented by a “regular” combinational circuit multiplexer and a control
+circuit. Below we will design a speed-independent control circuit for a 4-phase
+bundled-data MUX.
+
+Ctl CtlReq
+CtlAck
+
+Handshake MUX
+
+0
+
+1
+
+In0
+
+In1
+Out
+
+Ctl
+
+0
+
+1
+
+In1Req
+In1Ack
+
+In0Ack
+In0Req
+
+In1Data
+
+In0data
+
+OutAck
+OutReq
+
+OutData
+
+Figure 6.23.
+The handshake MUX and the structure of a 4-phase bundled-data implementa-
+tion.
+
+The MUX has three input channels and we must assume they are connected
+to three independent dummy environments. The dots remind us that the chan-
+nels are push channels. When specifying the behaviour of the MUX control
+circuit and its (dummy) environment it is important to keep this in mind. A
+typical error when drawing STGs is to specify an environment with a more
+limited behaviour than the real environment. For each of the three input chan-
+nels the STG has cycles involving (Req�;Ack
+�;Req�;Ack
+�; etc.), and each
+of these cycles is initialized to contain a token.
+As mentioned previously, it is sometimes easier to deal with control chan-
+nels using dual-rail (or in general 1�of�N) data encodings since this implies
+dealing with one-hot (decoded) control signals. As a first step towards the STG
+for a MUX using entirely 4-phase bundled-data channels, figure 6.24 shows an
+STG for a MUX where the control channel uses dual-rail signals (Ctl�t, Ctl�f
+and CtlAck). This STG can then be combined with the STG-fragment for a
+4-phase bundled-data channel from figure 6.8 on page 91, resulting in the STG
+in figure 6.25. The “intermediate” STG in figure 6.24 emphasizes the fact that
+the MUX can be seen as a controlled join – the two mutually exclusive and
+structurally identical halves are basically the STGs of a join.
+
+
+110
+Part I: Asynchronous circuit design – A tutorial
+
+Ctl.t+
+
+CtlAck+
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+Ctl.f+
+
+CtlAck+
+
+OutReq+
+
+P2
+P1
+P0
+
+Ctl.t-
+Ctl.f-
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.24.
+The STG specification of the control circuit for a 4-phase bundled-data MUX
+using a 4-phase dual-rail control channel. Combined with the STG fragment for a bundled-data
+(control) channel the resulting STG for an all 4-phase dual-rail MUX is obtained (figure 6.25).
+
+Ctl-
+
+Ctl+
+
+CtlReq+
+
+P5
+
+P2
+
+P3
+
+P4
+
+OutReq+
+
+CtlAck+
+
+CtlAck-
+
+CtlReq-
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+
+P1
+
+P0
+
+P6
+
+CtlReq-
+
+CtlAck+
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.25.
+The final STG specification of the control circuit for the 4-phase bundled-data
+MUX. All channels, including the control channel, are 4-phase bundled-data.
+
+
+Chapter 6: Speed-independent control circuits
+111
+
+Below is the result of running Petrify, this time with the -o option that writes
+the resulting STG (possibly with state signals added) in a file rather than to
+stdout.
+
+> petrify MUX4p.g -o MUX4p-csc.g -gcm -eqn MUX4p-gcm.eqn
+
+State coding conflicts for signal In1Ack
+State coding conflicts for signal In0Ack
+State coding conflicts for signal OutReq
+The STG has no CSC.
+Adding state signal: csc0
+The STG has CSC.
+
+> more MUX4p-gcm.eqn
+
+# EQN file for model MUX4p
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 29.00
+
+INORDER = In0Req OutAck In1Req Ctl CtlReq In1Ack In0Ack OutReq
+CtlAck csc0;
+OUTORDER = [In1Ack] [In0Ack] [OutReq] [CtlAck] [csc0];
+[In1Ack] = OutAck csc0’;
+[In0Ack] = OutAck csc0;
+[2] = CtlReq (In1Req csc0’ + In0Req Ctl’);
+[3] = CtlReq’ (In1Req’ csc0’ + In0Req’ csc0);
+[OutReq] = OutReq [3]’ + [2];
+# mappable onto gC
+[5] = OutAck’ csc0;
+[CtlAck] = CtlAck [5]’ + OutAck;
+# mappable onto gC
+[7] = OutAck’ CtlReq’;
+[8] = CtlReq Ctl;
+[csc0] = csc0 [8]’ + [7];
+# mappable onto gC
+
+As can be seen, the STG does not satisfy CSC (complete state coding) as
+several markings correspond to the same state vector, so Petrify adds an inter-
+nal state-signal csc0. The intuition is that after CtlReq� the Boolean signal
+Ctl is no longer valid but the MUX control circuit has not yet finished its job.
+If the circuit can’t see what to continue doing from its input signals it needs
+an internal state variable in which to keep this information. The signal csc0 is
+an active-low signal: it is set low if Ctl
+� 0 when CtlReq� and it is set back
+to high when OutAck and CtlReq are both low. The fact that the signal csc0 is
+high when all channels are idle (all handshake signals are low) should be kept
+in mind when dealing with reset, c.f. section 6.5.
+The exact details of how the state variable is added can be seen from the
+STG that includes csc0 which is produced by Petrify before it synthesizes the
+logic expressions for the circuit.
+
+
+112
+Part I: Asynchronous circuit design – A tutorial
+
+It is sometimes possible to avoid adding a state variable by re-shuffling sig-
+nal transitions. It is not always obvious what yields the best solution. In prin-
+ciple more concurrency should improve performance, but it also results in a
+larger state-space for the circuit and this often tends to result in larger and
+slower circuits. A discussion of performance also involves the interaction with
+the environment. There is plenty of room for exploring alternative solutions.
+
+Ctl-
+
+Ctl+
+
+CtlReq+
+
+P5
+
+P2
+
+P3
+
+P4
+
+OutReq+
+
+CtlAck+
+
+CtlAck-
+
+CtlReq-
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+
+P1
+
+P0
+
+P6
+
+CtlReq-
+
+CtlAck+
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.26.
+The modified STG specification of the 4-phase bundled-data MUX control cir-
+cuit.
+
+In figure 6.26 we have removed some concurrency from the MUX STG by
+ordering the transitions on In0Ack
+�In1Ack and CtlAck (In0Ack
+�
+� CtlAck
+�,
+In1Ack
+�
+� CtlAck
+� etc.). This STG satisfies CSC and the resulting circuit is
+marginally smaller:
+
+> more MUX4p-gcm.eqn
+
+# EQN file for model MUX4pB
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+
+
+Chapter 6: Speed-independent control circuits
+113
+
+# Estimated area = 27.00
+
+INORDER = In0Req OutAck In1Req Ctl CtlReq In1Ack In0Ack OutReq CtlAck;
+OUTORDER = [In1Ack] [In0Ack] [OutReq] [CtlAck];
+[0] = Ctl CtlReq OutAck;
+[1] = Ctl’ CtlReq OutAck;
+[2] = CtlReq (Ctl’ In0Req + Ctl In1Req);
+[3] = CtlReq’ (In0Ack’ In1Req’ + In0Req’ In0Ack);
+[OutReq] = OutReq [3]’ + [2];
+# mappable onto gC
+[CtlAck] = In1Ack + In0Ack;
+[In1Ack] = In1Ack OutAck + [0];
+# mappable onto gC
+[In0Ack] = In0Ack OutAck + [1];
+# mappable onto gC
+
+6.9.
+Summary
+
+This chapter has provided an introduction to the design of asynchronous se-
+quential (control) circuits with the main focus on speed-independent circuits
+and specifications using STGs. The material was presented from a practical
+view in order to enable the reader to go ahead and design his or her own speed-
+independent control circuits. This, rather than comprehensiveness, has been
+our goal, and as mentioned in the introduction we have largely ignored impor-
+tant alternative approaches including burst-mode and fundamental-mode cir-
+cuits.
+
+
+
+Chapter 7
+
+ADVANCED 4-PHASE BUNDLED-DATA
+PROTOCOLS AND CIRCUITS
+
+The previous chapters have explained the basics of asynchronous circuit
+design. In this chapter we will address 4-phase bundled-data protocols and
+circuits in more detail. This will include: (1) a variety of channel types, (2)
+protocols with different data-validity schemes, and (3) a number of more so-
+phisticated latch control circuits. These latch controllers are interesting for
+two reasons: they are very useful in optimizing the circuits for area, power and
+speed, and they are nice examples of the types of control circuits that can be
+specified and synthesized using the STG-based techniques from the previous
+chapter.
+
+7.1.
+Channels and protocols
+
+7.1.1
+Channel types
+
+So far we have considered only push channels where the sender is the active
+party that initiates the communication of data, and where the receiver is the
+passive party. The opposite situation, where the receiver is the active party
+that initiates the communication of data, is also possible, and such a channel
+is called a pull channel. A channel that carries no data is called a nonput
+channel and is used for synchronization purposes. Finally, it is also possible
+to communicate data from a receiver to a sender along with the acknowledge
+signal. Such a channel is called a biput channel. In a 4-phase bundled-data
+implementation data from the receiver is bundled with the acknowledge, and in
+a 4-phase dual-rail protocol the passive party will acknowledge the reception
+of a codeword by returning a codeword rather than just an an acknowledge
+signal. Figure 7.1 illustrates these four channel types (nonput, push, pull, and
+biput) assuming a bundled-data protocol. Each channel type may, of course,
+use any of the handshake protocols (2-phase or 4-phase) and data encodings
+(bundled-data, dual-rail, m�of�n, etc.) introduced previously.
+
+115
+
+
+116
+Part I: Asynchronous circuit design – A tutorial
+
+7.1.2
+Data-validity schemes
+
+For the bundled-data protocols it is also relevant to define the time interval
+in which data is valid, and figure 7.2 illustrates the different possibilities.
+For a push channel the request signal carries the message “here is new data
+for you” and the acknowledge signal carries the information “thank you, I have
+absorbed the data, and you may release the data wires.” Similarly, for a pull
+channel the request signal carries the message “please send new data” and the
+acknowledge signal carries the message “here is the data that you requested.”
+It is the signal transitions on the request and acknowledge wires that are in-
+terpreted in this way. A 4-phase handshake involves two transitions on each
+wire and, depending on whether it is the rising or the falling transitions on
+the request and acknowledge signals that are interpreted in this way, several
+data-validity schemes emerge: early, broad, late and extended early.
+Since 2-phase handshaking does not involve any redundant signal transitions
+there is only one data-validity scheme for each channel type (push or pull), as
+illustrated in figure 7.2.
+It is common to all of the data-validity schemes that the data is valid some
+time before the event that indicates the start of the interval, and that it remains
+stable until some time after the event that indicates the end of the interval.
+Furthermore, all of the data-validity schemes express the requirements of the
+party that receives the data. The fact that a receiver signals “thank you, I have
+absorbed the data, and you may go ahead and release the data wires,” does
+not mean that this actually happens – the sender may prolong the data-validity
+interval, and the receiver may even rely on this.
+A typical example of this is the extended-early data-validity schemes in fig-
+ure 7.2. On a push channel the data-validity interval begins some time before
+Req
+� and ends some time after Req
+�.
+
+7.1.3
+Discussion
+
+The above classification of channel types and handshake protocols stems
+mostly from Peeters’ Ph.D. thesis [112]. The choice of channel type, hand-
+shake protocol and data-validity scheme obviously affects the implementation
+of the handshake components in terms of area, speed, and power. Just as a
+design may use a mix of different bundled-data and dual-rail protocols, it may
+also use a mix of channel types and data-validity schemes.
+For example, a 4-phase bundled-data push channel using a broad or an
+extended-early data-validity scheme is a very convenient input to a function
+block that is implemented using precharged CMOS circuitry: the request signal
+may directly control the precharge and evaluate transistors because the broad
+and the extended-early data-validity schemes guarantee that the input data is
+stable during the evaluate phase.
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+117
+
+n
+
+Data
+
+Ack
+
+Req
+
+n
+
+Ack
+
+Req
+
+Req
+
+Data
+
+Data
+
+Ack
+
+Nonput channel
+
+Data
+
+Ack
+
+Req
+
+Biput channel (bundled data)
+
+Push channel (bundled data)
+
+Pull channel (bundled data)
+
+Figure 7.1.
+The four fundamental channel types: nonput, push, biput, and pull.
+
+Data (early)
+
+4-phase protocol:
+(push channel)
+Ack
+
+Req
+
+Data (broad)
+
+Data (late)
+
+Data (extended early)
+
+Data (early)
+
+Ack
+
+Req
+
+Data (broad)
+
+Data (late)
+
+Data (extended early)
+
+4-phase protocol:
+(pull channel)
+
+Ack
+
+Req
+
+Data (pull channel)
+
+2-phase protocols:
+
+Data (push channel)
+
+Figure 7.2.
+Data-validity schemes for 2-phase and 4-phase bundled data.
+
+
+118
+Part I: Asynchronous circuit design – A tutorial
+
+Another interesting option in a 4-phase bundled-data design is to use func-
+tion blocks that assume a broad data validity scheme on the input channel and
+that produce a late data validity scheme on the output channel. Under these
+assumptions it is possible to use a symmetric delay element that matches only
+half of the latency of the combinatorial circuit. The idea is that the sum of the
+delay of Req
+� and Req
+� matches the latency of the combinatorial circuit, and
+that Req
+� indicates valid output data. In [112, p.46] this is referred to as true
+single phase because the return-to-zero part of the handshaking is no longer
+redundant. This approach also has implications for the implementation of the
+components that connect to the function block.
+It is beyond the scope of this text to enter into a discussion of where and
+when to use the different options. The interested reader is referred to [112, 77]
+for more details.
+
+7.2.
+Static type checking
+
+When designing circuits it is useful to think of the combination of channel
+type and data-validity scheme as being similar to a data type in a programming
+language, and to do some static type checking of the circuit being designed
+by asking questions like: “what types are allowed on the input ports of this
+handshake component?” and “what types are produced on the output ports
+of this handshake component?”. The latter may depend on the type that was
+provided on the input port. A similar form of type checking for synchronous
+circuits using two-phase non-overlapping clocks has been proposed in [104]
+and used in the Genesil silicon compiler [67].
+
+"broad"
+
+"extended early"
+
+"late"
+"early"
+
+Figure 7.3.
+Hierarchical ordering of the four data-validity schemes for the 4-phase bundled-
+data protocol.
+
+Figure 7.3 shows a hierarchical ordering of the four possible types (data
+validity schemes) for a 4-phase bundled-data push channel: “broad” is the
+strongest type and it can be used as input to circuits that require any of the
+weaker types. Similarly “extended early” may be used where only “early” is
+required. Circuits that are transparent to handshaking (function blocks, join,
+fork, merge, mux, demux) produce outputs whose type is at most as strong as
+their (weakest) input type. In general the input and output types are the same
+but there are examples where this is not the case. The only circuit that can
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+119
+
+produce outputs whose type is stronger than the input type is a latch. Let us
+look at some examples:
+
+A join that concatenates two inputs of type “extended early” produces
+outputs that are only “early.’
+
+From the STG fragments in figure 6.21 on page 107 it may be seen that
+the simple 4-phase bundled-data latch controller from the previous chap-
+ters (figure 2.9 on page 18) assumes “early” inputs and that it produces
+“extended-early” outputs.
+
+The 4-phase bundled-data MUX design in section 6.8.3 assumes “ex-
+tended early” on its control input (the STG in figure 6.25 on page 110
+specifies stable input from CtlReq� to CtlReq�).
+
+The reader is encouraged to continue this exercise and perhaps draw the as-
+sociated timing diagrams from which the types of the outputs may be deduced.
+The type checking explained here is a very useful technique for debugging
+circuits that exhibit erronous behaviour.
+
+7.3.
+More advanced latch control circuits
+
+In previous chapters we have only considered 4-phase bundled-data hand-
+shake latches using a latch controller consisting of a C-element and an inverter
+(figure 2.9 on page 18). In [41] this circuit is called a simple latch controller,
+and in [77] it is called an un-decoupled latch controller.
+When a pipeline or FIFO that uses the simple latch controller fills, every
+second handshake latch will be holding a valid token and every other hand-
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+E
+D3
+E
+E
+D1
+D2
+
+Ack
+
+Req
+
+Data
+
+Ack
+
+Req
+
+EN
+
+L
+Data
+
+(b)
+
+(a)
+
+C
+C
+C
+C
+C
+C
+
+Figure 7.4.
+(a) A FIFO based on handshake latches, and (b) its implementation using simple
+latch controllers and level-sensitive latches. The FIFO fills with valid data in every other latch.
+A latch is transparent when EN
+� 0 and it is opaque (holding data) when EN
+� 1.
+
+
+120
+Part I: Asynchronous circuit design – A tutorial
+
+Ack
+Req
+
+Data
+
+Ack
+Req
+
+Data
+D3
+D2
+D1
+
+EN
+EN
+EN
+
+Latch
+Latch
+Latch
+control
+control
+control
+
+Figure 7.5.
+A FIFO where every level-sensitive latch holds valid data when the FIFO is full.
+The semi-decoupled and fully-decoupled latch controllers from [41] allow this behaviour.
+
+shake latch will be holding an empty token as illustrated in figure 7.4(a) – the
+static spread of the pipeline is S
+� 2.
+This token picture is a bit misleading. The empty tokens correspond to
+the return-to-zero part of the handshaking and in reality the latches are not
+“holding empty tokens” – they are transparent, and this represents a waste of
+hardware resource.
+Ideally one would want to store a valid token in every level-sensitive latch
+as illustrated in figure 7.5 and just “add” the empty tokens to the data stream
+on the interfaces as part of the handshaking. In [41] Furber and Day explain
+the design of two such improved 4-phase bundled-data latch control circuits:
+a semi-decoupled and a fully-decoupled latch controller. In addition to these
+specific circuits the paper also provides a nice illustration of the use of STGs
+for designing control circuits following the approach explained in chapter 6.
+The three latch controllers have the following characteristics:
+
+The simple or un-decoupled latch controller has the problem that new
+input data can only be latched when the previous handshake on the out-
+put channel has completed, i.e., after Aout
+
+�. Furthermore, the hand-
+shakes on the input and output channels interact tightly: Rout
+
+�
+� Ain
+
+�
+and Rout
+
+�
+� Ain
+
+�.
+
+The semi-decoupled latch controller relaxes these requirements some-
+what: new inputs may be latched after Rout
+
+�, and the controller may
+produce Ain
+
+� independently of the handshaking on the output channel –
+the interaction between the input and output channels has been relaxed
+to: Aout
+
+�
+� Ain
+
+�.
+
+The fully-decoupled latch controller further relaxes these requirements:
+new inputs may be latched after Aout
+
+� (i.e. as soon as the downstream
+latch has indicated that it has latched the current data), and the hand-
+shaking on the input channel may complete without any interaction with
+the output channel.
+
+Another potential drawback of the simple latch controller is that it is unable
+to take advantage of function blocks with asymmetric delays as explained in
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+121
+
+Latch controller
+Static spread, S
+Period, P
+
+“Simple”
+2
+2Lr
+
+�2L f
+�V
+“Semi-decoupled”
+1
+2Lr
+
+�2L f
+�V
+“Fully-decoupled”
+1
+2Lr
+
+�L f
+�V
+
+�L f
+�E
+
+Table 7.1.
+Summary of the characteristics of the latch controllers in [41].
+
+section 4.4.1 on page 52. The fully-decoupled latch controller presented in
+[41] does not have this problem. Due to the decoupling of the input and out-
+put channels the dependency graph critical cycle that determines the period, P,
+only visits nodes related to two neighbouring pipeline stages and the period be-
+comes minimum (c.f. section 4.4.1). Table 7.1 summarizes the characteristics
+of the simple, semi-decoupled and fully-decoupled latch controllers.
+All of the above-mentioned latch controllers are “normally transparent”
+and this may lead to excessive power consumption because inputs that make
+multiple transitions before settling will propagate through several consecutive
+pipeline stages. By using “normally opaque” latch controllers every latch will
+act as a barrier. If a handshake latch that is holding a bubble is exposed to
+a token on its input, the latch controller switches the latch into the transpar-
+ent mode, and when the input data have propagated safely into the latch, it
+will switch the latch back into the opaque mode in which it will hold the data.
+In the design of the asynchronous MIPS processor reported in [23] we expe-
+rienced approximately a 50 % power reduction when using normally opaque
+latch controllers instead of normally transparent latch controllers.
+Figure 7.6 shows the STG specification and the circuit implementation of
+the normally opaque latch controller used in [23]. As seen from the STG there
+is quite a strong interaction between the input and output channels, but the
+dependency graph critical cycle that determines the period only visits nodes
+related to two neighbouring pipeline stages and the period is minimum. It may
+be necessary to add some delay into the Lt
+� to Rout
+� path in order to ensure
+that input signals have propagated through the latch before Rout
+�. Further-
+more the duration of the Lt
+� 0 pulse that causes the latch to be transparent is
+determined by gate delays in the latch controller itself, and the pulse must be
+long enough to ensure safe latching of the data. The latch controller assumes
+a broad data-validity scheme on its input channel and it provides a broad data-
+validity scheme on its output channel.
+
+7.4.
+Summary
+
+This chapter introduced a selection of channel types, data-validity schemes
+and a selection of latch controllers. The presentation was rather brief; the aim
+was just to present the basics and to introduce some of the many options and
+
+
+122
+Part I: Asynchronous circuit design – A tutorial
+
+Lt = 0: Latch is transparant
+
+Lt = 1: Latch is opaque (holding data)
+
+C
+
+C
+C
+
+C
+
++
++
++
++
+
+B
+
+EN
+
+Rin+
+Rout+
+
+Ain+
+
+Rin-
+
+Ain-
+Aout-
+
+Rout-
+
+Aout+
+
+Lt-
+
+B+
+
+B-
+
+Lt+
+
+Lt
+
+Rout
+
+Aout
+Rin
+
+Ain
+
+Latch
+
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+
+Din
+Dout
+
+Figure 7.6.
+The STG specification and the circuit implementation of the normally opaque
+fully-decoupled latch controller from [23].
+
+possibilities for optimizing the circuits. The interested reader is referred to the
+literature for more details.
+Finally a warning: the static data-flow view of asynchronous circuits pre-
+sented in chapter 3 (i.e. that valid and empty tokens are copied forward con-
+trolled by the handshaking between latch controllers) and the performance
+analysis presented in chapter 4 assume that all handshake latches use the sim-
+ple normally transparent latch controller. When using semi-decoupled or fully-
+decoupled latch controllers, it is necessary to modify the token flow view, and
+to rework the performance analysis. To a first order one might substitute each
+semi-decoupled or fully-decoupled latch controller with a pair of simple latch
+controllers. Furthermore a ring need only include two handshake latches if
+semi-decoupled or fully-decoupled latch controllers are used.
+
+
+Chapter 8
+
+HIGH-LEVEL LANGUAGES AND TOOLS
+
+This chapter addresses languages and CAD tools for the high-level modeling
+and synthesis of asynchronous circuits. The aim is briefly to introduce some
+basic concepts and a few representative and influential design methods. The
+interested reader will find more details elsewhere in this book (in Part II and
+chapter 13) as well as in the original papers that are cited in the text. In the last
+section we address the use of VHDL for the design of asynchronous circuits.
+
+8.1.
+Introduction
+
+Almost all work on the high-level modeling and synthesis of asynchronous
+circuits is based on the use of a language that belongs to the CSP family of
+languages, rather than one of the two industry-standard hardware description
+languages, VHDL and Verilog. Asynchronous circuits are highly concurrent
+and communication between modules is based on handshake channels. Con-
+sequently a hardware description language for asynchronous circuit design
+should provide efficient primitives supporting these two characteristics. The
+CSP language proposed by Hoare [57, 58] meets these requirements. CSP
+stands for “Communicating Sequential Processes” and its key characteristics
+are:
+
+Concurrent processes.
+
+Sequential and concurrent composition of statements within a process.
+
+Synchronous message passing over point-to-point channels (supported
+by the primitives send, receive and – possibly – probe).
+
+CSP is a member of a large family of languages for programming concurrent
+systems in general: OCCAM [68], LOTOS [108, 16], and CCS [89], as well as
+languages defined specifically for designing asynchronous circuits: Tangram
+[142, 135], CHP [81], and Balsa [9, 10]. Further details are presented else-
+where in this book on Tangram (in Part III, chapter 13) and Balsa (in Part II).
+In this chapter we first take a closer look at the CSP language constructs
+supporting communication and concurrency. This will include a few sample
+
+123
+
+
+124
+Part I: Asynchronous circuit design – A tutorial
+
+programs to give a flavour of this type of language. Following this we briefly
+explain two rather different design methods that both take a CSP-like program
+as the starting point for the design:
+
+At Philips Research Laboratories, van Berkel, Peeters, Kessels et al. have
+developed a proprietary language, Tangram, and an associated silicon
+compiler [142, 141, 135, 112]. Using a process called syntax-directed
+compilation, the synthesis tool maps a Tangram program into a structure
+of handshake components. Using these tools several significant asyn-
+chronous chips have been designed within Philips [137, 138, 144, 73,
+74]. The last of these is a smart-card circuit that is described in chap-
+ter 13 on page 221.
+
+At Caltech Martin has developed a language CHP – Communicating
+Hardware Processes – and a set of tools that supports a partly manual,
+partly automated design flow that targets highly optimized transistor-
+level implementations of QDI 4-phase dual-rail circuits [80, 83].
+
+CHP has a syntax that is similar to CSP (using various special symbols)
+whereas Tangram has a syntax that is more like a traditional programming
+language (using keywords); but in essence they are both very similar to CSP.
+In the last section of this chapter we will introduce a VHDL-package that
+provides CSP-like message passing and explain an associated VHDL-based
+design flow that supports a manual step-wise refinement design process.
+
+8.2.
+Concurrency and message passing in CSP
+
+The “sequential processes” part of the CSP acronym denotes that each pro-
+cess is described by a program whose statements are executed in sequence one
+by one. A semicolon is used to separate statements (as in many other program-
+ming languages). The semicolon can be seen as an operator that combines
+statements into programs. In this respect a process in CSP is very similar to a
+process in VHDL. However, CSP also allows the parallel composition of state-
+ments within a process. The symbol “�” denotes parallel composition. This
+feature is not found in VHDL, whereas the fork-join construct in Verilog does
+allow statement-level concurrency within a process.
+The “communicating” part of the CSP acronym refers to synchronous mes-
+sage passing using point-to-point channels as illustrated in figure 8.1, where
+two processes P1 and P2 are connected by a channel named C. Using a send
+statement, C!x, process P1 sends (denoted by the ‘!’ symbol) the value of its
+variable x on channel C, and using a receive statement, C?y, process P2 re-
+ceives (denoted by the ‘?’ symbol) from channel C a value that is assigned
+to its variable y. The channel is memoryless and the transfer of the value of
+variable x in P1 into variable y in P2 is an atomic action. This has the effect
+
+
+Chapter 8: High-level languages and tools
+125
+
+P2:
+
+C
+
+....
+C!x;
+
+....
+x:= 17;
+var x ...
+P1:
+var y,z ...
+....
+
+C?y;
+z:= y -17;
+....
+
+Figure 8.1.
+Two processes P1 and P2 connected by a channel C. Process P1 sends the value of
+its variable x to the channel C, and process P2 receives the value and assigns it to its variable y.
+
+of synchronizing processes P1 and P2. Whichever comes first will wait for
+the other party, and the send and receive statements complete at the same time.
+The term rendezvous is sometimes used for this type of synchronization.
+When a process executes a send (or receive) statement, it commits to the
+communication and suspends until the process at the other end of the channel
+performs its receive (or send) statement. This may not always be desirable, and
+Martin has extended CSP with a probe construct [79] which allows the process
+at the passive end of a channel to probe whether or not a communication is
+pending on the channel, without committing to any communication. The probe
+is a function which takes a channel name as its argument and returns a Boolean.
+The syntax for probing channel C is C.
+As an aside we mention that some languages for programming concurrent
+systems assume channels with (possibly unbounded) buffering capability. The
+implication of this is that the channel acts as a FIFO, and the communicating
+processes do not synchronize when they communicate. Consequently this form
+of communication is called asynchronous message passing.
+Going back to our synchronous message passing, it is obvious that the phys-
+ical implementation of a memoryless channel is simply a set of wires together
+with a protocol for synchronizing the communicating processes. It is also obvi-
+ous that any of the protocols that we have considered in the previous chapters
+may be used. Synchronous message passing is thus a very useful language
+construct that supports the high-level modeling of asynchronous circuits by
+abstracting away the exact details of the data encoding and handshake protocol
+used on the channel.
+Unfortunately both VHDL and Verilog lack such primitives. It is possible
+to write low-level code that implements the handshaking, but it is highly unde-
+sirable to mix such low-level details into code whose purpose is to capture the
+high-level behaviour of the circuit.
+In the following section we will provide some small program examples to
+give a flavour of this type of language. The examples will be written in Tan-
+
+
+126
+Part I: Asynchronous circuit design – A tutorial
+
+gram as they also serve the purpose of illustrating syntax-directed compilation
+in a subsequent section. The source code, handshake circuit figures, and frag-
+ments of the text have been kindly provided by Ad Peeters from Philips.
+Manchester University has recently developed a similar language and syn-
+thesis tool that is available in the public domain [10], and is introduced in Part
+II of this book. Other examples of related work are presented in [17] and [21].
+
+8.3.
+Tangram: program examples
+
+This section provides a few simple Tangram program examples: a 2-place
+shift register, a 2-place ripple FIFO, and a greatest common divisor function.
+
+8.3.1
+A 2-place shift register
+
+Figure 8.2 shows the code for a 2-place shift register named ShiftReg. It is
+a process with an input channel In and an output channel Out, both carrying
+variables of type [0..255]. There are two local variables x and y that are
+initialized to 0. The process performs an unbounded repetition of a sequence
+of three statements: out!y; y:=x; in?x.
+
+x
+y
+out
+
+ShiftReg
+
+in
+
+T = type [0..255]
+& ShiftReg : main proc(in? chan T & out! chan T).
+begin
+& var x,y: var T := 0
+|
+forever do
+out!y ; y:=x ; in?x
+od
+end
+
+Figure 8.2.
+A Tangram program for a 2-place shift register.
+
+8.3.2
+A 2-place (ripple) FIFO
+
+Figure 8.3 shows the Tangram program for a 2-place first-in first-out buffer
+named Fifo. It can be understood as two 1-place buffers that are operating in
+parallel and that are connected by a channel c. At first sight it appears very
+similar to the 2-place shift register presented above, but a closer examination
+will show that it is more flexible and exhibits greater concurrency.
+
+
+Chapter 8: High-level languages and tools
+127
+
+x
+y
+in
+out
+
+Fifo
+
+c
+
+T = type [0..255]
+& Fifo : main proc(in? chan T & out! chan T).
+begin
+& x,y: var T
+& c : chan T
+|
+forever do in?x ; c!x
+od
+|| forever do c?y
+; out!y od
+end
+
+Figure 8.3.
+A Tangram program for a 2-place (ripple) FIFO.
+
+8.3.3
+GCD using while and if statements
+
+Figure 8.4 shows the code for a module that computes the greatest common
+divisor, the example from section 3.7. The “do x<>y then
+�
+�
+�od” is a while
+statement and, apart from the syntactical differences, the code in figure 8.4 is
+identical to the code in figure 3.11 on page 39.
+The module has an input channel from which it receives the two operands,
+and an output channel on which it sends the result.
+
+int = type [0..255]
+& gcd_if : main proc (in?chan <<int,int>> & out!chan int).
+begin x,y:var int ff
+| forever do
+in?<<x,y>>
+; do x<>y then
+if x<y then y:=y-x
+else x:=x-y
+fi
+od
+; out!x
+od
+end
+
+Figure 8.4.
+A Tangram for GCD using while and if statements.
+
+
+128
+Part I: Asynchronous circuit design – A tutorial
+
+8.3.4
+GCD using guarded commands
+
+Figure 8.5 shows an alternative version of GCD. This time the module has
+separate input channels for the two operands and its body is based on the repe-
+tition of a guarded command. The guarded repetition can be seen as a general-
+ization of the while statement. The statement repeats until all guards are false.
+When at least one of the guards is true, exactly one command corresponding to
+such a true guard is selected (either deterministically or non-deterministically)
+and executed.
+
+int = type [0..255]
+& gcd_gc : main proc (in1,in2?chan int & out!chan int).
+begin x,y:var int ff
+| forever do
+in1?x || in2?y
+; do x<y then y:=y-x
+or y<x then x:=x-y
+od
+; out!x
+od
+end
+
+Figure 8.5.
+A Tangram program for GCD using guarded repetition.
+
+8.4.
+Tangram: syntax-directed compilation
+
+Let us now address the synthesis process. The design flow uses an inter-
+mediate format based on handshake circuits. The front-end design activity
+is called VLSI programming and, using syntax-directed compilation, a Tan-
+gram program is mapped into a structure of handshake components. There is a
+one-to-one correspondence between the Tangram program and the handshake
+circuit as will be clear from the following examples. The compilation process
+is thus fully transparent to the designer, who works entirely at the Tangram
+program level.
+The back-end of the design flow involves a library of handshake circuits
+that the compiler targets as well as some tools for post-synthesis peephole
+optimization of the handshake circuits (i.e. replacing common structures of
+handshake components by more efficient equivalent ones). A number of hand-
+shake circuit libraries exist, allowing implementations using different hand-
+shake protocols (4-phase dual-rail, 4-phase bundled-data, etc.), and different
+implementation technologies (CMOS standard cells, FPGAs, etc.). The hand-
+shake components can be specified and designed: (i) manually, or (ii) using
+STGs and Petrify as explained in chapter 6, or (iii) using the lower steps in
+Martin’s transformation-based method that is presented in the next section.
+
+
+Chapter 8: High-level languages and tools
+129
+
+It is beyond the scope of this text to explain the details of the compilation
+process. We will restrict ourselves to providing a flavour of “syntax-directed
+compilation” by showing handshake circuits corresponding to the example
+Tangram programs from the previous section.
+
+8.4.1
+The 2-place shift register
+
+As a first example of syntax-directed compilation figure 8.6 shows the hand-
+shake circuit corresponding to the Tangram program in figure 8.2.
+
+�
+in
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+
+; 0
+1
+2
+
+�
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.6.
+The compiled handshake circuit for the 2-place shift register.
+
+Handshake components are represented by circular symbols, and the chan-
+nels that connect the components are represented by arcs. The small dots on
+the component symbols represent ports. An open dot denotes a passive port
+and a solid dot denotes an active port. The arrowhead represents the direction
+of the data transfer. A nonput channel does not involve the transfer of data and
+consequently it has no direction and no arrowhead. As can be seen in figure 8.6
+a handshake circuit uses a mix of push and pull channels.
+The structure of the program is a forever-do statement whose body consists
+of three statements that are executed sequentially (because they are separated
+by semicolons). Each of the three statements is a kind of assignment statement:
+the value of variable y is “assigned” to output channel out, the value of variable
+x is assigned to variable y, and the value received on input chanel in is assigned
+to variable x. The structure of the handshake circuit is exactly the same:
+
+At the top is a repeater that implements the forever-do statement. A
+repeater waits for a request on its passive input port and then it performs
+an unbounded repetition of handshakes on its active output channel. The
+handshake on the input channel never completes.
+
+Below is a 3-way sequencer that implements the semicolons in the pro-
+gram text. The sequencer waits for a request on its passive input channel,
+then it performs in sequence a full handshake on each of its active out-
+
+
+130
+Part I: Asynchronous circuit design – A tutorial
+
+put channels (in the order indicated by the numbers in the symbol) and
+finally it completes the handshaking on the passive input channel. In
+this way the sequencer activates in turn the handshake circuit constructs
+that correspond to the individual statements in the body of the forever-do
+statement.
+
+The bottom row of handshake components includes two variables, x and
+y, and three transferers, denoted by ‘�’. Note that variables have pas-
+sive read and write ports. The transferers implement the three statements
+(out!y; y:=x; in?x) that form the body of the forever-do statement,
+each a form of assignment. A transferer waits for a request on its passive
+nonput channel and then initiates a handshake on its pull input channel.
+The handshake on the pull input channel is relayed to the push output
+channel. In this way the transferer pulls data from its input channel and
+pushes it onto its output channel. Finally, it completes the handshaking
+on the passive nonput channel.
+
+8.4.2
+The 2-place FIFO
+
+Figure 8.7 shows the handshake circuit corresponding to the Tangram pro-
+gram in figure 8.3. The component labeled ‘psv’ in the handshake circuit of
+figure 8.7 is a so-called passivator. It relates to the internal channel c of the
+Fifo and implements the synchronization and communication between the ac-
+tive sender (c!x) and the active receiver (c?y).
+
+�
+in
+
+�
+
+0 ;
+1
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+psv
+
+�
+
+�
+
+�
+
+0 ;
+1
+
+�
+
+�
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.7.
+Compiled handshake circuit for the FIFO program.
+
+An optimization of the handshake circuit for Fifo is shown in figure 8.8.
+The synchronization in the datapath using a passivator has been replaced by a
+synchronization in the control using a ‘join’ component. One may observe that
+the datapath of this handshake circuit for the FIFO design is the same as that
+of the shift register, shown in figure 8.2. The only difference is in the control
+part of the circuits.
+
+
+Chapter 8: High-level languages and tools
+131
+
+�
+in
+
+�
+
+;
+0
+1
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+;
+0
+1
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.8.
+Optimized handshake circuit for the FIFO program.
+
+8.4.3
+GCD using guarded repetition
+
+As a more complex example of syntax-directed compilation figure 8.9 shows
+the handshake circuit compiled from the Tangram program in figure 8.5. Com-
+pared with the previous handshake circuits, the handshake circuit for the GCD
+program introduces two new classes of components that are treated in more
+detail below.
+Firstly, the circuit contains a ‘bar’ and a ‘do’ component, both of which are
+data-dependent control components. Secondly, the handshake circuit contains
+components that do not directly correspond to language constructs, but rather
+implement sharing: the multiplexer (denoted by ‘mux’), the demultiplexer (de-
+noted by ‘dmx’), and the fork component (denoted by ‘�’).
+Warning: the Tangram fork is identical to the fork in figure 3.3 but the Tan-
+gram multiplexer and demultiplexer components are different. The Tangram
+multiplexer is identical to the merge in figure 3.3 and the Tangram demulti-
+plexer is a kind of “inverse merge.” Its output ports are passive and it requires
+the handshakes on the two outputs to be mutually exclusive.
+
+The ‘bar’ and the ‘do’ components:
+The do and bar component together
+implement the guarded command construct with two guards, in which the do
+component implements the iteration part (the do od part, including the evalu-
+ation of the disjunction of the two guards), and the bar component implements
+the choice part (the then or then part of the command).
+The do component, when activated through its passive port, first collects the
+disjunction of the value of all guards through a handshake on its active data
+port. When the value thus collected is true, it activates its active nonput port
+(to activate the selected command), and after completion starts a new evalua-
+tion cycle. When the value collected is false, the do component completes its
+operation by completing the handshake on the passive port.
+
+
+132
+Part I: Asynchronous circuit design – A tutorial
+
+�
+
+�
+in2
+mux
+
+�
+
+�
+y
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+in1
+mux
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+bar
+
+�
+
+�
+
+�
+
+do
+
+�
+
+�
+
+� out
+
+�
+
+;
+0
+1
+2
+
+�
+
+Figure 8.9.
+Compiled handshake circuit for the GCD program using guarded repetition.
+
+The bar component can be activated either through its passive data port, or
+through its passive control port. (The do component, for example, sequences
+these two activations.) When activated through the data port, it collects the
+value of two guards through a handshake on the active data ports, and then
+sends the disjunction of these values along the passive data port, thus complet-
+ing that handshake. When activated through the control port, the bar compo-
+nent activates an active control port of which the associated data port returned a
+‘true’ value in the most recent data cycle. (For simplicity, this selection is typ-
+ically implemented in a deterministic fashion, although this is not required at
+the level of the program.) One may observe that bar components can be com-
+
+
+Chapter 8: High-level languages and tools
+133
+
+bined in a tree or list to implement a guarded command list of arbitrary length.
+Furthermore, not every data cycle has to be followed by a control cycle.
+
+The ‘mux’, ‘demux’, and ‘fork’ components
+The program for GCD in
+figure 8.4 has two occurrences of variable x in which a value is written into x,
+namely input action in1?x and assignment x:=x-y. In the handshake circuit
+of figure 8.9, these two write actions for Tangram variable x are merged by the
+multiplexer component so as to arrive at the write port of handshake variable
+x.
+Variable x occurs at five different locations in the program as an expres-
+sion, once in the output expression out!x, twice in the guard expressions x<y
+and y<x, and twice in the assignment expressions x-y and y-x. These five in-
+spections of variable x could be implemented as five distinct read ports on the
+handshake variable x, which is shown in the handshake circuit in [135, Fig. 2.7,
+p.34]. In figure 8.9, a different compilation is shown, in which handshake vari-
+able x has three read ports:
+
+A read port dedicated to the occurrence in the output action.
+
+A read port dedicated to the guard expressions. Their evaluation is mu-
+tually inclusive, and hence can be combined using a synchronizing fork
+component.
+
+A read port dedicated to the assignment expressions. Their evaluation is
+mutually exclusive, and hence can be combined using a demultiplexer.
+
+The GCD example is discussed in further detail in chapter 13.
+
+8.5.
+Martin’s translation process
+
+The work of Martin and his group at Caltech has made fundamental contri-
+butions to asynchronous design and it has influenced the work of many other
+researchers. The methods have been used at Caltech to design several sig-
+nificant chips, most recently and most notably an asynchronous MIPS R3000
+processor [88]. As the following presentation of the design flow hints, the de-
+sign process is elaborate and sophisticated and is probably only an option to a
+person who has spent time with the Caltech group.
+The mostly manual design process involves the following steps (semantics-
+preserving transformations):
+(1) Process decomposition where each process is refined into a collection
+of interacting simpler processes. This step is repeated until all processes are
+simple enough to be dealt with in the next step in the process.
+(2) Handshake expansion where each communication channel is replaced
+by explicit wires and where each communication action (e.g. send or receive)
+
+
+134
+Part I: Asynchronous circuit design – A tutorial
+
+is replaced by the signal transitions required by the protocol that is being used.
+For example a receive statement such as:
+
+C?y
+
+is replaced by a sequence of simpler statements – for example:
+
+�Creq
+
+�; y :� data; Cack
+
+�;
+��Creq
+
+�; Cack
+
+�
+
+which is read as: “wait for request to go high”, “read the data”, “drive ac-
+knowledge high”, “wait for request to go low”, and “drive acknowledge low”.
+At this level it may be necessary to add state variables and/or to reshuffle
+signal transitions in order to obtain a specification that satisfies a condition
+similar to the CSC condition in chapter 6.
+(3) Production rule expansion where each handshaking expansion is re-
+placed by a set of production rules (or guarded commands), for example:
+
+a
+�b
+�� c
+�
+and
+�b
+�
+�c
+�� c
+�
+
+A production rule consist of a condition and an action, and the action is per-
+formed whenever the condition is true. As an aside we mention that the above
+two production rules express the same as the set and reset functions for the
+signal c on page 96. The production rules specify the behaviour of the internal
+signals and output signals of the process. The production rules are themselves
+simple concurrent processes and the guards must ensure that the signal tran-
+sitions occur in program order (i.e. that the semantics of the original CHP
+program are maintained). This may require strengthening the guards. Further-
+more, in order to obtain simpler circuit implementations, the guards may be
+modified and made symmetric.
+(4) Operator reduction where production rules are grouped into clusters and
+where each cluster is mapped onto a basic hardware component similar to a
+generalized C-element. The above two production rules would be mapped into
+the generalized C-element shown in figure 6.17 on page 100.
+
+8.6.
+Using VHDL for asynchronous design
+
+8.6.1
+Introduction
+
+In this section we will introduce a couple of VHDL packages that provide
+the designer with primitives for synchronous message passing between pro-
+cesses – similar to the constructs found in the CSP-family of languages (send,
+receive and probe).
+The material was developed in an M.Sc. project and used in the design of a
+32-bit floating-point ALU using the IEEE floating-point number representation
+[110], and it has subsequently been used in a course on asynchronous circuit
+
+
+Chapter 8: High-level languages and tools
+135
+
+design. Others, including [95, 118, 149, 78], have developed related VHDL
+packages and approaches.
+The channel packages introduced in the following support only one type
+of channel, using a 32-bit 4-phase bundled-data push protocol. However, as
+VHDL allows the overloading of procedures and functions, it is straightfor-
+ward to define channels with arbitrary data types. All it takes is a little cut-and-
+paste editing. Providing support for protocols other than the 4-phase bundled-
+data push protocol will require more significant extensions to the packages.
+
+8.6.2
+VHDL versus CSP-type languages
+
+The previous sections introduced several CSP-like hardware description lan-
+guages for asynchronous design. The advantages of these languages are their
+support of concurrency and synchronous message passing, as well as a limited
+and well-defined set of language constructs that makes syntax-directed compi-
+lation a relatively simple task.
+Having said this there is nothing that prevents a designer from using one
+of the industry standard languages VHDL (or Verilog) for the design of asyn-
+chronous circuits. In fact some of the fundamental concepts in these languages
+– concurrent processes and signal events – are “nice fits” with the modeling
+and design of asynchronous circuits. To illustrate this figure 8.10 shows how
+the Tangram program from figure 8.2 could be expressed in plain VHDL. In
+addition to demonstrating the feasibility, the figure also highlights the limi-
+tations of VHDL when it comes to modeling asynchronous circuits: most of
+the code expresses low-level handshaking details, and this greatly clutters the
+description of the function of the circuit.
+VHDL obviously lacks built-in primitives for synchronous message passing
+on channels similar to those found in CSP-like languages. Another feature of
+the CSP family of languages that VHDL lacks is statement-level concurrency
+within a process. On the other hand there are also some advantages of using an
+industry standard hardware description language such as VHDL:
+
+It is well supported by existing CAD tool frameworks that provide sim-
+ulators, pre-designed modules, mixed-mode simulation, and tools for
+synthesis, layout and the back annotation of timing information.
+
+The same simulator and test bench can be used throughout the entire de-
+sign process from the first high-level specification to the final implemen-
+tation in some target technology (for example a standard cell layout).
+
+It is possible to perform mixed-mode simulations where some entities
+are modeled using behavioural specifications and others are implemented
+using the components of the target technology.
+
+
+136
+Part I: Asynchronous circuit design – A tutorial
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+type T is std_logic_vector(7 downto 0)
+
+entity ShiftReg is
+port ( in_req
+: in
+std_logic;
+in_ack
+: out std_logic;
+in_data
+: in
+T;
+out_req
+: out std_logic;
+out_ack
+: in
+std_logic;
+out-data : out T );
+end ShiftReg;
+
+architecture behav of ShiftReg is
+begin
+process
+variable x, y: T;
+begin
+loop
+out_req <= ’1’ ;
+--
+out!y
+out_data <= y ;
+wait until out_ack = ’1’;
+out_req <= ’0’;
+wait until out_ack = ’0’;
+y := x;
+--
+y := x
+wait until in_req = ’1’;
+--
+in?x
+x := in_data;
+in.ack <= ’1’;
+wait until ch_req = ’0’;
+ch_ack <= ’0’;
+end loop;
+end process;
+end behav;
+
+Figure 8.10.
+VHDL description of the 2-place shift register FIFO stage from figure 8.2.
+
+Many real-world systems include both synchronous and asynchronous
+subsystems, and such hybrid systems can be modeled without any prob-
+lems in VHDL.
+
+8.6.3
+Channel communication and design flow
+
+The design flow presented in what follows is motivated by the advantages
+mentioned above. The goal is to augment VHDL with CSP-like channel com-
+munication primitives, i.e. the procedures send(<channel>, <variable>)
+and receive(<channel>,<variable>) and the function probe(<channel>).
+Another goal is to enable mixed-mode simulations where one end of a channel
+connects to an entity whose architecture body is a circuit implementation and
+the other end connects to an entity whose architecture body is a behavioural de-
+scription using the above communication primitives, figure 8.11(b). In this way
+
+
+Chapter 8: High-level languages and tools
+137
+
+Data
+
+Control
+
+Latches
+
+Ack
+
+Req
+
+Comb. logic
+
+Entity 2:
+
+High-level model: 
+
+Entity 2:
+
+Receive(<channel>,<var>)
+channel
+
+Data
+
+Control
+
+Latches
+
+Ack
+
+Req
+
+Comb. logic
+
+channel
+
+Mixed-mode model: 
+Entity 2:
+
+Entity 1:
+
+channel
+
+Comb. logic
+Latches
+
+Ack
+
+Req
+
+Data
+
+Control
+
+Low-level model: 
+
+Entity 1:
+
+Send(<channel>,<var>)
+
+Entity 1:
+
+Send(<channel>,<var>)
+
+(a)
+
+(b)
+
+(c)
+
+Figure 8.11.
+The VHDL packages for channel communication support high-level, mixed-
+mode and gate-level/standard cell simulations.
+
+a manual top-down stepwise refinement design process is supported, where the
+same test bench is used throughout the entire design process from high-level
+specification to low-level circuit implementation, figure 8.11(a-c).
+In VHDL all communication between processes takes place via signals.
+Channels therefore have to be declared as signals, preferably one signal per
+channel. Since (for a push channel) the sender drives the request and data part
+of a channel, and the receiver drives the acknowledge part, there are two drivers
+to one signal. This is allowed in VHDL if the signal is a resolved signal. Thus,
+it is possible to define a channel type as a record with a request, an acknowl-
+edge and a data field, and then define a resolution function for the channel type
+which will determine the resulting value of the channel. This type of channel,
+with separate request and acknowledge fields, will be called a real channel and
+is described in section 8.6.5. In simulations there will be three traces for each
+channel, showing the waveforms of request and acknowledge along with the
+data that is communicated.
+A channel can also be defined with only two fields: one that describes the
+state of the handshaking (called the “handshake phase” or simply the “phase”)
+and one containing the data. The type of the phase field is an enumerated type,
+
+
+138
+Part I: Asynchronous circuit design – A tutorial
+
+whose values can be the handshake phases a channel can assume, as well as
+the values with which the sender and receiver can drive the field. This type of
+channel will be called an abstract channel. In simulations there will be two
+traces for each channel, and it is easy to read the phases the channel assumes
+and the data values that are transfered.
+The procedures and definitions are organized into two VHDL-packages: one
+called “abstpack.vhd” that can be used for simulating high-level models and
+one called “realpack.vhd” that can be used at all levels of design. Full listings
+can be found in appendix 8.A at the end of this chapter. The design flow
+enabled by these packages is as follows:
+
+The circuit and its environment or test bench is first modelled and sim-
+ulated using abstract channels. All it takes is the following statement in
+the top level design unit: “usepackage work.abstpack.all”.
+
+The circuit is then partitioned into simpler entities. The entities still
+communicate using channels and the simulation still uses the abstract
+channel package. This step may be repeated.
+
+At some point the designer changes to using the real channel package
+by changing to: “usepackage work.realpack.all” in the top-level
+design unit. Apart from this simple change, the VHDL source code is
+identical.
+
+It is now possible to partition entities into control circuitry (that can be
+designed as explained in chapter 6) and data circuitry (that consist of or-
+dinary latches and combinational circuitry). Mixed mode simulations as
+illustrated in figure 8.11(b) are possible. Simulation models of the con-
+trol circuits may be their actual implementation in the target technology
+or simply an entity containing a set of concurrent signal assignments –
+for example the Boolean equations produced by Petrify.
+
+Eventually, when all entities have been partitioned into control and data,
+and when all leaf entities have been implemented using components of
+the target technology, the design is complete. Using standard technology
+mapping tools an implementation may be produced, and the circuit can
+be simulated with back annotated timing information.
+
+Note that the same simulation test bench can be used throughout the entire
+design process from the high-level specification to the low-level implementa-
+tion using components from the target technology.
+
+8.6.4
+The abstract channel package
+
+An abstract channel is defined in figure 8.12 with a data type called fp (a
+32-bit standard logic vector representing an IEEE floating-point number). The
+
+
+Chapter 8: High-level languages and tools
+139
+
+type handshake_phase is
+(
+u,
+-- uninitialized
+idle,
+-- no communication
+swait,
+-- sender waiting
+rwait,
+-- receiver waiting
+rcv,
+-- receiving data
+rec1,
+-- recovery phase 1
+rec2,
+-- recovery phase 2
+req,
+-- request signal
+ack,
+-- acknowledge signal
+error
+-- protocol error
+);
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+phase : handshake_phase;
+data
+: fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+Figure 8.12.
+Definition of an abstract channel.
+
+actual channel type is called channel fp. It is necessary to define a channel
+for each data type used in the design. The data type can be an arbitrary type,
+including record types, but it is advisable to use data types that are built from
+std logic because this is typically the type used by target component libraries
+(such as standard cell libraries) that are eventually used for the implementation.
+The meaning of the values of the type handshake phase are described in
+detail below:
+
+u: Uninitialized channel. This is the default value of the drivers. As long as
+either the sender or receiver drive the channel with this value, the channel
+stays uninitialized.
+
+idle: No communication. Both the sender and receiver drive the channel with
+the idle value.
+
+swait: The sender is waiting to perform a communication. The sender is driv-
+ing the channel with the req value and the receiver drives with the idle
+value.
+
+rwait: The receiver is waiting to perform a communication. The sender is
+driving the channel with the idle value and the receiver drives with the
+
+
+140
+Part I: Asynchronous circuit design – A tutorial
+
+rwait value. This value is used both as a driving value and as a resulting
+value for a channel, just like the idle and u values.
+
+rcv: Data is transfered. The sender is driving the channel with the req value
+and the receiver drives it with the rwait value.
+After a predefined
+amount of time (tpd at the top of the package, see later in this section)
+the receiver changes its driving value to ack, and the channel changes
+its phase to rec1. In a simulation it is only possible to see the transfered
+value during the rcv phase and the swait phase. At all other times the
+data field assumes a predefined default data value.
+
+rec1: Recovery phase. This phase is not seen in a simulation, since the channel
+changes to the rec2 phase with no time delay.
+
+rec2: Recovery phase. This phase is not seen in a simulation, since the channel
+changes to the idle phase with no time delay.
+
+req: The sender drives the channel with this value, when it wants to perform
+a communication. A channel can never assume this value.
+
+ack: The receiver drives the channel with this value when it wants to perform
+a communication. A channel can never assume this value.
+
+error: Protocol error. A channel assumes this value when the resolution func-
+tion detects an error. It is an error if there is more than one driver with
+an rwait, req or ack value. This could be the result if more than two
+drivers are connected to a channel, or if a send command is accidentally
+used instead of a receive command or vice versa.
+
+Figure 8.13 shows a graphical illustration of the protocol of the abstract
+channel. The values in large letters are the resulting values of the channel, and
+the values in smaller letters below them are the driving values of the sender
+and receiver respectively. Both the sender and receiver are allowed to initiate
+a communication. This makes it possible in a simulation to see if either the
+
+IDLE
+
+IDLE
+RWAIT
+IDLE
+RWAIT
+
+SWAIT
+REQ
+IDLE
+
+RCV
+REQ
+REC1
+REQ
+ACK
+REC2
+IDLE
+ACK
+-
+UU
+
+IDLE
+RWAIT
+
+Figure 8.13.
+The protocol for the abstract channel. The values in large letters are the resulting
+resolved values of the channel, and the values in smaller letters below them are the driving
+values of the sender and receiver respectively.
+
+
+Chapter 8: High-level languages and tools
+141
+
+sender or receiver is waiting to communicate. It is the procedures send and
+receive that follow this protocol.
+Because channels with different data types are defined as separate types,
+the procedures send, receive and probe have to be defined for each of these
+channel types. Fortunately VHDL allows overloading of procedure names, so
+it is possible to make these definitions. The only differences between the def-
+initions of the channels are the data types, the names of the channel types and
+the default values of the data fields in the channels. So it is very easy to copy
+the definitions of one channel to make a new channel type. It is not necessary
+to redefine the type handshake phase. All these definitions are conveniently
+collected in a VHDL package. This package can then be referenced wherever
+needed. An example of such a package with only one channel type can be
+seen in appendix A.1. The procedures initialize in and initialize out
+are used to initialize the input and output ends of a channel. If a sender or re-
+ceiver does not initialize a channel, no communications can take place on that
+channel.
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+use work.abstract_channels.all;
+
+entity fp_latch is
+generic(delay : time);
+port ( d
+: inout channel_fp;
+-- input data channel
+port ( q
+: inout channel_fp;
+-- output data channel
+resetn : in std_logic
+);
+end fp_latch;
+
+architecture behav of fp_latch is
+begin
+
+process
+variable data : fp;
+begin
+initialize_in(d);
+initialize_out(q);
+wait until resetn = ’1’;
+loop
+receive(d, data);
+wait for delay;
+send(q, data);
+end loop;
+end process;
+
+end behav;
+
+Figure 8.14.
+Description of a FIFO stage.
+
+
+142
+Part I: Asynchronous circuit design – A tutorial
+
+d
+q
+
+resetn
+
+fp_latch
+
+ch_in
+ch_out
+d
+q
+
+resetn
+
+fp_latch
+
+d
+q
+
+resetn
+
+fp_latch
+
+FIFO_stage_1
+FIFO_stage_2
+FIFO_stage_3
+
+Figure 8.15.
+A FIFO built using the latch defined in figure 8.14.
+
+Figure 8.16.
+Simulation of the FIFO using the abstract channel package.
+
+A simple example of a subcircuit is the FIFO stage fp latch shown in
+figure 8.14. Notice that the channels in the entity have the mode inout, and
+the FIFO stage waits for the reset signal resetn after the initialization. In that
+way it waits for other subcircuits which may actually use this reset signal for
+initialization.
+The FIFO stage uses a generic parameter delay. This delay is inserted for
+experimental reasons in order to show the different phases of the channels.
+Three FIFO stages are connected in a pipeline (figure 8.15) and fed with data
+values. The middle section has a delay that is twice as long as the other two
+stages. This will result in a blocked channel just before the slow FIFO stage
+and a starved channel just after the slow FIFO stage.
+The result of this experiment can be seen in figure 8.16. The simulator
+used is the Synopsys VSS. It is seen that ch in is predominantly in the swait
+phase, which characterizes a blocked channel, and ch out is predominantly in
+the rwait phase, which characterizes a starved channel.
+
+8.6.5
+The real channel package
+
+At some point in the design process it is time to separate communicating
+entities into control and data entities. This is supported by the real channel
+types, in which the request and acknowledge signals are separate std logic
+signals – the type used by the target component models. The data type is the
+
+
+Chapter 8: High-level languages and tools
+143
+
+same as the abstract channel type, but the handshaking is modeled differently.
+A real channel type is defined in figure 8.17.
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+req
+: std_logic;
+ack
+: std_logic;
+data : fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+Figure 8.17.
+Definition of a real channel.
+
+All definitions relating to the real channels are collected in a package (sim-
+ilar to the abstract channel package) and use the same names for the channel
+types, procedures and functions. For this reason it is very simple to switch
+to simulating using real channels. All it takes is to change the name of the
+package in the use statements in the top level design entity. Alternatively, one
+can use the same name for both packages, in which case it is the last analyzed
+package that is used in simulations.
+An example of a real channel package with only one channel type can be
+seen in appendix A.2. This package defines a 32-bit standard logic 4-phase
+bundled-data push channel. The constant tpd in this package is the delay from
+a transition on the request or acknowledge signal to the response to this tran-
+sition. “Synopsys compiler directives” are inserted in several places in the
+package. This is because Synopsys needs to know the channel types and the
+resolution functions belonging to them when it generates an EDIF netlist to the
+floor planner, but not the procedures in the package.
+Figure 8.18 shows the result of repeating the simulation experiment from the
+previous section, this time using the real channel package. Notice the sequence
+of four-phase handshakes.
+Note that the data value on a channel is, at all times, whatever value the
+sender is driving onto the channel. An alternative would be to make the resolu-
+tion function put out the default data value outside the data-validity period, but
+this may cause the setup and hold times of the latches to be violated. The proce-
+dure send provides a broad data-validity scheme, which means that it can com-
+municate with receivers that require early, broad or late data-validity schemes
+on the channel. The procedure receive requires an early data-validity scheme,
+
+
+144
+Part I: Asynchronous circuit design – A tutorial
+
+Figure 8.18.
+Simulation of the FIFO using the real channel package.
+
+which means that it can communicate with senders that provide early or broad
+data-validity schemes.
+The resolution functions for the real channels (and the abstract channels)
+can detect protocol errors. Examples of errors are more than one sender or
+receiver on a channel, and using a send command or a receive command at
+the wrong end of a channel. In such cases the channel assumes the X value on
+the request or acknowledge signals.
+
+8.6.6
+Partitioning into control and data
+
+This section describes how to separate an entity into control and data enti-
+ties. This is possible when the real channel package is used but, as explained
+below, this partitioning has to follow certain guidelines.
+To illustrate how the partitioning is carried out, the FIFO stage in figure 8.14
+in the preceding section will be separated into a latch control circuit called
+latch ctrl and a latch called std logic latch. The VHDL code is shown
+in figure 8.19, and figure 8.20 is a graphical illustration of the partitioning that
+includes the unresolved signals ud and uq as explained below.
+In VHDL a driver that drives a compound resolved signal has to drive all
+fields in the signal. Therefore a control circuit cannot drive only the acknowl-
+edge field in a channel. To overcome this problem a signal of the corresponding
+unresolved channel type has to be declared inside the partitioned entity. This
+is the function of the signals ud and uq of type uchannel fp in figure 8.17.
+The control circuit then drives only the acknowledge field in this signal; this
+is allowed since the signal is unresolved. The rest of the fields remain unini-
+tialized. The unresolved signal then drives the channel; this is allowed since it
+drives all of the fields in the channel. The resolution function for the channel
+should ignore the uninitialized values that the channel is driven with. Compo-
+nents that use the send and receive procedures also drive those fields in the
+
+
+Chapter 8: High-level languages and tools
+145
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+use work.real_channels.all;
+
+entity fp_latch is
+port ( d
+: inout channel_fp;
+-- input data channel
+q
+: inout channel_fp;
+-- output data channel
+resetn : in std_logic
+);
+end fp_latch;
+
+architecture struct of fp_latch is
+
+component latch_ctrl
+port ( rin, aout, resetn : in
+std_logic;
+ain, rout, lt : out std_logic
+);
+end component;
+
+component std_logic_latch
+generic (width : positive);
+port ( lt : in
+std_logic;
+d
+: in
+std_logic_vector(width-1 downto 0);
+q
+: out std_logic_vector(width-1 downto 0)
+);
+end component;
+
+signal lt : std_logic;
+signal ud, uq : uchannel_fp;
+
+begin
+
+latch_ctrl1 : latch_ctrl
+port map (d.req,q.ack,resetn,ud.ack,uq.req,lt);
+std_logic_latch1 : std_logic_latch
+generic map (width => 32)
+port map (lt,d.data,uq.data);
+
+d <= connect(ud);
+q <= connect(uq);
+
+end struct;
+
+Figure8.19.
+Separation of the FIFO stage into an ordinary data latch and a latch control circuit.
+
+channel that they do not control with uninitialized values. For example, an out-
+put to a channel drives the acknowledge field in the channel with the U value.
+The fields in a channel that are used as inputs are connected directly from the
+channel to the circuits that have to read those fields.
+Notice in the description that the signals ud and uq do not drive d and q
+directly but through a function called connect. This function simply returns
+its parameter. It may seem unnecessary, but it has proved to be necessary when
+some of the subcircuits are described with a standard cell implementation. In
+a simulation a special “gate-level simulation engine” is used to simulate the
+
+
+146
+Part I: Asynchronous circuit design – A tutorial
+
+Lt
+
+d
+
+std_logic_latch
+
+q
+
+d
+
+resetn
+
+Lt
+
+Lt
+aout
+ain
+
+rin
+rout
+
+resetn
+
+ud
+q
+uq
+
+latch_ctl
+
+ch_in
+ch_out
+
+q
+
+resetn
+
+d
+
+fp_latch
+fp_latch
+fp_latch
+
+FIFO_stage
+FIFO_stage
+FIFO_stage
+
+Figure 8.20.
+Separation of control and data.
+
+standard cells [129]. During initialization it will set some of the signals to the
+value X instead of to the value U as it should. It has not been possible to get
+the channel resolution function to ignore these X values, because the gate-level
+simulation engine sets some of the values in the channel. By introducing the
+connect function, which is a behavioural description, the normal simulator
+takes over and evaluates the channel by means of the corresponding resolution
+function. It should be emphasized that it is a bug in the gate-level simulation
+engine that necessitates the addition of the connect function.
+
+8.7.
+Summary
+
+This chapter addressed languages and CAD tools for high-level modeling
+and synthesis of asynchronous circuits. The text focused on a few represen-
+tative and influential design methods that are based languages that are similar
+to CSP. The reason for preferring these languages are that they support chan-
+nel based communication between processes (synchronous message passing)
+as well as concurrency at both process and statement level – two features that
+are important for modeling asynchronous circuits. The text also illustrated a
+synthesis method known as syntax directed translation. Subsequent chapters
+in this book will elaborate much more on these issues.
+Finally the chapter illustrated how channel based communication can be
+implemented in VHDL, and we provided two packages containing all the nec-
+essary procedures and functions including: send, receive and probe. These
+packages supports a manual top-down stepwise-refinement design flow where
+the same test bench can be used to simulate the design throughout the entire
+
+
+Chapter 8: High-level languages and tools
+147
+
+design process from high level specification to low level circuit implementa-
+tion.
+This chapter on languages and CAD-tools for asynchronous design con-
+cludes the tutorial on asynchronous circuit design and it it time to wrap up:
+Chapter 2 presented the fundamental concepts and theories, and provided point-
+ers to the literature. Chapters 3 and 4 presented an RTL-like abstract view
+on asynchronous circuits (tokens flowing in static data-flow structures) that is
+very useful for understanding their operation and performance. This material
+is probably where this tutorial supplements the existing body of literature the
+most. Chapters 5 and 6 addressed the design of datapath operators and con-
+trol circuits. Focus in chapter 6 was on speed-independent circuits, but this
+is not the only approach. In recent years there has also been great progress
+in synthesizing multiple-input-change fundamental-mode circuits. Chapter 7
+discussed more advanced 4-phase bundled-data protocols and circuits. Finally
+chapter 8 addressed languages and tools for high-level modeling and synthesis
+of asynchronous circuits.
+The tutorial deliberately made no attempts at covering of all corners of the
+field – the aim was to pave a road into “the world of asynchronous design”.
+Now you are here at the end of the road; hopefully with enough background
+to carry on digging deeper into the literature, and equally importantly, with
+sufficient understanding of the characteristics of asynchronous circuits, that
+you can start designing your own circuits. And finally; asynchronous circuits
+do not represent an alternative to synchronous circuits. They have advantages
+in some areas and disadvantages in other areas and they should be seen as a
+supplement, and as such they add new dimensions to the solution space that
+the digital designer explores. Even today, many circuits can not be categorized
+as either synchronous or asynchronous, they contain elements of both.
+The following chapters will introduce some recent industrial scale asyn-
+chronous chips. Additional designs are presented in [106].
+
+
+148
+Part I: Asynchronous circuit design – A tutorial
+
+Appendix: The VHDL channel packages
+
+A.1.
+The abstract channel package
+
+-- Abstract channel package: (4-phase bundled-data push channel, 32-bit data)
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+package abstract_channels is
+
+constant tpd : time := 2 ns;
+
+-- Type definition for abstract handshake protocol
+
+type handshake_phase is
+(
+u,
+-- uninitialized
+idle,
+-- no communication
+swait,
+-- sender waiting
+rwait,
+-- receiver waiting
+rcv,
+-- receiving data
+rec1,
+-- recovery phase 1
+rec2,
+-- recovery phase 2
+req,
+-- request signal
+ack,
+-- acknowledge signal
+error
+-- protocol error
+);
+
+-- Floating point channel definitions
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+phase : handshake_phase;
+data
+: fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+procedure initialize_in(signal ch : out channel_fp);
+
+procedure initialize_out(signal ch : out channel_fp);
+
+procedure send(signal ch : inout channel_fp; d : in fp);
+
+procedure receive(signal ch : inout channel_fp; d : out fp);
+
+function probe(signal ch : in channel_fp) return boolean;
+
+end abstract_channels;
+
+
+Chapter 8: High-level languages and tools
+149
+
+package body abstract_channels is
+
+-- Resolution table for abstract handshake protocol
+
+type table_type is array(handshake_phase, handshake_phase) of
+handshake_phase;
+
+constant resolution_table : table_type := (
+----------------------------------------------------------------------------
+-- 2. parameter:
+|
+|
+-- u
+idle
+swait rwait rcv
+rec1
+rec2
+req
+ack
+error
+|1. par:|
+----------------------------------------------------------------------------
+(u,
+u,
+u,
+u,
+u,
+u,
+u,
+u,
+u,
+u
+), --| u
+|
+(u,
+idle, swait,rwait,rcv,
+rec1, rec2, swait,rec2, error), --| idle
+|
+(u,
+swait,error,rcv,
+error,error,rec1, error,rec1, error), --| swait |
+(u,
+rwait,rcv,
+error,error,error,error,rcv,
+error,error), --| rwait |
+(u,
+rcv,
+error,error,error,error,error,error,error,error), --| rcv
+|
+(u,
+rec1, error,error,error,error,error,error,error,error), --| rec1
+|
+(u,
+rec2, rec1, error,error,error,error,rec1, error,error), --| rec2
+|
+(u,
+error,error,error,error,error,error,error,error,error), --| req
+|
+(u,
+error,error,error,error,error,error,error,error,error), --| ack
+|
+(u,
+error,error,error,error,error,error,error,error,error));--| error |
+
+-- Fp channel
+
+constant default_data_fp : fp := "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX";
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp is
+variable result : uchannel_fp := (idle, default_data_fp);
+begin
+for i in s’range loop
+result.phase := resolution_table(result.phase, s(i).phase);
+if (s(i).phase = req) or (s(i).phase = swait) or
+(s(i).phase = rcv) then
+result.data := s(i).data;
+end if;
+end loop;
+if not((result.phase = swait) or (result.phase = rcv)) then
+result.data := default_data_fp;
+end if;
+return result;
+end resolved;
+
+procedure initialize_in(signal ch : out channel_fp) is
+begin
+ch.phase <= idle after tpd;
+end initialize_in;
+
+procedure initialize_out(signal ch : out channel_fp) is
+begin
+ch.phase <= idle after tpd;
+end initialize_out;
+
+procedure send(signal ch : inout channel_fp; d : in fp) is
+begin
+if not((ch.phase = idle) or (ch.phase = rwait)) then
+wait until (ch.phase = idle) or (ch.phase = rwait);
+
+
+150
+Part I: Asynchronous circuit design – A tutorial
+
+end if;
+ch <= (req, d);
+wait until ch.phase = rec1;
+ch.phase <= idle;
+end send;
+
+procedure receive(signal ch : inout channel_fp; d : out fp) is
+begin
+if not((ch.phase = idle) or (ch.phase = swait)) then
+wait until (ch.phase = idle) or (ch.phase = swait);
+end if;
+ch.phase <= rwait;
+wait until ch.phase = rcv;
+wait for tpd;
+d := ch.data;
+ch.phase <= ack;
+wait until ch.phase = rec2;
+ch.phase <= idle;
+end receive;
+
+function probe(signal ch : in channel_fp) return boolean is
+begin
+return (ch.phase = swait);
+end probe;
+
+end abstract_channels;
+
+A.2.
+The real channel package
+
+-- Low-level channel package (4-phase bundled-data push channel, 32-bit data)
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+package real_channels is
+
+-- synopsys synthesis_off
+constant tpd : time := 2 ns;
+-- synopsys synthesis_on
+
+-- Floating point channel definitions
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+req
+: std_logic;
+ack
+: std_logic;
+data : fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+
+Chapter 8: High-level languages and tools
+151
+
+subtype channel_fp is resolved uchannel_fp;
+
+-- synopsys synthesis_off
+procedure initialize_in(signal ch : out channel_fp);
+
+procedure initialize_out(signal ch : out channel_fp);
+
+procedure send(signal ch : inout channel_fp; d : in fp);
+
+procedure receive(signal ch : inout channel_fp; d : out fp);
+
+function probe(signal ch : in uchannel_fp) return boolean;
+-- synopsys synthesis_on
+
+function connect(signal ch : in uchannel_fp) return channel_fp;
+
+end real_channels;
+
+package body real_channels is
+
+-- Resolution table for 4-phase handshake protocol
+
+-- synopsys synthesis_off
+type stdlogic_table is array(std_logic, std_logic) of std_logic;
+
+constant resolution_table : stdlogic_table := (
+--
+--------------------------------------------------------------
+--
+| 2. parameter:
+|
+|
+--
+|
+U
+X
+0
+1
+Z
+W
+L
+H
+-
+|1. par:|
+--
+--------------------------------------------------------------
+( ’U’, ’X’, ’0’, ’1’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+U
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+X
+|
+( ’0’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+0
+|
+( ’1’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+1
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+Z
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+W
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+L
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+H
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ )); -- |
+-
+|
+-- synopsys synthesis_on
+
+-- Fp channel
+
+-- synopsys synthesis_off
+constant default_data_fp : fp := "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX";
+-- synopsys synthesis_on
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp is
+-- pragma resolution_method three_state
+-- synopsys synthesis_off
+variable result : uchannel_fp := (’U’,’U’,default_data_fp);
+-- synopsys synthesis_on
+begin
+-- synopsys synthesis_off
+for i in s’range loop
+result.req := resolution_table(result.req,s(i).req);
+result.ack := resolution_table(result.ack,s(i).ack);
+
+
+152
+Part I: Asynchronous circuit design – A tutorial
+
+if (s(i).req = ’1’) or (s(i).req = ’0’) then
+result.data := s(i).data;
+end if;
+end loop;
+if not((result.req = ’1’) or (result.req = ’0’)) then
+result.data := default_data_fp;
+end if;
+return result;
+-- synopsys synthesis_on
+end resolved;
+
+-- synopsys synthesis_off
+procedure initialize_in(signal ch : out channel_fp) is
+begin
+ch.ack <= ’0’ after tpd;
+end initialize_in;
+
+procedure initialize_out(signal ch : out channel_fp) is
+begin
+ch.req <= ’0’ after tpd;
+end initialize_out;
+
+procedure send(signal ch : inout channel_fp; d : in fp) is
+begin
+if ch.ack /= ’0’ then
+wait until ch.ack = ’0’;
+end if;
+ch.req <= ’1’ after tpd;
+ch.data <= d after tpd;
+wait until ch.ack = ’1’;
+ch.req <= ’0’ after tpd;
+end send;
+
+procedure receive(signal ch : inout channel_fp; d : out fp) is
+begin
+if ch.req /= ’1’ then
+wait until ch.req = ’1’;
+end if;
+wait for tpd;
+d := ch.data;
+ch.ack <= ’1’;
+wait until ch.req = ’0’;
+ch.ack <= ’0’ after tpd;
+end receive;
+
+function probe(signal ch : in uchannel_fp) return boolean is
+begin
+return (ch.req = ’1’);
+end probe;
+-- synopsys synthesis_on
+
+function connect(signal ch : in uchannel_fp) return channel_fp is
+begin
+return ch;
+end connect;
+
+end real_channels;
+
+
+II
+
+BALSA - AN ASYNCHRONOUS HARDWARE
+SYNTHESIS SYSTEM
+
+Author: Doug Edwards, Andrew Bardsley
+Department of Computer Science
+The University of Manchester
+{doug,bardsley}@cs.man.ac.uk
+
+Abstract
+Balsa is a system for describing and synthesising asynchronous circuits based
+on syntax-directed compilation into communicating handshake circuits. In these
+chapters, the basic Balsa design flow is described and several simple circuit ex-
+amples are used to illustrate the Balsa language in an informal tutorial style. The
+section concludes with a walk-through of a major design exercise – a 4 channel
+DMA controller described entirely in Balsa.
+
+Keywords:
+asynchronous circuits, high-level synthesis
+
+
+
+Chapter 9
+
+AN INTRODUCTION TO BALSA
+
+9.1.
+Overview
+
+Balsa is both a framework for synthesising asynchronous hardware systems
+and a language for describing such systems. The approach adopted is that of
+syntax-directed compilation into communicating handshaking components and
+closely follows the Tangram system ([141, 135] and Chapter 13 on page 221)
+of Philips. The advantage of this approach is that the compilation is trans-
+parent: there is a one-to-one mapping between the language constructs in the
+specification and the intermediate handshake circuits that are produced. It is
+relatively easy for an experienced user to envisage the micro-architecture of the
+circuit that results from the original description. Incremental changes made at
+the language level result in predictable changes at the circuit implementation
+level. This is important if optimisations and design trade-offs are to be made
+easily and contrasts with synchronous VHDL synthesis in which small changes
+in the specification may make radical alterations to the resulting circuit.
+It is important to understand what Balsa offers the designer and what obli-
+gations are still placed upon the designer. The tight “edit description – syn-
+thesise – simulate – revise description” loop made possible by the fast com-
+pilation process makes it very easy for the design space of a system to be
+explored and prototypes rapidly evaluated. However, there is no substitute
+for creativity. Poor designs may be created as easily as elegant designs and
+some experience in designing asynchronous circuits is required before even a
+good designer of conventional clocked circuits will best be able to exploit the
+system. Be warned that although Balsa guarantees correct-by-construction cir-
+cuits, it does not guarantee correct systems. In particular, it is quite feasible,
+as in any asynchronous system, to describe an elegant circuit which will ex-
+hibit deadlock. Furthermore, post-layout simulation is still required in order
+to check that when the instantiated circuit has been placed and routed by con-
+ventional CAD tools, it meets basic timing requirements. On the other hand,
+a choice of implementation libraries is available allowing the designer to trade
+
+155
+
+
+156
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+the greater process portability of a delay-insensitive implementation against,
+perhaps, smaller circuit area which may require a larger post-layout validation
+effort.
+Although Balsa has evolved from a research environment, it is not a toy sys-
+tem unsuited for large-scale designs; Balsa has been used to synthesise the 32
+channel DMA controller [11] for the Amulet3i asynchronous microprocessor
+macro-cell [48]. The controller has a complex specification and the resulting
+implementation occupies 2mm2 on a 0�35µm 3-layer metal process. Balsa is at
+the time of writing being used to synthesise a complete Amulet core as part of
+the EU funded G3 smartcard project [46].
+As noted earlier, Balsa is very similar to Tangram. It is a less mature pack-
+age lacking some useful tools contained within the Tangram package such as
+the power performance analyser. However, Balsa is freely available whereas
+Tangram is not generally available outside Philips. As far as the expressiveness
+of the languages is concerned, Balsa adds powerful parameterisation using re-
+cursive expansion definition facilities whereas Tangram allows more flexibility
+in interacting with non delay-insensitive external interfaces. Balsa has delib-
+erately chosen not to add such features to ensure that its channels-only delay-
+insensitive model is not compromised.
+The reader should be aware that not all aspects of the Balsa language or its
+syntax are explored in the material that follows: a more detailed introduction
+is available in the Balsa User Guide available from [7]. The Balsa system is
+freely available from the same site. The system is still evolving: the description
+here refers to Balsa release 3.1.0.
+
+9.2.
+Basic concepts
+
+A circuit described in Balsa is compiled into a communicating network com-
+posed from a small (about 40) set of handshake components. The components
+are connected by channels over which atomic communications or handshakes
+take place. Channels may have datapaths associated with them (in which case
+a handshake involves the transfer of data), or may be purely control (in which
+case the handshake acts as a synchronisation or rendezvous point).
+Each channel connects exactly one passive port of a handshake component
+to one active port of another handshake component. An active port is a port
+which initiates a communication. A passive port responds (when it is ready) to
+the request from the active port by an acknowledge signal.
+Data channels may be push channels or pull channels. In a push channel,
+the direction of the data flow is from the active port to the passive port. This
+is similar to the communication style of micropipelines. Data validity is sig-
+nalled by request and released on acknowledge. In a pull channel, the direction
+of data flow is from the passive port to the active port. The active port requests
+
+
+Chapter 9: An introduction to Balsa
+157
+
+a transfer, data validity is signalled by an acknowledge from the passive port.
+An example of a circuit composed from handshake components is shown in
+figure 9.1. Active ports are denoted by filled bubbles on a handshake compo-
+nent and passive ports are denoted by open bubbles.
+
+acknowledge
+
+request
+acknowledge
+
+request
+
+acknowledge
+
+bundled data
+
+acknowledge
+request
+
+request
+request
+
+acknowledge
+@
+
+"0;1"
+
+0
+
+1
+
+→
+
+Figure 9.1.
+Two connected handshake components.
+
+Here, a Fetch component, or Transferrer, denoted by “� ”) and a Case com-
+ponent (denoted by “@”) are connected by an internal data-bearing channel.
+Circuit action is activated by a request to the Transferrer which in turn issues
+a request to the environment on its active pull input port (on the left of the
+diagram). The environment supplies the demanded data indicating its validity
+by the acknowledgement signal. The Transferrer then presents a handshake re-
+quest and data to the Case component on its active push output port which the
+Case component receives on a passive port. Depending on the data value, the
+Case component issues a handshake to its environment on either the top right
+or bottom right port. Finally, when the acknowledgement is received by the
+Case component, an acknowledgement is returned along the original channel
+and terminating this handshake. The circuit is ready to operate once more.
+Data follows the direction of the request in this example and the acknowl-
+edgement to that request flows in the opposite direction. In this figure, indi-
+vidual physical request, acknowledgement and data wires are explicitly shown.
+Data is carried on separate wires from the signalling (it is “bundled” with the
+control) although this is not necessarily true for other data/signalling encoding
+schemes.
+The bundled-data scheme illustrated in figure 9.1 is not the only imple-
+mentation possible. Methodologies exist to implement channel connections
+with delay-insensitive signalling where timing relationships between individ-
+ual wires of an implemented channel do not affect the functionality of the cir-
+cuit. Handshake circuits can be implemented using these methodologies which
+are robust to na¨ıve realisations, process variations and interconnect delay prop-
+erties. Future releases of Balsa will include several alternative back-ends. A
+more detailed discussion of handshake protocols can be found in section 2.1
+on page 9 and section 7.1 on page 115.
+
+
+158
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Normally, handshake circuit diagrams are not shown at the level of detail
+of figure 9.1, a channel usually being shown as a single arc with the direc-
+tion of data being denoted by an arrow head on the arc. Similarly, control
+only channels, comprising only request/acknowledge wires, are indicated by
+an arc without an arrowhead. The circuit complexity of handshake circuits is
+often low: for example, a Transferrer may be implemented using only wires.
+An example of a handshake circuit for a modulo-10 counter (see page 185) is
+shown in figure 9.2. The corresponding gate-level implementation is shown in
+figure 9.3.
+
+count
+
+aclk
+
+activate
+
+4
+
+1
+0
+
+1
+
+0
+
+4
+
+4
+
+"0;1"
+
+4
+
+@
+tmp
+x /= 9
+
+4
+count
+_reg
+|
+
+→
+
+4
+→
+
+→
+
+→
+
+→
+
+4
+4
+x + 1
+
+1
+4
+
+#
+DW
+;
+
+*
+
+4
+4
+
+Figure 9.2.
+Handshake circuit of a modulo-10 counter.
+
+DW
+#
+
+4
+4
+
+*
+
+;
+
+4
+1
+
+x + 1
+4
+4
+
+T
+
+T
+
+@
+
+"0;1"
+
+1
+0
+
+1
+
+T
+
+T
+0
+
+4
+
+4
+
+4
+
+|
+
+4
+4
+
+tmp
+T
+x /= 9
+
+4
+count
+_reg
+
+count
+
+activate
+
+aclk
+
+(no ack)
+
+Control sequencing components (3 gates each)
+
+S
+
+S
+
+C
+
+r
+a
+
+Compare
+r
+
+a
+
+/= 9?
+
+Incrementer
+r
+
+a
+
+R
+
+S
+
+latch x4
+
+r
+
+a
+
+0
+
+1
+
+latch
+
+Figure 9.3.
+Gate-level circuit of a modulo-10 counter.
+
+
+Chapter 9: An introduction to Balsa
+159
+
+9.3.
+Tool set and design flow
+
+An overview of the Balsa design flow is shown in figure 9.4. Behavioural
+simulation is provided by LARD [38], a language developed within the Amulet
+group for modelling asynchronous systems. However, the target CAD sys-
+tem can also be used to perform more accurate simulations and to validate
+the design.
+Most of the Balsa tools are concerned with manipulating the
+Breeze handshake intermediate files produced by compiling Balsa descrip-
+tions. Breeze files can be used by back-end tools to provide implementations
+for Balsa descriptions, but also contain procedure and type definitions passed
+on from Balsa source files allowing Breeze to be used as the package descrip-
+tion format for Balsa.
+The Balsa system comprises the following collection of tools:
+
+balsa-c: the compiler for the Balsa language. The compiler produces
+Breeze from Balsa descriptions.
+
+balsa-netlist: produces a netlist, currently EDIF, Compass or Verilog,
+from a Breeze description, performing technology mapping and hand-
+shake expansion.
+
+breeze2ps: a tool which produces a PostScript file of the handshake cir-
+cuit graph.
+
+breeze2lard: a translator that converts a Breeze file to a LARD behavioural
+model.
+
+breeze-cost: a tool which gives an area cost estimate of the circuit.
+
+balsa-md: a tool for generating Makefiles for make(1).
+
+balsa-mgr: a graphical front-end to balsa-md with project management
+facilities.
+
+The interfaces between the Balsa and target CAD systems are handled by
+the following scripts:
+
+balsa-pv: uses powerview tools to produce an EDIF file from a top-level
+powerview schematic which incorporates Balsa generated circuits.
+
+balsa-xi: produces a Xilinx download file from an EDIF description of
+a compiled circuit.
+
+balsa-ihdl: an interface to the Cadence Verilog-XL environment.
+
+9.4.
+Getting started
+
+In this section, simple buffer circuits are described in Balsa introducing the
+basic elements of a Balsa description.
+
+
+160
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+balsa−c
+
+balsa−netlist
+
+breeze−cost
+
+balsa−md
+balsa−mgr
+
+balsa−li
+
+balsa−lcd
+
+breeze2lard
+
+Behavioural sim.
+
+LARD bytecodes
+
+LARD test harness
+LARD
+
+balsa−pv
+
+Fusion
+
+balsa−xi
+
+EDIF 2 0 0 netlist
+
+Powerview DB
+
+Simulation results
+
+Xilinx bitstream
+
+Timing extraction
+
+Timing info.
+
+balsa−ihdl
+
+Pearl
+Silicon Ensemble
+
+Silicon Ensemble
+
+Verilog−XL
+
+Verilog netlist
+
+Cadence DB
+
+SDF
+Layout
+
+SDF
+
+Simulation results
+
+cp(1)
+
+Chip compiler
+Netlist gen.
+
+TimeMill
+
+Compass netlist
+
+Compass DB
+
+Layout
+
+Cap. extraction
+
+Extracted netlist
+
+TimeMill netlist
+
+Simulation results
+
+A non−Balsa tool
+A Balsa tool
+
+A file format / data
+
+Balsa
+
+Breeze
+
+Cost estimate
+
+Figure 9.4.
+Design flow.
+
+
+Chapter 9: An introduction to Balsa
+161
+
+9.4.1
+A single-place buffer
+
+This buffer circuit is the HDL equivalent of the “hello, world” program. Its
+Balsa description is:
+
+import [balsa.types.basic]
+-- a single line comment
+-- buffer1a: A single place buffer
+procedure buffer1 (input i : byte; output o : byte) is
+variable x : byte
+begin
+loop
+i -> x
+-- Input
+communication
+;
+-- sequence the two communications
+o <- x
+-- Output communication
+end
+end
+
+Commentary on the code
+
+This Balsa description builds a single-place buffer, 8-bits wide. The circuit
+requests a byte from the environment which, when ready, transfers the data
+to the register. The circuit signals to the environment on its output channel
+that data is available and the environment reads it when it chooses. This small
+program introduces:
+
+comments:
+Balsa supports both multi-line comments and single-line com-
+ments.
+
+modular compilation:
+Balsa supports modular compilation. The import
+statement in this example includes the definition of some standard data types
+such as byte, nibble, etc. The search path given in the import statement
+is a dot-separated directory path similar to that of Java (although multi-file
+packages are not implemented). The import statement may be used to include
+other precompiled Balsa programs thereby acting as a library mechanism. Any
+import statements must precede other declarations in the files.
+
+procedures:
+The procedure declaration introduces an object that looks sim-
+ilar to a procedure definition in a conventional programming language. A Balsa
+procedure is a process. The parameters of the procedure define the interface to
+the environment outside the circuit block. In this case, the module has an 8-bit
+input and an 8-bit output. The body of the procedure definition defines an al-
+gorithmic behaviour for the circuit; it also implies a structural implementation.
+In this example, a variable x (of type byte) is declared implying that an 8-bit
+wide storage element will be appear in the synthesised circuit.
+
+
+162
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+The behaviour of the circuit is obvious from the code: 8-bit values are trans-
+ferred from the environment to the storage variable, x, and then sequentially
+output from the variable to the environment. This sequence of events is con-
+tinually repeated (loop
+�
+�
+� end).
+
+channel communication:
+the communication operators “->” and “<-” are
+channel assignments and imply a communication or handshake over the chan-
+nel. Because of the sequencing explicit in the description, the variable x will
+only accept a new value when it is ready; the value will only be passed out to
+the environment when requested. Note that the channel is always on the left-
+hand side of the operator and the corresponding variable or expression on the
+right-hand side.
+
+sequencing:
+The “;” operator separating the two assignments is not merely a
+syntactic statement separator, it explicitly denotes sequentiality. The contents
+of x are transferred to the output port after the input transfer has completed.
+Because a “;” connects two sequenced statements or blocks, it is an error to
+place a “;” after the last statement in a block.
+
+repetition
+The loop
+�
+�
+� end construct causes infinite repetition of the code
+contained within its body. Procedures without loop
+�
+�
+� end are permitted and
+will terminate, allowing procedure calls to be sequenced if required.
+
+Compiling the circuit
+
+balsa-c buffer1a
+
+The compiler produces an output file buffer1a.breeze. This is a file in an in-
+termediate format which can be imported back into other Balsa source files
+(thereby providing a simple library mechanism). Breeze is a textual format file
+designed for ease of parsing and it is therefore somewhat opaque. A primitive
+graphical representation of the compiled circuit in terms of handshake compo-
+nents can be produced (as buffer1a.ps) by:
+
+breeze2ps buffer1a
+
+The synthesised circuit
+
+The resulting handshake circuit is shown in figure 9.5. This is not actually
+taken from the output of breeze2ps, but has been redrawn to make the diagram
+more readable. Although it is not necessary to understand the exact opera-
+tion of the compiled circuit, a knowledge of the structure is helpful to gain an
+understanding of how best to describe circuits which can be synthesised effi-
+ciently using Balsa. A brief description of the operation of the circuit is given
+
+
+Chapter 9: An introduction to Balsa
+163
+
+o
+i
+
+Loop
+
+Sequence
+
+Variable
+
+→
+
+➤
+
+Fetch
+Fetch
+
+→
+
+*
+
+x
+
+;
+
+#
+
+Figure 9.5.
+Handshake circuit for a single-place buffer.
+
+below. The circuit has been annotated with the names of the various handshake
+components.
+The port at the top, denoted by “>”, is an activation port generating a hand-
+shake enclosing the behaviour of the circuit. It can be thought of as a reset
+signal which, when de-asserted, initiates the operation of the circuit. All com-
+piled Balsa programs contain an activation port.
+The activation port starts the operation of the Repeater (“#”) which initi-
+ates a handshake with the Sequencer. The Repeater corresponds directly to the
+loop�
+�
+� end construct, and the Sequencer to the “;” operator. The Sequencer
+first issues a handshake to the left-hand Fetch component, causing data to be
+moved to the storage element in the Variable element. The Sequencer then
+handshakes with the right-hand Fetch component, causing data to be read from
+the Variable element. When these operations are complete, the Sequencer com-
+pletes its handshake with the Repeater which starts the cycle again.
+
+9.4.2
+Two-place buffers
+
+1st design
+
+Having built a single-place buffer, an obvious goal is a pipeline of single
+buffer stages. Initially consider a two-place buffer; there are a number of ways
+we might describe this. One choice is to define a circuit with two storage
+elements:
+
+-- buffer2a: Sequential 2-place buffer with assignment
+--
+between variables
+import [balsa.types.basic]
+
+procedure buffer2 (input i : byte; output o : byte) is
+variable x1, x2 : byte
+begin
+
+
+164
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+loop
+i -> x1;
+-- Input communication
+x2 := x1;
+-- Implied communication
+o <- x2
+-- Output communication
+end
+end
+
+In this example in we explicitly introduce two storage elements, x1 and x2.
+The contents of the variable x1 are caused to be transferred to the variable x2
+by means of the assignment operator “:=”. However, transfer is still effected
+by means of a handshaking communication channel. This assignment operator
+is merely a way of concealing the channel for convenience.
+
+2nd design
+
+The implicit channel can be made explicit as shown in buffer2b.balsa:
+
+-- buffer2b: Sequential version with an explicit
+--
+internal channel
+import [balsa.types.basic]
+
+procedure buffer2 (input i:byte; output o:byte) is
+variable x1, x2 : byte
+channel chan: byte
+begin
+loop
+i -> x1;
+-- Input
+communication
+chan <- x1 || chan -> x2;
+-- Transfer x1 to x2
+o <- x2
+-- Output communication
+end
+end
+
+The channel which, in the previous example, was concealed behind the use
+of the “:=” assignment operator, has been made explicit. The handshake circuit
+produced (after some simple optimisations) is identical to buffer2a. The “||”
+operator is explained in the next example.
+It is important to understand the significance of the operation of the circuits
+produced by buffer2a and buffer2b. Remember that “;” is more than a syntac-
+tic separator: it is an operator denoting sequence. Thus, first the input, i, is
+transferred to x1. When this operation is complete, x1 is transferred to x2 and
+finally the contents of x2 are written to the environment on port o. Only after
+this sequence of operations is complete can new data from the environment be
+read into x1 again.
+
+9.4.3
+Parallel composition and module reuse
+
+The operation above is unnecessarily constrained: there is no reason why
+the circuit cannot be reading a new value into x1 at the same time that x2 is
+
+
+Chapter 9: An introduction to Balsa
+165
+
+writing out its data to the environment. The program in buffer2c achieves this
+optimisation.
+
+-- buffer2c: a 2-place buffer using parallel composition
+import [buffer1a]
+
+procedure buffer2 (input i : byte; output o : byte) is
+channel c : byte
+begin
+buffer1 (i, c) ||
+buffer1 (c, o)
+end
+
+Commentary on the code
+
+In the program above, a 2-place buffer is composed from 2 single-place
+buffers. The output of the first buffer is connected to the input of the second
+buffer by their respective output and input ports. However, apart from com-
+munications across the common channel, the operation of the two buffers is
+independent.
+The deceptively simple program above illustrates a number of new features
+of the Balsa language:
+
+modular compilation:
+The buffer1a circuit is included by the import mech-
+anism described earlier. The circuit must have been compiled previously. The
+Makefile generation program balsa-md (see page 166) can be used to generate
+a Makefile which will automatically take care of such dependencies.
+
+connectivity by naming:
+The output of the first buffer is connected to the
+input of the second buffer because of the common channel name, c, in the
+parameter list in the instantiation of the buffers.
+
+parallel composition:
+The “||” operator specifies that the two units which it
+connects should operate in parallel. This does not mean that the two units may
+operate totally independently: in this example, the output of one buffer writes
+to the input of the other buffer, creating a point of synchronisation. Note also
+that the parallelism referred to is a temporal parallelism. The two buffers are
+physically connected in series.
+
+9.4.4
+Placing multiple structures
+
+If we wish to extend the number of places in the buffer, the previous tech-
+nique of explicitly enumerating every buffer becomes tedious. What is required
+is a means of parameterising the buffer length (though in any real hardware
+implementation the number of buffers cannot be variable and must be known
+
+
+166
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+before-hand). One technique, shown in buffer n, is to use the for construct
+together with compile-time constants:
+
+-- buffer_n: an n-place parameterised buffer
+import [buffer1a]
+constant n = 8
+
+procedure buffer_n (input i:byte; output o:byte)
+is
+array 1 .. n-1 of channel c : byte
+begin
+buffer1 (i, c[1]) ||
+-- First buffer
+buffer1 (c[n-1], o) ||
+-- Last buffer
+for || i in 1 .. n-2 then
+-- Buffer i
+buffer1 (c[i], c[i+1])
+end
+end
+
+Commentary on the Code
+
+constants:
+the value of an expression (of any type) may be bound to a name.
+The value of the expression is evaluated at compile time and the type of the
+name when used will be the same as the original expression in the constant
+declaration. Numbers can be given in decimal (starting with one of 1..9), hexa-
+decimal (0x prefix), octal (0 prefix) and binary (0b prefix).
+
+arrayed channels:
+procedure ports and locally declared channels may be
+arrayed. Each channel can be referred to by a numeric or enumerated index,
+but from the point of view of handshaking, each channel is distinct and no
+indexed channel has any relationship with any other such channel other than
+the name they share. Arraying is not part of a channel’s type.
+
+for loops:
+a for loop allows iteration over the instantiation of a subcircuit.
+The composition of the circuits may either be a parallel composition – as in the
+example above – or sequential. In the latter case, “;” should be substituted for
+“||” in the loop specifier. The iteration range of the loop must be resolvable at
+compile time.
+A more flexible approach uses parameterised procedures and is discussed
+later in chapter 11 on page 193.
+
+9.5.
+Ancillary Balsa tools
+
+9.5.1
+Makefile generation
+
+Makefiles are commonly used in Unix by the utility make(1) to specify and
+control the processes by which complicated programs are compiled. Speci-
+fying the dependencies involved is often tedious and error prone. The Balsa
+
+
+Chapter 9: An introduction to Balsa
+167
+
+system has a utility, balsa-md, to generate the Makefile for a given program
+automatically. The generated Makefile knows not only how to compile a Balsa
+module with multiple imports, but also how to generate and run test-harnesses
+for the simulation environment, LARD, used by Balsa. Balsa-mgr provides
+a convenient, intuitive, GUI front-end to balsa-md and considerably simpli-
+fies project management, in particular the handling of multiple test harnesses.
+However, since a textual description of any GUI is tedious, balsa-mgr will not
+be discussed further and only the facilities to which the underlying balsa-md
+provides a gateway will be described in the examples that follow. The interface
+presented by balsa-mgr is shown in figure 9.6.
+
+Figure 9.6.
+balsa-mgr IDE.
+
+9.5.2
+Estimating area cost
+
+The area cost of a circuit may be estimated by executing the Makefile rule
+cost. For example, an extract of the output produced for the 2-place buffer is
+shown below:
+
+Part: buffer2
+(0 (component "$BrzFetch" (8) (10 2 9)))
+(0 (component "$BrzFetch" (8) (8 6 7)))
+(0 (component "$BrzFetch" (8) (5 4 3)))
+
+
+168
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+(20.75 (component "$BrzLoop" () (1 11)))
+(99.0 (component "$BrzSequence" (3) (11 (10 8 5))))
+(198.0 (component "$BrzVariable" (8 1 "x1[0..7]") (9 (6))))
+(198.0 (component "$BrzVariable" (8 1 "x2[0..7]") (7 (4))))
+
+Total cost: 515.75
+
+The exact format of the report produced is somewhat obscure. Each line
+corresponds to a handshake component. Its area cost is the first number on the
+line. The parameters after the component name correspond to the width of the
+various channels of that component and the internal channel names. The area
+reported is proportional to the cost of implementing the circuit in a particular
+silicon process and is of most use in comparing different circuit descriptions.
+
+9.5.3
+Viewing the handshake circuit graph
+
+A PostScript view of the handshake circuit graph can be produced by run-
+ning the rule make ps. A (flattened) view of the handshake circuit graph for
+the example buffer.2c is shown in figure 9.7.
+The two single-place buffers from which the circuit is composed are recog-
+nisable in the circuit. Apart from minor differences in the labelling of the
+handshake component symbols, the circuit is identical to that shown in fig-
+ure 8.6 discussed in section 8.4 on page 128 and the same optimisations have
+been (automatically) applied.
+
+9.5.4
+Simulation
+
+Ignoring the various simulation possibilities available once the design has
+been converted to a silicon layout, there are three strategies for evaluating and
+simulating the design from Balsa:
+
+1 Default LARD test harness.
+
+The command make sim will generate a LARD test harness and run it.
+The test harness reads data from a file for each input port of the module
+under test. Data sent to output channels appears on the standard output.
+This method needs no knowledge of LARD at all.
+
+2 Balsa test harness.
+
+If a more sophisticated test sequence is required, Balsa is a sufficiently
+flexible language in its own right to be able to specify most test se-
+quences. A default LARD test harness can then be generated for the
+Balsa test harness. Again no detailed knowledge of LARD is required.
+
+3 Custom LARD test harness.
+
+
+Chapter 9: An introduction to Balsa
+169
+
+buffer2c
+
+activate
+
+.
+
+C1: @10:18
+
+0
+
+i
+o
+
+#
+
+C10: @13:3
+
+0
+
+1
+
+#
+
+C4: @13:3
+
+0
+
+2
+
+x[0..7]
+
+;
+
+C15: @14:11
+
+0
+
+1
+
+T
+
+C14: @14:7
+
+0
+
+1
+
+.
+
+C12: @15:7
+
+1
+
+2
+
+C2: i
+
+1
+
+C13: x
+
+0
+
+2
+
+T
+
+C11: x
+
+1
+
+1
+
+x[0..7]
+
+C7: x
+
+0
+
+2
+
+;
+
+C9: @14:11
+
+0
+
+1
+
+T
+
+C6: @15:7
+
+0
+
+2
+
+C8: @14:7
+
+0
+
+1
+
+C3: o
+
+2
+
+C5: x
+
+1
+
+1
+
+C16: @27:18
+
+0
+
+2
+
+Figure 9.7.
+Flattened view of buffer2c.
+
+
+170
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+For some applications, it may be necessary to write a custom test harness
+in LARD. The Makefile generated test harness may be used as template.
+
+The default test harness exercises the target Balsa block by repeatedly hand-
+shaking on all external channels; input data channels receive the value 0 on
+each handshake, although it is possible to associate an input channel with a
+data file.
+
+Simulating buffer2c
+
+A simulation can be generated by invoking the appropriate simulation rule
+from the Makefile, producing the following output:
+
+0: chan ‘i’: writing 0
+6: chan ‘i’: writing 0
+15: chan ‘o’: reading 0
+19: chan ‘i’: writing 0
+28: chan ‘o’: reading 0
+32: chan ‘i’: writing 0
+41: chan ‘o’: reading 0
+45: chan ‘i’: writing 0
+54: chan ‘o’: reading 0
+58: chan ‘i’: writing 0
+67: chan ‘o’: reading 0
+71: chan ‘i’: writing 0
+80: chan ‘o’: reading 0
+
+The simulation runs forever unless terminated (by Ctrl-C). The numbers
+reported on the left hand side of each channel activity line are simulation times.
+LARD uses a unit delay model so these values should be treated with caution.
+
+Simulation data file
+
+This particular simulation stimulus is not very informative. A better strategy
+is to arrange for the data on the input channel i to be externally defined. In
+the next example, a file contains the following set of test data (in a variety of
+number representations):
+
+1
+0x10
+022
+0b011101
+5
+
+The Makefile can be forced to generate a rule for running a simulation from
+this stimulus file. If the simulation is now run, the following output is pro-
+duced:
+
+
+Chapter 9: An introduction to Balsa
+171
+
+3: chan ‘i’: writing 1
+15: chan ‘o’: reading 1
+16: chan ‘i’: writing 16
+28: chan ‘o’: reading 16
+29: chan ‘i’: writing 18
+41: chan ‘o’: reading 18
+42: chan ‘i’: writing 29
+54: chan ‘o’: reading 29
+55: chan ‘i’: writing 5
+67: chan ‘o’: reading 5
+Program terminated
+
+Channel viewer
+
+In the previous examples, the output of the simulation is textual appearing
+on the standard output.
+LARD has a graphical interface which displays the
+handshakes and data values associated with the internal and external channels.
+Assuming the building of a test harness rule has been specified to balsa-md,
+the channel viewer can be invoked causing two windows to appear on the
+screen: the LARD interpreter control window and the channel viewer window
+itself.
+Starting the simulation will cause a trace of the various channels in the de-
+sign to appear in the channel view window. For each channel the request and
+acknowledge signals and data values are displayed.
+
+Figure 9.8.
+Channel viewer window.
+
+
+
+Chapter 10
+
+THE BALSA LANGUAGE
+
+In this chapter, a tutorial overview of the language is given together with
+several small designs which illustrate various aspects of the language.
+
+10.1.
+Data types
+
+Balsa is strongly typed with data types based on bit vectors. The results
+of expressions must be guaranteed to fit within the range of the underlying bit
+vector representation. Types are either anonymous or named. Type equivalence
+for anonymous types is checked on the basis of the size and properties of the
+type, whereas type equivalence for named types is checked against the point of
+declaration.
+There are two classes of anonymous types: numeric types which are de-
+clared with the bits keyword, and arrays of other types. Numeric types can
+be either signed or unsigned. Signedness has an effect on expression operators
+and casting. Only numeric types and arrays of other types may be used without
+first binding a name to those types. Balsa has three separate name spaces: one
+for procedure and function names, a second for variable and channel names
+and a third for type declarations.
+
+Numeric types
+
+Numeric types support the number range
+�0�2n
+
+�1� for n-bit unsigned num-
+bers or
+��2n�1
+
+�2n�1
+
+� 1� for n-bit signed numbers. Named numeric types are
+just aliases of the same range. An example of a numeric type declaration is:
+
+type word is 16 bits
+
+This defines a new type word which is unsigned (there is no unsigned key-
+word) covering the range
+�0�216
+
+� 1�. Alternatively, a signed type could have
+been declared as:
+
+type sword is 16 signed bits
+
+which defines a new type sword covering the range
+��215
+
+�215
+
+�1�.
+
+173
+
+
+174
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+The only predefined type is bit. However the standard Balsa distribution
+comes with with a set of library declarations for such types as byte, nibble,
+boolean and cardinal as well as the constants true and false.
+
+Enumerated types
+
+Enumerated types consist of named numeric values. The named values are
+given values starting at zero and incrementing by one from left to right. Ele-
+ments with explicit values reset the counter and many names can be given to
+the same value, for example:
+
+type Colour is enumeration
+Black, Brown, Red, Orange, Yellow, Green, Blue,
+Violet, Purple=Violet, Grey, Gray=Grey, White
+end
+
+The value of the Violet element of Colour is 7, as is Purple. Both Grey
+and Gray have value 8. The total number of elements is 10. An enumeration
+can be padded to a fixed size by use of the over keyword:
+
+type SillyExample is enumeration
+e1=1, e2 over 4 bits
+end
+
+Here 2 bits are sufficient to specify the 3 possible values of the enumeration
+(0 is not bound to a name, e1 has the value 1 and e2 has the value 2). The over
+keyword ensures that the representation of the enumerated type is actually 4
+bits. Enumerated types must be bound to names by a type declaration before
+use.
+
+Constants
+
+Constant values can be defined in terms of an expression resolvable at com-
+pile time. Constants may be declared in terms of a predefined type otherwise
+they default to a numeric type. Examples are:
+
+constant minx = 5
+constant maxx = minx + 10
+constant hue = Red : Colour
+constant colour = Colour’Green
+-- explicit enumeration element
+
+Record types
+
+Records are bit-wise compositions of named elements of possibly different
+(pre-declared) types with the first element occupying the least significant bit
+positions, e.g.:
+
+type Resistor is record
+
+
+Chapter 10: The Balsa language
+175
+
+FirstBand, SecondBand, Multiplier : Colour;
+Tolerance : ToleranceColour
+end
+
+Resistor has four elements: FirstBand, SecondBand, Multiplier of
+type Colour and Tolerance of type ToleranceColour (both types must have
+been declared previously). FirstBand is the first element and so represents the
+least significant portion of the bit-wise value of a type Resistor. Selection of
+elements within the record structure is accomplished with the usual dot nota-
+tion. Thus if R15 is a variable of type Resistor, the value of its SecondBand
+can extracted by R15.SecondBand. As with enumerations, record types can be
+padded using the over notation.
+
+Array types
+
+Arrays are numerically indexed compositions of same-typed values. An
+example of the declaration of an array type is:
+
+type RegBank_t : array 0..7 of byte
+
+This introduces a new type RegBank t which is an array type of 8 elements
+indexed across the range [0, 7], each element being of type byte. The ordering
+of the range specifier is irrelevant; array 0..7 is equivalent to array 7..0.
+In general a single expression, expr, can be used to specify the array size: this
+is equivalent to a range of 0..expr-1. Anonymous array types are allowed in
+Balsa, so that variables can be declared as an array without first defining the
+array type:
+
+variable RegBank : array 0..7 of byte
+
+Arbitrary ranges of elements within an array can be accessed by an array
+slicing mechanism e.g. a[5..7] extracts elements a5, a6, and a7. As with all
+range specifiers, the ordering of the range is irrelevant. In general Balsa packs
+all composite typed structures in a least significant to most significant, left to
+right manner. Array slices always return values which are based at index 0.
+Arrays can be constructed by a tupling mechanism or by concatenation of
+other arrays of the same base type:
+
+variable a, b, c, d, e ,f: byte
+variable z2 : array 2 of byte
+variable z4 : array 4 of byte
+variable z6 : array 6 of byte
+
+z4:= {a,b,c,d}
+-- array construction
+z6:= z4 @ {e, f}
+-- array concatenation
+z2:= (z4 @ {e, f}) [3..4] -- element extraction by array slicing
+
+
+176
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+In the last example, the first element of z2 is set to d and the second element
+is set to e. Array slicing is useful for extracting arbitrary bitfields from other
+datatypes.
+
+Arrayed channels
+
+Channels may be arrayed, that is they may consist of several distinct chan-
+nels which can be referred to by a numeric or enumerated index. This is similar
+to the way in which variables can have an array type except that each channel
+is distinct for the purposes of handshaking and each indexed channel has no
+relationship to the other channels in the array other than the single name they
+share. The syntax for arrayed channels is different from that of array typed
+variables making it easier to disambiguate arrays from arrayed channels. As
+an example:
+
+array 4 of channel XYZ : array 4 of byte
+
+declares 4 channels, XYZ[0] to XYZ[3], each channel is a 32-bit wide type
+array 0..3 of byte. An example of the use of arrayed channels is shown
+in section 9.4.4 on page 165.
+
+10.2.
+Data typing issues
+
+As stated previously, Balsa is strongly typed: the left and right sides of as-
+signments are expected to have the same type. The only form of implicit type-
+casting is the promotion of numeric literals and constants to a wider numeric
+type. In particular, care must be taken to ensure that the result of an arithmetic
+operation will always be compatible with the declared result type. Consider
+the assignment statement x := x + 1. This is not a valid Balsa statement be-
+cause potentially the result is one bit wider than the width of the variable x. If
+the carry-out from the addition is to be ignored, the user must explicitly force
+the truncation by means of a cast.
+
+Casts
+
+If the variable x was declared as 32 bits, the correct form of the assignment
+above is:
+
+x := (x + 1 as 32 bits)
+
+The keyword as indicates the cast operation. The parentheses are a neces-
+sary part of the syntax. If the carry out of the addition of two 32-bit numbers
+is required, a record type can be used to hold the composite result:
+
+type AddResult is record
+Result : 32 bits;
+
+
+Chapter 10: The Balsa language
+177
+
+Carry : bit;
+end
+variable r : AddResult
+
+r := (a + b as AddResult)
+
+The expression r.Carry accesses the required carry bit, r.Result yields
+the 32-bit addition result.
+Casts are required when extracting bit fields. Here is an example from the
+instruction decoder of a simple microprocessor. The bottom 5 bits of the 16-bit
+instruction word contain an 5-bit signed immediate. It is required to extract the
+immediate field and sign-extend it to 16 bits:
+
+type Word is 16 signed bits
+type Imm5 is 5 signed bits
+
+variable Instr : 16 bits -- bottom 5 bits contain an immediate
+variable Imm16 : Word
+
+Imm16 := (((Instr as array 16 of bit) [0..4] as Imm5) as Word)
+
+First, the instruction word, Instr, is cast into an array of bits from which
+an arbitrary sub-range can be extracted:
+
+(Instr as array 16 of bit)
+
+Next the bottom (least significant) 5 bits must be extracted:
+
+(Instr as array 16 of bit) [0..4]
+
+The extracted 5 bits must now be cast back into a 5-bit signed number:
+
+((Instr as array 16 of bit) [0..4] as Imm5)
+
+The 5-bit signed number is then signed extended to the 16-bit immediate
+value:
+
+(((Instr as array 16 of bit) [0..4] as Imm5) as Word)
+
+The double cast is required because a straightforward cast from 5 bits to
+the variable Imm16 of type Word would have merely zero filled the topmost bit
+positions even though Word is a signed type. However, a cast from a signed
+numeric type to another (wider) signed numeric type will sign extend the nar-
+rower value into the width of the wider target type.
+Extracting bits from a field is a fairly common operation in many hardware
+designs. In general, the original datatype has to be cast into an array first before
+bitfield extraction. The smash operator “#” provides a convenient shorthand for
+casting an object into an array of bits. Thus the sign extension example above
+is more simply written:
+
+((#Instr [0..4] as Imm5) as Word)
+
+
+178
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Table 10.1.
+Balsa commands.
+
+Command
+Notes
+
+sync
+control only (dataless) handshake
+<-
+handshake data transfer from an expression to an output port
+->
+handshake data transfer to a variable from an input port
+:=
+assigns a value to a variable
+;
+sequence operator
+||
+parallel composition operator
+continue
+a no-op
+halt
+causes deadlock (useful in simulation)
+loop
+�
+�
+� end
+repeat forever
+while
+�
+�
+� else
+�
+�
+� end
+conditional loop
+for
+�
+�
+� end
+structural (not temporal) iteration
+if
+�
+�
+� then
+�
+�
+� else
+�
+�
+� end
+conditional execution, may have multiple guarded commands
+case
+�
+�
+� end
+conditional execution based on constant expressions
+select
+non-arbitrated choice operator
+arbitrate
+arbitrated choice operator
+print
+compile time printing of diagnostics
+
+Auto-assignment
+
+Statements of the form:
+
+x := f(x)
+
+are allowed in Balsa. However, the implementation generates an auxiliary vari-
+able which is then assigned back to the variable visible to the programmer –
+the variable is enclosed within a single handshake and cannot be read from
+and written to simultaneously. Since auto-assignment generates twice as many
+variables as might be suspected, it is probably better practice to avoid the auto-
+assignment, explicitly introduce the extra variable and then rewrite the program
+to hide the sequential update thereby avoiding any time penalty. An example
+of this approach is given in count10b on page 184.
+
+10.3.
+Control flow and commands
+
+Balsa’s command set is listed in table 10.1.
+
+Dataless handshakes
+
+sync
+�Channel� – awaits a handshake on the named channel. Circuit action
+does not proceed until the handshake is completed.
+
+
+Chapter 10: The Balsa language
+179
+
+Channel communications
+
+Data can be transferred between a variable and a channel, between channels
+or from a channel to a command code block as shown below:
+
+�Channel� <-
+�Variable� – transfers data from a variable to the named channel.
+This may either be an internal channel local to a procedure or an output port
+listed in the procedure declaration.
+
+�Channel� ->
+�Variable� – transfers data from the channel connected to a
+variable. The channel may either be an internal channel local to a procedure or
+an input port listed in the procedure declaration.
+
+�Channel1� ->
+�Channel2� – transfers data between channels.
+
+�Channel� -> then
+�Command� – allows the data to be accessed throughout the
+command block. However, the handshake on the channel is not completed and
+thus the data not released until the command block itself has terminated.
+
+Variable assignment
+
+�Variable� :=
+�Expression� – transfers the result of an expression into a vari-
+able. The result type of the expression and that of the variable must agree.
+
+Sequential composition
+
+�Command1� ;
+�Command2� – the two commands execute sequentially. The first
+must terminate before the second commences.
+
+Parallel composition
+
+�Command1� ||
+�Command2� – composes two commands such that they oper-
+ate concurrently and independently. Both commands must complete before
+the circuit action proceeds. Beware of inadvertently introducing dependencies
+between the two commands so that neither can proceed until the other has com-
+pleted. The “||” operator binds tighter than “;”. If that is not what is intended,
+then commands may be grouped in blocks as shown below
+
+[
+�Command1� ;
+�Command2� ] ||
+�Command3�
+
+Note the use of square brackets to group commands rather than parentheses.
+Alternatively, the keywords begin
+�
+�
+� end may be used and are mandatory if
+variables local to a block are to be declared.
+
+Continue and halt commands
+
+continue is effectively a no-op.
+The command halt causes a process
+thread to deadlock.
+
+
+180
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Looping constructs
+
+The loop command causes an infinite repetition of a block of code. Finite
+loops may be constructed using the while construct. The simplest example of
+its use is:
+
+while
+�Condition� then
+�Command� end
+
+However, multiple guards are allowed, so a more general form of the con-
+struct is:
+
+while
+
+�Condition1� then
+�Command1�
+|
+�Condition1� then
+�Command2�
+|
+�Condition3� then
+�Command3�
+else
+
+�Command4�
+end
+
+The ability of the while construct to take an else clause is a minor conve-
+nience. The code sequence above could have been written, with only a small
+difference in the resultant handshake circuit implementation, as:
+
+while
+
+�Condition1� then
+�Command1�
+|
+�Condition2� then
+�Command2�
+|
+�Condition3� then
+�Command3�
+end;
+
+�Command4�
+
+If more than one guard is satisfied, the particular command that is executed
+is unspecified.
+
+Structural iteration
+
+Balsa has a structural looping construct. In many programming languages
+it is a matter of convenience or style as to whether a loop is written in terms of
+a for loop or a while loop. This is not so in Balsa. The for loop is similar
+to VHDL’s for
+�
+�
+� generate command and is used for iteratively laying out
+repetitive structures. An example of its use was given earlier in section 9.4.4 on
+page 165. An illustration of the inappropriate use of the for command is given
+in the example count10e on page 189. Structures may be iteratively instantiated
+to operate either sequentially or concurrently with one another depending on
+whether for ; or for || is employed.
+
+Conditional execution
+
+Balsa has two constructs to achieve conditional execution. Balsa’s case
+statement is similar to that found in conventional programming languages. A
+single guard may match more than one value of the guard expression.
+
+
+Chapter 10: The Balsa language
+181
+
+case x+y of
+1 .. 4, 11
+then o <- x
+| 5 .. 10 then o <- y
+else o <- z
+end
+
+An if
+�
+�
+� then
+�
+�
+� else statement allows conditional execution based on
+the evaluation of expressions at run-time. Its syntax is similar to that of the
+while loop. Note the sequencing implicit in nested if statements, such as that
+shown below:
+
+if
+�Condition1� then
+
+�Command1�
+else
+if
+�Condition2� then
+
+�Command2�
+end
+end
+
+The test for Condition2 is made after the test for Condition1. If it is
+known that the two conditions are mutually exclusive, the expression may be
+written:
+
+if
+�Condition1� then
+�Command1�
+|
+�Condition2� then
+�Command2�
+end
+
+The “|” separator causes Condition1 and Condition2 to be evaluated in
+parallel. The result is undefined if more than one guard (condition) is satisfied.
+
+10.4.
+Binary/unary operators
+
+Balsa’s binary/unary operators are shown in order of decreasing precedence
+in table 10.2.
+
+10.5.
+Program structure
+
+File structure
+
+A typical design will consist of several files containing procedure/type/cons-
+tant declarations which come together in a top-level procedure that constitutes
+the overall design. This top-level procedure would typically be at the end of
+a file which imports all the other relevant design files. This importing feature
+forms a simple but effective way of allowing component reuse and maps sim-
+ply onto the notion of the imported procedures being either pre-compiled hand-
+shake circuits or existing (possibly hand crafted) library components. Declara-
+tions have a syntactically defined order (left to right, top to bottom) with each
+declaration having its scope defined from the point of declaration to the end of
+
+
+182
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Table 10.2.
+Balsa binary/unary operators.
+
+Symbol
+Operation
+Valid types
+Notes
+
+.
+record indexing
+record
+
+#
+smash
+any
+takes value from any type and reduces it
+to an array of bits
+[]
+array indexing
+array
+non-constant index possible (can generate
+lots of hardware)
+not,log,
+unary operators
+numeric
+log only works on constants and returns
+- (unary)
+the ceiling: e.g. log 15 returns 4
+“-” returns a result 1 bit wider than the
+argument
+ˆ
+exponentiation
+numeric
+
+*, /, %
+multiply, divide,
+numeric
+only applicable to constants
+
+remainder
+
++,-
+add, subtract
+numeric
+results are 1 or 2 bits longer than
+the largest argument
+@
+concatenation
+arrays
+
+<,>,<=,>=
+inequalities
+numeric,
+enumerations
+
+=, /=
+equals,
+all types
+comparison is by sign extended
+
+not equals
+value for signed numeric types
+and
+bitwise AND
+numeric
+Balsa uses type 1 bits for if/while
+guards so bitwise and logical operators
+are the same.
+or, xor
+bitwise OR,XOR
+numeric
+
+the current (or importing) file. Thus Balsa has the same simple “declare before
+use” rule of C and Modula, though without any facility for prototypes.
+
+Declarations
+
+Declarations introduce new type, constant or procedure names into the global
+name spaces from the point of declaration until the end of the enclosing block
+(or file in the case of top-level declarations). There are three disjoint name
+spaces: one for types, one for procedures and a third for all other declarations.
+At the top level, only constants are in this last category. However, variables and
+channels may be included in procedure local declarations. Where a declaration
+within an enclosed/inner block has the same name as one previously made in
+an outer/enclosing context, the local declaration will hide the outer declaration
+for the remainder of that inner block.
+
+
+Chapter 10: The Balsa language
+183
+
+Procedures
+
+Procedures form the bulk of a Balsa description. Each procedure has a name,
+a set of ports and an accompanying behavioural description. The sync key-
+word introduces dataless channels. Both dataless and data bearing channels can
+be members of “arrayed channels”. Arrayed channels allow numeric/enumer-
+ated indexing of otherwise functionally separate channels. Procedures can also
+carry a list of local declarations which may include other procedures, types and
+constants.
+
+Shared procedures
+
+Normally, each call to a procedure generates separate hardware to instan-
+tiate that procedure. A procedure may be shared, in which case calls to that
+procedure access common hardware thereby avoiding duplication of the cir-
+cuit at the cost of some multiplexing to allow sharing to occur. The use of
+shared procedures is discussed further on page 187.
+
+Functions
+
+In many programming languages, functions can be thought of as procedures
+without side effects that return a result. However, in Balsa there is a fundamen-
+tal difference between functions and procedures. Parameters to a procedure
+define handshaking channels that interface to the circuit block defined by the
+procedure. Function parameters, on the other hand, are just expression aliases
+returning values. An example of the use of function definitions can be found
+in the arbiter tree design on page 202.
+
+10.6.
+Example circuits
+
+In this section, various designs of counter are described in Balsa. In flavour,
+they resemble the specifications of conventional synchronous counters, since
+these designs are more familiar to newcomers to asynchronous systems. More
+sophisticated systolic counters, better suited to an asynchronous approach, are
+described by van Berkel [14].
+In this design, the role of the clock which updates the state of the counter is
+taken by a dataless sync channel, named aclk. The counter issues a handshake
+request over the sync channel, the environment responds with an acknowledge-
+ment completing the handshake and the counter state is updated.
+
+A modulo-16 counter
+
+-- count16a.balsa: modulo 16 counter
+import [balsa.types.basic]
+
+
+184
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+procedure count16 (sync aclk; output count : nibble) is
+variable count_reg : nibble
+begin
+loop
+sync aclk ;
+count <- count_reg ;
+count_reg := (count_reg + 1 as nibble)
+end
+end
+
+This counter interfaces to its environment by means of two channels: the
+dataless aclk channel and the output channel count which carries the current
+count value. The internal register which implements the variable count reg
+and the output channel are of the predefined type nibble (4-bits wide). After
+count reg is incremented, the result must be cast back to type nibble. For
+simplicity, issues of initialisation/reset have been ignored. A LARD simulation
+of this circuit will give a harmless warning when uninitialised variables are
+accessed.
+
+Removing auto-assignment
+
+The auto-assignment statement in the example above, although concise and
+expressive, hides the fact that, in most back-ends, an auxiliary variable is cre-
+ated so that the update can be carried out in a race-free manner. By making this
+auxiliary variable explicit, advantage may be taken of its visibility to overlap
+its update with other activity as shown in the example below.
+
+-- count16b.balsa: write-back overlaps output assignment
+import [balsa.types.basic]
+
+procedure count16 (sync aclk; output count : nibble) is
+variable count_reg, tmp : nibble
+begin
+loop
+sync aclk;
+tmp := (count_reg + 1 as nibble) ||
+count <- count_reg;
+count_reg := tmp
+end
+end
+
+In this example, the transfer of the count register to the output channel is
+overlapped with the incrementing of the auxiliary shadow register. There is
+some slight area overhead involved in parallelisation and any potential speed-
+up may be minimal in this case, but the principle of making trade-offs at the
+level of the source code is illustrated.
+
+
+Chapter 10: The Balsa language
+185
+
+A modulo-10 counter
+
+The basic counter description above can easily be modified to produce a
+modulo-10 counter. A simple test is required to detect when the internal regis-
+ter reaches its maximum value and then to reset it to zero.
+
+-- count10a.balsa: an asynchronous decade counter
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+procedure count10(sync aclk; output count: C_size) is
+variable count_reg : C_size
+variable tmp : C_size
+begin
+loop
+sync aclk;
+if count_reg /= max_count then
+tmp := (count_reg + 1 as C_size)
+else
+tmp := 0
+end || count <- count_reg ;
+count_reg := tmp
+end -- loop
+end -- begin
+
+A loadable up/down decade counter
+
+This example describes a loadable up/down decade counter. It introduces
+many of the language features discussed earlier in the chapter. The counter
+requires two control bits, one to determine the direction of count, and the other
+to determine whether the counter should load or
+�inc,dec�rement on the next
+operation. The are several valid design options; in this example, count10b
+below, the control bits and the data to be loaded are bundled together in a
+single channel, in sigs.
+
+-- count10b.balsa: an aysnchronous up/down decade counter
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+type dir is enumeration down, up end
+type mode is enumeration load, count end
+type In_bundle is record
+data : C_size ;
+mode : mode;
+dir : dir
+
+
+186
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+end
+
+procedure updown10 (
+input in_sigs: In_bundle;
+output count: C_size
+) is
+variable count_reg : C_size
+variable tmp : In_bundle
+begin
+loop
+in_sigs -> tmp; -- read control+data bundle
+if
+tmp.mode = count then
+case tmp.dir of
+down then
+-- counting down
+if count_reg /= 0 then
+tmp.data := (count_reg - 1 as C_size)
+else
+tmp.data := max_count
+end -- if
+| up then
+-- counting up
+if count_reg /= max_count then
+tmp.data := (count_reg + 1 as C_size)
+else
+tmp.data := 0
+end -- if
+end -- case tmp.dir
+end; -- if
+count <- tmp.data || count_reg:= tmp.data
+end -- loop
+end
+
+The example above illustrates the use of if
+�
+�
+� then
+�
+�
+� else and case
+control constructs as well the use of record structures and enumerated types.
+The use of symbolic values within enumerated types makes the code more
+readable. Test harnesses which can be generated automatically by the Balsa
+system can also read the symbolic enumerated values. For example, here is a
+test file which initialises the counter to 8, counts up, testing that the counter
+wraps round to zero and then counts down allowing the user to check that the
+counter correctly wraps to 9.
+
+{8, load, up}
+load
+counter with 8
+{0, count, up}
+count to 9
+{0, count, up}
+count & wrap to 0
+{0, count, up}
+count to 1
+{0, count, down}
+count down to 0
+{0, count, down}
+count down to 9
+{1, load, down}
+load counter with 1
+{0, count, down}
+count down to 0
+{0, count, down}
+count down & wrap to 9
+
+
+Chapter 10: The Balsa language
+187
+
+Sharing hardware
+
+In Balsa, every statement instantiates hardware in the resulting circuit. It
+is therefore worth examining descriptions to see if there are any repeated con-
+structs that could either be moved to a common point in the code or replaced by
+shared procedures. In count10b above, the description instantiates two adders:
+one used for incrementing and the other for decrementing. Since these two
+units are not used concurrently, area can be saved by sharing a single adder
+(which adds either
+�1 or
+�1 depending in the direction of count) described by
+a shared procedure. The code below illustrates how count10b can be rewritten
+to use a shared procedure. The shared procedure add sub computes the next
+count value by adding the current count value to a variable, inc, which can
+take values of
+�1 or
+�1. Note that to accommodate these values, inc must be
+declared as 2 signed bits.
+The area advantage of the approach is shown by examining the cost of the
+circuit reported by breeze-cost: count10b has a cost of 2141 units, whereas the
+shared procedure version has a cost of only 1760. The relative advantage, of
+course, becomes more pronounced as the size of the counter increases.
+
+-- count10c.balsa: introducing shared procedures
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+type dir is enumeration down, up end
+type mode is enumeration load, count end
+type inc is 2 signed bits
+
+type In_bundle is record
+data : C_size ;
+mode :
+mode;
+dir : dir
+end
+
+procedure updown10 (
+input in_sigs: In_bundle;
+output count: C_size
+) is
+variable count_reg : C_size
+variable tmp : In_bundle
+variable inc : inc
+
+shared add_sub is
+begin
+tmp.data:= (count_reg + inc as C_size)
+end
+
+
+188
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+begin
+loop
+in_sigs -> tmp; -- read control+data bundle
+if
+tmp.mode = count then
+case tmp.dir of
+down then
+-- counting down
+if count_reg /= 0 then
+inc:= -1;
+add_sub()
+else
+tmp.data := max_count
+end -- if
+| up then
+-- counting up
+if count_reg /= max_count then
+inc := +1;
+add_sub()
+else
+tmp.data := 0
+end -- if
+end -- case tmp.dir
+end; -- if
+count <- tmp.data || count_reg:= tmp.data
+end -- loop
+end
+
+In order to guarantee the correctness of implementations, there are a number
+of minor restrictions on the use of shared procedures:
+
+shared procedures cannot have any arguments;
+
+shared procedures cannot use local channels;
+
+shared procedures using elements of the channel referenced by a select
+statement (see section 10.7 on page 190) must be declared as local within
+the body of that select block.
+
+“while” loop description
+
+An alternative description of the modulo-10 counter employs the while con-
+struct:
+
+-- count10d.balsa: mod-10 counter alternative implementation
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 10
+
+procedure count10(sync aclk; output count : C_size) is
+variable count_reg : C_size
+begin
+
+
+Chapter 10: The Balsa language
+189
+
+loop
+while count_reg < max_count then
+sync aclk;
+count <- count_reg;
+count_reg:= (count_reg + 1 as C_size)
+end; -- while
+count_reg:= 0
+end -- loop
+end
+
+Structural “for” loops
+
+for loops are a potential pitfall for beginners to Balsa. In many program-
+ming languages, while loops and for loops can be used interchangeably. This
+is not the case in Balsa: a for loop implements structural iteration, in other
+words, separate hardware is instantiated for each pass through the loop. The
+following description, which superficially appears very similar to the while
+loop example of count10d previously, appears to be correct: it compiles with-
+out problems and a LARD simulation appears to give the correct behaviour.
+However, examination of the cost reveals an area cost of 11577, a large in-
+crease. It is important to understand why this is the case. The for loop is un-
+rolled at compile time and 10 instances of the circuit to increment the counter
+are created. Each instance of the loop is activated sequentially. The PostScript
+plot of the handshake circuit graph is rather unreadable; setting max_count to
+3 produces a more readable plot.
+
+-- count10e.balsa: beware the "for" construct
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 10
+
+procedure count10(sync aclk; output count: C_size) is
+variable count_reg : C_size
+begin
+loop
+for ; i in 1 .. max_count then
+sync aclk;
+count <- count_reg;
+count_reg:= (count_reg + 1 as C_size)
+end; -- for ; i
+count_reg:= 0
+end -- loop
+end -- begin
+
+If, instead of using the sequential for construct, the parallel for construct
+had been employed (for ||
+�
+�
+�), the compiler would give an error message
+complaining about read/write conflicts from parallel threads. In this case, all
+
+
+190
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+instances of the counter circuit would attempt to update the counter register
+at the same time, leading to possible conflicts. A reader who understands the
+resulting potential handshake circuit is well on the way to a good understanding
+of the methodology.
+
+10.7.
+Selecting channels
+
+The asynchronous circuit described below merges two input channels into
+a single output channel; it may be thought of as a self-selecting multiplexer.
+The select statement chooses between the two input channels a and b by
+waiting for data on either channel to arrive. When a handshake on either a or b
+commences data is held valid on the input, and the handshake not completed,
+until the end of the select
+�
+�
+�end block.
+This circuit is an example of handshake enclosure and avoids the need for an
+internal latch to be created to store the data from the input channel; a possible
+disadvantage is that, because of the delayed completion of the handshake, the
+input is not released immediately to continue processing independently. In this
+example, data is transferred to the output channel and the input handshake will
+complete as soon as data has been removed from the output channel. An exam-
+ple of a more extended enclosure can be found in the code for the population
+counter on page 197.
+
+-- mux1.balsa: unbuffered Merge
+import [balsa.types.basic]
+
+procedure mux (input a, b :byte; output c :byte) is
+begin
+loop
+select a then c <- a
+-- channel behaves like a variable
+|
+b then c <- b
+-- ditto
+end -- select
+end -- loop
+end
+
+Because of the enclosed nature of the handshake associated with select,
+inputs a and b should be mutually exclusive for the duration of the block of
+code enclosed by the select. In many cases, this is not a difficult obligation
+to satisfy. However, if a and b are truly independent, select can be replaced
+by arbitrate which allows an arbitrated choice to be made. Arbiters are rel-
+atively expensive in terms of speed and may not be possible to implement in
+some technologies. Further, the loss of determinism in circuits with arbitra-
+tion can also introduce testing and design correctness verification problems.
+Designers should therefore not use arbiters unnecessarily.
+
+
+Chapter 10: The Balsa language
+191
+
+-- mux2.balsa: unbuffered arbitrated Merge.
+import [balsa.types.basic]
+
+procedure mux (input a, b :byte; output c :byte) is
+begin
+loop
+arbitrate a then c <- a
+-- channel behaves like a variable
+|
+b then c <- b
+-- ditto
+end -- arbitrate
+end -- loop
+end
+
+
+
+Chapter 11
+
+BUILDING LIBRARY COMPONENTS
+
+11.1.
+Parameterised descriptions
+
+Parameterised procedures allow designers to develop a library of commonly
+used components and then to instantiate those structures later with varying
+parameters. A simple example is the specification of a buffer as a library part
+without knowing the width of the buffer. Similarly, a pipeline of buffers can be
+defined in the library without requiring any knowledge of the depth chosen for
+the pipeline when it is instantiated.
+
+11.1.1
+A variable width buffer definition
+
+The example pbuffer1 below defines a single place buffer with a parame-
+terised width.
+
+-- pbuffer1.balsa - parameterised buffer example
+import [balsa.types.basic]
+
+procedure Buffer (
+parameter X : type ;
+input i : X; output o : X
+) is
+variable x : X
+begin
+loop
+i -> x ;
+o <- x
+end
+end
+
+-- now define a byte wide buffer
+procedure Buffer8 is Buffer (byte)
+
+-- now use the definition
+procedure test1 (input a : byte ; output b : byte) is
+begin
+Buffer8 (a, b)
+
+193
+
+
+194
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+end
+
+-- alternatively
+procedure test2 (input a : byte ; output b : byte) is
+begin
+Buffer (byte, a, b)
+end
+
+The definition of the single place buffer given earlier on page 161 is modi-
+fied by the addition of the parameter declaration which defines X to be of type
+type. In other words X is identified as being a type to be refined later. Once a
+parameter type has been declared, it can be used in later declarations and state-
+ments: for example, input channel i is defined as being of type X. No hardware
+is generated for the parameterised procedure definition itself.
+Having defined the procedure, it can be used in other procedure definitions.
+Buffer8 defines a byte wide buffer that can be instantiated as required as
+shown, for example, in procedure test1. Alternatively, a concrete realisation
+of the parameterised procedure can be used directly as shown in procedure
+test2.
+
+11.1.2
+Pipelines of variable width and depth
+
+The next example illustrates how multiple parameters to a procedure may
+be specified. The parameterised buffer element is included in a pipeline whose
+depth is also parameterised.
+
+-- pbuffer2.balsa - parameterised pipeline example
+import [balsa.types.basic]
+import [pbuffer1]
+
+-- BufferN: an n-place parameterised, variable width buffer
+procedure BufferN (
+parameter n : cardinal ;
+parameter X : type ;
+input i : X ;
+output o : X
+) is
+begin
+if n = 1 then
+-- single place pipeline
+Buffer(X, i, o)
+| n >= 2 then
+-- parallel evaluation
+local array 1 .. n-1 of channel c : X
+begin
+Buffer(x, i, c[1])
+||
+-- first buffer
+Buffer(x, c[n-1], o) ||
+-- last buffer
+for || i in 1 .. n-2 then
+Buffer(X, c[i], c[i+1])
+end -- for || i
+end
+
+
+Chapter 11: Building library components
+195
+
+else print error, "zero length pipeline specified"
+end -- if
+end
+
+-- Now define a 4 deep, byte-wide pipeline.
+procedure Buffer4 is BufferN(4, byte)
+
+Buffer is the single place parameterised width buffer of the previous exam-
+ple and this is reused by means of the library statement import[pbuffer1].
+In this code, BufferN is defined in a very similar manner to the example in sec-
+tion 9.4.4 on page 165 except that the number of stages in the pipeline, n, is
+not a constant but is a parameter to the definition of type cardinal. Note that
+this definition includes some error checking. If an attempt is made to build a
+zero length pipeline during a definition, an error message is printed.
+
+11.2.
+Recursive definitions
+
+Balsa allows a form of recursion in definitions (as long as the resulting struc-
+tures can be statically determined at compile time). Many structures can be
+described elegantly using this technique which forms a natural extension to
+the powerful parameterisation mechanism. The remainder of this chapter il-
+lustrates recursive parameterisation by means of some interesting (and useful)
+examples.
+
+11.2.1
+An n-way multiplexer
+
+An n-way multiplexer can be constructed from a tree of 2-way multiplexers.
+A recursive definition suggests itself as the natural specification technique: an
+n-way multiplexer can be split into two n/2-way multiplexers connected by
+internal channels to a 2-way multiplexer as shown in figure 11.1 on page 196.
+
+--- Pmux1.balsa: A recursive parameterised MUX definition
+import [balsa.types.basic]
+
+procedure PMux (
+parameter X : type;
+parameter n : cardinal;
+array n of input inp : X;
+-- note use of arrayed port
+output out : X
+) is
+begin
+if n = 0 then print error,"Parameter n should not be zero"
+|
+n = 1 then
+loop
+select inp[0] then
+out <- inp[0]
+end -- select
+end -- loop
+
+
+196
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+0
+inp
+
+n/2−1
+inp
+
+n/2
+inp
+
+n−1
+inp
+
+0
+out
+
+1
+out
+
+out
+
+After Decompostion
+
+0
+inp
+
+1
+inp
+
+inpn−1
+n−2
+inp
+
+Before Decomposition
+
+out
+
+Figure 11.1.
+Decomposition of an n-way multiplexer.
+
+|
+n = 2 then
+loop
+select inp[0] then
+out <- inp[0]
+| inp[1] then
+out <- inp[1]
+end -- select
+end -- loop
+else
+local
+-- local block with local definitions
+channel out0, out1 : X
+constant mid = n/2
+begin
+PMux (X, mid,
+inp[0..mid-1], out0) ||
+PMux (X, n-mid, inp[mid..n-1], out1) ||
+PMux (X, 2, {out0,out1},out)
+end
+end -- if
+end
+
+-- Here is a 5-way multiplexer
+procedure PMux5Byte is PMux(byte, 5)
+
+Commentary on the code
+
+The multiplexer is parameterised in terms of the type of the inputs and the
+number of channels n. The code is straightforward. A multiplexer of size
+greater than 2 is decomposed into two multiplexers half the size connected by
+internal channels to a 2-to-1 multiplexer. Notice how the arrayed channels,
+out0 and out1 are specified as a tuple. The recursive decomposition stops
+
+
+Chapter 11: Building library components
+197
+
+when the number of inputs is 2 or 1 (specifying a multiplexer with zero inputs
+generates an error). A 1-input multiplexer makes no choice of inputs.
+
+A Balsa test harness
+
+The code below illustrates how a simple Balsa program can be used as a test
+harness to generate test values for the multiplexer. The test program is actually
+rather na¨ıve.
+
+-- test_pmux.balsa - A test-harness for Pmux1
+import [balsa.types.basic]
+import [pmux1]
+
+procedure test (output out : byte) is
+type ttype is sizeof byte + 1 bits
+array 5 of channel inp : byte
+variable i : ttype
+begin
+begin
+i:= 1;
+while i <= 0x80 then
+inp[0] <- (i as byte);
+inp[1] <- (i+1 as byte);
+inp[2] <- (i+2 as byte);
+inp[3] <- (i+3 as byte);
+inp[4] <- (i+4 as byte);
+i:= (i + i as ttype)
+end
+end ||
+PMux5Byte(inp, out)
+end
+
+11.2.2
+A population counter
+
+This next example counts the number of bits set in a word. It comes from
+the requirement in an Amulet processor to know the number of registers to be
+restored/saved during LDM/STM (Load/Store Multiple) instructions.
+The approach taken is to partition the problem into two parts. Initially, ad-
+jacent bits are added together to form an array of 2-bit channels representing
+the numbers of bits that are set in each of the adjacent pairs. The array of 2-
+bit numbers are then added in a recursively defined tree of adders (procedure
+AddTree). The structure of the bit-counter is shown in figure 11.2.
+
+-- popcount: count the number of bits set in a word
+import [balsa.types.basic]
+
+procedure AddTree (
+parameter inputCount : cardinal;
+parameter inputSize : cardinal;
+parameter outputSize : cardinal;
+
+
+198
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+array inputCount of input i : inputSize bits;
+output o : outputSize bits
+) is
+begin
+if inputCount = 1 then
+select i[0] then o <- (i[0] as outputSize) end
+|
+inputCount = 2 then
+select i[0], i[1] then
+o <- (i[0] + i[1] as outputSize bits)
+end -- select
+else
+local
+constant lowHalfInputCount = inputCount / 2
+constant highHalfInputCount = inputCount - lowHalfInputCount
+
+channel lowO, highO : outputSize - 1 bits
+begin
+AddTree (lowHalfInputCount, inputSize, outputSize - 1,
+i[0..lowHalfInputCount-1], lowO) ||
+AddTree (highHalfInputCount, inputSize, outputSize - 1,
+i[lowHalfInputCount..inputCount-1], highO) ||
+AddTree (2, outputSize - 1, outputSize, {lowO, highO}, o)
+end
+end -- if
+end
+
+procedure PopulationCount (
+parameter n : cardinal;
+input i : n bits;
+output o : log (n+1) bits
+) is
+begin
+if n % 2 = 1 then
+print error, "number of bits must be even"
+end; -- if
+loop
+select i then
+if n = 1 then
+o <- i
+|
+n = 2 then
+o <- (#i[0] + #i[1])
+add bits 0 and 1
+else
+local
+constant pairCount = n - (n / 2)
+array pairCount of channel addedPairs : 2 bits
+begin
+for || c in 0..pairCount-1 then
+addedPairs[c] <- (#i[c*2] + #i[(c*2)+1])
+end ||
+AddTree (pairCount, 2, log (n+1), addedPairs, o)
+
+
+Chapter 11: Building library components
+199
+
+#i[0]
+#i[1]
+#i[2]
+#i[3]
+#i[4]
+#i[5]
+#i[6]
+#i[7]
+
++ : 2 bits
++ : 2 bits
+
++ : 3 bits
+
++ : bit
++ : bit
++ : bit
++ : bit
+
+o
+
+PopulationCount (8)
+
+(4,2,4)
+AddTree
+(2,3,4)
+AddTree
+
+i
+
+(2,2,3)
+AddTree
+(2,2,3)
+AddTree
+
+Figure 11.2.
+Structure of a bit population counter.
+
+end
+end -- if
+end -- select
+end -- loop
+end
+
+procedure PopCount16 is PopulationCount (16)
+
+Commentary on the code
+
+parameterisation:
+Procedures AddTree and PopulationCount are param-
+eterised. PopulationCount can be used to count the number of bits set in any
+sized word. AddTree is parameterised to allow a recursively defined adder of
+any number of arbitrary width vectors.
+
+enclosed selection:
+The semantics of the enclosed handshake of select al-
+low the contents of the input i to be referred to several times in the body of the
+select block without the need for an internal latch.
+
+avoiding deadlock:
+Note that the formation of the sum of adjacent bits is
+specified by a parallel for loop.
+
+for || c in 0..pairCount-1 then
+addedPairs[c] <- (#i[c*2] + #i[(c*2)+1])
+end
+
+
+200
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+It might be thought that a serial for ; loop could be used at, perhaps, the
+expense of speed. This is not the case: the system will deadlock, illustrating
+why designing asynchronous circuits requires some real understanding of the
+methodology. In this case the adder to which the array of addPairs is con-
+nected requires pairs of inputs to be ready before it can complete the addition
+and release its inputs. However, if the sum of adjacent bits is computed seri-
+ally, the next pair will not be computed until the handshake for the previous
+pair has been completed – which is not possible because AddTree is awaiting
+all pairs to become valid: result deadlock!
+
+11.2.3
+A Balsa shifter
+
+General shifters are an essential element of all microprocessors including
+the Amulet processors. The following description forms the basis of such a
+shifter. It implements only a rotate right function, but it is easily extensible to
+other shift functions.
+The main work of the shifter is local procedure rorBody which recursively
+creates sub-shifters capable of optionally rotating 1, 2, 4, 8 etc. bits. The
+structure of the shifter is shown in figure 11.3.
+
+rorStage
+
+rorStage
+
+rorStage
+
+mux
+
+rotate
+
+#d[log distance]
+
+i
+d
+
+o
+
+rorStage
+
+0
+1
+
+rorBody (1)
+
+rorBody (2)
+
+rorBody (4)
+
+i
+
+o
+
+ror (8 bits)
+
+d
+
+Figure 11.3.
+Structure of a rotate right shifter.
+
+import [balsa.types.basic]
+
+-- ror: rotate right shifter
+procedure ror (
+parameter X : type;
+input d : sizeof X bits;
+
+
+Chapter 11: Building library components
+201
+
+input i : X;
+output o : X
+) is
+begin
+loop
+select d then
+local
+constant typeWidth = sizeof X
+
+procedure rorBody (
+parameter distance : cardinal;
+input i : X;
+output o : X
+) is
+local
+procedure rorStage (
+output o : X
+) is
+begin
+select i then
+if #d[log distance] then
+o <- (#i[typeWidth-1..distance] @
+#i[distance-1..0] as X)
+{shift}
+else
+o <- i
+{don’t shift}
+end -- if
+end -- select
+end
+channel c : X
+begin
+if distance > 1 then
+rorStage (c) ||
+rorBody (distance/2, c, o)
+else
+rorStage (o)
+end -- if
+end
+begin
+rorBody (typeWidth/2, i, o)
+end
+end -- select
+end -- loop
+end
+
+procedure ror32 is ror (32 bits)
+
+
+202
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Testing the shifter
+
+This next code example builds another test routine in Balsa to exercise the
+shifter.
+
+import [balsa.types.basic]
+import [ror]
+
+--test ror32
+procedure test_ror32(output o : 32 bits)
+is
+variable i : 5 bits
+channel shiftchan : 32 bits
+channel distchan : 5 bits
+begin
+begin
+i:= 1;
+while i < 31 then
+shiftchan <- 7 || distchan <- i;
+i:= (i+1 as 5 bits)
+end -- while
+end || ror32(distchan, shiftchan, o)
+end
+
+11.2.4
+An arbiter tree
+
+The final example builds a parameterised arbiter. This circuit forms part
+of the DMA controller of chapter 12. The architecture of an 8-input arbiter
+is shown in figure 12.3 on page 212. ArbFunnel is a parameterisable tree
+composed of two elements: ArbHead and ArbTree. Pairs of incoming sync
+requests are arbitrated and combined into single bit decisions by ArbHead el-
+ements. These single bit channels are then arbitrated between by ArbTree
+elements. An ArbTree takes a number of decision bits from each of a number
+of inputs (on the i ports) and produces a rank of 2-input arbiters to reduce the
+problem to half as many inputs each with 1 extra decision bit. Recursive calls
+to ArbTree reduce the number of input channels to one (whose final decision
+value is returned on port o).
+
+-- ArbHead: 2 way arbcall with channel no. output
+import [balsa.types.basic]
+procedure ArbHead (
+sync i0, i1;
+output o : bit
+) is
+begin
+loop
+arbitrate i0 then o <- 0
+|
+i1 then o <- 1
+end -- arbitrate
+
+
+Chapter 11: Building library components
+203
+
+end -- loop
+end -- begin
+
+-- ArbTree: a tree arbcall which outputs a channel number
+-- prepended onto the input channel’s data. (invokes itself
+-- recursively to make the tree)
+
+procedure ArbTree (
+parameter inputCount : cardinal;
+parameter depth : cardinal;
+-- bits to carry from inputs
+array inputCount of input i : depth bits;
+output o : (log inputCount) + depth bits
+) is
+type BitArray is array 1 of bit
+type BitArray2 is array 2 of bit
+function AddTopBit (hd : bit; tl : depth bits) =
+(#tl @ {hd} as depth + 1 bits)
+function AddTopBit2 (hd : bit; tl : depth + 1 bits) =
+(#tl @ {hd} as depth + 2 bits)
+function AddTop2Bits (hd0 : bit; hd1 : bit; tl : depth bits) =
+(#tl @ {hd0,hd1} as depth + 2 bits)
+begin
+case inputCount of
+0, 1 then print error, "Can’t build an ArbTree with fewer than 2 inputs"
+|
+2 then loop
+arbitrate i[0] -> i0 then o <- AddTopBit (0, i0)
+|
+i[1] -> i1 then o <- AddTopBit (1, i1)
+end -- arbitrate
+end -- loop
+|
+3 then local channel lo : 1 + depth bits
+begin
+ArbTree (2, depth, i[0 .. 1], lo) ||
+loop
+arbitrate lo then o <- AddTopBit2 (0, lo)
+| i[2] -> i2 then o <- AddTop2Bits (1, 0, i2)
+end -- arbitrate
+end -- loop
+end
+else local
+constant halfCount = inputCount / 2
+constant halfBits = depth + log halfCount
+channel l, r : halfBits bits
+begin
+ArbTree (halfCount, depth, i[0 .. halfCount-1], l) ||
+ArbTree (inputCount - halfCount, depth,
+i[halfCount .. inputCount-1], r) ||
+ArbTree (2, halfBits, {l,r}, o)
+end -- begin
+end -- case inputCount
+end
+
+
+204
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+-- ArbFunnel: build a tree arbcall (balanced apart from the last
+--
+channel which is faster than the rest) which produces a channel
+--
+number from an array of sync inputs
+procedure ArbFunnel (
+parameter inputCount : cardinal;
+array inputCount of sync i;
+output o : log inputCount bits
+) is
+constant halfCount = inputCount / 2
+constant oddInputCount = inputCount % 2
+begin
+if inputCount < 2 then
+print error, "can’t build an ArbFunnel with fewer than 2 inputs"
+| inputCount = 2 then
+ArbHead (i[0], i[1], o)
+| inputCount > 2 then
+local
+array halfCount + 1 of channel li : bit
+begin
+for || j in 0 .. halfCount - 1 then
+ArbHead (i[j*2], i[j*2+1], li[j])
+end ||
+if oddInputCount then
+ArbTree (halfCount + 1, 1, li[0 .. halfCount], o) ||
+loop
+select i[inputCount - 1] then li[halfCount] <- 0
+end -- select
+end -- loop
+else
+ArbTree (halfCount, 1, li[0 .. halfCount-1], o)
+end -- if
+end
+end
+-- if
+end
+
+
+Chapter 12
+
+A SIMPLE DMA CONTROLLER
+
+A simple 4 channel DMA controller is presented as a practical description
+of a reasonably large-scale Balsa design written entirely in Balsa and so can
+be compiled for any of the technologies which the Balsa back-end supports.
+Readers should note that this controller is not the same as the Amulet3i DMA
+controller referred to in chapter 15. A more detailed description of this con-
+troller and the motivation for its design can be found in [8]. A complete listing
+of the code for the controller can be downloaded from [7].
+The simplified controller provides:
+
+4 full address range channels each with independent source, destination
+and count registers.
+
+8 client DMA request inputs with matching acknowledgements.
+
+Peripheral to peripheral, memory to memory and peripheral to memory
+transfers. Each channel has both source and destination client requests
+so “true” peripheral to peripheral transfers can be performed by waiting
+for requests from both parties.
+
+Figure 12.1 shows the programmer’s view of the controller’s register mem-
+ory map. The register bank is split into two parts: the channel registers and the
+global registers.
+
+12.1.
+Global registers
+
+The global registers contain control state pertaining to the state of currently
+active channels and of interrupts signalled by the termination of transfer runs.
+There are 4 global registers:
+
+genCtrl: General control.
+In this controller, the general control register
+only contains one bit: the global enable – gEnable. The global enable is the
+only controller bit reset at power-up. All other controller state bits must be
+initialised before gEnable is set. Using a global enable bit in this way allows
+the initialisation part of the Balsa description to remain small and cheap.
+
+205
+
+
+206
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+0
+1
+2
+3
+4
+5
+11:9
+8:6
+
+chan[n].ctrl
+
+enable
+
+srcInc
+dstInc
+countDec
+srcDRQ
+dstDRQ
+srcClientNo
+dstClientNo
+
+31
+23
+15
+0
+24
+16
+8
+7
+
+chan[0].src
+
+chan[0].dst
+
+chan[0].count
+
+chan[1].dst
+
+chan[1].count
+
+chan[2].src
+
+chan[2].count
+
+chan[3].src
+
+chan[3].count
+
+chan[1].src
+
+chan[2].dst
+
+000
+004
+008
+00C
+
+010
+014
+018
+01C
+
+020
+024
+028
+02C
+
+030
+034
+038
+03C
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+genCtrl
+Reads as 0/1
+200
+
+204
+
+208
+
+20C
+
+chanStatus
+
+IRQMask
+
+IRQReq
+
+Reads as 0x0000 000...
+
+Reads as 0x0000 000...
+
+Reads as 0x0000 000...
+
+chan[0].ctrl
+
+chan[2].ctrl
+
+chan[3].ctrl
+
+chan[2].dst
+
+chan[1].ctrl
+
+genCtrl.gEnable
+
+global registers
+
+channel registers
+
+Figure 12.1.
+DMA controller programmer’s model.
+
+chanStatus: Channelend-of-runstatus.
+The chanStatus register contains
+4 bits, one per DMA channel. When set by the DMA controller, a bit in this
+register indicates that the corresponding channel has come to the end of its run
+of transfers.
+
+IRQMask, IRQReq: Interrupt mask and status.
+The IRQMask register
+contains one bit per channel (like chanStatus) with set bits specifying that
+an interrupt should be raised at the end of a transfer run of that channel (when
+the corresponding chanStatus bit becomes set). IRQReq contains the current
+interrupt status for each channel.
+The channel status, IRQ mask and IRQ status bits are kept in global registers
+in order to reduce the number of DMA register reads which must be performed
+by the CPU after receiving an interrupt in order to determine which channel to
+service.
+
+12.2.
+Channel registers
+
+Each channel has 4 registers associated with it in the same way as the
+Amulet3i DMA controller. The two address registers (channel[n].src and
+channel[n].dst) specify the 32-bit source and destination addresses for trans-
+fers. The count register (channel[n].count) is a 32-bit count of remaining
+
+
+Chapter 12: A simple DMA controller
+207
+
+transfers to perform; transfer runs terminate when the count register is decre-
+mented to zero. The control register (channel[n].ctrl) specifies the updates
+to be performed on the other three registers and the clients to which this chan-
+nel is connected. Writing to the control register has the effect of clearing inter-
+rupts and end-of-run indication on that channel. The control register contains
+8 fields:
+
+enable: Transfer enable.
+If the enable bit is set, this channel should be
+considered for transfers when a new DMA request arrives. Channel enables
+are not cleared on power-up. The genCtrl.gEnable bit can be used to pre-
+vent transfers from occurring whilst the channel enable bits are cleared during
+startup.
+
+srcInc, dstInc, countDec: Increment/decrement control.
+These bits are
+used to enable source, destination and count register update after a transfer.
+Source and destination registers are incremented by 4 after transfers (since
+only word transfers are supported in this version of the controller) if srcInc
+and dstInc (respectively) are set. Note that the bottom 2 bits of these ad-
+dresses are preserved. The count register is decremented by 1 after each trans-
+fer if countDec is set. Resetting either srcInc or dstInc results in the corre-
+sponding address remaining unchanged between transfers. This is useful for
+nominating peripheral (rather than memory) addresses. Resetting countDec
+results in “free-running” transfers.
+
+srcDRQ, dstDRQ: Initial DMA requests.
+Transfers can take place on a
+channel when a pair of DMA requests have been received, one for the source
+client and the other for the destination client (the requests-pending registers).
+The srcDRQ and dstDRQ bits specify the initial states for those two requests.
+Setting both of these bits indicates that the source and destination requests
+should be considered to have already arrived. Resetting one or both of the bits
+specifies that requests from the corresponding
+�src,dst�ClientNo numbered
+client should trigger a transfer (both client requests are required when both
+control bits are reset).
+
+srcClientNo, dstClientNo: Client to channel mapping.
+These fields spec-
+ify the client numbers from which this channel receives source and destination
+DMA requests. These fields are only of use when either srcDRQ or dstDRQ (or
+both) are reset.
+
+12.3.
+DMA controller structure
+
+The structure of the simplified DMA controller is shown in Figure 12.2. The
+simplified DMA controller is composed of 5 units:
+
+
+208
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ARBITER
+
+DMA Requests
+
+DRQ
+
+MARBLE
+
+TECommand
+
+ARBITER
+
+interrupts to CPU
+
+A/Doff
+Don
+
+busResponse
+
+Control unit
+Engine
+Transfer
+
+TEAck: tfr. ack.
+
+A
+D
+
+Initiator I/F
+Target I/F
+
+busCommand
+
+Figure 12.2.
+DMA controller structure.
+
+
+Chapter 12: A simple DMA controller
+209
+
+MARBLE target interface
+
+The controller is assumed to be attached to the MARBLE asynchronous
+bus which connects the subsystems in the Amulet3i system-on-chip (see chap-
+ter 15). It is relatively easy to provide an interface to other forms of on-chip
+system connect busses.
+The MARBLE target interface provides a connection to MARBLE through
+which the controller can be programmed. Accesses to the registers from this
+interface are arbitrated with incoming DMA requests and transfer acknowl-
+edgements from the transfer engine. This arbitration and the decoupling of
+transfer engine from control unit allow the DMA controller to avoid potential
+bus access deadlock situations.
+The MARBLE interface used here carries an 8-bit address (8-bit word ad-
+dress, 10-bit byte address) similar to that of the Amulet3i DMA controller.
+This allows the same address mapping of channel registers and the possibil-
+ity of extendeding the number of channels to 32 without changing the global
+register addresses.
+
+MARBLE initiator interface
+
+The initiator interface is used by the DMA controller to perform its transfers.
+Only the address and control bits to this interface are connected to the Balsa
+synthesised controller hardware. The data to and from the initiator interface is
+handled by a latch (shown as the shaded box in Figure 12.2). Only word-wide
+transfers are supported and so this latch is all that is needed to hold data values
+between the read and write bus transactions of a transfer. Supporting different
+transfer widths is relatively easy but has not been implemented in this example
+in order to simplify the code.
+
+Control unit
+
+Each DMA channel has a pair of register bits, the requests-pending bits,
+which recode the arrival of requests for that channel’s source and destination
+clients. After marking-up an incoming request, the control unit examines the
+requests-pending registers of each channel in turn to find a channel on which
+to perform a transfer. If a transfer is to be performed, the register contents for
+that channel are forwarded to the transfer engine and the register contents are
+updated to reflect the incremented addresses and decremented count. DMA
+requests are acknowledged straight away when no transfer engine command
+is issued, or just after the command is issued where a transfer command is
+issued to the transfer engine. The acknowledgement of DMA requests does
+not guarantee the timely completion of the related transfer, peripherals must
+observe bus accesses made to themselves for this purpose. The acknowledge-
+
+
+210
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ment serves only to confirm the receipt of the DMA transfer request. A request
+must be removed after an acknowledgement is signalled so that other requests
+can be received through the request arbitration tree to mark-up potential trans-
+fers for other channels.
+
+Transfer engine
+
+The controller’s transfer engine takes commands from the control unit when
+a DMA transfer is due to be performed and performs no DMA request mapping
+or filtering of its own. The only reason for having the transfer engine is to
+prevent the potential bus deadlock situation if an access to the register bank
+is made across MARBLE while the DMA controller is trying to perform a
+transfer. In this situation, control of the bus belongs to the initiator (usually
+the CPU) trying to access the DMA controller. This initiator cannot proceed
+as the DMA controller is engaged in trying to gain the bus for itself. With a
+transfer engine, and the decoupling of DMA request/CPU access from transfer
+operations, the control unit is free to fulfil the initiator’s register request while
+the transfer engine is waiting for the bus to become available.
+After performing a transfer, the transfer engine will signal to the control
+unit to provide a new transfer command; it does this by a handshake on the
+transfer acknowledge channel (marked TEAck in Figure 12.2). This channel
+passes through the control unit’s command arbitration hardware and serves to
+inform the control unit that the transfer engine is free and that the request-
+pending register can be polled to find the next suitable transfer candidate. The
+acknowledgement not only provides the self-looping activation required to per-
+form memory to memory transfers but also allows the looping required to ser-
+vice requests for other types of transfer which are received during the period
+when the transfer engine was busy.
+A flag register, TEBusy, held in the control unit, is used to record the status
+of the transfer engine so that commands are not issued to it while a transfer
+is in progress. This flag is set each time a transfer command is issued to the
+transfer engine and cleared each time a transfer acknowledgement is received
+by the control unit. The request-pending registers are not re-examined (and a
+transfer command issued) if TEBusy is set.
+
+Arbiter tree
+
+The DMA controller receives DMA requests on an array of 8 sync channels
+connected to the input of the ARBITER unit shown in Figure 12.2. This arbiter
+unit is a tree of 2-way arbiter cells that combines these 8 inputs into a single
+DMA request number which it provides to the control unit. DMA requests
+are acknowledged as soon as the control unit has recorded them. Only the
+successful transfer of data between peripherals should be used as an indication
+
+
+Chapter 12: A simple DMA controller
+211
+
+of the actual completion of a DMA operation. When a transfer is begun (i.e.
+passed from control unit to transfer engine), that transfer’s channel registers
+and requests-pending registers are updated before another arbitrated access to
+the control unit is accepted. As a consequence, a new request on a channel can
+arrive (and be correctly observed) as soon as any transfer-related bus activity
+occurs for that transfer.
+
+12.4.
+The Balsa description
+
+The Balsa description of the DMA controller is composed of 3 parts: the
+arbiter tree, the control unit and the transfer engine. The two MARBLE inter-
+faces sit outside the Balsa block and are controlled through the target (mta and
+mtd) ports (corresponding to command and response ports) and the initiator
+address/control (mia) port. The top level of the DMA controller is:
+
+procedure DMAArb is ArbFunnel (NoOfClients)
+
+procedure dma (
+input mta : MARBLE8bACommand;
+output mtd : MARBLEResponse;
+output mia : MARBLECommandNoData;
+output irq : bit;
+array NoOfClients of sync drq
+) is
+channel DRQClientNo : ClientNo
+channel TECommand : array 2 of Word
+\textbf{--srcAddr, dstAddr}
+sync TEAck
+begin
+DMAArb (drq, DRQClientNo) ||
+DMAControl (mta, mtd, DRQClientNo, TECommand, TEAck, IRQ) ||
+DMATransferEngine (TECommand, TEAck, mia)
+end
+
+Interrupts are signalled by writing a 0 or 1 to the irq port. This interrupt
+value must then be caught by an external latch to generate a bare interrupt
+signal.
+
+12.4.1
+Arbiter tree
+
+DMA requests from the client peripherals arrive on the sync channels drq,
+these channels connect to the request arbiter DMAArb. The procedure declara-
+tion for DMAArb is given in the top level as a parameterised version of the pro-
+cedure ArbFunnel and was described in chapter 11 on page 202. Figure 12.3
+shows the organisation of an 8-input ArbFunnel.
+
+
+212
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ARB.
+
+ARB.
+
+ARB.
+
+ArbHead
+ArbHead
+
+ArbTree over 2,2 
+
+ArbTree over 4,1
+
+ArbFunnel over 8
+
+o
+
+ArbHead
+ArbHead
+
+i[0]
+i[1]
+i[2]
+i[3]
+i[4]
+i[5]
+i[6]
+i[7]
+
+Figure 12.3.
+8-input arbiter – ArbFunnel.
+
+12.4.2
+Transfer engine
+
+The transfer engine is, like the arbiter unit, quite simple. It exists only as
+a buffer stage between the control unit and the MARBLE initiator interface.
+This function is reflected in the sequencing in the Balsa description and the
+latches used to store the outgoing addresses.
+
+procedure DMATransferEngine (
+input command : array 2 of Word;
+sync ack;
+output busCommand : MARBLECommandNoData
+) is
+variable commandV : array 2 of Word
+begin
+loop
+command -> commandV;
+busCommand <- {commandV[0],read,word};
+busCommand <- {commandV[1],write,word};
+sync ack
+end
+end
+
+
+Chapter 12: A simple DMA controller
+213
+
+12.4.3
+Control unit
+
+The bulk of the controller is contained in the control unit. It contains all the
+channel register latch bits and register access multiplexers/demultiplexers. The
+reduced number of channels and single channel type makes this arrangement
+practical. There are in total 445 bits of programmer accessible state. The ports,
+local variables and local channels of the control unit’s Balsa description are:
+
+procedure DMAControl (
+input busCommand : MARBLE8bACommand;
+output busResponse : MARBLEResponse;
+input DRQ : ClientNo;
+output TECommand : array 2 of Word;
+sync TEAck;
+output IRQ : bit
+) is
+-- combined channel registers
+variable channelRegisters :
+array NoOfChannels of ChannelRegister
+variable channelR, channelW : ChannelRegister
+array over ChannelRegType of bit
+variable channelNo : ChannelNo
+variable clientNo : ClientNo
+
+variable TEBusy : bit
+
+variable gEnable : bit
+variable chanStatus : array NoOfChannels of bit
+variable IRQMask, IRQReq : array NoOfChannels of bit
+
+variable requestPending :
+array NoOfChannels of RequestPair
+
+channel commandSourceC : DMACommandSource
+channel busCommandC : MARBLE8bACommand
+channel DRQC : ClientNo
+variable commandSource : DMACommandSource
+. . .
+
+The ChannelRegister is the combined source, destination, count and con-
+trol registers for one channel. The variable channelRegisters is accessed
+by reading or writing these 108-bit wide registers (32 + 32 + 32 + 12). The
+two registers, channelR and channelW, are used as read and write buffers
+to the channel registers. This allows the partial writes required for CPU ac-
+cess to individual 32-bit words to fragment only these two registers, not all
+of the channel registers. The variables channelNo and clientNo are used to
+hold channel and client numbers between operations. DMA request arrival
+
+
+214
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+and request mark-up can modify clientNo and channel register accesses and
+ready-to-transfer polling can modify channelNo.
+The three channel declarations are used to communicate between a sub-
+procedure of DMAControl, RequestHandler, which arbitrates requests from
+the arbiter tree, MARBLE target interface and transfer engine acknowledge for
+service by the control unit. RequestHandler’s description is fairly uninterest-
+ing and so will not be discussed.
+The body of the control unit, with the less interesting portions removed, is
+as follows:
+
+begin
+Init ();
+-- RequestHandler is an ArbFunnel
+--
+with accompanying data
+RequestHandler (busCommand, DRQ, TEAck, commandSourceC,
+busCommandC, DRQC) ||
+loop
+-- find source of service requests
+commandSourceC -> commandSource;
+case commandSource of
+DRQ then DRQC -> clientNo; MarkUpClientRequest ()
+| bus then
+select busCommandC then
+if (busCommandC.a as RegAddrType).globalNchannel
+then . . . -- global R/W from the CPU
+else -- channel regs
+channelNo :=
+(busCommandC.a as ChannelRegAddr).channelNo;
+ReadChannelRegisters ();
+case busCommandC.rNw of
+. . . -- most of CPU reg. access code omitted
+-- CPU ctrl register write
+| ctrl then channelW.ctrl :=
+(busCommandC.d as ControlRegister) ||
+requestsPending[channelNo] := {0,0} ||
+ClearChanStatus ()
+end;
+WriteChannelRegisters ()
+end
+end
+end
+else -- TEAck
+TEBusy := 0;
+if gEnable then AssessInterrupts () end
+end;
+if gEnable and not TEBusy then
+TryToIssueTransfer ()
+end
+
+
+Chapter 12: A simple DMA controller
+215
+
+end
+end
+
+A number of procedure calls are made by the control unit body, for exam-
+ple, AssessInterrupts (). These procedure calls are to shared procedures
+whose definitions follow the local variables in DMAControl’s description. In
+Balsa, local procedures which are declared to be “shared” are only instantiated
+in the containing procedure’s handshake circuit in one place. (Normal pro-
+cedure calls place a new copy of that procedure’s body for each call). Calls
+to shared procedures are combined using a Call component making their use
+cheaper than normal procedures for whom a new copy of the called procedure’s
+body is placed at each call location.
+
+DMA request handling – MarkUpClientRequest
+
+Incoming DMA requests are marked up in the request pending registers as
+previously described. The procedure MarkUpClientRequest performs this
+operation by testing all the channels’ srcClientNo and dstClientNo con-
+trol bits with clientNo (the client ID of the incoming request) in parallel.
+MarkUpClientRequest’s description is:
+
+shared MarkUpClientRequest is
+begin
+for || i in 0..NoOfChannels-1 then
+if channelRegisters[i].ctrl.srcClientNo = clientNo
+then requestsPending[i].src := 1
+end ||
+if channelRegisters[i].ctrl.dstClientNo = clientNo
+then requestsPending[i].dst := 1
+end
+end
+end
+
+The for || loops in this description performs parallel structural instantia-
+tion of NoOfChannels copies of the body if statements.
+
+Register Access – ReadChannelRegisters,
+WriteChannelRegisters
+
+The shared procedures used to access the channel registers are very short.
+They make the only variable-indexed accesses to the channel registers. The
+two procedures are:
+
+shared ReadChannelRegisters is begin
+channelR := channelRegisters[channelNo]
+end
+
+
+216
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+shared WriteChannelRegisters is begin
+channelRegisters[channelNo] := channelW
+end
+
+Notice that no individual word write enables are present and so, in order
+to modify a single word, a whole channel register must be read, modified,
+then written back. The ReadChannelRegisters followed by channelW :=
+channelR in the description of the CPU’s access to the channel registers uses
+this update method.
+
+Channel status and interrupts – ClearChanStatus,
+AssessInterrupts
+
+The interrupt output bit is asserted by AssessInterrupts. Interrupts are
+signalled when the IRQReq register is non-zero and are reassessed each time
+an action which could clear an interrupt occurs. ClearChanStatus is called
+during channel control register updates to clear interrupts and channel status
+(end-of-run) indications. Their descriptions are:
+
+shared AssessInterrupts is begin
+IRQ <- (IRQReq as NoOfChannels bits) /= 0
+end
+
+shared ClearChanStatus is begin
+chanStatus[channelNo] := 0 ||
+IRQReq[channelNo] := 0;
+AssessInterrupts ()
+end
+
+Ready-to-transfer polling – TryToIssueTransfer, IssueTransfer
+
+Whenever the DMA controller is stimulated by its command interfaces, it
+tries to perform a transfer. The request-pending, and ctrl[n].enable bits
+for each channel are examined in turn to determine if that channel is ready
+to transfer. Incrementing the channel number during this search is performed
+using a local channel to allow the incremented value to be accessed in parallel
+from two places. TryToIssueTransfer’s Balsa description is:
+
+shared TryToIssueTransfer is local
+variable foundChannel : bit
+variable newChannelNo : ChannelNo
+begin
+foundChannel := 0 || channelNo := 0;
+
+while not foundChannel then
+-- source and destination requests arrived?
+if requestsPending[channelNo] = {1,1}
+and channelRegisters[channelNo].ctrl.enable then
+
+
+Chapter 12: A simple DMA controller
+217
+
+ReadChannelRegisters ();
+requestsPending[channelNo] :=
+channelR.ctrl.srcDRQ, channelR.ctrl.dstDRQ ||
+foundChannel := 1 ||
+IssueTransfer () ||
+UpdateRegistersAfterTransfer ()
+else
+local
+channel incChanNo : array ChannelNoLen + 1 of bit
+begin
+incChanNo <- (channelNo + 1 as
+array ChannelNoLen + 1 of bit) ||
+select incChanNo then
+foundChannel := incChanNo[ChannelNoLen] ||
+newChannelNo := (incChanNo[0..ChannelNoLen-1]
+as ChannelNo)
+end;
+channelNo := newChannelNo
+end
+end
+end
+end
+
+Notice that if a transfer is taken, the requestPending bits for that chan-
+nel are re-initialised from the ctrl.�srcDRQ,dstDRQ� control bits for that
+channel. The procedure IssueTransfer actually passes the transfer on to the
+transfer engine. Its definition is:
+
+shared IssueTransfer is begin
+TEBusy := 1 ||
+TECommand <- {channelR.src, channelR.dst}
+end
+
+The interlock formed by checking TEBusy before attempting a transfer, and
+the setting/resetting of TEBusy by transfer initiation/completion ensures that
+no transfer is presented to the transfer engine (deadlocking the control unit)
+while it is occupied. The TEAck communication back to the control unit also
+provides stimuli for re-triggering the DMA controller to perform outstanding
+requests. This re-triggering, combined with the sequential polling of channels,
+allows outstanding requests (received while the transfer engine was busy) to be
+serviced correctly. Notice that a static prioritisation on pre-arrived requests is
+enforced by sequential channel polling.
+
+Register increment/decrement –
+UpdateRegistersAfterTransfer
+
+After a transfer has been issued, the registers for that transfer’s channel must
+be updated. Procedure UpdateRegistersAfterTransfer performs this task:
+
+
+218
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+shared UpdateRegistersAfterTransfer is begin
+channelW.ctrl := channelR.ctrl ||
+if channelR.ctrl.srcInc then
+channelW.src := (channelR.src + 1 as Word)
+end ||
+if channelR.ctrl.dstInc then
+channelW.dst := (channelR.dst + 1 as Word)
+end ||
+if channelR.ctrl.countDec then
+channelW.count := (channelR.count - 1 as Word)
+end;
+if channelW.count = 0 then
+chanStatus[channelNo] := 1 ||
+if IRQMask[channelNo] then
+IRQReq[channelNo] := 1
+end ||
+channelW.ctrl.enable := 0
+end;
+WriteChannelRegisters ()
+end
+
+This procedure uses two incrementers and a decrementer to modify the
+channel’s source address, destination address and count respectively. If the
+channel’s transfer count is decremented to zero, the chanStatus bit, interrupt
+status and channel enable are all updated to indicate an end-of-run.
+This concludes the descriptions of the more illustrative aspects of the Balsa
+DMA controller description. For further details see [8] and, as was noted at
+the beginning of this chapter, a full code listing is available from [7].
+
+
+III
+
+LARGE-SCALE ASYNCHRONOUS DESIGNS
+
+Abstract
+In this final part of the book we describe some large-scale asynchronous VLSI
+designs to illustrate the capabilities of this technology. The first two of these
+designs – the contactless smart card chip developed at Philips and the Viterbi
+decoder developed at the University of Manchester – were designed within EU-
+funded projects in the low-power design initiative that is the sponsor of this book
+series. The third chapter describes aspects of the Amulet microprocessor series,
+again from the University of Manchester, developed in several other EU-funded
+projects which, although outside the low-power design initiative, never-the-less
+still had low power as a significant objective.
+The chips descibed in this part of the book are some of the largest and most com-
+plex asynchronous designs ever developed. Fully detailed descriptions of them
+are far beyond the scope of this book, but they are included to demonstrate that
+asynchronous design is fully capable of supporting large-scale designs, and they
+show what can be done with skilled and experienced design teams. The descrip-
+tions presented here have been written to give insight into the thinking processes
+that a designer of state-of-the-art asynchronous systems might go through in de-
+veloping such designs.
+
+Keywords:
+asynchronous circuits, large-scale designs
+
+
+
+Chapter 13
+
+DESCALE:
+
+�
+
+a Design Experiment for a Smart Card Application
+consuming Low Energy
+
+Joep Kessels & Ad Peeters
+Philips Research, NL-5656AA Eindhoven, The Netherlands
+{Joep.Kessels
+� Ad.Peeters}@philips.com
+
+Torsten Kramer
+Kramer-Consulting, D-21079 Hamburg, Germany
+
+Kramer@kramer-consulting.de
+
+Volker Timm
+Philips Semiconductors, D-22529 Hamburg, Germany
+
+Volker.Timm@philips.com
+
+Abstract
+We have designed an asynchronous chip for contactless smart cards. Asyn-
+chronous circuits have two power properties that make them very suitable for
+contactless devices: low average power and small current peaks. The fact that
+asynchronous circuits operate over a wide range of the supply voltage, while
+automatically adapting their speed, has been used to obtain a circuit that is very
+resilient to voltage drops while giving maximum performance for the power be-
+ing received. The asynchronous circuit has been built, tested and evaluated.
+
+Keywords:
+low-power asynchronous circuits, smart cards, contactless devices, DES cryp-
+tography
+
+�Part of the work described in this paper was funded by the European Commission under Esprit TCS/ESD-
+LPD contract 25519 (Design Experiment DESCALE).
+
+221
+
+
+222
+Part III: Large-Scale Asynchronous Designs
+
+13.1.
+Introduction
+
+Since their introduction in the eighties, smart cards have been used in a con-
+tinuously growing number of applications, such as banking, telephony (tele-
+phone and SIM cards), access control (Pay-TV), health-care, tickets for public
+transport, electronic signatures, and identification. Currently, most cards have
+contacts and, for that reason, need to be inserted into a reader. For applications
+in which the fast handling of transactions is important, contactless smart cards
+have been introduced requiring only close proximity to a reader (typically sev-
+eral centimeters). The chip on such a card must be extremely power efficient,
+since it is powered by electromagnetic radiation.
+Asynchronous CMOS circuits have the potential for very low power con-
+sumption, since they only dissipate when and where active. However, asyn-
+chronous circuits are difficult to design at the level of gates and registers.
+Therefore the high-level design language Tangram was defined [141] and a
+silicon compiler has been implemented that translates Tangram programs into
+asynchronous circuits.
+The Tangram compiler generates a special class of asynchronous circuits
+called handshake circuits [135, 112]. Handshake circuits are constructed from
+a set of about forty basic components that use handshake signalling for com-
+munication.
+Several chips have been designed in Tangram [136, 144] and if we compare
+these chips to existing clocked implementations, then the asynchronous ver-
+sions are generally about 20% larger in area and consume about 25% of the
+power.
+In order to find out what advantages asynchronous circuits have to offer in
+contactless smart cards, we have designed an asynchronous smart card chip. In
+this chapter, we indicate which properties of asynchronous circuits have been
+exploited and we present the results. The rest of the chapter is organized as fol-
+lows. Section 13.2 presents the Tangram method for designing asynchronous
+circuits. Section 13.3 summarizes the differences in power behaviour between
+synchronous and asynchronous circuits. Section 13.4 first provides some back-
+ground to contactless smart cards, then identifies the power characteristics in
+which contactless devices differ from battery-powered ones, and finally indi-
+cates why asynchronous circuits are very suited for contactless devices. Sec-
+tion 13.5 presents the digital circuit, Section 13.6 the results of the silicon, and
+Section 13.8 the power regulator, which also exploits an asynchronous oppor-
+tunity. We conclude with a summary of the merits of this asynchronous design.
+
+
+Chapter 13: Descale
+223
+
+13.2.
+VLSI programming of asynchronous circuits
+
+The design flow that is used in the design of the smart card IC reported here
+is based on the programming language Tangram and its associated compiler
+and tool set. An important aspect of this design approach is the transparency
+of the silicon compiler [142], which allows the designer to reason about de-
+sign characteristics such as area, power, performance, and testability, at the
+programming (Tangram) level.
+This section first introduces the Tangram tool set, then briefly describes the
+underlying handshake technology, and finally illustrates VLSI-programming
+techniques using the GCD algorithm presented in Chapter 8.
+
+13.2.1
+The Tangram toolset
+
+Fig. 13.1 shows the Tangram toolset. The designer describes a design in
+Tangram, which is a conventional programming language, similar to C and
+Pascal, extended to include constructs for expressing concurrency and com-
+munication in a way similar to those in the language CSP [58]. In addition to
+this, there are language constructs for expressing hardware-specific issues, like
+sharing of blocks and waiting for clock-edges.
+A compiler translates Tangram programs into so-called handshake circuits,
+which are netlists composed from a library of some 40 handshake components.
+Each handshake component implements a language construct, such as sequenc-
+ing, repetition, communication, and sharing.
+The handshake circuit simulator and corresponding performance analyzer
+give feedback to the designer about aspects such as function, area, timing,
+power, and testability.
+The actual mapping onto a conventional (synchronous) standard-cell library
+is done in two steps. In the first step the component expander uses the com-
+ponent library to generate an abstract netlist consisting of combinational logic,
+registers, and asynchronous cells, such as Muller C-elements. This step also
+determines the data encoding and the handshake protocol; generally a four-
+phase single-rail implementation is generated. In the second step commercial
+synthesis tools and technology mappers are used to generate the standard-cell
+netlist. No dedicated (asynchronous) cells are required in this mapping, be-
+cause all asynchronous cells are decomposed into cells from the standard-cell
+library at hand.
+Similar language-based approaches using handshake circuits as intermedi-
+ate format are described in [17, 9]. Design approaches in which asynchronous
+details are not hidden from the designer have also proven successful [80, 21,
+83, 30, 64]. A general overview of design methods for asynchronous circuits
+is given in [69] and in Part I of this book.
+
+
+224
+Part III: Large-Scale Asynchronous Designs
+
+Tangram
+
+Program
+
+Tangram
+
+Compiler
+
+Handshake
+Circuit
+
+Component
+
+Expander
+
+Abstract
+Netlist
+
+Technology
+
+Mapper
+
+Mapped
+
+Netlist
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Function, Area
+
+Timing, Power, Test
+
+Performance
+Analyzer
+
+Timed Traces
+Fault Coverage
+
+Handshake Circuit
+Simulator
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Component
+
+Library
+
+Asynchronous
+
+Library
+
+Standard-cell
+Library
+
+Figure 13.1.
+The Tangram toolset: boxes denote tools, ovals denote (design) representations.
+
+
+Chapter 13: Descale
+225
+
+Active
+Passive
+
+Active
+Passive
+
+�
+Req
+
+�
+Ack
+
+Figure 13.2.
+Handshake channel: abstract figure (top) and implementation (bottom).
+
+13.2.2
+Handshake technology
+
+The design of large-scale asynchronous ICs demands a timing discipline to
+replace the clock regime that is used in conventional VLSI design. We have
+chosen handshake signaling [121] as the asynchronous timing discipline, since
+it supports plug-and-play composition of components into systems, and is also
+easy to implement in VLSI. An alternative to handshaking would be to com-
+pose asynchronous finite-state machines that communicate using fundamental
+mode or burst-mode assumptions [27, 132].
+Fig. 13.2 shows a handshake channel, which is a point-to-point connection
+between an active and a passive partner. In the abstract figure, the solid circle
+indicates the channel’s active side and the open circle its passive side. The
+implementation shows that both partners are connected by two wires: a request
+(Req) and an acknowledge (Ack) wire. A handshake requires the cooperation of
+both partners. It is initiated by the active party, which starts by sending a signal
+via Req, and then waits until a signal arrives via Ack. The passive side waits
+until a request arrives, and then sends an acknowledge. Handshake channels
+can be used not only for synchronization, but also for communication. To that
+end, data can be encoded in the request, the acknowledge, or in both.
+The protocol used in most asynchronous VLSI circuits is a four-phase hand-
+shake, in which the channel starts in a state with both Req and Ack low. The
+active side starts a handshake by making Req high. When this is observed by
+the passive side, it pulls Ack high. After this a return-to-zero cycle follows,
+during which first Req and then Ack go low, thus returning to the initial state.
+Handshake components interact with their environment using handshake
+channels. One can build handshake components implementing language con-
+structs. Fig. 13.3 shows two examples: the sequencer and the parallel compo-
+nent.
+The sequencer, when activated via a, performs first a handshake via b and
+then via c. It is used to control the sequential execution of commands con-
+nected to b and c. After receiving a request along a, it sends a request along b,
+waits for the corresponding acknowledge, then sends a request along c, waits
+
+
+226
+Part III: Large-Scale Asynchronous Designs
+
+SEQ
+
+a
+
+c
+b
+PAR
+
+a
+
+c
+b
+
+Figure 13.3.
+Handshake components: sequencer (left) and parallel (right).
+
+Guard
+Command
+DO
+
+�
+
+Figure 13.4.
+Handshake circuit for while loop.
+
+for the acknowledge on c, and finally signals completion of its operation by
+sending an acknowledge along channel a.
+The parallel component, when activated by a request along a, sends requests
+along channels b and c concurrently, waits until both acknowledges have ar-
+rived, and then sends an acknowledge along channel a.
+Components for storage of data (variables) and operation on data (such as
+addition and bit-manipulation) can also be constructed. Tangram programs
+are compiled into handshake circuits (composition of handshake components)
+in a syntax-directed way (see also Chapter 8 on page 123). For instance, the
+compilation of a while loop in Tangram, which is written as
+
+do
+Guard
+then
+Command
+od
+
+results in the handshake circuit shown in Fig. 13.4. The do-component, when
+activated, collects the value of the guard through a handshake on its active data
+port. When this guard is false, it completes the handshake on its passive port,
+otherwise it activates the command through a handshake on its active control
+port, and after this has completed, re-evaluates the guard to start a new cycle.
+Details about handshake circuits, the compilation from Tangram into this
+intermediate architecture, the four-phase handshake protocols that are applied,
+and of the gate-level implementation of handshake components can be found
+in [141, 135, 112].
+
+13.2.3
+GCD algorithm
+
+When designing in Tangram, the emphasis for an initial design typically is
+on functional correctness. When this is the only point of attention, the result
+
+
+Chapter 13: Descale
+227
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin x,y:var int ff
+| forever do
+in?<<x,y>>
+; do x<>y then
+if x<y then y:=y-x
+else x:=x-y
+fi
+od
+; out!x
+od
+end
+
+Figure 13.5.
+GCD algorithm in Tangram.
+
+is generally too large, too slow, and too power hungry. The design of a suit-
+able datapath and control architecture, targetting some optimization criteria or
+cost function, can be approached in a transformational way. One can use the
+Tangram tool set to evaluate and analyze the design, and based on that, de-
+cide which transformations to apply. The transparency of the silicon compiler
+(‘What you program is what you get’) helps in predicting the effect of these
+transformations.
+The GCD algorithm is used as an example to illustrate VLSI programming
+based on a transparent silicon compiler (see also the discussion in section 8.3.3
+on page 127). We start with the algorithm of Fig. 13.5, which is functionally
+correct but far from optimal when it comes to implementation cost in VLSI.
+The corresponding handshake circuit is shown in Fig. 13.6.
+The cost report for this GCD design, as presented by the Tangram toolset, is
+the following:
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+19
+2052
+38.0
+12.5
+Memory
+16
+3744
+69.3
+22.8
+C-elements
+12
+1242
+23.0
+7.5
+Logic
+81
+9414
+174.3
+57.2
+--------------------------------------------------------------------
+Total:
+128
+16452
+304.7
+100.0
+
+An important aspect of the design is that it contains four operators in the dat-
+apath. We can improve this by changing the Tangram description in such a
+way that only one subtractor is needed, instead of two. A way to achieve this
+is to change the timing behaviour of the algorithm, and use a higher number
+of iterations of the do-loop by either swapping or subtracting the two numbers
+
+
+228
+Part III: Large-Scale Asynchronous Designs
+
+mux
+
+�
+
+�
+y
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+mux
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+dmx
+
+�
+
+��
+
+�
+
+�
+do
+
+�
+
+�
+
+�
+@
+
+�
+
+�
+
+�
+in
+
+��
+
+�
+
+�
+
+� out
+
+�
+
+;
+0
+1
+
+2
+
+�
+
+Figure 13.6.
+Compiled handshake circuit for initial GCD program.
+
+
+Chapter 13: Descale
+229
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin
+xy : var <<int,int>> ff
+& x = alias xy.0
+& y = alias xy.1
+| forever do
+in?xy
+; do x<>y then
+if x<y then xy:= <<x,y-x>>
+else xy:= <<y,x>>
+fi
+od
+; out!x
+od
+end
+
+Figure 13.7.
+GCD algorithm in Tangram with optimized control.
+
+x and y. This requires that x and y are stored in a single flip-flop variable.
+The new Tangram algorithm thus obtained is shown in Fig. 13.7. Its associated
+gate-level cost report has improved from 305 to 274 gate equivalents.
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+14
+1512
+28.0
+10.2
+Memory
+16
+3744
+69.3
+25.3
+C-elements
+10
+1008
+18.7
+6.8
+Logic
+86
+8532
+158.0
+57.7
+--------------------------------------------------------------------
+Total:
+126
+14796
+274.0
+100.0
+
+An additional transformation is to compute x<y and y-x using only one op-
+erator, and to combine the two assignments to variable xy into one assignment,
+so as to further simplify the control, at the price of always requiring the worst-
+case computation time for the conditional expression, even if it involves just a
+swap of x and y. Furthermore, one can allow one additional step in the com-
+putation, so that the termination condition of the loop simplifies from x<>y to
+y<>0. This final implementation is shown in Fig. 13.8; its handshake circuit is
+shown in Fig. 13.9, in which the datapath operations have been represented in
+an abstract way, rather than as separate components.
+The handshake circuit of the optimized design has a simpler structure than
+that of the initial design. The number of logic blocks (and, in the single-rail
+implementation, the number of delay-matching gate chains) has been reduced
+from four to two. This improvement is also apparent in the area statistics given
+below.
+
+
+230
+Part III: Large-Scale Asynchronous Designs
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin
+xy : var <<int,int>> ff
+& x = alias xy.0
+& y = alias xy.1
+& comp: func(). (y-x) cast <<int,bool>>
+| forever do
+in?xy
+; do y<>0 then
+xy:= if -comp.1 then <<x,comp.0>> else <<y,x>> fi
+od
+; out!x
+od
+end
+
+Figure 13.8.
+GCD algorithm in Tangram with optimized datapath.
+
+�
+
+�
+in
+mux
+
+�
+
+�
+xy
+
+�
+
+x
+
+�
+
+y
+�� 0
+
+�
+
+f
+�x�y�
+
+�
+
+�
+
+�
+� out
+
+do
+
+�
+
+�
+
+�
+
+;
+0
+1
+
+2
+
+�
+
+Figure 13.9.
+Compiled handshake circuit for optimized GCD program.
+
+
+Chapter 13: Descale
+231
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+10
+1080
+20.0
+9.4
+Memory
+16
+3744
+69.3
+32.7
+C-elements
+6
+576
+10.7
+5.0
+Logic
+42
+6048
+112.0
+52.8
+--------------------------------------------------------------------
+Total:
+74
+11448
+212.0
+100.0
+
+One may observe that area-wise, the design has improved in the control (the
+number of C-elements was reduced from 12 to 6), in the logic (the area for
+combinational logic was reduced from 174 to 112 gate equivalents), and in the
+total number of delay elements required (which was reduced from 19 to 10).
+
+13.3.
+Opportunities for asynchronous circuits
+
+When the asynchronous circuits generated by the Tangram compiler are
+compared to synchronous ones, three differences stand out, leading to four
+attractive properties of these asynchronous circuits.
+
+1 The subcircuits in a synchronous circuit are clock-driven, whereas they
+are demand-driven in an asynchronous one. This means that the subcir-
+cuits in an asynchronous circuit are only active when and where needed.
+Asynchronous circuits will therefore generally dissipate less power than
+synchronous ones.
+
+2 The operations in a synchronous circuit are synchronized by a central
+clock, whereas they are synchronized by distributed handshakes in an
+asynchronous circuit. Therefore
+
+a) a synchronous circuit shows large current peaks at the clock edges,
+whereas the power consumption of an asynchronous circuit is more
+uniformly distributed over time;
+
+b) the strict periodicity of the current peaks in a synchronous circuit
+leads to higher clock harmonics in the emission spectrum, which
+are absent in the spectrum of an asynchronous design.
+
+3 Synchronous circuits use an external time reference, whereas asynch-
+ronous circuits are self-timed. This means that asynchronous circuits
+operate over a wide range of the supply voltage (for instance, from 1 up
+to 3.3 V) while automatically adapting their speed. This property, called
+automatic performance adaptation, implies that asynchronous circuits
+are resilient to supply voltage variations. It can also be used to reduce
+the power consumption by adaptive voltage scaling, which means adapt-
+ing the supply voltage to the performance required [100]. Adaptive volt-
+
+
+232
+Part III: Large-Scale Asynchronous Designs
+
+age scaling techniques are also applied in synchronous circuits, but then
+special measures must be taken to adapt the clock frequency.
+
+Asynchronous circuits also have drawbacks. The most important one is that
+these circuits are unconventional: designers and mainstream tools and libraries
+are all oriented towards synchronous design methods. Additional drawbacks
+of asynchronous circuits come from the fact that they use gates to control reg-
+isters (latches and flip-flops), instead of the relatively straightforward clock-
+distribution network in synchronous circuits. Although this enables the low
+power consumption it also leads to circuits that are typically larger, slower,
+and harder to test. Testability (for fabrication defects) is probably the most
+fundamental issue of these. For a discussion on testing asynchronous circuits
+we refer to [61, 120].
+Property 2.b was the main reason for Philips Semiconductors to design a
+family of asynchronous pager chips [114]. However, as is addressed in the next
+section, it is the other three properties that can be used most advantageously in
+contactless smart card chips.
+
+13.4.
+Contactless smartcards
+
+Contactless smart cards have a number of advantages when compared to
+contacted ones: they are convenient and fast to use, insensitive to dirt and
+grease and, since their readers have no slots, they are less amenable to vandal-
+ism.
+The communication between a contactless smart card and reader is estab-
+lished through an electromagnetic field, emitted by the reader. The card has a
+coil through which it can retrieve energy from this field. The amount of energy
+that can be retrieved depends on the distance to the reader, the number of turns
+in the coil, and the orientation of the card in the field.
+Fig. 13.10 shows the functional parts of a contactless smartcard consisting
+of a VLSI circuit (in the dotted box) and an external coil. The tuned circuit
+formed by the coil and capacitor C0 is used for three purposes:
+
+receiving power;
+
+receiving the clock frequency (equal to the carrier frequency); and
+
+supporting the communication.
+
+The complete circuit consists of a power supply unit and a digital circuit with
+a buffer capacitor (C1) for storing energy.
+Several standards for contactless smart cards currently exist:
+
+ISO/IEC 10536, which specifies close coupling cards, operating at a dis-
+tance of 1cm from the reader.
+
+
+Chapter 13: Descale
+233
+
+Digital
+Circuit
+
+Power
+Supply
+Unit
+C0
+C1
+
+Figure 13.10.
+Contactless smart card.
+
+Table 13.1.
+Main characteristics of ISO/IEC 14443 standard.
+
+ISO 14443 standard
+A (Mifare)
+B
+
+Carrier frequency
+13.56 MHz
+13.56 MHz
+Throughput (up and down)
+106kbit/sec
+106kbit/sec
+Down link (reader to card)
+ASK 100%
+ASK 10%
+encoding
+Miller
+NRZ
+Up link (card to reader)
+ASK
+BPSK
+frequency
+847.5 kHz
+847.5 kHz
+modulation
+Manchester
+NRZ
+
+ISO/IEC 14443, for so-called proximity integrated circuit cards (PICCs),
+operating at a distance of up to 10cm from the reader, typically using 5
+turns in the on-card coil. This standard defines two types, A and B, the
+main characteristics of which are given in Table 13.1.
+
+ISO/IEC 15693 specifies vicinity integrated circuit cards (VICCs), op-
+erating at some 50cm from the reader, and typically requiring a coil with
+a few hundreds of turns.
+
+The Mifare [63] standard (ISO/IEC 14443 type A) has hundreds of millions
+of cards in operation today. Fig. 13.11 shows a Mifare card with both the chip
+and the coil visible. Mifare is a proximity card (it can be used up to 10 cm
+from the reader) supporting two-way communication. Performance is impor-
+tant, since the transaction time must be less than 200 msec. One of the first
+companies to deploy Mifare technology en masse was the Seoul Bus Asso-
+ciation, which has millions of such bus cards in use, generating hundreds of
+millions of transactions per month.
+This chapter reports an asynchronous Mifare smart card IC that was reported
+earlier in [73]. Both synchronous [116] and asynchronous [2] circuits for smart
+cards of the ISO/IEC 14443 type B standard have also been reported. Due to
+
+
+234
+Part III: Large-Scale Asynchronous Designs
+
+the 100% ASK modulation scheme in the type A standard, a Mifare IC is
+exposed to periods during which no power comes in at all, in contrast to the
+type B standard, which is based on 10% ASK modulation.
+Since on average (over time) contactless smart card chips receive only a few
+milliwatts of power, power efficiency is very important. Although low power
+is also important in battery-powered devices, there are two crucial differences
+between these two types of device.
+
+1 To maximize the battery life-time in battery-powered devices one should
+minimize the average power consumption. In contactless devices, how-
+ever, one should in addition minimize the peak power, since the peaks
+must be kept below a certain level, which depends on the incoming
+power and the buffer capacitor.
+
+2 The supply voltage is nearly constant in battery-powered devices whereas
+in contactless ones it may vary over time during a transaction due to fluc-
+tuations in both the incoming and the consumed power.
+
+Figure 13.11.
+Mifare card, showing IC (bottom left) and coil.
+
+In the bullets below, we give some facts about conventional synchronous
+chips for contactless smart cards, which, as we will see later, offer opportuni-
+ties for improvement by using asynchronous circuits.
+
+A synchronous circuit runs at a fixed speed dictated by the clock, de-
+spite the fact that both the incoming and the effectively consumed power
+vary over time. Synchronous circuits must, therefore, be designed so as
+to allow the most power-hungry operations to be performed when min-
+imum power is coming in. Consequently, if too much power is being
+
+
+Chapter 13: Descale
+235
+
+DES
+
+RSA
+
+UART
+
+80C51
+Micro-
+controller
+
+RAM
+
+ROM
+
+EEPROM
+
+Figure 13.12.
+Global design of the smart-card circuit.
+
+received, that superfluous power is thrown away. If, on the other hand,
+too little power is being received, the supply voltage drops making the
+circuit slower and, as soon as the circuit has become too slow to meet
+the time requirements set by the clock, the transaction must be canceled.
+For this reason contactless smart card chips contain subcircuits that de-
+tect when the voltage drops below a certain threshold and then abort the
+transaction.
+
+Currently, the performance of the microcontroller in a contactless smart
+card chip is usually not limited by the speed of the circuit, but by the
+RF-power being received.
+
+A synchronous circuit requires a buffer capacitor of several nanofarads
+and the area needed for such a capacitor is of the same order of magni-
+tude as the area needed for the microcontroller.
+
+The communication from the smart card to the reader is based on mod-
+ulating the load, which implies that normal functional load fluctuations
+may interfere with the communication.
+
+13.5.
+The digital circuit
+
+We have built the digital circuit shown in Fig. 13.12 that consists of:
+
+an 80C51 microcontroller;
+
+three kinds of low-power memory, the sizes and access times of which
+are given in Table 13.2 (64 bytes can be written simultaneously in one
+write access to the EEPROM);
+
+two encryption coprocessors:
+
+– an RSA converter [119] for public key conversions and
+
+
+236
+Part III: Large-Scale Asynchronous Designs
+
+– a triple DES converter [96] for private key conversions;
+
+a UART for the external communication.
+
+The EEPROM contains program parts as well as data such as encryption keys
+and e-money. Both the ROM and the RAM are equipped with matching delay
+lines and for the EEPROM we designed a similar function based on a counter.
+These delay lines have been used to provide all three memories with a hand-
+shake interface, which made it extremely easy to deal with the differences in
+access time as well as variations in both temperature and supply voltage. An
+additional advantage is that the controller automatically runs faster when exe-
+cuting code from ROM than when executing code from EEPROM.
+The circuit is meant to be used in a dual-interface card, which is a card with
+both a contacted and a contactless interface. Apart from the RSA converter,
+which will not be used in contactless operation, all circuits are asynchronous.
+In contactless operation, the average supply voltage will be about 2 V. The
+simulations, however, are done at 3.3 V, which is the voltage at which the
+library has been characterized.
+
+Table 13.2.
+Memory sizes and access times.
+
+Memory
+Size Access time [ns]
+type
+[kbyte]
+read
+write
+
+RAM
+2
+10
+10
+ROM
+38
+30
+EEPROM
+32
+180
+4,000
+
+13.5.1
+The 80C51 microcontroller
+
+The 80C51 microcontroller is a modified version of the one described in
+[144, 143]. The four most important modifications are described below.
+To deal with the slow memories a prefetch unit has been included in the
+80C51 architecture. At 3.3 V the average instruction execution time in free-
+running mode is about 100 ns provided it takes no time to fetch the code from
+memory. If, however, code is fetched from the EEPROM and the microcon-
+troller has to wait during the read accesses, the performance would be drasti-
+cally reduced, since most instructions are one or two bytes long, taking 180 or
+360 ns to fetch. To avoid this performance degradation a form of instruction
+prefetching has been introduced in which a process running concurrently to
+the 80C51 core is fetching code bytes as long as a two byte FIFO is not full.
+
+
+Chapter 13: Descale
+237
+
+The prefetch unit gives an increase in performance of about 30%. A simplified
+version of the prefetch unit is described in Section 13.5.2.
+We also introduced early write completion, which means that the microcon-
+troller continues execution as soon as it has issued a write access. This has
+been introduced to prevent the microcontroller from waiting during the 4 msec
+it takes to do a write access to the EEPROM (for instance to change the e-
+money), but also to speed up the write accesses to the RAM. To exploit this
+feature when doing a write access to the EEPROM, the corresponding code
+must be in the ROM.
+The controller has been provided with an immediate halt input signal by
+which its execution can be halted within a short time. This provision is nec-
+essary to deal with the fact that the information, which the reader sends to the
+card, is coded by suppressing the carrier during periods of 3 µsec. Since the
+card does not receive any power during these periods, the controller has to be
+halted immediately (only some basic functions continue to operate). In the
+synchronous design this halting function came naturally, since the clock would
+stop during these periods.
+We have introduced a quasi synchronous mode, in which the microcontroller
+is, at instruction level, fully timing compatible with its synchronous counter-
+part. In this mode, the asynchronous microcontroller waits after each instruc-
+tion until the number of clock ticks required by a synchronous version have
+elapsed. This mode is necessary when time-dependent functions are designed
+in software. Since this mode is under software control, the microcontroller can
+easily switch modes depending on the function it is executing. This feature
+was also of great help to demonstrate the guaranteed performance, which is
+the maximum clock rate at which each instruction terminates within the given
+number of clock ticks. For most programs, the free-running performance is
+about twice as high as the guaranteed performance.
+We have compared the asynchronous circuit as compiled from Tangram with
+a synchronous circuit that is functionally equivalent, and which has been com-
+piled from synthesizable VHDL to the same CMOS technology. These ICs
+have a comparable performance.
+The asynchronous microcontroller nicely
+demonstrates the three properties of asynchronous circuits that we want to ex-
+ploit in the design of the smart-card chip:
+
+The average power consumption of the asynchronous 80C51 is about
+three times lower than the power consumption of its synchronous coun-
+terpart when delivering the same performance at the same supply volt-
+age.
+
+Fig. 13.13 shows the current peaks of both the synchronous and the asyn-
+chronous 80C51 at 3.3 V, where the asynchronous version is running in
+quasi synchronous mode, giving a performance that is 2.5 times higher
+
+
+238
+Part III: Large-Scale Asynchronous Designs
+
+100
+
+0
+1
+2
+0
+1
+2
+
+Time[µsec]
+
+I[mA]
+
+Synchronous version
+Asynchronous version
+
+Figure 13.13.
+Current peaks of 80C51 microcontroller.
+
+than the synchronous design (the synchronous version runs at 10 MHz
+and the asynchronous version at 25 MHz). Despite the fact that the figure
+does not give a fair impression of the average power being consumed,
+it clearly shows that the current peaks of the asynchronous 80C51 are
+about five times smaller than those of the synchronous equivalent.
+
+The performance adaptation property of asynchronous circuits is demon-
+strated in Fig. 13.14, which shows the free-running performance of the
+microcontroller, when executing code from ROM, as a function of the
+supply voltage. As is expected, the performance depends linearly on the
+supply voltage. When the supply voltage goes up from 1.5 to 3.3 V, the
+performance increases from 3 to 8.7 MIPS (about a factor 3). Since the
+ROM containing the program does not function properly when the sup-
+ply voltage is below 1.5 V, we could not measure the performance for
+lower values. We observed, however, that the DES coprocessor, which
+does not need a memory, still functions correctly at a supply voltage
+level as low as 0.5 V.
+
+The figure also shows the supply current as a function of the supply
+voltage. Note that the current increases in this range from 0.7 to 6 mA
+(about a factor 9). Since in CMOS circuits the current is the product
+of the transition rate (performance) and the charge being switched per
+transition (both of which depend linearly on the supply voltage), the
+current increases with the square of the voltage. From this it follows that
+
+
+Chapter 13: Descale
+239
+
+the power, being the product of the current and the voltage, goes up with
+the cube of the voltage.
+
+From this data one can compute the third curve showing the energy
+needed to execute an instruction, which increases with the square of the
+supply voltage from 0.35 to 2.25 nJ.
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+���
+
+������
+�
+������
+���
+
+�
+����������
+� !���
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+"�����
+�
+��#�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+$�����
+�
+��
+%�&�����%��
+��'�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Figure 13.14.
+Measured performance of the asynchronous 80C51 for various supply voltages.
+
+13.5.2
+The prefetch unit
+
+Fig. 13.15 gives the Tangram code of a simplified version of the prefetch
+unit. The prefetch unit communicates with the 80C51 core through two chan-
+nels: it receives the address from which to start fetching code bytes via channel
+StartAddress and then it sends these bytes through channel CodeByte. Since
+the prefetch unit plays the passive role in both communications, it can probe
+each channel to see whether the core has started a communication through that
+channel. The state of the prefetch unit consists of program counter, pc, and
+a two-place buffer, which is implemented by means of an array Buffer, an
+integer count, and two one-bit pointers getptr and putptr.
+
+
+240
+Part III: Large-Scale Asynchronous Designs
+
+forever do
+sel probe(StartAddress)
+then (
+StartAddress?pc
+|| putptr := getptr
+|| count := 0
+|| AbortMemAcc()
+)
+or probe(CodeByte) and (count>0)
+then CodeByte!Buffer[getptr]
+; (
+getptr := next(getptr)
+|| count := count-1
+)
+or MemAck
+then Buffer[putptr] := MemData
+; (
+putptr := next(putptr)
+|| count := count+1
+|| pc := pc+1
+|| CompleteMemAcc()
+)
+les
+; if (count<2) and -MemReq
+then MemReq := true
+fi
+od
+
+Figure 13.15.
+Tangram code of a simplified version of the prefetch unit.
+
+The prefetch unit executes an infinite loop and in each step it first executes a
+selection command (denoted by sel ... les), in which it can select among three
+guarded commands, which are separated by the keyword or. Each guarded
+command is of the form
+
+Guard
+then
+Command,
+
+in which Guard is a Boolean condition, and Command a command that may
+be executed only if the guard holds. A command is said to be enabled if the
+corresponding guard holds. Executing a selection command implies waiting
+until at least one of the commands is enabled, then selecting such a command
+—in an arbitrated choice— and executing it.
+In the first guarded command, channel StartAddress is probed to find out
+whether the core is sending a new start address. In that case, program counter
+pc is set to the address received, the buffer is flushed, and a possible outstand-
+ing memory access is aborted (by resetting both MemReq and the delay counter).
+All four subcommands in this guarded command are executed in parallel (‘A
+|| B’ means execute commands A and B concurrently, whereas ‘A ; B’ means
+execute A and B sequentially).
+
+
+Chapter 13: Descale
+241
+
+The second guarded command takes care of sending the next program byte
+via channel CodeByte to the core if the core is ready to receive that byte and
+the buffer is not empty. The third guarded command gets enabled if MemAck
+goes high indicating that the data signals in a read access are valid. In that
+case the value read from memory is put in the buffer after which the memory
+handshake is completed.
+After each event, if the buffer is not full and no memory access is being
+performed a next memory access is started. Since (count<2)
+� -MemReq is
+an invariant of the loop, the last (conditional) command in the loop can be
+simplified to unconditional assignment MemReq:= (count<2).
+Note that the value (pc-count) is equal to the program counter in the core,
+since it is set to the destination address in the case of a jump, increased by
+1 if a code byte is transferred to the core, and kept invariant if a code byte
+is read from memory. Therefore the core does not need to hold the program
+counter and instead, when the information is needed for a relative branch, it can
+retrieve the counter value from the prefetch unit. Clearly, this feature requires
+an extension of the Tangram code shown in Fig. 13.15.
+
+13.5.3
+The DES coprocessor
+
+A transaction may need up to ten single DES conversions, where each con-
+version takes about 10 ms if it is executed in software. Therefore a hardware
+solution is needed, since these conversions would otherwise consume about
+half of the transaction time.
+Fig. 13.16 shows the datapath of the DES coprocessor. The processor sup-
+ports both single- and triple-DES conversions and, for the latter type of con-
+version, contains two keys: a foreground and a background key. Single-DES
+conversions use the foreground key, whereas triple-DES conversions use the
+foreground key for the first and third conversion and the background key for
+the second conversion. The foreground key is stored in register CD0 consisting
+of 56 flipflops (the DES key size is 56 bits), whereas the background key re-
+sides in variable CD1 consisting of 56 latches. The text value resides in variable
+LR consisting of 64 latches (DES words contain 64 bits).
+A single-DES conversion consists of 16 steps and, in each step, the key is
+permuted and a new text value is computed from the old text value and the key.
+In order to have the key return to its original value at the end of a conversion,
+the key makes two basic permutations in 12 of the 16 steps and only one in
+the remaining 4, where 28 basic permutations are needed for a complete cycle.
+The permutations are performed in flipflop register CD0.
+Most of the area is taken by the combinational circuit called DES. Since this
+circuit is also dominant in power dissipation, one should minimize the number
+of transitions at its inputs. For this purpose, we have introduced two latch
+
+
+242
+Part III: Large-Scale Asynchronous Designs
+
+cd(56)
+
+CD0(56)
+
+CD1(56)
+
+LR(64)
+
+lr(64)
+
+Data(8)
+
+DES
+
+LRotMuxLR
+RotMuxCD
+
+Swap
+
+Inp/Enc/Dec
+
+SFRwrite
+
+Figure 13.16.
+DES coprocessor architecture.
+
+registers: cd for the key and lr for the text. If two basic permutations are done
+in one step, cd hides the effect of the first one from combinational circuit DES.
+In addition, all inputs of combinational circuit DES change only once in each
+step by loading the two registers lr and cd simultaneously and then storing the
+result in register LR as described by the following piece of Tangram text.
+
+( lr:= LR || cd:= CD0 ) ; LR:= DES(lr,cd)
+
+Therefore, latch register lr also serves as a kind of slave register. Latch
+register cd also serves a functional purpose, since the two keys are swapped by
+executing the following three transfers.
+
+cd:= CD0 ; CD0:= CD1 ; CD1:= cd
+
+The size of the DES coprocessor is 3,250 gate equivalents, of which 57%
+is taken by the combinational logic and 35% by latches and flip-flops. Con-
+sequently, the overhead in area due to the asynchronous design style (delay
+matching and C-elements) is marginal at 8%. At 3.3 V, a single-DES conver-
+sion takes 1.25 µs and 12 nJ.
+
+
+Chapter 13: Descale
+243
+
+Fig. 13.17 shows the simulated current of the DES coprocessor at 3.3 V
+(the microcontroller is active before and after the DES computation). The real
+current peaks will be much smaller due to a lower supply voltage (the DES
+processor functions properly at a supply voltage as low as 0.5 V) as well as the
+buffer capacitor (the resolution in the simulation is 1 ns).
+
+1
+
+100
+
+3
+2
+0
+t[µs]
+
+i[mA]
+
+Figure 13.17.
+DES coprocessor current at 3.3 V.
+
+The conversion time, of a few microseconds, is so small that we used the
+handshaking mechanism to obtain synchronization between the microcontroller
+and the coprocessor. After starting the coprocessor, the microcontroller can
+continue executing instructions, and only when reading the result will it be
+held up in a handshake until the result is available. Note that a synchronous
+design would require a form of busy-waiting.
+
+13.6.
+Results
+
+Fig. 13.18 shows the layout of the chip, which is in a five-layer metal,
+0.35 µm technology and has a size of 4�52
+� 4�16
+ 18 mm2, including the
+bond pads. Many bond pads are only included for measurement and evalua-
+tion purposes. A production chip only needs about 10 bond pads.
+The two horizontal blocks on top form the buffer capacitor (in a production
+chip, the capacitor would only require about one quarter of the area). The
+memories are on the next row, from left to right: two RAMs, one ROM and
+the EEPROM, which is the large block to the right. The asynchronous circuit
+is located in the lower left quadrant, near the centre.
+Table 13.3a gives the area of the blocks constituting the contactless dig-
+ital circuit, which is the asynchronous circuit together with the memories.
+
+
+244
+Part III: Large-Scale Asynchronous Designs
+
+Figure 13.18.
+Layout of smart card chip.
+
+The other modules are either synchronous or analog circuits, where the syn-
+chronous modules are not used in contactless operation. From this table it
+follows that the asynchronous logic takes only 12% of the total contactless
+digital circuit.
+
+Table 13.3a.
+Areas of the contactless dig-
+ital circuit blocks.
+
+Block
+Area [mm2]
+
+RAM
+1.2
+ROM
+1.0
+EEPROM
+5.6
+Async. circ.
+1.1
+
+Total
+8.9
+
+Table 13.3b.
+Areas of the asynchronous
+modules.
+
+Module
+Area [GE]
+
+CPU
+7,800
+Pref. Unit
+700
+DES
+3,250
+UART
+2,040
+Interfaces
+3,680
+Timer
+1,080
+
+Total
+18,550
+
+
+Chapter 13: Descale
+245
+
+The sizes of the different asynchronous modules are given in Table 13.3b.
+In the standard cell library used, a gate equivalent (GE) is 54 µm2 with a typical
+layout density of 17,500 gates per mm2.
+
+Table 13.3c.
+Power of the contactless
+digital circuit.
+
+Block
+Power
+
+Core
+56%
+ROM
+27%
+RAM
+17%
+
+Table 13.3d.
+Effect of asynchronous de-
+sign on power and area at different levels.
+
+Level
+Power
+Area
+
+Async. circ.
+�70%
+�18%
+Async. + Mem.
+�60%
+�2%
+
+Table 13.3c shows the power dissipation of the digital circuit blocks when
+the controller is executing code from ROM (being the normal situation).
+Table 13.3d shows the effects on power and area of an asynchronous design
+at two different levels. The asynchronous circuit gives a reduction in power
+dissipation of about 70% for 18% additional area. At the level of the contact-
+less digital circuit, however, we obtain a power reduction of 60% for only 2%
+additional area. Note that this analysis does not include the synchronous RSA
+converter and the analog circuits needed in a production chip, such as for in-
+stance the buffer capacitor and the power supply unit. Therefore at chip level
+the relative reduction in power dissipation will be about the same, whereas the
+overhead in area will be reduced even further.
+
+13.7.
+Test
+
+The testing of asynchronous circuits for manufacturing defects is known to
+be a difficult problem [61, 120]. The main problem is that asynchronous cir-
+cuits have many feedback loops that cannot be controlled by an external signal.
+This makes the introduction of scan testing expensive, and forces the designer
+to become involved in the development of a functional test, either in producing
+the patterns directly or in implementing the design-for-test measures.
+A functional test approach was chosen for the chip described in this chapter.
+During test, the microcontroller is connected to an external ROM that contains
+a test program. This program computes a signature, and a circuit is said to be
+defective if the signature is not correct. In addition, current measurements are
+performed to increase the test coverage.
+The functional tests were developed using the test for the 80C51 microcon-
+troller [144] as a starting point. Both this test and its extension were developed
+using the test-evaluation tool described in [152]. This tool evaluates the struc-
+tural test coverage (controllability and observability) of functional traces.
+
+
+246
+Part III: Large-Scale Asynchronous Designs
+
+For the datapath logic, the tool can be used to achieve a 100% coverage for
+the stuck-at-input fault model, even though actually achieving this may be a
+real challenge to the test engineer. For the 80C51 microcontroller, however,
+with the inherent controllability and observability of its registers and buses,
+this appears to be feasible.
+In the absence of full-scan, it is known that a 100% coverage for the stuck-
+at-input fault model can only be achieved for the asynchronous control logic
+by using a combination of functional patterns and current measurements in a
+circuit that has been modified so as to be pausable in the middle of carefully
+selected handshakes [120]. Since this modification has not been implemented
+in the smartcard circuit reported here, the fault coverage can never be 100%.
+The exact fault coverage of the traces that were used here is not known,
+as this would demand unrealistic levels of compute power using a verification
+tool, but is estimated to be around 90% for the asynchronous control and data-
+path subsystem.
+
+13.8.
+The power supply unit
+
+Fig. 13.19 shows the power supply unit consisting of a rectifier and a power
+regulator, which are both completely analog circuits. The design of the recti-
+fier is conventional, and of the regulator we discuss only those aspects of the
+behaviour that are relevant to the design of the digital circuit without going
+into the details of its design.
+
+Pow
+Reg
+
+i0
+
+v0
+
+i1
+
+v1
+Rect
+
+Figure 13.19.
+Power supply unit.
+
+To avoid interference with the communication, a power regulator has been
+designed that shows an almost constant load at its input. Fig. 13.20 shows
+Spice-level simulation results of such a power regulator when the input volt-
+age V0 is fixed at 5V. On the horizontal axis we have the activity (number of
+transitions per second) of the digital circuit. The input load is almost constant,
+since input current i0 is almost constant over the whole range.
+When the activity is low, output voltage V1 is constant at about 3V. In this
+range, too much power is coming in and the regulator functions as a voltage
+source with output current i1 increasing when the activity increases. The su-
+perfluous power is shunted to ground. However, i1 reaches a saturation point
+
+
+Chapter 13: Descale
+247
+
+when it reaches i0. From this point on, no more power is shunted to ground and
+the regulator starts to function as a current source with output voltage V1 de-
+creasing when the activity increases. The regulator delivers maximum power
+in the middle of the range where both the outgoing voltage and the outgoing
+current are high. Note that these simulation results assume constant incoming
+RF power. The variations in the incoming RF power during a transaction, how-
+ever, are an additional source of fluctuations in V1, since these variations result
+in shifts of the saturation point.
+
+i0
+
+i1
+
+V1
+
+2
+
+3
+
+0.0
+
+0.4
+
+0.8
+
+1.2
+
+mA
+
+V
+
+1
+
+Activity
+
+Figure 13.20.
+Power regulator behaviour.
+
+A power source with these characteristics burdens the designer of a syn-
+chronous circuit with the problem of trading off between performance and ro-
+bustness. Going for maximum performance means assuming a supply voltage
+of 3 V in which case a transaction must be aborted if the voltage drops below
+2.5 V, for example. On the other hand, if the designer opts for a more robust
+design by choosing 2 V as the operating voltage, performance is lost when
+the regulator delivers 3 V. Such trade-offs are not needed for an asynchronous
+circuit, since it always automatically gives the maximum performance for the
+power received.
+
+13.9.
+Conclusions
+
+We have designed, built and evaluated an asynchronous chip for contactless
+smart cards in which we have exploited the fact that asynchronous circuits:
+
+use little average power,
+
+show small current peaks, and
+
+operate over a wide range of the supply voltage.
+
+
+248
+Part III: Large-Scale Asynchronous Designs
+
+Measurements and simulations showed the following advantages of this de-
+sign when compared to a conventional synchronous one:
+
+The asynchronous circuit gives the maximum performance for the power
+received. This comes mainly from the fact that the asynchronous de-
+sign needs less of what is the main limiting factor for the performance,
+namely power. Compared to a synchronous design, the asynchronous
+circuit needs about 40% of the power for less than 2% additional area.
+In addition, the automatic speed adaptation property of asynchronous
+circuits saves the designer from trading off between performance and
+robustness. Due to this property the asynchronous circuit will give free-
+running instead of guaranteed performance, where the difference be-
+tween the two is about a factor two.
+
+The asynchronous design is more resilient to voltage drops, since it still
+operates correctly for voltages down to 1.5 V.
+
+The current peaks of an asynchronous circuit are less pronounced, mak-
+ing the requirements with respect to the buffer capacitor more modest.
+
+The combination of the power regulator with the asynchronous circuit
+gives little communication interference. In this case, the smaller current
+peaks and the self-adaptation property are of importance.
+
+Acknowledgements
+
+We gratefully acknowledge the other members of the Tangram team: Kees
+van Berkel, Marc Verra and Erwin Woutersen, and we thank Klaus Ully for
+helping us to get the DES converter right.
+
+
+Chapter 14
+
+AN ASYNCHRONOUS VITERBI
+DECODER
+
+�
+
+Linda E. M. Brackenbury
+Department of Computer Science, The University of Manchester
+
+lbrackenbury@cs.man.ac.uk
+
+Abstract
+Viterbi decoders are used for decoding data encoded using convolutional for-
+ward error correcting codes. Such codes are used in a large proportion of digital
+transmission and digital recording systems because, even when the transmitted
+signal is subjected to significant noise, the decoder is still able efficiently to de-
+termine the most likely transmitted data.
+This chapter descibes a novel Viterbi decoder aimed at being power efficient
+through adopting an asynchronous approach. The new design is based upon se-
+rial unary arithmetic for the computation and storage of the metrics required;
+this arithmetic replaces the add-compare-select parallel arithmetic performed by
+conventional synchronous systems. Like all Viterbi decoders, a history of com-
+putational results is built up over many data bits to determine the data most likely
+to have been transmitted at an earlier time. The identification of a starting point
+to this tracing operation allows the storage requirement to be greatly reduced
+compared with that in conventional decoders where the starting point is random.
+Furthermore, asynchronous operation in the system described enables multiple,
+independent, concurrent tracing operations to be performed which are decoupled
+from the placing of new data in the history memory.
+
+Keywords:
+low-power asynchronous circuits, Viterbi, convolution decoder
+
+14.1.
+Introduction
+
+The PREST (Power REduction for Systems Technology) [1] project was a
+collaborative project where each partner designed a low power alternative to
+
+�The Viterbi design work was supported by the EPSRC/MoD PowerPack project GR/L27930 and the EU
+PREST project EP25242, and this support is gratefully acknowledged.
+
+249
+
+
+250
+Part III: Large-Scale Asynchronous Designs
+
+an industry standard Viterbi decoder. The Manchester University team’s aim
+was to obtain a power-efficient design through the use of asynchronous timing.
+The Viterbi decoder function [148] was chosen as it is a key function in
+digital TV and mobile communications. It detects and corrects transmission
+errors, outputting the data stream which, according to its calculations, is the
+most likely to have been transmitted. Viterbi coding is popular because it can
+deal with a continual data stream and requires no framing information such as
+a block start or finish. Furthermore, the way in which the output stream is con-
+structed means that even if output data is incorrectly indicated, this situation
+will correct itself in time.
+
+14.2.
+The Viterbi decoder
+
+14.2.1
+Convolution encoding
+
+In order to understand the functions performed by the decoder, the way the
+data is encoded needs to be described. This is illustrated in figure 14.1. The
+input data stream enters a shift register which is initially set to zero. The 2-
+bit register stores the previous two input bits and these are combined with the
+current bit in two modulo-2 adders to provide the two binary digits which form
+the encoded output for the current input bit.
+
+1−bit
+delay
+1−bit
+
++
+
++
+
+i(t−1)
+i(t−2)
+
+Y
+
+delay
+
+X
+
+i(t)
+
+Figure 14.1.
+Four-state encoder.
+
+For example, suppose that the input stream comprises 011 with the current
+input bit (‘1’) on the right and the oldest bit (‘0’) on the left, then the encoded
+X-output, which adds the bits in all three bits, is 0 and the encoded Y-output,
+which adds the current and oldest bit, is 1; so ‘01’ would be transmitted in this
+case. As each input bit causes two encoded bits to be transmitted, the code is
+referred to as 1�2 rate.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+251
+
+The use of three input bits (called the constraint length k) means that the
+encoder has 2
+
+�k
+�1�
+
+� 4 possible states, ranging from state S0 when both previ-
+ous bits are zero to S3 when both these bits are ones. So, a current state n will
+become state 2n modulo 4 if the current input bit is a zero and
+�2n
+�1� modulo
+4 if it is a one; for example, if the current state is 2, the next state will be 0 or
+1. That is, each state leads to two known states. This state transition informa-
+tion versus time is normally drawn in the form of a trellis diagram as shown in
+figure 14.2 for a four-state system. States are also referred to as nodes, and the
+node-to-node connection network is referred to as a butterfly connection due
+to its predictability, regularity and shape.
+
+S3
+
+S2
+
+S1
+
+S0
+
+0
+1
+2
+3
+4
+
+1
+1
+0
+1
+
+time
+
+00
+
+10
+
+11
+
+01
+
+01
+
+10
+
+00
+
+11
+
+data
+
+Figure 14.2.
+Four-node trellis.
+
+By knowing the starting state of the trellis at any time and the subsequent
+input stream, the route taken by the encoder and the state it reaches can be
+traced. For example, starting at state 0 at time 0, an input pattern of 1 then 1
+then 0 followed by 1 will cause the trellis path to travel from state S0
+� S1
+�
+S3
+� S2
+� S1; this route is indicated by the thicker black line on figure 14.2.
+This figure also shows the encoder outputs associated with each path or
+branch on the trellis. For example, moving from state S3 to S2, corresponding
+to a current input of ‘0’, the input stream must be 110 (with the oldest bit
+leftmost), so the encoded X and Y outputs are 0 and 1; this path is labelled
+with 01 to indicate the encoder output in moving from S3 to S2. The other
+paths are calculated in a similar way and labelled appropriately.
+
+14.2.2
+Decoder principle
+
+The decoder attempts to reconstruct the most likely path taken by the en-
+coder through the trellis. It does this by constructing a trellis and attaching
+weights to the nodes and each possible path at each timeslot; these indicate
+how likely it is that the node and path were the route taken by the encoder.
+Consider again the four-node example above. The encoded output for trans-
+
+
+252
+Part III: Large-Scale Asynchronous Designs
+
+mission resulting from the same input sequence of 1 then 1 then 0 then 1 would
+be 11 followed by 01, 01 and finally 00 (from an initial encoder state of all-
+zeros). Suppose that instead of receiving this sequence, the decoder receives
+corrupted data of 11 followed by 00, 01 and 00. Also assume that node 0 has
+an initialised weight of zero and that the other nodes have a higher initialised
+weight, say 2, at time 0; this corresponds to the encoder starting in state S0.
+
+S3
+
+S2
+
+S1
+
+S0
+
+2
+
+2
+
+2
+
+0
+
+0
+
+1
+
+1
+
+1
+
+2
+
+0
+
+1
+
+2
+
+11
+
+2
+
+1
+
+1
+
+1
+
+0
+
+2
+
+1
+
+0
+
+00
+
+1
+
+2
+
+0
+
+0
+
+1
+
+1
+
+2
+
+1
+
+01
+
+3
+
+3
+
+1
+
+2
+
+2
+
+1
+
+1
+
+1
+
+0
+
+2
+
+1
+
+0
+
+00
+
+3
+
+0
+
+2
+2
+
+3
+
+1
+
+1
+
+2
+
+2
+
+2
+1
+0
+3
+4
+3
+3
+
+1
+
+time
+
+Figure 14.3.
+Decoding trellis.
+
+For each branch, the distance between each received bit and the bit expected
+by the branch gives the weight of the branch. Where X and Y are encoded as
+single bits, referred to as hard coding, the number of bits difference between
+the received bits and the ideal encoded state for the branch gives the branch
+weight. The weights allocated in each timeslot for the received data are shown
+in figure 14.3. The received bits in the first timeslot are 11. Branches rep-
+resenting an encoded output of 11 are given a weight 0 while the branches
+representing an encoded output of 00 are given a weight of 2, and branches
+with ideal encoded outputs of 01 or 10 are given a weight of 1; these branch
+weights represent the distance between the branch and the received input. The
+branch weights are then added to the node weights to give an overall weight.
+Thus for example, state S2 has a node weight of 2 at time 0 and its branches
+have weights of 0 and 2 giving an overall weight of 2 and 4 going into the
+receiving node at the end of the timeslot. The overall weight indicates how
+likely it is that the route through the encoder corresponded to this sequence;
+the lower the overall weight the more likely this is.
+From the trellis diagram, it can be seen that each state can be reached from
+two other states and that each state will therefore have two incoming overall
+weights. The weight that is chosen is the lower of these two since this rep-
+resents the most likely branch that the encoder would traverse in reaching a
+particular state. So for example, state S1 can be entered from either state S0
+or S2 and at time 1, the overall node plus branch weight from S0 is 0
+� 0
+� 0
+while from S2 it is 2
+� 2
+� 4. The weight arising from S0 is chosen as this is
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+253
+
+the lower and therefore more likely route into S1; the weight arising from the
+S2 branch is discarded and the new weight for node S1 at the end of this first
+timeslot is that from S0, i.e. 0.
+This process continues in a similar way for each timeslot with weights given
+to each branch according to the difference between the encoded branch pattern
+and the received bits. This difference is then added to the node weight to give
+an overall weight with the next weight for a node equal to the lower incoming
+weight. This results in the weights given in figure 14.3.
+To form the output from the decoder, a path is traced through its trellis using
+the accumulated history of node weight information. The weights at time 4
+indicate that the encoder is most likely to be in state S1 as this has the lowest
+node weight. Furthermore, this state was most likely to have been reached
+from S2. Continuing tracing backwards from S2 taking the lower overall node
+count, the most likely path taken from S2 is S3, S1 and S0 (initialisation). Note
+that despite having received corrupted data, this is exactly the sequence taken
+by the encoder in sending this data!
+The output data can be ‘rescued’ from the decoder by noting that to reach
+states S0 and S2 the current data input to the encoder is a ‘0’ while to reach
+states S1 and S3, the current encoder input is a ‘1’; that is, the least significant
+state bit indicates the state of the current data. Thus since the optimum states
+the decoder reaches at time 1, 2, 3 and 4 is S1, S3, S2 and S1 respectively, the
+decoder would output a data stream of 1101 as being the most likely input to
+the encoder.
+
+14.3.
+System parameters
+
+In practice the decoder designed is larger and more complicated than indi-
+cated by the simple example given. The encoder uses the current bit and the
+six previous bits in various combinations to provide the two encoded output
+streams; this spreads each input bit out over a longer transmission period, in-
+creasing the probability of error-free reception in the presence of noise. If the
+current bit is bit 0 with the oldest bit in the shift register being bit 6, then the
+encoded X-output is obtained by adding bits 0, 1, 2, 3 and 6 and the encoded
+Y-output from adding bits 0, 2, 3, 5 and 6. Since the constraint length is 7, the
+system has 64 nodes and therefore 128 paths or branches. Thus state n, where
+n is an integer from 0 to 63, leads to states 2n modulo 64 and
+�2n
+�1� modulo
+64.
+Furthermore, the received bits are not hard coded (just straight ones and
+zeros) but soft coded. Three bits are used to represent values, with 100 used
+for a strong zero value and 011 for a strong one value. Noise in transmission
+means that a received character can be indicated by any 3-bit value from 0 to
+7 and in interpreting the received value, it is helpful to regard the number as
+
+
+254
+Part III: Large-Scale Asynchronous Designs
+
+signed. Thus 011 indicates a strong one while 000 denotes received data that
+weakly indicates a one. Similarly, 100 implies a strong zero while 111 denotes
+a probable but weak zero. To make the text easier to follow hereafter, unsigned
+values will be used with the code for a 3-bit perfect zero taken as 000 (i.e.
+value 0) while a 3-bit perfect one is taken as 111 (i.e. value 7).
+The interface to the asynchrous decoder is synchronous with the validity of
+the encoded characters on a particular positive clock edge indicated by a Block-
+Valid signal; encoded data is only present when this signal is high. Code rates
+other than the 1/2, primarily described in this chapter, are also possible. These
+are achieved by using the Block-Valid with a Puncture and a Puncture-X-nY
+signal; if Block-Valid is active (high) then a high Puncture signal indicates that
+only one of the X or Y symbols is valid and Puncture-X-nY indicates which.
+Both encoded characters are present if Block-Valid is high and Puncture is low.
+All the code rates originate from the 1/2 rate encoder with data for the other
+code rates then obtained by omitting to send some of these encoded characters.
+In this way, in addition to the 1/2 code rate, the system receives and decodes
+rates of 2/3 (two input bits result in the transmission of three encoded charac-
+ters), 3/4, 5/6, 6/7 and 7/8 (7 input bits defined by 8 encoded characters with
+the remaining 6 not transmitted). As the code rate progressively increases, less
+redundancy is included in the transmitted data resulting in an increased error
+rate from the decoder; a rate of 1/2 will yield the most error-free output.
+The system is expected to be operated from a 90 MHz clock. Since this
+clock is used by all code rates, the code rate not only defines the ratio of the
+number of input bits to the number of transmitted encoded characters but also
+specifies the ratio of the number of clocks containing encoded information (of
+1 or 2 valid characters) to the number of clock cycles. For example, a 3/4 rate
+signifies that three input bits result in four transmitted characters and also that
+three clocks out of four contain some encoded information. Thus in a 3/4 code,
+every fourth clock contains no encoded information (Block-Valid is low). Any
+clock for which Block-Valid is high is said to contain a symbol. Thus with
+a 90 MHz clock rate, a 1/2 rate code which has valid data on alternate clock
+cycles yields a data rate of 45 MSymbols/sec and a 7/8 code rate is equivalent
+to 90x7/8 Msymbols/sec = 78.75 Msymbols/sec.
+
+14.4.
+System overview
+
+To perform the decoder computation previously outlined, the Viterbi de-
+coder comprises three units as shown in figure 14.4. The Branch Metric Unit
+(BMU) receives the transmitted data and computes the distance between the
+ideal branch pattern symbols of (0,0), (0,7), (7,0) or (7,7) and the received
+data; these weights are then passed to the Path Metric Unit (PMU). The PMU
+holds the node weights and performs the node plus branch weight calculations,
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+255
+
+Clock
+Req
+Req
+
+branch
+
+Ack
+
+Req
+
+Unit
+History
+global
+Metric
+Path
+
+Unit
+Metric
+Unit
+
+Branch
+
+Receiver
+
+Ack
+
+Decoded
+Output 
+
+Ack
+
+winner
+
+Input from
+metrics
+
+local
+winners
+
+Figure 14.4.
+Decoder units.
+
+selecting the lower overall weight as the next weight for a node in a particular
+timeslot. The computed node weights are then fed back (within the PMU) and
+become the node weights for the next timeslot.
+As well as computing the next slot node weights, the PMU remembers
+whether the winner at a node came from the upper or lower branch leading
+in to it; this is only a single bit per node. On each timeslot, this local winner
+information is passed to the History Unit (HU), called the Survivor Memory
+Unit in synchronous systems. This information enables the HU to keep a his-
+tory of the routes taken through the trellis by each node. In addition to this
+local node information, in our asynchronous design the state with the lowest
+node weight is identified in the PMU and its identity passed to the HU. Locat-
+ing this global winner gives a known starting point in searching back through
+the trellis history to find the best data to output.
+Conventional synchronous designs do not undertake an overall node winner
+identification and consequently start the search at a random point. They need
+to store a relatively large number of timeslots in order to ensure that there is
+sufficient history to make it likely that the correct route through the trellis is
+eventually found. In the asynchronous design, the identification of the overall
+node winner in the PMU was relatively easy to perform and it seemed the
+natural way to proceed. It has had the desirable effect of enabling the amount
+of timeslot history kept in the HU to be reduced and it also reduces the activity
+in the HU, saving power.
+The HU uses both the overall winning node information (the global winner)
+and the local node winners in order to reconstruct the trellis and trace back the
+path from the current overall winner to find the node indicated by the oldest
+timeslot stored; the bit for this node is then output. The HU can be visualised
+as a circular buffer. Once data is output from the oldest slot, this information
+is overwritten with the current (newest) winner data so that the next oldest data
+becomes the oldest data in the next timeslot.
+Figure 14.4 shows the bundled-data interface used between units; four-phase
+signalling is used for the Request and Acknowledge handshake signals. The
+Clock signal is required because the external system to the decoder is syn-
+
+
+256
+Part III: Large-Scale Asynchronous Designs
+
+chronous. Input data is supplied and output data removed on the positive clock
+edge provided that the bits on that clock edge are valid as indicated by Block-
+Valid. In practice, all units contain buffering at their interfaces in order to
+decouple the operation of the units, to cater for local variations in operating
+times, and to satisfy latency requirements imposed by the external system.
+
+14.5.
+The Path Metric Unit (PMU)
+
+14.5.1
+Node pair design in the PMU
+
+The PMU performs the core of the computation in the decoder and is the
+starting point for the design. Conventionally, the computation performed is
+that of Add (node to branch weight), Compare (upper and lower weights to
+a node) and Select (lower weight as next weight for a node). Because of the
+butterfly connection, the branch weights associated with nodes j and j
+�32 and
+their connections lead to nodes 2 j and 2 j
+� 1, as shown in figure 14.5 where
+BMa and BMb represent the branch weights; it should be noted that since a
+branch represents ideal convolved characters of (0,0), (0,7), (7,0) or (7,7), it is
+only necessary to compute a total of four weights in any system representing
+their distance from the received soft-coded characters.
+
+node metric
+Previous
+
+node
+  
+node
+
+node
+node
+BMa
+
+  BMa
+
+BMb
+
+BMb
+
+Next
+
+node metric
+
+ j+32
+
+     j
+
+2j+1
+
+   2j
+
+Figure 14.5.
+Node pair computation.
+
+As the logic for this pair of nodes is self-contained and all the logic can be
+similarly partitioned into self-contained pairs of nodes, the basic unit of logic
+design in the PMU is the node pair; this is then replicated the required number
+of times (32 in this system). Furthermore, since 8-bit parallel arithmetic is nor-
+mally used, in a 64-node system this leads to 512 data signals in the butterfly
+connection and 1024 interconnections within the PMU.
+In an effort to simplify this routing problem and to achieve an associated
+power reduction from this simplification, serial arithmetic was proposed for
+the asynchronous design; in principle, this would reduce the butterfly to just
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+257
+
+Table 14.1.
+Serial unary arithmetic.
+
+number
+transition
+representation
+
+zero
+=
+000000
+one
+=
+111111
+two
+=
+000001
+three
+=
+111101
+four
+=
+000101
+five
+=
+110101
+six
+=
+010101
+
+Ain
+
+Rin
+Rout
+
+Aout
+
+C
+C
+C
+C
+C
+C
+
+R1
+
+A1
+R2
+
+A2
+R3
+A4
+
+A3
+R4
+A5
+
+R5
+
+1
+3
+5
+
+2
+4
+6
+
+Figure 14.6.
+Six-bit 2-phase dataless FIFO.
+
+64 wires. The adoption of serial arithmetic significantly impinges upon the
+node pair design and the way weights are stored. A conventional binary rep-
+resentation of values where serial arithmetic was performed on them was not
+a practical option as it would lead to a system throughput appreciably below
+that specified. This led to the idea of indicating values by the number of entries
+occupied in a FIFO buffer, so for example a count of five would require that the
+bottom five stages of the buffer showed that they were full; note that the buffer
+stages don’t need to store any data but merely require a full/empty indication.
+The speed and simplicity of this full/empty dataless FIFO scheme is further
+enhanced by adopting serial unary arithmetic for representing the data in the
+buffer (rather than a ‘1’, say, to represent full and a ‘0’ for empty). This is
+essentially a 2-phase representation for values, so that the number of transi-
+tions considering the full/empty bits as a whole represents the count. This is
+illustrated in table 14.1 for a 6-stage FIFO where the input enters on the left
+hand side (and exits from the right hand side).
+The FIFOs used to hold the path and state metrics are Muller pipelines as
+shown in figure 14.6 (see also figure 2.8 on page 17). The encoding of a metric,
+M, is simply the state of an initially empty Muller pipeline after it has been
+exposed to M 2-phase handshakes on its input. Since a Muller C-gate in the
+technology used has a propagation delay of around 1 nsec the FIFO can transfer
+data in and out at a rate of around 500 MHz.
+
+
+258
+Part III: Large-Scale Asynchronous Designs
+
+Using serial unary arithmetic, the major design component in a node pair
+for adding the node and branch weights and transferring the smaller to be the
+new node weight is an increment-decrement unit. The basic scheme for a node
+pair is illustrated in figure 14.7.
+
+node  2n+1
+node  2n
+
+node  2n+3
+node  2n+2
+
+node  n/2
+
+node  n/2+32
+
+from node pair  n/4
+
+from node pair  n/4+16
+to node pair  n+1
+
+to node pair n
+
+butterfly
+network
+
+butterfly
+network
+
+metric
+state
+
+state
+metric
+
+path
+
+metrics
+branch
+
+path
+
+path
+
+path
+
+global
+
+state
+metric
+
+state
+metric
+
+local
+winner
+
+metric
+
+metric
+
+metric
+
+select
+
+select
+
+node  n+1
+
+node  n
+
+node pair n/2
+
+winner
+to History Unit
+
+metric
+
+Figure 14.7.
+Node pair logic.
+
+The new weights for each state are stored in the State Metric FIFOs on the
+right hand side of figure 14.7. When the global and local winners of these have
+been sent to the HU and acknowledged, the next timeslot commences with the
+parallel loading of the branch weights into the Path Metric FIFOs on the left
+in figure 14.7; these overwrite any existing content in these FIFOs. Parallel
+loading here, rather than serial entry, was selected on the grounds of speed and
+the need to clear the Path Metric FIFOs of any existing count.
+The branch weights loaded are those computed by the BMU. The BMU first
+computes the conventional branch weights based on the difference between the
+two ideal 3-bit characters expected on the trellis branch and the two received
+values. The BMU then translates this to a transition pattern. This is made
+more complicated by the fact that the external environment to the Path Metric
+FIFOs is sometimes in the
+� 1
+� state necessitating a 2-phase inversion of the
+computed pattern.
+Once the branch weights are loaded in, the node weights are then added to
+them. The node weights are transferred serially from the State Metric FIFOs
+across the butterfly connection into the Path Metric FIFOs. For each event
+transferred, two Path Metric FIFOs are incremented by one and the State Met-
+ric FIFO decremented by one. The transfer is complete when the feeding State
+Metric FIFO is empty. The Path Metric FIFOs for the node pair can commence
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+259
+
+the comparison and selection of the lower count as the node weight for the re-
+ceiving State Metric FIFO once the receiving State Metric FIFO is empty.
+The simplest way of performing this comparison and selection of the lower
+Path Metric FIFO count is to pair transitions (events) in the upper and lower
+Path Metric FIFOs connected to a receiving State Metric FIFO. Each paired
+event decrements each Path Metric FIFO by one and produces a transition
+which is used to increment the State Metric FIFO. The observant reader will
+note that the pairing of events to produce an output transition is exactly the
+action performed by a two-input Muller C-gate and this in principle is all that
+is required for the Select element shown in figure 14.7.
+The pairing action ceases when the lower count in the two Path Metric FI-
+FOs has decremented to zero. At this point, this Path Metric FIFO is the local
+path winner and the new node weight in the receiving State Metric FIFO is
+complete. The identity of the winning Path Metric FIFO (upper or lower) is
+needed to reconstruct the trellis; this information is buffered in the PMU and
+sent to the HU when all local winners are known and the overall winning node
+has been identified. This completes the actions required in the current timeslot
+and the PMU is then free to commence the next timeslot.
+
+14.5.2
+Branch metrics
+
+The proposed scheme is only viable if the numbers that need to be trans-
+ferred between the FIFOs can be kept small. A simulator was written in order
+to establish the minimum values that were consistent with meeting the bit error
+rates specified for the industrial standard device.
+
+�
+�
+
+
+
+ 
+
+7,7
+
+d
+
+a
+b
+
+d11
+
+d00
+
+c
+
+d01
+
+d10
+
+7,0
+0,0
+
+0,7
+
+Figure 14.8.
+Computing the branch metric.
+
+
+260
+Part III: Large-Scale Asynchronous Designs
+
+Table 14.2.
+Branch metric weight generation.
+
+received 3-bit character
+0
+1
+2
+3
+4
+5
+6
+7
+
+Weight referenced to 0:
+0
+0
+0
+0
+1
+3
+5
+7
+Weight referenced to 7:
+7
+5
+3
+1
+0
+0
+0
+0
+
+In the BMU, the distance of the incoming data from the ideal branch rep-
+resentations of (0,0), (0,7), (7,0) and (7,7) needs to be computed. This cal-
+culation is depicted in figure 14.8. The incoming data is assumed to have
+a value of
+�a�c� which does not correspond to any of the ideal points. The
+squares of the distances d00, d01, d10 and d11 are a2
+
+�c2, a2
+
+�d2, b2
+
+�c2 and
+b2
+
+�d2 respectively. Only the relative values of these quantities are of interest.
+Substituting
+�7
+� a� for b and
+�7
+� c� for d in the quadratic expressions gives
+squared distances of a2
+
+�c2, a2
+
+�
+�7
+�c�2,
+�7
+�a�2
+
+�c2 and
+�7
+�a�2
+
+�
+�7
+�c�2.
+Expanding out and subtracting a2
+
+� c2 gives distance values of 0, 49
+� 14c,
+49
+�14a and 98
+�14a
+�14c. Dividing by 7, adding a
+�c and then substituting
+back b for
+�7
+�a� and d for
+�7
+�c� yields the linear linear metrics a
+�c, a
+�d,
+b
+�c and b
+�d.
+Thus in this particular system, the Euclidian distance squared is equivalent
+to the Manhattan distance, which is a somewhat surprising and unexpected re-
+sult. It indicates that to use squared distances offers no advantage over using
+the much simpler linear distances; indeed, using squared distances followed
+by scaling to reduce the number size (which is adopted in some systems) in-
+troduces unnecessary circuit complexity and some inaccuracy.
+The linear weights are further minimised by subtracting the x and y distance
+to the nearest ideal point from the branch weights, so the smallest linear metric
+is always made zero. For example, if the incoming soft codes are exactly 7,7
+in figure 14.8 then linear metrics of 14, 7, 7 and 0 are generated. However, if
+the incoming bits are at co-ordinate 5,6 (due to noise) then the metrics become
+11, 6, 8 and 3 which by subtracting 3 from all values (2 for x value and 1
+for y value) become branch metrics of 8, 3, 5 and 0. Using decrementing
+of the smallest distance to the nearest point, which then becomes zero, the
+values generated for weights for each 3-bit character are reduced as shown in
+table 14.2.
+The maximum metric is now 14, and this will always arise when the incom-
+ing data coincides with one of the ideal input combinations. This is still too
+large for a system which needs to operate serially at close to 90 MHz. There-
+fore the metric is further scaled non-linearly and, to preserve the relative value
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+261
+
+of the weights, this is performed separately on each of the two incoming 3-bit
+soft values.
+Referring to table 14.2, weights of zero remain zero while weights of 1, 3, 5
+and 7 are scaled to 1, 2, 3 and 4. Clearly, the weights for both 3-bit soft codes
+need to be added to obtain the overall branch metric weight. Thus for example,
+the incoming soft-codes of 7,7 generate weights of 4+4, 4+0, 0+4 and 0 = 8, 4,
+4 and 0, while incoming soft-codes of 5,6 generates weights of 2+3, 2+0, 0+3
+and 0 = 5, 2, 3 and 0. Although the scaling here is non-linear and will therefore
+introduce some inaccuracy, simulation showed that this was not significant in
+relation to the results obtained with and without the scaling.
+Furthermore, simulation also reveals that, unsurprisingly, paths with the
+largest weights are rarely involved in the most likely path to be chosen dur-
+ing the path reconstruction to find the output. Therefore, weights generated
+can be limited or capped and a limit of 6 is used in the BMU. Thus the weights
+actually generated and loaded by the BMU for a 7,7 soft code input are 6, 4, 4
+and 0; however, the weights for a 5,6 input code remain as above.
+Six-bit FIFOs are used throughout the PMU and again numbers in the PMU
+are capped at 6. To deal with cases where the serial addition of the node metric
+to a branch weight would exceed this number, logic referred to as the Overflow
+Unit is placed at the input of each Path Metric FIFO. This receives the incom-
+ing request handshake but does not pass it to the FIFO, instead it returns it to
+the sending State Metric FIFO as an acknowledge signal.
+
+14.5.3
+Slot timing
+
+The overall or global winner from the PMU in a particular timeslot is the
+node having the lowest state metric count. In the same way that the BMU
+values can be adjusted so that the lowest weight is zero, the state metric values
+can also be decremented so that their minimum value is zero. As a result,
+numbers in the BMU and PMU are guaranteed to range only between zero and
+six. Furthermore if the soft bits contain no noise, which is the situation most of
+the time, then one (and only one) State Metric FIFO will contain a zero count
+indicating the best path through the trellis. This means that in the majority of
+timeslots, there is no need to perform any subtraction on the state metrics.
+Detecting that a count is zero is in itself easy since it is indicated by an all-
+zeros state in the FIFO. This, as well as establishing the local winner, is done
+locally within a node and is timed from the control signals applied to the node;
+each node has a control section which generates the timing signals required by
+the node, and the timing within a node is independent of the timing of all other
+nodes.
+A slot commences with the loading of the BMU branch weights. The node
+timing then passes to the next stage where the state-to-path metric transfer is
+
+
+262
+Part III: Large-Scale Asynchronous Designs
+
+performed. Following this, detecting that the sending State Metric FIFOs and
+the State Metric FIFO to receive the lower path metric count are empty causes
+the generation of a state-to-path metric done signal. The node timing then
+moves on to the next phase which generates a path-to-state metric enable. If
+at the time this signal is activated one of the Path Metric FIFOs for the node
+is empty, then a flip-flop is set indicating that this node is a global winner
+candidate; in this case no transfer to the State Metric FIFO is required and
+the path-to-state metric done signal is generated; this signal is used to clock
+the upper/lower branch (local) winner into a flip-flop and also to set a ‘local
+winner found’ flip-flop.
+If neither Path Metric FIFO is empty then the path-to-state enable signal
+allows the transfer to its State Metric FIFO until one of the Path Metric FIFOs
+becomes empty; at this point, the path-to-state metric done signal is generated,
+setting the ‘local winner’ and the ‘local winner found’ flip-flops only.
+The ‘local winner found’ and ‘global winner found’ signals now progress
+up the PMU logic hierarchy to the top level because all information needed
+to be passed to the HU has to be present before the request out signal to the
+HU is generated by the PMU. Furthermore, when the local and global winner
+data have been assembled for the HU, all nodes need to be informed that the
+slot has ended and the timing can be shifted to the start of the next slot. It
+should therefore be noted that while the timing within the nodes is local, the
+communication of the winner information to the HU and the subsequent release
+of the nodes has to be global.
+
+14.5.4
+Global winner identification
+
+The formation of the ‘all local winners found’ and the global winner identi-
+fication is partitioned across the various levels of the PMU logic hierarchy. At
+the level of a pair of node pairs, the four global winner candidate signals are
+input to a priority function which produces a 2-bit node address and a ‘global
+winner found’ signal if one of the inputs was a candidate. This logic is repeated
+with four such pairs of node pairs so that using the ‘global winner found’ sig-
+nal generated by each pair of node pairs, the two bit address obtained indicates
+which pair of node pairs contains the global winner; these are then combined
+with the two-bit address generated by that pair of node pairs to form a 4-bit
+node address. Finally, this logic is repeated at the top level where there are
+four sets of four pairs of node pairs. Again the ‘global winner found’ signal
+from each set is used in a priority logic function to produce the two-bit address
+of the winning set and this is combined with the four address bits identifying
+the winning node within the winning set; this is the 6-bit node identification
+that is sent to the HU as the global winner.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+263
+
+The ‘local winner found’ signals only require combining in NAND or NOR
+gates and this is done at the node pair, four pairs of node pairs and top levels.
+At the top level, all the ‘local winner found’ signals must be true in order to
+generate the request out signal to the HU. Since the global winner identification
+is generated from the node whose local winner is the first to be indicated, the
+global winner is guaranteed to be identified prior to the last local winner being
+found. The acknowledge signal from the HU, in response to the PMU request
+out, causes a reset signal to all nodes which resets the global candidate and the
+‘local winner found’ flip-flops and moves the timing on.
+For the cases where no State Metric FIFO contains zero, this is detected;
+it indicates that noisy data has been received. Here, the global winner can be
+identified by performing a subtraction such that one or more State Metric FI-
+FOs with the smallest count become zero. This could be time consuming and
+a decision was taken not to perform the decrement in the current timeslot. In-
+stead the local winners are sent to the HU in the normal way but a Not-Valid
+signal accompanies the request handshake indicating that while the local win-
+ner information is genuine the global winner identification should be ignored.
+The decrementing of all the state metric weights is performed in the next
+cycle by all the Overflow Units (which precede the Path Metric FIFOs). A
+signal to this unit indicates that decrementing is required. This results in the
+first incoming request from a State Metric FIFO to transfer its count to the
+Path Metric FIFO being ignored. The Overflow Unit sends an acknowledge
+back to its sending State Metric FIFO but does not pass the request on to its
+Path Metric FIFO. This effectively decrements the State Metric FIFOs by one
+by discarding the first item sent by them to the Path Metric FIFOs.
+Only a count of one is decremented in this way on any timeslot. This may
+still leave all state metrics with a non-zero count in them but simulation re-
+vealed that this was highly unlikely. Furthermore, if the Overflow Units were
+used to decrement the state metrics by the smallest count then either consider-
+able logic to determine the size of this count would be needed, or time consum-
+ing logic which decremented all state metrics by one and then tested to see if all
+these metrics were still non-zero (repeating these steps if necessary) would be
+required. Instead the much simpler approach of detecting a zero-valued state
+metric and identifying when all state metric counts are non-zero is used.
+In retrospect, it would have been better to have decremented the State Met-
+ric down to zero in the current timeslot. The decrementing has to occur at some
+point and postponing to the next timeslot merely shifts when the operation is
+performed. More importantly, the failure to identify the global winner in the
+case of all State Metrics FIFOs holding a non-zero count means that informa-
+tion which is in the PMU is not passed to the HU and therefore the HU has less
+information on which to base its decisions as to the data output.
+
+
+264
+Part III: Large-Scale Asynchronous Designs
+
+14.6.
+The History Unit (HU)
+
+The global and local winner information from the PMU to the HU is accom-
+panied by a Request handshake signal in the normal way. Having specified the
+interface to the PMU, the design of the asynchronous HU can be decoupled
+from the design of the rest of the system. As previously mentioned, the iden-
+tification of a global winner means that the number of timeslots of local and
+global winner history that need be kept by the HU can be reduced compared
+with systems that need to start the tracing back through the trellis information
+from a random point. A rule of thumb for the minimum number of timeslots
+that need to be stored for determining the correct output is around 5 times the
+constraint length. With a length of seven for the system described, a minimum
+history of 35 timeslots is required and this was confirmed by simulation. On
+this basis, a 65-slot HU was developed.
+
+14.6.1
+Principle of operation
+
+Figure 14.9 illustrates the principle of operation of the HU which, for sim-
+plicity, is shown as having only four states and storing only five time slots
+indicated by the rectangular outline. At each time step, indicated by T1, T2,
+etc., the PMU supplies the HU with the local winner information (an arrow
+points back from each state to the upper/lower winner state in the previous
+time step) and a global winner indicated by a sold circle.
+Consider the situation when the latest data to have been supplied is at T6.
+The global winner at T6 is S3, and following the arrows back, the global winner
+at T2 is S0. The next data output bit is therefore 0 (the least significant bit
+of S0’s state number), and this is output as the buffer window slides forward
+to time step T7. At T7 the received data has been corrupted by noise, and
+the global winner is (erroneously) S3. Following the local winners back, the
+backtrace modifies the global winners in time steps T6 and T5, but the path
+converges with the old path at T4. The next data output (from the global winner
+at T3) is 1 and is still correct. Moving on to T8, the global winner is S0 and,
+tracing back, the global winners at T5, T6 and T7 are changed, with T5 and T6
+resuming their previous correct values. The noise that corrupted the path at T7
+has no lasting effect and the output data stream is error-free.
+
+14.6.2
+History Unit backtrace
+
+The data structure used in the HU is illustrated in table 14.3. Each of the
+65 slots contains 64 bits of local winner information and a 6-bit global winner
+identifier. There is also a ‘global winner valid’ flag which indicates whether or
+not the global winner has been computed. The 65 slots form a circular buffer
+with the start (and end) of the buffer stepping around the slots in sequence.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+265
+
+0
+0
+1
+0
+1
+1
+1
+0
+data_out:
+
+T1
+T2
+T3
+T4
+T5
+T6
+T7
+T8
+
+time
+
+S0
+
+S1
+
+S2
+
+S3
+
+Figure 14.9.
+History Unit backtrace example.
+
+Table 14.3.
+History Unit data structure.
+
+slot
+local winner
+global winner
+valid
+number
+(64 bits)
+(6 bits)
+(1 bit)
+
+0
+L00[63..0]
+G00[5..0]
+V00
+1
+L01[63..0]
+G01[5..0]
+V01
+
+�
+�
+�
+18
+L18[63..0]
+G18[5..0]
+V18
+19
+L19[63..0]
+G19[5..0]
+V19
+20
+L20[63..0]
+G20[5..0]
+V20
+� head
+21
+L21[63..0]
+G21[5..0]
+V21
+22
+L22[63..0]
+G22[5..0]
+V22
+
+�
+�
+�
+64
+L64[63..0]
+G64[5..0]
+V64
+
+At each step the next output bit is issued from the least significant bit of the
+current head-slot global winner identifier. Then the new local and global win-
+ner information is written into the head slot and the head pointer moves to the
+next slot. The new local and global winner information is used to initiate a
+backtrace, which updates the current optimum path held in the global winner
+memories.
+The trellis arrangement produces a simple arithmetic relationship between
+one state and the next state so that, given a global winner identity in one slot,
+the previous global winner identity is readily computed. The parent identity
+can be derived from the child identity by shifting the child state one place to
+the right and inserting the relevant local winner bit into the most significant bit
+position. For example, if the global winner is node 23 in a slot, then the global
+
+
+266
+Part III: Large-Scale Asynchronous Designs
+
+winner in the previous slot will be node 11 (if the current slot local winner for
+node 23 is 0) or node 11+32 = node 43 (if the local winner for node 23 is 1).
+Where the current global winner is the child of the previous global winner,
+the current winner continues the good path already stored in the global winner
+memories. This makes it unnecessary to search back through the local winner
+information in order to reconstruct the trellis and hence saves power. There-
+fore, when data is received from the PMU, if the incoming global winner is the
+child of the last winner, then it is only necessary to output data from the oldest
+global winner entry and then to overwrite this memory line with the incoming
+local and global winner information.
+However, if sufficient noise is present (or noise has been present and the
+data now switches back to a good stream), then there may be a discontinu-
+ity between the incoming and previous global winner; this is recognised by
+the current global winner not being the child of the previous winner. In this
+case, the global winner memories do not hold a good path and this path is re-
+constructed using the local winner information. Here, the output data is read
+out and the winner information is written in as before. In addition, starting
+from the current global winner, this node identification is used to select its up-
+per/lower branch winner from the current local winner information. The parent
+identity is then constructed as described above. This computed parent identity
+is compared with the global winner identity for the previous slot. If they are
+the same then the backtrace has now converged onto the good path kept in the
+global winner memories and no further action is required. If, however, they are
+not the same then the computed parent identity needs to overwrite the global
+winner in the previous timeslot. The backtrace process now repeats in order to
+construct a good path to the next previous timeslot, and this process continues
+until the computed parent identity does match the stored global winner.
+Backtracing slot by slot thus proceeds until the computed path converges
+with the stored path. The algorithm is shown in a Balsa-like pseudo-code in
+figure 14.10. (Note, however, that Balsa does not have the ‘��’ and ‘��’
+shift operators; they are borrowed here from C to improve clarity.) In practice,
+simulation shows that path convergence usually occurs within eight or fewer
+slots. So, although the most recent items may be over-written, the oldest items
+tend to be static and the output data from the oldest slots does not change.
+Overwriting the entire path is a rare occurrence and, in this circumstance, the
+data output from the system is almost certainly erroneous.
+No backtrace is commenced in any slot where the global winner is invalid;
+the global winner entry is marked as invalid but the local winner information is
+written in the normal way. Any subsequent backtrace that encounters an invalid
+global winner will force a not equivalence with the incoming computed global
+winner at that slot, so that the computed global winner replaces the invalid
+stored value and the entry is then marked as valid.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+267
+
+loop
+c := head;
+-- child starts at head
+data_out <- Gc[0];
+-- output lsb of Gc
+Lc := In.local_winners;
+-- update head local winners,
+Gc := In.global_winner;
+--
+global winner, and
+Vc := In.global_winner_valid; --
+global winner valid bit
+head := head + 1;
+-- step head pointer to next slot
+if Vc then
+-- backtrace only from valid head
+p := (c-1) % 65;
+-- parent slot number
+while (c /= head
+-- detect buffer wrap-around
+and (not Vp
+-- over-write invalid parent
+or
+Gp /= (Lc[Gc] << 5) + (Gc >> 1)))) -- not converged
+then
+Gp := (Lc[Gc] << 5) + (Gc >> 1));
+-- update parent
+Vp := TRUE;
+-- mark as valid
+c
+:= p;
+-- next slot
+p
+:= (c-1) % 65
+-- next parent slot number
+end -- while
+end -- if
+end -- loop
+
+Figure 14.10.
+History Unit backtrace sequential algorithm.
+
+14.6.3
+History Unit implementation
+
+The type of memory used in the HU is the dominant factor in determining
+its implementation. Initially, RAM elements were considered for this storage
+as single and dual-port read elements were present in the available cell library.
+However, their use makes it difficult to keep track of incomplete backtraces
+when a new backtrace needs to be started. In addition, the global and local
+winner memories need to be separate entities but this introduces some ineffi-
+ciency in the address decoding. Furthermore, there are difficulties in providing
+the many specific timed signals required to drive the memory. The RAM tim-
+ings are equivalent to many simple gate delays. Such gates would be used to
+form the reference timing signals, and it is not clear that the gate propagation
+delays due to supply voltage changes vary in the same way as the RAM delays.
+For these reasons, the memory was implemented with flip-flop storage and
+the system is shown in figure 14.11. It comprises 65 lines made up of 64
+slots of replicated storage and one further slot which, on reset, becomes the
+slot holding the head token; the head slot receives the new local and global
+winner information from the PMU. The control block holds the global winner
+identification plus the token handling and backtrace logic.
+The concurrent asynchronous algorithm is illustrated in Balsa-like pseudo-
+code in figure 14.12, which represents a single stage in the History Unit. The
+complete HU comprises 65 such stages, one of which is initialised to be the
+
+
+268
+Part III: Large-Scale Asynchronous Designs
+
+control
+
+evaluate
+addr
+
+token
+
+Rin
+Ain
+data_out
+Aout
+
+control
+
+local winners
+
+winners
+winners
+local
+
+memory
+
+local
+
+memory
+
+addr
+
+data
+
+strobe
+
+addr
+
+data
+
+strobe
+
+winner
+global
+
+Figure 14.11.
+History Unit implementation.
+
+head of the circular buffer. This can be compared with the sequential algo-
+rithm shown earlier in figure 14.10. The transformation from the sequential
+to the concurrent algorithm is illustrative of high-level asynchronous design
+methodologies even when the design is being carried out manually (as was this
+Viterbi decoder) and not in a high-level language such as Balsa.
+The head slot contains the oldest winner information and determines the 1-
+bit data output from the system. Remembering that odd states signify a ‘1’ in-
+put and even states a ‘0’ input, the head slot outputs the least significant global
+winner bit on the data-out line. This data enters a buffer and the acknowledge
+Aout signifies its acceptance. The head is then free to write the current winner
+information to its memory. The Token signal then passes (leftwards) to the
+adjacent slot which now becomes the new head.
+The parent node of the current global winner is computed as described and
+this is passed (rightwards) to the adjacent slot with an Evaluate signal. The
+computed parent is compared with the stored winner in the previous stage.
+Equivalence results in no further backtracing and the backtrace is said to be
+retired. Not equivalence causes overwriting of the global winner and this win-
+ner accompanied by Strobe is used to address (Addr) the local memory. The
+data bit returned on Data is used to compute the parent of this winner which is
+then passed rightwards to the preceding timeslot with an Evaluate signal. This
+process repeats until the backtrace converges with the existing global winners
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+269
+
+loop
+arbitrate In then
+if head then
+-- head...
+data_out <- G[0];
+--
+output next data bit
+L := In.local_winners || --
+update local values
+G := In.global_winner ||
+V := In.global_winner_valid;
+if V then
+--
+if global winner is valid
+addrOut <- (L[G] << 5) + (G >> 1) -- start backtrace
+end; -- if V
+sync tokenOut ||
+--
+pass on head gtoken
+head := false
+--
+and clear head Boolean
+end -- if head
+| addrIn then
+--
+backtrace input?
+if not head then
+if addrIn /= G or not V then
+-- path converged? If not...
+G := addrIn ||
+--
+update global winner
+V := true;
+addrOut <- (L[G] << 5) + (G >> 1) -- propagate backtrace
+end -- if ..
+end -- if not head
+| tokenIn then
+--
+head token arrived
+head := true
+--
+set head Boolean
+end -- arbitrate
+end -- loop
+
+Figure 14.12.
+History Unit backtrace stage.
+
+and can be retired. A backtrace has to be forcibly retired if it is in danger of
+running into the head slot; an arbiter is used to test for this in the control and it
+is the only place where arbiters need to be used in the system. Fortunately, the
+meeting of the head and backtrace processing is a rare occurrence.
+It should be noted that, unlike a conventional system, path reconstruction is
+only undertaken if necessary and then for only as long as required; both strate-
+gies save power. Furthermore, the use of asynchronous techniques within the
+HU enables the writing of winner information from the PMU to be indepen-
+dent of and run concurrently with any path reconstruction activity. The use
+of flip-flop storage rather than RAM has resulted in a simpler and more flexi-
+ble design. It also has the distinct advantage of enabling multiple backtraces,
+whose frontiers are all at different slots, to be run concurrently.
+
+14.7.
+Results and design evaluation
+
+The asynchronous Viterbi decoder system was implemented as described
+using the industrial partner’s cell library which was designed to operate from
+3.3V. Non-standard elements such as the Muller C-gate were constructed from
+
+
+270
+Part III: Large-Scale Asynchronous Designs
+
+the cell library; the only full-custom element which had to be designed was an
+arbiter.
+Following simulation, the decoder was fabricated on a 0.35 micron CMOS
+process. Results for a 1/2 code rate, random input bit stream with no errors
+show an overall power dissipation of 1333 mW at 45 Msymbols/sec. Of this,
+the dissipation in the PMU dominates at 1233 mW while the HU takes only
+37 mW. The difference (about 60 mW) between these figure and those for the
+overall consumption are accounted for by the dissipation in the BMU and in
+the small amount of ‘glue’ logic prior to the BMU.
+Errors in the input data to the decoder result in only small variations in the
+dissipation with the overall dissipation falling slightly with an increasing input
+error rate; internally, this decrease comprises a small reduction in the PMU
+dissipation and a smaller rise in the HU dissipation. The results for other code
+rates are a scaled version of those obtained for the 1/2 code rate as might be
+expected. For example, a 3/4 code rate which receives 3 symbols for every 4
+clocks exhibits 1.5 times the dissipation of the 1/2 rate code with its 2 symbols
+every 4 clocks.
+The asynchronous PMU performs approximately the same amount of work
+regardless of the number of errors in the data stream. This results from capping
+numbers at 6. This means that for a good data stream, all nodes have a weight
+of six except the one node on the good (correct) path. Thus the PMU is almost
+permanently ‘saturated’ and practically all work performed relates to paths
+which will never be selected! Errors cause some spread in the node weights
+with the higher weights (4, 5 and 6) predominating and the slightly smaller
+counts on some nodes accounts for the slight drop in dissipation in the PMU
+under error conditions.
+The asynchronous PMU dissipation is very high and does not compare well
+with synchronous PMUs using conventional parallel arithmetic to perform the
+add, compare and select operation [15]. In order to understand why this occurs,
+it is necessary to examine the operation and logic used in the asynchronous
+PMU in more detail. With good (i.e. no error) data, 63 nodes have weights
+of 6 and one node has a weight of 0. This translates to PMU activity on each
+timeslot where all State Metric FIFOs but one contain counts of 6 which are
+then transferred to Path Metric FIFOs, whose counts of 6 are in turn removed
+from the Path Metric FIFOs, paired and transferred to the receiving State Met-
+ric FIFOs. Entering or removing a count of 6 from a FIFO involves 21 changes
+of state in the stages. Furthermore, the number of transitions actually involved
+is higher due to (around 5) internal transitions on the elements making up the
+C-gates forming each FIFO stage. Thus, each of 63 nodes experiences around
+650 transitions just on the data path per timeslot. The control and other over-
+heads on the data path can be expected to form (say) an additional 30% of
+logic. This indicates a node activity of around 850 transitions/timeslot and
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+271
+
+overall the PMU can be expected to make a maximum of 54,400 transitions on
+each timeslot.
+Unfortunately, the design of the FIFOs, particularly the Path Metric FIFOs,
+has led to high capacitive loading on the true and inverse C-gate outputs at
+each stage. A dissipation of 1233 mW with 54,400 transitions per slot and an
+energy cost of 5�45
+�C joules per transition, where C is the average track plus
+gate loading capacitance (in Farads), indicates an average loading of 92 fF and
+this is confirmed by measurements.
+By contrast, the HU power efficiency is excellent and is the real success
+story in this design. Its dissipation is low and is far smaller than that in any
+other system the designers are aware of. The HU dissipation demonstrates that
+by keeping a good path, very little computing is required to output the data
+stream when there are no errors. Furthermore, when lots of noise is present,
+so that the backtrace process is active with many good paths in the process of
+being reconstructed concurrently, the dissipation in the HU only rises slightly;
+this indicates that accessing the local winner memory in flip-flops and over-
+writing the global winner information is not a power-costly operation.
+The HU dissipation also compares favourably with a (synchronous) system
+built along similar principles to the HU described here but using RAM ele-
+ments from the library instead of flip-flops. Due to the limitations of the RAM
+devices previously mentioned, these introduce additional complexity because
+only one backtrace is performed at any time; it is therefore necessary to keep
+track of the depth reached by incomplete backtraces which are abandoned for a
+new backtrace leaving a partially complete global winner path reconstruction.
+The difference in dissipation between the asynchronous HU using flip-flops
+and the other using RAMs reflects the power cost of accessing the local win-
+ner RAM and the associated significant additional computation involved in the
+backtrace. This points to the power efficiency of storing the HU information
+in a manner best suited to the task.
+
+14.8.
+Conclusions
+
+As in many asynchronous designs, the system design has had to be ap-
+proached from first principles and has caused a complete rethink about how
+to implement the Viterbi algorithm. This has resulted in a number of novel
+features being incorporated in the PMU and HU units. In the PMU, the deci-
+sion to use serial unary arithmetic has enabled the conventional parallel add,
+compare and select logic to be dispensed with and replaced by dataless FIFOs
+which perform the arithmetic serially.
+While the PMU is an interesting and different design from that convention-
+ally used, its power consumption is not good. Its design illustrates that power
+efficiency at the algorithmic and architecture levels needs to be combined with
+
+
+272
+Part III: Large-Scale Asynchronous Designs
+
+efficient design at the logic, circuit and layout levels to realise the true po-
+tential of a system. This is demonstrated by a synchronous PMU constructed
+along similar architectural principles to those described but implemented using
+a low-power logic family and full custom layout which dissipates only 70 mW
+at 45 Msymbols/sec [15]. It is clear that while a full custom design of the asyn-
+chronous PMU datapath would reduce the current power levels significantly, a
+major revision of the PMU logic for the datapath, paying particular attention to
+loading, is required for a design which has better power efficiency than other
+systems.
+The identification of a global winner is probably the most important advance
+in the PMU design. This has meant that both a good path and a local winner
+history can be kept by the HU, leading to a greatly reduced amount of overall
+storage required to deduce the output data. The use of flip-flop storage has also
+greatly contributed to the power efficiency of this unit and it does demonstrate
+the power advantages of optimising design at all levels in the design hierarchy
+down to and including the logic.
+The HU also illustrates the advantages of asynchronous design in that the
+placing of current information is decoupled from any backtracing operations of
+which there may be many running concurrently. Furthermore, the speed of the
+backtracing is only dependent on the logic required to perform this operation
+and not on any other internal or external system timing. Such a decoupled,
+multiple backtracing activity would clearly be more difficult to organise in the
+context of a synchronous timing environment.
+
+14.8.1
+Acknowledgement
+
+As in any large project, a number of people have been engaged in the design
+and implementation of the Viterbi decoder described in this chapter. It is there-
+fore a pleasure to acknowledge the other colleagues in the Amulet group in the
+Computer Science Department at Manchester University involved at all stages
+in this project, namely Mike Cumpstey, Steve Furber and Peter Riocreux. I am
+also grateful to them for comments on the draft of this chapter.
+
+14.8.2
+Further reading
+
+Further information on the Viterbi algorithm may be found in [148], [71]
+and [70]. Futher information on the PREST project is in [1].
+
+
+Chapter 15
+
+PROCESSORS
+
+�
+
+Jim D. Garside
+Department of Computer Science, The University of Manchester
+
+jgarside@cs.man.ac.uk
+
+Abstract
+Computer design becomes ever more complex. Small asynchronous systems
+may be intriguing and even elegant but unless asynchronous logic can not only be
+competitive with ‘conventional’ logic but can show some significant advantages
+it cannot be taken seriously in the commercial world.
+There can be no better way to demonstrate the feasibility of something than
+by doing it. To this end several research groups around have the world have been
+putting together real, large, asynchronous systems. These have taken several
+forms, but many groups have chosen to start with microprocessors; a processor
+is a good demonstrator because it is well defined, self-contained and forces a de-
+signer to solve problems which are already well understood. If an asynchronous
+implementation of a microprocessor can compare favourably with a synchronous
+device performing an identical function then the case is proven.
+This chapter describes a number of processors that have been fabricated and
+discusses in some detail some of the solutions employed. The primary source of
+the material is the Amulet series of ARM implementations – because these are
+the most familiar to the author – but other devices are included as appropriate.
+The later parts of the chapter widen the descriptions to include memory systems,
+cacheing and on-chip interconnect, illustrating how a complete asynchronous
+System on Chip (SoC) can be produced.
+
+Keywords:
+low-power asynchronous circuits, processor architecture
+
+�The majority of the work described in this chapter has been made possible by grants from the European
+Union Open Microprocessor systems Initiative (OMI). The primary sources of funding have been OMI-
+MAP (Amulet1), OMI/DE-ARM (Amulet2e) and OMI/ATOM (Amulet3). Without this funding none of
+these devices would have been made and this support is gratefully acknowledged.
+
+273
+
+
+274
+Part III: Large-Scale Asynchronous Designs
+
+15.1.
+An introduction to the Amulet processors
+
+Most of the examples in this chapter are based around the Amulet series
+of microprocessors, developed at the University of Manchester. All of these
+have been asynchronous implementations of the ARM architecture [65] and,
+as such, allow direct comparisons with their synchronous contemporaries. It
+should be noted that the primary intention, as in other ARM designs, was to
+produce power-efficient rather than high-performance processors.
+Brief descriptions of the three fabricated Amulet processors and some other
+notable examples are given below.
+
+15.1.1
+Amulet1 (1994)
+
+Figure 15.1.
+Amulet1.
+
+Amulet1 [158] (figure 15.1) was a feasibility study in asynchronous design,
+using techniques based extensively on Sutherland’s Micropipelines [128]. Al-
+though two-phase signalling was used for communications standard, transpar-
+ent latches were used internally rather than Sutherland’s capture-pass latch (see
+figure 2.11 on page 20). The external two-phase interface proved difficult to
+interface with external commodity parts.
+Amulet1 comprised 60,000 transistors in a 1�0µm, 2-layer metal process and
+ran the ARM6 instruction set (with the exception of the multiply-accumulate
+operation).
+It achieved about half the instruction throughput of an ARM6
+manufactured on the same process with roughly the same energy efficiency
+(MIPS/W).
+
+
+Chapter 15: Processors
+275
+
+Figure 15.2.
+Amulet2e.
+
+15.1.2
+Amulet2e (1996)
+
+Amulet2e [44] (figure 15.2) was an ARM7 compatible device with complete
+instruction set compliance. In addition to the CPU it included an asynchronous
+4 KByte cache memory and a flexible external interface making it much easier
+to integrate with commodity parts. A few other optimisations such as (limited)
+result forwarding and branch prediction were added.
+Internally this device used four-phase rather than two-phase handshake pro-
+tocols. It occupied 450,000 transistors (mostly in the cache memory) in a
+0�5µm 3-layer metal process; although about three times faster than Amulet1 it
+was still only half the performance of a contemporary synchronous chip.
+
+15.1.3
+Amulet3i (2000)
+
+Amulet3i [48] was intended as a macrocell for supporting System on Chip
+(SoC) applications rather than a stand-alone device. It is an ARM9 compat-
+ible device comprising around 800 000 transistors in a 0�35µm 3-layer metal
+process. It comprises an Amulet3 CPU, 8 KBytes of pseudo-dual port RAM,
+16 KBytes of ROM, a powerful DMA controller and an external memory/test
+interface, all based around a MARBLE [4] asynchronous on-chip bus.
+
+
+276
+Part III: Large-Scale Asynchronous Designs
+
+Figure 15.3.
+DRACO.
+
+Amulet3i achieves roughly the same performance as a contemporary, syn-
+chronous ARM with an equal or marginally better energy efficiency. It was
+integrated with a number of synchronous peripheral devices, designed by Ha-
+genuk GmbH, to form the DRACO (DECT Radio Communications Controller)
+device (figure 15.3).
+
+15.2.
+Some other asynchronous microprocessors
+
+Several other groups around the world have also been developing asyn-
+chronous microprocessors over the past decade or so. In this section a (non-
+exhaustive) selection of these are briefly described.
+Caltech has produced two asynchronous processors: the ‘Caltech Asyn-
+chronous Microprocessor’ (1989) [86] was a locally-designed 16-bit RISC
+which was the first single chip asynchronous processor; the ‘MiniMIPS’ [88]
+was an implementation of the R3000 architecture [72]. Both of these proces-
+sors were custom designed using delay-insensitive coding rather than the ‘bun-
+dled data’ used in the Amulets. This, together with a design philosophy aimed
+at speed rather than low-power consumption results in high-performance, high-
+power devices.
+Another MIPS-style microprocessor is the University of Tokyo’s ‘TITAC-2’
+(1997) [130] (figure 15.4) which is an R2000. This was developed in another
+
+
+Chapter 15: Processors
+277
+
+Figure 15.4.
+TITAC-2. (Picture courtesy of the University of Tokyo.)
+
+Figure 15.5.
+ASPRO-216.
+(Picture courtesy of the TIMA Laboratory,CIS Group, IMAG
+(Grenoble).)
+
+
+278
+Part III: Large-Scale Asynchronous Designs
+
+different design style (quasi-delay insensitive). As may be apparent from the
+figure TITAC-2 employed considerable manual layout.
+‘ASPRO-216’ (1998) [117] from IMAG in Grenoble is slightly different in
+that it is a 16-bit signal processor rather than a general-purpose microprocessor.
+More significantly its design was largely automated and synthesised from a
+CHP(Communicating Hardware Processes) [84, 118] description. This shows
+in the more ‘amorphous’ appearance of the processor, although the chip is
+dominated by its memories (figure 15.5).
+All the devices mentioned above have been research prototypes. Commer-
+cial take up of asynchronous processor technology has been slower; neverthe-
+less there are some significant examples.
+Philips Research Laboratories in Eindhoven have been developing the ‘Tan-
+gram’ [135] circuit synthesis system, primarily aimed at low-performance,
+very low power systems. This has been used to produce an asynchronous im-
+plementation of the 80C51 (1998) [144] which has been deployed in commer-
+cial pager devices where its low power and low EMI properties are particularly
+attractive. It is also intended for use in smartcard applications (see chapter 13
+on page 221).
+Although not strictly a microprocessor the Sharp DDMP (Data-Driven Me-
+dia Processor) (1997) [131] merits inclusion here. Intended for multimedia
+applications this provides a number of parallel processing elements which are
+employed or left idle according to the demand at any time. Asynchronous
+technology was attractive here because of the ease of power management.
+Finally the DRACO device (figure 15.3) was designed specifically for com-
+mercial use although not (yet) marketed due to company reorganisation. As
+a processor in a radio ‘base station’ the low EMI properties of asynchronous
+logic were the reasons for adoption of this technology.
+
+15.3.
+Processors as design examples
+
+Why build an asynchronous microprocessor? Part of the answer must be the
+various advantages of low power, low EMI etc. claimed for any asynchronous
+design and demonstrated in the commercial devices mentioned above, but why
+is a processor a good demonstration of these features?
+In many ways it is not. A better demonstrator of asynchronous advantages
+may well be an application with a regular structure which is amenable to very
+fine grain pipelining: some signal processing or network switching applica-
+tions have these characteristics. However there is a great deal of appeal in con-
+structing a microprocessor. Firstly, it is a well-defined and self-contained prob-
+lem; it is easy to define what a microprocessor should do and to demonstrate
+that it fulfils that specification. Secondly, it forces an asynchronous designer to
+confront and solve a number of implementation problems which might not oc-
+
+
+Chapter 15: Processors
+279
+
+cur if a ‘tailor made’ demonstration was chosen. Lastly, it is often possible to
+compare the result with contemporary, synchronous devices in order to quan-
+tify and assess the results of the work. Of course, microprocessors are also an
+intensely fast-moving and competitive market in which it is hard to compete,
+even in a familiar technology!
+
+15.4.
+Processor implementation techniques
+
+15.4.1
+Pipelining processors
+
+Pipelining [56] the operation of a device such as a microprocessor is an effi-
+cient way to improve performance. At its simplest, pipelining can subdivide a
+time-consuming operation into a number of faster operations whose execution
+can be overlapped. If done ‘perfectly’ a five-stage pipeline can speed up an
+operation by (almost) five times with only a small hardware cost in pipeline
+latches.
+A typical synchronous pipeline should divide the logic into equally timed
+slices. If there is an imbalance in the partitioning the slowest pipeline stage
+sets the clock period for the whole system; faster logic is slowed down by the
+clock resulting in some time wastage (figure 15.6). This is more emphasised
+if the timing of a particular pipeline stage is, in some way, variable or ‘data
+dependent’. Data dependencies can be quite common in a microprocessor: a
+simple example is an ALU operation, where a ‘move’ operation is faster than
+an addition because the former operations require no carry propagation. A
+more ‘exaggerated’ example is a memory access where a cache hit is much
+faster than a cache miss.
+
+work
+Useful
+period
+Clock
+
+Fetch
+Decode
+Evaluate
+Transfer
+Write
+
+Figure 15.6.
+Synchronous pipeline usage.
+
+Normally in most of such cases the clock must allow for the slowest pos-
+sible cycle. This either slows down the clock or causes the designer to invest
+considerable hardware in speeding up the worst-case operations, for example
+adding fast carry propagation mechanisms. Whilst the latter is a good trade
+if a high proportion of the operations are slow this is poor economics if the
+worst-care operations are rare.
+
+
+280
+Part III: Large-Scale Asynchronous Designs
+
+Another possible solution open to the synchronous designer is to allow cer-
+tain slow operations to occupy more than one clock cycle; this is clearly expe-
+dient when a cache miss occurs and the processor must idle until an off-chip
+memory can be read. However multi-cycle operations introduce the need for
+system-wide clock control to stall other parts of the system; even then the tim-
+ing resolution is still limited to whole clock cycles.
+Asynchronous pipelining is conceptually much easier. Not only is all con-
+trol localised, but it is implicitly adaptable to pipeline stages with variable de-
+lays. This means that rare, worst-case operations can be accommodated with-
+out either excessive hardware investment or a significant impact in the overall
+processing time by altering the delay of a pipeline stage dynamically. This,
+combined with the fact that operations can flow at their own rate rather than
+being stacked one to a stage, gives the pipeline considerable ‘elasticity’. In
+such a pipeline a cache miss still occupies a single cycle, although that cycle
+will be a particularly slow one!
+Another clear example of this is visible in the ARM instruction set [65]
+where data processing operations and address calculations may specify an
+operand shift before the ALU operation. In early ARM implementations this
+was contained in a single pipeline stage and hidden by the slower memory ac-
+cess time. To avoid a performance penalty more modern synchronous designs
+have the options of:
+
+an additional shifter pipeline stage (increases latency);
+
+stalling for an extra clock when a shift is required (increases complex-
+ity).
+
+An asynchronous pipeline can simply stretch the execution cycle when re-
+quired. As the additional time is unlikely to be as long as the ALU stage delay
+this is more flexible than either of the synchronous options, and the overall
+impact is small because shift/operate operations are quite rare.
+
+‘Average case’ performance.
+The elasticity of an asynchronous pipeline has
+led to the myth that an asynchronous pipeline can perform its processing in an
+‘average’ rather than worst case time for each pipeline stage. This is true only
+if the unit under consideration is kept constantly busy. This will not be true in
+the general case: on some occasions the unit will be ready before its operands
+and at other times it will have completed before subsequent stages can accept
+its output; in each case an interlude of idleness is enforced. This effect can be
+reduced by providing fast buffers between processing elements to allow some
+‘play’ in the timing, but true average case behaviour can only be achieved with
+buffers of infinite size. Any buffering increases the pipeline latency and should
+therefore be used with circumspection.
+
+
+Chapter 15: Processors
+281
+
+In practice an asynchronous pipeline should be balanced in a similar fashion
+to a synchronous pipeline. The difference is that occasional, time consuming
+operations can be accommodated without either pipeline disruption or signif-
+icant extra hardware. An added bonus is that the pipeline latency, especially
+when filling an ‘empty’ pipeline, can be reduced because the first instruction
+is not retarded by the clock at each stage (figure 15.6). The problem then de-
+volves into partitioning the system and solving the resulting problem of internal
+operand dependences.
+
+15.4.2
+Asynchronous pipeline architectures
+
+Once an asynchronous pipeline has been developed it is very easy to add
+pipeline stages to a design. However, pipelining indiscriminately can be a Bad
+Thing, at least in a general-purpose processor. The major reason for this is
+the need to resolve dependencies [56], where one operation must gather its
+operands from the results of previous instructions; if many operations are in
+the pipeline simultaneously it is quite likely that any new instructions will be
+forced to wait for some of these to complete. Resolving dependencies in an
+asynchronous environment can be a relatively complex task and is discussed in
+a later section.
+A less obvious consequence is the increase in latency in the pipeline, not
+only due to added latch delays but because, in some circumstances, the pipeline
+needs to drain and refill. Consider a processor pipeline with a fast FIFO buffer
+acting as a prefetch buffer; this initially seems like a good idea as it can help
+reduce stalls due to (for example) occasional slow cycles in the prefetch (e.g.
+cache miss) and execution (e.g. multiplication) stages. However, at least in
+a general purpose processor, this benefit is masked because a single stall can
+fill up the prefetch buffer and, typically shortly thereafter, a branch requires it
+to be drained. This was an architectural error evidenced in Amulet1, which
+suffered a noticeable performance penalty due to its four-stage prefetch buffer.
+Of course in other applications, where pipeline flushes are rare, the ease of
+adding buffering can provide significant gains; however experience has shown
+that for a general purpose CPU a more conventional approach producing a
+reasonably balanced pipeline based around a known cycle time (such as the
+cache read-hit cycle time) is the ‘best’ approach.
+One definite advantage of an asynchronous pipeline is that the pipeline flow
+can be controlled locally. Consider that bane of the RISC architecture the
+multi-cycle instruction; in a synchronous environment such an operation must
+be able to suspend large parts or all of the processing pipeline, necessitating a
+widespread control network. In an asynchronous environment the system need
+not be aware of operations in other parts of the system; instead a multi-cycle
+
+
+282
+Part III: Large-Scale Asynchronous Designs
+
+operation simply appears as a longer delay, possibly causing a stall if other
+processing elements require interaction with the busy unit(s).
+In the ARM architecture there is one case where this ability is very useful;
+the multiple register load and store operations (LDM/STM) can transfer an
+arbitrary subset of the sixteen current registers to or from memory. Amulet3
+implements this by generating several local ‘instruction’ packets for the exe-
+cution stages for a single input handshake. At this point it is likely that the
+prefetch will fill up the intervening latches and stall, but this is a natural con-
+sequence of the pipeline’s operation.
+There are other examples of this behaviour in Amulet3 (figure 15.7), notably
+the Thumb decoder which ingests 32-bit packets which can contain either one
+ARM instruction or two of the ‘compressed’, 16-bit Thumb instructions. In
+the latter case two output handshakes are generated for each input. This pro-
+vides an advantage over earlier (synchronous) Thumb implementations, which
+fetch instructions sixteen bits at a time, because it reduces the number of in-
+struction fetch memory cycles and, with a slow memory, uses the full available
+bandwidth. The power consumption in the memory system is also reduced
+commensurately.
+Local control is also possible in instructions such as ‘CMP’ (compare) which
+do not need to traverse the entire pipeline length; it is just as easy to remove a
+packet from the pipeline by generating no output handshakes as it is to generate
+extra packets. In Amulet3 comparison operations only affect the processor’s
+flags which reside in the ‘execute’ stage and therefore cause no activity further
+down the pipeline.
+A final benefit of the localised control is that the pipeline operation can
+be regulated by any active stage. Both Amulet2 and Amulet3 have retrofitted a
+‘halt’ (until interrupted) instruction into the ARM instruction set (implemented
+transparently from an instruction which branches to itself). This can be de-
+tected and ‘executed’ anywhere within the pipeline with the same effect of
+causing the processor to stall. Indeed Amulet2 instigates halts in its execution
+unit whereas Amulet3 halts by suspending instruction prefetch, but the over-
+all effect is the same. Halting an asynchronous processor (or part thereof) is
+equivalent to stopping the clock in a synchronous processor and, in a CMOS
+process, can drop the power consumption to near zero. This facility therefore
+makes power management particularly easy and – in many cases – near auto-
+matic.
+
+15.4.3
+Determinism and non-determinism
+
+Before examining specific architectural techniques which can be employed
+in an asynchronous processor it is worth considering something of the de-
+sign philosophies employed. The most significant is probably that of non-
+
+
+Chapter 15: Processors
+283
+
+Latch
+
+Execute
+Interface
+Data
+
+FIFO
+
+Buffer
+
+Latch
+
+Decode &
+
+Latch
+
+Register
+
+Latch
+
+Prefetch
+
+Reorder
+
+Write
+
+Latch
+
+Reg. Rd.
+
+Thumb
+
+skip
+memory
+
+FIQ
+
+IRQ
+
+Latch
+
+Instr
+Memory
+ may generate
+
+ additional packets
+
+branch addresses
+
+forwarding
+indirect PC load
+
+store data
+
+addr.
+
+load data Memory
+Data
+
+Figure 15.7.
+Amulet3 core organisation.
+
+
+284
+Part III: Large-Scale Asynchronous Designs
+
+determinism because, unlike a synchronous processor, an asynchronous pro-
+cessor can behave non-deterministically and yet still function correctly.
+An advantage in the analysis and design of a synchronous system is that the
+state in the next cycle can be determined entirely from the current state. This
+may also be true in an asynchronous system, but the timing freedom means
+that this is not the only choice of action. Within a small asynchronous state
+machine it is possible to achieve the same behaviour with internal transitions
+ordered differently (e.g. the inputs to a Muller C-element can change in any
+order) and this is also true on a macroscopic level. The first example used here
+is a processor’s prefetch behaviour, chosen because different philosophies have
+been chosen in different projects.
+All the Amulet processors have had a non-deterministic prefetch depth. This
+is achieved by allowing the prefetch unit to run freely, normally only con-
+strained by the rate at which the instruction memory is able to accept instruc-
+tions. In order to branch the prefetch process is ‘interrupted’ and a new pro-
+gramme counter value inserted; this is an asynchronous process which requires
+arbitration and therefore happens at a non-deterministic point.
+An alternative approach, for example used in the ASPRO-216 processor
+[117], is to prefetch a fixed number of instructions.
+This can be done by
+prompting the prefetch unit to begin a new fetch for each instruction which
+completes. If a branch is required this can be signalled, if not this too is sig-
+nalled. In effect the processing pipeline becomes a ring in which a number of
+tokens are circulated and reused. (See also section 3.8.2 on page 39.)
+Having a deterministic prefetch simplifies certain tasks, notably dealing
+with speculative prefetches and branch delay slots. As it is possible to say
+exactly how many instructions will be prefetched following a branch these can
+be processed or discarded with a simple counting process. However keeping
+tokens flowing around a ring places an extra constraint on the elasticity of the
+pipeline which – in some circumstances – may sacrifice performance.
+With a non-deterministic prefetch depth it is possible to have fetched zero
+or more instructions speculatively, although there will be an upper bound set
+when the pipeline ‘backs up’. In an architecture without delay slots (such
+as ARM) the lower limit is not a problem, but some means other than in-
+struction counting must be provided to indicate that the prefetch stream has
+changed. The Amulet processors do this by ‘colouring’ the prefetch streams.
+To illustrate: imagine the processor prefetches a stream of ‘red’ instructions.
+Eventually, as these are executed, a branch is encountered which causes the
+execution unit to request prefetches starting at a new address and in a different
+colour (‘green’, say). Subsequent red instructions must then be discarded, the
+first green instruction being the next operation completed. A subsequent green
+branch will cause another colour change; because all the former red operations
+
+
+Chapter 15: Processors
+285
+
+(a) Deadlock
+(b) Deadlock avoided with
+extra latch
+
+Figure 15.8.
+Branch deadlock and its prevention.
+
+must have been flushed at this point it is possible to switch back to red, thus
+only two colours (i.e. a single colour bit) is required.
+Before leaving this issue there is one other, less obvious, problem with a
+non-deterministic prefetch depth which can cause deadlock if not considered
+in the design. In this architecture the act of branching uses an arbiter to insert
+a token into the processor’s pipeline. If the pipeline is already full – and there
+is nothing to limit this – then the arbiter cannot acknowledge the operation
+and thus the pipeline deadlocks (figure 15.8(a)). Because a branch could occur
+when the pipeline is not full the deadlock is not inevitable, but it could happen
+each time a branch is attempted.
+This problem is easily solved; an extra latch which is normally empty can
+decouple the branch operation from the main pipeline flow (figure 15.8(b)),
+leaving ‘normal’ operation to continue. As a second, valid branch cannot be
+taken until after the first has been accepted and its target fetched and decoded,
+a single latch will always prevent deadlocks here.
+This class of problem is generic. A manifestation of a similar problem be-
+came evident early in the design of Amulet1 where the instruction fetch com-
+peted non-deterministically for the bus with data loads and stores. In a situation
+where the processing pipeline is full it is possible that an instruction fetch can
+occupy the memory bus but be unable to complete because there is no latch
+ready for the result. If a data transfer is pending then it is blocked, resulting
+in the pipeline remaining full (figure 15.9(a)). This can be rectified if the in-
+struction fetch is throttled so that it cannot gain the bus until it is known that
+it can relinquish it again (figure 15.9(b)). The converse of the problem does
+
+
+286
+Part III: Large-Scale Asynchronous Designs
+
+Memory
+
+(a) Deadlock
+
+Throttle
+
+Memory
+
+(b) Deadlock avoided by
+throttling prefetch
+
+Figure 15.9.
+Bus contention deadlock and its prevention.
+
+not occur because the data transfer, once started, cannot be stalled, so only
+instruction prefetch requires throttling.
+Amulet3, with its separate instruction and data buses, does not exhibit this
+problem within the processor, although the memory system can still suffer an
+analogous deadlock. This is described further in the section on memory.
+If a deterministic, asynchronous solution is preferred this could be imple-
+mented by adding a data transfer phase to every instruction. There is still an
+asynchronous advantage here in that an ‘unwanted’ data access would be very
+fast but there is a price in limiting the adaptability of the processor’s pipeline.
+
+Counterflow Pipeline Processor.
+Although most asynchronous micropro-
+cessor designs have a ‘conventional’ architecture (other than lacking a clock!)
+it may be practicable to implement radically different processor structures, and
+several have been studied. One interesting – and highly non-deterministic –
+idea is the Counterflow Pipeline Processor (CFPP) [126] in which instructions
+flow freely along a pipeline containing processing units towards the register
+bank while operands flow equally freely towards them. This is intended to re-
+duce stalls in waiting for operand dependences between instructions; as well
+as evaluating and carrying its result to completion an instruction can cast the
+result backwards to be picked up as required by subsequent operations.
+Whilst expensive in the number of buses required, the CFPP allows con-
+siderable flexibility in the number and arrangement of functional units (fig-
+ure 15.10). Functional units may be ALUs of full or limited function, mem-
+ory access units, multipliers etc. The only rule is that the operation must be
+attempted by one of the available units, so stalls are only needed if an uncom-
+pleted operand reaches the last appropriate functional unit still without all its
+operands.
+
+
+Chapter 15: Processors
+287
+
+Instructions
+Fetch
+
+ALU/...
+
+ALU/... 
+
+ALU/...
+Instructions Operands
+
+Instructions Operands
+
+Instructions Operands
+
+Registers
+
+Branch
+
+Results
+
+Figure 15.10.
+Principle of the Counterflow Pipeline Processor.
+
+To ensure that an instruction does not miss any of its operands, passing
+in the other direction, a degree of synchronisation is needed at each pipeline
+stage. Because there is no ‘favoured’ direction in the pipeline an arbiter is
+required at each stage to ensure that two packets do not ‘cross over’ without
+checking each other’s contents. Because every movement of both instructions
+and data requires an arbitration, fast, efficient arbiters are essential for such a
+performance architecture.
+Of course deepening the pipeline to accommodate many functional units
+increases the penalty due to branches – which need to propagate ‘backwards’
+to the fetch stage – so good branch prediction is important.
+
+Arbitration and deadlock.
+It is possible to build an asynchronous processor
+which is as deterministic in its operation sequences as its synchronous coun-
+terpart (i.e. deterministic except for such events as interrupts). Alternatively it
+is possible to make a highly non-deterministic processor.
+Each scheme has both advantages and disadvantages. Enforcing synchroni-
+sation could lead to a reduction in performance; for example the memory in-
+terface may not begin a prefetch until told that an instruction does not require
+the memory for a data transfer, with a consequent reduction in available band-
+width. On the other hand the predictability of the system’s behaviour can make
+testing significantly easier by reducing the reachable state space of the system.
+The decision as to what (if anything) should be allowed to be non-deterministic
+is a decision for the designer which must be reviewed in the particular circum-
+stances. However it must be remembered that every non-determinism has an
+associated arbiter (which, theoretically, can require an infinite resolution time)
+
+
+288
+Part III: Large-Scale Asynchronous Designs
+
+and is likely to introduce a potential deadlock which must be identified and
+prevented.
+In the general case a good rule for avoiding deadlock is to examine carefully
+any instances of arbitration for a ‘shared’ resource (such as the bus) and ensure
+that no unit can be granted access until it is certain that it can relinquish it
+again regardless of the behaviour of other parts of the system. Each arbiter
+increases the number of states reachable by the processor and makes the design
+problem harder, but it increases the system’s flexibility. Non-determinism can
+be beneficial if used with caution.
+
+15.4.4
+Dependencies
+
+When a processor is pipelined beyond a certain depth it is necessary to in-
+sert interlocks to ensure that dependences between instructions are satisfied
+and programmes execute correctly. Even devices such as the MIPS R3000
+[72] – which was a ‘Microprocessor without Interlocked Pipeline Stages’ is
+interlocked in the sense that the programmer/compiler could use a clock cy-
+cle count to ensure correct operation; an expedient which is disqualified in an
+asynchronous environment. Similar constraints are applied in the ARM archi-
+tecture.
+
+PC pipeline.
+ARM does not use the clock explicitly in the way MIPS does,
+but there is one aspect of the architecture which is similar. The Programme
+Counter (PC) is available to the programmer in the general-purpose register
+set and when it is read the value obtained is the address of the current in-
+struction plus two instructions. This is a historical consequence of the early
+ARM implementations where there were two clock cycles between generating
+the instruction’s address and executing the operation. Compatibility with this
+must be maintained, even in an asynchronous processor where the prefetch and
+execution are autonomous.
+
+memory
+Instruction
+
+Read
+
+Decode
+Reg.
+
+PC pipeline
+
+Fetch
+
+Figure 15.11.
+PC pipeline.
+
+Because the generation and subsequent, possible use of the PC are unsyn-
+chronised in an Amulet processor a method of transmitting the value must be
+found. To do this all Amulet processors have maintained a copy of the PC with
+
+
+Chapter 15: Processors
+289
+
+each fetched instruction (figure 15.11). These flow down the pipeline with the
+instruction and can be read whenever the PC may be needed. The PC may be
+required explicitly as an operand or address, implicitly in branch instructions,
+or to find a return address in the case of exceptions such as interrupts or mem-
+ory aborts. Different Amulet cores have varied the exact value (e.g. PC+8)
+held with this ‘PC pipeline’ in attempts to minimise the later calculation over-
+heads, but in Amulet3 the PC is held without any premodification which allows
+any of the required values PC+2, PC+4, PC+8 to be calculated with a simple
+incrementer.
+It is worth noting that the PC values need not be bundled with the instruction
+directly. The PC pipeline can be a separate, asynchronous pipeline from the
+instruction fetch which can have a different depth, providing that a ‘one in,
+one out’ correspondence is maintained. This is a feature which is exploited in
+Amulet3 to throttle the prefetch unit and prevent instruction fetches causing a
+deadlock; this mechanism is described more fully in section 15.5.2.
+
+Registerdependencies.
+The greatest register dependency problems are read-
+after-write dependencies. One of these occurs in the case of a fragment of code
+such as:
+
+LDR
+R1, [R0]
+; Load ...
+ADD
+R2, R1, R3
+; ... and read
+
+In this example it is essential that the value is (at least apparently) loaded
+into R1 before the subsequent instruction uses it. As soon as the execution
+path is pipelined there is a risk that this will not be assured and this uncertainty
+is increased in an asynchronous device where the load could take an arbitrary
+time.
+Three solutions for ensuring register bank dependencies are satisfied are
+given below.
+
+Don’t pipeline.
+
+Lock.
+
+Forward.
+
+The first solution was the approach taken in the earliest, synchronous ARM
+implementations. This involves reading register operands, performing an oper-
+ation and writing the result back in a single cycle so that a subsequent operation
+always has a ‘clean’ register view. This is simple but makes the evaluation cy-
+cle very long and is unacceptable in a high-performance processor.
+A locking approach allows selective pipelining of the instruction execution
+by retarding instructions whose operand registers are not yet valid. This in-
+volves setting a ‘lock’ flag when an instruction is decoded which wishes to
+
+
+290
+Part III: Large-Scale Asynchronous Designs
+
+modify a register and clearing the flag again when the value finally arrives. A
+subsequent instruction can test the locks on its operands and be delayed until
+they are all clear. This mechanism is eminently suited to an asynchronous im-
+plementation because a stalled instruction is simply caused to wait, which can
+be done without recourse to arbitration.
+In practice it is convenient to allow more than one result to be outstanding
+for a single register. Partly this is a consequence of the ARM’s extensive use
+of conditional instructions, such as:
+
+CMP
+R1, R2
+; Set flags
+MOVNE
+R0, #-1
+; If R1 ? R2
+MOVEQ
+R0, #1
+; If R1 = R2
+
+In such a case a single lock flag (on R0 in this instance) is inadequate and
+some form of semaphore is needed. It turns out that the operation of such
+a semaphore is fairly simple to implement in an asynchronous environment
+provided that testing and incrementing are mutually exclusive. At issue time
+an instruction therefore:
+
+attempts to read its operands and waits until they are all unlocked, then
+
+locks its register destination(s) by incrementing the semaphore.
+
+The semaphore is decremented again when the result is returned; this action
+may take the semaphore to zero and, possibly, free up a waiting instruction.
+This can happen at any time.
+The example above illustrates another potential problem in an asynchronous
+system: the two ‘MOV’ operations are mutually exclusive and so only one will
+be executed. As this is not known at issue time both have incremented the
+semaphore and so both must decrement it, otherwise R0 will be permanently
+locked. In general if a speculative operation is begun it must complete – the
+‘write’ operation therefore always takes place, although sometimes the register
+contents are not changed and only the unlocking is performed.
+The principle of semaphores was designed and implemented as the ‘lock
+FIFO’ in Amulet1 and Amulet2. In these processors the semaphore also held
+the destination register addresses to avoid carrying them with the instruction.
+As the instructions flowed (largely) in order, results and destinations could be
+paired at write time.
+The lock FIFO was implemented as an asynchronous pipeline as shown in
+figure 15.12. Because the cells in each latch (horizontal) are transparent latches
+the entries are copied from one to the next, thus ensuring the outputs of the OR
+gates are glitch free. These can then be used to stall any read operations on
+the particular register until the write is complete. The only hazard would be
+if a register was actively locked whilst a read was being attempted; this is
+prevented by a sequencing of read-then-lock by the instruction decoder.
+
+
+Chapter 15: Processors
+291
+
+Lock indicator
+1
+0
+1
+0
+
+0
+
+0
+
+0
+
+0
+
+0
+0
+0
+
+1
+
+0
+
+1
+0
+
+0
+
+0
+0
+
+1
+
+0
+
+FIFO Control
+
+Figure 15.12.
+Lock FIFO.
+
+Unlocking can happen safely at any time, both relative to the reading or
+locking of other registers. The destination address, in its decoded form, is
+already available at the bottom of the data FIFO.
+
+Reorder buffer.
+Although the lock FIFO works successfully it can intro-
+duce inefficiency in that it enforces in-order completion on the instructions
+and stalls each instruction until its operands are available. Therefore it is an
+effective, cheap method to guarantee functionality but is less than ideal for
+high-performance architectures.
+In Amulet3 register dependencies are resolved using an asynchronous re-
+order buffer [66, 51]. Whilst the major incentive for this was to facilitate a less
+intrusive page fault mechanism (see below) this allows instructions to complete
+in an arbitrary order and results can be forwarded at any time. It is therefore a
+significant step towards a complete out-of-order asynchronous processor.
+
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+
+
+
+
+
+
+
+
+
+
+
+
+Look−up
+
+ALU
+Registers
+Reorder
+
+Memory
+
+Read
+
+Arrival
+
+Forward
+
+Writeback
+
+Allocate
+
+Figure 15.13.
+Reorder buffer position in Amulet3.
+
+
+292
+Part III: Large-Scale Asynchronous Designs
+
+Table 15.1.
+Reorder buffer allocation.
+
+Instruction
+Slot0
+Slot1
+Slot2
+Slot3
+
+LDR
+R0, [R2]
+R0
+?
+?
+?
+MOV
+R4, #17
+R0
+R4
+?
+?
+LDR
+R7, [R0+4]!
+R0
+R4
+R0
+R7
+ADD
+R7, R7, R4
+R7
+R4
+R0
+R7
+CMP
+R3, R4
+R7
+R4
+R0
+R7
+ADDNE
+R7, R7, R6
+R7
+R7
+R0
+R7
+SUB
+R1, R7, R0
+R7
+R7
+R1
+R7
+
+The reorder buffer is positioned between the various data processing units
+and the register bank (figure 15.13) and crosses several timing domains. An
+instruction first encounters the reorder buffer control at decode time when its
+operands may be discovered and forwarded; any results are allocated space
+at this time. Subsequently an instruction may be subdivided (and, in princi-
+ple, issued at any time) and results can arrive separately and independently.
+Operands can be forwarded any number of times between when they arrive
+and when they are overwritten. Finally an ordered writeback occurs where the
+results are retired and the reorder buffer slots freed for reallocation.
+In the decode phase the operand register numbers are compared with a small
+content addressable memory (CAM) to determine which reorder buffer slots
+may contain appropriate values for forwarding. This list always terminates
+with the register itself, so a value is always available from somewhere. Once
+this is map is known the reorder buffer slots can safely be reassigned.
+The assignment process cyclically associates each reorder buffer entry with
+a particular register address and the instruction packet carries forward just the
+reorder buffer ‘slot’ number. Consider the following code fragment:
+
+LDR
+R0, [R2]
+;
+MOV
+R4, #17
+;
+LDR
+R7, [R0+4]!
+;
+ADD
+R7, R7, R4
+;
+CMP
+R3, R4
+;
+ADDNE
+R7, R7, R6
+;
+SUB
+R1, R7, R0
+;
+
+Assuming that the reorder buffer has four entries and the next free entry
+is (arbitrarily) 0, the reorder buffer assignment will proceed as shown in ta-
+ble 15.1. In each case the italicized entry is the latest one to have been as-
+signed.
+
+
+Chapter 15: Processors
+293
+
+Note:
+
+the second load (LDR) operation uses the ARM’s base writeback mode
+and therefore requires two destinations;
+
+the comparison has no register destinations;
+
+the same register address can appear multiple times in the reorder buffer;
+
+a slot is assigned even if the instruction is conditional (ADDNE) and
+may not produce a valid result.
+
+The instruction decoder still retains the reorder buffer map prior to the start
+of the instruction. In parallel with the reassignment it proceeds as follows,
+using the final instruction as an example:
+
+The locations of the appropriate registers are examined; these may be in
+the process of being reassigned but cannot be overwritten yet because
+the instruction execution has not begun.
+
+For R7 try: slot 1, slot 0, slot 3, register bank.
+
+For R0 try: slot 2, register bank.
+
+Try to read from each location in the list until a value is obtained.
+
+Note that a list of possibilities is required because the assigned slots need
+not contain a valid value. The most obvious cause of invalidation in an ARM
+is an instruction which fails its condition code check (e.g. the value of R7 in
+slot1), but other conditions – such as a preceding branch – can also result in an
+instruction being abandoned.
+
+Read register
+
+Read CAM
+Forward
+
+Assign slot
+
+Figure 15.14.
+Register processes in the Amulet3 decode unit.
+
+The flow of control through the decode phase is summarised in figure 15.14.
+Note that the forwarding time can vary depending on external factors while the
+slot assignment time depends on the number of slots which are required (from
+zero to two) and (occasionally) may have to wait for a slot to be available. Re-
+order buffer slots are assigned serially, even within a single instruction which
+simplifies the asynchronous implementation. There is a small performance
+impact on more ‘complex’ instructions, but these are relatively rare.
+
+
+294
+Part III: Large-Scale Asynchronous Designs
+
+Subsequent to assignment the instruction packets proceed to further stages
+carrying their reorder buffer slot number. Although Amulet3 issues only sin-
+gle instructions in order the ARM instruction set effectively allows two semi-
+independent operations via the internal execution unit and via the external data
+interface (figure 15.15). Each of these may produce a result at any time so each
+has its own port into the reorder buffer. Whilst these inputs are asynchronous
+they are guaranteed to target different slots and can therefore be independent.
+
+LDM − subsequent cycles
+
+Data Load
+
+LDM − first cycle
+
+Internal operation
+
+Decode
+Decode
+
+Execute
+
+Execute
+
+Data Int.
+
+Decode
+
+Execute
+
+Data Int.
+
+Data Int.
+
+Execute
+Data Int.
+
+Decode
+
+Figure 15.15.
+Sub-instruction routing.
+
+The writeback process simply copies results back to the register bank. On its
+arrival each result ‘fills’ a reorder buffer slot. The ‘writeback’ process therefore
+waits until a particular slot is ready, copies it out, and moves to the next one.
+The fact that the slots may become ready in an arbitrary order is not a concern.
+Superimposed on this process is a forwarding mechanism which waits until
+a result is ready and copies it back into the decode stage. This process can hap-
+pen before, during or after the writeback process; in fact the processes are asyn-
+chronous and concurrent. The key is that both processes use non-destructive
+copying and therefore leave the data available for the other as required. A
+result in the reorder buffer remains until it is overwritten.
+Either of the above processes may be required to wait if they need a result
+which has not yet arrived. In order to control this in a non-interacting way there
+are two separate ‘flags’ which indicate the presence of data in a slot. One flag
+is raised to indicate that the slot is ‘full’; this is cleared when the writeback
+process has copied the result to the register bank. This is at the heart of the
+control circuit shown in figure 15.16 (which will be described shortly). The
+
+
+Chapter 15: Processors
+295
+
+WPa_n
+WPr_n
+
+Aout
+
+T_n+1
+T_n
+
+Rout_n
+
+C
+C
+
+Match
+
+Try
+
+Fcol
+
+C
+
+Result_ack
+
+Full_n
+
+Token_out
+
+Result_req
+
+Token_in
+
+Write back
+
+Figure 15.16.
+Reorder buffer copy-back circuit.
+
+Table 15.2.
+‘Fcol’ state as results arrive in a reorder buffer.
+
+Result
+0
+1
+2
+3
+Number
+
+0
+0
+� 1
+0
+0
+0
+1
+1
+0
+� 1
+0
+0
+2
+1
+1
+0
+� 1
+0
+3
+1
+1
+1
+0
+� 1
+4
+1
+� 0
+1
+1
+1
+5
+0
+1
+� 0
+1
+1
+6
+0
+0
+1
+� 0
+1
+7
+0
+0
+0
+1
+� 0
+8
+0
+� 1
+0
+0
+0
+
+second flag (‘Fcol’) is merely changed to indicate the arrival of a result. As the
+state of this flag alternates on each pass through the cyclic reorder buffer it is
+possible for a forwarding request to test to see if a result has arrived without
+changing the flag’s state (table 15.2). This is essential because a value may be
+forwarded zero or more times, the number not being known at issue time.
+
+
+296
+Part III: Large-Scale Asynchronous Designs
+
+As a final embellishment a result returned to the reorder buffer may be in-
+valid (due, for example, to a condition code failure). Each slot is therefore
+accompanied by a flag which can prevent the data being written to the register
+bank (the ‘full’ flag is still cleared) or being forwarded. In the latter case fur-
+ther forwarding may be attempted from a less recent result, culminating in the
+use of the default register value.
+The writeback circuit (figure 15.16) is a good example of the working of
+such asynchronous control circuits. It operates as follows.
+
+The arrival of a result (top left) causes the ‘Full’ bit to be set and ac-
+knowledges itself (top right). The request comes from one of a number
+of mutually exclusive sources.
+
+This event also toggles the forwarding colour (‘Fcol’) which allows any
+instructions issued subsequent to this one to use the result.
+
+Note that the input can be on any of the mutually exclusive input chan-
+nels; two are shown here but more are possible.
+
+The input request can be removed at any time leaving the slot marked as
+‘Full’.
+
+When it becomes this slot’s turn to copy back a token is received (bot-
+tom left) which initiates an output request (bottom centre). If the token
+is received first the circuit waits for the result to arrive. The token is
+acknowledged.
+
+This process also allows the ‘Full’ bit to be reset, waiting for the input
+request to fall if it has not already done so.
+
+When the copy back is complete the (broadcast) acknowledge is picked
+up by the circuit to complete the token input handshake and pass the
+token to the next, similar circuit to the right. The next slot cannot attempt
+to output until the four-phase output acknowledge is complete.
+
+These slots are connected in a ring and reset so that there is a single token
+present for the first stage to output after which it runs freely, emptying each
+slot in turn when it contains a result.
+The experimental (and often unsystematic!) way in which Amulet3 was de-
+signed meant that the original state machine was defined and refined to a ver-
+sion close to that shown in figure 15.16 as a schematic and only subsequently
+subjected to a more formal analysis. The analysis was carried out using Petrify
+(which was introduced in section 6.7 on page 102).
+At first this caused problems due to the choices within the system. The
+output channel ‘calls’ the register bank and it was expedient to broadcast the
+output acknowledge to all the copy back circuits and use the (unique) active
+
+
+Chapter 15: Processors
+297
+
+request to sensitize the relevant location. This proved hard to model in a sin-
+gle subcircuit. However on a suggestion from one of the Petrify developers
+(Alex Yakovlev) it proved easier to model the whole system than a single part.
+The resultant signal transition graph (STG) (see figure 15.17) clearly shows the
+four implemented subcircuits which pass control around cyclically. The rest of
+the processor is abstracted in the four small rings which, when an output hand-
+shake has been completed, can reset the circuit to ‘Full’ again at an arbitrary
+time.
+The analysis with Petrify both verified the circuit’s operation and removed
+a redundant transistor in each subcircuit.
+There are two hazards in the asynchronous processes as described here. The
+first is that there is no local way of preventing a slot being overwritten before
+it has been emptied. In Amulet3 this is guaranteed elsewhere by ensuring that
+a slot is never assigned until it has been freed by the copy back process. In
+effect a count of free slots is maintained which is decremented when a slot is
+assigned and incremented when it is released again. Because assignment and
+release are both cyclic and in-order it is not necessary to pass individual slot
+identification, the presence of a token is adequate. This ‘throttle’ is imple-
+mented as a simple, dataless FIFO which also acts as an asynchronous buffer
+between the unsynchronized processes. This is shown in figure 15.18 for a
+system with four reorder buffer slots; in the state shown one result has been
+produced but not yet committed to the register bank, two more are being gen-
+erated and the decoder could issue another instruction which generated a single
+result.
+The other hazard is because the forwarding and writeback processes are not
+synchronised, therefore the register value which is read (as a default) could be
+changed during the read process, resulting in ‘random’ data being read. How-
+ever this can only happen if there is a valid value for that register in the reorder
+buffer and therefore it is certain that this value will be forwarded in preference
+to the register contents. Providing that the implementor can ensure that the
+‘random’ data does not cause problems within the circuit, the mechanism is
+secure against this asynchronous interaction.
+Studies of ARM code indicated that for the Amulet3 architecture a reorder
+buffer with five or more entries was unlikely ever to fill [50]. Amulet3 therefore
+implements a four entry buffer; however the mechanism described is extensible
+to any chosen size.
+
+15.4.5
+Exceptions
+
+An ‘exception’ is an unexpected event which occurs in the course of running
+a programme. The Amulet processors are compatible with the ARM instruc-
+
+
+298
+Part III: Large-Scale Asynchronous Designs
+
+Rout1+
+
+Full1-
+T4-
+Aout+
+
+Rout1-
+
+T1+
+Aout-
+
+Aout-
+
+WPr1+
+Rout2+
+
+Full1+
+Full2-
+T1-
+Aout+
+
+WPa1+
+
+Rout2-
+
+T2+
+
+WPr1-
+
+WPa1-
+
+Aout-
+
+WPr2+
+Rout3+
+
+Full2+
+Full3-
+T2-
+Aout+
+
+WPa2+
+
+Rout3-
+
+T3+
+
+WPr2-
+
+WPa2-
+
+Aout-
+
+WPr3+
+Rout4+
+
+Full3+
+Full4-
+T3-
+Aout+
+
+WPa3+
+
+Rout4-
+
+T4+
+
+WPr3-
+
+WPa3-
+
+WPr4+
+
+Full4+
+
+WPa4+
+
+WPr4-
+
+WPa4-
+
+Figure 15.17.
+An STG for all four reorder buffer token-passing circuits.
+
+
+Chapter 15: Processors
+299
+
+Execute
+Decode
+
+Memory
+
+Reorder
+
+Figure 15.18.
+Token passing throttle on reorder buffer.
+
+tion set and, therefore, the types of exceptions and the behaviour when they
+occur is predefined. Ignoring reset, ARM has six types of exception:
+
+Prefetch abort - a memory fault (e.g. page fault) during an instruction
+prefetch;
+
+Data abort - a memory fault during a read or write transfer;
+
+Illegal instruction - an emulator trap;
+
+Software interrupt - a system call (not really an exception, but has similar
+behaviour);
+
+Interrupt - a normal, level sensitive interrupt;
+
+Fast interrupt - similar to normal interrupts, but higher priority.
+
+Of these the majority are quite easy to deal with: software interrupts and
+illegal instructions can be detected at instruction decode time, as can prefetch
+aborts (when there is no instruction to decode); interrupts are imprecise and
+therefore can be inserted anywhere; only data aborts have the ability to cause
+serious problems because they only evidence after the instruction has been
+decoded and started to execute.
+
+Interrupts.
+In any processor an interrupt is an asynchronous event. In one
+sense the arrival of an interrupt can be thought of as the insertion of a ‘call’ in-
+struction in the normal sequence of instruction fetches. At first glance it would
+seem that simply arbitrating an extra ‘instruction’ packet into the instruction
+stream would suffice; however this simplistic view can cause problems.
+The chief problem is that there is some interaction between the interrupt
+‘instruction’ and prefetch stream; the interrupt needs a PC value to synthesise
+its return address, and the interrupting device cannot know what this is. Fur-
+thermore the return address must be valid; if an interrupt is accepted just after
+
+
+300
+Part III: Large-Scale Asynchronous Designs
+
+a branch has been prefetched it could be inserted into code which should not
+be run.
+Amulet3 implements interrupts by ‘hijacking’ rather than inserting instruc-
+tions, the whole operation being performed in the prefetch unit. Although the
+interrupt signals (in this case they are level-sensitive) change asynchronously
+with respect to the prefetch unit the mutual exclusion element is better thought
+of as a synchroniser than as a typical asynchronous arbiter. Figure 15.19 shows
+a method of implementing this. Here any change in the interrupt input will re-
+tard the normal request flow until the synchronised state has been latched; the
+synchronised interrupt signal only changes when done is low.
+
+Request
+
+MUTEX
+
+Interrupt
+
+Interrupt
+
+Synchronised
+
+Done
+
+Latch
+
+Figure 15.19.
+Interrupt synchroniser.
+
+When it is known that an interrupt signal has become active the current PC
+value effectively becomes the address of the interrupt ‘instruction’ and can be
+used to form the return address. This can be sent as an instruction but can save
+time by bypassing the memory. The interrupt can then be disabled to prevent
+further acknowledgement.
+Because this action takes place in the prefetch unit, Amulet3 can treat the
+interrupt entry as a predicted branch and jump directly to the appropriate inter-
+rupt service routine which, in an ARM, are at fixed addresses.
+The problem still arises that the interrupt entry may be speculative. If a
+branch is pending the return address sent to the execution stage may be invalid
+– in any case it will be wrongly coloured and therefore discarded! However
+the act of branching updates both the PC and any associated information, in-
+cluding the interrupt enables. As the interrupt has not been serviced the (level-
+sensitive) request will still be active and another attempt to enter the service
+routine will be made. This time the branch target address will be saved and
+there can be no further impediments.
+
+Data aborts.
+Although it solves the register dependency problems, the re-
+order buffer was originally introduced to simplify the implementation of data
+aborts. The ARM architecture specifies that, if an abort occurs, no effects
+from following instructions will be retained. Earlier Amulet processors did not
+
+
+Chapter 15: Processors
+301
+
+speculate on memory operations, relying on a fast ‘go/no go’ decision from
+any memory management unit. Amulet3 allows for more sophisticated (i.e.
+slower!) memory management by outputting memory transfer requests and
+only checking for aborts at the end of the operation. To be of any worth this
+must allow other, speculative operations to take place in parallel, but these
+operations cannot be ‘retired’ until the outcome of the load is known.
+The reorder buffer provides a place for speculative results to be held – and
+forwarded for reuse if necessary – until they can be retired into registers. In the
+(rare) case of a data abort the speculative results can be discarded, leaving the
+register bank intact. The discard can be achieved either by using a colouring
+scheme, tested by the register writeback process, or by marking speculative
+results as invalid using the same flag as an operation which has failed for other
+reasons. For certain reasons of implementational expediency Amulet3 uses
+the latter method, although the asynchronous hazard of invalidating a result
+whilst a forwarding operation is being attempted must be avoided. (This is
+achieved by implementing two validity bits and nullifying only one of them;
+the copy back process uses an AND of these whereas forwarding uses an OR.
+Forwarding is therefore not disturbed, although the result will be discarded
+later.)
+The reorder buffer accounts only for the register state however; the ARM
+holds other state bits which also require preservation. Two separate mecha-
+nisms are used for these, depending on the frequency of changes.
+The first is the current programme status register (CPSR) which holds the
+processor’s flags, operating mode et alia, the whole of which can be repre-
+sented in ten bits. The flags clearly change frequently during execution and
+there are many dependences on these bits due, for example, to compare-branch
+sequences. When attempting a memory operation Amulet3 simply copies the
+current CPSR state into a history buffer [66] FIFO; successful completion of
+the transaction discards this entry, but an abort can restore the CPSR state to
+that at the start of the failed operation.
+The other non-register state is a set of five saved programme status regis-
+ters (SPSRs) which act as a temporary store for the CPSR in various exception
+entries. These change very rarely and it is uneconomic to enlarge the history
+buffer to encompass them, although – in theory – they could be changed be-
+tween a load being issued and an abort being signalled. The solution here was
+simply to use a semaphore to lock the SPSRs whilst any memory operations
+are outstanding. This delivers the required functionality very cheaply and the
+performance penalty is tiny because SPSRs change so rarely.
+
+
+302
+Part III: Large-Scale Asynchronous Designs
+
+15.5.
+Memory – a case study
+
+It seems reasonable that an asynchronous processor should interact with an
+asynchronous memory system. This implies the need for handshake interfaces
+on a range of memory systems, including RAM, ROM and caches. This is the
+subject of the following sections.
+An individual memory array is a very regular structure and – under steady
+voltage, temperature etc. conditions – will produce data in a constant time.
+At first glance this may suggest that there is not much scope for asynchronous
+design within a memory system. However each part of the memory will have
+its own characteristic timing; in some cases even a simple memory will have a
+variable cycle time. An example is a RAM which will typically take longer to
+read from than it will to write to.
+In fact the memory system is one part of a computer which has extremely
+variable timing; even a clocked computer will take different times to service a
+cache hit and a cache miss. An asynchronous system will accommodate such
+cycle time variation quite naturally and is able to exploit many more subtle
+variations which would be padded to fill a clock cycle in a synchronous ma-
+chine.
+
+15.5.1
+Sequential accesses
+
+A static RAM (SRAM) stores data in small flip-flops which have only a
+very weak output drive. To accelerate read access it is normal to use ‘sense
+amplifiers’ to detect small (sub-digital) swings in voltage and produce an early
+digital output. Sense amplifiers, being analogue parts, are quite power-hungry.
+Sense amplification is only useful when there has been enough voltage
+swing for the read bits to be discriminated; it is also only required until the
+bits can be latched. As this period is certainly less than even half a clock cycle
+this is an ideal application for a self-timed system. A delay can be inserted to
+keep the sense amplifiers ‘switched off’ when the read cycle commences and
+only switch them on when they may be useful. An extra (known) bit in the
+RAM may then be discriminated and, when it has been read, used to latch the
+entire RAM line and disable the sense amplifiers. The same signal can be used
+to indicate the completion of the read cycle, possibly returning the RAM array
+to the precharge state in preparation for another cycle (figure 15.20).
+When designing such a circuit the RAM completion is easy to detect but the
+delay before enabling the sense amplifiers is harder to judge. The designer can
+choose this to be short – to ensure maximum speed – or somewhat longer –
+to ensure the memory is ready before the amplifiers are enabled thus ensuring
+minimum power wastage. If the designer errs then either speed or power may
+be compromised slightly, however functionality is retained.
+
+
+Chapter 15: Processors
+303
+
+Delay
+
+RAM array
+
+Latch
+
+Figure 15.20.
+Self-timed sense amplifiers.
+
+A typical SRAM array is organised to be roughly square; a 1 Kbyte RAM
+might therefore be organised as (say) 64
+� 128 rather than 256
+� 32 even
+though the processor requires only 32 bits on a given cycle. This presents
+the RAM designer with two choices:
+
+multiplex 32 sense amplifiers to the required word;
+
+amplify all 128 bits and ignore the unwanted ones.
+
+The first choice appears better when a read is considered in isolation but
+cycles are rarely so arranged; typical access patterns (especially code fetches)
+exhibit considerable sequentiality and this can be exploited in the hardware
+design.
+When using the first option it is possible to delay the RAM precharge and
+provide a subsequent read operation with a shorter read delay. The Amulet2e
+cache [49] uses this technique and is therefore able to provide subsequent ac-
+cesses within a RAM ‘line’ faster than the first such access. This variation
+in access time is much less than a whole ‘cycle’ and therefore would be of
+no interest to a synchronous designer, but it is exploited automatically in an
+asynchronous system.
+The second option given above can latch the entire RAM line after amplifi-
+cation. It can then service subsequent requests from this latch. This frees the
+RAM array to be precharged and – possibly – used for other purposes. This
+technique is exploited in the Amulet3i RAM which is described below.
+
+15.5.2
+The Amulet3i RAM
+
+As shown in figure 15.7 on page 283, the Amulet3 processor has a Harvard
+architecture with separate instruction and data buses. However in the Amulet3i
+
+
+304
+Part III: Large-Scale Asynchronous Designs
+
+SoC the memory model is unified; this implies that the buses must ‘merge’
+somewhere outside the processor core.
+In practice the local buses are merged in two places: once to get onto the
+on-chip MARBLE bus (see below) and once for access to the local, high-speed
+RAM (figure 15.21). In Amulet3i the local RAM is memory-mapped rather
+than forming a cache, although there is no reason why a cache could not have
+been implemented here; cache design is discussed later.
+
+A
+D
+
+AmuInstAdr
+
+InstBus
+
+RIA
+
+MID
+
+RRI
+
+Data Port
+
+Inst Port
+
+DataBus
+
+AmuWriteData
+
+AmuDataAdr
+
+SDA
+
+A
+D
+A
+D
+
+Initiator
+
+Logic
+InstDec
+
+Core
+AMULET3
+
+DataDec/
+Arbiter
+
+Target
+Initiator
+
+8Kbyte
+RAM
+
+MIA
+
+RRD
+
+RDA
+
+RWD
+
+SWD
+
+MRD
+
+MDA
+
+MWD
+
+MARBLE
+
+DAI
+
+IAI
+
+Figure 15.21.
+Amulet3i local memory.
+
+The local RAM (8 kbytes) is divided into 1 Kbyte blocks; both buses run
+to each block (figure 15.22). The blocks are interleaved so that – most of the
+time – there need be no interaction between instruction and data fetches. Only
+if there is a conflict with both buses requiring data from the same RAM block
+is there any need for arbitration.
+In the case of a conflict there is no clock on which to control an adjudica-
+tion; access to the block is restricted by a mutual exclusion element (‘mutex’),
+within the Arbiter blocks in figure 15.22, on a ‘first-come, first served’ basis.
+Note that, in general, data and instruction accesses are not synchronised and
+therefore the average wait will be about half the typical RAM access time.
+Collisions are further minimised by using the latch at the output of the sense
+amplifiers (see preceding section) as a form of cache. Here separate latches
+are provided for instruction and data reads, so sequential accesses rarely need
+to compete for the arbitrated resource. In practice this gives performance ap-
+
+
+Chapter 15: Processors
+305
+
+Arbiter
+Arbiter
+Arbiter
+
+Initiator
+Initiator/Target
+
+Dbuffer
+Ibuffer
+Dbuffer
+Ibuffer
+Ibuffer
+Dbuffer
+
+MARBLE
+
+Local Instruction bus
+
+Local Data bus
+
+MARBLE
+
+microprocessor
+AMULET3
+
+1Kbyte RAM
+1Kbyte RAM
+1Kbyte RAM
+
+Figure 15.22.
+Memory block organisation.
+
+proaching that of a dual-port RAM despite being implemented with standard
+SRAM.
+The local RAM architecture thus provides memory cycles with two differ-
+ent delays (‘random’ and ‘sequential’) with the potential of an added (variable)
+delay in the rare case of a collision. In Amulet3i this is further complicated
+because the two local buses cycle in different times; the instruction bus is sim-
+plified as a read-only bus and runs noticeably faster than the full-function local
+data bus, which also permits external bus mastery to allow DMA and test ac-
+cess (fig. 20). The implications of these various timings are absorbed by the
+asynchronous nature of the system – for instance it is not necessary to slow the
+instruction fetches down by around 25% to fit a clock period set by the data
+bus.
+The inclusion of arbiters within the memory blocks implies that the access
+patterns are non-deterministic. Care must therefore be taken to ensure that the
+system cannot reach a deadlock state. The only possible deadlock that could
+occur in the memory would occur as follows:
+
+1 A (non-sequential) data transfer needs access to a particular RAM block.
+
+2 This is prevented because an instruction fetch is already using the RAM
+array.
+
+
+306
+Part III: Large-Scale Asynchronous Designs
+
+3 The instruction fetch cannot complete because the instruction decoder is
+still busy.
+
+4 The processor pipeline is full and is blocked by the data fetch.
+
+5 Deadlock!
+
+To avoid this it is important not to gain access to the shared resource (the
+RAM array) until it is known that the operation will be able to release it again.
+A data transfer can always do this but provision has to be made restricting the
+generation of instruction fetches until they can be guaranteed to release the
+RAM. In practice the latch following the sense amplifiers forms a convenient,
+dedicated buffer in which to hold the instruction and allow data accesses to
+proceed. In Amulet3 the processor throttles its requests so only a single in-
+struction fetch is outstanding at any given time and this must be removed from
+the ‘I buffer’ before the next address can be sent (figure 15.23).
+
+Latch
+Latch
+
+Arbiter
+
+D tag
+
+D buffer
+
+I tag
+
+I buffer
+
+Instructions
+
+Data transfers
+
+Throttling (within processor)
+
+RAM block
+
+Figure 15.23.
+Memory block arbitration and throttling.
+
+The need for arbitration is rare and thus the possibility of discovering a
+deadlock by random simulation even rarer. It is therefore essential to analyse
+such a non-deterministic system thoroughly to ensure that the opportunities for
+deadlock are removed.
+
+
+Chapter 15: Processors
+307
+
+15.5.3
+Cache
+
+An synchronous cache is very similar to an asynchronous RAM; most of the
+design is a combination of the preceding description of asynchronous RAM
+and standard cache design techniques. However in order to be efficient there
+are certain problems, not present in a synchronous cache, which require solu-
+tion.
+The most significant problems are in managing any conflicts between the
+processor/cache interactions and cache/bus interactions. The first of these to
+address is the issue of line fetch.
+A line fetch generally occurs when a cache miss results from an attempted
+access to a cacheable location. A cache line comprising the required word and
+a small number of adjacent words (a cache line) are copied from memory into
+the cache. The simplest solution to this is to halt the processor, fetch the entire
+cache line, and allow the processor to proceed as the access is now a cache
+hit. However this requires a processor stall which is considerably longer than
+is strictly necessary.
+A more efficient scheme is to begin fetching the cache line, forward the re-
+quired word to the processor as soon as it arrives (it can often be arranged to be
+the first word fetched) and then allow the processor and line fetch to continue
+independently. Performance is further enhanced by allowing the processor to
+use other parts of the cache whilst the line fetch is proceeding (‘hit-under-
+miss’) and to use the incoming words as soon as they arrive. Unfortunately
+in an asynchronous environment this is difficult because the fetched words are
+arriving with no fixed timing relationship with the processor’s cycles.
+Initial thoughts may suggest arbitration for the cache. However it is possi-
+ble to solve this problem without arbitration while maintaining all the desired
+functions by including a dedicated latch for holding the last fetched line. This
+latch is called the Line Fetch Latch.
+
+Read
+
+F
+Read
+Ack.
+
+Select
+
+T
+
+F
+
+Select
+
+LFL Hit
+
+SYNC.
+LF LATCH
+
+MAIN CACHE ARRAY
+
+Req.
+
+Hit
+
+T
+
+DATA
+
+LF ENGINE
+
+ADDR
+
+Figure 15.24.
+Control circuit request steering.
+
+
+308
+Part III: Large-Scale Asynchronous Designs
+
+The line fetch latch (LFL) (figure 15.24) is actually a set of latches residing
+just outside the true RAM array. It normally holds the last-fetched cache line.
+It has its own tag and comparator which allow it to function much like the other
+cache lines. Note that the LFL holds the only copy of this data. (Incidentally,
+because the LFL is static and requires no sense amplification when it is read it
+can provide faster access in an asynchronous system.)
+When, as a result of a cache miss, a fetch is needed, a line from the RAM is
+selected for rejection. For the moment assume that the cache is write-through
+and therefore the RAM can simply be overwritten. The LFL contents, together
+with its tag, are then copied into the chosen line and the LFL is marked as
+‘empty’. This can happen in parallel with the start of the external access.
+The processor is then assumed to have a cache hit from within the LFL and
+attempts to read the appropriate word; this causes a stall at the synchronisation
+point because the word is empty and – unless the external memory is excep-
+tionally fast – will not have been refilled yet.
+As words arrive they are stored in the LFL and individually flagged as ‘full’.
+As soon as the processor can read a word from the LFL (typically after the
+completion of the first fetch cycle) it can continue. From this time the processor
+can continue in parallel with the remaining words being fetched.
+A subsequent cache cycle could be:
+
+a cache hit: this can proceed independently and without interaction with
+the LFL;
+
+a cache miss: this will cause a stall until the line fetch process is com-
+plete and the fetch process can be repeated;
+
+a LFL hit: this attempts to read the LFL whilst it is being filled. The
+possibilities are that the required word is already present (the processor
+continues) or the word is still pending (the processor must stall until it
+arrives.
+
+Only in the last case is there any interaction between the asynchronous pro-
+cesses. However this interaction is merely a wait which can be implemented
+with a flip-flop and an AND gate (figure 15.25). The potential wait begins
+when the LFL is first emptied (caused by, and therefore synchronised with,
+the processor’s action). The wait can end at an arbitrary time but this merely
+delays a transition; it cannot abort or change an action and can therefore be
+implemented without arbitration or the risk of metastability. This mechanism
+was first implemented in the Amulet2e cache system [49] and was also used in
+TITAC-2 [130].
+Both these processors used a simple write-through cache for simplicity. For
+higher performance a copy-back mechanism is needed. This too can be pro-
+vided using an extension of the LFL mechanism. In this case the process of
+
+
+Chapter 15: Processors
+309
+
+Q
+
+R
+
+S
+
+Q
+S
+
+Q
+
+R
+
+S
+
+Q
+
+R
+
+S
+
+Memory line fetch interface
+
+LF_data3
+LF_data2
+LF_data1
+LF_data0
+
+En
+
+Data in
+
+word address
+LF_req
+Processor read interface
+
+LF_complete
+
+LF_ack
+
+Figure 15.25.
+Line-fetch latch read synchronisation.
+
+line fetching is complicated by the need first to copy the victim line from the
+cache array before overwriting it. The victim line can be placed in a separate
+write buffer together with its address as supplied by its tag field. Note that the
+line fetch is caused by a cache miss, so the rejected line can never be the same
+as the line being fetched (this becomes more important later). The LFL is then
+emptied into the RAM array as before and the refilling begins. The writing
+of the rejected line can be delayed because it is less urgent than satisfying the
+cache miss.
+Each cache line (and the write buffer) also contain a ‘dirty’ flag which is set
+if the cache line has been modified. This can be checked and used to determine
+if the write buffer should be written out (i.e. ‘dirty’ is true) or is already coher-
+ent with the memory; in the latter case the copy-back process can be bypassed.
+This process reduces the write traffic but, with a single entry write buffer,
+does not greatly assist with reducing fetch latency because the write buffer
+must be emptied before a subsequent fetch. However the write buffer can be
+extended, albeit at the cost of introducing an arbiter and the potential pitfalls
+therefrom.
+If a second line fetch is needed before an earlier fetch is complete it is the-
+oretically possible for this to overtake any pending write operations. This is
+also desirable. As the write operation will already be pending – merely wait-
+ing for the bus to become free – it is necessary to determine that the new fetch
+request arrived before the write could begin, which requires arbitration in an
+asynchronous environment. However, once the decision is made the operations
+can proceed as before. Two problems remain however:
+
+if the write buffer becomes full there will be nowhere to evict a line to
+and the system can deadlock;
+
+
+310
+Part III: Large-Scale Asynchronous Designs
+
+if the required line is one which has recently been evicted the fetch could
+overtake a pending write and thus lose cache coherency.
+
+The first problem is relatively easy to solve; a simple counter can ensure
+that only a certain amount of overtaking is allowed and that one space in the
+write buffer always remains free (i.e. when the last entry is filled the next bus
+operation must be a write – unless, of course, it is a ‘clean’ line where the write
+can be assumed and bypassed). This can be implemented, for example, as a
+semaphore.
+The second problem is harder, but can be solved by forwarding in a similar
+manner to the forwarding from a processor’s reorder buffer. A line fetch checks
+the addresses of the entries in the write buffer and, if it finds a match, satisfies
+itself from there instead of requesting the memory bus. In such a case the
+‘fetch’ can be performed with no latency, much more rapidly as it is an internal
+operation and can copy an entire cache line in a single operation. This is, of
+course, irrelevant to the functioning of other parts of the system as the whole
+is self-timed anyway. Such forwarding does not affect the write process (a re-
+fetched ‘dirty’ line is returned clean) and can take place regardless of whether
+the copy-back is pending, in progress, complete or not needed. The write buffer
+acts, in effect, as a write-through victim cache [56].
+There is one caveat to this process; the line fetch occurs in parallel with
+the eviction of a cache line. It is therefore possible that one entry in the write
+buffer could be changing during the comparison process. As noted above this
+entry is reserved for the freshly evicted line and can safely be excluded from
+the comparison, thus averting any possibilities of a false positive due to signals
+changing.
+
+15.6.
+Larger asynchronous systems
+
+15.6.1
+System-on-Chip (DRACO)
+
+DRACO (DECT Radio Communications Controller) (figure 15.26) is a sys-
+tem on chip based around the Amulet3 processor. In terms of area about half
+of the (7 mm square) device is an asynchronous ‘island’ – hence Amulet3i –
+and the other half comprises synchronous peripherals compiled from VHDL.
+The asynchronous subsystem (figure 15.27) is a computer in itself and was
+developed both with commercial intent and with a view to investigating some
+new techniques. The processor and RAM have already been discussed. Some
+other novel asynchronous features are outlined in this section.
+
+15.6.2
+Interconnection
+
+Ideally an asynchronous system should be based around an asynchronous
+bus. Indeed it is arguable that large, fast synchronous systems should also use
+
+
+Chapter 15: Processors
+311
+
+Figure 15.26.
+DRACO layout.
+
+subsystem
+
+8 Kbyte
+
+interface
+Test
+
+interface
+Memory
+
+Bridge
+Synchronous
+MARBLE/
+
+peripheral
+
+DMA
+
+I/Os
+peripheral
+
+control
+DRAM
+
+selects
+
+ROM
+
+chip
+
+controller
+
+Synchronous
+
+controller
+
+RAM
+16 Kbyte
+
+data
+
+addr
+
+DMArq
+
+test
+
+delay
+
+AMULET3
+
+MARBLE
+
+synchronous
+
+asynchronous
+
+Figure 15.27.
+Amulet3i asynchronous subsystem.
+
+
+312
+Part III: Large-Scale Asynchronous Designs
+
+asynchronous interconnection between their synchronous subsystems to alle-
+viate problems with high-speed clock distribution and clock skew. This model
+is sometimes referred to as “GALS” (Globally Asynchronous, Locally Syn-
+chronous) and may represent an early commercial opportunity for the inclusion
+of asynchronous circuits in ‘conventional’ systems.
+
+MARBLE.
+As a step in developing such an interconnection standard an
+Amulet3i contains the first implementation of MARBLE [5], a 32-bit, multi-
+master, on-chip bus which communicates by using handshakes rather than a
+clock. Apart from this the signal definitions, with 32-bit address and data,
+look very similar to a conventional bus. MARBLE separates address and data
+communications, allowing pipelining and interleaving of operations in order to
+increase the available bandwidth when several devices require global access.
+MARBLE is supported by ‘initiator’ and ‘target’ interfaces which can be
+attached to any asynchronous component. These, their address, and the bus
+wiring provide all that is needed for communication between the various com-
+ponents. In Amulet3i there are four initiators and seven targets. For example
+the processor’s two local buses each terminate in a MARBLE initiator and the
+local data bus is also a MARBLE target which allows DMA and test data in
+and out of the RAM from other initiators.
+
+Chain.
+Chain (‘Chip area interconnect’) is currently under development as
+a possible replacement for a conventional bus for on-chip communications.
+Chain is based around narrow, high-speed, point-to-point links forming a net-
+work rather than a bus. The idea is to exploit the potential for fast symbol
+transmission within an asynchronous system while reducing the number of
+long distance wires.
+By using a delay-insensitive coding scheme Chain relieves the chip designer
+of the need to ensure timing closure across the whole chip; it also provides
+tolerance of potential problems such as induced crosstalk on the long inter-
+connection wires. Again the user need only communicate with ‘conventional’
+parallel interfaces.
+
+15.6.3
+Balsa and the DMA controller
+
+The DMA controller is a complex, multi-channel unit which was evolved
+according to an external specification. Whilst the function of the unit is rela-
+tively straightforward, even in the asynchronous domain, the unit is notable for
+being the first real application of Balsa synthesis [11].
+The DMA controller comprises a large set of control registers for the many
+DMA channels and a control unit which chooses an active request and services
+it. The registers were designed as blocks of custom VLSI to optimise their area.
+The control logic was written in Balsa, and modified several times as the spec-
+
+
+Chapter 15: Processors
+313
+
+ification changed. The modifications proved remarkably easy to accommodate
+in this environment.
+Such synthesis is not (yet) suitable for high-performance units, but proved
+extremely useful in such an application where the performance was limited by
+other constraints (such as the bus speed, here) and development time predom-
+inates. Of course, in an asynchronous environment, it is easy to accommodate
+devices in any performance range without affecting the overall system func-
+tionality.
+Part II of this book is an introduction to Balsa and includes a complete
+source listing of a simpler 4-channel DMA controller.
+
+15.6.4
+Calibrated time delays
+
+To be useful in real systems an asynchronous processor must be able to
+interface with currently available commodity parts. Whilst it is possible – and
+perhaps even desirable – to have memory devices with handshake interfaces,
+these are not available ‘off the shelf’. Instead commodity memories rely on
+absolute timing assumptions to guarantee their correct operation and thus a
+system using them must have an accurate timing reference. This is the one
+thing lacking in an asynchronous design.
+Introducing a clock purely to provide memory timing would negate many of
+the advantages of the asynchronous system; it is therefore preferable to retain
+the idea of providing data ‘on demand’, providing an adequately precise delay
+can be provided. This delay need not be a clock; a monostable which can be
+triggered on demand is preferable.
+Amulet1 and Amulet2e relied on an externally supplied delay line to pro-
+vide a bus timing reference. This is a flexible solution in that a short delay
+can be used repeatedly to provide longer timing intervals, providing a flexible,
+programmable interface. For example an area of address space could be set
+up for DRAM devices with (say) 1 delay address set up time, 2 delays RAS
+address hold, etc. The bus interface can then count delays as appropriate.
+This is a reasonable solution but suffers from certain drawbacks:
+
+the on-chip delay for each delay cycle is not factored;
+
+delay lines are not particularly precise;
+
+driving an off-chip delay is power hungry.
+
+A much better solution would be to use an on-chip delay. The chief problem
+with this is that a fixed delay will be imprecise, varying from chip to chip and
+also changing with temperature and supply voltage fluctuations. Any on-board
+delay must therefore be calibratable against an external reference.
+Amulet3i uses such a delay. This comprises a chain of delay elements (fig-
+ure 15.28) which can be ‘shorted’ at a determined point. This can be calibrated
+
+
+314
+Part III: Large-Scale Asynchronous Designs
+
+−
+
+R
+
+C
+
+M
+
++
+
+Out
+
+−
+−
+C
+
+In
+
+C
++
++
+−
++
+
+L
+
+C
+
+0
+0
+
+1
+
+0
+
+1
+1
+
+0
+
+1
+
+Figure 15.28.
+Controllable delay circuit.
+
+by counting the number of cycles completed within a known interval and ad-
+justed accordingly using the control wires shown. Once calibrated the delay
+will only change slowly (e.g. with temperature drift) unless the external condi-
+tions change; calibration can therefore be repeated at infrequent intervals under
+software control. The external timing reference can be a long delay such as a
+period of a 32 kHz ‘watch’ oscillator – hardly power expensive!
+
+15.6.5
+Production test
+
+Figure 15.27 on page 311 includes a ‘test interface controller’ block. The
+DRACO chip was designed as a commercial product, and must therefore be
+testable in production. The test approach adopted in the design of DRACO is
+based upon exploiting the MARBLE bus to access the various on-chip mod-
+ules and their test features. In normal use the external memory interface is a
+MARBLE target, but for test purposes it can be reconfigured as a bus initiator,
+enabling an external tester to control the bus and thereby read and write the
+on-chip memory. Circuits in the test interface controller make this efficient,
+with support for automatic sequential address generation, and so on.
+All of the production tests for the asynchronous subsystem use the exter-
+nal tester to load a test program into the on-chip RAM and then instruct the
+Amulet3 processor to execute the program. The test runs without intervention
+from the tester, which must simply wait a given time before inspecting an on-
+chip memory location to read a pass/fail code. Inevitably the tests run at full
+speed; there is no external timing input to control them.
+Certain on-chip modules are very difficult to test in purely functional mode,
+so additional access routes (via test registers accessible from MARBLE) are
+provided to make the tests more efficient. The calibrated time delay is one
+such circuit, and the processor branch predictor is another. In the latter case,
+the branch predictor is taken off line so that the processor (of which it is a part)
+
+
+Chapter 15: Processors
+315
+
+can manipulate its internal state to run optimised tests on its CAM and RAM
+components.
+Although DRACO is not, at the time of writing, in full-scale production, the
+tests ran without difficulty on prototype sample parts.
+
+15.7.
+Summary
+
+Devices such as DRACO have demonstrated that it is feasible to build large,
+functional asynchronous systems. Like any prototype system the chip has its
+problems. There are two (known) bugs in the asynchronous half of the sys-
+tem: an electrical drive strength problem within the multiplier (which failed to
+evidence during simulation) and a logic oversight in the prefetch unit which
+falsely indicates a cycle is ‘sequential’ under certain specific conditions (a
+problem if running code from off-chip DRAM). Neither of these is attributable
+to the asynchronous nature of the device (indeed, there are slightly more bugs
+in the synchronous part of the device!) and both are readily fixable. The pro-
+cessor is comparable with an ARM9 manufactured on the same process in
+terms of physical size, performance and energy efficiency; preliminary mea-
+surements suggest that is is significantly ‘quieter’ in terms of EMI.
+This chapter has presented possible solutions (though certainly not the only
+ones!) to many of the problems facing the designer of complex asynchronous
+processing and memory systems. The majority of the designs described at the
+beginning of the chapter have been produced by academic groups and could be
+classified as “research”; however the complexity of a modern system on chip is
+such that these designs stretch the capability of even a large university group. It
+has been demonstrated that large, functional asynchronous designs are not only
+possible, but can be competitive and have some unique advantages in terms of
+power management and EMI. Asynchronous interconnection may be the only
+solution for large devices, even those with local clocks. Asynchronous chip
+design is ready to move to “development”.
+
+
+
+Epilogue
+
+Asynchronous technology has existed since the first days of digital elec-
+tronics – many of the earliest computers did not employ a central clock signal.
+However, with the development of integrated circuits the need for a straightfor-
+ward design discipline that could scale up rapidly with the available transistor
+resource was pressing, and clocked design became the dominant approach.
+Today, most practising digital designers know very little about asynchronous
+techniques, and what they do know tends to discourage them from venturing
+into the territory. But clocked design is beginning to show signs of stress – its
+ability to scale is waning, and it brings with it growing problems of excessive
+power dissipation and electromagnetic interference.
+During the reign of the clock, a few designers have remained convinced that
+asynchronous techniques have merit, and new techniques have been developed
+that are far better suited to the VLSI era than were the approaches employed on
+early machines. In this book we have tried to illuminate these new techniques
+in a way that is accessible to any practising digital circuit designer, whether or
+not they have had prior exposure to asynchronous circuits.
+In this account of asynchronous design techniques we have had to be selec-
+tive in order not to obscure the principal goal with arcane detail. Much work of
+considerable quality and merit has been omitted, and the reader whose interest
+has been ignited by this book will find that there is a great deal of published
+material available that exposes aspects of asynchronous design that have not
+been touched upon here.
+Although there are commercial examples of VLSI devices based on asyn-
+chronous techniques (a couple of which have been described in this book),
+these are exceptions – most asynchronous development is still taking place in
+research laboratories. If this is to change in the future, where will this change
+first manifest itself?
+The impending demise of clocked design has been forecast for many years
+and still has not happened. If it does happen, it will be for some compelling
+reason, since designers will not lightly cast aside their years of experience in
+one design style in favour of another style that is less proven and less well
+supported by automated tools.
+
+317
+
+
+318
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+There are many possible reasons for considering asynchronous design, but
+no single ‘killer application’ that makes its use obligatory. Several of the argu-
+ments for adopting asynchronous techniques mentioned at the start of this book
+– low power, low electromagnetic interference, modularity, etc. – are applica-
+ble in their own niches, but only the modularity argument has the potential to
+gain universal adoption. Here a promising approach that will support hetero-
+geneous timing environments is GALS (Globally Asynhronous Locally Syn-
+chronous) system design. An asynchronous on-chip interconnect – a ‘chip area
+network’ such as Chain (described on page 312) – is used to connect clocked
+modules. The modules themselves can be kept small enough for clock skew to
+be well-contained so that straightforward synchronous design techniques work
+well, and different modules can employ different clocks or the same clock with
+different phases. Once this framework is in place, it is then clearly straightfor-
+ward to make individual modules asynchronous on a case-by-case basis.
+Here, perhaps unsurprisingly, we see the need to merge asynchronous tech-
+nology with established synchronous design techniques, so most of the func-
+tional design can be performed using well-understood tools and approaches.
+This evolutionary approach contrasts with the revolutionary attacks described
+in Part III of this book, and represents the most likely scenario for the wide-
+spread adoption of the techniques described in this book in the medium-term
+future.
+In the shorter term, however, the application niches that can benefit from
+asynchronous technology are important and viable. It is our hope in writing
+this book that more designers will come to understand the principles of asyn-
+chronous design and its potential to offer new solutions to old and new prob-
+lems. Clocks are useful but they can become straitjackets. Don’t be afraid to
+think outside the box!
+
+For further information on asynchronous design see the bibliography at the
+end of this book, the Asynchronous Bibliography on the Internet [111], and
+the general information on asynchronous design available at the Asynchronous
+Logic Homepage, also on the Internet [47].
+
+
+References
+
+[1] G. Abouyannis et al. Project PREST EP25242. European Low Power
+Initiative for Electronic System Design (ESDLPD) Third International
+Workshop, pages 5–49, 2000.
+
+[2] A. Abrial, J. Bouvier, M. Renaudin, P. Senn, and P. Vivet. A new con-
+tactless smartcard IC using an on-chip antenna and an asynchronous
+micro-controller.
+IEEE Journal of Solid-State Circuits, 36(7):1101–
+1107, July 2001.
+
+[3] T. Agerwala. Putting Petri nets to work. IEEE Computer, 12(12):85–94,
+December 1979.
+
+[4] W.J. Bainbridge and S.B. Furber. Asynchronous macrocell intercon-
+nect using MARBLE. In Proc. International Symposium on Advanced
+Research in Asynchronous Circuits and Systems (Async’98), pages 122–
+132. IEEE Computer Society Press, April 1998.
+
+[5] W.J. Bainbridge and S.B. Furber.
+MARBLE: An asynchronous on-
+chip macrocell bus. Microprocessors and Microsystems, 24(4):213–222,
+April 2000.
+
+[6] T.S. Balraj and M.J. Foster. Miss Manners: A specialized silicon com-
+piler for synchronizers. In Charles E. Leierson, editor, Advanced Re-
+search in VLSI, pages 3–20. MIT Press, April 1986.
+
+[7] A. Bardsley. The Balsa web pages.
+http://www.cs.man.ac.uk/amulet/balsa/projects/balsa.
+
+[8] A. Bardsley. Implementing Balsa Handshake Circuits. PhD thesis, De-
+partment of Computer Science, University of Manchester, 2000.
+
+[9] A. Bardsley and D.A. Edwards. Compiling the language Balsa to delay-
+insensitive hardware. In C. D. Kloos and E. Cerny, editors, Hardware
+Description Languages and their Applications (CHDL), pages 89–91,
+April 1997.
+
+319
+
+
+320
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[10] A. Bardsley and D.A. Edwards. The Balsa asynchronous circuit synthe-
+sis system. In Forum on Design Languages, September 2000.
+
+[11] A. Bardsley and D.A. Edwards. Synthesising an asynchronous DMA
+controller with Balsa. Journal of Systems Architecture, 46:1309–1319,
+2000.
+
+[12] P.A. Beerel, C.J. Myers, and T.H.-Y. Meng.
+Automatic synthesis of
+gate-level speed-independent circuits. Technical Report CSL-TR-94-
+648, Stanford University, November 1994.
+
+[13] P.A. Beerel, C.J. Myers, and T.H.-Y. Meng. Covering conditions and
+algorithms for the synthesis of speed-independent circuits. IEEE Trans-
+actions on Computer-Aided Design, March 1998.
+
+[14] G. Birtwistle and A. Davis, editors. Proceedings of the Banff VIII Work-
+shop: Asynchronous Digital Circuit Design, Banff, Alberta, Canada,
+August 28–September 3, 1993. Springer Verlag, Workshops in Com-
+puting Science, 1995.
+Contributions from: S.B. Furber, “Comput-
+ing without Clocks: Micropipelining the ARM Processor,” A. Davis,
+“Practical Asynchronous Circuit Design: Methods and Tools,” C.H. van
+Berkel, “VLSI Programming of Asynchronous Circuits for Low Power,”
+J. Ebergen, “Parallel Program and Asynchronous Circuit Design,” A.
+Davis and S. Nowick, “Introductory Survey”.
+
+[15] I. Bogdan, M. Munteau, P.A. Ivey, N.L. Seed, and N. Powell. Power
+reduction techniques for a Viterbi decoder implementation. European
+Low Power Initiative for Electronic System Design (ESDLPD) Third In-
+ternational Workshop, pages 28–48, 2000.
+
+[16] E. Brinksma and T. Bolognesi. Introduction to the ISO specification
+language LOTOS. Computer Networks and ISDN Systems, 14(1), 1987.
+
+[17] E. Brunvand and R.F. Sproull. Translating concurrent programs into
+delay-insensitive circuits.
+In Proc. International Conf. Computer-
+Aided Design (ICCAD), pages 262–265. IEEE Computer Society Press,
+November 1989.
+
+[18] J.A. Brzozowsky and C.-J.H. Seager. Asynchronous Circuits. Springer
+Verlag, Monographs in Computer Science, 1994. ISBN: 0-387-94420-6.
+
+[19] S.M. Burns. Performance Analysis and Optimization of Asynchronous
+Circuits. PhD thesis, Computer Science Department, California Institute
+of Technology, 1991. Caltech-CS-TR-91-01.
+
+[20] S.M. Burns.
+General condition for the decomposition of state hold-
+ing elements. In Proc. International Symposium on Advanced Research
+
+
+REFERENCES
+321
+
+in Asynchronous Circuits and Systems. IEEE Computer Society Press,
+March 1996.
+
+[21] S.M. Burns and A.J. Martin. Syntax-directed translation of concurrent
+programs into self-timed circuits. In J. Allen and F. Leighton, editors,
+Advanced Research in VLSI, pages 35–50. MIT Press, 1988.
+
+[22] D.M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems.
+PhD thesis, Stanford University, October 1984.
+
+[23] K.T. Christensen, P. Jensen, P. Korger, and J. Sparsø. The design of an
+asynchronous TinyRISC TR4101 microprocessor core. In Proc. Inter-
+national Symposium on Advanced Research in Asynchronous Circuits
+and Systems, pages 108–119. IEEE Computer Society Press, 1998.
+
+[24] T.-A. Chu. Synthesis of Self-Timed VLSI Circuits from Graph-Theoretic
+Specifications. PhD thesis, MIT Laboratory for Computer Science, June
+1987.
+
+[25] T.-A. Chu and R.K. Roy (editors). Special issue on asynchronous cir-
+cuits and systems. IEEE Design & Test, 11(2), 1994.
+
+[26] T.-A. Chu and L.A. Glasser. Synthesis of self-timed control circuits
+from graphs: An example. In Proc. International Conf. Computer De-
+sign (ICCD), pages 565–571. IEEE Computer Society Press, 1986.
+
+[27] B. Coates, A. Davis, and K. Stevens.
+The Post Office experience:
+Designing a large asynchronous chip. Integration, the VLSI journal,
+15(3):341–366, October 1993.
+
+[28] F. Commoner, A.W. Holt, S. Even, and A. Pnueli.
+Marked directed
+graphs. J. Comput. System Sci., 5(1):511–523, October 1971.
+
+[29] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and
+A. Yakovlev. Petrify: a tool for manipulating concurrent specifications
+and synthesis of asynchronous controllers. In XI Conference on Design
+of Integrated Circuits and Systems, Barcelona, November 1996.
+
+[30] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and
+A. Yakovlev. Petrify: a tool for manipulating concurrent specifications
+and synthesis of asynchronous controllers. IEICE Transactions on In-
+formation and Systems, E80-D(3):315–325, March 1997.
+
+[31] U. Cummings, A. Lines, and A. Martin. An asynchronous pipelined lat-
+tice structure filter. In Proc. International Symposium on Advanced Re-
+search in Asynchronous Circuits and Systems, pages 126–133, Novem-
+ber 1994.
+
+
+322
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[32] A. Davis. A data-driven machine architecture suitable for VLSI imple-
+mentation. In Proceedings of the First Caltech Conference on VLSI,
+pages 479–494, Pasadena, CA, January 1979.
+
+[33] A. Davis and S.M. Nowick. Asynchronous circuit design: Motivation,
+background, and methods. In G. Birtwistle and A. Davis, editors, Asyn-
+chronous Digital Circuit Design, Workshops in Computing, pages 1–49.
+Springer-Verlag, 1995.
+
+[34] A. Davis and S.M. Nowick. An introduction to asynchronous circuit
+design. Technical Report UUCS-97-013, Department of Computer Sci-
+ence, University of Utah, September 1997.
+
+[35] A. Davis and S.M. Nowick. An introduction to asynchronous circuit
+design. In A. Kent and J. G. Williams, editors, The Encyclopedia of
+Computer Science and Technology, volume 38. Marcel Dekker, New
+York, February 1998.
+
+[36] J.B. Dennis. Data Flow Computation. In Control Flow and Data Flow
+— Concepts of Distributed Programming, International Summer School,
+pages 343–398, Marktoberdorf, West Germany, July 31 – August 12,
+1984. Springer, Berlin.
+
+[37] J.C. Ebergen and R. Berks. Response time properties of linear asyn-
+chronous pipelines. Proceedings of the IEEE, 87(2):308–318, February
+1999.
+
+[38] P.B. Endecott and S.B. Furber.
+Modelling and simulation of asyn-
+chronous systems using the LARD hardware description language. In
+Proceedings of the 12th European Simulation Multiconference, Manch-
+ester, Society for Computer Simulation International, pages 39–43, June
+1994. ISBN 1-56555-148-6.
+
+[39] K.M. Fant and S.A. Brandt. Null Conventional Logic: A complete and
+consistent logic for asynchronous digital circuit synthesis. In Interna-
+tional Conference on Application-specific Systems, Architectures, and
+Processors, pages 261–273, 1996.
+
+[40] R.M. Fuhrer, S.M. Nowick, M. Theobald, N.K. Jha, B. Lin, and
+L. Plana.
+Minimalist: An environment for the synthesis, verifica-
+tion and testability of burst-mode asynchronous machines.
+Techni-
+cal Report TR CUCS-020-99, Columbia University, NY, July 1999.
+http://www.cs.columbia.edu/˜nowick/minimalist.pdf.
+
+[41] S.B. Furber and P. Day. Four-phase micropipeline latch control circuits.
+IEEE Transactions on VLSI Systems, 4(2):247–253, June 1996.
+
+
+REFERENCES
+323
+
+[42] S.B. Furber, P. Day, J.D. Garside, N.C. Paver, S. Temple, and J.V.
+Woods. The design and evaluation of an asynchronous microproces-
+sor. In Proc. Int’l. Conf. Computer Design, pages 217–220, October
+1994.
+
+[43] S.B. Furber, D.A. Edwards, and J.D. Garside. AMULET3: a 100 MIPS
+asynchronous embedded processor. In Proc. International Conf. Com-
+puter Design (ICCD), September 2000.
+
+[44] S.B. Furber, J.D. Garside, P. Riocreux, S. Temple, P. Day, J. Liu, and
+N.C. Paver. AMULET2e: An asynchronous embedded controller. Pro-
+ceedings of the IEEE, 87(2):243–256, February 1999.
+
+[45] S.B. Furber, J.D. Garside, S. Temple, J. Liu, P. Day, and N.C. Paver.
+AMULET2e: An asynchronous embedded controller. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 290–299. IEEE Computer Society Press, 1997.
+
+[46] EU IST-1999-13515, G3Card – generation 3 smartcard, January 2000.
+
+[47] J.D. Garside. The Asynchronous Logic Homepages.
+http://www.cs.man.ac.uk/async/.
+
+[48] J.D. Garside, W.J. Bainbridge, A. Bardsley, D.A. Edwards, S.B. Furber,
+J. Liu, D.W. Lloyd, S. Mohammadi, J.S. Pepper, O. Petlin, S. Temple,
+and J.V. Woods. AMULET3i – an asynchronous system-on-chip. In
+Proc. International Symposium on Advanced Research in Asynchronous
+Circuits and Systems, pages 162–175. IEEE Computer Society Press,
+April 2000.
+
+[49] J.D. Garside, S. Temple, and R. Mehra. The AMULET2e cache sys-
+tem. In Proc. International Symposium on Advanced Research in Asyn-
+chronous Circuits and Systems (Async’96), pages 208–217. IEEE Com-
+puter Society Press, March 1996.
+
+[50] D.A. Gilbert. Dependency and Exception Handling in an Asynchronous
+Microprocessor. PhD thesis, Department of Computer Science, Univer-
+sity of Manchester, 1997.
+
+[51] D.A. Gilbert and J.D. Garside.
+A result forwarding mechanism for
+asynchronous pipelined systems. In Proc. International Symposium on
+Advanced Research in Asynchronous Circuits and Systems (Async’97),
+pages 2–11. IEEE Computer Society Press, April 1997.
+
+[52] B. Gilchrist, J.H. Pomerene, and S.Y. Wong. Fast carry logic for digital
+computers. IRE Transactions on Electronic Computers, EC-4(4):133–
+136, December 1955.
+
+
+324
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[53] L.A. Glasser and D.W. Dobberpuhl. The Design and Analysis of VLSI
+Circuits. Addison-Wesley, 1985.
+
+[54] S. Hauck. Asynchronous design methodologies: An overview. Proceed-
+ings of the IEEE, 83(1):69–93, January 1995.
+
+[55] L.G. Heller, W.R. Griffin, J.W. Davis, and N.G. Thoma. Cascode voltage
+switch logic: A differential CMOS logic family.
+Proc. International
+Solid State Circuits Conference, pages 16–17, February 1984.
+
+[56] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantita-
+tive Approach (2nd edition). Morgan Kaufmann, 1996.
+
+[57] C.A.R. Hoare. Communicating sequential processes. Communications
+of the ACM, 21(8):666–677, August 1978.
+
+[58] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall,
+Englewood Cliffs, 1985.
+
+[59] D.A. Huffman.
+The synthesis of sequential switching circuits.
+J.
+Franklin Inst., pages 161–190, 275–303, March/April 1954.
+
+[60] D.A. Huffman. The synthesis of sequential switching circuits. In E. F.
+Moore, editor, Sequential Machines: Selected Papers. Addison-Wesley,
+1964.
+
+[61] H. Hulgaard, S.M. Burns, and G. Borriello. Testing asynchronous cir-
+cuits: A survey. Integration, the VLSI journal, 19(3):111–131, Novem-
+ber 1995.
+
+[62] K. Hwang. Computer Arithmetic: Principles, Architecture, and Design.
+John Wiley & Sons, 1979.
+
+[63] ISO/IEC. Mifare identification cards - contactless integrated circuit(s)
+cards - proximity cards. Standard ISO/IEC Standard 14443 Type A.
+
+[64] H. Jacobson, E. Brunvand, G. Gopalakrishnan, and P. Kudva. High-level
+asynchronous system design using the ACK framework. In Proc. Inter-
+national Symposium on Advanced Research in Asynchronous Circuits
+and Systems, pages 93–103. IEEE Computer Society Press, April 2000.
+
+[65] D. Jaggar. Advanced RISC Machines Architecture Reference Manual.
+Prentice Hall, 1996.
+
+[66] M. Johnson. Superscalar Microprocessor Design. Series in Innovative
+Technology. Prentice Hall, 1991.
+
+
+REFERENCES
+325
+
+[67] S.C. Johnson and S. Mazor. Silicon compiler lets system makers design
+their own VLSI chips. Electronic Design, 32(20):167–181, 1984.
+
+[68] G. Jones. Programming in OCCAM. Prentice-Hall international, 87.
+
+[69] M.B. Josephs, S.M. Nowick, and C.H. van Berkel. Modeling and de-
+sign of asynchronous circuits. Proceedings of the IEEE, 87(2):234–242,
+February 1999.
+
+[70] G.C. Clark Jr. and J.B. Cain. Error correcting coding for digital com-
+munication. Plenum, 1981.
+
+[71] G.D. Forney Jr. The Viterbi algorithm. Proc. IEEE, 13(3):268–278,
+1973.
+
+[72] G. Kane and J. Heinrich. MIPS RISC Achitecture. Prentice Hall, 1992.
+
+[73] J. Kessels, T. Kramer, G. den Besten, A. Peeters, and V. Timm. Apply-
+ing asynchronous circuits in contactless smart cards. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 36–44. IEEE Computer Society Press, April 2000.
+
+[74] J. Kessels, T. Kramer, A. Peeters, and V. Timm. DESCALE: a design ex-
+periment for a smart card application consuming low energy. In R. van
+Leuken, R. Nouta, and A. de Graaf, editors, European Low Power Ini-
+tiative for Electronic System Design, pages 247–262. Delft Institute of
+Microelectronics and Submicron Technology, July 2000.
+
+[75] Z. Kohavi. Switching and Finite Automata Theory. McGraw-Hill, 1978.
+
+[76] A. Kondratyev, J. Cortadella, M. Kishinevsky, L. Lavagno, and
+A. Yakovlev. Logic decomposition of speed-independent circuits. Pro-
+ceedings of the IEEE, 87(2):347–362, February 1999.
+
+[77] J. Liu. Arithmetic and control components for an asynchronous micro-
+processor. PhD thesis, Department of Computer Science, University of
+Manchester, 1997.
+
+[78] D.W. Lloyd. VHDL models of asychronous handshaking. (Personal
+communication, August 1998).
+
+[79] A.J. Martin.
+The probe: An addition to communication primitives.
+Information Processing Letters, 20(3):125–130, 1985.
+Erratum: IPL
+21(2):107, 1985.
+
+[80] A.J. Martin. Compiling communicating processes into delay-insensitive
+VLSI circuits. Distributed Computing, 1(4):226–234, 1986.
+
+
+326
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[81] A.J. Martin. Formal program transformations for VLSI circuit synthesis.
+In E.W. Dijkstra, editor, Formal Development of Programs and Proofs,
+UT Year of Programming Series, pages 59–80. Addison-Wesley, 1989.
+
+[82] A.J. Martin. The limitations to delay-insensitivity in asynchronous cir-
+cuits. In W.J. Dally, editor, Advanced Research in VLSI: Proceedings of
+the Sixth MIT Conference, pages 263–278. MIT Press, 1990.
+
+[83] A.J. Martin. Programming in VLSI: From communicating processes
+to delay-insensitive circuits.
+In C.A.R. Hoare, editor, Developments
+in Concurrency and Communication, UT Year of Programming Series,
+pages 1–64. Addison-Wesley, 1990.
+
+[84] A.J. Martin. Synthesis of asynchronous VLSI circuits, 1991.
+
+[85] A.J. Martin. Asynchronous datapaths and the design of an asynchronous
+adder. Formal Methods in System Design, 1(1):119–137, July 1992.
+
+[86] A.J. Martin, S.M. Burns, T. K. Lee, D. Borkovic, and P.J. Hazewindus.
+The design of an asynchronous microprocessor.
+In Charles L. Seitz,
+editor, Advanced Research in VLSI, pages 351–373. MIT Press, 1989.
+
+[87] A.J. Martin, S.M. Burns, T.K. Lee, D. Borkovic, and P.J. Hazewindus.
+The first asynchronous microprocessor: The test results. Computer Ar-
+chitecture News, 17(4):95–98, 1989.
+
+[88] A.J. Martin, A. Lines, R. Manohar, M. Nystr¨om, P. Penzes, R. South-
+worth, U.V. Cummings, and T.-K. Lee. The design of an asynchronous
+MIPS R3000. In Proceedings of the 17th Conference on Advanced Re-
+search in VLSI, pages 164–181. MIT Press, September 1997.
+
+[89] R. Milner. Communication and Concurrency. Prentice-Hall, 1989.
+
+[90] C.E. Molnar, I.W. Jones, W.S. Coates, and J.K. Lexau. A FIFO ring
+oscillator performance experiment. In Proc. International Symposium
+on Advanced Research in Asynchronous Circuits and Systems, pages
+279–289. IEEE Computer Society Press, April 1997.
+
+[91] C.E. Molnar, I.W. Jones, W.S. Coates, J.K. Lexau, S.M. Fairbanks, and
+I.E. Sutherland. Two FIFO ring performance experiments. Proceedings
+of the IEEE, 87(2):297–307, February 1999.
+
+[92] D.E. Muller. Asynchronous logics and application to information pro-
+cessing. In H. Aiken and W. F. Main, editors, Proc. Symp. on Applica-
+tion of Switching Theory in Space Technology, pages 289–297. Stanford
+University Press, 1963.
+
+
+REFERENCES
+327
+
+[93] D.E. Muller and W.S. Bartky. A theory of asynchronous circuits. In
+Proceedings of an International Symposium on the Theory of Switch-
+ing, Cambridge, April 1957, Part I, pages 204–243. Harvard University
+Press, 1959. The annals of the computation laboratory of Harvard Uni-
+versity, Volume XXIX.
+
+[94] T. Murata. Petri Nets: Properties, Analysis and Applications. Proceed-
+ings of the IEEE, 77(4):541–580, April 1989.
+
+[95] C.J. Myers. Asynchronous Circuit Design. John Wiley & Sons, July
+2001. ISBN: 0-471-41543-X.
+
+[96] National Bureau of Standards. Data encryption standard, January 1997.
+Federal Information Processing Standards Publication 46.
+
+[97] C.D. Nielsen. Evaluation of function blocks for asynchronous design.
+In Proc. European Design Automation Conference (EURO-DAC), pages
+454–459. IEEE Computer Society Press, September 1994.
+
+[98] C.D. Nielsen, J. Staunstrup, and S.R. Jones. Potential performance ad-
+vantages of delay-insensitivity. In M. Sami and J. Calzadilla-Daguerre,
+editors, Proceedings of IFIP workshop on Silicon Architectures for
+Neural Nets, StPaul-de-Vence, France, November 1990. North-Holland,
+Amsterdam, 1991.
+
+[99] L.S. Nielsen. Low-power Asynchronous VLSI Design. PhD thesis, De-
+partment of Information Technology, Technical University of Denmark,
+1997. IT-TR:1997-12.
+
+[100] L.S. Nielsen, C. Niessen, J. Sparsø, and C.H. van Berkel. Low-power
+operation using self-timed circuits and adaptive scaling of the supply
+voltage. IEEE Transactions on VLSI Systems, 2(4):391–397, 1994.
+
+[101] L.S. Nielsen and J. Sparsø. A low-power asynchronous data-path for a
+FIR filter bank. In Proc. International Symposium on Advanced Re-
+search in Asynchronous Circuits and Systems, pages 197–207. IEEE
+Computer Society Press, 1996.
+
+[102] L.S. Nielsen and J. Sparsø.
+An 85 µW asynchronous filter-bank for
+a digital hearing aid. In Proc. IEEE International Solid State circuits
+Conference, pages 108–109, 1998.
+
+[103] L.S. Nielsen and J. Sparsø. Designing asynchronous circuits for low
+power: An IFIR filter bank for a digital hearing aid. Proceedings of the
+IEEE, 87(2):268–281, February 1999. Special issue on “Asynchronous
+Circuits and Systems” (Invited Paper).
+
+
+328
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[104] D.C. Noice.
+A Two-Phase Clocking Dicipline for Digital Integrated
+Circuits. PhD thesis, Department of Electrical Engineering, Stanford
+University, February 1983.
+
+[105] S.M. Nowick. Design of a low-latency asynchronous adder using specu-
+lative completion. IEE Proceedings, Computers and Digital Techniques,
+143(5):301–307, September 1996.
+
+[106] S.M. Nowick, M.B. Josephs, and C.H. van Berkel (editors).
+Special
+issue on asynchronous circuits and systems. Proceedings of the IEEE,
+87(2), February 1999.
+
+[107] S.M. Nowick, K.Y. Yun, and P.A. Beerel. Speculative completion for
+the design of high-performance asynchronous dynamic adders. In Proc.
+International Symposium on Advanced Research in Asynchronous Cir-
+cuits and Systems, pages 210–223. IEEE Computer Society Press, April
+1997.
+
+[108] International Standards Organization. LOTOS — a formal description
+technique based on the temporal ordering of observational behaviour.
+ISO IS 8807, 1989.
+
+[109] N.C. Paver, P. Day, C. Farnsworth, D.L. Jackson, W.A. Lien, and J. Liu.
+A low-power, low-noise configurable self-timed DSP. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 32–42, 1998.
+
+[110] M. Pedersen.
+Design of asynchronous circuits using standard CAD
+tools. Technical Report IT-E 774, Technical University of Denmark,
+Dept. of Information Technology, 1998. (In Danish).
+
+[111] A.M.G. Peeters. The ‘Asynchronous’ Bibliography.
+http://www.win.tue.nl/˜wsinap/async.html.
+Corresponding e-mail address: async-bib@win.tue.nl.
+
+[112] A.M.G. Peeters. Single-Rail Handshake Circuits. PhD thesis, Eind-
+hoven University of Technology, June 1996.
+http://www.win.tue.nl/˜wsinap/pdf/Peeters96.pdf.
+
+[113] J.L. Peterson. Petri nets. Computing Surveys, 9(3):223–252, September
+1977.
+
+[114] Philips Semiconductors. PCA5007 handshake-technology pager IC data
+sheet. http://www.semiconductors.philips.com/pip/PCA5007H.
+
+[115] J. Rabaey. Digital Integrated Circuits: A Design Perspective. Prentice-
+Hall, 1996.
+
+
+REFERENCES
+329
+
+[116] P. Rakers, L. Connell, T. Collins, and D. Russell. Secure contactless
+smartcard ASIC with DPA protection. IEEE Journal of Solid-State Cir-
+cuits, 36(3):559–565, March 2001.
+
+[117] M. Renaudin, P. Vivet, and F. Robin. ASPRO-216: A standard-cell QDI
+16-bit RISC asynchronous microprocessor. In Proc. International Sym-
+posium on Advanced Research in Asynchronous Circuits and Systems
+(Async’98), pages 22–31. IEEE Computer Society Press, April 1998.
+
+[118] M. Renaudin, P. Vivet, and F. Robin. A design framework for asyn-
+chronous/synchronous circuits based on CHP to HDL translation. In
+Proc. International Symposium on Advanced Research in Asynchronous
+Circuits and Systems, pages 135–144, April 1999.
+
+[119] R. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital
+signatures and public-key crypto systems, June 1978.
+
+[120] M. Roncken. Defect-oriented testability for asynchronous ICs. Pro-
+ceedings of the IEEE, 87(2):363–375, February 1999.
+
+[121] C.L. Seitz. System timing. In C.A. Mead and L.A. Conway, editors,
+Introduction to VLSI Systems, chapter 7. Addison-Wesley, 1980.
+
+[122] N.P. Singh. A design methodology for self-timed systems. Master’s
+thesis, Laboratory for Computer Science, MIT, 1981. MIT/LCS/TR-
+258.
+
+[123] J. Sparsø, C.D. Nielsen, L.S. Nielsen, and J. Staunstrup. Design of self-
+timed multipliers: A comparison. In S. Furber and M. Edwards, editors,
+Asynchronous Design Methodologies, volume A-28 of IFIP Transac-
+tions, pages 165–180. Elsevier Science Publishers, 1993.
+
+[124] J. Sparsø and J. Staunstrup. Delay-insensitive multi-ring structures. IN-
+TEGRATION, the VLSI Journal, 15(3):313–340, October 1993.
+
+[125] J. Sparsø, J. Staunstrup, and M. Dantzer-Sørensen. Design of delay in-
+sensitive circuits using multi-ring structures. In G. Musgrave, editor,
+Proc. of EURO-DAC ’92, European Design Automation Conference,
+Hamburg, Germany, September 7-10, 1992, pages 15–20. IEEE Com-
+puter Society Press, 1992.
+
+[126] R.F. Sproull, I.E. Sutherland, and C.E. Molnar. The counterflow pipeline
+processor architecture. IEEE Design & Test of Computers, 11(3):48–59,
+1994.
+
+[127] L. Stok. Architectural Synthesis and Optimization of Digital Systems.
+PhD thesis, Eindhoven University of Technology, 1991.
+
+
+330
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[128] I.E. Sutherland.
+Micropipelines.
+Communications of the ACM,
+32(6):720–738, June 1989.
+
+[129] Synopsys, Inc. Synopsys VSS Family Core Programs Manual, 1997.
+
+[130] A. Takamura, M. Kuwako, M. Imai, T. Fujii, M. Ozawa, I. Fukasaku,
+Y. Ueno, and T. Nanya. TITAC-2: An asynchronous 32-bit micropro-
+cessor based on scalable-delay-insensitive model. In Proc. International
+Conf. Computer Design (ICCD’97), pages 288–294. MIT Press, Octo-
+ber 1997.
+
+[131] H. Terada, S. Miyata, and M. Iwata.
+DDMPs: Self-timed super-
+pipelined data-driven multimedia processors. Proceedings of the IEEE,
+87(2):282–296, February 1999.
+
+[132] M. Theobald and S.M. Nowick. Transformations for the synthesis and
+optimization of asynchronous distributed control. In Proc. ACM/IEEE
+Design Automation Conference, 2001.
+
+[133] S.H. Unger.
+Asynchronous Sequential Switching Circuits.
+Wiley-
+Interscience, John Wiley & Sons, Inc., New York, 1969.
+
+[134] C.H. van Berkel. Beware the isochronic fork. INTEGRATION, the VLSI
+journal, 13(3):103–128, 1992.
+
+[135] C.H. van Berkel. Handshake Circuits: an Asynchronous Architecture
+for VLSI Programming, volume 5 of International Series on Parallel
+Computation. Cambridge University Press, 1993.
+
+[136] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and
+F. Schalij. Asynchronous circuits for low power: a DCC error corrector.
+IEEE Design & Test, 11(2):22–32, 1994.
+
+[137] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and
+F. Schalij. A fully asynchronous low-power error corrector for the DCC
+player. In ISSCC 1994 Digest of Technical Papers, volume 37, pages
+88–89. IEEE, 1994. ISSN 0193-6530.
+
+[138] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken,
+F. Schalij, and R. van de Viel.
+A single-rail re-implementation of a
+DCC error detector using a generic standard-cell library. In 2nd Work-
+ing Conference on Asynchronous Design Methodologies, London, May
+30-31, 1995, pages 72–79, 1995.
+
+[139] C.H. van Berkel, F. Huberts, and A. Peeters.
+Stretching quasi delay
+insensitivity by means of extended isochronic forks. In Asynchronous
+
+
+REFERENCES
+331
+
+Design Methodologies, pages 99–106. IEEE Computer Society Press,
+May 1995.
+
+[140] C.H. van Berkel, M.B. Josephs, and S.M. Nowick. Scanning the tech-
+nology: Applications of asynchronous circuits.
+Proceedings of the
+IEEE, 87(2):223–233, February 1999.
+
+[141] C.H. van Berkel, J. Kessels, M. Roncken, R. Saeijs, and F. Schalij. The
+VLSI-programming language Tangram and its translation into hand-
+shake circuits. In Proc. European Conference on Design Automation
+(EDAC), pages 384–389, 1991.
+
+[142] C.H. van Berkel, C. Niessen, M. Rem, and R. Saeijs. VLSI program-
+ming and silicon compilation. In Proc. International Conf. Computer
+Design (ICCD), pages 150–166, Rye Brook, New York, 1988. IEEE
+Computer Society Press.
+
+[143] H. van Gageldonk.
+An Asynchronous Low-Power 80C51 Microcon-
+troller. PhD thesis, Dept. of Math. and C.S., Eindhoven Univ. of Tech-
+nology, September 1998.
+
+[144] H. van Gageldonk, D. Baumann, C.H. van Berkel, D. Gloor, A. Peeters,
+and G. Stegmann. An asynchronous low-power 80c51 microcontroller.
+In Proc. International Symposium on Advanced Research in Asyn-
+chronous Circuits and Systems, pages 96–107. IEEE Computer Society
+Press, April 1998.
+
+[145] P. Vanbekbergen.
+Synthesis of Asynchronous Control Circuits from
+Graph-Theoretic Specifications. PhD thesis, Catholic University of Leu-
+ven, September 1993.
+
+[146] V.I. Varshavsky, M.A. Kishinevsky, V.B. Marakhovsky, V.A. Peschan-
+sky, L.Y. Rosenblum, A.R. Taubin, and B.S. Tzirlin. Self-timed Con-
+trol of Concurrent Processes.
+Kluwer Academic Publisher, 1990.
+V.I.Varshavsky Ed., (Russian edition: 1986).
+
+[147] T. Verhoeff. Delay-insensitive codes - an overview. Distributed Com-
+puting, 3(1):1–8, 1988.
+
+[148] A.J. Viterbi. Error bounds for convolutional codes and an asymptotically
+optimum decoding algorithm.
+In IEEE Transactions on Information
+Theory, volume 13, pages 260–269, 1967.
+
+[149] P. Viviet and M. Renaudin. CHP2VHDL, a CHP to VHDL translator
+- towards asynchronous-design simulation.
+In L. Lavagno and M.B.
+
+
+332
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+Josephs, editors, Handouts from the ACiD-WG Workshop on Specifica-
+tion models and languages and technology effects of asynchronous de-
+sign. Dipartemento di Elettronica, Polytecnico de Torino, Italy, January
+1998.
+
+[150] J.F. Wakerly. Digital Design: Principles and Practices, 3/e. Prentice-
+Hall, 2001.
+
+[151] N. Weste and K. Esraghian. Principles of CMOS VLSI Design – A sys-
+tems Perspective, 2nd edition. Addison-Wesley, 1993.
+
+[152] Rik van de Wiel. High-level test evaluation of asynchronous circuits.
+In Asynchronous Design Methodologies, pages 63–71. IEEE Computer
+Society Press, May 1995.
+
+[153] T.E. Williams. Self-Timed Rings and their Application to Division. PhD
+thesis, Department of Electrical Engineering and Computer Science,
+Stanford University, 1991. CSL-TR-91-482.
+
+[154] T.E. Williams. Analyzing and improving latency and throughput in self-
+timed rings and pipelines. In Tau-92: 1992 Workshop on Timing Issues
+in the Specification and Synthesis of Digital Systems. ACM/SIGDA,
+March 1992.
+
+[155] T.E. Williams. Performance of iterative computation in self-timed rings.
+Journal of VLSI Signal Processing, 6(3), October 1993.
+
+[156] T.E. Williams and M.A. Horowitz.
+A zero-overhead self-timed 160
+ns. 54 bit CMOS divider.
+IEEE Journal of Solid State Circuits,
+26(11):1651–1661, 1991.
+
+[157] T.E. Williams, N. Patkar, and G. Shen. SPARC64: A 64-b 64-active-
+instruction out-of-order-execution MCM processor.
+IEEE Journal of
+Solid-State Circuits, 30(11):1215–1226, November 1995.
+
+[158] J.V. Woods, P. Day, S.B. Furber, J.D. Garside, N.C. Paver, and S. Tem-
+ple. AMULET1: An asynchronous ARM processor. IEEE Transactions
+on Computers, 46(4):385–398, April 1997.
+
+[159] C. Ykman-Couvreur, B. Lin, and H. de Man. Assassin: A synthesis sys-
+tem for asynchronous control circuits. Technical report, IMEC, Septem-
+ber 1994. User and Tutorial manual.
+
+[160] K.Y. Yun and D.L. Dill. Automatic synthesis of extended burst-mode
+circuits: Part II (automatic synthesis). IEEE Transactions on Computer-
+Aided Design, 18(2):118–132, February 1999.
+
+
+Index
+
+Acknowledgement (or indication), 15
+Activation port, 163
+Active port, 156
+Actual case latency, 65
+Adaptive voltage scaling, 231
+Addition (ripple-carry), 64
+Amulet microprocessors, 274
+Amulet1, 274, 281, 285, 290
+Amulet2, 290
+Amulet2e, 275, 303
+Amulet3, 282, 286, 291, 297, 300
+Amulet3i, 156, 205, 275, 303, 313
+DRACO, 278, 310
+And-or-invert (AOI) gates, 102
+Arbitration, 79, 202, 210–211, 269, 287, 304
+ARM, 274, 280, 297
+ASPRO-216, 278, 284
+Asymmetric delay, 48, 53
+Asynchronous advantages, 3, 231
+Asynchronous disadvantages, 232
+Asynchronous synthesis, 155
+Atomic complex gate, 94, 103
+Automatic performance adaptation, 231
+Average power consumption, 237
+Balsa, 123, 155, 312
+communications, 179
+area cost, 167
+array types, 175
+arrayed channels, 166, 176
+auto-assignment, 178, 184
+channel viewer, 171
+conditional execution, 180
+constants, 166, 174
+data types, 173
+design flow, 159
+DMA controller, 211
+enumerated types, 174
+for loops, 166
+hardware sharing, 187
+looping constructs, 180
+modular compilation, 165
+numeric types, 173
+operators, 181
+parallel composition, 165
+
+parameterised descriptions, 193
+program structure, 181
+record types, 174
+recursive definitions, 195
+simulation, 168
+structural iteration, 180, 189
+test harness, 168, 197
+tools, 159
+Branch colour, 284, 300
+Breeze, 159, 162
+Bubble limited, 49
+Bubble, 30
+Bundled-data, 9, 157, 255
+Burst mode, 86
+input burst, 86
+output burst, 86
+C-element, 14, 58, 92, 257
+asymmetric, 100, 59
+generalized, 100, 103, 105
+implementation, 15
+specification, 15, 92
+Cache, 275, 303–304, 307
+Calibrated time delays, 313
+Caltech, 133, 276
+Capture-pass latch, 19
+Cast, 176
+CCS (calculus of communicating systems), 123
+Channel (or link), 7, 30, 156
+communication in Balsa, 162
+Channel type
+biput, 115
+nonput, 115
+pull, 10, 115
+push, 10, 115
+Chip area interconnect (Chain), 312
+CHP (communicating hardware processes),
+123–124, 278
+Circuit templates:
+for statement, 37
+if statement, 36
+while statement, 38
+Classification
+delay-insensitive (DI), 25
+quasi delay-insensitive (QDI), 25
+
+333
+
+
+334
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+self-timed, 26
+speed-independent (SI), 25
+Closed circuit, 23
+Codeword (dual-rail), 12
+empty, 12
+intermediate, 12
+valid, 12
+Compatible states, 85
+Complete state coding (CSC), 88
+Completion indication, 65
+Completion
+detection, 21–22, 302
+indication, 62
+strong, 62
+weak, 62
+Complex gates, 104
+Concurrent processes, 123
+Concurrent statements, 123
+Consistent state assignment, 88
+Control limited, 50
+Control logic for transition signaling, 20
+Control-data-flow graphs, 36
+Convolution encoding, 250
+Counterflow Pipeline Processor (CFPP), 286
+CSP (communicating sequential processes), 123,
+223
+Cycle time of a ring, 49
+Data dependency, 279
+Data encoding
+bundled-data, 9
+dual-rail, 11
+m-of-n, 14
+one-hot (or 1-of-n), 13
+single-rail, 10
+Data limited, 49
+Data types, 173
+Data validity scheme (4-phase bundled-data)
+broad, 116
+early, 116
+extended early, 116
+late, 116
+Data validity, 156
+Data-flow abstraction, 7
+DCVSL, 70
+Deadlock, 30, 155, 199, 285, 287, 305
+Delay assumptions, 23
+Delay insensitive minterm synthesis (DIMS), 67
+Delay matching, 11, 236
+Delay model
+fixed delay, 83
+inertial delay, 83
+delay time, 83
+reject time, 83
+min-max delay, 83
+transport delay, 83
+unbounded delay, 83
+Delay selection, 66, 305
+
+Delay-insensitive (DI), 12, 17, 25, 156
+codes, 12, 312
+Demultiplexer (DEMUX), 32
+Dependency graph, 52
+DES coprocessor, 241
+Design for test, 314
+Determinism, 282
+Differential logic, 70
+DIMS, 67–68
+DMA controller, 202, 205
+Balsa description, 211
+control unit, 209, 213
+on DRACO, 312
+transfer engine, 210, 212
+DRACO, 276, 278, 310, 315
+Dual-rail carry signals, 65
+Dual-rail encoding, 11
+Dummy environment, 87
+Dynamic wavelength, 49
+Electromagnetic interference (EMI), 278, 315
+emission spectrum, 231
+Electromagnetic radiation (as power source), 222
+Empty word, 12, 29–30
+Environment, 83
+Event, 9
+Exceptions, 297
+Excitation region, 97
+Excited gate/variable, 24
+FIFO, 16, 257
+Finite state machine (using a ring), 35
+Firing (of a gate), 24
+For statement, 37, 166
+Fork, 31
+Forward latency, 47
+Four-phase handshake, 225
+Function block, 31, 60–61
+bundled-data (“speculative completion”), 66
+bundled-data, 18, 65
+dual-rail (DIMS), 67
+dual-rail (Martin’s adder), 71
+dual-rail (null convention logic), 69
+dual-rail (transistor level CMOS), 70
+dual-rail, 22
+hybrid, 73
+strongly indicating, 62
+weakly indicating, 62
+Fundamental mode, 81, 83–84
+Generalized C-element, 103, 105
+Generate (carry), 65
+Globally Asynchronous, Locally Synchronous
+(GALS), 312
+Greatest common divisor (GCD), 38, 131, 226
+Guarded command, 128, 240
+Guarded repetition, 128
+Halt, 237, 282
+Handshake channel, 115, 156, 225
+biput, 115
+
+
+INDEX
+335
+
+nonput, 115, 129
+pull, 10, 115, 129, 156
+push, 10, 115, 129, 156
+Handshake circuit, 128, 162, 223
+2-place ripple FIFO, 130–131
+2-place shift register, 129
+greatest common divisor (GCD), 132, 226
+Handshake component, 156, 225
+arbiter, 79
+bar, 131
+case, 157
+demultiplexer, 32, 76, 131
+do, 131, 226
+fetch, 157, 163
+fork, 31, 58, 131, 133
+join, 31, 58, 130
+latch, 29, 31, 57
+2-phase bundled-data, 19
+4-phase bundled-data, 18, 106
+4-phase dual-rail, 21
+merge, 32, 58
+multiplexer, 32, 76, 109, 131
+parallel, 225–226
+passivator, 130
+repeater, 129, 163
+sequencer, 129, 163, 225
+transferer, 130
+variable, 130, 163
+Handshake expansion, 133
+Handshake protocol, 7, 9
+2-phase bundled-data, 9, 274–275
+2-phase dual-rail, 13
+4-phase bundled-data, 9, 117, 255
+4-phase dual-rail, 11
+non-return-to-zero (NRZ), 10
+return-to-zero (RTZ), 10
+Handshaking, 7, 155
+Hazard, 297
+dynamic-01, 83
+dynamic-10, 83, 95
+static-0, 83
+static-1, 83, 94
+Huffmann, D. A., 84
+Hysteresis, 22, 64
+If statement, 36, 181
+IFIR filter bank, 39
+Indication (or acknowledgement), 15
+of completion, 65
+dependency graphs, 73
+distribution of valid/empty indication, 72
+strong, 62
+weak, 62
+Initial state, 101
+Initialization, 101, 30
+Input free choice, 88
+Input-output mode, 81, 84
+Instruction prefetching, 236
+
+Intermediate codeword, 12
+Interrupts, 299
+Isochronic fork, 26
+Iterative computation (using a ring), 35
+Join, 31
+Kill (carry), 65
+LARD, 159
+Latch (see also: handshake comp.), 18
+Latch controller, 106
+fully-decoupled, 120
+normally opaque, 121
+normally transparent, 121
+semi-decoupled, 120
+simple/un-decoupled, 119
+Latency, 47
+actual case, 65
+Line fetch latch (LFL), 308
+Link (or channel), 7, 30
+Liveness, 88
+Lock FIFO, 290
+Logic decomposition, 94
+Logic thresholds, 27
+LOTOS, 123
+M-of-n threshold gates with hysteresis, 69
+Makefile, 165–166
+MARBLE bus, 209, 304, 312
+Matched delay, 11, 65
+Memory, 302
+Merge, 32
+Metastability, 78
+filter, 78
+mean time between failure, 79
+probability of, 79
+Micropipelines, 19, 156, 274
+Microprocessors
+80C51, 236
+Amulet series, 274
+ASPRO-216, 278, 284
+asynchronous MIPS R3000, 133
+asynchronous MIPS, 39
+CFPP, 286
+MiniMIPS, 276
+TITAC-2, 276
+Minterm, 22, 67
+Modulo-10 counter, 158, 185
+Modulo-16 counter, 183
+Monotonic cover constraint, 97, 99, 103
+Muller C-element, 15
+Muller model of a closed circuit, 23
+Muller pipeline/distributor, 16, 257
+Muller, D., 84
+Multi-cycle instruction, 281
+Multiplexer (MUX), 32, 109
+Mutual exclusion, 58, 77, 300, 304
+mutual exclusion element (MUTEX), 77
+N-way multiplexer, 195
+NCL adder, 70
+
+
+336
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+Non-determinism, 282, 305
+Non-return-to-zero (NRZ), 10
+Null Convention Logic (NCL), 69
+NULL, 12
+OCCAM, 123
+Occupancy (or static spread), 49
+One-hot encoding, 13
+Operator reduction, 134
+Optimization, 227
+Parallel composition, 164
+Passive port, 156
+Performance parameters:
+cycle time of a ring, 49
+dynamic wavelength, 49
+forward latency, 47
+latency, 47
+period, 48
+reverse latency, 48
+throughput, 49
+Performance
+analysis and optimization, 41
+average case, 280
+Period, 48
+Persistency, 88
+Petri net, 86
+merge, 88
+1-bounded, 88
+controlled choice, 89
+firing, 86
+fork, 88
+input free choice, 88
+join, 88
+liveness, 88
+places, 86
+token, 86
+transition, 86
+Petrify, 102, 296
+Pipeline, 5, 30, 279
+2-phase bundled-data, 19
+4-phase bundled-data, 18
+4-phase dual-rail, 20
+balance, 281
+Place, 86
+Power consumption, 231, 234
+Power efficiency, 250
+Power supply, 246
+Precharged CMOS circuitry, 116
+Prefetch unit (80C51), 239
+Primitive flow table, 85
+Probe, 123, 125
+Process decomposition, 133
+Processors, 274
+Production rule expansion, 134
+Propagate (carry), 65
+Pull channel, 10, 115, 156
+Push channel, 10, 115, 156
+Quasi delay-insensitive (QDI), 25
+
+Quiescent region, 97
+Re-shuffling signal transitions, 102, 112
+Read-after-write data hazard, 40
+Receive, 123, 125
+Reduced flow table, 85
+Register
+dependency, 289
+locking, 40, 289
+Rendezvous, 125
+Reorder buffer, 291
+Reset function, 97
+Reset signal, 163
+Return-to-zero (RTZ), 9–10
+Reverse latency, 48
+Ring, 30, 296
+finite state machine, 35
+iterative computation, 35
+Ripple FIFO, 16
+Self-timed, 26
+Semantics-preserving transformations, 133
+Send, 123, 125
+Sequencer, 225
+Sequencing, 162
+Serial unary arithmetic, 257
+Set function, 97
+Set-Reset implementation, 96
+Shared ressource, 77
+Sharing, 187, 223
+Sharp DDMP, 278
+Shift register
+with parallel load, 44
+Signal transition graph (STG), 86, 297
+Signal transition, 9
+Silicon compiler, 124, 223
+Simulation, 168
+Single input change, 84
+Single-place buffer, 161
+Single-rail, 10
+Smart cards, 222, 232
+Spacer, 12
+Speculative completion, 66
+Speed adaptation, 248
+Speed-independent (SI), 23–25, 83
+Stable gate/variable, 23
+Standard C-element, 106
+implementation, 96
+State graph, 85
+Static data-flow structure, 7, 29
+Static data-flow structure
+examples:
+greatest common divisor (GCD), 38
+IFIR filter bank, 39
+MIPS microprocessor, 39
+simple example, 33
+vector multiplier, 40
+Static spread (or occupancy), 49, 120
+Static type checking, 118
+
+
+INDEX
+337
+
+Stuck-at fault model, 27
+Supply voltage variations, 231, 236
+Synchronizer flip-flop, 78
+Synchronous message passing, 123
+Syntax-directed compilation, 128, 155
+Tangram examples:
+2-place ripple FIFO, 127
+2-place shift register, 126
+GCD using guarded repetition, 128
+GCD using while and if statements, 127
+Tangram, 123, 155, 222–223, 278
+Technology mapping, 103, 224
+Test, 27, 245, 314
+IDDQ testing, 28
+halting of circuit, 28, 246
+isochronic forks, 28
+short and open faults, 28
+stuck-at faults, 27
+toggle test, 28
+untestable stuck-at faults, 28
+Throughput, 42, 49
+Thumb decoder, 282
+Time safe, 78
+TITAC-2, 276, 308
+Token, 7, 30, 86
+Transition, 86
+
+Transparent to handshaking, 7, 23, 33, 61
+Two-place buffer, 163
+Unique entry constraint, 97, 99
+Up/down decade counter, 185
+Valid codeword, 12
+Valid data, 12, 29
+Valid token, 30
+Value safe, 78
+Vector multiplier, 40
+Verilog, 124
+VHDL, 124, 155, 237
+Viterbi decoder, 249
+backtrace, 264
+branch metric, 260
+constraint length, 251
+global winner, 262
+History Unit, 264
+Path Metric Unit (PMU), 256
+soft codes, 253
+trellis diagram, 251
+VLSI programming, 128, 223
+VSTGL (Visual STG Lab), 103
+Wave, 16
+crest, 16
+trough, 16
+While statement, 38
+Write-back, 40
+
+
diff --git a/papers/references/Amari2016.pdf b/papers/references/Amari2016.pdf
new file mode 100644
index 00000000..eab69df0
--- /dev/null
+++ b/papers/references/Amari2016.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d54c73f85233ad188ddd594aa061cd6ed671c77dd371ef2525e3448098558190
+size 15842558
diff --git a/papers/references/Amari2016.txt b/papers/references/Amari2016.txt
new file mode 100644
index 00000000..6c6084bb
--- /dev/null
+++ b/papers/references/Amari2016.txt
@@ -0,0 +1,37050 @@
+Information 
+Geometry
+
+Geert Verdoolaege
+
+www.mdpi.com/journal/entropy
+
+Edited by
+
+Printed Edition of the Special Issue Published in Entropy
+
+
+Information Geometry
+
+
+
+Information Geometry
+
+Special Issue Editor
+
+Geert Verdoolaege
+
+MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade
+
+
+Special Issue Editor
+
+Geert Verdoolaege
+
+Ghent University
+
+Belgium
+
+Editorial Office
+
+MDPI
+
+St. Alban-Anlage 66
+
+4052 Basel, Switzerland
+
+This is a reprint of articles from the Special Issue published online in the open access journal Entropy
+
+(ISSN 1099-4300) in 2014 (available at: https://www.mdpi.com/journal/entropy/special issues/
+
+information-geometry)
+
+For citation purposes, cite each article independently as indicated on the article page online and as
+
+indicated below:
+
+LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number,
+
+Page Range.
+
+ISBN 978-3-03897-632-5 (Pbk)
+
+ISBN 978-3-03897-633-2 (PDF)
+
+Cover image courtesy of Geert Verdoolaege.
+
+c⃝ 2019 by the authors. Articles in this book are Open Access and distributed under the Creative
+
+Commons Attribution (CC BY) license, which allows users to download, copy and build upon
+
+published articles, as long as the author and publisher are properly credited, which ensures maximum
+
+dissemination and a wider impact of our publications.
+
+The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
+
+license CC BY-NC-ND.
+
+
+Contents
+
+About the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
+
+Preface to ”Information Geometry” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+ix
+
+Shun-ichi Amari
+Information Geometry of Positive Measures and Positive-DefiniteMatrices: 
+Decomposable Dually Flat Structure
+Reprinted from: Entropy 2014, 16, 2131–2145, doi:10.3390/e16042131 . . . . . . . . . . . . . . . . .
+1
+
+Harsha K. V. and Subrahamanian Moosath K S
+F-Geometry and Amari’s α−Geometry on a Statistical Manifold
+Reprinted from: Entropy 2014, 16, 2472–2487, doi:10.3390/e16052472 . . . . . . . . . . . . . . . . .
+14
+
+Frank Critchley and Paul Marriott
+Computational Information Geometry in Statistics: Theory and Practice
+Reprinted from: Entropy 2014, 16, 2454–2471, doi:10.3390/e16052454 . . . . . . . . . . . . . . . . .
+29
+
+Paul Vos and Karim Anaya-Izquierdo
+Using Geometry to Select One Dimensional Exponential Families That Are Monotone
+Likelihood Ratio in the Sample Space, Are Weakly Unimodal and Can Be Parametrized by a
+Measure of Central Tendency
+Reprinted from: Entropy 2014, 16, 4088–4100, doi:10.3390/e16074088 . . . . . . . . . . . . . . . . .
+44
+
+Guido Mont´ufar, Johannes Rauh and Nihat Ay
+On the Fisher Metric of Conditional Probability Polytopes
+Reprinted from: Entropy 2014, 16, 3207–3233, doi:10.3390/e16063207 . . . . . . . . . . . . . . . . .
+56
+
+Andr´e Klein
+Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes
+Reprinted from: Entropy 2014, 16, 2023–2055, doi:10.3390/e16042023 . . . . . . . . . . . . . . . . .
+80
+
+Keisuke Yano and Fumiyasu Komaki
+Asymptotically Constant-Risk Predictive Densities When the Distributions of Data and Target
+Variables Are Different
+Reprinted from: Entropy 2014, 16, 3026–3048, doi:10.3390/e16063026 . . . . . . . . . . . . . . . . . 110
+
+Samuel Livingstone and Mark Girolami
+Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions
+Reprinted from: Entropy 2014, 16, 3074–3102, doi:10.3390/e16063074 . . . . . . . . . . . . . . . . . 131
+
+Hui Zhao and Paul Marriott
+Variational Bayes for Regime-Switching Log-Normal Models
+Reprinted from: Entropy 2014, 16, 3832–3847, doi:10.3390/e16073832 . . . . . . . . . . . . . . . . . 155
+
+Frank Nielsen, Richard Nock and Shun-ichi Amari
+On Clustering Histograms with k-Means by Using Mixed α-Divergences
+Reprinted from: Entropy 2014, 16, 3273–3301, doi:10.3390/e16063273 . . . . . . . . . . . . . . . . . 169
+
+Salem Said, Lionel Bombrun and Yannick Berthoumieu
+New Riemannian Priors on the Univariate Normal Model
+Reprinted from: Entropy 2014, 16, 4015–4031, doi:10.3390/e16074015 . . . . . . . . . . . . . . . . . 194
+
+v
+
+
+Luigi Malag`o and Giovanni Pistone
+Combinatorial Optimization with Information Geometry: The Newton Method
+Reprinted from: Entropy 2014, 16, 4260–4289, doi:10.3390/e16084260 . . . . . . . . . . . . . . . . . 209
+
+Domenico Felice, Carlo Cafaro and Stefano Mancini
+Information Geometric Complexity of a Trivariate Gaussian Statistical Model
+Reprinted from: Entropy 2014, 16, 2944–2958, doi:10.3390/e16062944 . . . . . . . . . . . . . . . . . 237
+
+Alexandre Levada
+Learning from Complex Systems: On the Roles of Entropy and Fisher Information in Pairwise
+Isotropic Gaussian Markov Random Fields
+Reprinted from: Entropy 2014, 16, 1002–1036, doi:10.3390/e16021002 . . . . . . . . . . . . . . . . . 250
+
+Masatoshi Funabashi
+Network Decomposition and Complexity Measures: An Information Geometrical Approach
+Reprinted from: Entropy 2014, 16, 4132–4167, doi:10.3390/e16074132 . . . . . . . . . . . . . . . . . 283
+
+Roger Balian
+The Entropy-Based Quantum Metric
+Reprinted from: Entropy 2014, 16, 3878–3888, doi:10.3390/e16073878 . . . . . . . . . . . . . . . . . 315
+
+Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li
+Extending the Extreme Physical Information to Universal Cognitive Models via a Confident
+Information First Principle
+Reprinted from: Entropy 2014, 16, 3670–3688, doi:10.3390/e16073670 . . . . . . . . . . . . . . . . . 324
+
+vi
+
+
+About the Special Issue Editor
+
+Geert Verdoolaege obtained an M.Sc.
+degree in Theoretical Physics in 1999 and the Ph.D. in
+
+Engineering Physics in 2006, both at Ghent University (UGent, Belgium). His Ph.D. work concerned
+
+applications of Bayesian probability theory to plasma spectroscopy in fusion devices.
+He was a
+
+postdoctoral researcher in the field of computer vision at the University of Antwerp (2007–2008),
+
+working on probabilistic modeling of image textures using information geometry. From 2008 to 2010,
+
+he was with the Department of Data Analysis at UGent, where he worked on modeling and
+
+estimation of brain activity, based on functional magnetic resonance imaging. In 2010, he returned
+
+to the Department of Applied Physics at UGent, first as a postdoctoral assistant and from 2014
+
+onwards, as a part-time assistant professor.
+Since 2013, he has held a cross-appointment as a
+
+researcher at the Laboratory for Plasma Physics of the Royal Military Academy (LPP-ERM/KMS)
+
+in Brussels. His research activities comprise development of data analysis techniques using methods
+
+from probability theory, machine learning and information geometry, and their application to nuclear
+
+fusion experiments. He also teaches a Master course on Continuum Mechanics at Ghent University.
+
+He serves on the editorial board of the multidisciplinary journal Entropy and is a member of the
+
+scientific committees of several conferences (IAEA Technical Meeting on Fusion Data Processing,
+
+Validation and Analysis; International Workshop on Bayesian Inference and Maximum Entropy
+
+Methods in Science and Engineering; Conference on Geometric Science of Information). In addition,
+
+he is a consulting expert in the International Tokamak Physics Activity (ITPA) Transport and
+
+Confinement Topical Group and member of the General Assembly of the European Fusion Education
+
+Network (FuseNet).
+
+vii
+
+
+
+Preface to ”Information Geometry”
+
+The mathematical field of information geometry originated from the observation the Fisher
+
+information can be used to define a Riemannian metric on manifolds of probability distributions.
+
+This led to a geometrical description of probability theory and statistics, which over the years has
+
+developed into a rich mathematical field with a broad range of applications in the data sciences.
+
+Moreover, similar to the concept of entropy, there are various connections to and applications of
+
+information geometry in statistical mechanics, quantum mechanics, and neuroscience.
+
+It has been a pleasure to act as a guest editor for this first Special Issue on information geometry
+
+in the journal Entropy. For me, as a physicist working on the development and application of
+
+data science techniques in the context of nuclear fusion experiments, the interdisciplinary character
+
+of information geometry has always been one of the main reasons for its appeal.
+There are, of
+
+course, many other domains in physics where geometrical notions play a key role, including classical
+
+mechanics, continuum mechanics (which I have been teaching at Ghent University for several years
+
+now), general relativity, and much of modern physics. This interplay between the beautiful and
+
+elegant formalism of differential geometry on the one hand and physics and data science on the
+
+other hand is both fascinating and inspiring. The variety of topics covered by this Special Issue is a
+
+reflection of this cross-fertilization between disciplines.
+
+“Information Geometry I” has been a great success, and although the papers were published
+
+already several years ago, it was decided that it was worthwhile to reprint the collection of papers
+
+in book form. Indeed, even though all papers present original research, many have a strong tutorial
+
+character, and we were honored to receive multiple contributions by authorities in the field. The
+
+papers have been structured according to their main subject area, or field of application, and we
+
+briefly discuss each of them in the following.
+
+We start with two papers related to the foundations of information geometry. We were very
+
+pleased to receive a contribution by one of the founders of the field of information geometry, prof.
+
+Shun-ichi Amari. In his paper, the dually flat structure of the manifold of positive measures is
+
+discussed, derived from a class of Bregman divergences. These so-called (ρ,τ)-divergences, originally
+
+proposed by J. Zhang, are defined in terms of two monotone, scalar functions (ρ and τ) and form a
+
+unique class of dually flat, decomposable divergences. This is extended to the set of positive-definite
+
+matrices, additionally requiring invariance of the divergence under matrix transformations. It is well
+
+known that such dually flat manifolds have computationally desirable properties in applications to
+
+classification and information retrieval.
+
+Harsha K. V. and Subrahamanian Moosath K. S. introduce F-geometry as a generalization
+
+of α-geometry, based on a representation of a probability density function through a function F.
+
+They then combine this with another function G to define a weighted expectation, from which
+
+an (F,G)-metric and connection are deduced. A condition for two of such structures to lead
+
+to dual connections is also derived. However, it was shown by Zhang (J. Zhang, Entropy 17,
+
+pp. 4485–4499, 2015) that this framework is equivalent to the (ρ,τ)-geometry introduced earlier by
+
+him. Although the present paper is slightly different in perspective, it should be read with this
+
+equivalence in mind.
+
+The next four papers deal with applications of information geometry in statistics. The paper
+
+by Frank Critchley and Paul Marriott presents an important research program aimed at rendering
+
+some of the most useful results of information geometry more accessible to statisticians in
+
+ix
+
+
+practical applications. Indeed, the formalism of differential geometry and tensor algebra can
+
+appear daunting at first sight and may present an obstacle to adoption of many useful results
+
+by practitioners. The paper describes a computational framework that facilitates implementation
+
+of results from information geometry, based on an embedding of various important statistical
+
+models in a (sufficiently large) simplex. Challenges related to extension of the framework to the
+
+infinite–dimensional case are touched upon as well.
+
+In the paper by Paul Vos and Karim Anaya-Izquierdo, the goal is to identify one-dimensional
+
+exponential families enjoying a number of properties that are convenient for statistical modeling,
+
+i.e., parametrization by a measure of central tendency, unimodality, and monotone likelihood ratio.
+
+The basis for the framework is the multinomial distribution, modeled geometrically by the simplex.
+
+The selection of exponential families with desirable properties is then based on a partitioning of the
+
+natural parameter space of the family of multinomial distributions by means of convex cones.
+
+Guido Mont´ufar and co-workers consider various possibilities to define natural Riemannian
+
+metrics on polytopes of stochastic matrices, which describe the conditional probability distribution
+
+of two categorical random variables. Inspired by the classical result regarding the uniqueness of the
+
+Fisher metric by requiring invariance under Markov morphisms, they define metrics derived from
+
+a natural class of stochastic maps between such polytopes, or, alternatively, through embeddings in
+
+various possible model spaces. They provide recommendations as to which metric to use, depending
+
+on the application.
+
+Andr´e Klein, in his article, provides a survey of several matrix algebraic properties of the Fisher
+
+information matrix corresponding to weakly stationary time series. The link with various structured
+
+matrices arising from a number of time series models is demonstrated. A statistical distance measure
+
+is built using the Fisher information matrix in the context of classical and quantum information.
+
+Finally, conditions are obtained for the Fisher information of a stationary process to obey certain
+
+forms of the Stein equation.
+
+We continue with three papers concerning applications of information geometry in Bayesian
+
+inference and simulation. Keisuke Yano and Fumiyasu Komaki, in their paper, construct constant-risk
+
+Bayesian predictive densities using the Kullback-Leibler loss function when the distributions of the
+
+data and the target variable to be predicted are different but have a common unknown parameter.
+
+Specifically, the issue of prior selection is investigated, and several applications are given.
+
+Samuel Livingstone and Mark Girolami provide an introduction to recent enhancements of
+
+Markov chain Monte Carlo simulation techniques inspired by information geometry. They apply
+
+this to the Metropolis–Hastings algorithm driven by Langevin diffusion, gradually transforming the
+
+ingredients to the setting of a Riemannian manifold equipped with a metric similar to the Fisher
+
+information metric. Pointers to various applications are given. The paper is written in a way that also
+
+makes it accessible to practitioners with little background in stochastic processes and geometry.
+
+The paper by Hui Zhao and Paul Marriott concerns Bayesian inference making use of variational
+
+methods for approximating the posterior distribution. In the context of inference for time series
+
+models that switch between different regimes, variational Bayes is shown to be a computationally
+
+attractive alternative to Markov chain Monte Carlo simulations. The geometry related to the
+
+projection of the posterior onto a computationally tractable family of distributions is elucidated by
+
+means of a simple example. This is followed by an application wherein it is shown that variational
+
+Bayes is successful in estimating the regime-switching model, including the number of regimes.
+
+Applications of information geometry in machine learning are represented by the following
+
+x
+
+
+three papers. The article by Frank Nielsen and colleagues considers κ-means histogram clustering,
+
+with applications to, e.g., information retrieval. Based on the α-divergences as similarity measures,
+
+clustering is performed using either the sided (asymmetric) or symmetrized divergence, or by means
+
+of the interesting notion of a mixed divergence. An important computational advantage is that the
+
+centroids based on the sided and mixed divergences have a closed-form expression. Next, the scheme
+
+is extended to algorithms with optimized initialization of cluster centroids, as well as soft clustering.
+
+Salem Said and co-workers present a class of distributions on the manifold of the univariate
+
+normal model equipped with the Fisher information metric. Expressed in terms of the Fisher-Rao
+
+distance, the distributions are used as priors for modeling the classes in Bayesian classification
+
+problems of normal distributions. Characteristics of this “Gaussian” distribution on the manifold are
+
+discussed, as well as estimation of its parameters and the posterior using the Laplace approximation.
+
+In an application to classification of image textures, the improved performance of these priors over
+
+conjugate priors is demonstrated.
+
+Luigi Malag`o and Giovanni Pistoni address optimization on manifolds of exponential
+
+distributions on a discrete state space using Newton’s method, which is based on second-order
+
+calculus. In particular, the goal is to find maxima of the expectation of a function with respect to the
+
+distribution (stochastic relaxation). Details of the computation are provided, including calculation
+
+of the Riemannian Hessian. A nonparametric formalism is used, with a view to extension to the
+
+infinite–dimensional case.
+
+The next three papers are related to the role of information geometry in complex systems
+
+research. Domenico Felice and colleagues consider the time-averaged volume explored by geodesics
+
+on a statistical manifold as an indicator of complexity of the entropic dynamics of a system.
+
+The parameters of the model play the role of macrovariables conveying information on the
+
+system’s microstate. Examples are given for the case of univariate, bivariate, and trivariate normal
+
+distributions, providing interesting results depending on correlations between the microvariables.
+
+Alexandre Levada investigates the role of entropy and Fisher information in pairwise isotropic
+
+Gaussian Markov random fields, acting as models for complex systems. Expressions for these
+
+quantities are derived and the evolution of the Fisher information, and entropy is studied as
+
+a function of the inverse temperature of the system. An interesting interpretation is given of
+
+asymmetries between these curves in terms of the arrow of time.
+
+Masatoshi Funabashi presents a framework for measuring statistical dependence between
+
+subsystems of a stochastic model, based on the model’s graph representation.
+A description in
+
+terms of the mixed coordinates of the system is used to quantify the complexity loss incurred by
+
+cutting an edge of the graph. In addition, a complexity measure is defined as a geometric mean of
+
+Kullback–Leibler divergences between decompositions of the system in terms of subsystems with
+
+fewer statistical dependencies. This quantifies the degree to which the system can be decomposed.
+
+The following paper concerns an application to physics, specifically quantum mechanics. Roger
+
+Balian gives an overview of a geometrical framework for measuring information loss in quantum
+
+systems resulting from the mixing of states. A Riemannian metric is defined, based on the von
+
+Neumann entropy, generating a mapping between states and observables. The metric is compared
+
+to other quantum metrics, as well as the Fisher–Rao metric, and various geometrical properties are
+
+derived. Applications are given to quantum information, as well as equilibrium and non-equilibrium
+
+quantum statistical mechanics.
+
+The final paper in the Special Issue is situated at the interface between physics and neuroscience.
+
+xi
+
+
+Xiaozhao Zhao and colleagues consider the principle of extreme physical information based on the
+
+Fisher information, which has been used before in an attempt to establish an information-theoretical
+
+basis for physical laws. They extend the idea to cognitive systems and aim at narrowing the gap
+
+between the information bound and the data information for such complex systems, by transforming
+
+the model to a simpler one. This is done by means of a dimensionality reduction technique, also based
+
+on the Fisher information. The approach is applied to derive the model for single-layer Boltzmann
+
+machines and interpret their learning algorithms.
+
+We are convinced that the varied collection of papers in this Special Issue will be useful for
+
+scientists who are new to the field, while providing an excellent reference for the more seasoned
+
+researcher. Furthermore, it is worth mentioning that the second Entropy Special Issue in this series,
+
+“Information Geometry II”, will also be published as a book, and that a third Special Issue is being
+
+prepared. We hope that the reader will enjoy browsing and reading through this collection of papers
+
+as much as we enjoyed guest editing this Special Issue “Information Geometry I”.
+
+Finally, I would like to thank the Editor-in-Chief of Entropy, Prof. Dr. Kevin H. Knuth, for
+
+suggesting the opportunity to guest-edit a Special Issue on information geometry.
+Furthermore,
+
+I wish to thank the editorial staff at MDPI for their great help with contacting authors, organizing
+
+paper reviews, and editing the original Special Issue in Entropy, as well as the reprinted version in the
+
+present book.
+
+Geert Verdoolaege
+
+Special Issue Editor
+
+xii
+
+
+
+
+entropy
+
+Article
+Information Geometry of Positive Measures and
+Positive-Definite Matrices: Decomposable Dually
+Flat Structure
+
+Shun-ichi Amari
+
+RIKEN Brain Science Institute, Hirosawa 2-1, Wako-shi, Saitama 351-0198, Japan;
+E-Mail: amari@brain.riken.jp; Tel.: +81-48-467-9669; Fax: +81-48-467-9687
+
+Received: 14 February 2014; in revised form: 9 April 2014 / Accepted: 10 April 2014 /
+Published: 14 April 2014
+
+Abstract: Information geometry studies the dually flat structure of a manifold, highlighted by
+the generalized Pythagorean theorem. The present paper studies a class of Bregman divergences
+called the (ρ, τ)-divergence. A (ρ, τ)-divergence generates a dually flat structure in the manifold of
+positive measures, as well as in the manifold of positive-definite matrices. The class is composed of
+decomposable divergences, which are written as a sum of componentwise divergences. Conversely,
+a decomposable dually flat divergence is shown to be a (ρ, τ)-divergence. A (ρ, τ)-divergence is
+determined from two monotone scalar functions, ρ and τ. The class includes the KL-divergence, α-,
+β- and (α, β)-divergences as special cases. The transformation between an affine parameter and its
+dual is easily calculated in the case of a decomposable divergence. Therefore, such a divergence
+is useful for obtaining the center for a cluster of points, which will be applied to classification and
+information retrieval in vision. For the manifold of positive-definite matrices, in addition to the dually
+flatness and decomposability, we require the invariance under linear transformations, in particular
+under orthogonal transformations. This opens a way to define a new class of divergences, called the
+(ρ, τ)-structure in the manifold of positive-definite matrices.
+
+Keywords: information geometry; dually flat structure; decomposable divergence; (ρ, τ)-structure
+
+1. Introduction
+
+Information geometry, originated from the invariant structure of a manifold of probability
+distributions, consists of a Riemannian metric and dually coupled affine connections with respect to
+the metric [1]. A manifold having a dually flat structure is particularly interesting and important. In
+such a manifold, there are two dually coupled affine coordinate systems and a canonical divergence,
+which is a Bregman divergence. The highlight is given by the generalized Pythagorean theorem and
+projection theorem. Information geometry is useful not only for statistical inference, but also for
+machine learning, pattern recognition, optimization and even for neural networks. It is also related to
+the statistical physics of Tsallis q-entropy [2–4].
+The present paper studies a general and unique class of decomposable divergence functions
+in Rn+, the manifold of n-dimensional positive measures. This is the (ρ, τ)-divergences, introduced
+by Zhang [5,6], from the point of view of “representation duality”. They are Bregman divergences
+generating a dually flat structure. The class includes the well-known Kullback-Leibler divergence,
+α-divergence, β-divergence and (α, β)-divergence [1,7–9] as special cases. The merit of a decomposable
+Bregman divergence is that the θ-η Legendre transformation is computationally tractable, where θ
+and η are two affine coordinates systems coupled by the Legendre transformation. When one uses
+a dually flat divergence to define the center of a cluster of elements, the center is easily given by
+the arithmetic mean of the dual coordinates of the elements [10,11]. However, we need to calculate
+its primal coordinates. This is the θ-η transformation. Hence, our new type of divergences has an
+
+Entropy 2014, 16, 2131–2145; doi:10.3390/e16042131
+www.mdpi.com/journal/entropy
+1
+
+
+Entropy 2014, 16, 2131–2145
+
+advantage of calculating θ-coordinates for clustering and related pattern matching problems. The most
+general class of dually flat divergences, not necessarily decomposable, is further given in Rn+. They are
+the (ρ, τ) divergence.
+Positive-definite (PD) matrices appear in many engineering problems, such as convex
+programming, diffusion tensor analysis and multivariate statistical analysis [12–20]. The manifold,
+PDn, of n × n PD matrices form a cone, and its geometry is by itself an important subject of research.
+If we consider the submanifold consisting of only diagonal matrices, it is equivalent to the manifold
+of positive measures. Hence, PD matrices can be regarded as a generalization of positive measures.
+There are many studies on geometry and divergences of the manifold of positive-definite matrices. We
+introduce a general class of dually flat divergences, the (ρ, τ)-divergence. We analyze the cases when
+a (ρ, τ)-divergence is invariant under the general linear transformations, Gl(n), and invariant under
+the orthogonal transformations, O(n). They not only include many well-known divergences of PD
+matrices, but also give new important divergences.
+The present paper is organized as follows. Section 2 is preliminary, giving a short introduction
+to a dually flat manifold and the Bregman divergence. It also defines the cluster center due to a
+divergence. Section 3 defines the (ρ, τ)-structure in Rn+. It gives dually flat decomposable affine
+coordinates and a related canonical divergence (Bregman divergence). Section 4 is devoted to the
+(ρ, τ)-structure of the manifold, PDn, of PD matrices. We first study the class of divergences that are
+invariant under O(n). We further study a decomposable divergence that is invariant under Gl(n).
+It coincides with the invariant divergence derived from zero-mean Gaussian distributions with PD
+covariance matrices. They not only include various known divergences, but new remarkable ones.
+Section 5 discusses a general class of non-decomposable flat divergences and miscellaneous topics.
+Section 6 is the conclusions.
+
+2. Preliminaries to Information Geometry of Divergence
+
+2.1. Dually Flat Manifold
+
+A manifold is said to have the dually flat Riemannian structure, when it has two affine coordinate
+systems θ =
+�
+θ1, · · · , θn�
+and η = (η1, · · · , ηn) (with respect to two flat affine connections) together
+with two convex functions, ψ(θ) and ϕ(η), such that the two coordinates are connected by the Legendre
+transformations:
+η = ∇ψ(θ),
+θ = ∇ϕ(η),
+(1)
+
+where ∇ is the gradient operator. The Riemannian metric is given by:
+
+�
+gij(θ)
+� = ∇∇ψ(θ),
+�
+gij(η)
+�
+= ∇∇ϕ(η)
+(2)
+
+in the respective coordinate systems. A curve that is linear in the θ-coordinates is called a θ-geodesic,
+and a curve linear in the η-coordinates is called an η-geodesic.
+A dually flat manifold has a unique canonical divergence, which is the Bregman divergence
+defined by the convex functions,
+
+D[P : Q] = ψ (θP) + ϕ
+�
+ηQ
+� − θP · ηQ,
+(3)
+
+where θP is the θ-coordinates of P, ηQ is the η-coordinates of Q and θP · ηQ = ∑i
+�
+θi
+P
+� �
+ηQi
+�
+, where θi
+P
+and ηQi are components of θp and ηQ, respectively. The Pythagorean and projection theorems hold in
+a dually flat manifold:
+Pythagorean Theorem
+Given three points, P, Q, R, when the η-geodesic connecting P and Q is
+orthogonal to the θ-geodesic connecting Q and R with respect to the Riemannian metric,
+
+D[P : Q] + D[Q : R] = D[P : R].
+(4)
+
+2
+
+
+Entropy 2014, 16, 2131–2145
+
+Projection Theorem
+Given a smooth submanifold, S, let PS be the minimizer of divergence
+from P to S,
+PS = min
+Q∈S D[P : Q].
+(5)
+
+Then, PS is the η-geodesic projection of P to S, that is the η-geodesic connecting P and PS is orthogonal
+to S.
+We have the dual of the above theorems where θ- and η-geodesics are exchanged and D[P : Q] is
+replaced by its dual D[Q : P].
+
+2.2. Decomposable Divergence
+
+A divergence, D[P
+:
+Q], is said to be decomposable, when it is written as a sum of
+component-wise divergences,
+
+D[P : Q] =
+n
+∑
+i=1
+d
+�
+θi
+P, θi
+Q
+�
+,
+(6)
+
+where θi
+P and θi
+Q are the components of θP and θQ and d
+�
+θi
+P, θi
+Q
+�
+is a scalar divergence function.
+An f-divergence:
+
+Df [P : Q] = ∑ pi f
+� qi
+
+pi
+
+�
+(7)
+
+is a typical example of decomposable divergence in the manifold of probability distributions, where
+P = (p) and Q = (q) are two probability vectors with ∑ pi = ∑ qi = 1. A convex function, ψ(θ), is
+said to be decomposable, when it is written as:
+
+ψ(θ) =
+n
+∑
+i=1
+˜ψ
+�
+θi�
+(8)
+
+by using a scalar convex function, ˜ψ(θ). The Bregman divergence derived from a decomposable convex
+function is decomposable.
+When ψ(θ) is a decomposable convex function, its Legendre dual is also decomposable. The
+Legendre transformation is given componentwise as:
+
+ηi = ˜ψ′ (θi) ,
+(9)
+
+where ′ is the differentiation of a function, so that it is computationally tractable.
+Its inverse
+transformation is also componentwise,
+θi = ˜ϕ′ (ηi) ,
+(10)
+
+where ˜ϕ is the Legendre dual of ˜ψ.
+
+2.3. Cluster Center
+
+Consider a cluster of points P1, · · · , Pm of which θ-coordinates are θ1, · · · , θm and η-coordinates
+are η1, · · · , ηm. The center, R, of the cluster with respect to the divergence, D[P : Q], is defined by:
+
+R = arg min
+Q
+
+m
+∑
+i=1
+D [Q : Pi] .
+(11)
+
+By differentiating ∑ D [Q : Pi] by θ (the θ-coordinates of Q), we have:
+
+∇ψ (θR) = 1
+
+m
+
+m
+∑
+i=1
+ηi.
+(12)
+
+Hence, the cluster-center theorem due to Banerjee et al. [10] follows; see also [11]:
+
+3
+
+
+Entropy 2014, 16, 2131–2145
+
+Cluster-Center Theorem
+The η-coordinates ηR of the cluster center are given by the arithmetic
+average of the η-coordinates of the points in the cluster:
+
+ηR = 1
+
+m
+
+m
+∑
+i=1
+ηi.
+(13)
+
+When we need to obtain the θ-coordinates of the cluster center, it is given by the θ-η transformation
+from ηR,
+θR = ∇ϕ (ηR) .
+(14)
+
+However, in many cases, the transformation is computationally heavy and intractable when the
+dimensions of a manifold is large. The transformation is easy in the case of a decomposable divergence.
+This is motivation for considering a general class of decomposable Bregman divergences.
+
+3. (ρ, τ) Dually Flat Structure in Rn
++
+
+3.1. (ρ, τ)-Coordinates of Rn
++
+
+Let Rn+ be the manifold of positive measures over n elements x1, · · · , xn. A measure (or a weight)
+of xi is given by:
+ξi = m (xi) > 0
+(15)
+
+and ξ = (ξ1, · · · , ξn) is a distribution of measures. When ∑ ξi = 1 is satisfied, it is a probability
+measure. We write:
+R+
+n = {ξ |ξi > 0}
+(16)
+
+and ξ forms a coordinate system of Rn+.
+Let ρ(ξ) and τ(ξ) be two monotonically increasing differentiable functions. We call:
+
+θ = ρ(ξ),
+η = τ(ξ)
+(17)
+
+the ρ- and τ-representations of positive measure ξ. This is a generalization of the ±α representations [1]
+and was introduced in [5] for a manifold of probability distributions. See also [6].
+By using these functions, we construct new coordinate systems θ and η of Rn+. They are given, for
+θ =
+�
+θi�
+and η = (ηi), by componentwise relations,
+
+θi = ρ (ξi) ,
+ηi = τ (ξi) .
+(18)
+
+They are called the ρ- and τ-representations of ξ ∈ Rn+, respectively. We search for convex functions,
+ψρ,τ(θ) and ϕρ,τ(η), which are Legendre duals to each other, such that θ and η are two dually coupled
+affine coordinate systems.
+
+3.2. Convex Functions
+
+We introduce two scalar functions of θ and η by:
+
+˜ψρ,τ(θ)
+=
+� ρ−1(θ)
+
+0
+τ(ξ)ρ′(ξ)dξ,
+(19)
+
+˜ϕρ,τ(η)
+=
+� τ−1(η)
+
+0
+ρ(ξ)τ′(ξ)dξ.
+(20)
+
+Then, the first and second derivatives of ˜ψρ,τ are:
+
+˜ψ′
+ρ,τ(θ)
+=
+τ(ξ),
+(21)
+
+˜ψ′′
+ρ,τ(θ)
+=
+τ′(ξ)
+ρ′(ξ) .
+(22)
+
+4
+
+
+Entropy 2014, 16, 2131–2145
+
+Since ρ′(ξ) > 0, τ′(ξ) > 0, we see that ˜ψρ,τ(θ) is a convex function. So is ˜ϕρ,τ(η). Moreover, they are
+Legendre duals, because:
+
+˜ψρ,τ(θ) + ˜ϕρ,τ(η) − θη
+=
+� ξ
+
+0 τ(ξ)ρ′(ξ)dξ +
+� ξ
+
+0 ρ(ξ)τ′(ξ)dξ − ρ(ξ)τ(ξ)
+(23)
+
+=
+0.
+(24)
+
+We then define two decomposable convex functions of θ and η by:
+
+ψρ,τ(θ)
+= ∑ ˜ψρ,τ
+�
+θi�
+,
+(25)
+
+ϕρ,τ(η)
+= ∑ ˜ϕρ,τ (ηi) .
+(26)
+
+They are Legendre duals to each other.
+
+3.3. (ρ, τ)-Divergence
+
+The (ρ, τ)-divergence between two points, ξ, ξ′ ∈ R+
+n , is defined by:
+
+Dρ,τ
+�
+ξ : ξ′�
+=
+ψρ,τ (θ) + ϕρ,τ
+�
+η′� − θ · η′
+(27)
+
+=
+n
+∑
+i=1
+
+�� ξi
+
+0
+τ(ξ)ρ′(ξ)dξ +
+� ξ′
+i
+
+0
+ρ(ξ)τ′(ξ)dξ − ρ (ξi) τ
+�
+ξ′
+i
+��
+,
+(28)
+
+where θ and η′ are ρ- and τ-representations of ξ and ξ′, respectively.
+The (ρ, τ)-divergence gives a dually flat structure having θ and η as affine and dual affine
+coordinate systems. This is originally due to Zhang [5] and a generalization of our previous results
+concerning the q and deformed exponential families [4]. The transformation between θ and η is simple
+in the (ρ, τ)-structure, because it can be done componentwise,
+
+θi
+=
+ρ
+�
+τ−1 (ηi)
+�
+,
+(29)
+
+ηi
+=
+τ
+�
+ρ−1 �
+θi��
+.
+(30)
+
+The Riemannian metric is:
+
+gij(ξ) = τ′ (ξi)
+
+ρ′ (ξi) δij,
+(31)
+
+and hence Euclidean, because the Riemann-Christoffel curvature due to the Levi-Civita connection
+vanishes, too.
+The following theorem is new, characterizing the (ρ, τ)-divergence.
+
+Theorem 1. The (ρ, τ)-divergences form a unique class of divergences in Rn+ that are dually flat and
+decomposable.
+
+3.4. Biduality: α-(ρ, τ) Divergence
+
+We have dually flat connections,
+�
+∇ρ,τ, ∇∗
+ρ,τ
+�
+, represented in terms of covariant derivatives,
+which are derived from Dρ,τ. This is called the representation duality by Zhang [5]. We further have
+the α-(ρ, τ) connections defined by:
+
+∇(α)
+ρ,τ = 1 + α
+
+2
+∇ρ,τ + 1 − α
+
+2
+∇∗
+ρ,τ.
+(32)
+
+The α-(−α) duality is called the reference duality [5]. Therefore, ∇(α)
+ρ,τ possesses the biduality, one
+concerning α and (−α), and the other with respect to ρ and τ.
+
+5
+
+
+Entropy 2014, 16, 2131–2145
+
+The Riemann-Christoffel curvature of ∇(α)
+ρ,τ is:
+
+R(α)
+ρ,τ = 1 − α2
+
+4
+R(0)
+ρ,τ = 0
+(33)
+
+for any α. Hence, there exists unique canonical divergence D(α)
+ρ,τ and α-(ρ, τ) affine coordinate systems.
+It is an interesting future problem to obtain their explicit forms.
+
+3.5. Various Examples
+
+As a special case of the (ρ, τ)-divergence, we have the (α, β)-divergence obtained from the
+following power functions,
+
+ρ(ξ) = 1
+
+αξα, τ(ξ) = 1
+
+βξβ.
+(34)
+
+This was introduced by Cichocki, Cruse and Amari in [7,8].
+The affine and dual affine coordinates are:
+
+θi = 1
+
+α (ξi)α ,
+ηi = 1
+
+β (ξi)β
+(35)
+
+and the convex functions are:
+
+ψ(θ) = cα,β ∑ θ
+
+α+β
+
+i α
+,
+ϕ(η) = cβ,α ∑ η
+
+α+β
+
+β
+i
+,
+(36)
+
+where:
+cα,β =
+1
+
+β(α + β)α
+α+β
+
+α .
+(37)
+
+The induced (α, β)-divergence has a simple form,
+
+Dα,β[ξ : ξ′] =
+1
+
+αβ (α + β) ∑
+�
+αξα+β
+i
++ βξ′α+β
+i
+− (α + β)ξα
+i ξ′β
+i
+�
+,
+(38)
+
+for ξ, ξ′ ∈ Rn+. It is defined similarly in the manifold, Sn, of probability distributions, but it is not
+a Bregman divergence in Sn. This is because the total mass constraint ∑ ξi = 1 is not linear in θ- or
+η-coordinates in general.
+The α-divergence is a special case of the (α, β)-divergence, so that it is a (ρ, τ)-divergence. By
+putting:
+
+ρ(ξ) =
+2
+
+1 − αξ
+1−α
+
+2 ,
+τ(ξ) =
+2
+
+1 + αξ
+1+α
+
+2 ,
+(39)
+
+we have:
+
+Dα
+�
+ξ : ξ′� =
+4
+
+1 − α2 ∑
+
+�1 − α
+
+2
+ξi + 1 + α
+
+2
+ξ
+
+1−α
+
+i 2
+− ξα
+i
+�
+ξ′
+i
+� 1+α
+
+2
+�
+.
+(40)
+
+The β-divergence [19] is obtained from:
+
+ρ(ξ) = ξ,
+τ(ξ) = 1
+
+βξ1+β.
+(41)
+
+It is written as:
+
+Dβ
+�
+ξ : ξ′� =
+1
+
+β(β + 1) ∑
+i
+
+�
+ξβ+1
+i
++ (β + 1)ξ′
+i −
+�
+ξ′
+i
+�β+1 − (β + 1)ξi
+�
+ξ′
+i
+�β�
+.
+(42)
+
+The β-divergence is special in the sense that it gives a dually flat structure, even in Sn. This is because
+u(ξ) is linear in ξ.
+
+6
+
+
+Entropy 2014, 16, 2131–2145
+
+The classes of α-divergences and β-divergences intersect at the KL-divergence, and their duals are
+different in general. They are the only intersecting points of the two classes.
+When ρ(ξ) = ξ and τ(ξ) = U′(ξ) where U is a convex function, (ρ, τ)-divergence is Eguchi’s
+U-divergence [21].
+Zhang already introduced the (α, β)-divergence in [5], which is not a (ρ, τ)-divergence in Rn+ and
+different from ours. We regret for our confusing the naming of the (α, β)-divergence.
+
+4. Invariant, Flat Decomposable Divergences in the Manifold of Positive-Definite Matrices
+
+4.1. Invariant and Decomposable Convex Function
+
+Let P be a positive-definite matrix and ψ(P) be a convex function. Then, a Bregman divergence is
+defined between two positive definite matrices, P and Q, by:
+
+D[P : Q] = ψ(P) − (Q) − ∇ (P) · (P − Q)
+(43)
+
+where ∇ is the gradient operator with respect to matrix P =
+�
+Pij
+�
+, so that ∇ψ(P) is a matrix and the
+inner product of two matrices is defined by:
+
+∇ψ(Q) · P = tr {∇ψ(Q)P} ,
+(44)
+
+where tr is the trace of a matrix.
+It induces a dually flat structure to the manifold of positive-definite matrices, where the affine
+coordinate system (θ-coordinates) is
+= P and the dual affine coordinate system (η-coordinates) is:
+
+H = ∇ψ(P).
+(45)
+
+A convex function, ψ(P), is said to be invariant under the orthogonal group O(n), when:
+
+ψ(P) = ψ
+�
+OTPO
+�
+(46)
+
+holds for any orthogonal transformation O, where OT is the transpose of O. An invariant function is
+written as a symmetric function of n eigenvalues λ1, · · · , λn of P. See Dhillon and Tropp [12]. When
+an invariant convex function of P is written, by using a convex function, f, of one variable, in the
+additive form:
+ψ(P) = ∑ f (λi) ,
+(47)
+
+it is said to be decomposable. We have:
+
+ψ(P) = trf (P).
+(48)
+
+4.2. Invariant, Flat and Decomposable Divergence
+
+A divergence D[P : Q] is said to be invariant under O(n), when it satisfies:
+
+D[P : Q] = D
+�
+OTPO : OTQO
+�
+.
+(49)
+
+When it is derived from a decomposable convex function, ψ(P), it is invariant, flat and decomposable.
+We give well-known examples of decomposable convex functions and the divergences derived
+from them:
+
+7
+
+
+Entropy 2014, 16, 2131–2145
+
+(1) For f (λ) = (1/2)λ2, we have:
+
+ψ(P)
+=
+1
+2 ∑ λ2
+i ,
+(50)
+
+D[P : Q]
+=
+1
+2∥P − Q∥2,
+(51)
+
+where ∥P∥2 is the Frobenius norm:
+∥P∥2 = ∑ P2
+ij.
+(52)
+
+(2) For f (λ) = − log λ
+
+ψ(P)
+=
+− log (det |P|) ,
+(53)
+
+D[P : Q]
+=
+tr
+�
+PQ−1�
+− log
+�
+det
+���PQ−1���
+�
+− n.
+(54)
+
+The affine coordinate system is P, and the dual coordinate system is P−1. The derived geometry is
+the same as that of multivariate Gaussian probability distributions with mean zero and covariance
+matrix P.
+(3) For f (λ) = λ log λ − λ,
+
+ψ(P)
+=
+tr (P log P − P) ,
+(55)
+
+D[P : Q]
+=
+tr (P log P − P log Q − P + Q) .
+(56)
+
+This divergence is used in quantum information theory. The affine coordinate system is P, and
+the dual affine coordinate system is log P; and, ψ(P) is called the negative von Neuman entropy.
+
+4.3. (ρ, τ)-Structure in Positive Definite Matrices
+
+We extend the (ρ, τ)-structure in the previous section to the matrix case and introduce a general
+dually flat invariant decomposable divergence in the manifold of positive-definite matrices. Let:
+
+Θ = ρ(P),
+H = τ(P)
+(57)
+
+be ρ- and τ-representations of matrices.
+We use two functions,
+˜ψρ,τ(θ) and ˜ϕρ,τ(η), defined
+in Equations (19) and (20), for defining a pair of dually coupled invariant and decomposable
+convex functions,
+
+ψ(Θ)
+=
+tr ˜ψρ,τ {Θ} ,
+(58)
+
+ϕ(H)
+=
+tr ˜ϕρ,τ {H} .
+(59)
+
+They are not convex with respect to P, but are convex with respect to Θ and H, respectively. The
+derived Bregman divergence is:
+
+D[P : Q] = ψ {Θ(P)} + ϕ {H(Q)} − Θ(P) · H(Q).
+(60)
+
+Theorem 2. The (ρ, τ)-divergences form a unique class of invariant, decomposable and dually flat
+divergences in the manifold of positive matrices.
+
+8
+
+
+Entropy 2014, 16, 2131–2145
+
+The Euclidean, Gaussian and von Neuman divergences given in Equations (51), (54) and (56) are
+special examples of (ρ, τ)-divergences. They are given, respectively, by:
+
+(1)
+ρ(ξ) = τ(ξ) = ξ,
+(61)
+
+(2)
+ρ(ξ) = ξ,
+τ(ξ) = −1
+
+ξ ,
+(62)
+
+(3)
+ρ(ξ) = ξ,
+τ(ξ) = log ξ.
+(63)
+
+When ρ and τ are power functions, we have the (α, β)-structure in the manifold of positive-definite
+matrices.
+
+(4) (α-β)-divergence.
+
+By using the (α, β) power functions given by Equation (34), we have:
+
+ψ(Θ) =
+α
+
+α + βtr Θ
+α+β
+
+α
+=
+α
+
+α + βtr Pα+β,
+(64)
+
+ϕ(H) =
+β
+
+α + βtr H
+
+α+β
+
+β
+=
+β
+
+α + βtr Pα+β
+(65)
+
+so that the (α, β)-divergence of matrices is:
+
+D[P : Q] = tr
+�
+α
+
+α + βPα+β +
+β
+
+α + βQα+β − PαQβ
+�
+.
+(66)
+
+This is a Bregman divergence, where the affine coordinate system is Θ = Pα and its dual is
+H = Pβ.
+(5) The α-divergence is derived as:
+
+Θ(P)
+=
+2
+
+1 − αP
+1−α
+
+2 ,
+(67)
+
+ψ(Θ)
+=
+2
+
+1 + αP,
+(68)
+
+Dα[P : Q]
+=
+4
+
+1 − α2 tr
+�
+−P
+1−α
+
+2 Q
+1+α
+
+2 + 1 − α
+
+2
+P + 1 + α
+
+2
+Q
+�
+.
+(69)
+
+The affine coordinate system is
+2
+
+1−αP
+1−α
+
+2 , and its dual is
+2
+
+1+αP
+1+α
+
+2 .
+(6) The β-divergence is derived from Equation (41) as:
+
+Dβ[P : Q] =
+1
+
+β(β + 1)tr
+�
+Pβ+1 + (β + 1)Q − Qβ+1 − (β + 1)PQβ�
+.
+(70)
+
+4.4. Invariance Under Gl(n)
+
+We extend the concept of invariance under the orthogonal group to that under the general linear
+group, Gl(n), that is the set of invertible matrices, L, det |L| ̸= 0. This is a stronger condition. A
+divergence is said to be invariant under Gl(n), when:
+
+D[P : Q] = D
+�
+LTPL : LTQL
+�
+(71)
+
+holds for any L ∈ Gl(n).
+We identify matrix P with the zero-mean Gaussian distribution:
+
+p(x, P) = exp
+�
+−1
+
+2xTP−1x − 1
+
+2 log det |P| − c
+�
+,
+(72)
+
+9
+
+
+Entropy 2014, 16, 2131–2145
+
+where c is a constant. We know that an invariant divergence belongs to the class of f-divergences in
+the case of a manifold of probability distributions, where the invariance means the geometry does
+not change under a one-to-one mapping of x to y. Moreover, the only invariant flat divergence is the
+KL-divergence [22]. These facts suggest the following conjecture.
+
+Proposition. The invariant, flat and decomposable divergence under Gl(n) is the KL-divergence
+given by:
+
+DKL[P : Q] = tr
+�
+PQ−1�
+− log
+�
+det
+���PQ−1|
+�
+− n.
+(73)
+
+5. Non-Decomposable Divergence
+
+We have focused on flat and decomposable divergences.
+There are many interesting
+non-decomposable divergences. We first discuss a general class of flat divergences in Rn+ and then
+touch upon interesting flat and non-flat divergences in the manifold of positive-definite matrices.
+
+5.1. General Class of Flat Divergences in Rn
++
+
+We can describe a general class of flat divergence in Rn+, which are not necessarily decomposable.
+This is introduced in [23], which studies the conformal structure of general total Bregman divergences
+([11,13]). When Rn+ is endowed with a dually flat structure, it has a θ-coordinate system given by:
+
+θ = ρ(ξ)
+(74)
+
+which is not necessarily a componentwise function. Any pair of invertible θ = ρ(ξ) and convex
+function ψ(θ) defines a dually flat structure and, hence, a Bregman divergence in Rn+.
+The dual coordinates η = τ(ξ) are given by:
+
+η = ∇ψ(θ)
+(75)
+
+so that we have:
+η = τ(ξ) = ∇ψ {ρ(ξ)} .
+(76)
+
+This implies that a pair (ρ, τ) of coordinate systems can define dually coupled affine coordinates and,
+hence, a dually flat structure, when and only when η = τ
+�
+ρ−1(θ)
+�
+is a gradient of a convex function.
+This is different from the case of decomposable divergence, where any monotone pair of ρ(ξ) and
+τ(ξ) gives a dually flat structure.
+
+5.2. Non-Decomposable Flat Divergence in PDn
+
+Ohara and Eguchi [15,16] introduced the following function:
+
+ψV(P) = V (det |P|) ,
+(77)
+
+where V(ξ) is a monotonically decreasing scalar function. ψV is convex when and only when:
+
+1 + V′′(ξ)ξ2
+
+V′(ξ)
+< 1
+
+n.
+(78)
+
+In such a case, we can introduce dually flat structure to PDn, where P is an affine coordinate system
+with convex ψV(P), and the dual affine coordinate system is:
+
+H = V′(det ∥P∥)P−1.
+(79)
+
+10
+
+
+Entropy 2014, 16, 2131–2145
+
+The derived divergence is:
+
+DV[P : Q] = V(det |P) − V(det |Q)|
+(80)
+
++ V′(det |Q|)tr
+�
+Q−1(Q − P)
+�
+.
+(81)
+
+When V(ξ) = − log ξ, it reduces to the case of Equation (54), which is invariant under Gl(n) and
+decomposable. However, the divergence DV[P : Q] is not decomposable. It is invariant under O(n)
+and more strongly so under SGl(n) ⊂ Gl(n), defined by det |L| = ±1.
+
+5.3. Flat Structure Derived from q-Escort Distribution
+
+A dually flat structure is introduced in the manifold of probability distributions [4] as:
+
+˜Dα[p : q] =
+1
+
+1 − q
+1
+
+Hq(p)
+
+�
+1 − ∑ p1−q
+i
+qq
+i
+�
+,
+(82)
+
+where:
+
+Hq(p)
+= ∑ pq
+i ,
+(83)
+
+q
+=
+1 + α
+
+2
+.
+(84)
+
+The dual affine coordinates are the q-escort distribution: [4]
+
+ηi =
+1
+
+Hq(p) pq
+i .
+(85)
+
+The divergence, ˜Dq, is flat, but not decomposable.
+We can generalize it to the case of PDn,
+
+˜Dq[P : Q] =
+1
+
+1 − q
+1
+
+tr Pq
+�
+(1 − q) tr (P) + q tr (Q) − tr
+�
+P1−qQq��
+.
+(86)
+
+This is flat, but not decomposable.
+
+5.4. γ-Divergence in PDn
+
+The γ-divergence is introduced by Fujisawa and Eguchi [24]. It gives a super-robust estimator. It
+is interesting to generalize it to PDn,
+
+Dγ[P : Q] =
+1
+
+γ(γ − 1)
+
+�
+log tr Pγ − (γ − 1) log tr Qγ−1 − γ log tr PQγ−1�
+.
+(87)
+
+This is not flat nor decomposable. This is a projective divergence in the sense that, for any c, c′ > 0,
+
+Dγ
+�
+cP : c′Q
+� = Dγ[P : Q].
+(88)
+
+Therefore, it can be defined in the submanifold of tr P = 1.
+
+6. Concluding Remarks
+
+We have shown that the (ρ, τ)-divergence introduced by Zhang [5] is a general dually flat
+decomposable structure of the manifold of positive measures. We then extended it to the manifold
+of positive-definite matrices, where the criterion of invariance under linear transformations (in
+particular, under orthogonal transformations) were added. The decomposability is useful from the
+
+11
+
+
+Entropy 2014, 16, 2131–2145
+
+computational point of view, because the θ-η transformation is tractable. This is the motivation for
+studying decomposable flat divergences.
+When we treat the manifold of probability distributions, it is a submanifold of the manifold of
+positive measures, where the total sum of measures are restricted to one. This is a nonlinear constraint
+in the θ or η coordinates, so that the manifold is not flat, but curved in general. Hence, our arguments
+hold in this case only when at least one of the ρ and τ functions are linear. The U-divergence [21] and
+β-divergence [19] are such cases. However, for clustering, we can take the average of the η-coordinates
+of member probability distributions in the larger manifold of positive measures and then project it
+to the manifold of probability distributions. This is called the exterior average, and the projection is
+simply a normalization of the result. Therefore, the (ρ, τ)-structure is useful in the case of probability
+distributions. The same situation holds in the case of positive-definite matrices.
+Quantum information theory deals with positive-definite Hermitian matrices of trace one [25,26].
+We need to extend our discussions to the case of complex matrices. The trace one constraint is not
+linear with respect to θ- or η-coordinates, as is the same in the case of probability distributions. Many
+interesting divergence functions have been introduced in the manifold of positive-definite Hermitian
+matrices. It is an interesting future problem to apply our theory to quantum information theory.
+
+Conflicts of Interest: The author declares no conflicts of interest.
+
+References
+
+1.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society and Oxford
+University Press: Rhode Island, RI, USA, 2000.
+2.
+Tsallis, C. Introduction to Nonextensive Statistical Mechanics:
+Approaching a Complex World; Springer:
+Berlin/Heidelberg, Germany, 2009.
+3.
+Naudts, J. Generalized Thermostatistics; Springer: Berlin/Heidelberg, Germany, 2011.
+4.
+Amari,
+S.;
+Ohara,
+A.;
+Matsuzoe,
+H. Geometry of deformed exponential families:
+Invariant,
+dually-flat and conformal geometries. Physica A 2012, 391, 4308–4319.
+5.
+Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195.
+6.
+Zhang, J. Nonparametric information geometry: From divergence function to referential-representational
+biduality on statistical manifolds. Entropy 2013, 15, 5384–5418.
+7.
+Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of
+similarities. Entropy 2010, 12, 1532–1568.
+8.
+Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust
+nonnegative matrix factorization. Entropy 2011, 13, 134–170.
+9.
+Minami, M.; Eguchi, S. Robust blind source separation by beta-divergence. Neural Comput. 2002 14,
+1859–1886.
+10.
+Banerjee, A.; Merugu, S.; Dhillon I.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res.
+2005, 6, 1705–1749.
+11.
+Liu, M.; Vemuri, B.C.; Amari, S.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering.
+IEEE Trans. Pattern Anal. Mach. Learn. 2012, 24, 3192–3212.
+12.
+Dhillon, I.S.; Tropp, J.A. Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl.
+2007, 29, 1120–1146.
+13.
+Vemuri, B.C.; Liu, M.; Amari, S.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis.
+IEEE Trans. Med. Imaging 2011, 30, 475–483.
+14.
+Ohara, A.; Suda, N.; Amari, S. Dualistic differential geometry of positive definite matrices and its applications
+to related problems. Linear Algebra Appl. 1996 247, 31–53.
+15.
+Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by
+beta-divergence. Entropy 2013, 15, 4732–4747.
+16.
+Ohara, A.; Eguchi, S. Geometry on positive definite matrices induced from V-potential functions. In Geometric
+Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2013;
+pp. 621–629.
+
+12
+
+
+Entropy 2014, 16, 2131–2145
+
+17.
+Chebbi,
+Z.;
+Moakher,
+M.
+Means
+of
+Hermitian
+positive-definite
+matrices
+based
+on
+the
+log-determinant alpha-divergence function. Linear Algebra Appl. 2012, 436, 1872–1889.
+18.
+Tsuda, K.; Ratsch, G.; Warmuth, M.K. Matrix exponentiated gradient updates for on-line learning and
+Bregman projection. J. Mach. Learn. Res. 2005, 6, 995–1018.
+19.
+Nock, R.; Magdalou, B.; Briys, E.; Nielsen, F. Mining matrix data with Bregman matrix divergences for
+portfolio selection. In Matrix Information Geometry; Nielsen, F., Bhatia, R., Eds.; Springer: Berlin/Heidelberg,
+Germany, 2013; Chapter 15, pp. 373–402.
+20.
+Nielsen, F., Bhatia, R., Eds. Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013.
+21.
+Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo. 2006, 19, 197–216.
+22.
+Amari, S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes.
+IEEE Trans. Inf. Theory 2009, 55, 4925–4931.
+23.
+Nock, R.; Nielsen, F.; Amari, S. On conformal divergences and their population minimizers. IEEE Trans. Inf.
+Theory 2014, submitted for publication.
+24.
+Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination.
+J. Multivar. Anal. 2008, 99, 2053–2081.
+25.
+Petz, P. Monotone metrics on matrix spaces. Linear Algebra Appl. 1996, 244, 81–96.
+26.
+Hasegawa, H. α-divergence of the non-commutative information geometry. Rep. Math. Phys. 1993, 33, 87–93.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+13
+
+
+entropy
+
+Article
+F-Geometry and Amari’s α−Geometry on a
+Statistical Manifold
+
+Harsha K. V. * and Subrahamanian Moosath K S *
+
+Indian Institute of Space Science and Technology, Department of Space, Government of India, Valiamala P.O,
+Thiruvananthapuram-695547, Kerala, India
+*
+E-Mails: harsha.11@iist.ac.in (K.V.H.); smoosath@iist.ac.in (K.S.S.M.); Tel.: +91-95-6736-0425 (K.V.H.);
++91-94-9574-3148 (K.S.S.M.).
+
+Received: 13 December 2013; in revised form: 21 April 2014 / Accepted: 25 April 2014 /
+Published: 6 May 2014
+
+Abstract: In this paper, we introduce a geometry called F-geometry on a statistical manifold S using
+an embedding F of S into the space RX of random variables. Amari’s α−geometry is a special
+case of F−geometry. Then using the embedding F and a positive smooth function G, we introduce
+(F, G)−metric and (F, G)−connections that enable one to consider weighted Fisher information
+metric and weighted connections. The necessary and sufficient condition for two (F, G)−connections
+to be dual with respect to the (F, G)−metric is obtained. Then we show that Amari’s 0−connection is
+the only self dual F−connection with respect to the Fisher information metric. Invariance properties
+of the geometric structures are discussed, which proved that Amari’s α−connections are the only
+F−connections that are invariant under smooth one-to-one transformations of the random variables.
+
+Keywords:
+embedding; Amari’s α−connections;
+F−metric;
+F−connections; (F, G)−metric;
+(F, G)−connections; invariance
+
+1. Introduction
+
+Geometric study of statistical estimation has opened up an interesting new area called the
+Information Geometry. Information geometry achieved a remarkable progress through the works of
+Amari [1,2], and his colleagues [3,4]. In the last few years, many authors have considerably contributed
+in this area [5–9]. Information geometry has a wide variety of applications in other areas of engineering
+and science, such as neural networks, machine learning, biology, mathematical finance, control system
+theory, quantum systems, statistical mechanics, etc.
+A statistical manifold of probability distributions is equipped with a Riemannian metric and a pair
+of dual affine connections [2,4,9]. It was Rao [10] who introduced the idea of using Fisher information
+as a Riemannian metric in the manifold of probability distributions. Chentsov [11] introduced a family
+of affine connections on a statistical manifold defined on finite sets. Amari [2] introduced a family of
+affine connections called α−connections using a one parameter family of functions, the α−embeddings.
+These α−connections are equivalent to those defined by Chentsov. The Fisher information metric and
+these affine connections are characterized by invariance with respect to the sufficient statistic [4,12] and
+play a vital role in the theory of statistical estimation. Zhang [13] generalized Amari’s α−representation
+and using this general representation together with a convex function he defined a family of divergence
+functions from the point of view of representational and referential duality. The Riemannian metric
+and dual connections are defined using these divergence functions.
+In this paper, Amari’s idea of using α−embeddings to define geometric structures is extended to
+a general embedding. This paper is organized as follows. In Section 2, we define an affine connection
+called F−connection and a Riemannian metric called F−metric using a general embedding F of
+a statistical manifold S into the space of random variables. We show that F−metric is the Fisher
+
+Entropy 2014, 16, 2472–2487; doi:10.3390/e16052472
+www.mdpi.com/journal/entropy
+14
+
+
+Entropy 2014, 16, 2472–2487
+
+information metric and Amari’s α−geometry is a special case of F−geometry. Further, we introduce
+(F, G)−metric and (F, G)−connections using the embedding F and a positive smooth function G.
+In Section 3, a necessary and sufficient condition for two (F, G)−connections to be dual with
+respect to the (F, G)−metric is derived and we prove that Amari’s 0−connection is the only self
+dual F−connection with respect to the Fisher information metric. Then we prove that the set of all
+positive finite measures on X, for a finite X, has an F−affine manifold structure for any embedding F.
+In Section 4, invariance properties of the geometric structures are discussed. We prove that the
+Fisher information metric and Amari’s α−connections are invariant under both the transformation
+of the parameter and the transformation of the random variable. Further we show that Amari’s
+α−connections are the only F−connections that are invariant under both the transformation of the
+parameter and the transformation of the random variable.
+Let (X, B) be a measurable space, where X is a non-empty subset of R and B is the σ-field of
+subsets of X. Let RX be the space of all real valued measurable functions defined on (X, B). Consider
+an n−dimensional statistical manifold S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn}, with coordinates
+ξ = [ξ1, ..., ξn], defined on X. S is a subset of P(X), the set of all probability measures on X given by
+
+P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X);
+�
+
+X
+p(x)dx = 1}.
+(1)
+
+The tangent space to S at a point pξ is given by
+
+Tξ(S) = {
+n
+∑
+i=1
+αi∂i / αi ∈ R}
+where ∂i =
+∂
+∂ξi .
+(2)
+
+Define ℓ(x; ξ) = log p(x; ξ) and consider the partial derivatives { ∂ℓ
+
+∂ξi = ∂iℓ ; i = 1, ...., n} which are
+called scores. For the statistical manifold S, ∂iℓ’s are linearly independent functions in x for a fixed ξ.
+Let T1
+ξ (S) be the n-dimensional vector space spanned by n functions {∂iℓ ; i = 1, ...., n} in x. So
+
+T1
+ξ (S) = {
+n
+∑
+i=1
+Ai∂iℓ / Ai ∈ R}.
+(3)
+
+Then there is a natural isomorphism between these two vector spaces Tξ(S) and T1
+ξ (S) given by
+
+∂i ∈ Tξ(S) ←→ ∂iℓ(x; ξ) ∈ T1
+ξ (S).
+(4)
+
+Obviously, a tangent vector A = ∑n
+i=1 Ai∂i ∈ Tξ(S) corresponds to a random variable A(x) =
+∑n
+i=1 Ai∂iℓ(x; ξ) ∈ T1
+ξ (S) having the same components Ai. Note that Tξ(S) is the differentiation
+
+operator representation of the tangent space, while T1
+ξ (S) is the random variable representation of the
+
+same tangent space. The space T1
+ξ (S) is called the 1-representation of the tangent space.
+Let A and B be two tangent vectors in Tξ(S) and A(x) and B(x) be the 1−representations of A and B
+respectively. We can define an inner product on each tangent space Tξ(S) by
+
+gξ(A, B) =< A, B >ξ = Eξ[A(x)B(x)] =
+�
+A(x)B(x)p(x; ξ)dx.
+(5)
+
+Especially the inner product of the basis vectors ∂i and ∂j is
+
+gij(ξ) = < ∂i, ∂j >ξ = Eξ[∂iℓ ∂jℓ] =
+�
+∂iℓ(x; ξ)∂jℓ(x; ξ)p(x; ξ)dx.
+(6)
+
+15
+
+
+Entropy 2014, 16, 2472–2487
+
+Note that g =<, > defines a Riemannian metric on S called the Fisher information metric.
+On the Riemannian manifold (S, g =<, >), define n3 functions Γijk by
+
+Γijk(ξ) = Eξ[(∂i∂jℓ(x; ξ))(∂kℓ(x; ξ))].
+(7)
+
+These functions Γijk uniquely determine an affine connection ∇ on S by
+
+Γijk(ξ) =< ∇∂i∂j, ∂k >ξ .
+(8)
+
+∇ is called the 1−connection or the exponential connection.
+Amari [2] defined a one parameter family of functions called the α−embeddings given by
+
+Lα(p) =
+
+�
+2
+
+1−α p
+1−α
+
+2
+α ̸= 1
+log p
+α = 1
+(9)
+
+Using these, we can define n3 functions Γα
+ijk by
+
+Γα
+ijk =
+�
+∂i∂jLα(p(x; ξ))∂kL−α(p(x; ξ))dx
+(10)
+
+These Γα
+ijk uniquely determine affine connections ∇α on the statistical manifold S by
+
+Γα
+ijk = < ∇α
+∂i∂j, ∂k >
+(11)
+
+which are called ff−connections.
+
+2. F−Geometry of a Statistical Manifold
+
+On a statistical manifold S, the Fisher information metric and exponential connection are defined
+using the log embedding. In a similar way, α−connections are defined using a one parameter family of
+functions, the α−embeddings. In general, we can give other geometric structures on S using different
+embeddings of the manifold S into the space of random variables RX.
+Let F : (0, ∞) −→ R be an injective function that is at least twice differentiable. Thus we have
+F′(u) ̸= 0, ∀ u ∈ (0, ∞). F is an embedding of S into RX that takes each p(x; ξ) �−→ F(p(x; ξ)).
+Denote F(p(x; ξ)) by F(x; ξ) and ∂iF can be written as
+
+∂iF(x; ξ) = p(x; ξ)F′(p(x; ξ))∂iℓ(p(x; ξ)).
+(12)
+
+It is clear that ∂iF(x; ξ);
+i = 1, ..., n are linearly independent functions in x for fixed ξ since
+∂iℓ(p(x; ξ)); i = 1, .., n are linearly independent. Let TF(pξ)F(S) be the n-dimensional vector space
+spanned by n functions ∂iF; i = 1, ...., n in x for fixed ξ. So
+
+TF(pξ)F(S) = {
+n
+∑
+i=1
+Ai∂iF / Ai ∈ R}
+(13)
+
+Let the tangent space TF(pξ)(F(S)) to F(S) at the point F(pξ) be denoted by TF
+ξ (S). There is a natural
+
+isomorphism between the two vector spaces Tξ(S) and TF
+ξ (S) given by
+
+∂i ∈ Tξ(S) ←→ ∂iF(x; ξ) ∈ TF
+ξ (S).
+(14)
+
+TF
+ξ (S) is called the F−representation of the tangent space Tξ(S).
+
+16
+
+
+Entropy 2014, 16, 2472–2487
+
+For any A = ∑n
+i=1 Ai∂i ∈ Tξ(S), the corresponding A(x) = ∑n
+i=1 Ai∂iF ∈ TF
+ξ (S) is called the
+
+F−representation of the tangent vector A and is denoted by AF(x). Note that TF
+ξ (S) ⊆ TF(pξ)(RX).
+
+Since RX is a vector space, its tangent space TF(pξ)(RX) can be identified with RX. So TF
+ξ (S) ⊆ RX.
+
+Definition 1. F−expectation of a random variable
+f
+with respect to the distribution p(x; ξ) is
+defined as
+
+EF
+ξ ( f ) =
+�
+f (x)
+1
+
+p(F′(p))2 dx.
+(15)
+
+We can use this F−expectation to define an inner product in RX by
+
+< f, g >F
+ξ = EF
+ξ [ f (x)g(x)],
+(16)
+
+which induces an inner product on Tξ(S) by
+
+< A, B >F
+ξ = EF
+ξ [AF(x)BF(x)] ; A, B ∈ Tξ(S).
+(17)
+
+Proposition 1. The induced metric <, >F on S is the Fisher information metric g =<, > on S.
+
+Proof. For any basis vectors ∂i, ∂j ∈ Tξ(S)
+
+< ∂i, ∂j >F
+ξ
+=
+EF
+ξ [∂iF ∂jF]
+
+=
+�
+∂iF ∂jF
+1
+
+p(F′(p))2 dx
+
+=
+�
+(p F′(p) ∂iℓ) (p F′(p) ∂jℓ)
+1
+
+p(F′(p))2 dx
+(18)
+
+=
+�
+∂iℓ ∂jℓ p(x; ξ) dx
+
+=
+Eξ[∂iℓ ∂jℓ]
+
+=
+gij(ξ)
+
+=
+< ∂i, ∂j >ξ .
+
+So the metric <, >F on S induced by the embedding F of S into RX is the Fisher information metric
+g =<, > on S.
+
+We can induce a connection on S using the embedding F.
+Let πF
+|pξ : RX −→ TF
+ξ (S) be the projection map.
+
+Definition 2. The connection induced by the embedding F on S, the F−connection, is defined as
+
+∇F
+∂i∂j
+=
+πF
+|pξ(∂i∂jF)
+
+= ∑
+n ∑
+m
+gmn < ∂i∂jF, ∂mF >F
+ξ ∂n.
+(19)
+
+where [gmn(ξ)] is the inverse of the Fisher information matrix G(ξ) = [gmn(ξ)]. Note that the F−connections
+are symmetric.
+
+Lemma 1. The F−connection and its components can be written in terms of scores as
+
+∇F
+∂i∂j = ∑
+n ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+∂n
+(20)
+
+17
+
+
+Entropy 2014, 16, 2472–2487
+
+and
+
+ΓF
+ijk(ξ) = Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+(21)
+
+Proof. From Equation (12), we have
+
+∂i∂jF = pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ.
+(22)
+
+Therefore
+
+< ∂i∂jF, ∂mF >F
+ξ
+=
+�
+∂i∂jF ∂mF
+1
+
+p(F′(p))2 dx
+
+=
+� �
+pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ
+�
+∂mℓ
+F′(p)dx
+(23)
+
+=
+� �
+∂i∂jℓ ∂mℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ ∂mℓ
+�
+pdx
+
+=
+Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+.
+
+Hence we can write
+
+∇F
+∂i∂j
+=
+πF
+|pξ(∂i∂jF)
+
+= ∑
+n ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+∂n.
+(24)
+
+Then we have the Christoffel symbols of the F−connection
+
+Γn
+ij = ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+(25)
+
+and components of the F−connection are given by
+
+ΓF
+ijk(ξ) =< ∇F
+∂i∂j, ∂k >ξ= Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+.
+(26)
+
+Theorem 1. Amari’s α−geometry is a special case of the F−geometry.
+
+Proof. Let F(p) = Lα(p), Lα(p) is the α−embedding of Amari.
+The components Γα
+ijk of the α−connection are given by
+
+Γα
+ijk(ξ)
+=
+< ∇α
+∂i∂j, ∂k >ξ
+
+=
+Eξ
+
+�
+(∂i∂jℓ + 1 − α
+
+2
+∂iℓ ∂jℓ)(∂kℓ)
+�
+.
+(27)
+
+From Equation (26), when F(p) = Lα(p)
+we have
+
+F′(p) = L′
+α(p)
+=
+p−( 1+α
+
+2 )
+(28)
+
+F′′(p) = L′′
+α(p)
+=
+−1 + α
+
+2
+p−( 3+α
+
+2 ).
+(29)
+
+18
+
+
+Entropy 2014, 16, 2472–2487
+
+Then we get
+
+1 + pF′′(p)
+
+F′(p) = 1 + pL′′
+α(p)
+
+L′α(p) = 1 − α
+
+2
+(30)
+
+Hence
+
+ΓF
+ijk(ξ) =< ∇F
+∂i∂j, ∂k >ξ
+=
+Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+
+=
+Eξ
+
+�
+(∂i∂jℓ + 1 − α
+
+2
+∂iℓ ∂jℓ)(∂kℓ)
+�
+(31)
+
+=
+Γα
+ijk(ξ)
+
+which are the components of the α−connection. Hence F−connection reduces to α−connection.
+Thus we obtain that α−geometry is a special case of F−geometry.
+
+Remark 1. Burbea [14] introduced the concept of weighted Fisher information metric using a positive
+continuous function. We use this idea to define weighted F−metric and weighted F−connections. Let
+G : (0, ∞) −→ R be a positive smooth function and F be an embedding, define (F, G)−expectation of a
+random variable with respect to the distribution pξ as
+
+EF,G
+ξ
+( f ) =
+�
+f (x)
+G(p)
+
+p(F′(p))2 dx.
+(32)
+
+Define (F, G)−metric <, >F,G
+ξ
+in Tpξ(S) by
+
+< ∂i, ∂j >F,G
+ξ
+=
+EF,G
+ξ
+[∂iF ∂jF]
+
+=
+�
+∂iF ∂jF
+G(p)
+
+p(F′(p))2 dx
+(33)
+
+=
+�
+∂iℓ ∂jℓ G(p) p dx
+
+=
+Eξ[G(p) ∂iℓ ∂jℓ].
+
+Define (F, G)−connection as
+
+ΓF,G
+ijk
+=
+< ∇F,G
+∂i ∂j, ∂k >ξ
+
+=
+Eξ
+
+��
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+(G(p))
+�
+.
+(34)
+
+When G(p) = 1, (F, G)−connection reduces to the F−connection and the metric <, >F,G reduces to the Fisher
+information metric. This is a more general way of defining Riemannian metrics and affine connections on a
+statistical manifold.
+
+3. Dual Affine Connections
+
+Definition 3. Let M be a Riemannian manifold with a Riemannian metric g. Two affine connections, ∇ and
+∇∗ on the tangent bundle are said to be dual connections with respect to the metric g if
+
+Zg(X, Y) = g(∇ZX, Y) + g(X, ∇∗
+ZY)
+(35)
+
+holds for any vector fields X, Y, Z on M.
+
+19
+
+
+Entropy 2014, 16, 2472–2487
+
+Theorem 2. Let F, H be two embeddings of statistical manifold S into the space RX of random variables. Let G
+be a positive smooth function on (0, ∞). Then the (F, G)−connection ∇F,G and the (H, G)−connection ∇H,G
+
+are dual connections with respect to the (F, G)−metric iff the functions F and H satisfy
+
+H′(p) = G(p)
+
+pF′(p).
+(36)
+
+We call such an embedding H as a G−dual embedding of F.
+The components of the dual connection ∇H,G can be written as
+
+ΓH,G
+ijk
+=
+� �
+∂i∂jℓ + ( pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ
+�
+∂kℓ G(p)p dx.
+(37)
+
+Proof. ∇F,G and ∇H,G are dual connections with respect to the G−metric means,
+
+∂k < ∂i, ∂j >F,G=< ∇F,G
+∂k ∂i, ∂j >F,G + < ∂i, ∇H,G
+∂k
+∂j >F,G .
+(38)
+
+for any basis vectors ∂i, ∂j, ∂k ∈ Tξ(S).
+
+∂k < ∂i, ∂j >F,G
+=
+�
+∂k∂jℓ ∂iℓ pG(p)dx +
+�
+∂k∂iℓ ∂jℓ pG(p)dx
+
++
+�
+(1 + pG′(p)
+
+G(p) )∂iℓ ∂jℓ ∂kℓ pG(p)dx.
+(39)
+
+< ∇F,G
+∂k ∂i, ∂j >F,G + < ∂i, ∇H,G
+∂k
+∂j >F,G
+=
+�
+∂k∂iℓ ∂jℓ pG(p)dx
+
++
+�
+1 + pF′′(p)
+
+F′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx
+
++
+�
+1 + pH′′(p)
+
+H′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx
+
++
+�
+∂k∂jℓ ∂iℓ pG(p)dx
+(40)
+
+Then the condition (38) holds iff
+
+�
+[2 + pF′′(p)
+
+F′(p) + pH′′(p)
+
+H′(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx =
+
+�
+[1 + pG′(p)
+
+G(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx
+(41)
+
+⇐⇒ [2 + pF′′(p)
+
+F′(p) + pH′′(p)
+
+H′(p) ] = 1 + pG′(p)
+
+G(p) .
+(42)
+
+⇐⇒ 1 + pH′′(p)
+
+H′(p) = pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p)
+(43)
+
+⇐⇒ H′′(p)
+
+H′(p) = G′(p)
+
+G(p) − F′′(p)
+
+F′(p) − 1
+
+p ⇐⇒ H′(p) = G(p)
+
+pF′(p).
+(44)
+
+Hence ∇F,G and ∇H,G are dual connections with respect to the (F, G)−metric iff Equation (36) holds.
+From Equation (43), we can rewrite the components of dual connection ∇H,G as
+
+ΓH,G
+ijk
+=
+� �
+∂i∂jℓ + ( pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ
+�
+∂kℓ G(p)p dx.
+(45)
+
+20
+
+
+Entropy 2014, 16, 2472–2487
+
+Corollary 1. Amari’s 0−connection is the only self dual F−connection with respect to the Fisher information
+metric.
+
+Proof. From Theorem 2, for G(p) = 1 the F−connection ∇F and the H−connection ∇H are dual
+connections with respect to the Fisher information metric iff the functions F and H satisfy
+
+H′(p) =
+1
+
+pF′(p)
+(46)
+
+Thus the F−connection ∇F is self dual iff the embedding F satisfies the condition
+
+F′(p) =
+1
+
+pF′(p) ⇐⇒ F′(p) = p−( 1
+
+2 ) ⇐⇒ F(p) = 2p
+1
+2 = L0(p).
+(47)
+
+That is, Amari’s 0−connection is the only self dual F−connection with respect to the Fisher
+information metric.
+
+So far, we have considered the statistical manifold S as a subset of P(X), the set of all probability
+measures on X. Now we relax the condition �
+p(x)dx = 1, and consider S as a subset of ˜P(X), which
+is defined by
+˜P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X);
+�
+
+X
+p(x)dx < ∞}.
+(48)
+
+Definition 4. Let M be a Riemannian manifold with a Riemannian metric g. Let ∇ be an affine connection on
+M. If there exists a coordinate system [θi] of M such that ∇∂i∂j = 0 then we say that ∇ is flat, or alternatively
+M is flat with respect to ∇, and we call such a coordinate system [θi] an affine coordinate system for ∇.
+
+Definition 5. Let S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn} be an n−dimensional statistical manifold. If
+for some coordinate system [θi]; i = 1, ..., n
+
+∂i∂jF(p(x; θ)) = 0
+(49)
+
+then we can see from Equation (19) that [θi] is an F−affine coordinate system and that S = {pθ} is F−flat. We
+call such S as an F−affine manifold.
+The condition (49) is equivalent to the existence of the functions C, F1, .., Fn on X such that
+
+F(p(x; θ)) = C(x) +
+n
+∑
+i=1
+θiFi(x)
+(50)
+
+Theorem 3. For any embedding F, ˜P(X) is an F−affine manifold for finite X.
+
+Proof. Let X = {x1, ...., xn} be a finite set constituted by n elements. Let Fi : X −→ R be the functions
+defined by Fi(xj) = δij for i, j = 1, .., n. Let us define n coordinates [θi] by
+
+θi = F(p(xi))
+(51)
+
+Then we get F(p(x))
+=
+∑n
+i=1 θiFi(x).
+Therefore
+˜P(X) is an F−affine manifold for any
+embedding F(p).
+
+Remark 2. Zhang [13] introduced ρ-representation, which is a generalization of α-representation of Amari.
+Zhang’s geometry is defined using this ρ-representation together with a convex function. Zhang also defined the
+ρ-affine family of density functions and discussed its dually flat structure. The F−geometry defined using a
+
+21
+
+
+Entropy 2014, 16, 2472–2487
+
+general F-representation is different from the Zhang’s geometry. The metric defined in the F-embedding approach
+is the Fisher information metric and the Riemannian metric defined using the ρ-representation is different from
+the Fisher information metric. The F-connections defined are not in general dually flat and are different from the
+dual connections defined by Zhang.
+
+Remark 3. On a statistical manifold S, we introduced a dualistic structure (g, ∇F, ∇H), where g is the Fisher
+information metric and ∇F, ∇H are the dual connections with respect to the Fisher information metric. Since
+F-connections are symmetric, the manifold S is flat with respect to ∇F iff S is flat with respect to ∇H. Thus if
+S is flat with respect to ∇F, then (S, g, ∇F, ∇H) is a dually flat space. The dually flat spaces are important in
+statistical estimation [4].
+
+4. Invariance of the Geometric Structures
+
+For the statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn}, the parameters are merely labels attached
+to each point p ∈ S, hence the intrinsic geometric properties should be independent of these labels.
+Consequently, it is natural to consider the invariance properties of the geometric structures under
+suitable transformations of the variables in a statistical manifold. Here we can consider two kinds of
+invariance of the geometric structures; covariance under re-parametrization of the parameter of the
+manifold and invariance under the transformations of the random variable [15]. Now let us investigate
+the invariance properties of the F-geometric structures defined in Section 2.
+
+4.1. Covariance under Re-Parametrization
+
+Let [θi] and [ηj] be two coordinate systems on S, which are related by an invertible transformation
+η = η(θ). Let us denote ∂i =
+∂
+∂θi and ∂j =
+∂
+∂ηj . Let the coordinate expressions of the metric g be given
+
+by gij =< ∂i, ∂j > and ˜gij =< ∂i, ∂j >. Let the components of the connection ∇ with respect to the
+coordinates [θi] and [ηj] be given by Γijk, ˜Γijk respectively.
+Then the covariance of the metric g and the connection ∇ under the re-parametrization means,
+
+˜gij
+= ∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+gmn
+(52)
+
+˜Γijk
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+Γmnh + ∑
+m,h
+
+∂θh
+
+∂ηk
+
+∂2θm
+
+∂ηi∂ηj
+gmh
+(53)
+
+Lemma 2. The Fisher information metric g is covariant under re-parametrization.
+
+Proof. The components of the Fisher information metric with respect to the coordinate system [θi] are
+given by
+
+gij(θ) = < ∂i, ∂j >θ =
+�
+∂ip(x; θ)∂jp(x; θ)
+1
+
+p(x; θ)dx.
+(54)
+
+Let ˜p(x; η) = p(x; θ(η)). Then the components of the Fisher information metric with respect to the
+coordinate system [ηj] are given by
+
+˜gij(η) = < ∂i, ∂j >η =
+�
+∂i ˜p(x; η)∂j ˜p(x; η)
+1
+
+˜p(x; η)dx.
+(55)
+
+Since
+
+∂i ˜p(x; η) = ∑
+m
+
+∂θm
+
+∂ηi
+
+∂p(x; θ(η))
+
+∂θm
+(56)
+
+22
+
+
+Entropy 2014, 16, 2472–2487
+
+we can write
+
+˜gij(η)
+=
+�
+∂i ˜p(x; η)∂j ˜p(x; η)
+1
+
+˜p(x; η)dx
+
+=
+�
+∑
+m
+
+∂θm
+
+∂ηi
+
+∂p(x; θ)
+
+∂θm
+∑
+n
+
+∂θn
+
+∂ηj
+
+∂p(x; θ)
+
+∂θn
+1
+
+p(x; θ)dx
+(57)
+
+= ∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+�
+∂mp(x; θ)∂np(x; θ)
+1
+
+p(x; θ)dx.
+
+=
+
+�
+∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+gmn(θ)
+
+�
+
+θ=θ(η)
+
+Lemma 3. The F−connection ∇F is covariant under re-parametrization.
+
+Proof. Let the components of ∇F with respect to the coordinates [θi] and [ηj] be given by Γijk,
+˜Γijk respectively.
+Let ˜p(x; η) = p(x; θ(η)). Let us denote log p(x; θ) by ℓ(x; θ) and log ˜p(x; η) by ˜ℓ(x; η).
+The components of the F−connection ∇F with respect to the coordinate system [θi] are given by
+
+Γijk =
+� �
+∂i∂jℓ(x; θ) + (1 + pF′′(p)
+
+F′(p) )∂iℓ(x; θ) ∂jℓ(x; θ)
+�
+∂kℓ(x; θ)p(x; θ)dx
+(58)
+
+The components of ∇F with respect to the coordinate system [ηj] are given by
+
+˜Γijk =
+� �
+∂i∂j˜ℓ(x; η) + (1 + ˜pF′′( ˜p)
+
+F′( ˜p) )∂i˜ℓ(x; η) ∂j˜ℓ(x; η)
+�
+∂k˜ℓ(x; η) ˜p(x; η)dx
+(59)
+
+We can write
+
+∂i˜ℓ(x; η) = ∑
+m
+
+∂θm
+
+∂ηi
+
+∂ℓ(x; θ(η))
+
+∂θm
+(60)
+
+Then
+
+∂i∂j˜ℓ(x; η) = ∑
+m,n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂2ℓ(x; θ(η))
+
+∂θm∂θn
++ ∑
+m
+
+∂2θm
+
+∂ηi∂ηj
+
+∂ℓ(x; θ(η))
+
+∂θm
+(61)
+
+∂i˜ℓ(x; η) ∂j˜ℓ(x; η) = ∑
+m,n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+(62)
+
+∂k˜ℓ(x; η) = ∑
+h
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θh
+(63)
+
+Hence we get
+
+˜Γijk
+=
+�
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+∂2ℓ(x; θ(η))
+
+∂θm∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+�
+∑
+m,h
+
+∂2θm
+
+∂ηi∂ηj
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+(64)
+
+�
+(1 + pF′′(p)
+
+F′(p) ) ∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx
+
+23
+
+
+Entropy 2014, 16, 2472–2487
+
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+� ∂2ℓ(x; θ(η))
+
+∂θm∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+∑
+m,h
+
+∂2θm
+
+∂ηi∂ηj
+
+∂θh
+
+∂ηk
+
+� ∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+�
+(1 + pF′′(p)
+
+F′(p) )∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx
+
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+Γmnh + ∑
+m,h
+
+∂θh
+
+∂ηk
+
+∂2θm
+
+∂ηi∂ηj
+gmh
+
+Hence we showed that F−connections are covariant under re-parametrization of the parameter.
+The covariance under re-parametrization actually means that the metric and connections are coordinate
+independent. Hence we obtained that the F−geometry is coordinate independent.
+
+4.2. Invariance Under the Transformation of the Random Variable
+
+Amari and Nagaoka [4] defined the invariance of Riemannian metric and connections on a
+statistical manifold under a transformation of the random variable as follows,
+
+Definition 6. Let S = {p(x; ξ) | ξ ∈ E ⊆ Rn} be a statistical manifold defined on a sample space
+X.
+Let x, y be random variables defined on sample spaces X, Y respectively and φ be a transformation
+of x to y.
+Assume that this transformation induces a model S′ = {q(y; ξ) | ξ ∈ E ⊆ Rn} on Y.
+Let λ : S −→ S′ be a diffeomorphism defined as
+
+λ(pξ) = qξ
+(65)
+
+Let g =<>, g′ =<>′ be two Riemannian metrics defined on S and S′ respectively. Let ∇, ∇
+′ be two affine
+connections on S and S′ respectively. Then the invariance properties are given by
+
+< X, Y >p
+=
+< λ∗(X), λ∗(Y) >′
+λ(p) ∀ X, Y ∈ Tp(S)
+(66)
+
+λ∗(∇XY)
+=
+∇
+′
+λ∗(X)λ∗(Y)
+(67)
+
+where λ∗ is the push forward map associated with the map λ, which is defined by
+
+λ∗(X)λ(p) = (dλ)p(X)
+(68)
+
+Now we discuss the invariance properties of the F−geometry under suitable transformations
+of the random variable. Let us restrict ourselves to the case of smooth one-to-one transformations
+of the random variable that are in fact statistically interesting. Amari and Nagaoka [4] mentioned a
+transformation, the sufficient statistic of the parameter of the statistical model, which is widely used in
+statistical estimation. In fact the one-to-one transformations of the random variable are trivial examples
+of sufficient statistic.
+Consider a statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn} defined on a sample space X. Let φ be
+a smooth one-to-one transformation of the random variable x to y. Then the density function q(y; ξ) of
+the induced model S′ takes the form
+
+q(y : ξ) = p(w(y); ξ)w′(y)
+(69)
+
+where w is a function such that x = w(y) and φ′(x) =
+1
+
+w′(φ(x)).
+Let us denote log q(y; ξ) by ℓ(qy) and log p(x; ξ) by ℓ(px).
+
+24
+
+
+Entropy 2014, 16, 2472–2487
+
+Lemma 4. The Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one
+transformations of the random variable.
+
+Proof. Let φ be a smooth one-to-one transformation of the random variable x to y.
+From Equation (69)
+
+p(x; ξ)
+=
+q(φ(x); ξ)φ′(x)
+(70)
+
+∂iℓ(qy)
+=
+∂iℓ(pw(y))
+(71)
+
+∂iℓ(qφ(x))
+=
+∂iℓ(px)
+(72)
+
+The Fisher information metric g′ on the induced manifold S′ is given by
+
+g′
+ij(qξ)
+=
+�
+
+Y ∂iℓ(qy) ∂jℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂iℓ(qφ(x)) ∂jℓ(qφ(x)) q(φ(x); ξ) φ′(x)dx
+(73)
+
+=
+�
+
+X ∂iℓ(px) ∂jℓ(px) p(x; ξ)dx
+
+=
+gij(pξ)
+
+which is the Fisher information metric on S.
+The components of Amari’s α−connections on the induced manifold S′ are given by
+
+´Γα
+ijk(qξ)
+=
+�
+
+Y ∂i∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy +
+�
+
+Y
+1 − α
+
+2
+∂iℓ(qy) ∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂i∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx +
+�
+
+X
+1 − α
+
+2
+∂iℓ(qφ(x)) ∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx
+(74)
+
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+�
+
+X
+1 − α
+
+2
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx
+
+=
+Γα
+ijk(pξ)
+
+which are the components of Amari’s α−connections on the manifold S. Thus we obtained that
+the Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one
+transformations of the random variable.
+
+Now we prove that α-connections are the only F−connections that are invariant under smooth
+one-to-one transformations of the random variable.
+
+Theorem 4. Amari’s α-connections are the only F−connections that are invariant under smooth one-to-one
+transformations of the random variable.
+
+25
+
+
+Entropy 2014, 16, 2472–2487
+
+Proof. Let φ be a smooth one-to-one transformation of the random variable x to y.
+The components of the F−connection of the induced manifold S′ are
+
+´ΓF
+ijk(qξ)
+=
+�
+
+Y
+
+�
+∂i∂jℓ(qy) + (1 + qF′′(q)
+
+F′(q) )∂iℓ(qy) ∂jℓ(qy)
+�
+∂kℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+(75)
+
+�
+
+X(1 + q(φ(x); ξ)F′′(q(φ(x); ξ))
+
+F′(q(φ(x); ξ))
+)∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx.
+
+and the components of the F−connection of the manifold S are
+
+ΓF
+ijk(pξ)
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+
+�
+
+X(1 + p(φ(x); ξ)F′′(p(x; ξ))
+
+F′(p(x; ξ))
+)∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx.
+(76)
+
+Then by equating the components ´ΓF
+ijk(qξ), ΓF
+ijk(pξ) of the F−connection, we get
+
+� q(φ(x); ξ)F′′(q(φ(x); ξ))
+
+F′(q(φ(x); ξ))
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx =
+
+� p(x; ξ)F′′(p(x; ξ))
+
+F′(p(x; ξ))
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx
+(77)
+
+Then it follows that the condition for F−connection to be invariant under the transformation φ is
+given by
+pF′′(p)
+F′(p) = k,
+(78)
+
+where k is a real constant.
+Hence it follows from the Euler’s homogeneous function theorem that the function F′ is a positive
+homogeneous function in p of degree k. So
+
+F′(λp) = λkF′(p) for λ > 0.
+(79)
+
+Since F′ is a positive homogeneous function in the single variable p, without loss of generality
+we can take,
+F′(p) = pk.
+(80)
+
+Therefore
+
+F(p) =
+
+�
+pk+1
+
+k+1
+k ̸= −1
+log p
+k = −1
+(81)
+
+Let
+
+k = −(1 + α)
+
+2
+, α ∈ R.
+(82)
+
+we get
+
+F(p) =
+
+�
+2
+
+1−α p
+1−α
+
+2
+α ̸= 1
+log p
+α = 1
+(83)
+
+which is nothing but Amari’s α−embeddings Lα(p). Hence we obtain that Amari’s α−connections
+are the only F−connections that are invariant under smooth one-to-one transformations of the
+random variable.
+
+26
+
+
+Entropy 2014, 16, 2472–2487
+
+Remark 4. In Section 2, we defined (F, G)-connections using a general embedding function F and a positive
+smooth function G. We can show that (F, G)-connection is invariant under smooth one-to-one transformation
+of the random variable when G(p) = c, where c is a real constant and F(p) = Lα(p) (proof is similar to that of
+Theorem 4). The notion of (F, G)−metric and (F, G)−connection provides a more general way of introducing
+geometric structures on a manifold. We were able to show that the Fisher information metric (up to a constant)
+and Amari’s α−connections are the only metric and connections belonging to this class that are invariant under
+both the transformation of the parameter and the one-to-one transformation of the random variable.
+
+5. Conclusions
+
+The Fisher information metric and Amari’s α−connections are widely used in the theory
+of information geometry and have an important role in the theory of statistical estimation.
+Amari’s α−connections are defined using a one parameter family of functions, the α−embeddings.
+We generalized this idea to introduce geometric structures on a statistical manifold S. We considered
+a general embedding function F of S into RX and obtained a geometric structure on S called the
+F−geometry. Amari’s α−geometry is a special case of F−geometry. A more general way of defining
+Riemannian metrics and affine connections on a statistical manifold S is given using a positive
+continuous function G and the embedding F.
+Amari’s α−geometry is the only F−geometry that is invariant under both the transformation of
+the parameter and the random variable or equivalently under the sufficient statistic. We can relax the
+condition of invariance under the sufficient statistic and can consider other statistically significant
+transformations as well, which then gives an F−geometry other than α−geometry that is invariant
+under these statistically significant transformations. We believe that the idea of F−geometry can be
+used in the further development of the geometric theory of q-exponential families. We look forward to
+studying these problems in detail later.
+
+Acknowledgments: We are extremely thankful to Shun-ichi Amari for reading this article and encouraging our
+learning process. We would like to thank the reviewer who mentioned the references [13,16] that are of great
+importance in our future work.
+
+Author Contributions: The authors contributed equally to the presented mathematical framework and the writing
+of the paper.
+
+Conflicts of Interest: The authors declare no conflicts of interest.
+
+References
+
+1.
+Amari, S. Differential geometry of curved exponential families-curvature and information loss.
+Ann. Statist. 1982, 10, 357–385.
+2.
+Amari, S. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics, Volume 28; Springer-Verlag:
+New York, NY, USA, 1985.
+3.
+Amari, S.; Kumon, M. Differential geometry of Edgeworth expansions in curved exponential family. Ann. Inst.
+Statist. Math. 1983, 35, 1–24.
+4.
+Amari, S.; Nagaoka, H. Methods of Information Geometry, Translations of Mathematical Monographs;
+Oxford University Press: Oxford, UK, 2000.
+5.
+Barndorff-Nielsen, O.E.; Cox, D.R.; Reid, N. The role of differential geometry in statistical theory.
+Internat. Statist. Rev. 1986, 54, 83–96.
+6.
+Dawid, A.P. A Discussion to Efron’s paper. Ann. Statist. 1975, 3, 1231–1234.
+7.
+Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency).
+Ann. Statist. 1975, 3, 1189–1242.
+8.
+Efron, B. The geometry of exponential families. Ann. Statist. 1978, 6, 362–376.
+9.
+Murray,
+M.K.;
+Rice,
+R.W.
+Differential
+Geometry
+and
+Statistics;
+Chapman
+&
+Hall:
+London,
+UK, 1995.
+10.
+Rao,
+C.R.
+Information
+and
+accuracy
+attainable
+in
+the
+estimation
+of
+statistical
+parameters.
+Bull. Calcutta. Math. Soc. 1945, 37, 81–91.
+
+27
+
+
+Entropy 2014, 16, 2472–2487
+
+11.
+Chentsov, N.N. Statistical Decision Rules and Optimal Inference; Transted in English, Translation of the
+Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 1982.
+12.
+Corcuera, J.M.; Giummole, F. A characterization of monotone and regular divergences.
+Ann.
+Inst.
+Statist. Math. 1998, 50, 433–450.
+13.
+Zhang, J. Divergence function, duality and convex analysis. Neur. Comput. 2004, 16, 159–195.
+14.
+Burbea, J. Informative geometry of probability spaces. Expo Math. 1986, 4, 347–378.
+15.
+Wagenaar,
+D.A.
+Information
+Geometry
+for
+Neural
+Networks.
+Available
+online:
+http://www.danielwagenaar.net/res/papers/98-Wage2.pdf (accessed on 13 December 2013).
+16.
+Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually flat and
+conformal geometries. Physica A 2012, 391, 4308–4319.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+28
+
+
+entropy
+
+Article
+Computational Information Geometry in Statistics:
+Theory and Practice
+
+Frank Critchley 1 and Paul Marriott 2,*
+
+1 Department of Mathematics and Statistics, The Open University, Walton Hall, Milton Keynes,
+Buckinghamshire MK7 6AA, UK; E-Mail: f.critchley@open.ac.uk
+2 Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo,
+ON N2L 3G1, Canada
+*
+E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567.
+
+Received: 27 March 2014; in revised form: 25 April 2014 / Accepted: 29 April 2014 /
+Published: 2 May 2014
+
+Abstract: A broad view of the nature and potential of computational information geometry in
+statistics is offered.
+This new area suitably extends the manifold-based approach of classical
+information geometry to a simplicial setting, in order to obtain an operational universal model space.
+Additional underlying theory and illustrative real examples are presented. In the infinite-dimensional
+case, challenges inherent in this ambitious overall agenda are highlighted and promising new
+methodologies indicated.
+
+Keywords: information geometry; computational geometry; statistical foundations
+
+1. Introduction
+
+The application of geometry to statistical theory and practice has seen a number of different
+approaches developed. One of the most important can be defined as starting with Efron’s seminal
+paper [? ] on statistical curvature and subsequent landmark references, including the book by Kass
+and Vos [? ]. This approach, a major part of which has been called information geometry, continues
+today, a primary focus being invariant higher-order asymptotic expansions obtained through the use
+of differential geometry. A somewhat representative example of the type of result it generates is taken
+from [? ], where the notation is defined:
+
+Example 1. The bias correction of a first-order efficient estimator, ˆβ, is defined by:
+
+ba(β) = − 1
+
+2n gaa′ �
+gbcΓ(−1)
+a′bc + gκλh(−1)
+κλa′
+�
+,
+
+and has the property that if ˆβ∗ := ˆβ − b(β) then:
+
+Eβ( ˆβ∗ − β) = O(n−3/2).
+
+The strengths usually claimed of such a result are that, for a worker fluent in the language of
+information geometry, it is explicit, insightful as to the underlying structure and of clear utility in
+statistical practice. We agree entirely. However, the overwhelming evidence of the literature is that,
+while the benefits of such inferential improvements are widely acknowledged in principle, in practice,
+the overhead of first becoming fluent in information geometry prevents their routine use. As a result, a
+great number of powerful results of practical importance lay severely underused, locked away behind
+notational and conceptual bars.
+This paper proposes that this problem can be addressed computationally by the development
+of what we call computational information geometry. This gives a mathematical and numerical
+
+Entropy 2014, 16, 2454–2471; doi:10.3390/e16052454
+www.mdpi.com/journal/entropy
+29
+
+
+Entropy 2014, 16, 2454–2471
+
+computational framework in which the results of information geometry can be encoded as “black-box"
+numerical algorithms, allowing direct access to their power. Essentially, this works by exploiting the
+structural properties of information geometry, which are such that all formulae can be expressed in
+terms of four fundamental building blocks: defined and detailed in Amari [? ], these are the +1 and −1
+geometries, the way that these are connected via the Fisher information and the foundational duality
+theorem. Additionally, computational information geometry enables a range of methodologies and
+insights impossible without it; notably, those deriving from the operational, universal model space,
+which it affords; see, for example, [? ? ? ].
+The paper is structured as follows. Section 2 looks at the case of distributions on a finite number
+of categories where the extended multinomial family provides an exhaustive model underlying the
+corresponding information geometry. Since the aim is to produce a computational theory, a finite
+representation is the ultimate aim, making the results of this section of central importance. The
+paper also emphasises how the simplicial structures introduced here are foundational to a theory of
+computational information geometry. Being intrinsically constructive, a simplicial approach is useful
+both theoretically and computationally. Section 3 looks at how simplicial structures, defined for finite
+dimensions, can be extended to the infinite dimensional case.
+
+2. Finite Discrete Case
+
+2.1. Introduction
+
+This section shows how the results of classical information geometry can be applied in a purely
+computational way. We emphasise that the framework developed here can be implemented in
+a purely algorithmic way, allowing direct access to a powerful information geometric theory of
+practical importance.
+The key tool, as explained in [? ], is the simplex:
+
+Δk :=
+
+�
+
+ß = (ß0, ß1, . . . , ßk)⊤ : ßi ≥ 0 ,
+k
+∑
+i=0
+ßi = 1
+
+�
+
+,
+(1)
+
+with a label associated with each vertex. Here, k is chosen to be sufficiently large, so that any statistical
+model—by which we mean a sample space, a set of probability distributions and selected inference
+problem—can be embedded. The embedding is done in such a way that all the building blocks
+of information geometry (i.e., manifold, affine connections and metric tensor) can be numerically
+computed explicitly. Within such a simplex, we can embed a large class of regular exponential families;
+see [? ] for details. This class includes exponential family random graph models, logistic regression,
+log-linear and other models for categorical data analysis. Furthermore, the multinomial family on k + 1
+categories is naturally identified with the relative interior of this space, int(Δk), while the extended
+family, Equation (??), is a union of distributions with different support sets.
+This paper builds on the theory of information geometry following that introduced by [? ] via
+the affine space construction introduced by [? ] and extended by [? ]. Since this paper concentrates
+on categorical random variables, the following definitions are appropriate. Consider a finite set of
+disjoint categories or bins B = {Bi}i∈A. Any distribution over this finite set of categories is defined
+by a set, {πi}i∈A, which defines the corresponding probabilities. With “mix” connoting mixtures of
+distributions, we have:
+
+Definition 1. The −1-affine space structure over distributions on B := {Bi}i∈A is (Xmix, Vmix, +) where:
+
+Xmix =
+
+�
+
+{xi}i∈A| ∑
+i∈A
+xi = 1
+
+�
+
+, Vmix =
+
+�
+
+{vi}i∈A| ∑
+i∈A
+vi = 0
+
+�
+
+and the addition operator, +, is the usual addition of sequences.
+
+30
+
+
+Entropy 2014, 16, 2454–2471
+
+In Definition ??, the space of (discretised) distributions is a −1-convex subspace of the affine
+space, (Xmix, Vmix, +). A similar affine structure for the +1-geometry, once the support has been fixed,
+can be derived from the definitions in [? ].
+
+2.2. Examples
+
+Examples ?? and ?? are used for illustration. The second of these is a moderately high dimensional
+family, where the way that the boundaries of the simplex are attached to the model is of great
+importance for the behaviour of the likelihood and of the maximum likelihood estimate. In general,
+working in a simplex, boundary effects mean that standard first order asymptotic results can fail,
+while the much more flexible higher order methods can be very effective. The other example is a
+continuous curved exponential family, where both higher order asymptotic sampling theory results
+and geometrically-based dimension reduction are described.
+
+Example 2. The paper [? ] models survival times for leukaemia patients. These times, recorded in days, start
+at the time of diagnosis, and there are 43 observations; see [? ] for details. We further assume that the data is
+censored at a fixed value. It was observed that a censored exponential distribution gives a reasonable, but not
+exact, fit. As discussed in [? ], this gives a one-dimensional curved exponential family inside a two-dimensional
+regular exponential family of the form:
+
+exp
+�
+λ1x + λ2y − log
+� 1
+
+λ2
+
+�
+eλ2t − 1
+�
++ eλ1+λ2t
+��
+,
+(2)
+
+where y = min(z, t) and x = I(z ≥ t), and the embedding map is given by (λ1(θ), λ2(θ)) = (− log θ, −θ).
+As shown in [? ], the loss due to discretisation can be made arbitrarily small for all information geometry
+objects. Thus, for example, using this computational approach, it is straightforward to compute the bias
+correction described in Example ??. Each of the terms in the asymptotic bias, i.e., the metric, gij, its inverse, gij,
+
+the Christoffel symbols, Γ(−1)
+ijk
+, and curvature term, h(−1), can be directly numerically coded as appropriate finite
+difference approximations to derivatives. Thus, “black-box” code can directly calculate the numerical value of
+the asymptotic bias, and this numerical value can then be used by those who are not familiar with information
+geometry. For example this calculation establishes the fact that, with this particular data set, the sample size is
+such that the bias is inferentially unimportant.
+
+1
+
+2
+
+3
+
+4
+
+Figure 1. Undirected graphical model showing the cyclic graph of order four.
+
+Example 3. The paper [? ] discusses an undirected graphical model based on the cyclic graph of order four,
+shown in Figure ??, with binary random variables at each node. Without any constraints, there are 16 possible
+values for the graph, so model space can be thought of as a 15-dimensional simplex, including the relative
+
+31
+
+
+Entropy 2014, 16, 2454–2471
+
+boundary. However, the conditional independence relations encoded by the graph impose linear constraints in the
+natural parameters of the exponential family. Thus, the resultant model is a lower dimensional full exponential
+family and its closure.
+As described in [? ], the four cycle model is a seven dimensional exponential family, which is a +1-affine
+subspace of the +1-affine structure of the 15-dimensional simplex. The model can be written in the form:
+
+⎛
+
+⎝
+πi exp
+�
+∑8
+h=1 ηhvhi
+�
+
+∑15
+j=0 πj exp
+�
+∑8
+h=1 ηhvhj
+�
+
+⎞
+
+⎠
+
+15
+
+i=0
+
+(3)
+
+for a given set of linearly independent vectors {vh}8
+h=1. The existence of the maximum likelihood estimate for
+η = (ηh) will depend on how the limit points of Model (??) meet the observed face of Δ15; that is, the span of the
+vertices (bins) having positive counts. Thus, a key computational task is to learn how a full exponential family,
+defined by a representation of the form of (??), is attached to boundary sub-simplices of the high-dimensional
+embedding simplex.
+In order to visualise the geometric aspects of this problem, consider a lower dimensional version. Define
+a two-dimensional full exponential family by the vectors v1 = (1, 2, 3, 4), v2 = (1, 4, 9, −1) and the uniform
+distribution base point, πi, embedded in the three-dimensional simplex. The two-dimensional family is defined
+by the +1-affine space through (0.25, 0.25, 0.25, 0.25) spanned by the space of vectors of the form:
+
+α(1, 2, 3, 4) + β(1, 4, 9, −1) = (α + β, 2α + 4β, 3α + 9β, 4α − β).
+
+Consider directions from the origin obtained by writing α = θβ, giving, for each θ, a one-dimensional, full
+exponential family parameterized by β in the direction β(θ + 1, 2θ + 4, 3θ + 9, 4θ − 1). The aspect of this vector,
+which determines the connection to the boundary, is the rank order of its elements. For example, suppose the first
+component was the maximum and the last the minimum. Then, as β → ±∞, this one-dimensional family will
+be connected to the first and fourth vertex of the embedding four simplex, respectively. Note that changing the
+value of θ changes the rank structure, as illustrated in Figure ??. This plot shows the four element-wise linear
+functions of θ (dashed lines) and the salient overall feature of their rank order; that is, their upper and lower
+envelopes (solid lines). From this analysis of the envelopes of a set of linear functions, it can be seen that the
+function 2θ + 4 is redundant. The consequence of this is shown in Figure ??, which shows a direct computation
+of the two-dimensional family. It is clear that, indeed, only three of the four vertexes have been connected by
+the model.
+In general, the problem of finding the limit points in full exponential families inside simplex models is a
+problem of finding redundant linear constraints. As shown in [? ], this can be converted, via convex duality, into
+the problem of finding extremal points in a finite dimensional affine space. In the four-cycle model, this technique
+can construct all sub-simplices containing limit points of the four-cycle model. For example, it can be shown
+that all of the 16 vertices are part of the boundary. Once the boundary points have been identified as necessary
+and sufficient, conditions for the existence of the maximum likelihood in the +1-parameters can easily be found
+computationally [? ].
+
+32
+
+
+Entropy 2014, 16, 2454–2471
+
+���
+���
+��
+�
+�
+��
+��
+
+���
+���
+���
+�
+��
+��
+��
+
+Envelope of linear functions
+
+�
+
+���������������
+
+Figure 2. The envelope of a set of linear functions. Functions, dashed lines; envelope, solid lines.
+
+Figure 3. Attaching a two-dimensional example to the boundary of the simplex.
+
+2.3. Tensor Analysis and Numerical Stability
+
+One of the most powerful set of results from classical information geometry is the way that
+geometrically-based tensor analysis is perfect for use in multi-dimensional higher order asymptotic
+analysis; see [? ] or [? ]. The tensorial formulation does, however, present a couple of problems in
+practice. For many, its very tight and efficient notational aspects can obscure rather than enlighten,
+while the resulting formulae tend to have a very large number of terms, making them rather
+cumbersome to work with explicitly. These are not problems at all for the computational approach
+described in this paper. Rather, the clarity of the tensorial approach is ideal for coding, where large
+numbers of additive terms, of course, are easy to deal with.
+Two more fundamental issues, which the global geometric approach of this paper highlights,
+concern numerical stability. The ability to invert the Fisher information matrix is vital in most tensorial
+
+33
+
+
+Entropy 2014, 16, 2454–2471
+
+formulae, and so understanding its spectrum, discussed in Section ??, is vital. Secondly, numerical
+underflow and overflow near boundaries require careful analysis, and so, understanding the way
+that models are attached to the boundaries of the extended multinomial models is equally important.
+The four-cycle model, to which we now return, illustrates computational information geometry doing
+this effectively.
+
+Example 4. The multivariate Edgeworth approximation to the sampling distribution of part of the sufficient
+statistic for the four-cycle model is shown in Figure ??. Using the techniques described above, a point near the
+boundary of the 15-simplex has been selected as the data generation process. For illustration, we focus on the
+marginal distribution of two components of the sufficient statistic, though any number could have been chosen.
+The boundary forces constraints on the range of the sufficient statistics, shown by the dashed line in the plot.
+The points, jittered for clarity, show the distribution computed by simulation. It is typical that such boundary
+constraints prevent standard first order methods from performing well, but the greater flexibility of higher
+order methods can be seen to work well here. As discussed above, methods, such as the multivariate Edgeworth
+expansion, can be strongly exploited in a computational framework, such as ours. Note, the discretization that
+can be observed in the figure is extensively discussed in [? ].
+
+�
+��
+��
+��
+
+��
+�
+�
+��
+��
+��
+
+��������������������
+
+��������������������
+
+Figure 4. Using the Edgeworth expansion near the boundary of four-cycle model.
+
+2.4. Spectrum of Fisher Information
+
+We focus now on the second numerical issue identified above. In any multinomial, the Fisher
+information matrix and its inverse are explicit. Indeed, the 0-geodesics and the corresponding geodesic
+distance are also explicit; see [? ] or [? ]. However, since the simplex glues together multinomial
+structures with different supports and the computational theory is in high dimensions, it is a fact
+that the Fisher information matrix can be arbitrarily close to being singular. It is therefore of central
+interest that the spectral decomposition of the Fisher information itself has a very nice structure, as
+shown below.
+
+Example 5. Consider a multinomial distribution based on 81 equal width categories on [−5, 5], where the
+probability associated to a bin is proportional to that of the standard normal distribution for that bin. The Fisher
+information for this model is an 80 × 80 matrix, whose spectrum is shown in Figure ??. By inspection, it can
+be seen that there are exponentially small eigenvalues, so that while the matrix is positive definite, it is also
+arbitrarily close to being singular. Furthermore, it can be seen that the spectrum has the shape of a half-normal
+density function and that the eigenvalues seem to come in pairs. These facts are direct consequences of the general
+results below.
+
+34
+
+
+Entropy 2014, 16, 2454–2471
+
+With π−0 denoting the vector of all bin probabilities, except π0, we can write the Fisher
+information matrix (in the +1 form) as N times:
+
+I(π) := diag(π−0) − π−0πT
+−0.
+
+This has an explicit spectral decomposition, which can be computed by using interlacing
+eigenvalue results (see for example [?
+], Chapter 4).
+In particular, if the diagonal matrices,
+diag(π1, . . . , πk) and diag(λ1Im1| · · · |λgImg), agree up to a row-and-column permutation, where g > 1
+and λ1 > · · · > λg > 0, then I(π) has ordered spectrum:
+
+λ1 > ˜λ1 > · · · > λg > ˜λg ≥ 0,
+(4)
+
+with ˜λg > 0 ⇐⇒ π0 > 0, each λi having multiplicity mi − 1, while each ˜λg is simple.
+
+0
+20
+40
+60
+80
+
+0.00
+0.01
+0.02
+0.03
+0.04
+0.05
+0.06
+
+Eigenvalues
+
+rank
+
+Eigenvalues
+
+Figure 5. Spectrum of the Fisher information matrix of a discretised normal distribution.
+
+We give a complete account of the spectral decomposition (SpD) of I(π). There are four cases to
+consider, the last having the generic spectrum of (??). Without loss, after permutation, assume now
+π1 ≥ · · · ≥ πk. The four cases are:
+
+Case 1 For some l < k, the last k − l elements of π−0 vanish: the sub-case l = 0 ⇐⇒ π0 = 1 ⇐⇒
+I(π) = 0 is trivial. Otherwise, writing π+ = (π1, . . . , πl)T and Π+ = diag(π+), the SpD of:
+
+I(π) =
+
+�
+Π+ − π+πT+
+0
+
+0
+0
+
+�
+
+follows at once from that of Π+ − π+πT+, given below.
+Case 2 k = 1: this case is trivial.
+Case 3 k > 1, π = λ1k, λ > 0: the SpD of I(π) is:
+
+λCk + λ(1 − kλ)Jk
+
+where Ck = Ik − Jk and Jk = k−11k1T
+k . Here, λ has multiplicity k − 1 and eigenspace [Span(1k)]⊥,
+while ˜λ := λ(1 − kλ) has multiplicity one and eigenspace Span(1k).
+In particular, since
+1 − π0 = kλ, it follows that:
+I(π) is singular ⇐⇒ π0 = 0.
+
+35
+
+
+Entropy 2014, 16, 2454–2471
+
+Case 4 π−0 = (λ11T
+m1| . . . |λg1T
+mg)T, g > 1 and λ1 > · · · > λg > 0:
+
+This is the generic case, having the spectrum of (??) above. Denoting by Om the zero matrix of
+order m × m and by P(ν) the rank one orthogonal projector onto Span(ν), (ν ̸= 0), the SpD is:
+
+g
+∑
+i=1,mi>1
+λidiag(Omi−, Cmi, Omi−) +
+
+g
+∑
+i=1
+˜λiP
+
+⎛
+
+⎝
+�
+λ1
+
+˜λi − λ1
+1T
+m1, . . . ,
+λg
+
+˜λi − λg
+1T
+mg
+
+�T⎞
+
+⎠ ,
+
+where: mi− = ∑{mj|j < i}, mi+ = ∑{mj|j > i} and the ˜λi are the zeros of:
+
+h(˜λ) := 1 +
+
+g
+∑
+i=1
+
+miλ2
+i
+
+˜λ − λi
+= (1 −
+
+g
+∑
+i=1
+miλi) + ˜λ
+
+� g
+∑
+i=1
+
+miλi
+˜λ − λi
+
+�
+
+.
+
+In particular, {˜λi : i = 1, · · · , g} are simple eigenvalues satisfying (??) while, whenever mi > 1,
+λi, is also an eigenvalue having multiplicity mi − 1. Further, expanding det(I(π)), we again find:
+
+I(π) is singular
+⇐⇒ π0 = 0,
+
+so that �λg
+>
+0 ⇔
+π0
+>
+0, as claimed.
+Finally, we note that each �λi (i <
+g) is
+typically (much) closer to λi than to λi+1.
+For, considering the graph of x
+→
+1/x,
+h ((λi + λi+1)/2 + δ(λi − λi+1)/2) (−1 < δ < +1) is well-approximated by:
+
+1 −
+2miλ2
+i
+
+(λi − λi+1)(1 − δ) +
+2mi+1λ2
+i+1
+
+(λi − λi+1)(1 + δ)
+
+whose unique zero δ∗ over (−1, 1) is positive whenever, as will typically be the case, mi = mi+1
+(both will usually be one), while (miλi + mi+1λi+1) < 1/2. Indeed, a straightforward analysis
+shows that, for any mi and mi+1, δ∗ = 1 + O(λi) as λi → 0.
+
+2.5. Total Positivity and Local Mixing
+
+Mixture modelling is an exemplar of a major area of statistics in which computational information
+geometry enables distinctive methodological progress. The −1-convex hull of an exponential family is
+of great interest, mixture models being widely used in many areas of statistical science. In particular,
+they are explored further in [? ]. Here, we simply state the main result, a simple consequence of the
+total positivity of exponential families [? ], that, generically, convex hulls are of maximal dimension. In
+this result, “generic” means that the +1 tangent vector, which defines the exponential family as having
+components that are all distinct.
+
+Theorem 1. The −1-convex hull of an open subset of a generic one-dimensional exponential family is of full
+dimension.
+
+Proof. For any (πi) ∈ Δk with each πi > 0, θ0 < · · · < θk and s0 < · · · < sk, let B = (π(θ0), ..., π(θk))
+have general element:
+πi(θj) := πi exp[siθj − ψ(θj)].
+
+Further, let �B = B − π(θ0)1T
+k+1, whose general column is π(θj) − π(θ0). Then, it suffices to show that
+�B has rank k. However, using [? ] (p. 33), Rank(�B) = Rank(B) − 1, so that:
+
+Rank(�B) = k ⇔ B is nonsingular ⇔ B∗ is nonsingular,
+
+36
+
+
+Entropy 2014, 16, 2454–2471
+
+where B∗ = (exp[siθj]). It suffices, then, to recall [? ] that K(x, y) = exp(xy) is strictly total positive (of
+order ∞), so that det B∗ > 0.
+
+3. Infinite Dimensional Structure
+
+This section will start to explore the question of whether the simplex structure, which describes
+the finite dimensional space of distributions, can extend to the infinite dimensional case. We examine
+some of the differences with the finite dimensional case, illustrating them with clear, commonly
+occurring examples.
+
+3.1. Infinite Dimensional Information Geometry: A Review
+
+In the previous sections, the underlying computational space is always finite dimensional. This
+section looks at issues related to an infinite dimensional extension of the theory in that paper. There
+is a great deal of literature concerning infinite dimensional statistical models. The discussion here
+concentrates on information geometric, parametrisation and boundary issues.
+The information geometry theory of Amari [? ] has a geometric foundation, where statistical
+models (typically full and curved exponential families) have a finite dimensional manifold structure.
+When considering the extension to infinite dimensional cases, Amari notes the problem of finding an
+“adequate topology” [? ] (p. 93). There has to be very interesting work following up this topological
+challenge. By concentrating on distributions with a common support, the paper [? ] uses the geometry
+of a Banach manifold, where local patches on the manifold are modelled by Banach spaces, via
+the concept of an Orlicz space. This gives a structure that is analogous to an infinite dimensional
+exponential family, with mean and natural parameters and including the ability to define mixed
+parametrisations. One drawback of this Banach structure, as pointed out in [? ], is that the likelihood
+function with finite samples is not continuous on the manifold. Fukumizu uses a reproducing kernel
+Hilbert space structure rather than a Banach manifold, which is a stronger topology. There are strong
+connections between the approach taken in [? ] and the material in Section ??, we note two issues here:
+(1) a focus on the finite nature of the data; and (2) using a Hilbert structure defined by a cumulant
+generating function. The approaches differ in that [? ] uses a manifold approach rather than the
+simplicial complex as the fundamental geometric object. There is also other work that explicitly used
+infinite dimensional Hilbert spaces in statistics, a good reference being [? ].
+In this paper, in contrast to previous authors, a simplicial, rather than a manifold-based, approach
+is taken. This allows distributions with varying support, as well as closures of statistical families to be
+included in the geometry. Another difference in approach is the way in which geometric structures
+are induced by infinite dimensional affine spaces rather than by using an intrinsic geometry. This
+approach was introduced by [? ] and extended by [? ]. Spaces of distributions are convex subsets of
+the affine spaces, and their closure within the affine space is key to the geometry.
+In exponential families, the −1-affine structure is often called the mean parametrisation, and
+using moments as parameters is one very important part of modelling. In the infinite dimensional
+case, the use of moments as a parameter system is related to the classical moment problem—when
+does there exist a (unique) distribution whose moments agree with a given sequence?—which has
+generated a vast literature in its own right; see [? ? ? ]. In general terms, the existence of a solution
+to the moment problem is connected to positivity conditions on moment matrices. Such conditions
+have been used in connection to the infinite dimensional geometry of mixture models [? ]. Uniqueness,
+however, is a much more subtle problem: sufficient conditions can be formulated in terms of the rate
+of growth of the moments [? ]. Counter examples to general uniqueness results include the log-normal
+distribution [? ].
+The geometry of the Fisher information is also much more complex in general spaces of
+distributions than in exponential families. Simple mixture models, including two-component mixtures
+of exponential distributions [? ], can have “infinite” expected Fisher information, which gives rise to
+
+37
+
+
+Entropy 2014, 16, 2454–2471
+
+non-standard inference issues. Similar results on infinitely small (and large) eigenvalues of covariance
+operators are also noted in [? ]. Since the Fisher information is a covariance, the fact that it does not
+exist for certain distributions or that its spectrum can be unbounded above or arbitrarily close to zero
+is not a surprise. However, these observations do need to be taken into account when considering the
+information geometry of infinite dimensional spaces.
+The rest of this section looks at the topology and geometry of the infinite dimensional simplex
+and gives some illustrative examples, which, in particular, show the need for specific Hilbert space
+structures, discussed in the final section.
+
+3.2. Topology
+
+For simplicity and concreteness, in this section, we will be looking at models for real valued
+random variables. In this paper, we restrict attention to the cases where the sample space is R+ or R
+and has been discretised to a countably infinite set of bins, Bi, with i ∈ N or Z, respectively. In the
+finite case, the basic object is the standard simplex, Δk, with k + 1 bins. We generalise this to countable
+unions of such objects. Of these, one is of central importance, denoted by Δemp or simply Δ, because it
+is the smallest object that contains all possible empirical distributions.
+
+Definition 2. For any finite subset of bins, indexed by I ⊂ N or Z, denote
+
+ΔI =
+
+�
+
+x = (xi)i∈I : xi ≥ 0 , ∑
+i∈I
+xi = 1
+
+�
+
+.
+
+We take the union of all such sets �
+|I|<∞ ΔI, where |I| denotes the number of elements of the index set. This
+can always be written as:
+
+Δ =
+
+�
+
+x = (xi)i∈Z : ∑
+i∈Z
+xi = 1, xi ≥ 0 and only finitely many xi > 0
+
+�
+
+.
+
+In what follows, it is important to note that for any given statistical inference problem, the sample
+size, n, is always finite, even if we frequently use asymptotic approximations, where n → ∞. Thus, the
+data, as represented by the empirical distribution, naturally lie in the space, Δ. However, many models,
+used in the given inference problem, will have support over all bins, so the models most naturally
+lie in the “boundary” constructed using the closures of the set. These objects are subsets of sequence
+spaces, and the corresponding topologies can be constructed from the Banach spaces, ℓp, p ∈ [1, ∞].
+The following results follow directly from explicit calculations, where we note that in this section, since
+all terms are non-negative, convergence always means absolute convergence. In particular, arbitrary
+rearrangements of series do not affect the existence of limits or their values.
+
+Example 6. Consider the sequence of “uniform distributions” x(n) = ( 1
+
+n, . . . , 1
+
+n, 0, . . . ) as elements of Δ. This
+has an ℓp limit of the zero sequence for p ∈ (1, ∞].
+
+Proposition 1. The ℓp extreme points of Δ, for p ∈ (1, ∞], are the zero sequence and the sequences, ffii (i ∈ Z),
+with one as the i − th element and zero elsewhere.
+
+For p ∈ [1, ∞], let Δp ⊂ ℓp denote the ℓp closure of Δ.
+
+Theorem 2. (a) Δ1 = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi = 1} .
+(b) Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} .
+(c) For p ∈ (1, ∞), Δp = Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} .
+
+38
+
+
+Entropy 2014, 16, 2454–2471
+
+Proof. (a) It is immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ1. Conversely, if ¯x is a limit
+point, then all its elements must be non-negative. Finally, if ∑∞
+i=1 ¯xi is not bounded above by one,
+
+then there exists N, such that ∑N
+i=1 ¯xi > 1 + ϵ for some ϵ > 0. Hence, ∑∞
+i=1 | ¯xi − x(n)
+i
+| ≥ ∑N
+i=1 | ¯xi −
+
+x(n)
+i
+| ≥ ∑N
+i=1 ¯xi − ∑N
+i=1 x(n)
+i
+> ϵ for all n, which contradicts convergence. If ∑∞
+i=1 ¯xi < 1 − ϵ, then
+
+∑∞
+i=1 | ¯xi − x(n)
+i
+| ≥ ∑∞
+i=1 x(n)
+i
+− ∑∞
+i=1 ¯xi > ϵ, which again contradicts convergence.
+(b) It is again immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ∞. However, by Example ??,
+the zero sequence is also in Δ∞, so that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi ≤ 1} ⊆ Δ∞.
+Conversely, by contradiction, it is easy to see that all elements of the closure must have non-
+negative elements. Finally, for any ¯x ∈ Δ∞, if ∑∞
+i=1 ¯xi is not bounded above by one, there exists N, such
+
+that ∑N
+i=1 ¯xi > 1 + ϵ for some ϵ > 0. For any sequence of points, x(n) in Δ, we have that ∑N
+i=1 x(n)
+i
+≤ 1,
+
+so that, for i = 1, . . . , N, the maximum value of |x(n)
+i
+− ¯xi| > ϵ/N. Hence, for all sequences, x(n), we
+have ∥x(n) − ¯x∥∞ > ϵ/N, which contradicts ¯x being in the closure.
+(c) This follows essentially the same argument as (b) by noting in the case where ∑∞
+i=1 ¯xi is not
+bounded above by one, we have:
+
+∥x(n) − ¯x∥p
+p ≥
+N
+∑
+i=1
+| ¯xi − x(n)
+i
+|p ≥ N max
+i=1,...N |x(n)
+i
+− ¯xi|p > N1−pϵp
+
+for any sequences, x(n), which contradicts ¯x being in the closure.
+
+It is immediate that the spaces, Δ and Δ1, are convex subsets of ℓ1 and that Δ∞ is a convex set in
+ℓ∞.
+
+3.3. Geometry
+
+In the same way as for the finite case, the −1-geometry can be defined using an affine space
+structure using the following definition.
+
+Definition 3. Let I be a countable index set which is a subset of Z. The −1-affine space structure over
+distributions is (Xmix, Vmix, +), where:
+
+Xmix =
+
+�
+
+x = (xi)|∑
+I
+xi = 1,∑
+I
+|xi| < ∞
+
+�
+
+, Vmix =
+
+�
+
+v = vi|∑
+I
+vi = 0,∑
+I
+|vi| < ∞,
+
+�
+
+,
+
+and x + v = (xi + vi).
+
+In order to define the +1-geometric structure, we also follow the approach used in the finite case.
+Initially, to understand the +1- structure, consider the case where all distributions have a common
+support, i.e., assume πi > 0 for all i. We follow here the approach of [? ].
+
+Definition 4. Consider the set of non-negative measures on N or Z and the equivalence relation defined by:
+
+{ai} ∼ {bi} ⇐⇒ ∃λ > 0 s.t. ∀i ai = λbi.
+
+The equivalences classes of this are the points in the +1 geometry.
+These points can be further partitioned into sets with the same support, i.e., supp(< a >) = {i : ai > 0},
+where this is clearly well-defined.
+
+On sets of +1-points with the same support, we can define the +1-geometry in the same way as
+in the finite case. With “exp” connoting an exponential family distribution, we have:
+
+39
+
+
+Entropy 2014, 16, 2454–2471
+
+Definition 5. For a given index set, I, define Xexp to be all +1-points whose support equals I, and define the
+vector space Vexp = {vi, i ∈ I} with the operation, ⊕, defined by:
+
+< xi > ⊕vi = ⟨xi exp(vi)⟩ ,
+
+is an affine space. The +1-affine structure is then defined by (Xexp, Vexp, ⊕).
+
+Theorem 3. If a and b lie in Δ (or Δ1) and have the same support, then C(ρ) = ∑(aρ
+i b(1−ρ)
+i
+) < ∞ for
+
+ρ ∈ [0, 1]. Hence, aρ
+i b(1−ρ)
+i
+
+C(ρ)
+∈ Δ (or Δ1).
+
+Proof. Since a, b are absolutely convergent, the sequence, max(ai, bi), is also. Since we have:
+
+0 ≤ min(ai, bi) ≤ aρ
+i b1−ρ
+i
+≤ max(ai, bi)
+
+it follows that C(ρ) < ∞, and we have the result.
+
+This result shows that sets in Δ1 with the same support are +1-convex, just as the faces in the
+finite case are.
+
+3.4. Examples
+
+In order to get a sense of how the +1-geometry works, let us consider a few illustrative examples.
+
+Example 7. If we denote the discretised standard normal density by a and the discretised Cauchy density by b
+and consider the path:
+
+aρ
+i b(1−ρ)
+i
+
+C(ρ)
+,
+
+the normalising constant is shown in Figure ??. We see that at ρ = 0 (the Cauchy distribution), we have that
+the derivative of the normalising constant (i.e., the mean of the sufficient statistic) is tending to infinity. At the
+other end (ρ = 1), the model can be extended in the sense that the distribution exists for values greater than one.
+
+���
+���
+���
+���
+���
+���
+
+����
+����
+����
+����
+����
+����
+����
+����
+
+����������������
+
+��������������������
+
+Figure 6. Normalising constant for normal-Cauchy exponential mixing example.
+
+Thus, in this example, the path joining the two distributions is an extended, rather than natural,
+exponential family, since we have to include the boundary point where the mean is unbounded.
+
+40
+
+
+Entropy 2014, 16, 2454–2471
+
+Example 8. Let us return to Example ??, but now without the censoring. Thus, now, there is a countably
+infinite set of bins, and so, we can investigate its embedding in the infinite simplex. As discussed in [? ], we shall
+discretise the continuous distribution by computing the probabilities associated to bins [ci, ci+1], i = 1, 2, · · · .
+For the exponential model, Exp(θ), the bin probabilities are simply:
+
+πi(θ) = exp(−θci) − exp(−θci+1).
+
+Using this, the model will lie in the infinite simplex on the positive half line with the index set I = N.
+First, consider the case where we have a uniform choice of discretisation, where cn = n × ϵ for some fixed,
+ϵ > 0. In this case, the bin probabilities can be written as an exponential family:
+
+πn(θ) = exp
+�
+−θϵn + log(1 − e−θϵ)
+�
+
+for θ > 0. This gives a +1-geodesic though {πi(θ0)} in the direction {ϵ × n} of the form:
+
+πn(θ0) exp
+
+�
+
+−λϵn + log
+
+�
+1 − e−(λ+θ0)ϵ
+
+1 − e−θ0ϵ
+
+��
+
+(5)
+
+for λ > −θ0. In the case where λ → −θ0, the limiting distribution is the zero measure in Δ∞, and at the
+other extreme, where λ → ∞, the limiting distribution is the atomic distribution in the first bin, a distribution
+with a different support than πi(θ0). However, unlike the finite case, there is no guarantee that, for a given
+“direction”, {ti}, there exists a +1-geodesic starting at {πi(θ0)}, since we require the convergence of the
+normalising constant:
+∞
+∑
+i=0
+πi(θ0) exp(λti) < ∞.
+
+From this example, we see that the limit points of exponential families can lie in the space, Δ∞,
+but not in Δ1. The next example shows that limits do not have to exist at all.
+
+Example 9. Consider the family whose bin probabilities, πi ∈ Δ∞, are proportional to a discretised standard
+normal with bins of constant width. The exponential family, which is proportional to πi exp(θi), does not have
+an ℓ∞ limit, as it is discretised normal with mean θ. The natural parameter space here is (−∞, ∞).
+
+The last illustrative example is from [? ] and shows that even for simple models, the Fisher
+information for the parameters of interest need not be finite.
+
+Example 10. Let us consider a simple example of a two-component mixture of (discretised) exponential distributions:
+
+(1 − ρ)πi(θ0 + λ) + ρπi(θ0)
+(6)
+
+the tangent vector in the ρ-direction is:
+
+πi(θ0) − πi(θ0 + λ) = πi(θ0)
+�
+1 − e−λϵnC
+�
+
+for a positive constant, C. The corresponding squared length, with respect to the Fisher information, is:
+
+∞
+∑
+n=0
+
+�
+1 − e−λϵnC
+�2
+
+πi(θ0)
+.
+
+As an example, consider θ0 = 1; then, this term will be infinite for λ ≤ −0.5.
+
+41
+
+
+Entropy 2014, 16, 2454–2471
+
+3.5. Hilbert Space Structures
+
+Following these examples, we can consider the Hilbert space structure of exponential families
+inside the infinite simplex with the following results.
+
+Definition 6. Define the functions, S(·), by S({vi}, ß) = supθ {θ| ∑I πi exp(θvi) < ∞}, the function being
+set to ∞ when the set is unbounded. Furthermore, define for a given {πi} ∈ ¯Δ∞, the set:
+
+V(ß) = {{vi}|S({vi}, ß > 0} , and Vc(ß) = {{vi}| ± {vi} ∈ V(ß)} .
+
+The spaces, Vc(ß), correspond to the directions in which the +1-geodesic and, so, the
+corresponding exponential families are well-defined and have particularly “nice” geometric structures.
+
+Theorem 4. For ß, define a Hilbert space by:
+
+H(ß) :=
+�
+{vi}|∑ v2
+i πi < ∞
+�
+
+with inner product:
+⟨{vi}, {wi}⟩ß = ∑ viwiπi,
+
+and corresponding norm || · ||ß. Under these conditions:
+(i) Vc(ß) is a subspace of H(ß), and
+(ii) the set V(ß) is a convex cone.
+
+Proof. (i) First, if {vi} ∈ Vc(ß), then by definition, the moment generating function:
+
+∑ exp(θvi)πi,
+
+is finite for θ in an open set containing θ = 0. Hence, have both:
+
+∑ viπi < ∞, and ∑ v2
+i πi < ∞.
+
+Thus, {vi} ∈ H(ß). The fact that it is a subspace follows from (ii) below.
+(ii) It is immediate that V(ß) is a cone.
+Convexity follows from the Cauchy–Schwartz inequality, since for all {vi}, {v∗
+i } ∈ V(ß) and
+λ ∈ [0, 1], it follows that:
+
+�
+∑ πie
+θ
+2 (λvi+(1−λ)v∗
+i )�2
+=
+�
+∑
+�√πie
+θ
+2 λvi
+� �√πie
+θ
+2 (1−λ)v∗
+i
+��2
+
+≤
+�
+∑ πieθλvi
+� �
+∑ πieθ(1−λ)v∗
+i
+�
+,
+
+and, so, is finite for a strictly positive value of θ, hence
+�
+λvi + (1 − λ)v∗
+i
+� ∈ V(ß).
+
+Hence, this result illustrates the point above regarding the existence of “nice” geometric structure
+in the sense of Amari’s information geometry developed for finite dimensional exponential families.
+Infinite dimensional families have a richer structure; for example, they include the possibility of having
+an infinite Fisher information; see Examples ?? and ??.
+
+Acknowledgments: The authors would like to thank Karim Anaya-Izquierdo and Paul Vos for many helpful
+discussions and the UK’s Engineering and Physical Sciences Research Council (EPSRC) for the support of grant
+number EP/E017878/.
+
+Author Contributions: All authors contributed to the conception and design of the study, the collection and
+analysis of the data and the discussion of the results. All authors read and approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+42
+
+
+Entropy 2014, 16, 2454–2471
+
+References
+
+1.
+Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency).
+Ann. Stat. 1975, 3, 1189–1242.
+2.
+Kass, R.E.; Vos, P.W. Geometrical Foundations of Asymptotic Inference; John Wiley & Sons: London, UK, 1997.
+3.
+Amari, S.-I. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics; Springer-Verlag Inc.:
+New York, NY, USA, 1985; Volume 28.
+4.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Foundations.
+In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science;
+Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 311–318.
+5.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Mixture
+Modelling. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer
+Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 319–326.
+6.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P. When are first order asymptotics adequate? A diagnostic. Stat
+2014, 3, 17–22.
+7.
+Murray, M.K.; Rice, J.W. Differential Geometry and Statistics; Chapman & Hall: London, UK, 1993.
+8.
+Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 95–97.
+9.
+Hand, D.J.; Daly, F.; Lunn, A.D.; McConway, K.J.; Ostrowski, E. A Handbook of Small Data Sets; Chapman and
+Hall: London, UK, 1994.
+10.
+Bryson, M.C.; Siddiqui, M.M. Survival times: Some criteria for aging. J. Am. Stat. Assoc. 1969, 64, 1472–1483.
+11.
+Marriott, P.; West, S. On the geometry of censored models. Calcutta Stat. Assoc. Bull. 2002, 52, 567–576.
+12.
+Geiger, D.; Heckerman, D.; King, H.; Meek, C. Stratified exponential families: Graphical models and model
+selection. Ann. Stat. 2001, 29, 505–529.
+13.
+Edelsbrunner, H. Algorithms in Combinatorial Geometry; Springer-Verlag: NewYork, NY, USA, 1987.
+14.
+Barndorff-Nielsen, O.E.; Cox, D.R. Asymptotic Techniques for Use in Statistics; Chapman & Hall: London, UK, 1989.
+15.
+McCullagh, P. Tensor Methods in Statistics; Chapman & Hall: London, UK, 1987.
+16.
+Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge Universtiy Press: Cambridge, UK, 1985.
+17.
+Karlin, S. Total Positivity; Stanford University Press: Stanford, CA, USA, 1968; Volume I.
+18.
+Householder, A.S. The Theory of Matrices in Numerical Analysis; Dover Publications: Dover, DE, USA, 1975.
+19.
+Pistone, G.; Rogantin, M.P. The exponential statistical manifold: Mean parameters, orthogonality and space
+transformations. Bernoulli 1999, 5, 571–760.
+20.
+Fukumizu, K. Infinite dimensional exponential families by reproducing kernel Hilbert spaces. In Proceedings
+of the 2nd International Symposium on Information Geometry and its Applications, Tokyo, Japan,
+12–16 December 2005.
+21.
+Small, C.G.; McLeish, D.L. Hilbert Space Methods in Probability and Statistical Inference; John Wiley & Sons:
+London, UK, 1994.
+22.
+Akhiezer, N.I. The Classical Moment Problem; Hafner: New York, NY, USA, 1965.
+23.
+Stoyanov, J.M. Counter Examples in Probability; John Wiley & Sons: London, UK, 1987.
+24.
+Gut, A. On the moment problem. Bernoulli 2002, 8, 407–421.
+25.
+Lindsay, B.G. Moment matrices: Applications in mixtures. Ann. Stat. 1989, 17, 722–740.
+26.
+Li, P.; Chen, J.; Marriott, P. Non-finite Fisher information and homogeneity: An EM approach. Biometrika
+2009, 96, 411–426.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+43
+
+
+entropy
+
+Article
+Using Geometry to Select One Dimensional
+Exponential Families That Are Monotone Likelihood
+Ratio in the Sample Space, Are Weakly Unimodal and
+Can Be Parametrized by a Measure of
+Central Tendency
+
+Paul Vos 1 and Karim Anaya-Izquierdo 2,*
+
+1 Department of Biostatistics, East Carolina University, Greenville, NC 27858, USA; E-Mail: vosp@ecu.edu
+2 Department of Mathematical Sciences, University of Bath, Bath BA27AY, UK
+*
+E-Mail: kai21@bath.ac.uk; Tel: +44-1225-384644
+
+Received: 30 April 2014; in revised form: 30 June 2014 / Accepted: 14 July 2014 /
+Published: 18 July 2014
+
+Abstract: One dimensional exponential families on finite sample spaces are studied using the
+geometry of the simplex Δ◦
+n−1 and that of a transformation Vn−1 of its interior. This transformation is
+the natural parameter space associated with the family of multinomial distributions. The space Vn−1
+is partitioned into cones that are used to find one dimensional families with desirable properties for
+modeling and inference. These properties include the availability of uniformly most powerful tests
+and estimators that exhibit optimal properties in terms of variability and unbiasedness.
+
+Keywords: simplex; cone; exponential family; monotone likelihood ratio; unimodal; duality
+
+1. Introduction
+
+The motivation for the constructions in this paper begins with a sample from a one dimensional
+space that is discrete. We allow for a continuous sample space but assume that this has been suitably
+discretized into n bins. The simplest underlying structure for the probability assigned to these bins is
+given by the multinomial distribution. The collection of all multinomial distributions can be identified
+with the n − 1 simplex Δn−1. We use the geometry of the simplex along with a transformation of its
+interior Δ◦
+n−1 to search for one dimensional subspaces that have good properties for modeling and for
+inference. In particular, we want families that can be parameterized by the mean, have only unimodal
+distributions, have desirable test characteristics (such as providing uniformly most powerful unbiased
+tests) and estimation properties (such as unbiasedness and small variability).
+The boundary of the (n − 1) dimensional simplex Δn−1 can be written as the union of simplexes
+of dimension (n − 2). This process can be repeated on the simplexes of lower dimension until the
+boundary consists of the vertices of the original simplex. This construction has statistical relevance
+to the possible supports for the probability distributions considered on the n bins. We obtain a dual
+decomposition for a transformation Vn−1 (defined in Equation (5) in Section 5) of Δ◦
+n−1; it is dual in
+that the result can be obtained by replacing simplexes with cones. The statistical relevance of the
+conical decomposition is to the possible modes for all the distributions on the n bins. Since Vn−1 is
+the natural parameter space for the distributions in Δ◦
+n−1, one dimensional exponential families are
+lines in Vn−1 and these can be related to the cones that partition Vn−1. One result is that the limiting
+distribution for any one dimensional exponential family in Δ◦
+n−1 is the uniform distribution whose
+support is determined by the cone that contains the limiting values of the line corresponding to the
+exponential family.
+
+Entropy 2014, 16, 4088–4100; doi:10.3390/e16074088
+www.mdpi.com/journal/entropy
+44
+
+
+Entropy 2014, 16, 4088–4100
+
+While one parameter exponential families can be defined quite generally by choosing a sufficient
+statistic, it can be useful to start with the sufficient statistics from well-known families such as the
+binomial, Poisson, negative binomial, normal, inverse Gaussian, and Gamma distribution. These
+exponential families have good modeling and inferential properties that we try to maintain by limiting
+the extent to which the sufficient statistic is modified. These restrictions lead to considering vectors in
+Vn−1 that lie in a cone. Examples of how to construct these cones are given.
+
+2. Motivating Examples
+
+One dimensional exponential families such as the binomial or Poisson are the workhorse of
+parametric inference because of their excellent statistical properties. However, being one dimensional
+means they do not always fit data very well so an extension to a two (or higher) dimensional
+exponential family can be pursued in order to preserve the nice inferential structure. An issue
+with such extension is that, for each extra natural parameter added, we need to choose a new sufficient
+statistic and this choice can substantially change the shape of the corresponding density functions. For
+example densities can pass from being unimodal to have multiple modes for some parameter values.
+To see this, consider the following examples.
+
+Example 1. Altham [1] considered the so-called multiplicative generalization of the binomial distribution with
+corresponding density
+
+f (x; p, φ) =
+�n
+x
+
+�
+px(1 − p)n−xφx(x−n)/C(p, φ)
+(1)
+
+where C is the normalizing constant and where clearly the binomial is recovered when φ = 1.
+By reparametrizing using θ1 = log(p/(1 − p)) and θ2 = log(φ) this density can be expressed in
+exponential form as
+f (x; θ1, θ2) = h(x) exp(θ1 x + θ2 T(x) − K(θ1, θ2))
+(2)
+
+where T(x) = x(x − n) is the added sufficient statistic and h(x) = (n
+x) where dependence on n has been ignored.
+Note that the same family is obtained if T(x) = x2 is added as a sufficient statistic instead of x(x − n).
+If n = 127 and (θ1, θ2) = (−0.0122, 0.018) then density (2) is bimodal as shown in the left panel of Figure
+
+1. The mean μ of this distribution is 50. Also plotted is the corresponding binomial density with the same mean
+or equivalently with θ1 = log(50/(127 − 50)) = −0.4318 and θ2 = 0.
+
+�
+��
+��
+��
+��
+���
+
+����
+����
+����
+����
+����
+
+�
+
+�
+��
+��
+��
+��
+���
+
+����
+����
+����
+����
+����
+
+�
+
+Figure 1. Binomial density (thick in both panels). Multiplicative binomial density (left panel and thin)
+and double binomial density (right panel and thin). All densities have the same mean μ = 50 and
+n = 127. Variance of the multiplicative and double binomial densities is equal.
+
+45
+
+
+Entropy 2014, 16, 4088–4100
+
+As explained by Lovison [2], this distribution has the feature of being under- or over-dispersed with
+respect to the binomial depending on θ2 being negative or positive, respectively. Furthermore, using the mixed
+parametrization (μ, θ2) (see [3] for details) it is easy to see that this distribution can be parametrized so that one
+parameter controls dispersion independently of the mean. In fact, for a fixed mean μ, as θ2 → −∞ f (x; θ1, θ2)
+tends to a two point distribution (with support points at the extremes x = 0 and x = n) or to a degenerate
+distribution on x = μ when θ2 → ∞.
+
+Example 2. Double exponential families [4] are two parameter exponential families that extend standard
+unidimensional exponential families such as the binomial and the Poisson. Similar to the multiplicative binomial
+in Example 1, the extra parameter involved in double exponential families controls the variance independently of
+the mean. The density for the so-called double binomial family can be written in the form (2) with
+
+T(x) = x log
+� x
+
+n
+
+�
++ (n − x) log
+�
+1 − x
+
+n
+
+�
+
+h(x) = (n
+x) and with the particular restriction that θ2 < 1 (see [4] for details). The range θ2 < 0 generates
+underdispersion and θ2 ∈ [0, 1) generates overdispersion with respect to the binomial. As shown on the right
+panel of Figure 1, the double binomial density can also be multimodal where the double binomial density shown
+has the same mean and variance as the multiplicative binomial shown in the left panel.
+
+These examples show that while extending exponential families can lead to useful modeling
+properties such as overdispersion, the extension can also result in distributions that are not suitable
+for modeling. We are interested in the relationship between geometric properties of one dimensional
+families and the modeling properties of their distributions.
+
+3. Sample Space and Distribution-valued Random Variables
+
+We consider first the general case where the sample space for a single observation X1 consists of
+n bins
+Sn = {B1, B2, . . . , Bn−1, Bn} .
+
+We consider the space of all probability distributions P on this sample space Sn. Each probability
+distribution in P is defined by the n-tuple p whose ith component is
+
+pi = Pr(Bi)
+
+so that P can be identified with the n − 1 simplex
+
+Δn−1 = {p ∈ Rn : pi ≥ 0 ∀i, 1′p = 1}
+
+where 1 in 1′p is the vector 1 ∈ Rn each of whose components is 1. We will slightly abuse the notation
+by using p to name a point in Δn−1, and hence in Rn, as well as the corresponding distribution in P.
+The sample space for a random sample of size N from a distribution p0 ∈ Δn−1 is
+
+X N
+n = {x : x is an n vector of nonnegative integers that sum to N} .
+
+There is simple relationship between X N
+n and the simplex that we obtain by dividing each component
+of x by N. Although the sample space X N
+n can be viewed as formed by compositional data, we will
+follow a different approach to handle this kind of data compared with the classical approach described
+by Aitchison [5] because the data we consider have additional structure.
+In Figure 2 the sample space for the sample of size N = 10 is displayed using open circles. The
+vertices correspond to the case where all 10 values fall in a single bin. The other points correspond
+to the less extreme cases. Let p0 be any point in Δn−1. By mapping the multinomial random variable
+of counts X to Δn−1, we obtain the random distribution �P = X/N whose values are multinomial
+
+46
+
+
+Entropy 2014, 16, 4088–4100
+
+distributions each having number of cases N and probability vector X/N. Identifying X N
+n -valued
+random variables with distribution-valued random variables provides a natural means for comparing
+data with probability models using the Kullback–Leibler (KL) divergence.
+We can compare distributions in Δn−1 using the KL divergence D : P × P �→ R
+
+D(p1, p2) = ∑ p1 log (p1/p2) = H(p1, p2) − H(p1)
+
+where H(p1, p2) = − ∑ p1 log(p2) and H(p1) = H(p1, p1) is the entropy of p1. Note that the arguments
+to D and H are distributions while the logarithm and ratios are defined on points in Rn. Following Wu
+and Vos [6], the variance of the random distribution �P is defined to be
+
+Varp0( �P) = min
+p∈Δn−1
+Ep0D( �P, p)
+
+and its mean is defined to be
+Ep0( �P) = arg min
+p∈Δn−1
+Ep0D( �P, p).
+
+Note that the expectation on the right hand side of the equations above are for real-valued random
+variables while the expectation on the left hand side of the second equation is for a distribution-valued
+random variable.
+
+Figure 2. Simplex for n = 3 bins and sample space for N = 10 observations.
+
+It is not difficult to show that Ep0 �P = p0 so that �P can be considered an unbiased estimator for
+p0. Details are in [6], which also shows that the KL risk can be decomposed into bias-squared and
+variance terms:
+Ep0D( �P, q) = D(p0, q) + Varp0( �P).
+
+The distributional variance is related to the entropy
+
+Varp0( �P) = Ep0D( �P, p0) = H(p0) − Ep0 H( �P).
+
+Note that for N = 1, H( �P) = 0 so that for a single observation the random distribution �P taking values
+on the vertices of Δn−1 has variance equal to the entropy of p0.
+For inference, p0 is unknown but we specify a subspace M ⊂ Δn−1 that contains p0, or at
+least has distributions that are not too different from p0. Estimates can be obtained by choosing a
+parameterization for M, say θ, and then considering real-valued functions ˆθ and evaluating these in
+
+47
+
+
+Entropy 2014, 16, 4088–4100
+
+terms of bias and variance. Bias and variance are useful descriptions when θ describes a feature of
+the distribution that is of inherent interest. However, if θ is simply a parameterization, or if there are
+other features that are also of interest, then these quantities are less useful. For inference regarding the
+distribution p0 we can use a distribution-valued estimator �PM where the subscript indicates that the
+estimator is defined to account for the fact that p0 ∈ M.
+We will not pursue the details of distribution-valued estimators here; we mention these only
+because all the subspaces we consider will be exponential families and in this case the maximum
+likelihood estimator has important properties in terms of distribution variance and distribution bias:
+when M is an exponential family, the maximum likelihood estimator is distribution unbiased, and it
+uniquely minimizes the distribution variance among the class of all distribution unbiased estimators.
+Furthermore, when p0 ̸∈ M then the maximum likelihood estimator is the unique unbiased minimum
+distribution variance estimator of the distribution in M that is closest (in terms of KL) to p0. Extensions
+of one dimensional exponential families that do not result in exponential families will not enjoy these
+properties of maximum likelihood estimation. Details of these results that hold for sample spaces more
+general than Sn are in [7].
+
+4. Simplices Δs
+
+One dimensional exponential families on Sn are curves in Δn−1 whose properties will depend on
+their location within various subspaces of Δn−1. An important collection of subspaces will be indexed
+by the subsets of Sn. For notational convenience we take Bi to the integer i. Using integers is suggestive
+of an ordering and a scale structure but at this point these are only being used to indicate distinct bins.
+For each s ⊂ Sn,
+
+Δs =
+�
+p ∈ Rn : pi ≥ 0 ∀i ∈ s, pi = 0 ∀i ∈ sc, 1′p = 1
+�
+
+where sc = {i ∈ Sn : i ̸∈ s}. Note that ΔSn = Δn−1. The interior of Δs is
+
+Δ◦
+s =
+�
+p ∈ Δs : pi > 0 ∀i ∈ s
+�
+.
+
+As probability distributions in P, Δ◦
+s corresponds to the set of all distributions having support s. There
+is a simple and obvious relationship between the dimension of Δs, |Δs|, and the cardinality of s, |s|,
+which holds for all nonempty s ⊂ Sn
+|Δs| + 1 = |s|.
+
+The boundary of Δs is defined as
+
+∂Δs = {p ∈ Δs : p ̸∈ Δ◦
+s }
+
+so that
+Δs = Δ◦
+s ⊎ ∂Δs
+
+where ⊎ indicates the sets in the union are disjoint. The boundary ∂Δs can be written as the union of
+all simplices of dimension one less than that Δs
+
+∂Δs =
+�
+
+s′:s′⊂s, |s′|=|s|−1
+Δs′
+(3)
+
+This boundary property for Δs holds because the simplex Sn consists of all possible subsets. Each
+nonempty s ∈ Sn specifies one of the possible supports for distribution P ∈ Pn
+
+Δs =
+�
+
+s′:s′⊂s
+Δ◦
+s′
+(4)
+
+48
+
+
+Entropy 2014, 16, 4088–4100
+
+where we set Δ∅ = ∅.
+
+5. Cones Λs
+
+The set of all nonempty subsets of the sample space provides a partition of Δn−1 based on the
+support of the distributions in P. The elements in the partition are simplices whose dimension is one
+less than the cardinality of the indexing set. In most cases we will consider models having support
+Sn, that is, models corresponding to Δ◦
+n−1. If we use subsets s to define the mode rather than support,
+we obtain a partition of P◦, the distributions in P having support Sn. This partition can be expressed
+using convex cones in an n − 1 dimensional plane Vn−1. The dimension of the cones are n minus the
+cardinality of the indexing set and the relationship between interiors of cones and their boundaries is
+analogous to that for simplices expressed in Equations (3) and (4).
+Let
+Vn−1 =
+�
+v ∈ Rn : 1′v = 0
+�
+(5)
+
+be the subspace of Rn of dimension n − 1 of all vectors that sum to zero. For each nonempty s ∈
+Sn define
+Λs =
+�
+v ∈ Vn−1 : vi ≥ vj ∀i ∈ s, ∀j ∈ Sn
+�
+.
+
+It is easily checked that Λs is a convex cone
+
+v1, v2 ∈ Λs =⇒ a1v1 + a2v2 ∈ Λs ∀a1, a2 ∈ [0, ∞) .
+
+The dimension of Λs is |Λs| = n − |s| since each point in j ∈ sc provides a basis vector bj whose ith
+
+component is 1 if i ∈ s or i = j and is zero otherwise and |sc| = n − |s|. The interior of Λs is
+
+Λ◦
+s =
+�
+v ∈ Λs : vi > vj ∀i ∈ s, ∀j ∈ sc�
+,
+
+the boundary is
+∂Λs = {v ∈ Λs : v ̸∈ Λ◦
+s } ,
+
+so that
+Λs = Λ◦
+s ⊎ ∂Λs
+
+by definition. Note ΛSn = Λ◦
+Sn = 0 ∈ Vn−1 ⊂ Rn where the first equality holds because the conditions
+in the definition of Λ◦
+s hold vacuously since i ∈ Sc
+n = ∅ adds no restriction. Likewise, we can extend
+the definition of Λs to include s = ∅ and since i ∈ ∅ adds no restriction
+
+Λ∅ = Λ◦
+∅ = Vn−1.
+
+Note that Λ∅ depends on the cardinality of the set Sn. Since we are considering n fixed, we will not
+show this dependence in the notation.
+Corresponding to Equation (3) we have for all nonempty s that the boundary of the cone Λs is the
+union of all cones having dimension one less than the dimension of Λs
+
+∂Λs =
+�
+
+s′:s⊂s′, |s′|=|s|+1
+Λs′.
+(6)
+
+Corresponding to Equation (4) we have
+
+Λs =
+�
+
+s′:s⊂s′
+Λ◦
+s′
+(7)
+
+The relationship between the simplices Δ and cones Λ is more easily seen if we suppress the
+sets that index these objects. Let Δ and Δ∗ be any two simplices and let Λ and Λ∗ be any two convex
+
+49
+
+
+Entropy 2014, 16, 4088–4100
+
+cones. We only consider cones and simplices that correspond to a nonempty subset of Sn. Then the
+Equations (6) and (7) for the convex cones are obtained by simply replacing Δ in Equations (3) and (4)
+with Λ:
+∂Δ =
+�
+
+Δ∗:|Δ∗|=|Δ|−1
+Δ∗,
+∂Λ =
+�
+
+Λ∗:|Λ∗|=|Λ|−1
+Λ∗
+(8)
+
+Δ =
+�
+
+Δ∗⊂Δ
+Δ◦
+∗,
+Λ =
+�
+
+Λ∗⊂Λ
+Λ◦
+∗
+(9)
+
+Equation (9) also holds for the empty set since Δ∅ = ∅ and Λ∅ = Vn−1.
+
+6. Vn−1 and P◦
+
+There is a natural bijection φ between Vn−1 and Δ◦
+n−1 defined by
+
+φ(p) = log(p) − m(p)1
+
+where log(p) is the vector with ith component log(pi) and m(p) is defined so that 1′φ(p) = 0. The
+inverse is
+ϕ(v) = k−1(v) exp(v)
+
+where exp(v) is the vector with ith component exp(vi) and k(v) is defined so that 1′ exp(v) = 1.
+Each cone Λ◦
+s in the partition
+Vn−1 =
+�
+Λ◦
+s
+
+corresponds to one of the 2n − 1 possible modes for any distribution having support Sn since vi > vj if
+and only if ϕi(v) > ϕj(v).
+
+7. Vn−1 and Exponential Families in P◦
+
+We define a line by a pair of vectors v0, v1 ∈ Vn−1 with v1 ̸= 0
+
+ℓ = ℓ(t) = {v ∈ Vn−1 : v = v0 + tv1, t ∈ R}
+
+Note that v0 and v1 are not unique. Applying the inverse transformation ϕ to points in ℓ gives
+probability densities
+
+ϕ(v0 + tv1) =
+exp(v0 + tv1)
+1′ exp(v0 + tv1)
+(10)
+
+which have the exponential family form with t playing the role of the natural parameter. Therefore,
+the space Vn−1 is easily recognized as the natural parameter space for the distributions Δ◦
+n−1 so that
+each line ℓ in Vn−1 corresponds to a one dimensional exponential family.
+For each line ℓ(t) there is a value tmax such that {ℓ(t) : t ≥ tmax} is contained in one of the cones
+Λ◦
+s where s is the subset of Sn with the property that vi
+1 ≥ vj
+1 for all i ∈ s for vectors v1 ∈ Λ◦
+x. For each
+line ℓ(t) there is a value tmin such that {ℓ(t) : t ≤ tmin} is contained in one of the cones Λ◦
+s′ where s′ is
+
+the subset of Sn with the property that vi
+1 ≤ vj
+1 for all i ∈ s′ for vectors v1 ∈ Λ◦
+x. The cones Λ◦
+s and Λ◦
+s′
+are disjoint and will be called the extremal cones for ℓ. There is at least one other cone Λ◦
+s′′ such that
+ℓ ∩ Λ◦
+s′′ ̸= ∅.
+Any one dimensional exponential family ℓ(t) can be described by an ordered sequence of
+disjoint cones
+�
+Λ◦
+s1, Λ◦
+s2, . . . , Λ◦
+sk
+
+�
+
+50
+
+
+Entropy 2014, 16, 4088–4100
+
+where k = k(ℓ) will depend on the family. These are simply the cones that are traversed by ℓ(t)
+between its extremal cones. We take Λ◦
+sk to be the cone that contains ℓ(t) for all sufficiently large t.
+Equation (6) for cones means that
+
+∂Λsi ⊂ Λsj for j = i + 1 or j = i − 1
+
+The ordered sequence of cones provides an ordered sequence of unique subsets of Sn
+
+(s1, s2, . . . , sk)
+
+that we call the modal profile for ℓ as these are the modes realized by the exponential family ℓ(t) between
+its extremal cones that have modes s1 and sk.
+Each point on a line ℓ(t) in Vn−1 corresponds to a distribution having support Sn. As t goes to −∞
+(+∞) ϕ(ℓ(t)) goes to a distribution having support s1 (sk). In fact, these are the uniform distribution
+on these supports. For every s ⊂ Sn other than ∅ and Sn, the uniform distribution on s is a limiting
+distribution for some one dimensional exponential family in P◦.
+Figure 3 shows Vn−1 for the two dimensional simplex shown in Figure 2. The three rays are the
+one dimensional cones and the spaces between these cones are the two dimensional cones. The origin
+is the zero dimensional cone. The sample values on the boundary of Δ2 are not in V2. Note that the
+one dimensional cones are line segments in Δ2.
+
+��
+��
+�
+�
+�
+
+��
+��
+�
+�
+�
+
+��
+
+Figure 3. V2 for n = 3 bins and sample space for N = 10 observations that are in the interior of Δ2.
+
+8. Ordered Bins and the Monotone Likelihood Ratio Property
+
+Let the bins be ordered and assign the first n integers to the bins to reflect this ordering. We seek
+to define exponential families that have a modal profile of the form
+
+({1} , {1, 2} , {2} , {2, 3} , . . . , {n − 1, n} , {n})
+(11)
+
+or a contiguous sub-collection of this profile. Extensions to three or more contiguous modes are clearly
+possible but not discussed here.
+From the definition of modal profile, it follows that a family with modal profile (11) will have the
+property that the mode is a non-decreasing function of t. In addition to this property for the mode, we
+want the likelihood ratio for any two members of the family to provide the same ordering structure
+
+51
+
+
+Entropy 2014, 16, 4088–4100
+
+as that of the bins. A family that satisfies this condition is said to have the monotone likelihood ratio
+property with respect to x where x takes the values of the bin labels: 1, 2, . . . , n. Let pθ1 and pθ2 be
+two distributions in a one dimensional family parameterized by θ and let pθ2/pθ1 be the n-vector with
+
+components pj
+θ2/pj
+θ1 for 1 ≤ j ≤ n. This family has monotone likelihood ratio if for all θ1 < θ2 and
+j < j′
+
+pj
+θ2
+
+pj
+θ1
+<
+pj′
+
+θ2
+
+pj′
+θ1
+
+.
+
+A family with this property avoids the problem situation where in general the data in the higher
+numbered bins are evidence for pθ2 but in going from a particular bin, say j0 to j0 + 1, the likelihood
+ratio actually decreases. Exponential families such as the binomial and Poisson have this monotone
+likelihood ratio property for the bin labels. The monotone likelihood ratio property can be extended
+to allow for likelihood ratios that are monotone in some function of x. An important advantage of
+families with the monotone likelihood ratio property is the existence of uniformly most powerful tests.
+To ensure that our exponential families have the monotone likelihood ratio property we consider
+vectors in the cone Λ↑ ⊂ Λn
+Λ↑ =
+�
+v : vi < vj, i < j
+�
+.
+
+From Equation (10), the exponential family indexed by θ is k(θ) exp(v0 + θv1)
+
+pj
+θ2
+
+pj
+θ1
+= k(θ2)
+
+k(θ1) exp
+�
+(θ2 − θ1) vj
+1
+�
+
+so that the likelihood ratio is monotone in j if v1 ∈ Λ↑.
+
+9. Selecting Vectors in Λ↑
+
+In order to choose n-dimensional vectors v ∈ Λ↑ we will consider a set of infinite dimensional
+vectors f. Let ¯f : R �→ R and consider f = ¯f |Z where Z is the set of integers. The function f is
+represented by a doubly infinite sequence
+
+f = . . . , f j−1, f j, f j+1, . . .
+
+and we denote the set of all such functions as
+
+F =
+�
+f : f j ∈ R ∀ j ∈ Z
+�
+.
+
+While it is not necessary to consider functions ¯f to define f, these functions are useful to describe
+properties of f, which can be thought of as a discretized version of ¯f.
+Define the gradient of f as the function ∇ whose jth component is
+
+(∇ f )j = f j − f j−1
+
+The simplest functions in F are the constant functions
+
+F0 =
+�
+f ∈ F : f j = f j′ ∀j, j′ ∈ Z
+�
+.
+
+The next simplest functions are those whose gradient is constant. We call these first order functions
+and denote the set of these as
+F1 = { f ∈ F : ∇ f ∈ F0} .
+
+52
+
+
+Entropy 2014, 16, 4088–4100
+
+Functions in F1 are such that changes from one bin to the next bin is the same for all bins. That is,
+these functions describe constant change. We can write the functions in F1 explicitly as
+
+F1 =
+�
+f ∈ F : f j = aj + b, a, b ∈ R
+�
+
+which shows that each f ∈ F1 is the discretized version of a function ¯f whose graph is a line in R × R.
+We obtain a vector v from f by defining the jth component of v as
+
+vj = f j −
+n
+∑
+1
+f i
+
+. From this definition we see that the intercept b of f does not affect v and that the slope is a scaling
+factor so that the restriction to first order functions results in a single direction in Λ↑. This direction
+defines the one dimensional cone defined by the vector with vj = j − (n + 1)/2.
+Additional directions can be obtained from the second order functions
+
+F2 = { f ∈ F : ∇ f ∈ F1} .
+
+If f ∈ F2 then (∇2 f )j = a for some a ∈ R and for all j ∈ Z. Using the fact that
+
+(∇2 f )j = (∇(∇ f ))j = ( f j − f j−1) − ( f j−1 − f j−2)
+
+= f j + f j−2 − 2f j−1
+
+the second order functions can be written explicitly as
+
+F2 =
+�
+f ∈ F : f j = a
+
+2 j(j + 1) + bj + c, a, b, c ∈ R
+�
+
+.
+In order for the vector v obtained from f ∈ F2 to be in Λ↑ we need (∇ f )j ≥ 0 for j = 1, 2, . . . , n.
+With f j = (a/2)j(j + 1) + bj + c we have (∇ f )j = aj + b so that for a > 0 we require b ≥ −a and for
+a < 0 we require b ≥ −an. Since we are concerned with the direction rather than the magnitude we
+can take a = ±1 and the value of c is chosen so the sum of the components is zero.
+The second order vectors in Λ↑ consists of the cone defined by the vectors v20 and v21 having
+components defined by
+
+(n − 1)(v20)j = 1
+
+2 j(j + 1) − j − c20
+
+(n − 1)(v21)j = −1
+
+2 j(j + 1) + nj − c21
+
+Notice that this cone contains v1 since v1 is proportional to v20 + v21. Many discrete one dimensional
+exponential families (e.g., binomial, negative binomial, and Poisson) use the vector v1. Furthermore,
+many continuous one dimensional exponential families use the continuous function f used to define
+v1: normal with σ known, and the gamma and inverse Gaussian distributions with known shape
+parameter (the shape parameter is the non-scale parameter). The cone defined by v20 and v21 allows us
+to perturb the v1 direction to obtain related exponential families that we would expect to have similar
+properties. Figure 4 shows v20 and v21 as well as v1 = 0.5v20 + 0.5v21.
+Other vectors can be used to define cones around v1. Looking at common exponential families we
+see that log(x) and x−1 are sufficient statistics so that these suggest taking ¯f (x) = log(x) or ¯f (x) = 1/x.
+These can be further generalized to ¯f (x; λ), which can be the power family or some other family of
+transformations. The vectors v f0 and v f1 are defined using the discretized f with the constraints that
+v f0, v f1 ∈ Λ↑ and 0.5v f0 + 0.5v f1 = v1.
+
+53
+
+
+Entropy 2014, 16, 4088–4100
+
+An exponential family with sufficient statistic x can be modified by choosing a function ¯f (x) and
+0 ≤ α ≤ 1 where α = 0.5 corresponds to the original exponential family and other values perturb this
+direction. We denote this vector as v f α so that v0 + tv f α is the natural parameter of the modified family.
+Figure 4 shows the components of the vectors v20 and v21.
+
+�
+��
+��
+��
+��
+���
+���
+
+����
+����
+���
+���
+���
+
+�����������
+
+�����
+
+���������������
+
+���
+
+���
+
+Figure 4. Components of the vectors v20 and v21 for n = 128 bins.
+
+Since v0 is common to each exponential family with natural parameter ℓ(t) = v0 + tv f α, the
+monotone likelihood ratio property will hold even if v0 ̸∈ Λ↑. Initial choices for v0 are suggested by
+the Poisson, binomial, and negative binomial distributions:
+
+(vPoisson)j = − log Γ(j) + c
+/∈ Λ↑
+
+(vbinomial)j = log Γ(n) − log Γ(j) − log Γ(n − j) + c
+/∈ Λ↑
+
+(vneg.bin.)j = log Γ(j + r) − log Γ(j) + c
+∈ Λ↑
+
+where c is a constant chosen so that the components sum to 1, n is the number of bins, and r is a
+positive real constant.
+
+Author Contributions: This paper was initiated by the first author but all sections reflect a collaborative effort.
+Both authors have read and approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+References
+
+1.
+Altham, P.M.E. Two Generalizations of the Binomial Distribution. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1978,
+27, 162–167.
+2.
+Lovison, G. An alternative representation of Altham’s multiplicative-binomial distribution. Stat. Probab. Lett.
+1998, 36, 415–420.
+3.
+Brown, L. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory;
+IMS Lecture Notes; Institute of Mathematical Statistics: Hayward, CA, USA, 1986.
+4.
+Efron, B. Double Exponential Families and Their Use in Generalized Linear Regression. J. Am. Stat. Assoc.
+1986, 81, 709–721.
+5.
+Aitchison, J. The Statistical Analysis of Compositional Data; Chapman and Hall: London, UK, 1986.
+
+54
+
+
+Entropy 2014, 16, 4088–4100
+
+6.
+Wu, Q.; Vos, P. Decomposition of Kullback–Leibler risk and unbiasedness for parameter-free estimators.
+J. Stat. Plan. Inference 2012, 142, 1525–1536.
+7.
+Vos, P.; Wu, Q.
+Maximum Likelihood Estimators Uniformly Minimize Distribution Variance among
+Distribution Unbiased Estimators in Exponential Families. Bernoulli 2014, submitted.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+55
+
+
+entropy
+
+Article
+On the Fisher Metric of Conditional Probability
+Polytopes
+
+Guido Montúfar 1,*, Johannes Rauh 1 and Nihat Ay 1,2,3
+
+1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig 04103, Germany; E-Mails:
+jrauh@mis.mpg.de (J.R.); nay@mis.mpg.de (N.A.)
+2 Department of Mathematics and Computer Science, Leipzig University, PF 10 09 20, Leipzig 04009, Germany
+3 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
+*
+E-Mail: montufar@mis.mpg.de; Tel.: +49-341-9959-521.
+
+Received: 31 March 2014; in revised form: 18 May 2014 / Accepted: 29 May 2014 /
+Published: 6 June 2014
+
+Abstract: We consider three different approaches to define natural Riemannian metrics on polytopes
+of stochastic matrices. First, we define a natural class of stochastic maps between these polytopes and
+give a metric characterization of Chentsov type in terms of invariance with respect to these maps.
+Second, we consider the Fisher metric defined on arbitrary polytopes through their embeddings as
+exponential families in the probability simplex. We show that these metrics can also be characterized
+by an invariance principle with respect to morphisms of exponential families. Third, we consider
+the Fisher metric resulting from embedding the polytope of stochastic matrices in a simplex of joint
+distributions by specifying a marginal distribution. All three approaches result in slight variations
+of products of Fisher metrics. This is consistent with the nature of polytopes of stochastic matrices,
+which are Cartesian products of probability simplices. The first approach yields a scaled product of
+Fisher metrics; the second, a product of Fisher metrics; and the third, a product of Fisher metrics
+scaled by the marginal distribution.
+
+Keywords: Fisher information metric; information geometry; convex support polytope; conditional
+model; Markov morphism; isometric embedding; natural gradient
+
+1. Introduction
+
+The Riemannian structure of a function’s domain has a crucial impact on the performance of
+gradient optimization methods, especially in the presence of plateaus and local maxima. The natural
+gradient [1] gives the steepest increase direction of functions on a Riemannian space. For example,
+artificial neural networks can often be trained by following some function’s gradient on a space of
+probabilities. In this context, it has been observed that following the natural gradient with respect to
+the Fisher information metric, instead of the Euclidean metric, can significantly alleviate the plateau
+problem [1,2]. The Fisher information metric, which is also called Shahshahani metric [3] in biological
+contexts, is broadly recognized as the natural metric of probability spaces. An important argument
+was given by Chentsov [4], who showed that the Fisher information metric is the only metric on
+probability spaces for which certain natural statistical embeddings, called Markov morphisms, are
+isometries. More generally, Chentsov’s Theorem characterizes the Fisher metric and α-connections of
+statistical manifolds uniquely (up to a multiplicative constant) by requiring invariance with respect
+to Markov morphisms. Campbell [5] gave another proof that characterizes invariant metrics on the
+set of non-normalized positive measures, which restrict to the Fisher metric in the case of probability
+measures (up to a multiplicative constant). In this paper, we explore ways of defining distinguished
+Riemannian metrics on spaces of stochastic matrices.
+
+Entropy 2014, 16, 3207–3233; doi:10.3390/e16063207
+www.mdpi.com/journal/entropy
+56
+
+
+Entropy 2014, 16, 3207–3233
+
+In learning theory, when modeling the policy of a system, it is often preferred to consider
+stochastic matrices instead of joint probability distributions. For example, in robotics applications,
+policies are optimized over a parametric set of stochastic matrices by following the gradient of a
+reward function [6,7]. The set of stochastic matrices can be parametrized in many ways, e.g., in terms
+of feedforward neural networks, Boltzmann machines [8] or projections of exponential families [9].
+The information geometry of policy models plays an important role in these applications and has
+been studied by Kakade [2], Peters and co-workers [10–12], and Bagnell and Schneider [13], among
+others. A stochastic matrix is a tuple of probability distributions, and therefore, the space of stochastic
+matrices is a Cartesian product of probability simplices. Accordingly, in applications, usually a product
+metric is considered, with the usual Fisher metric on each factor. On the other hand, Lebanon [14]
+takes an axiomatic approach, following the ideas of Chentsov and Campbell, and characterizes a class
+of invariant metrics of positive matrices that restricts to the product of Fisher metrics in the case of
+stochastic matrices. We will consider three different approaches discussed in the following.
+In the first part, we take another look at Lebanon’s approach for characterizing a distinguished
+metric on polytopes of stochastic matrices. However, since the maps considered by Lebanon do not
+map stochastic matrices to stochastic matrices, we will use different maps. We show that the product
+of Fisher metrics can be characterized by an invariance principle with respect to natural maps between
+stochastic matrices.
+In the second part, we consider an approach that allows us to define Riemannian structures
+on arbitrary polytopes. Any polytope can be identified with an exponential family by using the
+coordinates of the polytope vertices as observables. The inverse of the moment map then defines
+an embedding of the polytope in a probability simplex. This embedding can be used to pull back
+geometric structures from the probability simplex to the polytope, including Riemannian metrics,
+affine connections, divergences, etc. This approach has been considered in [9] as a way to define
+low-dimensional families of conditional probability distributions. More general embeddings can be
+defined by identifying each exponential family with a point configuration, B, together with a weight
+function, ν. Given B and ν, the corresponding exponential family defines geometric structures on the
+set (conv B)◦, which is the relative interior of the convex support of the exponential family. Moreover,
+we can define natural morphisms between weighted point configurations as surjective maps between
+the point sets, which are compatible with the weight functions. As it turns out, the Fisher metric on
+(conv B)◦ can be characterized by invariance under these maps.
+In the third part, we return to stochastic matrices. We study natural embeddings of conditional
+distributions in probability simplices as joint distributions with a fixed marginal. These embeddings
+define a Fisher metric equal to a weighted product of Fisher metrics. This result corresponds to the
+Definitions commonly used in robotics applications.
+All three approaches give very similar results. In all cases, the identified metric is a product
+metric. This is a sensible result, since the set of k × m stochastic matrices is a Cartesian product of
+probability simplices Δm−1 × · · · × Δm−1 = Δk
+m−1, which suggests using the product metric of the
+Fisher metrics defined on the factor simplices, Δm−1. Indeed, this is the result obtained from our second
+approach. The first approach yields that same result with an additional scaling factor of 1/k. Only
+when stochastic matrices of different sizes are compared, the two approaches differ. The third approach
+yields a product of Fisher metrics scaled by the marginal distribution that defines the embedding.
+Which metric to use depends on the concrete problem and whether a natural marginal distribution
+is defined and known. In Section 7, we do a case study using a reward function that is given as an
+expectation value over a joint distribution. In this simple example, the weighted product metric gives
+the best asymptotic rate of convergence, under the assumption that the weights are optimally chosen.
+In Section 8, we sum up our findings.
+The contents of the paper is organized as follows. Section 2 contains basic Definitions around the
+Fisher metric and concepts of differential geometry. In Section 3, we discuss the Theorems of Chentsov,
+Campbell and Lebanon, which characterize natural geometric structures on the probability simplex,
+
+57
+
+
+Entropy 2014, 16, 3207–3233
+
+on the set of positive measures and on the cone of positive matrices, respectively. In Section 4, we
+study metrics on polytopes of stochastic matrices, which are invariant under natural embeddings. In
+Section 5, we define a Riemannian structure for polytopes, which generalizes the Fisher information
+metric of probability simplices and conditional models in a natural way. In Section 6, we study a class of
+weighted product metrics. In Section 7, we study the gradient flow with respect to an expectation value.
+Section 8 contains concluding Remarks. In Appendix A, we investigate restrictions on the parameters
+of the metrics characterized in Sections 3 and 4 that make them positive definite. Appendix B contains
+the proofs of the results from Section 4.
+
+2. Preliminaries
+
+We will consider the simplex of probability distributions on [m] := {1, . . . , m}, m ≥ 2, which is
+given by Δm−1 := {(pi)i ∈ Rm : pi ≥ 0, ∑i pi = 1}. The relative interior of Δm−1 consists of all strictly
+positive probability distributions on [m], and will be denoted Δ◦
+m−1. This is a subset of Rm+, the cone
+of strictly positive vectors. The set of k × m row-stochastic matrices is given by Δk
+m−1 := {(Kij)ij ∈
+Rk×m : (Kij)j ∈ Δm−1 for all i ∈ [k]} and is equal to the Cartesian product ×i∈[k] Δm−1. The relative
+
+interior (Δk
+m−1)◦ is a subset of Rk×m
++
+, the cone of strictly positive matrices.
+Given two random variables X and Y taking values in the finite sets [k] and [m], respectively, the
+conditional probability distribution of Y given X is the stochastic matrix K = (P(y|x))x∈[k],y∈[m] with
+rows (P(y|x))y∈[m] ∈ Δm−1 for all x ∈ [k]. Therefore, the polytope of stochastic matrices Δk
+m−1 is called
+a conditional polytope.
+The tangent space of Rn+ at a point p ∈ Rn+, denoted by TpRn+, is the real vector space spanned
+by the vectors ∂1, . . . , ∂n of partial derivatives with respect to the n components. The tangent space of
+Δ◦
+n−1 at a point p ∈ Δ◦
+n−1 ⊂ Rn+ is the subspace TpΔ◦
+n−1 ⊂ TpRn+ consisting of the vectors:
+
+u = ∑
+i
+ui∂i ∈ TpRn
++
+with
+∑
+i
+ui = 0.
+(1)
+
+The Fisher metric on the positive probability simplex Δ◦
+n−1 is the Riemannian metric given by:
+
+g(n)
+p (u, v) =
+n
+∑
+i=1
+
+uivi
+pi
+,
+for all u, v ∈ TpΔ◦
+n−1.
+(2)
+
+The same formula (2) also defines a Riemannian metric on Rn+, which we will denote by the same
+symbol. This, however, is not the only way in which the Fisher metric can be extended from Δ◦
+n−1
+to Rn+. We will discuss other extensions in the next section (see Campbell’s Theorem, Theorem 2).
+Consider a smoothly parametrized family of probability distributions M = {(p(x; θ))x∈[n] : θ ∈
+
+Ω} ⊆ Δ◦
+n−1, where Ω ⊆ Rd is open. Then, g(n) induces a Riemannian metric on M. Denote by
+∂θi =
+∂
+∂θi the tangent vector corresponding to the partial derivative with respect to θi, for all i ∈ [d].
+Then, the Fisher matrix has coordinates:
+
+gM
+θ (∂θi, ∂θj) = ∑
+x∈[n]
+p(x; θ)∂ log p(x; θ)
+
+∂θi
+
+∂ log p(x; θ)
+
+∂θj
+,
+for all i, j ∈ [d],
+for all θ ∈ Ω.
+(3)
+
+Here, it is not necessary to assume that the parameters θi are independent. In particular, the dimension
+of M may be smaller than d, in which case the matrix is not positive definite. If the map Ω → M, θ �→
+p(·; θ) is an embedding (i.e., a smooth injective map that is a diffeomorphism onto its image), then gM
+θ
+defines a Riemannian metric on Ω, which corresponds to the pull-back of g(n).
+Consider an embedding f : E → E′. The pull-back of a metric g′ on E′ through f is defined as:
+
+( f ∗g′)p(u, v) := g′
+f (p)( f∗u, f∗v),
+for all u, v ∈ TpE,
+(4)
+
+58
+
+
+Entropy 2014, 16, 3207–3233
+
+where f∗ denotes the push-forward of TpE through f, which in coordinates is given by:
+
+f∗ :
+TpE → Tf (p)E′;
+∑
+i
+ui∂θi �→ ∑
+j ∑
+i
+ui
+∂ fj(p)
+
+∂θi
+∂θ′
+j,
+(5)
+
+where {∂θi}i spans TqE and {∂θ′
+j}j spans Tf (p)E′.
+
+An embedding f : E → E′ of two Riemannian manifolds (E, g) and (E′, g′) is an isometry iff:
+
+gp(u, v) = ( f ∗g′)p(u, v),
+for all p ∈ E and u, v ∈ TpE.
+(6)
+
+In this case, we say that the metric g is invariant with respect to f (and g′).
+
+3. The Results of Campbell and Lebanon
+
+One of the theoretical motivations for using the Fisher metric is provided by Chentsov’s
+characterization [4], which states that the Fisher metric is uniquely specified, up to a multiplicative
+constant, by an invariance principle under a class of stochastic maps, called Markov morphisms. Later,
+Campbell [5] considered the characterization problem on the space Rn+ instead of Δ◦
+n−1. This simplifies
+the computations, since Rn+ has a more symmetric parametrization.
+
+Definition 1. Let 2 ≤ m ≤ n. A (row) stochastic partition matrix (or just row-partition matrix) is a matrix
+Q ∈ Rm×n of non-negative entries, which satisfies ∑j∈Ai′ Qij = δii′ for an m block partition {A1, . . . , Am} of
+[n]. The linear map defined by:
+Rm
++ → Rn
++;
+p �→ p · Q
+(7)
+
+is called a congruent embedding by a Markov mapping of Rm+ to Rn+ or just a Markov map, for short.
+
+An example of a 3 × 5 row-partition matrix is:
+
+Q =
+
+⎛
+
+⎜
+⎝
+1/2
+0
+1/2
+0
+0
+0
+1/3
+0
+2/3
+0
+0
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ .
+(8)
+
+Markov maps preserve the 1-norm and restrict to embeddings Δ◦
+m−1 → Δ◦
+n−1.
+
+Theorem 1 (Chentsov’s Theorem.).
+
+• Let g(m) be a Riemannian metric on Δ◦
+m−1 for m ∈ {2, 3, . . .}. Let this sequence of metrics have the
+property that every congruent embedding by a Markov mapping is an isometry. Then, there is a constant
+C > 0 that satisfies:
+
+g(m)
+p
+(u, v) = C∑
+i
+
+uivi
+pi
+.
+(9)
+
+• Conversely, for any C > 0, the metrics given by Equation (9) define a sequence of Riemannian metrics
+under which every congruent embedding by a Markov mapping is an isometry.
+
+The main result in Campbell’s work [5] is the following variant of Chentsov’s Theorem.
+
+Theorem 2 (Campbell’s Theorem.).
+
+• Let g(m) be a Riemannian metric on Rm+ for m ∈ {2, 3, . . .}. Let this sequence of metrics have the
+property that every embedding by a Markov mapping is an isometry. Then:
+
+g(m)
+p
+(∂i, ∂j) = A(|p|) + δijC(|p|)|p|
+
+pi
+,
+(10)
+
+59
+
+
+Entropy 2014, 16, 3207–3233
+
+where |p| = ∑m
+i=1 pi, δij is the Kronecker delta, and A and C are C∞ functions on R+ satisfying
+C(α) > 0 and A(α) + C(α) > 0 for all α > 0.
+• Conversely, if A and C are C∞ functions on R+ satisfying C(α) > 0, A(α) + C(α) > 0 for all α > 0,
+then Equation (10) defines a sequence of Riemannian metrics under which every embedding by a Markov
+mapping is an isometry.
+
+The metrics from Campbell’s Theorem also define metrics on the probability simplices Δ◦
+m−1 for
+m = 2, 3, . . .. Since the tangent vectors v = ∑i vi∂i ∈ TpΔ◦
+m−1 satisfy ∑i vi = 0, for any two vectors
+u, v ∈ TpΔ◦
+m−1, also ∑i ∑j Auivj = 0 for any A. In this case, the choice of A is immaterial, and the
+metric becomes Chentsov’s metric.
+
+Remark 1. Observe that Chentsov’s Theorem is not a direct implication of Campbell’s Theorem. However, it
+can be deduced from it by the following arguments. Suppose that we have a family of Riemannian simplices
+(Δ◦
+m−1, g(m)) for m ∈ {2, 3, . . .}, and suppose that they are isometric with respect to Markov maps. If we can
+extend every g(m) to a Riemannian metric ˜g(m) on Rm+ in such a way that the resulting spaces (Rm+, ˜g(m)) are
+still isometric with respect to Markov maps, then Campbell’s Theorem implies that g(m) is a multiple of the
+Fisher metric. Such metric extensions can be defined as follows. Consider the diffeomorphism:
+
+Δ◦
+m−1 × R+ ∼= Rm
++,
+(p, r) �→ r · p.
+(11)
+
+Any tangent vector u ∈ T(p,r)Rm+ can be written uniquely as u = up + ur∂r, where up is tangent to rΔ◦
+m−1.
+Since each Markov map f preserves the one-norm | · |, its push-forward f∗ maps the tangent vector ∂r ∈ T(p,r)Rm+
+to the corresponding tangent vector ∂r ∈ Tf (p,r)Rm+; that is, f∗u = f∗up + ur∂r. Therefore,
+
+˜g(m)
+(p,r)(u, v) := g(m)
+p
+(up, vp) + urvr
+(12)
+
+is a metric on Rm+ that is invariant under f.
+
+In what follows, we will focus on positive matrices. In order to define a natural Riemannian
+metric, we can use the identification Rk×m
++
+∼= Rkm
++ and apply Campbell’s Theorem. This leads to
+metrics of the form:
+g(k,m)
+M
+(∂ij, ∂kl) = A(|M|) + δikδjlC(|M|)/Mij,
+(13)
+
+where ∂ij =
+∂
+
+∂Mij and |M| = ∑ij Mij. However, a disadvantage of this approach is that the action of
+
+general Markov maps on Rkm
++ has no natural interpretation in terms of the matrix structure. Therefore,
+Lebanon [14] considered a special class of Markov maps defined as follows.
+
+Definition 2. Consider a k × l row-partition matrix R and a collection of m × n row-partition matrices
+Q = {Q(1), . . . , Q(k)}. The map:
+
+Rk×m
++
+→ Rl×n
++ ;
+M �→ R⊤(M ⊗ Q)
+(14)
+
+is called a congruent embedding by a Markov morphism of Rk×m
++
+to Rl×n
++
+in [15]. We will refer to such an
+embedding as a Lebanon map. Here, the row product M ⊗ Q is defined by:
+
+(M ⊗ Q)ab = (M · Q(a))ab,
+for all a ∈ [k], b ∈ [n];
+(15)
+
+that is, the a-th row of M is multiplied by the matrix Q(a).
+
+In a Lebanon map, each row of the input matrix M is mapped by an individual Markov mapping
+Q(i), and each resulting row is copied and scaled by an entry of R. This kind of map preserves the
+sum of all matrix entries. Therefore, with the identification Rk×m
++
+∼= Rkm
++ , each Lebanon map restricts
+
+60
+
+
+Entropy 2014, 16, 3207–3233
+
+to a map Δ◦
+mk−1 → Δ◦
+nl−1. The set Δ◦
+mk−1 can be identified with the set of joint distributions of two
+random variables. Lebanon maps can be regarded as special Markov maps that incorporate the product
+structure present in the set of joint probability distributions of a pair of random variables. In Section 4,
+we will give an interpretation of these maps.
+Contrary to what is stated in [15], a Lebanon map does not map (Δk
+m−1)◦ to (Δl
+n−1)◦, unless k = l.
+Therefore, later, we will provide a characterization for the metrics on (Δk
+m−1)◦ in terms of invariance
+under other maps (which are not Markov nor Lebanon maps).
+The main result in Lebanon’s work [15, Theorems 1 and 2] is the following.
+
+Theorem 3 (Lebanon’s Theorem.).
+
+• For each k ≥ 1, m ≥ 2, let g(k,m) be a Riemannian metric on Rk×m
++
+in such a way that every Lebanon
+map is an isometry. Then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A(|M|) + δac
+
+� B(|M|)
+
+|Ma|
++ δbd
+C(|M|)
+
+Mab
+
+�
+(16)
+
+for some differentiable functions A, B, C ∈ C∞(R+).
+• Conversely, let {(Rk×m
++
+, g(k,m))} be a sequence of Riemannian manifolds, with metrics g(k,m) of the
+form (16) for some A, B, C ∈ C∞(R+). Then, every Lebanon map is an isometry.
+
+Lebanon does not study the question under which assumptions on A, B, C ∈ C∞(R+) the
+formula (16) does indeed define a Riemannian metric. This question has the following simple answer,
+which we will prove in Appendix A:
+
+Proposition 1. The matrix (16) is positive definite if and only if C(|M|) > 0, B(|M|) + C(|M|) > 0 and
+A(|M|) + B(|M|) + C(|M|) > 0.
+
+The class of metrics (16) is larger than the class of metrics (13) derived in Campbell’s Theorem.
+The reason is that Campbell’s metrics are invariant with respect to a larger class of embeddings.
+The special case with A(|M|) = 0, B(|M|) = 0 and C(|M|) = 1 is called product Fisher metric,
+
+g(k,m)
+M
+(∂ab, ∂cd) = δacδbd
+1
+
+Mab
+.
+(17)
+
+Furthermore, if we restrict to (Δk
+m−1)◦, the functions A and B do not play any role. In this case |M| = k,
+and we obtain the scaled product Fisher metric:
+
+g(k,m)
+M
+(∂ab, ∂cd) = δacδbd
+C(k)
+Mab
+,
+(18)
+
+where C(k) : N → R+ is a positive function. As mentioned before, Lebanon’s Theorem does not give a
+characterization of invariant metrics of stochastic matrices, since Lebanon maps do not preserve the
+stochasticity of the matrices. However, Lebanon maps are natural maps on the set Δ◦
+mk−1 of positive
+joint distributions. In the same way as Chentsov’s Theorem can be derived from Campbell’s Theorem
+(see Remark 1), we obtain the following Corollary:
+
+61
+
+
+Entropy 2014, 16, 3207–3233
+
+Corollary 1.
+
+• Let {(Δ◦
+km−1, g(k,m)): k ≥ 1, m ≥ 2} be a double sequence of Riemannian manifolds with the property
+that every Lebanon map is an isometry. Then:
+
+g(k,m)
+P
+(u, v) = B∑
+a ∑
+b,c
+
+uabuac
+
+|Pa|
++ C∑
+a ∑
+b
+
+uabvab
+
+Pab
+,
+for each P ∈ Δ◦
+km−1,
+(19)
+
+for some constants B, C ∈ R with C > 0 and B + C > 0, where |Pa| = ∑b Pab.
+• Conversely, let {(Δ◦
+km−1, g(k,m))} be a sequence of Riemannian manifolds with metrics g(k,m) of the form
+of Equation (19) for some B, C ∈ R. Then, every Lebanon map is an isometry.
+
+Observe that these metrics agree with (a multiple of) the Fisher metric only if B = 0. The case B = 0
+can also be characterized; note that Lebanon maps do not treat the two random variables symmetrically.
+Switching the two random variables corresponds to transposing the joint distribution matrix P. When
+exchanging the role of the two random variables, the Lebanon map becomes P �→ (P⊤ ⊗ Q)⊤R. We
+call such a map a dual Lebanon map. If we require invariance under both Lebanon maps and their
+duals in Theorem 3 or Corollary 1, the statements remain true with the additional restriction that B = 0
+(as a function or constant, respectively).
+
+4. Invariance Metric Characterizations for Conditional Polytopes
+
+According to Chentsov’s Theorem (Theorem 1), a natural metric on the probability simplex can
+be characterized by requiring the isometry of natural embeddings. Lebanon follows this axiomatic
+approach to characterize metrics on products of positive measures (Theorem 3). However, the maps
+considered by Lebanon dissolve the row-normalization of conditional distributions. In general, they
+do not map conditional polytopes to conditional polytopes. Therefore, we will consider a slight
+modification of Lebanon maps, in order to obtain maps between conditional polytopes.
+
+4.1. Stochastic Embeddings of Conditional Polytopes
+
+A matrix of conditional distributions P(Y|X) in Δk
+m−1 can be regarded as the equivalence class
+of all joint probability distributions P(X, Y) ∈ Δkm−1 with conditional distribution P(Y|X). Which
+Markov maps of probability simplices are compatible with this equivalence relation? The most obvious
+examples are permutations (relabelings) of the state spaces of X and Y.
+In information theory, stochastic matrices are also viewed as channels. For any distribution of X,
+the stochastic matrix gives us a joint distribution of the pair (X, Y) and, hence, a marginal distribution
+of Y. If we input a distribution of X into the channel, the stochastic matrix determines what the
+distribution of the output Y will be.
+Channels can be combined, provided the cardinalities of the state spaces fit together. If we
+take the output Y of the first channel P(Y|X) and feed it into another channel P(Y′|Y) then we
+obtain a combined channel P(Y′|X). The composition of channels corresponds to ordinary matrix
+multiplication. If the first channel is described by the stochastic matrix K and the second channel by Q,
+then the combined channel is described by K · Q. Observe that in this case, the joint distribution P
+(considered as a normalized matrix P ∈ Δkm−1) is transformed similarly; that is, the joint distribution
+of the pair (X, Y′) is given by P · Q.
+More general maps result from compositions where the choice of the second channel depends on
+the input of the first channel. In other words, we have a first channel that takes as input X and gives
+as output Y, and we have another channel that takes as input (X, Y) and gives as output Y′; we are
+interested in the resulting channel from X to Y′. The second channel can be described by a collection
+of stochastic matrices Q = {Q(i)}i. If K describes the first channel, then the combined channel is
+described by the row product K ⊗ Q (see Definition 2). Again, the joint distribution of (X, Y′) arises in
+a similar way as P ⊗ Q.
+
+62
+
+
+Entropy 2014, 16, 3207–3233
+
+We can also consider transformations of the first random variable X. Suppose that we use X as the
+input to a channel described by a stochastic matrix R. In this case, the joint distribution of the output
+X′ of the channel and Y is described by R⊤X. However, in general, there is not much that we can say
+about the conditional distribution of Y given X′. The result depends in an essential way on the original
+distribution of X. However, this is not true in the special case that the channel is “not mixing”, that is,
+in the case that R is a stochastic partition matrix. In this case, the conditional distribution P(Y|X′) is
+
+described by R⊤K, where R is the corresponding partition indicator matrix, where all non-zero entries
+of R are replaced by one. In other words, each state of X corresponds to several states of X′, and the
+corresponding row of K is copied a corresponding number of times.
+To sum up, if we combine the transformations due to Q and R, then the joint probability
+
+distribution transforms as P �→ R⊤(P ⊗ Q) and the conditional transforms as K �→ R⊤(K ⊗ Q).
+In particular, for the joint distribution, we obtain the Definition of a Lebanon map. Figure 1 illustrates
+the situation.
+
+X
+Y
+
+X′
+Y′
+
+K
+
+K′
+
+R
+Q
+
+joint distributions: P′ = R⊤(P ⊗ Q)
+
+conditional distributions: K′ = R⊤(K ⊗ Q)
+
+Figure 1. An interpretation for Lebanon maps and conditional embeddings. The variable X′ is
+computed from X by R, and Y′ is computed from X and Y by Q.
+
+Finally, we will also consider the special case where the partition of R (and R) is homogeneous,
+i.e., such that all blocks have the same size. For example, this describes the case where there is a third
+random variable Z that is independent of Y given X. In this case, the conditional distribution satisfies
+P(Y|X) = P(Y|X, Z), and R describes the conditional distribution of (X, Z) given X.
+
+Definition 3. A (row) partition indicator matrix is a matrix R ∈ {0, 1}k×l that satisfies:
+
+Rij =
+
+�
+1,
+if j ∈ Ai,
+
+0,
+else,
+(20)
+
+for a k block partition {A1, . . . , Ak} of [l].
+
+For example, the 3 × 5 partition indicator matrix corresponding to Equation (8) is:
+
+R =
+
+⎛
+
+⎜
+⎝
+1
+0
+1
+0
+0
+0
+1
+0
+1
+0
+0
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ .
+(21)
+
+Definition 4. Consider a k × l partition indicator matrix R and a collection of m × n stochastic partition
+matrices Q = {Q(i)}k
+i=1. We call the map:
+
+f :
+Rk×m
++
+→ Rl×n
++ ;
+M �→ R⊤(M ⊗ Q)
+(22)
+
+a conditional embedding of Rk×m
++
+in Rl×n
++ . We denote the set of all such maps by ˆF l,n
+k,m. If R is the partition
+indicator matrix of a homogeneous partition (with partition blocks of equal cardinality), then we call f a
+homogeneous conditional embedding. We denote the set of all such homogeneous conditional embeddings by
+F l,n
+k,m and assume that l is a multiple of k.
+
+63
+
+
+Entropy 2014, 16, 3207–3233
+
+Conditional embeddings preserve the 1-norm of the matrix rows; that is, the elements of ˆF l,n
+k,m
+map (Δk
+m−1)◦ to (Δl
+n−1)◦. On the other hand, they do not preserve the 1-norm of the entire matrix.
+Conditional embeddings are Markov maps only when k = l, in which case they are also Lebanon
+maps.
+
+4.2. Invariance Characterization
+
+Considering the conditional embeddings discussed in the previous section, we obtain the
+following metric characterization.
+
+Theorem 4.
+
+• Let g(k,m) denote a metric on Rk×m
++
+for each k ≥ 1 and m ≥ 2. If every homogeneous conditional
+embedding f ∈ F l,n
+k,m is an isometry with respect to these metrics, then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A
+
+k2 + δac
+
+�
+k B
+
+k2 + δbd
+|M|
+Mab
+
+C
+k2
+
+�
+,
+for all M ∈ Rk×m
++
+,
+(23)
+
+for some constants A, B, C ∈ R, where ∂ab =
+∂
+
+∂Mab and |M| = ∑ab Mab.
+• Conversely, given the metrics defined by Equation (23) for any non-degenerate choice of constants
+A, B, C ∈ R, each homogeneous conditional embedding f ∈ F l,n
+k,m, k ≤ l, m ≤ n is an isometry.
+• Moreover, the tensors g(k,m) from Equation (23) are positive-definite for all k ≥ 1 and m ≥ 2 if and only
+if C > 0, B + C > 0 and A + B + C > 0.
+
+The proof of Theorem 4 is similar to the proof of the Theorems of Chentsov, Campbell and
+Lebanon. Due to its technical nature, we defer it to Appendix B.
+Now, for the restriction of the metric g(k,m) to (Δk
+m−1)◦, we have the following. In this case,
+|M| = k. Since tangent vectors v = ∑ab vab∂ab ∈ TM(Δk
+m−1)◦ satisfy ∑b vab = 0 for all a, the constants
+A and B become immaterial, and the metric can be written as:
+
+g(k,m)
+M
+(u, v) = ∑
+ab
+
+|M|uabvab
+
+Mab
+
+C
+k2 = ∑
+ab
+
+uabvab
+Mab
+
+C
+k ,
+for all u, v ∈ TM(Δk
+m−1)◦.
+(24)
+
+This metric is a specialization of the metric (18) derived by Lebanon (Theorem 3).
+The statement of Theorem 4 becomes false if we consider general conditional embeddings instead
+of homogeneous ones:
+
+Theorem 5. There is no family of metrics g(k,m) on Rk×m
++
+(or on (Δk
+m−1)◦) for each k ≥ 1 and m ≥ 2, for
+
+which every conditional embedding f ∈ ˆF l,n
+k,m is an isometry.
+
+This negative result will become clearer from the perspective of Section 6: as we will show in
+Theorem 7, although there are no metrics that are invariant under all conditional embeddings, there are
+families of metrics (depending on a parameter, ρ) that transform covariantly (that is, in a well-defined
+manner) with respect to the conditional embeddings. We defer the proof of Theorem 5 to Appendix B.
+
+5. The Fisher Metric on Polytopes and Point Configurations
+
+In the previous section, we obtained distinguished Riemannian metrics on Rk×m
++
+and (Δk
+m−1)◦ by
+postulating invariance under natural maps. In this section, we take another viewpoint based on general
+considerations about Riemannian metrics on arbitrary polytopes. This is achieved by embedding each
+polytope in a probability simplex as an exponential family. We first recall the necessary background.
+In Section 5.2, we then present our general results, and in Section 5.3, we discuss the special case of
+conditional polytopes.
+
+64
+
+
+Entropy 2014, 16, 3207–3233
+
+5.1. Exponential Families and Polytopes
+
+Let X be a finite set and A ∈ Rd×X a matrix with columns ax indexed by x ∈ X . It will be
+convenient to consider the rows Ai, i ∈ [d] of A as functions Ai : X → R. Finally, let ν: X → R+. The
+exponential family EA,ν is the set of probability distributions on X given by:
+
+p(x; θ) = exp(θ⊤ax + log(ν(x)) − log(Z(θ))),
+for all x ∈ X ,
+for all θ ∈ Rd,
+(25)
+
+with the normalization function Z(θ) = ∑x′∈X exp(θ⊤ax′ + log(ν(x′))). The functions Ai are called
+the observables and ν the reference measure of the exponential family. When the reference measure ν
+is constant, ν(x) = 1 for all x ∈ X , we omit the subscript and write EA.
+A direct calculation shows that the Fisher information matrix of EA,ν at a point θ ∈ Rd has
+coordinates:
+
+gEA,ν
+θ
+(∂θi, ∂θj) = covθ(Ai, Aj),
+for all i, j ∈ [d].
+(26)
+
+Here, covθ denotes the covariance computed with respect to the probability distribution p(·; θ).
+The convex support of EA,ν is defined as:
+
+conv A := conv{ax : x ∈ X } =
+�
+Ep[A]: p ∈ Δ|X |−1
+�
+=
+�
+Ep[A]: p ∈ EA,ν
+�
+,
+(27)
+
+where conv S is the set of all convex combinations of points in S. The moment map μ : p ∈ Δn−1 �→
+A · p ∈ Rd restricts to a homeomorphism EA,ν → conv A; see [16]. Here, EA,ν denotes the Euclidean
+closure of EA,ν. The inverse of μ will be denoted by μ−1 : conv A → EA,ν ⊆ Δn−1. This gives a natural
+embedding of the polytope conv A in the probability simplex Δ|X |−1. Note that the convex support is
+independent of the reference measure ν. See [17] for more details.
+
+5.2. Invariance Fisher Metric Characterizations for Polytopes
+
+Let P ∈ Rd be a polytope with n vertices a1, . . . , an. Let A = (a1, . . . , an) be the matrix with
+columns ai ∈ Rd for all i ∈ [n]. Then, EA ⊆ Δ◦
+n−1 is an exponential family with convex support P. We
+will also denote this exponential family by EP. We can use the inverse of the moment map, μ−1, to pull
+back geometric structures on Δ◦
+n−1 to the relative interior P◦ of P.
+
+Definition 5. The Fisher metric on P◦ is the pull-back of the Fisher metric on EA ⊆ Δ◦
+n−1 by μ−1.
+
+Some obvious questions are: Why is this a natural construction? Which maps between polytopes
+are isometries between their Fisher metrics? Can we find a characterization of Chentsov type for
+this metric?
+Affine maps are natural maps between polytopes. However, in order to obtain isometries, we
+need to put some additional constraints. Consider two polytopes P ∈ Rd, P′ ∈ Rd′ and an affine
+map φ : Rd → Rd′ that satisfies φ(P) ⊆ P′. A natural condition in the context of exponential families is
+that φ restricts to a bijection between the set vert(P) of vertices of P and the set vert(P′) of vertices of P′.
+In this case, EP′ ⊆ EP ⊆ Δ◦
+n−1. Moreover, the moment map μ′ of P′ factorizes through the moment
+map μ of P: μ′ = φ ◦ μ. Let φ−1 = μ ◦ μ′−1. Then, the following diagram commutes:
+
+P◦
+EP
+
+Δ◦
+n−1
+
+P′◦
+EP′
+
+μ−1
+
+μ′−1
+
+φ−1
+(28)
+
+65
+
+
+Entropy 2014, 16, 3207–3233
+
+It follows that φ−1 is an isometry from P′◦ to its image in P◦. Observe that the inverse moment map
+itself arises in this way: In the diagram (28), if P is equal to Δn−1, then the upper moment map μ−1 is
+the identity map, and φ−1 equals the inverse moment map μ′−1 of P′.
+The constraint of mapping vertices to vertices bijectively is very restrictive. In order to consider
+a larger class of affine maps, we need to generalize our construction from polytopes to weighted
+point configurations.
+
+Definition 6. A weighted point configuration is a pair (A, ν) consisting of a matrix A ∈ Rd×n with columns
+a1, . . . , an and a positive weight function ν : {1, . . . , n} → R+ assigning a weight to each column ai. The pair
+(A, ν) defines the exponential family EA,ν.
+The (A, ν)-Fisher metric on (conv A)◦ is the pull-back of the Fisher metric on Δ◦
+n−1 through the inverse
+of the moment map.
+
+We recover Definition 5 as follows. For a polytope P, let A be the point configuration consisting
+of the vertices of P. Moreover, let ν be a constant function. Then, EP = EA,ν, and the two Definitions of
+the Fisher metric on P◦ coincide.
+The following are natural maps between weighted point configurations:
+
+Definition 7. Let (A, ν), (A′, ν′) be two weighted point configurations with A = (ai)i ∈ Rd×n and A′ =
+(a′
+j)j ∈ Rd′×n′. A morphism (A, ν) → (A′, ν′) is a pair (φ, σ) consisting of an affine mapφ : Rd → Rd′ and a
+surjective map σ : {1, . . . , n} → {1, . . . , n′} with φ(ai) = a′
+σ(i) andν′(a′
+j) = α ∑i:σ(i)=j ν(ai), where α > 0 is
+a constant that does not depend on j.
+
+Consider a morphism (φ, σ) : (A, ν) → (A′, ν′). For each j ∈ [n′], let Aj = {i : φ(ai) = a′
+j}. Then,
+
+(A1, . . . , An′) is a partition of [n]. Define a matrix Q ∈ Rn′×n by:
+
+Qji =
+
+⎧
+⎨
+
+⎩
+
+ν(i)
+
+∑i′∈Aj ν(i′),
+if i ∈ Aj,
+
+0,
+else.
+(29)
+
+Then, Q is a Markov mapping, and the following diagram commutes:
+
+(conv A)◦
+EA,ν
+Δ◦
+n−1
+
+(conv A′)◦
+EA′,ν′
+Δ◦
+n′−1
+
+μ−1
+
+μ′−1
+
+φ−1
+Q
+(30)
+
+By Chentsov’s Theorem (Theorem 1), Q is an isometric embedding. It follows that φ−1 also induces an
+isometric embedding. This shows the first part of the following Theorem:
+
+Theorem 6.
+
+• Let (φ, σ) : (A, ν) → (A′, ν′) be a morphism of weighted point configurations.
+Then, φ−1 :
+(conv A′)◦ → (conv A)◦ is an isometric embedding with respect to the Fisher metrics on (conv A)◦
+
+and (conv A′)◦.
+• Let gA,ν be a Riemannian metric on (conv A)◦ for each weighted point configuration (A, ν). If every
+morphism (φ, σ) : (A, ν) → (A′, ν′) of weighted point configurations induces an isometric embedding
+φ−1 : (conv A′)◦ → (conv A)◦, then there exists a constant α ∈ R+ such that gA,ν is equal to α times
+the (A, ν)-Fisher metric.
+
+66
+
+
+Entropy 2014, 16, 3207–3233
+
+Proof. The first statement follows from the discussion before the Theorem. For the second statement,
+we show that under the given assumptions, all Markov maps are isometric embeddings. By Chentsov’s
+Theorem (Theorem 1), this implies that the metrics gP agree with the Fisher metric whenever P is
+a simplex. The statement then follows from the two facts that the metric on P◦ or (conv A)◦ is the
+pull-back of the Fisher metric through the inverse of the moment map and that μ−1 is itself a morphism.
+Observe that Δn−1 = conv In = conv{e1, . . . , en} is a polytope, and Δ◦
+n−1 is the corresponding
+exponential family. Consider a Markov embedding Q : Δ◦
+n′−1 → Δ◦
+n−1, p �→ p · Q. Let ν(i) = ∑j Qji
+be the value of the unique non-zero entry of Q in the i-th column. This defines a morphism and an
+embedding as follows:
+Let A be the matrix that arises from Q by replacing each non-zero entry by one. We define φ
+as the linear map represented by the matrix A, and define σ : [n] → [n′] by σ(j) = i if and only
+if aj = ei, that is, σ(j) indicates the row i in which the j-th column of A is non-zero. Then, (φ, σ)
+is a morphism (In, ν) → (In′, 1), and by assumption, the inverse φ−1 is an isometric embedding
+Δ◦
+n′−1 → Δ◦
+n−1. However, φ−1 is equal to the Markov map Q. This shows that all Markov maps are
+isometric embeddings, and so, by Chentsov’s Theorem, the statement holds true on the simplices.
+
+Theorem 6 defines a natural metric on (Δk
+m−1)◦ that we want to discuss in more detail next.
+
+5.3. Independence Models and Conditional Polytopes
+
+Consider k random variables with finite state spaces [n1], . . . , [nk]. The independence model
+consists of all joint distributions p ∈ Δ∏i∈[k] ni−1 of these variables that factorize as:
+
+p(x1, . . . , xk) = ∏
+i∈[k]
+pi(xi),
+for all x1 ∈ [n1], . . . , xk ∈ [nk],
+(31)
+
+where pi ∈ Δni−1 for all i ∈ [k]. Assuming fixed n1, . . . , nk, we denote the independence model by Ek.
+It is the Euclidean closure of an exponential family (with observables of the form δiyi). The convex
+support of Ek is equal to the product of simplices Pk := Δn1−1 × · · · × Δnk−1. The parametrization (31)
+corresponds to the inverse of the moment map.
+We can write any tangent vector u ∈ T(p1,...,pk)P◦
+k of this open product of simplices as a linear
+combination u = ∑i∈[k] ∑xi∈[ni] uixi∂i,xi, where ∑xi∈[ni] vixi = 0 for all i ∈ [k]. Given two such tangent
+vectors, the Fisher metric is given by:
+
+gPk
+(p1,...,pk)(u, v) = ∑
+i∈[k] ∑
+xi∈[ni]
+
+uixivixi
+pi(xi) .
+(32)
+
+Just as the convex support of the independence model is the Cartesian product of probability
+simplices, the Fisher metric on the independence model is the product metric of the Fisher metrics
+on the probability simplices of the individual variables. If n1 = · · · = nk =: n, then Pk = Δk
+n−1 can be
+identified with the set of k × n stochastic matrices.
+The Fisher metric on the product of simplices is equal to the product of the Fisher metrics on the
+factors. More generally, if P = Q1 × Q2 is a Cartesian product, then the Fisher metric on P◦ is equal
+to the product of the Fisher metrics on Q◦
+1 and Q◦
+2. In fact, in this case, the inverse of the moment
+map of P can be expressed in terms of the two moment map inverses μ1 : Q1 → EQ1 ⊆ Δm1−1 and
+μ2 : Q2 → EQ2 ⊆ Δm2−1 and the moment map ˜μ of the independence model Δm1−1 × Δm2−1, by:
+
+μ−1(q1, q2) = ˜μ−1(μ−1
+1 (q1), μ−1
+2 (q2)).
+(33)
+
+Therefore, the pull-back by μ−1 factorizes through the pull-back by ˜μ−1, and since the independence
+model carries a product metric, the product of polytopes also carries a product metric.
+
+67
+
+
+Entropy 2014, 16, 3207–3233
+
+Let us compare the metric g(k,m)
+K
+from Equation (24), with the Fisher metric gPk
+(K1,...,Kk) from
+
+Equation (32) on the product of simplices P◦ = (Δk
+m−1)◦. In both cases, the metric is a product
+metric; that is, it has the form:
+g = g1 + · · · + gk,
+(34)
+
+where gi is a metric on the i-th factor Δ◦
+m−1. For g
+Δk
+Km−1
+, gi is equal to the Fisher metric on Δ◦
+m−1. However,
+
+for g(k,m)
+K
+, gi is equal to 1/k times the Fisher metric on Δ◦
+m−1. Since this factor only depends on k, it
+only plays a role if stochastic matrices of different sizes are compared. The additional factor of 1/k can
+be interpreted as the uniform distribution on k elements. This is related to another more general class
+of Riemannian metrics that are used in applications; namely, given a function K ∈ Δk
+m−1 → ρK ∈ Rk+,
+it is common to use product metrics with gi equal to ρK(i) times the Fisher metric on Δ◦
+m−1. When K
+has the interpretation of a channel or when K describes the policy by which a system reacts to some
+sensor values, a natural possibility is to let ρK be the stationary distribution of the channel input or of
+the sensor values, respectively. We will discuss this approach in Section 6.
+
+6. Weighted Product Metrics for Conditional Models
+
+In this section, we consider metrics on spaces of stochastic matrices defined as weighted sums
+of the Fisher metrics on the spaces of the matrix rows, similar to Equation (34). This kind of metric
+was used initially by Amari [1] in order to define a natural gradient in the supervised learning context.
+Later, in the context of reinforcement learning, Kakade [2] defined a natural policy gradient based on
+this kind of metric, which has been further developed by Peters et al. [10]. Related applications within
+unsupervised learning have been pursued by Zahedi et al. [18].
+Consider the following weighted product Fisher metric:
+
+gρ,m
+K
+= ∑
+a
+ρK(a)g(m),a
+Ka
+,
+for all K ∈ (Δk
+m−1)◦,
+(35)
+
+where g(m),a
+Ka
+denotes the Fisher metric of Δ◦
+m−1 at the a-th row of K and ρK ∈ Δ◦
+k−1 is a probability
+distribution over a associated with each K ∈ (Δk
+m−1)◦. For example, the distribution ρK could be the
+stationary distribution of sensor values observed by an agent when operating under a policy described
+by K.
+In the following, we will try to illuminate the properties of polytope embeddings that yield the
+metric (35) as the pull-back of the Fisher information metric on a probability simplex. We will focus on
+the case that ρK = ρ is independent of K.
+There are two direct ways of embedding Δk
+n−1 in a probability simplex. In Section 5, we used
+the inverse of the moment map of an exponential family, possibly with some reference measure. This
+embedding is illustrated in the left panel of Figure 2. If we have given a fixed probability distribution
+ρ ∈ Δ◦
+k−1, there is a second natural embedding ψρ : Δk
+m−1 → Δk·m−1 defined as follows:
+
+ψρ(K)(x, y) = ρ(x)Kx,y
+for all x ∈ [k], y ∈ [m].
+(36)
+
+If ρ is the distribution of a random variable X and K ∈ Δk
+m−1 is the stochastic matrix describing the
+conditional distribution of another variable Y given X, then ψρ(K) is the joint distribution of X and Y.
+Note that ψρ is an affine embedding. See the right panel of Figure 2 for an illustration.
+The pull-back of the Fisher metric on Δ◦
+km−1 through ψρ is given by:
+
+g(km)
+ψρ(K)(ψρ∗u, ψρ∗v) =∑
+a,b ∑
+c,d ∑
+i,j
+ρ(i)Kijuab
+∂ log ρ(i)Kij
+
+∂Kab
+vcd
+∂ log ρ(i)Kij
+
+∂Kcd
+
+=∑
+i
+ρ(i)∑
+j
+
+uijvij
+Kij
+= ∑
+i
+ρ(i)gi
+Ki(ui, vi) = gρ,m
+K (u, v).
+(37)
+
+68
+
+
+Entropy 2014, 16, 3207–3233
+
+This recovers the weighted sum of Fisher metrics from Equation (35).
+
+Δk
+m−1
+
+Δmk−1
+
+Ek
+
+Δk·m−1
+
+ψ( 1
+
+4, 3
+
+4)Δk
+m−1
+
+ψ( 1
+
+2, 1
+
+2)Δk
+m−1
+
+Figure 2. An illustration of different embeddings of the conditional polytope Δk
+m−1 in a probability
+simplex. The left panel shows an embedding in Δmk−1 by the inverse of the moment map μ of the
+independence model. The right panel shows an affine embedding in Δk·m−1 as a set of joint probability
+distributions for two different specifications of marginals.
+
+Are there natural maps that leave the metrics gρ,m invariant? Let us reconsider the stochastic
+embeddings from Definition 4. Let R be a k × l indicator partition matrix and R a stochastic partition
+matrix with the same block structure as R. Observe that to each indicator partition matrix R there are
+many compatible stochastic partition matrices R, but the indicator partition matrix R for any stochastic
+partition matrix R is unique. Furthermore, let Q = {Q(a)}a∈[k] be a collection of stochastic partition
+
+matrices. The corresponding conditional embedding f maps K ∈ Δk
+m−1 to f (K) := R⊤(K ⊗ Q) ∈ Δl
+n−1.
+Let ρ ∈ Δ◦
+k−1. Suppose that K describes the conditional distribution of Y given X and that ψρ(K)
+describes the joint distribution of Y and X. As explained in Section 4.1, the matrix f (P) := R⊤(P ⊗ Q)
+describes the joint distribution of a pair of random variables (X′, Y′), and the conditional distribution
+of Y′ given X′ is given by f (K). In this situation, the marginal distribution of X′ is given by ρ′ = ρR.
+Therefore, the following diagram commutes:
+
+(Δk
+m−1)◦
+Δ◦
+mk−1
+
+(Δl
+n−1)◦
+Δ◦
+nl−1
+
+ψρ
+
+ψρ′
+
+f
+f
+(38)
+
+The preceding discussion implies the first statement of the following result:
+
+69
+
+
+Entropy 2014, 16, 3207–3233
+
+Theorem 7.
+
+• For any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦
+k−1, the Riemannian metric gρ,m on (Δk
+m−1)◦ satisfies:
+
+gρ,m = f
+∗(gρ′,n),
+for ρ′ = ρR,
+(39)
+
+for any conditional embedding f : K �→ R(K ⊗ Q).
+• Conversely, suppose that for any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦
+k−1, there is a Riemannian metric
+g(ρ,m) on (Δk
+m−1)◦, such that Equation (39) holds for all conditional embeddings, and suppose that g(ρ,m)
+
+depends continuously on ρ. Then, there is a constant A > 0 that satisfies g(ρ,m) = Agρ,m.
+
+Proof. The first statement follows from the commutative diagram (38). For the second statement,
+denote by ρk the uniform distribution on a set of k elements. If f : K �→ R(K ⊗ Q) is a homogeneous
+conditional embedding of Δk
+m−1 in Δl
+n−1, then R = k
+
+l R is a stochastic partition matrix corresponding
+to the partition indicator matrix R. Observe that ρl = ρkR. Therefore, the family of Riemannian
+metrics gρk,m on Δk
+m−1 satisfies the assumptions of Theorem 4. Therefore, there is a constant A > 0
+
+for which gρk,m equals A/k times the product Fisher metric. This proves the statement for uniform
+distributions ρ.
+A general distribution ρ ∈ Δ◦
+k−1 can be approximated by a distribution with rational probabilities.
+Since g(ρ,m) is assumed to be continuous, it suffices to prove the statement for rational ρ. In this case,
+there exists a stochastic partition matrix R for which ρ′ := ρR is a uniform distribution, and so, g(ρ′,n)
+
+is of the desired form. Equation (39) shows that g(ρ,m) is also of the desired form.
+
+7. Gradient Fields and Replicator Equations
+
+In this section, we use gradient fields in order to compare Riemannian metrics on the space
+(Δk
+n−1)◦.
+
+7.1. Replicator Equations
+
+We start with gradient fields on the simplex Δ◦
+n−1. A Riemannian metric g on Δ◦
+n−1 allows us to
+consider gradient fields of differentiable functions F: Δ◦
+n−1 → R. To be more precise, consider the
+differential dpF : TpΔ◦
+n−1 → R of F in p. It is a linear form on TpΔ◦
+n−1, which maps each tangent vector
+u to dpF(u) = ∂F
+
+∂u(p) ∈ R. Using the map u �→ gp(u, ·), this linear form can be identified with a tangent
+vector in TpΔ◦
+n−1, which we denote by gradpF. If we choose the Fisher metric g(n) as the Riemannian
+metric, we obtain the gradient in the following way. First consider a differentiable extension of F to the
+positive cone Rn+, which we will denote by the same symbol F. With the partial derivatives ∂iF of F,
+the Fisher gradient of F on the simplex Δ◦
+n−1 is given as:
+
+(gradpF)i = pi
+
+�
+
+∂iF(p) −
+n
+∑
+j=1
+pj ∂jF(p)
+
+�
+
+,
+i ∈ [n].
+(40)
+
+Note that the expression on the right-hand side of Equation (40) does not depend on the particular
+differentiable extension of F to Rn+. The corresponding differential equation is well known in theoretical
+biology as the replicator equation; see [19,20].
+
+˙pi = pi
+
+�
+
+∂iF(p) −
+n
+∑
+j=1
+pj ∂jF(p)
+
+�
+
+,
+i ∈ [n].
+(41)
+
+70
+
+
+Entropy 2014, 16, 3207–3233
+
+We now apply this gradient formula to functions that have the structure of an expectation value.
+Given real numbers Fi, i ∈ [n], referred to as fitness values, we consider the mean fitness:
+
+¯F(p) :=
+n
+∑
+i=1
+pi Fi.
+(42)
+
+Replacing the pi by any positive real numbers leads to a differentiable extension of F, also denoted
+by F. Obviously, we have ∂iF = Fi, which leads to the following replicator equation:
+
+˙pi = pi (Fi − ¯F(p)) ,
+i ∈ [n].
+(43)
+
+This equation has the solution:
+
+pi(t) =
+pi(0)etFi
+
+∑n
+j=1 pj(0)etFi ,
+i ∈ [n].
+(44)
+
+Clearly, the mean fitness will increase along this solution of the gradient field. The rate of increase can
+be easily calculated:
+
+d
+dt
+¯F
+�
+p(t)
+� =
+n
+∑
+i=1
+˙pi(t) Fi =
+n
+∑
+i=1
+pi (Fi − ¯F(p)) Fi =
+n
+∑
+i=1
+pi (Fi − ¯F(p))2 = varp(F) > 0.
+(45)
+
+As limit points of this solution, we obtain:
+
+lim
+t→−∞ pi(t) =
+
+�
+pi(0)
+
+∑j∈argmin F pj(0),
+if i ∈ argmin F
+
+0
+,
+otherwise
+,
+i ∈ [n],
+(46)
+
+and:
+
+lim
+t→+∞ pi(t) =
+
+�
+pi(0)
+
+∑j∈argmax F pj(0),
+if i ∈ argmax F
+
+0
+,
+otherwise
+,
+i ∈ [n].
+(47)
+
+7.2. Extension of the Replicator Equations to Stochastic Matrices
+
+Now, we come to the corresponding considerations of gradient fields in the context of stochastic
+matrices K ∈ (Δk
+n−1)◦. We consider a function:
+
+K �→ F(K) = F(K11, . . . , K1n; K21, . . . , K2n; . . . ; Kk1, . . . , Kkn).
+(48)
+
+One way to deal with this is to consider for each i ∈ [k] the corresponding replicator equation:
+
+˙Kij = Kij
+
+⎛
+
+⎝∂ijF(K) −
+n
+∑
+j′=1
+Kij′ ∂ij′F(K)
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(49)
+
+Obviously, this is the gradient field that one obtains by using the product Fisher metric on (Δk
+n−1)◦
+
+(Equation (17)):
+
+g(k,m)
+K
+(u, v) = ∑
+ij
+
+1
+Kij
+uijvij.
+(50)
+
+If we replace the metric by the weighted product Fisher metric considered by Kakade (Equation (35)),
+
+gρ,m
+K (u, v) = ∑
+ij
+
+ρi
+Kij
+uijvij,
+(51)
+
+71
+
+
+Entropy 2014, 16, 3207–3233
+
+then we obtain
+
+˙Kij = Kij
+
+ρi
+
+⎛
+
+⎝∂ijF(K) −
+n
+∑
+j′=1
+Kij′ ∂ij′F(K)
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(52)
+
+7.3. The Example of Mean Fitness
+
+Next, we want to study how the gradient flows with respect to different metrics compare. We
+restrict to the class of metrics gρ,m (Equation (35)), where ρ ∈ Δ◦
+k is a probability distribution. In
+principle, one could drop the normalization condition ∑i ρi = 1 and allow arbitrary coefficients ρi.
+However, it is clear that the rate of convergence can always be increased by scaling all values ρi with a
+common positive factor. Therefore, some normalization condition is needed for ρ.
+With a probability distribution p ∈ Δ◦
+k−1 and fitness values Fij, let us consider again the example
+of an expectation value function:
+
+¯F(K) =
+k
+∑
+i=1
+pi
+
+n
+∑
+j=1
+Kij Fij.
+(53)
+
+With ∂ij ¯F(π) = pi Fij, this leads to:
+
+˙Kij = pi
+
+ρi
+Kij
+
+⎛
+
+⎝Fij −
+n
+∑
+j′=1
+Kij′ Fij′
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(54)
+
+The corresponding solutions are given by:
+
+Kij(t) =
+Kij(0) et pi
+
+ρi Fij
+
+∑n
+j′=1 Kij′(0) et pi
+
+ρi Fij′ ,
+i ∈ [n].
+(55)
+
+Since argmax( pi
+
+ρi Fi·) and argmin( pi
+
+ρi Fi·) are independent of ρi > 0, the limit points are given
+independently of the chosen ρ as:
+
+lim
+t→−∞ Kij(t) =
+
+⎧
+⎨
+
+⎩
+
+Kij(0)
+
+∑j′∈argmin Fi· Kij′(0),
+if j ∈ argmin Fi·
+
+0
+,
+otherwise
+,
+i ∈ [n],
+(56)
+
+and:
+
+lim
+t→+∞ Kij(t) =
+
+⎧
+⎨
+
+⎩
+
+Kij(0)
+
+∑j′∈argmax Fi· Kij′(0),
+if j ∈ argmax Fi·
+
+0
+,
+otherwise
+,
+i ∈ [n].
+(57)
+
+This is consistent with the fact that the critical points of gradient fields are independent of the chosen
+Riemannian metric. However, the speed of convergence does depend on the metric:
+For each i, let Gi = maxj Fij and gi = maxj/∈argmax(Fij) Fij be the largest and second-largest values
+in the i-th row of Fij, respectively. Then, as: t → ∞,
+
+Kij(t) =
+
+�
+1 − O(exp(− pi
+
+ρi (Gi − gi)t),
+if i ∈ argmax Fi·
+O(exp(− pi
+
+ρi (Gi − Fij)t) ,
+otherwise
+(58)
+
+72
+
+
+Entropy 2014, 16, 3207–3233
+
+Therefore,
+
+¯F(K(t)) = ∑
+i
+pi
+∑
+j∈argmax Fi·
+Fij + O
+�
+exp(− pi
+
+ρi
+(Gi − gi)t)
+�
+
+= ∑
+i
+pi
+∑
+j∈argmax Fi·
+Fij + O
+�
+exp(− inf
+i
+
+� pi
+
+ρi
+(Gi − gi)
+�
+t)
+�
+.
+(59)
+
+Thus, in the long run, the rate of convergence is given by infi{ pi
+
+ρi (Gi − gi)}, which depends on the
+parameter ρ of the metric. As a result, in this case study, the optimal choice of ρi, i.e., with the largest
+convergence rate, can be computed if the numbers Gi and gi are known.
+Consider, for example, the case that the differences Gi − gi are of comparable sizes for all i. Then,
+we need to find the choice of ρ that maximizes infi{ pi
+
+ρi }. Clearly, infi{ pi
+
+ρi } ≤ 1 (since there is always an
+index i with pi ≤ ρi). Equality is attained for the choice ρi = pi. Thus, we recover the choice of Kakade.
+
+8. Conclusions
+
+So, which Riemannian metric should one use in practice on the set of stochastic matrices, (Δk
+m−1)◦?
+The results provided in this manuscript give different answers, depending on the approach. In all
+cases, the characterized Riemannian metrics are products of Fisher metrics with suitable factor weights.
+Theorem 4 suggests to use a factor weight proportional to 1/k, and Theorem 6 suggests to use a
+constant weight independent of k. In many cases, it is possible to work within a single conditional
+polytope (Δk
+m−1)◦ and a fixed k, and then, these two results are basically equivalent. On the other
+hand, Theorem 7 gives an answer that allows arbitrary factor weights ρ.
+Which metric performs best obviously depends on the concrete application. The first observation
+is that in order to use the metric gρ,m of Theorem 7, it is necessary to know ρ. If the problem at
+hand suggests a natural marginal distribution ρ, then it is natural to make use of it and choose the
+metric gρ,m. Even if ρ is not known at the beginning, a learning system might try to learn it to improve
+its performance.
+On the other hand, there may be situations where there is no natural choice of the weights ρ.
+Observe that ρ breaks the symmetry of permuting the rows of a stochastic matrix. This is also expressed
+by the structural difference between Theorems 4 and 6 on the one side and Theorem 7 on the other.
+While the first two Theorems provide an invariance metric characterization, Theorem 7 provides a
+“covariance” classification; that is, the metrics gρ,m are not invariant under conditional embeddings,
+but they transform in a controlled manner. This again illustrates that the choice of a metric should
+depend on which mappings are natural to consider, e.g., which mappings describe the symmetries of a
+given problem.
+For example, consider a utility function of the form F = ∑i ρi ∑j KijFij. Row permutations do not
+leave gρ,m invariant (for a general ρ), but they are not symmetries of the utility function F, either, and
+hence, they are not very natural mappings to consider. However, row permutations transform the
+metric gρ,m and the utility function in a controlled manner; in such a way that the two transformations
+match. Therefore, in this case, it is natural to use gρ,m. On the other hand, when studying problems
+that are symmetric under all row permutations, it is more natural to use the invariant metric g(k,m).
+
+Appendix A
+
+Appendix A Conditions for Positive Definiteness
+
+Equation (16) in Lebanon’s Theorem 3 defines a Riemannian metrics whenever it defines a
+positive-definite quadratic form. The next Proposition gives sufficient and necessary conditions for
+which this is the case.
+
+73
+
+
+Entropy 2014, 16, 3207–3233
+
+Proposition A1. For each pair k ≥ 1 and m ≥ 2, consider the tensor on Rk×m
++
+defined by:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A(|M|) + δac
+
+� B(|M|)
+
+|Ma|
++ δbd
+C(|M|)
+
+Mab
+
+�
+(A1)
+
+for some differentiable functions A, B, C ∈ C∞(R+). The tensor g(k,m) defines a Riemannian metric for all k
+and m if and only if C(α) > 0, B(α) + C(α) > 0 and A(α) + B(α) + C(α) > 0 for all α ∈ R+.
+
+Proof. The tensors are Riemannian metrics when:
+
+g(k,m)
+M
+(V) = A(|M|)(∑
+ab
+Vab)2 + B(|M|)∑
+a
+
+|M|
+|Ma|(∑
+b
+Vab)2 + C(|M|)∑
+ab
+
+|M|
+Mab
+V2
+ab
+(A2)
+
+is strictly positive for all non-zero V ∈ Rk×m, for all M ∈ Rk×m
++
+.
+We can derive necessary conditions on the functions A, B, C from some basic observations.
+Choosing V = ∂ab in Equation (A2) shows that A(|M|) + |M|
+
+|Ma| B(|M|) + |M|
+
+Mab C(|M|) has to be positive
+
+for all a ∈ [k], b ∈ [m], for all M ∈ Rk×m
++
+. Since Mab can be arbitrarily small for fixed |M| and |Ma|, we
+see that C has to be non-negative. Since we can choose |Ma| ≈ Mab ≪ |M| for a fixed |M|, we find
+that B + C has to be non-negative. Further, since we can choose Mab ≈ |Ma| ≈ |M| for a given |M|,
+we find that A + B + C has to be non-negative. This shows that the quadratic form is positive definite
+only if C ≥ 0, B + C ≥ 0, A + B + C ≥ 0. Since the cone of positive definite matrices is open, these
+inequalities have to be strictly satisfied. In the following, we study sufficient conditions.
+For any given M ∈ Rk×m
++
+, we can write Equation (A2) as a product V⊤GV, for all V ∈ Rkm, where
+G = GA + GB + GC ∈ Rkm×km is the sum of a matrix GA with all entries equal to A(|M|), a block
+diagonal matrix GB whose a-th block has all entries equal to |M|
+
+|Ma| B(|M|), and a diagonal matrix GC with
+
+diagonal entries equal to |M|
+
+Mab C(|M|). The matrix G is obviously symmetric, and by Sylvester’s criterion,
+it is positive definite iff all its leading principal minors are positive. We can evaluate the minors using
+Sylvester’s determinant Theorem. That Theorem states that for any invertible m × m matrix X, an
+m × n matrix Y and an n × m matrix Z, one has the equality det(X + YZ) = det(X) det(In + ZX−1Y).
+Let us consider a leading square block G′, consisting of all entries Gab,cd of G with row-index pairs
+(a, b) satisfying b ∈ [m] for all a < a′ and b ≤ b′ for a = a′ for some a′ ≤ k and b′ ≤ m; and the same
+restriction for the column index pairs. The corresponding block G′
+A + G′
+B can be written as the rank-a′
+
+matrix YZ, with Y consisting of columns 1a for all a ≤ a′ and Z consisting of rows A + 1a
+|M|
+|Ma| B for all
+
+a ≤ a′. Hence, the determinant of G′ is equal to:
+
+det(G′) = det(G′
+C) · det(Ia′ + ZG′−1
+C Y).
+(A3)
+
+Since G′C is diagonal, the first term is just:
+
+det(G′
+C) =
+
+�
+∏
+a<a′ ∏
+b
+
+|M|
+Mab C
+
+� �
+∏
+b≤b′
+
+|M|
+Ma′b C
+
+�
+
+.
+(A4)
+
+The matrix in the second term of Equation (A3) is given by:
+
+Ia′ + ZG′−1
+C Y =
+
+1
+C
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+C + B
+...
+C + B
+
+C + ∑b≤b′ Ma′b
+
+|Ma′|
+B
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
++ 1
+
+C
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+|M1|
+|M| A
+· · ·
+|Ma′−1|
+
+|M|
+A
+∑b≤b′ Ma′b
+
+|M|
+A
+...
+...
+...
+|M1|
+|M| A
+· · ·
+|Ma′−1|
+
+|M|
+A
+∑b≤b′ Ma′b
+
+|M|
+A
+
+⎞
+
+⎟
+⎟
+⎟
+⎠ .
+(A5)
+
+74
+
+
+Entropy 2014, 16, 3207–3233
+
+By Sylvester’s determinant Theorem, we have:
+
+det(Ia′ + ZG′−1
+C Y) = C−a′(C + B)a′−1(C + ∑b≤b′ Ma′b
+
+|Ma′|
+B)(1 + ∑
+a<a′
+
+|Ma|
+|M| A
+
+C + B +
+
+∑b≤b′ Ma′b
+
+|M|
+A
+
+C + ∑b≤b′ Ma′b
+
+|Ma′|
+B
+)
+
+=
+
+�
+∏
+a
+
+C + Ba
+
+C
+
+� �
+
+1 + ∑
+a
+
+Aa
+
+C + Ba
+
+�
+
+,
+(A6)
+
+where Aa = |Ma|
+
+|M| A for a < a′ and Aa′ = ∑b≤b′ Ma′b
+
+|M|
+A, and Ba = B for a < a′ and Ba′ = ∑b≤b′ Ma′b
+
+|Ma′|
+B.
+This shows that the matrix G is positive definite for all M if and only if C > 0, C + B > 0 and
+�
+1 + ∑a≤a′
+Aa
+
+C+Ba
+
+�
+> 0 for all a′ and b′. The latter inequality is satisfied whenever A + B + C > 0. This
+completes the proof.
+
+Appendix B Proofs of the Invariance Characterization
+
+The following Lemma follows directly from the Definition and contains all the technical details
+we need for the proofs.
+
+Lemma A1. The push-forward f∗ : TMRk×m
++
+→ Tf (M)Rl×n
++
+of a map f ∈ ˆF l,n
+k,m is given by:
+
+f∗(∂ab) =
+l
+∑
+i=1
+
+n
+∑
+j=1
+
+RaiQ(a)
+bj ∂′
+ij,
+(A7)
+
+and the pull-back of a metric g(l,n) on Rl×n
++
+through f is given by:
+
+( f ∗g(l,n))M(∂ab, ∂cd) = g(l,n)
+f (M)( f∗∂ab, f∗∂cd) =
+l
+∑
+i=1
+
+n
+∑
+j=1
+
+l
+∑
+s=1
+
+n
+∑
+t=1
+
+RaiRcsQ(a)
+bj Q(c)
+dt g(l,n)
+f (M)(∂′
+ij, ∂′
+st).
+(A8)
+
+Proof of Theorem 4. We follow the strategy of [5,14]. The idea is to consider subclasses of maps from
+the class F l,n
+k,m and to evaluate their push-forward and pull-back maps together with the isometry
+requirement. This yields restrictions on the possible metrics, eventually fully characterizing them.
+
+First. Consider the maps hπ,σ ∈ F k,m
+k,m , resulting from permutation matrices Q(a) = Pπa, πa : [m] → [m]
+for all a ∈ [k], and R = Pσ, σ: [k] → [k]. Requiring isometry yields:
+
+(hπ,σ)∗(∂ab)
+=
+∂′
+σ(a) πa(b)
+(A9)
+
+g(k,m)
+M
+(∂ab, ∂cd)
+=
+g(k,m)
+hπ,σ(M)(∂σ(a) π(a)(b), ∂σ(c) π(c)(d)).
+(A10)
+
+Second. Consider the maps rzw ∈ F kz,mw
+k,m
+defined by Q(1) = · · · = Q(k) ∈ Rm×mw and R ∈ Rk×kz
+
+being uniform. In this case, for some permutations π and σ,
+
+(rzw)∗(∂ab)
+=
+1
+w
+
+z
+∑
+i=1
+
+w
+∑
+j=1
+∂′
+σ(a)(i) π(b)(j)
+(A11)
+
+(rzw∗g(kz,mw))M(∂ab, ∂cd)
+=
+1
+w2
+
+z
+∑
+i=1
+
+w
+∑
+j=1
+
+z
+∑
+s=1
+
+w
+∑
+t=1
+g(kz,mw)
+rzw(M) (∂′
+σ(a)(i) π(b)(j), ∂′
+σ(c)(s) π(d)(t)).
+(A12)
+
+75
+
+
+Entropy 2014, 16, 3207–3233
+
+Third. For a rational matrix M = 1
+
+Z ˜M with ˜M ∈ Nk×m and row-sum | ˜Ma| = N ∈ N for all a ∈ [k],
+consider the map vM ∈ F zk,N
+k,m
+that maps M to a constant matrix. In this case, R ∈ Rk×kz and Q(a) has
+
+the b-th row with | ˜Mab| entries with value
+1
+
+| ˜Mab| at positions π(ab)([ ˜Mab]) ⊆ [N], and:
+
+(vM)∗(∂ab)
+=
+1
+˜Mab
+
+k
+∑
+i=1
+
+˜Mab
+∑
+j=1
+∂′
+σ(a)(i) π(ab)(j)
+(A13)
+
+(vM∗g(kz,N))M(∂ab, ∂cd)
+=
+1
+˜Mab
+
+1
+˜Mcd
+
+z
+∑
+i=1
+
+˜Mab
+∑
+j=1
+
+z
+∑
+s=1
+
+˜Mcd
+∑
+t=1
+g(kz,N)
+vM(M)(∂′
+σ(a)(i) π(ab)(j), ∂′
+σ(c)(s) π(cd)(t)). (A14)
+
+Step 1: a ̸= c. Consider a constant matrix M = U. Then:
+
+g(k,m)
+U
+(∂a1b1, ∂c1d1) = g(k,m)
+hπ,σ(U)(∂a2b2, ∂c2d2) = g(k,m)
+U
+(∂a2b2, ∂c2d2).
+(A15)
+
+This implies that g(k,m)
+U
+(∂ab, ∂cd) = ˆA(k, m) when a ̸= c.
+Using the second type of map, we get:
+
+ˆA(k, m) = z2w2
+
+w2
+ˆA(kz, mw),
+(A16)
+
+which implies g(k,m)
+U
+(∂ab, ∂cd) = A
+
+k2 , when a ̸= c. Considering a rational matrix M and the map vM
+yields:
+
+g(k,m)
+M
+(∂ab, ∂c,d) = A
+
+k2 .
+(A17)
+
+Step 2: b ̸= d. By similar arguments as in Part 1, g(k,m)
+U
+(∂ab, ∂ad) = ˆB(k, m). Evaluating the map
+rzw yields:
+
+ˆB(k, m) = zw2
+
+w2 ˆB(kz, mw) + z(z − 1)w2
+
+w2
+A
+
+(kz)2
+
+= z ˆB(kz, mw) + z − 1
+
+z
+A
+k2 ,
+(A18)
+
+and therefore,
+
+1
+z
+
+�
+ˆB(k, m) − A
+
+k2
+
+�
+= ˆB(kz, mw) −
+A
+
+(kz)2 ,
+(A19)
+
+which implies that
+�
+ˆB(k, m) − A
+
+k2
+�
+is independent of m and scales with the inverse of k, such that it
+
+can be written as B
+
+k . Rearranging the terms yields g(k,m)
+U
+(∂ab, ∂ad) = A
+
+k2 + B
+
+k , for b ̸= d.
+For a rational matrix M, the pull-back through vM shows then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = z
+˜Mab ˜Mad
+˜Mab ˜Mad
+
+�
+A
+
+(kz)2 + B
+
+kz
+
+�
++ z(z − 1) ˜Mab ˜Mad
+
+˜Mab ˜Mad
+
+A
+
+(kz)2 = A
+
+k2 + B
+
+k .
+(A20)
+
+76
+
+
+Entropy 2014, 16, 3207–3233
+
+Step 3: a = c and b = d. In this case, g(k,m)
+U
+(∂a1b1, ∂a1b1) = g(k,m)
+U
+(∂a2b2, ∂a2b2) = ˆC(k, m), and:
+
+ˆC(k, m) = 1
+
+w2 zw ˆC(kz, mw) + 1
+
+w2 zw(w − 1)
+�
+A
+
+(kz)2 + B
+
+kz
+
+�
++ 1
+
+w2 zw2z(z − 1)
+A
+
+(kz)2
+
+= z
+
+w
+ˆC(kz, mw) + (1 − 1
+
+zw) A
+
+k2 + (1 − z
+
+zw) B
+
+k ,
+(A21)
+
+which implies:
+k
+m
+
+�
+ˆC(k, m) − A
+
+k2 − B
+
+k
+
+�
+= kz
+
+mw
+
+�
+˜C(kz, mw) −
+A
+
+(kz)2 − B
+
+kz
+
+�
+,
+(A22)
+
+such that the left-hand side is a constant C, and g(k,m)
+U
+(∂ab, ∂ab) = A
+
+k2 + B
+
+k + m
+
+k C. Now, for a rational
+matrix M, pulling back through vM gives:
+
+g(k,m)
+M
+(∂ab, ∂ab) =
+1
+˜M2
+ab
+˜Mab
+
+� A
+
+k2 + B
+
+k + | ˜Ma|
+
+k
+C
+�
++
+1
+˜M2
+ab
+˜Mab( ˜Mab − 1)
+� A
+
+k2 + B
+
+k
+
+�
++ 0
+
+= A
+
+k2 + B
+
+k + | ˜Ma|
+
+˜MabkC
+
+= A
+
+k2 + k B
+
+k2 + |M|
+
+Mab
+
+C
+k2 .
+(A23)
+
+Summarizing, we found:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A
+
+k2 + δac
+
+�
+k B
+
+k2 + δbd
+|M|
+Mab
+
+C
+k2
+
+�
+,
+(A24)
+
+which proves the first statement. The second statement follows by plugging Equation (23) into
+Equation (A8). Finally, the statement about the positive-definiteness is a direct consequence of
+Proposition 1.
+
+Proof of Theorem 5. Suppose, contrary to the claim, that a family of metrics g(k,m)
+M
+exists, which is
+invariant with respect to any conditional embedding. By Theorem 4, these metrics are of the form
+of Equation (23). To prove the claim, we only need to show that A, B and C vanish. In the following,
+we study conditional embeddings where Q consists of identity matrices and evaluate the isometry
+
+requirement ( f ∗g(l,n))M(∂ab, ∂cd) = g(k,m)
+M
+(∂ab, ∂cd).
+Step 1: In the case a ̸= c, we obtain from the invariance requirement and Equation (A8), that:
+
+A
+k2 = |Ra||Rc| A
+
+l2 .
+(A25)
+
+Observe that:
+1
+k
+
+k
+∑
+i=1
+|Ri| = 1
+
+k |R| = l
+
+k.
+(A26)
+
+In fact, |Ri| is the cardinality of the i-th block of the partition belonging to R. Therefore, if we choose R
+to be the partition indicator matrix of a partition that is not homogeneous and in which |Ra| > l/k
+and |Rc| > l/k, then Equation (A25) implies that A = 0.
+Step 2: In the case a = c and b ̸= d, we obtain from invariance and Equation (A8), that:
+
+B
+k =
+l
+∑
+i=1
+
+l
+∑
+s=1
+
+RaiRasδis
+B
+l = |Ra| B
+
+l .
+(A27)
+
+Again, we may chose Ra in such a way that |Ra| ̸= k
+
+l and find that B = 0.
+
+77
+
+
+Entropy 2014, 16, 3207–3233
+
+Step 3: Finally, in the case a = c and b = d, we obtain from invariance and Equation (A8), that:
+
+C|M|
+k2Mab
+=
+l
+∑
+i=1
+
+l
+∑
+s=1
+
+RaiRasδi,s
+C|R⊤M|
+
+l2(R⊤M)ib
+= |Ra|C|R⊤M|
+
+l2Mab
+.
+(A28)
+
+If we chose Ra, such that |Ra| ̸=
+|M|
+
+|R⊤M|, then we see that C = 0. Therefore, g(k,m) is the zero-tensor,
+
+which is not a metric.
+
+Acknowledgments
+
+The authors are grateful to Keyan Zahedi for discussions related to policy gradient methods in
+robotics applications. Guido Montúfar thanks the Santa Fe Institute for hosting him during the initial
+work on this article. Johannes Rauh acknowledges support by the VW Foundation. This work was
+supported in part by the DFG Priority Program, Autonomous Learning (DFG-SPP 1527).
+
+Author Contributions
+
+All authors contributed to the design of the research. The research was carried out by all authors,
+with main contributions by Guido Montúfar and Johannes Rauh. The manuscript was written by
+Guido Montúfar, Johannes Rauh and Nihat Ay. All authors read and approved the final manuscript.
+
+Conflicts of Interest
+
+The authors declare no conflict of interests.
+
+References
+
+1.
+Amari, S. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276.
+2.
+Kakade, S. A Natural Policy Gradient. Advances in Neural Information Processing Systems 14; MIT Press:
+Cambridge, MA, USA, 2001; pp. 1531–1538.
+3.
+Shahshahani, S. A New Mathematical Framework for the Study of Linkage and Selection; American Mathematical
+Society: Providence, RI, USA, 1979.
+4.
+Chentsov, N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Providence, RI,
+USA, 1982.
+5.
+Campbell, L. An extended ˇCencov characterization of the information metric. Proc. Am. Math. Soc. 1986,
+98, 135–141.
+6.
+Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with
+Function Approximation. In Advances in Neural Information Processing Systems 12; MIT Press: Cambridge,
+MA, USA, 2000; pp. 1057–1063.
+7.
+Marbach, P.; Tsitsiklis, J.
+Simulation-based optimization of Markov reward processes.
+IEEE Trans.
+Autom. Control 2001, 46, 191–209.
+8.
+Montúfar, G.; Ay, N.; Zahedi, K. Expressive power of conditional restricted boltzmann machines for
+sensorimotor control. 2014, arXiv:1402.3346.
+9.
+Ay, N.; Montúfar, G.; Rauh, J. Selection Criteria for Neuromanifolds of Stochastic Dynamics. In Advances
+in Cognitive Neurodynamics (III); Yamaguchi, Y., Ed.; Springer-Verlag: Dordrecht, The Netherlands 2013;
+pp. 147–154.
+10.
+Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190.
+11.
+Peters, J.; Schaal, S. Policy Gradient Methods for Robotics. In Proceedings of the IEEE International
+Conference on Intelligent Robotics Systems (IROS 2006), Beijing, China, 9–15 October 2006.
+
+78
+
+
+Entropy 2014, 16, 3207–3233
+
+12.
+Peters, J.; Vijayakumar, S.; Schaal, S. Reinforcement learning for humanoid robotics. In Proceedings of the
+third IEEE-RAS international conference on humanoid robots, Karlsruhe, Germany, 29–30 September 2003;
+pp. 1–20.
+13.
+Bagnell, J.A.; Schneider, J.
+Covariant policy search.
+In Proceedings of the 18th International Joint
+Conference on Artificial Intelligence, Acapulco, Mexico, August 9–15, 2003; Morgan Kaufmann Publishers
+Inc.: San Francisco, CA, USA, 2003; pp. 1019–1024.
+14.
+Lebanon, G. Axiomatic geometry of conditional models. IEEE Trans. Inform. Theor. 2005, 51, 1283–1294.
+15.
+Lebanon, G.
+An Extended ˇCencov-Campbell Characterization of Conditional Information Geometry.
+In Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence (UAI 04), Banff, AL, Canada,
+7–11 July 2004; Chickering, D.M., Halpern, J.Y., Eds.; AUAI Press: Arlington, VA, USA, 2004; pp. 341–345.
+16.
+Barndorff-Nielsen, O. Information and Exponential Families: In statistical Theory; John Wiley & Sons, Inc.:
+Hoboken, NJ, USA, 1978.
+17.
+Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory;
+Institute of Mathematical Statistics: Hayward, CA, USA, 1986.
+18.
+Zahedi, K.; Ay, N.; Der, R. Higher coordination with less control—A result of informaion maximiation in the
+sensorimotor loop. Adapt. Behav. 2010, 18.
+19.
+Hofbauer, J.; Sigmund, K.
+Evolutionary Games and Population Dynamics; Cambridge University Press:
+Cambridge, United Kingdom, 1998.
+20.
+Ay, N.; Erb, I. On a notion of linear replicator equations. J. Dyn. Differ. Equ. 2005, 17, 427–451.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+79
+
+
+entropy
+
+Article
+Matrix Algebraic Properties of the Fisher Information
+Matrix of Stationary Processes
+
+André Klein
+
+Rothschild Blv. 123 Apt.7, 65271 Tel Aviv, Israel; A.A.B.Klein@uva.nl or klein@contact.uva.nl; Tel.: 972.5.25594723
+
+Received: 12 February 2014; in revised form: 11 March 2014 / Accepted: 24 March 2014 /
+Published: 8 April 2014
+
+Abstract: In this survey paper, a summary of results which are to be found in a series of papers,
+is presented.
+The subject of interest is focused on matrix algebraic properties of the Fisher
+information matrix (FIM) of stationary processes. The FIM is an ingredient of the Cramér-Rao
+inequality, and belongs to the basics of asymptotic estimation theory in mathematical statistics. The
+FIM is interconnected with the Sylvester, Bezout and tensor Sylvester matrices. Through these
+interconnections it is shown that the FIM of scalar and multiple stationary processes fulfill the
+resultant matrix property. A statistical distance measure involving entries of the FIM is presented.
+In quantum information, a different statistical distance measure is set forth. It is related to the
+Fisher information but where the information about one parameter in a particular measurement
+procedure is considered. The FIM of scalar stationary processes is also interconnected to the solutions
+of appropriate Stein equations, conditions for the FIM to verify certain Stein equations are formulated.
+The presence of Vandermonde matrices is also emphasized.MSC Classification: 15A23, 15A24, 15B99,
+60G10, 62B10, 62M20.
+
+Keywords: Bezout matrix; Sylvester matrix; tensor Sylvester matrix; Stein equation; Vandermonde
+matrix; stationary process; matrix resultant; Fisher information matrix
+
+1. Introduction
+
+In this survey paper, a summary of results derived and described in a series of papers, is presented.
+It concerns some matrix algebraic properties of the Fisher information matrix (abbreviated as FIM) of
+stationary processes. An essential property emphasized in this paper concerns the matrix resultant
+property of the FIM of stationary processes. To be more explicit, consider the coefficients of two monic
+polynomials p(z) and q(z) of finite degree, as the entries of a matrix such that the matrix becomes
+singular if and only if the polynomials p(z) and q(z) have at least one common root. Such a matrix is
+called a resultant matrix and its determinant is called the resultant. The Sylvester, Bezout and tensor
+Sylvester matrices have such a property and are extensively studied in the literature, see e.g., [1,3]. The
+FIM associated with various stationary processes will be expressed by these matrices. The derived
+interconnections are obtained by developing the necessary factorizations of the FIM in terms of the
+Sylvester, Bezout and tensor Sylvester matrices. These factored forms of the FIM enable us to show that
+the FIM of scalar and multiple stationary processes fulfill the resultant matrix property. Consequently,
+the singularity conditions of the appropriate Fisher information matrices and Sylvester, Bezout and
+tensor Sylvester matrices coincide, these results are described in [4,6].
+A statistical distance measure involving entries of the FIM is presented and is based on [7]. In
+quantum information, a statistical distance measure is set forth, see [8,10], and is related to the Fisher
+information but where the information about one parameter in a particular measurement procedure is
+considered. This leads to a challenging question that can be presented as, can the existing distance
+measure in quantum information be developed at the matrix level?
+
+Entropy 2014, 16, 2023–2055; doi:10.3390/e16042023
+www.mdpi.com/journal/entropy
+80
+
+
+Entropy 2014, 16, 2023–2055
+
+The matrix Stein equation, see e.g., [11], is associated with the Fisher information matrices of
+scalar stationary processes through the solutions of the appropriate Stein equations. Conditions for the
+Fisher information matrices or associated matrices to verify certain Stein equations are formulated
+and proved in this paper. The presence of Vandermonde matrices is also emphasized. The general
+and more detailed results are set forth in [12] and [13]. In this survey paper it is shown that the FIM of
+linear stationary processes form a class of structured matrices. Note that in [14], the authors emphasize
+that statistical problems related to stationary processes have been treated successfully with the aid
+of Toeplitz forms. This paper is organized as follows. The various stationary processes, considered
+in this paper, are presented in Section 2, the Fisher information matrices of the stationary processes
+are displayed in Section 3. Section 3 sets forth the interconnections between the Fisher information
+matrices and the Sylvester, Bezout, tensor Sylvester matrices, and solutions to Stein equations. A
+statistical distance measure is expressed in terms of entries of a FIM.
+
+2. The Linear Stationary Processes
+
+In this section we display the class of linear stationary processes whose corresponding Fisher
+information matrix shall be investigated in a matrix algebraic context. But first some basic definitions
+are set forth, see e.g., [15].
+
+If a random variable X is indexed to time, usually denoted by t, the observations {Xt, t ∈ T } is
+
+called a time series, where T is a time index set (for example, T = Z, the integer set).
+
+2.1. Definition 2.1
+
+A stochastic process is a family of random variables {Xt, t ∈ T } defined on a probability space {Ω, F, ℘}.
+
+2.2. Definition 2.2
+
+The Autocovariance function. If {Xt, t ∈ T } is a process such that Var(Xt) < ∞ (variance) for each t, then
+
+the autocovariance function γX (·, ·) of {Xt} is defined by γX (r, s) = Cov (Xr, Xs) = E [(Xr − E Xr) (Xs − E
+
+Xs)], r, s ∈ Z and E represents the expected value.
+
+2.3. Definition 2.3
+
+Stationarity. The time series {Xt, t ∈ Z}, with the index set Z ={0,±}1,±}2, . . .}, is said to be stationary if
+
+(i)
+E |Xt|2 < ∞
+
+(ii)
+E (Xt) = m for all t ∈ Z, m is the constant average or mean
+(iii)
+γX (r, s) = γX (r + t, s + t) for all r, s, t ∈ Z,
+
+From Definition 2.3 can be concluded that the joint probability distributions of the random variables
+{X1, X2, . . . Xtn} and {X1+k, X2+k, . . . Xtn+k} are the same for arbitrary times t1, t2, . . . , tn for all n and
+all lags or leads k = 0, ±}1, ±}2, . . .. The probability distribution of observations of a stationary process
+is invariant with respect to shifts in time. In the next section the linear stationary processes that will be
+considered throughout this paper are presented.
+
+2.4. The Vector ARMAX or VARMAX Process
+
+We display one of the most general linear stationary process called the multivariate autoregressive,
+moving average and exogenous process, the VARMAX process. To be more specific, consider the
+vector difference equation representation of a linear system {y(t), t ∈ Z}, of order (p, r, q),
+
+p
+∑
+j=0
+Aj y(t − j) =
+r
+∑
+j=0
+Cj x(t − j) +
+
+q
+∑
+j=0
+Bj e(t − j), t ∈ Z
+(1)
+
+81
+
+
+Entropy 2014, 16, 2023–2055
+
+where y(t) are the observable outputs, x(t) the observable inputs and ϵ(t) the unobservable errors, all
+are n-dimensional. The acronym VARMAX stands for vector autoregressive-moving average with
+exogenous variables. The left side of (1) is the autoregressive part the second term on the right
+is the moving average part and x(t) is exogenous. If x(t) does not occur the system is said to be
+(V)ARMA. Next to exogenous, the input x(t) is also named the control variable, depending on the field
+of application, in econometrics and time series analysis, e.g., [15], and in signal processing and control,
+e.g., [16,17]. The matrix coefficients, Aj ∈ Rn×n, Cj ∈ Rn×n, and Bj ∈ Rn×n are the associate parameter
+matrices. We have the property A0 ≡ B0 ≡ C0 ≡ In.
+Equation (1) can compactly be written as
+
+A(z) y(t) = C(z) x(t) + B(z) e(t)
+(2)
+
+where
+
+A(z) =
+
+p
+∑
+j=0
+Aj zj; C(z) =
+r
+∑
+j=0
+Cj zj; B(z) =
+
+q
+∑
+j=0
+Bj zj
+
+we use z to denote the backward shift operator, for example z xt = xt−1. The matrix polynomials
+A(z), B(z) and C(z) are the associated autoregressive, moving average matrix polynomials, and the
+exogenous matrix polynomial respectively of order p, q and r respectively. Hence the process described
+
+by (2) is denoted as a VARMAX(p, r, q) process. Here z ∈ C with a duplicate use of z as an operator and
+as a complex variable, which is usual in the signal processing and time series literature, e.g., [15,16,18].
+The assumptions Det(A(z)) ̸= 0, such that |z| ≤ 1 and Det(B(z)) ̸= 0, such that |z| < 1 for all z ∈
+C, is imposed so that the VARMAX(p, r, q) process (2) has exactly one stationary solution and the
+condition Det(B(z)) ̸= 0 implies the invertibility condition, see e.g., [15] for more details. Under these
+assumptions, the eigenvalues of the matrix polynomials A(z) and B(z) lie outside the unit circle. The
+eigenvalues of a matrix polynomial Y (z) are the roots of the equation Det(Y (z)) = 0, Det(X) is the
+determinant of X. The VARMAX(p, r, q) stationary process (2) is thoroughly discussed in [15,18,19].
+The error {ϵ(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables
+
+each having positive definite covariance matrix ∑ and we assume, for all s, t, Eϑ { x(s) ϵT(t)} = 0, where
+
+XT denotes the transposition of matrix X and Eϑ represents the expected value under the parameter
+ϑ. The matrix ϑ represents all the VARMAX(p, r, q) parameters, with the total number of parameters
+being n2(p + q + r). For different purposes which will be specified in the next sections, two choices of
+the parameter structure are considred. First, the parameter vector ϑ ∈ Rn2(p+q+r)×1 is defined by
+
+ϑ = vec {A1, A2, . . . , Ap, C1, C2, . . . , Cr, B1, B2, . . . , Bq}
+(3)
+
+The vec operator transforms a matrix into a vector by stacking the columns of the matrix one
+underneath the other according to vec X = col(col(Xij)n
+i=1)n
+j=1, see e.g., [2,20]. A different choice
+
+is set forth, when the parameter matrix ϑ ∈ Rn×n(p+q+r) is of the form
+
+ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+r ϑp+r+1 ϑp+r+2 . . . ϑp+r+q)
+(4)
+
+= (A1 A2 . . . Ap C1 C2 . . . Cr B1 B2 . . . Bq)
+(5)
+
+Representation (5) of the parameter matrix has been used in [21]. The estimation of the matrices A1,
+A2,. . ., Ap, C1, C2,. . ., Cr, B1, B2, . . ., Bq and ∑ has received considerable attention in the time series
+and statistical signal processing literature, see e.g., [15,17,19]. In [19], the authors study the asymptotic
+properties of maximum likelihood estimates of the coefficients of VARMAX(p, r, q) processes, stored in
+a (ℓ × 1) vector ϑ, where ℓ = n2(p + q + r).
+Before describing the control-exogenous variable x(t) used in this survey paper, we shall present
+the different special cases of the model described in 1 and 2.
+
+82
+
+
+Entropy 2014, 16, 2023–2055
+
+2.5. The Vector ARMA or VARMA Process
+
+When the process (2) does not contain the control process x(t) it yields
+
+A(z)y(t) = B(z)e(t)
+(6)
+
+which is a vector autoregressive and moving average process, VARMA(p, q) process, see e.g., [15].
+The matrix ϑ represents now all the VARMA parameters, with the total number of parameters being
+n2(p+q). The VARMA(p, q) version of the parameter vector ϑ defined in (3) is then given by
+
+ϑ = vec {A1, A2, . . . , Ap, B1, B2, . . . , Bq}
+(7)
+
+A VARMA process equivalent to the parameter matrix (4) is then the n × n(p + q) parameter matrix
+
+ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+q) = (A1 A2 . . . Ap B1 B2 . . . Bq)
+(8)
+
+A description of the input variable x(t), in 2 follows. Generally, one can assume either that x(t) is non
+
+stochastic or that x(t) is stochastic. In the latter case, we assume Eϑ{ x(s) ϵT(t)} = 0, for all s, t, and
+that statistical inference is performed conditionally on the values taken by x(t). In this case it can
+be interpreted as constant, see [22] for a detailed exposition. However, in the papers referred in this
+survey, like in [21] and [23], the observed input variable x(t), is assumed to be a stationary VARMA
+process, of the form
+
+α(z)x(t) = β(z)η(t)
+(9)
+
+where α(z) and β(z) are the autoregressive and moving average polynomials of appropriate degree
+and {η(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables each having
+positive definite covariance matrix Ω. The spectral density of the VARMA process x(t) is Rx(·)/2π and
+for a definition, see e.g., [15,16], to obtain
+
+Rx(eiω) = α−1(eiω)β(eiω)Ωβ∗(eiω)α−∗(eiω)
+ω ∈ [−π, π]
+(10)
+
+where i is the imaginary unit with the property i2 = −1, ω is the frequency, the spectral density Rx(eiω)
+is Hermitian, and we further have, Rx(eiω) ≥ 0 and � π
+−π Rx(eiω)dω < ∞. As mentioned above, the
+basic assumption, x(t) and ϵ(t) are independent or at least uncorrelated processes, which corresponds
+geometrically with orthogonal processes, holds and X* is the complex conjugate transpose of matrix X.
+
+2.6. The ARMAX and ARMA Processes
+
+The scalar equivalent to the VARMAX(p, r, q) and VARMA(p, q) processes, given by 2 and 6
+
+respectively, shall now be displayed, to obtain for the ARMAX(p, r, q) process
+
+a(z)y(t) = c(z)x(t) + b(z)e(t)
+(11)
+
+and for the ARMA(p, q) process
+
+a(z)y(t) = b(z)e(t)
+(12)
+
+popularized in, among others, the Box-Jenkins type of time series analysis, see e.g., [15]. Where a(z),
+b(z) and c(z) are respectively the scalar autoregressive, moving average polynomials and exogenous
+polynomial, with corresponding scalar coefficients aj, bj and cj,
+
+a(z) =
+
+p
+∑
+j=0
+aj zj; c(z) =
+r
+∑
+j=0
+cj zj; b(z) =
+
+q
+∑
+j=0
+bj zj
+(13)
+
+83
+
+
+Entropy 2014, 16, 2023–2055
+
+Note that as in the multiple case, a0 = b0 = 1. The parameter vector, ϑ, for the processes, 11 and 12
+is then
+
+ϑ = {a1, a2, . . . , ap, c1, c2, . . . , cr, b1, b2, . . . , bq}
+(14)
+
+and
+
+ϑ = {a1, a2, . . . , ap, b1, b2, . . . , bq}
+(15)
+
+respectively.
+In the next section the matrix algebraic properties of the Fisher information matrix of the stationary
+processes (2), (6), (11) and (12) will be verified. Interconnections with various known structured
+matrices like the Sylvester resultant matrix, the Bezout matrix and Vandermonde matrix are set forth.
+The Fisher information matrix of the various stationary processes is also expressed in terms of the
+unique solutions to the appropriate Stein equations.
+
+3. Structured Matrix Properties of the Asymptotic Fisher Information Matrix of
+Stationary Processes
+
+The Fisher information is an ingredient of the Cramér-Rao inequality, also called by some
+the Cauchy-Schwarz inequality in mathematical statistics, and belongs to the basics of asymptotic
+estimation theory in mathematical statistics. The Cramér-Rao theorem [24] is therefore considered.
+When assuming that the estimators of ϑ, defined in the previuos sections, are asymptotically unbiased,
+the inverse of the asymptotic information matrix yields the Cramér-Rao bound, and provided that the
+estimators are asymptotically efficient, the asymptotic covariance matrix then verifies the inequality
+
+Cov
+� ˆϑ
+� ≽ I−1� ˆϑ
+�
+
+here I (�ϑ) is the FIM, Cov (�ϑ) is the covariance of �ϑ, the unbiased estimator of ϑ, for a detailed
+fundamental statistical analysis, see [25,26]. The FIM equals the Cramér-Rao lower bound, and the
+subject of the FIM is also of interest in the control theory and signal processing literature, see e.g., [27].
+Its quantum analog was introduced immediately after the foundation of mathematical quantum
+estimation theory in the 1960’s, see [28,29] for a rigorous exposition of the subject. More specifically, the
+Fisher information is also emphasized in the context of quantum information theory, see e.g., [30,31].
+It is clear that the Cramér-Rao inequality takes a lot of attention because it is located on the highly
+exciting boundary of statistics, information and quantum theory and more recently matrix theory. In
+the next sections, the Fisher information matrices of linear stationary processes will be presented and
+its role as a new class of structured matrices will be the subject of study.
+When time series models are the subject, using 2 for all t ∈ Z to determine the residual ϵ(t) or
+ϵt(ϑ), to emphasize the dependency on the parameter vector ϑ, and assuming that x(t) is stochastic and
+that (y(t), x(t)) is a Gaussian stationary process, the asymptotic FIM F(ϑ) is defined by the following
+(ℓ × ℓ) matrix which does not depend on t
+
+F(ϑ) = E
+
+��∂et(ϑ)
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+�∂et(ϑ)
+
+∂ϑ⊤
+
+��
+
+(16)
+
+where the (v × ℓ) matrix ∂(·)/∂ϑ T, the derivative with respect to ϑ T, for any (v × 1) column vector
+(·) and ℓ is the total number of parameters. The derivative with respect to ϑ T is used for obtaining
+the appropriate dimensions. Equality (16) is used for computing the FIM of the various time series
+processes presented in the previous sections and appropriate definitions of the derivatives are used,
+especially for the multivariate processes (2) and (6), see [21,22].
+
+84
+
+
+Entropy 2014, 16, 2023–2055
+
+3.1. The Fisher Information Matrix of an ARMA(p, q) Process
+
+In this section, the focus is on the FIM of the ARMA process (12). When ϑ is given in 15, the
+derivatives in 16 are at the scalar level
+
+∂et(ϑ)
+
+∂aj
+=
+1
+
+a(z)et−j
+for j = 1, . . . , p and∂et(ϑ)
+
+∂bk
+= − 1
+
+b(z)et−k for k = 1, . . . , q
+
+when combined for all j and k, the FIM of the ARMA process (12) with the variance of the noise process
+ϵt(ϑ) equal to one, yields the block decomposition, see [32]
+
+F(ϑ) =
+
+�
+Faa(ϑ)
+Fab(ϑ)
+Fba(ϑ)
+Fbb(ϑ)
+
+�
+
+(17)
+
+The expressions of the different blocks of the matrix F(ϑ) are
+
+Faa(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)u⊤
+p (z−1)
+
+a(z)a(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)v⊤
+p (z)
+
+a(z)ˆa(z)
+dz
+(18)
+
+Fab(ϑ) = − 1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)u⊤
+q (z−1)
+
+a(z)b(z−1)
+dz
+z = − 1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z)
+dz
+(19)
+
+Fba(ϑ) = − 1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)u⊤
+p (z−1)
+
+a(z−1)b(z)
+dz
+z = − 1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)v⊤
+p (z)
+
+ˆa(z)b(z) dz
+(20)
+
+Fbb(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)u⊤
+q (z−1)
+
+b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z)
+dz
+(21)
+
+where the integration above and everywhere below is counterclockwise around the unit circle. The
+
+reciprocal monic polynomials â(z) and �b(z) are defined as â(z) = zpa(z−1) and �b (z) = zqb(z−1) and ϑ
+=(a1, . . . , ap, b1, . . . , bq) T introduced in (15). For each positive integer k we have uk(z) = (1, z, z2,
+. . . , zk−1) T and vk(z) = zk−1uk(z−1). Considering the stability condition of the ARMA(p, q) process
+implies that all the roots of the monic polynomials a(z) and b(z) lie outside the unit circle. Consequently,
+
+the roots of the polynomials â(z) and �b(z) lie within the unit circle and will be used as the poles for
+computing the integrals (18)–(21) when Cauchy’s residue theorem is applied. Notice that the FIM F(ϑ)
+is symmetric block Toeplitz so that Fab(ϑ) = F ⊤
+ba(ϑ) and the integrands in (18)–(21) are Hermitian.
+The computation of the integral expressions, (18)–(21) is easily implementable by using the standard
+residue theorem. The algorithms displayed in [33] and [22] are suited for numerical computations of
+among others the FIM of an ARMA(p, q) process.
+
+3.2. The Sylvester Resultant Matrix - The Fisher Information Matrix
+
+The resultant property of a matrix is considered, in order to show that the FIM F(ϑ) has the matrix
+resultant property implies to show that the matrix F(ϑ) becomes singular if and only if the appropriate
+
+scalar monic polynomials â(z) and �b(z) have at least one common zero. To illustrate the subject, the
+following known property of two polynomials is set forth. The greatest common divisor (frequently
+abbreviated as GCD) of two polynomials is a polynomial, of the highest possible degree, that is a factor
+of both the two original polynomials, the roots of the GCD of two polynomials are the common roots of
+the two polynomials. Consider the coefficients of two monic polynomials p(z) and q(z) of finite degree,
+as the entries of a matrix such that the matrix becomes singular if and only if the polynomials p(z) and
+q(z) have at least one common root. Such a matrix is called a resultant matrix and its determinant is
+
+85
+
+
+Entropy 2014, 16, 2023–2055
+
+called the resultant. Therefore we present the known (p + q) × (p + q) Sylvester resultant matrix of the
+polynomials a and b, see e.g., [2], to obtain
+
+S(a, b) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+a1
+· · ·
+ap
+0
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+0
+0
+· · ·
+0
+1
+a1
+· · ·
+ap
+1
+b1
+· · ·
+bq
+0
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+0n×n
+0
+· · ·
+0
+1
+b1
+· · ·
+bq
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(22)
+
+Consider the q ×(p+q) and p×(p+q) upper and lower submatrices Sp (b) and Sq (−a) of the Sylvester
+
+resultant matrix S (−b, a) such that
+
+S(b, −a) =
+
+�
+Sp(b)
+−Sq(a)
+
+�
+
+(23)
+
+The matrix
+
+S
+
+(a, b) becomes singular in the presence of one or more common zeros of the monic polynomials â(z)
+
+and �b(z), this property is assessed by the following equalities
+
+R(a, b) =
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(αi − βj), R(b, a) = (−1)pq
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(αi − βj)
+(24)
+
+and
+
+R(b, −a) = (−1)q
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(βj − αi), and R(−b, a) = (−1)p
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(βj − αi)
+(25)
+
+where
+R(a,
+b)
+is
+the
+resultant
+of
+â(z)
+and
+�b(z),
+and
+is
+equal
+to
+Det
+includegraphics[scale=1]entropy-16-02023f6.pdf (a, b).
+The string of equalities in (24) and (25)
+hold since R(b, a) = (−1)pq R(a, b), R(b, −a) = (−1)q R(b, a), and R(−b, a) = (−1)p R(b, a), see [34]. The
+
+zeros of the scalar monic polynomials â(z) and �b(z) are αi and βj respectively and are assumed to be
+distinct. By this is meant, when we have (z − αi)nαi and (z − βj)nβj with the powers nαi and nβj both
+greater than one, that only the distinct roots will be considered free from the corresponding powers.
+
+The key property of the classical Sylvester resultant matrix S (a, b) is that its null space provides a
+complete description of the common zeros of the polynomials involved. In particular, in the scalar
+
+case the polynomials â(z) and �b(z) are coprime if and only if S (a, b) is non-singular. The following
+
+key property of the classical Sylvester resultant matrix S (a, b), is given by the well known theorem on
+resultants, to obtain
+
+dim Ker S(a, b) = ν(a, b)
+(26)
+
+86
+
+
+Entropy 2014, 16, 2023–2055
+
+where ν(a, b) is the number of common roots of the polynomials â(z) and �b(z), with counting
+multiplicities, see e.g., [3]. The dimension of a subspace V is represented by dim (V ), Ker (X)
+is the null space or kernel of the matrix X, denoted by Null or Ker. The null space of an n × n matrix A
+with coefficients in a field K (typically the field of the real numbers or of the complex numbers) is the
+set Ker A = {x ∈ Kn: Ax = 0}, see e.g., [1,2,20].
+
+In order to prove that the FIM F (ϑ) fulfills the resultant matrix property, the following
+factorization is derived, Lemma 2.1 in [5],
+
+F(ϑ) = S(b, −a)P(ϑ)S⊤(b, −a)
+(27)
+
+where the matrix ℘(ϑ) ∈ R(p+q)×(p+q) admits the form
+
+P(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)u⊤
+p+a(z−1)
+
+a(z)b(z)a(z−1)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)v⊤
+p+q(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(28)
+
+It is proved in [5] that the symmetric matrix ℘(ϑ) fulfills the property, ℘(ϑ) ≻ O. The factorization (27)
+allows us to show the matrix resultant property of the FIM, Corollary 2.2 in [5] states.
+The FIM of an ARMA(p, q) process with polynomials a(z) and b(z) of order p, q respectively
+
+becomes singular if and only if the polynomials â(z) and �b(z) have at least one common root.
+From Corollary 2.2 in [5] can be concluded, the FIM of an ARMA(p, q) process and the Sylvester
+resultant matrix
+
+S
+
+(−b, a) have the same singularity property. By virtue of (26) and (27) we will specify the dimension of
+
+the null space of the FIM F (ϑ), this is set forth in the following lemma.
+
+3.2.1. Lemma 3.1
+
+Assume that the polynomials â(z) and b(z) have ν(a, b) common roots, counting multiplicities.
+The factorization (27) of the FIM and the property (26) enable us to prove the equality
+
+dim (Ker F(ϑ)) = dim (Ker S(b, −a)) = ν(a, b)
+(29)
+
+Proof
+
+The matrix ℘(ϑ) ∈ R(p+q)×(p+q), given in (27), fulfills the property of positive definiteness, as
+proved in [5]. This implies that a Cholesky decomposition can be applied to ℘(ϑ), see [35] for more
+details, to obtain ℘(ϑ) =LT(ϑ)L(ϑ), where L(ϑ) is a R(p+q)×(p+q) upper triangular matrix that is unique if
+its diagonal elements are all positive. Consequently, all its eigenvalues are then positive so that the
+matrix L(ϑ) is also positive definite. Factorization of (27) now admits the representation
+
+F(ϑ) = S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a)
+(30)
+
+and taking the property, if A is an m× n matrix, then Ker (A) = Ker (ATA), into account, yields when
+applied to (30)
+
+Ker F(ϑ) = Ker S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a) = Ker L(ϑ)S⊤(b, −a)
+
+Assume the vector u ∈ Ker L(ϑ) S⊤ (b, −a), such that L(ϑ) S⊤ (b, −a)u = 0 and set S⊤ (b, −a)u = v = ⇒
+
+L(ϑ)v = 0, since the matrix L(ϑ) ≻ O = ⇒ v = 0, this implies S⊤ (b, −a)u = 0 = ⇒ u ∈ Ker S⊤ (b, −a).
+Consequently,
+
+87
+
+
+Entropy 2014, 16, 2023–2055
+
+Ker F(ϑ) = Ker S⊤(b, −a)
+(31)
+
+We will now consider the Rank-Nullity Theorem, see e.g., [1], if A is an m × n matrix, then
+
+dim (Ker A) + dim (Im A) = n
+
+and the property dim (Im A) = dim (Im AT). When applied to the (p + q) × (p + q) matrix S (b, −a),
+it yields
+
+dim (Ker S(b, −a)) = dim (Ker S⊤(b, −a)) ⇒ dim (Ker F(ϑ)) = dim (Ker S(b, −a))
+
+which completes the proof.
+Notice that the dimension of the null space of matrix A is called the nullity of A and the dimension
+of the image of matrix A, dim (Im A), is termed the rank of matrix A. An alternative proof to the one
+developed in Corollary 2.2 in [5], is given in a corollary to Lemma 3.1, reconfirming the resultant
+
+matrix property of the FIM F (ϑ).
+
+3.2.2. Corollary 3.2
+
+The FIM F (ϑ) of an ARMA(p, q) process becomes singular if and only if the autoregressive and moving
+
+average polynomials â(z) and �b(z) have at least one common root.
+
+Proof
+
+By virtue of the equality (31) combining with the property Det S⊤ (b, −a) = Det S (b, −a) and
+
+the matrix resultant property of the Sylvester matrix S (b, −a) yields, Det S⊤ (b, −a) = 0 ⇔ Ker S⊤
+
+(b, −a) ̸= {0} if and only if the ARMA(p, q) polynomials â(z) and �b(z) have at least one common root.
+
+Equivalently, Det S⊤ (b, −a) ̸= 0 ⇔ Ker S⊤ (b, −a) = {0} if and only if the ARMA(p, q) polynomials
+
+â(z) and �b(z) have no common roots. Consequently, by virtue of the equality Ker F (ϑ) =Ker S⊤ (b,
+
+−a) can be concluded, the FIM F (ϑ) becomes singular if and only if the ARMA(p, q) polynomials â(z)
+
+and �b(z) have at least one common root. This completes the proof.
+
+3.3. The Statistical Distance Measure and the Fisher Information Matrix
+
+In [7] statistical distance measures are studied. Most multivariate statistical techniques are based
+upon the concept of distance. For that purpose a statistical distance measure is considered that is
+a normalized Euclidean distance measure with entries of the FIM as weighting coefficients. The
+measurements x1, x2,. . . , xn are subject to random fluctuations of different magnitudes and have
+therefore different variabilities. It is then important to consider a distance that takes the variability
+of these variables or measurements into account when determining its distance from a fix point. A
+rotation of the coordinate system through a chosen angle while keeping the scatter of points given
+by the data fixed, is also applied, see [7] for more details. It is shown that when the FIM is positive
+definite, the appropriate statistical distance measure is a metric. In case of a singular FIM of an ARMA
+stationary process, the metric property depends on the rotation angle. The statistical distance measure,
+is based on m parameters unlike a statistical distance measure introduced in quantum information, see
+e.g., [8,9], that is also related to the Fisher information but where the information about one parameter
+in a particular measurement procedure is considered.
+
+88
+
+
+Entropy 2014, 16, 2023–2055
+
+The straight-line or Euclidean distance between the stochastic vector x =
+�
+x1
+x2
+. . .
+xn
+�⊤
+
+and fixed vector y =
+�
+y1
+y2
+. . .
+yn
+�⊤
+where x, y ∈ Rn, is given by
+
+d(x, y) = ∥x − y∥ =
+
+�
+n
+∑
+j=1
+(xj − yj)2
+�1/2
+(32)
+
+where the metric d(x, y):= ||x−y|| is induced by the standard Euclidean norm || · || on Rn, see
+e.g., [2] for the metric conditions.
+The observations x1, x2, . . . , xn are used to compute maximum likelihood estimated of the
+parameters ϑ1, ϑ2, . . . , ϑm and where m < n. These estimated parameters are random variables, see
+e.g., [15]. The distance of the estimated vector ϑ ∈ Rm, given in (15), is studied. Entries of the FIM are
+inserted in the distance measure as weighting coefficients. The linear transformation
+
+�ϑ = Li(ϕ)ϑ
+(33)
+
+is applied, where Li(ϕ) ∈ Rm×n is the Givens rotation matrix with rotation angle ϕ, with 0 ≤ ϕ ≤ 2π
+and i ∈ {1, . . . , m − 1}, see e.g., [36], and is given by
+
+Li(ϕ) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+Ii−1
+0
+0
+0
+0
+(cos(ϕ))i,i
+(− sin(ϕ))i,i+1
+0
+0
+(sin(ϕ))i+1,i
+(cos(ϕ))i+1,i+1
+0
+0
+0
+0
+Im−i−1
+
+⎞
+
+⎟
+⎟
+⎟
+⎠,
+0 ≤ ϕ ≤ 2π
+(34)
+
+The following matrix decomposition is applied in order to obtain a transformed FIM
+
+Fϕ(ϑ) = Li(ϕ)F(ϑ)L⊤
+i (ϕ)
+(35)
+
+where Fϕ(ϑ) and F (ϑ) are respectively the transformed and untransformed Fisher information
+matrices. It is straightforward to conclude that by virtue of (35), the transformed and untransformed
+
+Fisher information matrices F ϕ(ϑ) and F (ϑ), are similar since the rotation matrix Li(ϕ) is orthogonal.
+Two matrices A and B are similar if there exists an invertible matrix X such that the equality AX = XB
+holds. As can be seen, the Givens matrix Li(ϕ) involves only two coordinates that are affected by the
+rotation angle ϕ whereas the other directions, which correspond to eigenvalues of one, are unaffected
+by the rotation matrix.
+
+By virtue of (35) can be concluded that a positive definite FIM, F (ϑ) ≻ 0, implies a positive
+
+definite transformed FIM, F ϕ(ϑ) ≻ 0. Consequently, the elements on the main diagonal of F (ϑ), f 1,1,
+
+f 2,2, . . . , fm,m, as well as the elements on the main diagonal of F ϕ(ϑ), �f1,1, �f2,2, . . . , �fm,m are all
+positive. However, the elements on the main diagonal of a singular FIM of a stationary ARMA process
+are also positive.
+As developed in [7], combining (33) and (35) yields the distance measure of the estimated
+parameters ϑ1, ϑ2, . . . , ϑm accordingly, to obtain
+
+d2
+Fϕ(ϑ) =
+m
+∑
+j=1,j̸=i,i+1
+
+� ϑ2
+j
+
+fj,j
+
+�
+
++ {ϑi cos(ϕ) − ϑi+1 sin(ϕ)}2
+
+�fi,i(ϕ)
++ {ϑi+1 cos(ϕ) + ϑi sin(ϕ)}2
+
+�fi+1,i+1(ϕ)
+(36)
+
+where
+
+�fi,i(ϕ) = fi,i cos2(ϕ) − fi,i+1 sin(2ϕ) + fi+1,i+1 sin2(ϕ)
+(37)
+
+89
+
+
+Entropy 2014, 16, 2023–2055
+
+�fi+1,i+1(ϕ) = fi+1,i+1 cos2(ϕ) + fi,i+1 sin(2ϕ) + fi,i sin2(ϕ)
+(38)
+
+and fj,l are entries of the FIM F (ϑ) whereas �fi,i(φ) and �fi+1,i+1(φ) are the transformed components
+
+since the rotation affects only the entries, i and i+1, as can be seen in matrix Li(ϕ). In [7], the existence
+of the following inequalities is proved
+
+�fi,i(ϕ) > 0
+and
+�fi+1,i+1(ϕ) > 0
+
+this guaratees the metric property of (36).
+When the FIM of an ARMA(p, q) process is the
+case, a combination of (27) and (35) for the ARMA(p, q) parameters, given in (15) yields for the
+transformed FIM,
+
+Fϕ(ϑ) = Sϕ(−b, a)P(ϑ)S⊤
+ϕ (−b, a)
+(39)
+
+where ℘(ϑ) is given by (28) and the transformed Sylvester resultant matrix is of the form
+
+Sϕ(−b, a) = Li(ϕ)S(−b, a)
+(40)
+
+Proposition 3.5 in [7], proves that the transformed FIM F ϕ(ϑ) and the transformed Sylvester matrix
+Sφ (−b, a) fulfill the resultant matrix property by using the equalities (40) and (39). The following
+property is then set forth.
+
+3.3.1. Proposition 3.3
+
+The properties
+
+Ker Fϕ(ϑ) = Ker S⊤
+ϕ (−b, a) and Ker Sϕ(−b, a) = Ker S(−b, a)
+
+hold true.
+
+Proof
+
+By virtue of the equalities (39), (40) and the orthogonality property of the rotation matrix Li(ϕ)
+which implies that Ker Li(ϕ) = {0} combined with the same approach as in Lemma 3.1 completes
+the proof.
+A straightforward conclusion from Proposition 3.3 is then
+
+dim Ker Fϕ(ϑ) = dim Ker Sϕ(−b, a), dim Ker Sϕ(−b, a) = dim Ker S(−b, a)
+
+In the next section a distance measure introduced in quantum information is discussed.
+Statistical Distance Measure - Fisher Information and Quantum Information
+In quantum information, the Fisher information, the information about a parameter θ in a
+particular measurement procedure, is expressed in terms of the statistical distance s, see [8,10]. The
+statistical distance used is defined as a measure to distinguish two probability distributions on the basis
+of measurement outcomes, see [37]. The Fisher information and the statistical distance are statistical
+quantities, and generally refer to many measurements as it is the case in this survey. However, in
+the quantum information theory and quantum statistics context, the problem set up is presented as
+follows. There may or may not be a small phase change θ, and the question is whether it is there. In
+that case you can design quantum experiments that will tell you the answer unambiguously in a single
+measurement. The equality derived is of the form
+
+F (ϕ) =
+� ds
+
+dθ
+
+�2
+(41)
+
+90
+
+
+Entropy 2014, 16, 2023–2055
+
+the Fisher information is the square of the derivative of the statistical distance s with respect to θ.
+Contrary to (36), where the square of the statistical distance measure is expressed in terms of entries
+
+of a FIM F (ϑ) which is based on information about m parameters estimated from n measurements,
+for m < n. A challenging question could therefore be formulated as follows, can a generalization of
+equality (41) be developed in a quantum information context but at the matrix level ? To be more
+specific, many observations or measurements that lead to more than one parameter such that the
+corresponding Fisher information matrix is interconnected to an appropriate statistical distance matrix,
+a matrix where entries are scalar distance measures. This question could equally be a challenge to
+algebraic matrix theory and to quantum information.
+
+3.4. The Bezoutian - The Fisher Information Matrix
+
+In this section an additional resultant matrix is presented, it concerns the Bezout matrix or
+Bezoutian. The notation of Lancaster and Tismenetsky [2] shall be used and the results presented are
+extracted from [38]. Assume the polynomials a and b given by a(z) = ∑n
+j=0 aj zj and b(z) = ∑n
+j=0 bj zj,
+cfr. (13) but where p = q = n, and we further assume a0 = b0 = 1. The Bezout matrix B(a, b) of the
+polynomials a and b is defined by the relation
+
+a(z)b(w) − a(w)b(z) = (z − w)u⊤
+n (z)B(a, b)un(z)
+
+This matrix is often referred as the Bezoutian. We will display a decomposition of the Bezout matrix
+B(a, b) developed in [38]. For that purpose the matrix Uϕ and its inverse Tϕ are presented, where ϕ is a
+given complex number, to obtain
+
+Uϕ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+· · ·
+· · ·
+0
+−ϕ
+1
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+0
+· · ·
+0
+−ϕ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, Tϕ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+· · ·
+· · ·
+0
+ϕ
+1
+· · ·
+· · ·
+0
+
+ϕ2
+...
+...
+...
+...
+...
+ϕn−1
+· · ·
+ϕ2
+ϕ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+Let (1 − α1z) and (1 − β1z) be a factor of a(z) and b(z) respectively and α1 and β1 are zeros of â(z)
+
+and �b(z). Consider the factored form of the nth order polynomials a(z) and b(z) of the form a(z) = (1
+− α1z)a−1(z) and b(z) = (1 − β1z)b−1(z) respectively. Proceeding this way, for α2, . . . , αn yields the
+recursion a−(k−1)(z) = (1 − αkz)a−k(z), equivalently for the polynomials b−k(z) and a0(z) = a(z) and b0(z)
+= b(z). Proposition 3.1 in [38] is presented.
+The following non-symmetric decomposition of the Bezoutian is derived, considering the
+notations above
+
+B(a, b) = Uα1
+
+�
+B(a−1, b−1)
+0
+0
+0
+
+�
+
+U⊤
+β1 + (β1 − α1)bβ1a⊤
+α1
+(42)
+
+with aα1 such that a⊤
+α1 un(z) = a−1 similarly for bβ1. Iteration gives the following expansion for the
+Bezout matrix
+
+B(a, b) =
+n
+∑
+k=1
+(βk − αk)Uα1 . . . Uαk−1Uβk+1 . . . Uβnen
+1 (en
+1)⊤ U⊤
+β1 . . . U⊤
+βk−1U⊤
+αk+1 . . . U⊤
+αn
+
+where en
+1 is the first unit standard basis column vector in Rn, by ej we denote the jth coordinate vector,
+ej = (0, . . . , 1, . . . , 0) T, with all its components equal to 0 except the jth component which equals 1.
+The following corollarys to Proposition 3.1 in [38] are now presented.
+
+Corollary 3.2 in [38] states. Let ϕ be a common zero of the polynomials â(z) and �b(z). Then a(z) =
+(1 − ϕz)a−1(z) and b(z) = (1 − ϕz)b−1(z) and
+
+91
+
+
+Entropy 2014, 16, 2023–2055
+
+B(a, b) = Uϕ
+
+�
+B(a−1, b−1)
+0
+0
+0
+
+�
+
+U⊤
+ϕ
+
+This a direct consequence of (42) and from which can be concluded that the Bezoutian B(a, b) is
+non-singular if and only if the polynomials a(z) and b(z) have no common factors. A similar conclusion
+
+is drawn for the FIM in (27) so that matrices F (ϑ) and B(a, b) have the same singularity property.
+Related to Corollary 3.2 in[38], this is where we give a description of the kernel or nullspace of
+the Bezout matrix.
+Corollary 3.3 in [38] is now presented. Let ϕ1, . . ., ϕm be all the common zeros of the polynomials
+
+â(z) and �b(z), with multiplicities n1, . . . , nm. Let ℓ be the last unit standard basis column vector in Rn
+
+and put
+
+wj
+k =
+�
+Tj
+ϕk Jj−1�⊤
+ℓ
+
+for k = 1, . . . , m and j = 1, . . . , nk and by J we denote the forward n × n shift matrix, Jij = 1 if i = j + 1.
+
+Consequently, the subspace Ker B(a, b) is the linear span of the vectors wj
+k.
+An alternative representation to (27) but involving the Bezoutian B(b, a) and derived in
+Proposition 5.1 in [38] is of the form
+
+F(ϑ) = M−1(b, a)H(ϑ)M−⊤(b, a)
+(43)
+
+where
+
+H(ϑ) =
+
+�
+I
+0
+0
+B(b, a)
+
+�
+
+Q(ϑ)
+
+�
+I
+0
+0
+B(b, a)
+
+�
+
+and M(b, a) =
+
+�
+P
+0
+PS(ˆa)P
+PS(ˆb)P
+
+�
+
+(44)
+
+and
+
+P =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+· · ·
+0
+1
+...
+1
+0
+
+0
+...
+1
+0
+· · ·
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+, S(ˆa) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+an−1
+an−2
+· · ·
+a0
+an−2
+a0
+0
+...
+...
+a0
+0
+· · ·
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
+and Q(ϑ) ≻ 0
+
+The matrix S(â) is the symmetrizer of the polynomial â(z), in this paper a0 = 1, see [2] and P is a
+permutation matrix. In [38] it is shown that the matrix Q(ϑ) is the unique solution to an appropriate
+Stein equation and is strictly positive definite. However, in the next section an explicit form of the Stein
+solution Q(ϑ) is developed. Some comments concerning the property summarized in Corollary 5.2
+in [38] follow.
+
+The matrix H(ϑ) is non-singular if and only if the polynomials a(z) and b(z) have no common
+factors. The proof is straightforward since the matrix Q(ϑ) is non-singular which implies that the
+
+matrixH(ϑ) is only non-singular when the Bezoutian B(b, a) is non-singular and this is fulfilled if and
+only if the polynomials a(z) and b(z) have no common factors.
+
+The matrix M(b, a) is non-singular if a0 ̸= 0 and b0 ̸= 0, which is the case since we have a0 =
+
+b0 = 1. From (43) can be concluded that the FIM F (ϑ) is non-singular only when the matrix H(ϑ)
+is non-singular or by virtue of (44) when the Bezoutian B(b, a) is non-singular. Consequently, the
+
+singularity conditions of the Bezoutian B(b, a), the FIM F (ϑ) and the Sylvester resultant matrix
+
+S
+
+92
+
+
+Entropy 2014, 16, 2023–2055
+
+(b, −a) are therefore equivalent. Can be concluded, by virtue of (29) proved in Lemma 3.1 and the
+
+equality dim (Ker S (a, b)) = dim (Ker B(a, b)) proved in Theorem 21.11 in [1], yields
+
+dim (Ker S(b, −a)) = dim (Ker F(ϑ)) = dim (Ker B(b, a)) = ν(a, b)
+
+3.5. The Stein Equation - The Fisher Information Matrix of an ARMA(p, q) Process
+
+In [12], a link between the FIM of an ARMA process and an appropriate solution of a Stein
+equation is set forth. In this survey paper we shall present some of the results and confront some
+results displayed in the previous sections. However, alternative proofs will be given to some results
+obtained in [12,38].
+The Stein matrix equation is now set forth. Let A ∈ Cm×m, B ∈ Cn×n and Γ ∈ Cn×m and consider
+the Stein equation
+
+S − BSA⊤ = Γ
+(45)
+
+It has a unique solution if and only if λμ ̸= 1 for any λ ∈ σ(A) and μ ∈ σ(B), the spectrum of D is σ(D)
+
+= {λ ∈ C: det(λIm − D) = 0}, the set of eigenvalues of D. The unique solution will be given in the next
+theorem [11].
+
+3.5.1. Theorem 3.4
+
+Let A and B be, such that there is a single closed contour C with σ(B) inside C and for each non-zero w ∈
+σ(A), w−1 is outside C. Then for an arbitrary Γ the Stein 45 has a unique solution S
+
+S =
+1
+
+2πi
+
+�
+
+C
+(λIn − B)−1Γ(Im − λA)−⊤dλ
+(46)
+
+In this section an interconnection between the representation (27) of the FIM F (ϑ) and an appropriate
+solution to a Stein equation of the form (45) as developed in [12] is set forth. The distinct roots of
+
+the polynomials â(z) and �b(z) are denoted by α1, α2, . . . , αp and β1, β2, . . . , βq respectively such
+
+that the non-singularity of the FIM F (ϑ) is guaranteed. The following representation of the integral
+expression (28) is given when Cauchy’s residue theorem is applied, equation (4.8) in [12]
+
+P(ϑ) = U(ϑ)D(ϑ) ˆU(ϑ)
+(47)
+
+where
+
+U(ϑ) = {up+q(α1), up+q(α2), . . . , up+q(αp), up+q(β1), up+q(β2), . . . , up+q(βq)}
+
+D(ϑ) = diag
+��
+1
+
+ˆa(z;αi)ˆb(αi)a(αi)b(αi)
+
+�
+,
+�
+1
+
+ˆa(βj)ˆb(z;βj)a(βj)b(βj)
+
+��
+, i = 1, ..., p and j = 1, ..., q
+
+and
+
+ˆU(ϑ) = {vp+q(α1), vp+q(α2), . . . , vp+q(αp), vp+q(β1), vp+q(β2), . . . , vp+q(βq)}⊤
+
+the polynomial p(·; β) is defined accordingly, p(z; β) =
+p(z)
+(z−β) and D (ϑ) is the (p + q) × (p + q) diagonal
+
+matrix. The matrices U (ϑ) and �U((ϑ) in (47) are the (p + q)× (p + q) Vandermonde matrices Vαβ and
+�U αβ respectively, given by
+
+93
+
+
+Entropy 2014, 16, 2023–2055
+
+Vαβ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+α1
+α2
+1
+· · ·
+αp+q−1
+1
+1
+α2
+α2
+2
+· · ·
+αp+q−1
+2
+...
+...
+...
+...
+...
+1
+αp
+α2
+p
+· · ·
+αp+q−1
+p
+1
+β1
+β2
+1
+· · ·
+βp+q−1
+1
+1
+β2
+β2
+2
+· · ·
+βp+q−1
+2
+...
+...
+...
+...
+...
+1
+βq
+β2
+q
+· · ·
+βp+q−1
+q
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+and ˆVαβ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αp+q−1
+1
+αp+q−2
+1
+· · ·
+α1
+1
+αp+q−1
+2
+αp+q−2
+2
+· · ·
+α2
+1
+...
+...
+...
+...
+...
+αp+q−1
+p
+αp+q−2
+p
+· · ·
+αp
+1
+βp+q−1
+1
+βp+q−2
+1
+· · ·
+β1
+1
+βp+q−1
+2
+βp+q−2
+2
+· · ·
+β2
+1
+...
+...
+...
+...
+...
+βp+q−1
+q
+βp+q−2
+q
+· · ·
+βq
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+It is clear that the (p + q) × (p + q) Vandermonde matrices Vαβ and �U αβ are nonsingular when αi ̸= αj,
+βk ̸= βh and αi ̸= βk for all i, j = 1, . . . , p and k, h = 1, . . . , q. A rigorous systematic evaluation of the
+
+Vandermonde determinants DetVαβ and Det �U αβ, yields
+
+DetVαβ = (−1)(p+q) (p+q−1)/2Φ (αi, βk)
+
+where
+
+Φ (αi, βk) =
+∏
+1≤i<j≤p
+(αi − αj)
+∏
+1≤k<h≤q
+(βk − βh)
+∏
+m = 1, . . . p
+n = 1, . . . q
+
+(αm − βn)
+
+Since Vαβ = P ˆV⊤
+αβ and given the configuration of the permutation matrix, P, this leads to the equalities
+
+Det ˆV⊤
+αβ = DetP DetVαβ and DetP = (−1)(p+q)(p+q−1)/2 so that
+
+Det ˆVαβ = (−1)(p+q) (p+q−1) Φ (αi, βk) ⇒| DetVαβ |= | Det ˆVαβ |
+
+We shall now introduce an appropriate Stein equation of the form (45) such that an interconnection with
+℘(ϑ) in (47) can be verified. Therefore the following (p + q)× (p + q) companion matrix is introduced,
+
+Cg =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+1
+· · ·
+0
+...
+...
+...
+0
+· · ·
+0
+1
+−gp+q
+−gp+q−1
+· · ·
+−g1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
+(48)
+
+where the entries gi are given by zp+q + ∑
+p+q
+i=1 gi(ϑ)zp+q−i = ˆa(z)ˆb(z) = ˆg(z, ϑ) and ˆg(ϑ) is the vector
+ˆg(ϑ) = (gp+q(ϑ), gp+q−1(ϑ), . . . , g1(ϑ)) T. Likewise is the vector g(z, ϑ) = a(z)b(z) and g(ϑ) = (g1(ϑ), g1(ϑ),
+. . . , gp+q(ϑ)) T, for investigating the properties of a companion matrix see e.g., [36], [2]. Since all
+
+the roots of the polynomials â(z) and �b(z) are distinct and lie within the unit circle implies that the
+products αiβj ̸= 1, αiαj ̸= 1 and βiβj ̸= 1 hold for all i = 1, 2, . . . , p and j = 1, 2, . . . , q. Consequently,
+the uniqueness condition of the solution of an appropriate Stein equation is verified. The following
+Stein equation and its solution, according to (45) and (46), are now presented
+
+S − CgSC⊤
+g = Γ and S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − Cg)−1Γ(Ip+q − zCg)−⊤dz
+
+where the closed contour is now the unit circle |z| = 1 and the matrix Γ is of size (p + q)× (p + q). A
+more explicit expression of the solution S is of the form
+
+94
+
+
+Entropy 2014, 16, 2023–2055
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+q − Cg)Γ adj(Ip+q − zCg)⊤
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(49)
+
+where adj(X) = X−1 Det(X), the adjoint of matrix X. When Cauchy’s residue theorem is applied to the
+solution S in (49), the following factored form of S is derived, equation (4.9) in [12]
+
+S = (C1, C2) (Ip+q ⊗ Γ) (D(ϑ) ⊗ Ip+q) (C3, C4)⊤
+(50)
+
+where
+
+C1 = adj(α1Ip+q − Cg), adj(α2Ip+q − Cg), . . . , adj(αpIp+q − Cg)
+C2 = adj(β1Ip+q − Cg), adj(β2Ip+q − Cg), . . . , adj(βpIp+q − Cg)
+C3 = adj(Ip+q − α1Cg), adj(Ip+q − α2Cg), . . . , adj(Ip+q − αpCg)
+C4 = adj(Ip+q − β1Cg), adj(Ip+q − β2Cg), . . . , adj(Ip+q − βpCg)
+
+and D ϑ) is given in (47), the following matrix rule is applied
+
+(A ⊗ B) (C ⊗ D) = AC ⊗ BD
+
+and the operator ⊗ is the tensor (Kronecker) product of two matrices, see e.g., [2], [20].
+Combining (47) and (50) and taking the assumption, αi ̸= αj, βk ̸= βh and αi ̸= βk, into account
+
+implies that the inverse of the (p + q)× (p + q) Vandermonde matrices Vαβ and �U αβ exist, as Lemma
+4.2 [12] states.
+The following equality holds true
+
+S = (C1, C2)
+�
+V−1
+αβ P(ϑ) ˆV−1
+αβ ⊗ Γ
+�
+(C3, C4)⊤
+
+or
+
+S = (C1, C2)
+�
+V−1
+αβ S−1(b, −a)F(ϑ)S−⊤(b, −a) ˆV−1
+αβ ⊗ Γ
+�
+(C3, C4)⊤
+(51)
+
+Consequently, under the condition αi ̸= αj, βk ̸= βh and αi ̸= βk, and by virtue of (27) and (51),
+
+an interconnection involving the FIM F (ϑ), a solution to an appropriate Stein equation S, the
+Sylvester matrix
+
+S
+
+(b, −a) and the Vandermonde matrices Vαβ and �U αβ is established. It is clear that by using the
+expression (43), the Bezoutian B (a, b) can be inserted in equality (51).
+We will formulate a Stein equation when the matrix Γ = ep+qe⊤
+p+q,
+
+S − CgSC⊤
+g = ep+qe⊤
+p+q
+(52)
+
+where ep+q is the last standard basis column vector in Rp+q, em
+i is the i-th unit standard basis
+column vector in Rm, with all its components equal to 0 except the i-th component which equals 1. The
+next lemma is formulated.
+
+3.5.2. Lemma 3.5
+
+The symmetric matrix ℘(ϑ) defined in (28) fulfills the Stein Equation (52).
+
+Proof
+
+The unique solution of (52) is according to (46)
+
+95
+
+
+Entropy 2014, 16, 2023–2055
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − Cg)−1ep+qe⊤
+p+q(Ip+q − zCg)−⊤dz
+
+more explictely written,
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+q − Cg)ep+qe⊤
+p+qadj(Ip+q − zCg)⊤
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+
+Using the property of the companion matrix Cg, standard computation shows that the last column
+
+of adj(zIp+q − Cg) is the basic vector up+q(z) and consequently the last column of adj(Ip+q − z Cg)
+
+is the basic vector vp+q(z) = zp+q−1up+q(z−1). This implies that adj(zIp+q − Cg)ep+q = up+q(z) and
+e⊤
+p+qadj(Ip+q − zCg)⊤ = v⊤
+p+q(z) or
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)v⊤
+p+q(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz = P(ϑ)
+
+Consequently, the solution S to the Stein 52 coincides with the matrix ℘(ϑ) defined in (28).
+
+The Stein equation that is verified by the FIM F (ϑ) will be considered. For that purpose we
+
+display the following p × p and q × q companion matrices Ca and Cb of the form,
+
+Ca =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+−a1
+−a2
+· · ·
+· · ·
+−ap
+1
+0
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+0
+· · ·
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, Cb =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+−b1
+−b2
+· · ·
+· · ·
+−bq
+1
+0
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+0
+· · ·
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+respectively. Introduce the (p + q) × (p + q) matrix K(ϑ) =
+
+�
+Ca
+O
+O
+Cb
+
+�
+
+and the (p + q) × 1 vector
+
+B =
+
+�
+e1
+p
+−e1
+q
+
+�
+
+, where e1
+p and e1
+q are the first standard basis column vectors in Rp and Rq respectively.
+
+Consider the Stein equation
+
+S − K(ϑ)SK⊤(ϑ) = BB⊤
+(53)
+
+followed by the theorem.
+
+3.5.3. Theorem 3.6
+
+The Fisher information matrix F (ϑ) (17) coincides with the solution to the Stein 53.
+
+Proof
+
+The eigenvalues of the companion matrices Ca and Cb are respectively the zeros of the
+
+polynomials â(z) and �b(z) which are in absolute value smaller than one. This implies that the unique
+solution of the Stein 53 exists and is given by
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − K(ϑ))−1BB⊤(Ip+q − zK(ϑ))−⊤dz
+
+96
+
+
+Entropy 2014, 16, 2023–2055
+
+developing this integral expression in a more explicit form yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+adj(zIp−Ca)
+
+ˆa(z)
+O
+
+O
+adj(zIq−Cb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+e1
+p
+−e1
+q
+
+� ⎧
+⎨
+
+⎩
+
+⎛
+
+⎝
+
+adj(Ip−zCa)
+
+a(z)
+O
+
+O
+adj(Iq−zCb)
+
+b(z)
+
+⎞
+
+⎠
+�
+e1
+p
+−e1
+q
+
+�⎫
+⎬
+
+⎭
+
+⊤
+
+dz
+
+Considering the form of the companion matrices Ca and Cb leads through straightforward
+
+computation to the conclusion, the first column of adj(zIp − Ca ) is the basic vector vp(z) and
+
+consequently the first column of adj(Ip − z Ca ) is the basic vector up(z). Equivalently for the companion
+
+matrix Cb , this yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+vp(z)
+ˆa(z)
+− vq(z)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+u⊤p (z)
+a(z)
+−
+u⊤
+q (z)
+b(z)
+
+�
+dz
+(54)
+
+Representation (54) is such that in order to obtain an equivalent representation to the FIM F (ϑ) in (17),
+the transpose of the solution to the Stein 53 is therefore required, to obtain
+
+S⊤ =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎜
+⎝
+
+up(z)v⊤p (z)
+
+a(z)ˆa(z)
+−
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z)
+
+−
+uq(z)v⊤p (z)
+
+ˆa(z)b(z)
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z)
+
+⎞
+
+⎟
+⎠ dz = F(ϑ)
+(55)
+
+or
+
+S⊤ =
+1
+
+2πi
+
+�
+
+|z|=1
+(Ip+q − zK(ϑ))−1BB⊤(zIp+q − K(ϑ))−⊤dz = F(ϑ)
+
+The symmetry property of the FIM F (ϑ), leads to S = F (ϑ). From the representation (55) can be
+
+concluded that the solution S of the Stein 53 coincides with the symmetric block Toeplitz FIM F ( ϑ)
+given in (17). This completes the proof.
+It is straightforward to verify that the submatrix (1,2) in (55) is the complex conjugate transpose
+of the submatrix (2,1), whereas each submatrix on the main diagonal is Hermitian, consequently,
+the integrand is Hermitian. This implies that when the standard residue theorem is applied, it yields
+F ( ϑ) = F T (ϑ).
+An Illustrative Example of Theorem 3.6
+To illustrate Theorem 3.6, the case of an ARMA(2, 2) process is considered. We will use the
+
+representation (17) for computing the FIM F (ϑ) of an ARMA(2, 2) process. The autoregressive and
+moving average polynomials are of degree two or p = q = 2 and the ARMA(2, 2) process is described by,
+
+y(t)a(z) = b(z)e(t)
+(56)
+
+where y(t) is the stationary process driven by white noise ϵ(t), a(z) = (1 + a1z + a2z2) and b(z) = (1+b1z +
+b2z2) and the parameter vector is ϑ = (a1, a2, b1, b2)T. The condition, the zeros of the polynomials
+
+ˆa(z) = z2a(z−1) = z2 + a1z + a2 and ˆb(z) = z2b(z−1) = z2 + b1z + b2
+
+are in absolute value smaller than one, is imposed. The FIM F (ϑ) of the ARMA(2, 2) process (56) is of
+the form
+
+97
+
+
+Entropy 2014, 16, 2023–2055
+
+F(ϑ) =
+
+�
+Faa(ϑ)
+Fab(ϑ)
+F ⊤
+ab(ϑ)
+Fbb(ϑ)
+
+�
+
+(57)
+
+where
+
+Faa(ϑ) =
+1
+
+(1−a2)
+�
+(1+a2)2−a2
+1
+�
+
+�
+1 + a2
+−a1
+−a1
+1 + a2
+
+�
+
+Fbb(ϑ) =
+1
+
+(1−b2)
+�
+(1+b2)2−b2
+1
+�
+
+�
+1 + b2
+−b1
+−b1
+1 + b2
+
+�
+
+Fab(ϑ) =
+1
+
+(a2b2−1)2+(a2b1−a1) (b1−a1b2)
+
+�
+a2b2 − 1
+a1 − a2b1
+b1 − a1b2
+a2b2 − 1
+
+�
+
+The submatrices F aa(ϑ) and F bb(ϑ) are symmetric and Toeplitz whereas F ab(ϑ) is Toeplitz. One can
+assert that without any loss of generality, the property, symmetric block Toeplitz, holds for the class
+of Fisher information matrices of stationary ARMA(p, q) processes, where p and q are arbitrary, finite
+integers that represent the degrees of the autoregressive and moving average polynomials, respectively.
+
+The appropriate companion matrices Ca , Cb , the 4 × 4 matricesK (ϑ) and BBT are
+
+Ca =
+
+�
+−a1
+−a2
+1
+0
+
+�
+
+, Cb =
+
+�
+−b1
+−b2
+1
+0
+
+�
+
+, K(ϑ) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+−a1
+−a2
+0
+0
+1
+0
+0
+0
+0
+0
+−b1
+−b2
+0
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎠
+and BB⊤ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+−1
+0
+0
+0
+0
+0
+−1
+0
+1
+0
+0
+0
+0
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎠
+(58)
+
+where B =
+�
+1
+0
+−1
+0
+�⊤
+. It can be verified that the Stein equation
+
+F(ϑ) − K(ϑ)F(ϑ)K⊤(ϑ) = BB⊤
+
+holds true, when F (ϑ) is of the form (57) and the matricesK (ϑ) and
+includegraphics[scale=1]entropy-16-02023f666.pdfT are given in (58).
+
+3.5.4. Some Additional Results
+
+In Proposition 5.1 in [38], the matrix Q(ϑ) in (44) fulfills the Stein 59 and the property Q(ϑ) ≻ 0 is
+
+proved. It states that when e⊤
+P =
+�
+e⊤
+1 P, 0
+�⊤ = (en, 0n)⊤ ∈ R2n, where e1 is the first unit standard basis
+column vector in Rn and en is the last or n-th unit standard basis column vector in Rn, the following
+Stein equation admits the form
+
+Q(ϑ) = FN(ϑ)Q(ϑ)F⊤
+N (ϑ) + ePe⊤
+P
+(59)
+
+where
+
+FN(ϑ) =
+
+�
+ˆCa
+0
+e1e⊤
+1
+Cb
+
+�
+
+, ˆCa =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+1
+0
+· · ·
+0
+0
+0
+1
+· · ·
+0
+...
+...
+...
+...
+
+0
+...
+...
+1
+−ap
+−ap−1
+· · ·
+· · ·
+−a1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+A corollary to Proposition 5.1, [38] will be set forth, the involvement of various Vandermonde matrices
+in the explicit solution to 59 is confirmed. For that purpose the following Vandermonde matrices
+are displayed,
+
+98
+
+
+Entropy 2014, 16, 2023–2055
+
+Vα =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+1
+1
+α1
+α2
+αn
+α2
+1
+α2
+2
+α2
+n
+...
+...
+...
+αn−1
+1
+αn−1
+2
+αn−1
+n
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, ˆVα =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αn−1
+1
+αn−2
+1
+1
+αn−1
+2
+αn−2
+2
+1
+αn−1
+3
+αn−2
+3
+1
+...
+...
+...
+αn−1
+n
+αn−2
+n
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, ˆVαβ =
+
+�
+ˆVα
+ˆVβ
+
+�
+
+, and Vαβ =
+�
+Vα
+Vβ
+�
+
+(60)
+
+where �U β and Vβ have the same configuration as �U α and Vα respectively. A corollary to Proposition
+5.1 in [38] is now formulated.
+
+3.5.5. Corollary 3.7
+
+An explicit expression of the solution to the Stein 59 is of the form
+
+Q(ϑ) =
+
+�
+VαD11(ϑ) ˆVα
+VαD12(ϑ)V⊤
+α
+ˆV⊤
+αβD21(ϑ) ˆVαβ
+ˆV⊤
+αβD22(ϑ)V⊤
+αβ
+
+�
+
+(61)
+
+where the n × n and 2n × 2n diagonal matrices Dkl ϑ) shall be specified in the proof.
+
+Proof
+
+The condition of a unique solution of the Stein 59 is guaranteed since the eigenvalues of the
+
+companions matrices �Ca and Cb given respectively by the zeros of the polynomials â (z) and �b (z)
+are in absolute value smaller than one. Consequently, the unique solution to the Stein 59 exists and is
+given by
+
+Q(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+(zI2n − FN(ϑ))−1ePe⊤
+P (I2n − zFN(ϑ))−⊤dz
+(62)
+
+in order to proceed successfully, the following matrix property is displayed, to obtain
+
+�
+A
+O
+B
+C
+
+�−1
+=
+
+�
+A−1
+O
+−C−1BA−1
+C−1
+
+�
+
+When applied to the 62, it yields
+
+Q(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+adj(zIp− ˆCa)
+
+ˆa(z)
+O
+
+adj(zIq−Cb)e1e⊤
+1 adj(zIp− ˆCa)
+
+ˆa(z)ˆb(z)
+adj(zIq−Cb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+en
+0
+
+�
+
+×
+
+⎧
+⎨
+
+⎩
+
+⎛
+
+⎝
+
+adj(In−z ˆCa)
+
+ˆa(z)
+O
+
+adj(In−zCb)e1e⊤
+1 adj(Ip−z ˆCa)
+
+ˆa(z)ˆb(z)
+adj(In−zCb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+en
+0
+
+�⎫
+⎬
+
+⎭
+
+⊤
+
+dz
+
+Considering that the last column vector of the matrices adj(zIp − �Ca ) and adj(In − z �Ca ) are the
+vectors un(z) and vn(z) respectively, it then yields
+
+Q(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+un(z)
+ˆa(z)
+vn(z)
+
+ˆa(z)ˆb(z)
+
+⎞
+
+⎠
+�
+v⊤
+n (z)
+a(z)
+zn−1u⊤
+n (z)
+
+a(z)b(z)
+
+�
+dz
+
+=
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+un(z)v⊤
+n (z)
+
+a(z)ˆa(z)
+zn−1un(z)u⊤
+n (z)
+
+ˆa(z)a(z)b(z)
+vn(z)v⊤
+n (z)
+
+ˆa(z)ˆb(z)a(z)
+zn−1vn(z)u⊤
+n (z)
+
+ˆa(z)ˆb(z)a(z)b(z)
+
+⎞
+
+⎠ dz =
+
+�
+Q11(ϑ)
+Q12(ϑ)
+Q21(ϑ)
+Q22(ϑ)
+
+�
+
+99
+
+
+Entropy 2014, 16, 2023–2055
+
+Applying the standard residue theorem leads for the respective submatrices
+
+Q11(ϑ) = {un(α1), . . . , un(αn)}D11(ϑ) {vn(α1), . . . , vn(αn)}⊤
+
+Q12(ϑ) = {un(α1), . . . , un(αn)}D12(ϑ) {un(α1), . . . , un(αn)}⊤
+
+Q21(ϑ) = {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}D21(ϑ) {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}⊤
+
+Q22(ϑ) = {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}D22(ϑ) {un(α1), . . . , un(αn), un(β1), . . . , un(βn)}⊤
+
+where the n × n diagonal matrices are
+
+D11(ϑ) = diag {1/(a(αi)ˆa(z; αi))}, D12(ϑ) = diag {αn−1
+i
+/(a(αi)b(αi)ˆa(z; αi))} for i = 1, . . . , n
+
+and the 2n × 2n diagonal matrices are
+
+D21(ϑ) = diag
+�
+1/
+�
+a(αi)ˆb(αi)ˆa(z; αi)
+�
+, 1/
+�
+ˆa(βj)a(βj)ˆb(z; βj)
+��
+, for i, j = 1, . . . , n
+
+D22(ϑ) = diag
+�
+αn−1
+i
+/
+�
+a(αi)b(αi)ˆb(αi)ˆa(z; αi)
+�
+, βn−1
+j
+/
+�
+ˆa(βj)a(βj)b(βj)ˆb(z; βj)
+��
+, for i, j = 1, . . . , n
+
+It is clear that the first and third matrices in Q11(ϑ), Q12(ϑ), Q21(ϑ) and Q22(ϑ) are the appropriate
+Vandermonde matrices displayed in (60), it can be concluded that the representation (61) is verified.
+This completes the proof.
+In this section an explicit form of the solution Q(ϑ), expressed in terms of various Vandermonde
+
+matrices, is displayed. Also, an interconnection between the Fisher information F (ϑ) and appropriate
+solutions to Stein equations and related matrices is presented. Proofs are given when the Stein
+
+equations are verified by the FIM F (ϑ) and the associated matrix ℘(ϑ). These are alternative to the
+proofs developed in [38]. The presence of various forms of Vandermonde matrices is also emphasized.
+
+In the next section some matrix properties of the FIM F (ϑ) of an ARMAX process is presented.
+
+3.6. The Fisher Information Matrix of an ARMAX(p, r, q) Process
+
+The FIM of the ARMAX process (11) is set forth according to [4].
+The derivatives in the
+corresponding representation (16) are
+
+∂et(ϑ)
+
+∂aj
+=
+c(z)
+
+a(z)b(z) x(t − j) +
+1
+
+a(z)e(t − j), ∂et(ϑ)
+
+∂cl
+= − 1
+
+b(z)e(t − l) and∂et(ϑ)
+
+∂bk
+= − 1
+
+b(z)et−k
+
+where j = 1, . . . , p, l = 1, . . . , r and k = 1, . . . , q. Combining all j, l and k yields the (p + r + q) × (p + r +
+q) FIM
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+Gaa(ϑ)
+Gac(ϑ)
+Gab(ϑ)
+G⊤
+ac(ϑ)
+Gcc(ϑ)
+Gcb(ϑ)
+G⊤
+ab(ϑ)
+G⊤
+cb(ϑ)
+Gbb(ϑ)
+
+⎞
+
+⎟
+⎠
+(63)
+
+where the submatrices of G (ϑ) are given by
+
+100
+
+
+Entropy 2014, 16, 2023–2055
+
+Gaa(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z)
+up(z)u⊤p (z−1)c(z)c(z−1)
+
+a(z)a(z−1)b(z)b(z−1)
+dz
+z +
+1
+
+2πi
+�
+
+|z|=1
+
+up(z)u⊤p (z−1)
+
+a(z)a(z−1)
+dz
+z
+
+=
+1
+
+2πi
+�
+
+|z|=1
+Rx(z)
+up(z)v⊤p (z)c(z)ˆc(z)
+
+a(z)ˆa(z)b(z)ˆb(z)zr−q dz +
+1
+
+2πi
+�
+
+|z|=1
+
+up(z)v⊤p (z)
+
+a(z)ˆa(z) dz
+
+Gab(ϑ) = − 1
+
+2πi
+�
+
+|z|=1
+
+up(z)u⊤
+q (z−1)
+
+a(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z) dz
+
+Gac(ϑ) = − 1
+
+2πi
+�
+
+|z|=1
+Rx(z) up(z)u⊤
+r (z−1)c(z)
+
+a(z)b(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+Rx(z) up(z)v⊤
+r (z)c(z)
+
+a(z)b(z)ˆb(z)zr−q dz
+
+Gcc(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z) ur(z)u⊤
+r (z−1)
+
+b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z) ur(z)v⊤
+r (z)
+
+b(z)ˆb(z)zr−q dz
+
+Gbb(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+uq(z)u⊤
+q (z−1)
+
+b(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z) dz, and Gcb(ϑ) = O
+
+where Rx(z) is the spectral density of the process x(t) and is defined in (10).
+Let K(z) =
+a(z)a(z−1)b(z)b(z−1), combining all the expressions in (63) leads to the following representation of
+G (ϑ) as the sum of two matrices
+
+1
+
+2πi
+�
+
+|z|=1
+
+Rx(z)
+K(z)
+
+⎛
+
+⎜
+⎝
+c(z)up(z)
+−a(z)ur(z)
+O
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+c(z)up(z)
+−a(z)ur(z)
+O
+
+⎞
+
+⎟
+⎠
+
+∗
+
+dz
+z +
+1
+
+2πi
+�
+
+|z|=1
+
+1
+
+K(z)
+
+⎛
+
+⎜
+⎝
+b(z)up(z)
+O
+−a(z)uq(z)
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+b(z)up(z)
+O
+−a(z)uq(z)
+
+⎞
+
+⎟
+⎠
+
+∗
+
+dz
+z
+(64)
+
+where (X)* is the complex conjugate transpose of the matrix X ∈ Cm×n. Like in (23) we set forth
+
+S(−c, a) =
+
+�
+−Sp(c)
+Sr(a)
+
+�
+
+here Sp (c) is formed by the top p rows of S (−c, a). In a similar way we decompose
+
+S(−b, a) =
+
+�
+−Sp(b)
+Sq(a)
+
+�
+
+The representation (64) can be expressed by the appropriate block representations of the Sylvester
+resultant matrices, to obtain
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠ W(ϑ)
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠ P(ϑ)
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
+(65)
+
+where the matrix ℘(ϑ) is given in (28) and the matrix P (ϑ) ∈ R(p+r)×(p+r) is of the form
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Rx(z)
+up+r(z)u⊤
+p+r(z−1)
+
+a(z)a(z−1)b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Rx(z)
+up+r(z)v⊤
+p+r(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(66)
+
+It is shown in [4] that P (ϑ) ≻ O. As can be seen in (65), the ARMAX part is explained by the first
+term, whereas the ARMA part is described by the second term, the combination of both terms is a
+
+summary of the Fisher information of a ARMAX(p, r, q) process. The FIM G(ϑ) under form (65) allows
+
+us to prove the following property, Theorem 3.1 in [4]. The FIM G (ϑ) of the ARMAX(p, r, q) process
+with polynomials a(z), c(z) and b(z) of order p, r, q respectively becomes singular if and only if these
+
+101
+
+
+Entropy 2014, 16, 2023–2055
+
+polynomials have at least one common root. Consequently, the class of resultant matrices is extended
+
+by the FIM G (ϑ).
+
+3.7. The Stein Equation - The Fisher Information Matrix of an ARMAX(p, r, q) Process
+
+In Lemma 3.5 it is proved that the matrix ℘(ϑ) (28) fulfills the Stein 52. We will now consider
+
+the conditions under which the matrix P (ϑ) (66) verifies an appropriate Stein equation. For that
+purpose we consider the spectral density to be of the form Rx(z) = (1/h(z)h(z−1)). The degree of the
+polynomial h(z) is ℓ and we assume the distinct roots of the polynomial h(z) to lie outside the unit
+
+circle, consequently, the roots of the polynomial ˆh(z) lie within the unit circle. We therefore rewrite P
+
+(ϑ) accordingly
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)u⊤
+p+r(z−1)
+
+h(z)h(z−1)a(z)a(z−1)b(z)b(z−1)
+dz
+z
+
+We consider a companion matrix of the form (48) and with size p + q + ℓ, it is denoted by Cf and the
+
+entries fi are given by zp+q+ℓ + ∑
+p+q+ℓ
+i=1
+fi(ϑ)zp+q+qℓ−i = ˆa(z)ˆb(z)ˆh(z) = ˆf (z, ϑ) and �f((ϑ) is the vector
+
+�f((ϑ) = (fp+q+ℓ(ϑ), fp+q+ℓ−1(ϑ), . . . , f 1(ϑ))T. Likewise for the vector f(z, ϑ) = a(z)b(z)h(z) and f(ϑ) =
+
+(f 1(ϑ), f 1(ϑ), . . . , fp+q+ℓ(ϑ))T. The property Det(zIp+q+ℓ − Cf ) = â(z)�b(z)ˆh(z) and Det(Ip+q+ℓ − z Cf ) =
+a(z)b(z)h(z) holds and assume
+
+r = q + ℓ or p + q + ℓ = p + r and r > q
+(67)
+
+P (ϑ) is then of the form
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)v⊤
+p+r(z)
+
+h(z)ˆh(z)a(z)ˆa(z)b(z)ˆb(z)
+dz
+(68)
+
+We will formulate a Stein equation when the matrix Γ = ep+re⊤
+p+r and which is of the form
+
+S − C f SC⊤
+f = ep+re⊤
+p+r
+(69)
+
+where ep+r is the last standard basis column vector in Rp+r. The next lemma is formulated.
+
+3.7.1. Lemma 3.8
+
+The matrix P (ϑ) given in (68) fulfills the Stein 69.
+
+Proof
+
+The unique solution of (69) is assured since the product of all the eigenvalues of Cf are different
+from one, the solution is of the form
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+r − C f )−1ep+re⊤
+p+r(Ip+r − zC f )−⊤dz
+
+or
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+r − C f )ep+re⊤
+p+radj(Ip+r − zC f )⊤
+
+ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z)
+dz
+
+102
+
+
+Entropy 2014, 16, 2023–2055
+
+taking the property of the companion matrix Cf into account implies that the last column vector of
+
+adj(zIp+r − Cf ) is the basic vector up+r(z), consequently the last column of adj(Ip+r − z Cf ) is the basic
+vector vp+r(z), this yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)v⊤
+p+r(z)
+
+ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z)
+dz = W(ϑ)
+
+Consequently, the matrix P (ϑ) defined in (68) verifies the Stein 69. This completes the proof.
+
+The matrices, ℘(ϑ) and P (ϑ), in (65), verify under specific conditions appropriate Stein equations,
+as has been shown in Lemma 3.5 and Lemma 3.8, respectively. We will now confirm the presence of
+
+Vandermonde matrices by applying the standard residue theorem to P (ϑ) in (68), to obtain
+
+W(ϑ) = VαβξR (ϑ) ˆVαβξ
+(70)
+
+The (p + r) × (p + r) diagonal matrix R(ϑ) is of the form
+
+R (ϑ) = diag
+��
+1/ˆa(z; αi)ˆb(αi)ˆh(αi)ϕ(αi)
+�
+,
+�
+1/ˆa(βj)ˆb(z; βj)ˆh(βj)ϕ(βj)
+�
+,
+�
+1/ˆa(ξk)ˆb(ξk)ˆh(z; ξk)ϕ(ξk)
+��
+
+where ϕ(z) = a(z)b(z)h(z) and i = 1, . . . , p, j = 1, . . . , q and k = 1, . . . , ℓ. Whereas the (p + r) × (p + r)
+
+matrices Vαβξ and �U αβξ are of the form
+
+Vαβξ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+α1
+α2
+1
+· · ·
+αp+r−1
+1
+...
+...
+...
+...
+...
+1
+αp
+α2
+p
+· · ·
+αp+r−1
+p
+1
+β1
+β2
+1
+· · ·
+βp+r−1
+1
+...
+...
+...
+...
+...
+1
+βq
+β2
+q
+· · ·
+βp+r−1
+q
+1
+ξ1
+ξ2
+1
+· · ·
+ξp+r−1
+1
+...
+...
+...
+...
+...
+1
+ξℓ
+ξ2
+ℓ
+· · ·
+ξp+r−1
+ℓ
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+⊤
+
+, ˆVαβξ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αp+r−1
+1
+αp+r−2
+1
+· · ·
+α1
+1
+...
+...
+...
+...
+αp+r−1
+p
+αp+r−2
+p
+· · ·
+αp
+1
+βp+r−1
+1
+βp+r−2
+1
+· · ·
+β1
+1
+...
+...
+...
+...
+βp+r−1
+q
+βp+r−2
+q
+· · ·
+βq
+1
+ξp+r−1
+1
+ξp+r−2
+1
+· · ·
+ξ1
+1
+...
+...
+...
+...
+ξp+r−1
+ℓ
+ξp+r−2
+ℓ
+· · ·
+ξℓ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+The (p + r) × (p + r) Vandermonde matrices Vαβξ and �U αβξ are nonsingular when αi ̸= αj , βk ̸= βh,
+ξm ̸= ξn, αi ̸= βk, αi ̸= ξm, βk ̸= ξm for all i, j = 1, . . . , p, k, h = 1, . . . , q and m,n = 1, . . . , ℓ. The
+
+Vandermonde determinants DetVαβξ and Det �U αβξ, are
+
+DetVαβξ = (−1)(p+r) (p+r−1)/2 Ψ (αi, βk, ξm)
+
+where
+
+Ψ (αi, βk, ξm) =
+∏
+1≤i<j≤p
+(αi − αj)
+∏
+1≤k<h≤q
+(βk − βh)
+∏
+1≤m<n≤ℓ
+(ξm − ξn)
+∏
+r = 1, . . . , p
+s = 1, . . . , q
+
+(αr − βs)
+∏
+r = 1, . . . , p
+w = 1, . . . , ℓ
+
+(αr − ξw)
+∏
+s = 1, . . . , q
+w = 1, . . . , ℓ
+
+(βs − ξw)
+
+Like for the Vandermonde matrices Vαβ and ˆV⊤
+αβ,
+
+103
+
+
+Entropy 2014, 16, 2023–2055
+
+Det ˆVαβξ = (−1)(p+r) (p+r−1) Ψ (αi, βk, ξm) ⇒| DetVαβξ |= | Det ˆVαβξ |
+
+(70) is the ARMAX equivalent to (47). A combination of both equations generates a new representation
+
+of the FIM G (ϑ), this is set forth in the following lemma.
+
+3.7.2. Lemma 3.9
+
+Assume the conditions (67) to hold and consider the representations of ℘(ϑ) and P (ϑ) in (47) and (70)
+respectively, leads to an alternative form to (65), it is given by
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠ VαβξR (ϑ) ˆVαβξ
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠ VαβD(ϑ) ˆVαβ
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
+In Lemma 3.9, the FIM G (ϑ) is expressed by submatrices of two Sylvester matrices and various
+Vandermonde matrices, both type of matrices become singular if and only if the appropriate
+polynomials have at least one common root.
+
+3.8. The Fisher Information Matrix of a Vector ARMA(p, q) Process
+
+The process (5) is summarized as,
+
+A(z)y(t) = B(z)e(t)
+
+and we assume that {y(t), t ∈ N}, is a zero mean Gaussian time series and {ϵ(t), t ∈ N} is a n-dimensional
+vector random variable, such that
+
+Eϑ
+
+{ϵ(t)} = 0 and Eϑ {ϵ(t)ϵT (t)} = ∑ and the parameter vector ϑ is of the form (7). In [6] it is shown that
+representation (16) for the n2(p+q)×n2(p+q) asymptotic FIM of the VARMA process (6) is
+
+F(ϑ) = Eϑ
+
+�� ∂e
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+� ∂e
+
+∂ϑ⊤
+
+��
+
+(71)
+
+where ∂ϵ/∂ϑT is of size n×n2(p+q) and for convenience t is omitted from ϵ(t). Using the differential
+rules outlined in [6], yields
+
+∂e
+∂ϑ⊤ =
+�
+(A−1(z)B(z)e)
+⊤ ⊗ B−1(z)
+�∂vec A(z)
+
+∂ϑ⊤
+− (e⊤ ⊗ B−1(z))∂vec B(z)
+
+∂ϑ⊤
+(72)
+
+The substitution of representation (72) of ∂ϵ/∂ϑ T in (71) yields the FIM of a VARMA process. The
+purpose is to construct a factorization of the FIM F(ϑ) that should be a multiple variant of the
+factorization (27), so that a multiple resultant matrix property can be proved for F(ϑ). As illustrated
+in [6], the multiple version of the Sylvester resultant matrix (22) does not fulfill the multiple resultant
+matrix property. In that case even when the matrix polynomials A(z) and B(z) have a common zero or
+a common eigenvalue, the multiple Sylvester matrix is not neccessarily singular. This has also been
+
+illustrated in [3]. In order to consider a multiple equivalent to the resultant matrix S −b, a), Gohberg
+and Lerer set forth the n2(p + q) × n2(p + q) tensor Sylvester matrix
+
+104
+
+
+Entropy 2014, 16, 2023–2055
+
+S⊗(−B, A) :=
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+(−In) ⊗ In
+(−B1) ⊗ In
+· · ·
+(−Bq) ⊗ In
+On2×n2
+· · ·
+On2×n2
+
+On2×n2
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+On2×n2
+On2×n2
+· · ·
+On2×n2
+(−In) ⊗ In
+(−B1) ⊗ In
+· · ·
+(−Bq) ⊗ In
+In ⊗ In
+In ⊗ A1
+· · ·
+In ⊗ Ap
+On2×n2
+· · ·
+On2×n2
+
+On2×n2
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+On2×n2
+On2×n2
+· · ·
+On2×n2
+In ⊗ In
+In ⊗ A1
+· · ·
+In ⊗ Ap
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(73)
+
+In [3], the authors prove that the tensor Sylvester matrix S⊗ (−B,A) fulfills the multiple resultant
+property, it becomes singular if and only if the appropriate matrix polynomials A(z) and B(z) have at
+least one common zero. In Proposition 2.2 in [6], the following factorized form of the Fisher information
+F(ϑ) is developed
+
+F(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Φ(z)Θ(z)Φ∗(z)dz
+
+z
+(74)
+
+where
+
+Φ(z) =
+
+�
+Ip ⊗ A−1(z) ⊗ In
+Opn2×qn2
+Oqn2×pn2
+Iq ⊗ In ⊗ A−1(z)
+
+�
+
+S⊗(−B, A) (up+q(z) ⊗ In2)
+
+and
+
+Θ(z) = Σ ⊗ σ(z), σ(z) = B−⊤(z)Σ−1B−1(z−1)
+(75)
+
+In order to obtain a multiple variant of (27), the following matrix is introduced,
+
+M(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Λ(z)J (z)Λ∗(z)dz
+
+z = S⊗(−B, A)P(ϑ) (S⊗(−B, A))⊤
+(76)
+
+where
+
+J (z) = Φ(z)Θ(z)Φ∗(z) and Λ(z) =
+
+�
+Ip ⊗ A(z) ⊗ In
+Opn2×qn2
+Oqn2×pn2
+Iq ⊗ In ⊗ A(z)
+
+�
+
+and the matrix P(ϑ) is a multiple variant of the matrix ℘(ϑ) in (28), it is of the form
+
+P(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+(up+q(z) ⊗ In2) Θ(z) (up+q(z) ⊗ In2)∗ dz
+
+z
+(77)
+
+In Lemma 2.3 in [6], it is proved that the matrix M(ϑ) in (76) becomes singular if and only if the matrix
+polynomials A(z) and B(z) have at least one common eigenvalue-zero. The proof is a multiple equivalent
+of the proof of Corollary 2.2 in [5], since the equality (76) is a multiple version of (27). Consequently,
+
+the matrix M(ϑ) like the tensor Sylvester matrix S⊗ (−B,A), fulfills the multiple resultant matrix
+property. Since the matrix M(ϑ) is derived from the FIM F(ϑ), this enables us to prove that the matrix
+F(ϑ) fulfills the multiple resultant matrix property by showing that it becomes singular if and only if
+the matrix M(ϑ) is singular, this is done in Proposition 2.4 in [6]. Consequently, it can be concluded
+
+from [6] that the FIM of a VARMA process F(ϑ) and the tensor Sylvester matrix S⊗ (−B,A) have the
+same singularity conditions. The FIM of a VARMA process F(ϑ) can therefore be added to the class of
+multiple resultant matrices.
+
+105
+
+
+Entropy 2014, 16, 2023–2055
+
+A brief summary of the contribution of [6] follows, in order to show that the FIM of a VARMA
+process F(ϑ) is a multiple resultant matrix two new representations of the FIM are derived. To
+construct such representations appropriate matrix differential rules are applied. The newly obtained
+representations are expressed in terms of the multiple Sylvester matrix and the tensor Sylvester matrix.
+The representation of the FIM expressed by the tensor Sylvester matrix is used to prove that the FIM
+becomes singular if and only if the autoregressive and moving average matrix polynomials have
+at least one common eigenvalue. It then follows that the FIM and the tensor Sylvester matrix have
+equivalent singularity conditions. In a numerical example it is shown, however, that the FIM fails to
+detect common eigenvalues due to some kind of numerical instability. The tensor Sylvester matrix
+reveals it clearly, proving the usefulness of the results derived in this paper.
+
+3.9. The Fisher Information Matrix of a Vector ARMAX(p, r, q) Process
+
+The n2(p + q + r) × n2(p + q + r) asymptotic FIM of the VARMAX(p, r, q) process (2)
+
+A(z)y(t) = C(z)x(t) + B(z)e(t)
+
+is displayed according to [23] and is an extension of the FIM of the VARMA(p, q) process (6).
+Representation (16) of the FIM of the VARMAX(p, r, q) process is then
+
+G(ϑ) = Eϑ
+
+�� ∂e
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+� ∂e
+
+∂ϑ⊤
+
+��
+
+where
+
+∂e
+∂ϑ⊤ =
+�
+(A−1(z)C(z)x)⊤ ⊗ B−1(z)
+� ∂vec A(z)
+
+∂ϑ⊤
++
+�
+(A−1(z)B(z)e)⊤ ⊗ B−1(z)
+� ∂vec A(z)
+
+∂ϑ⊤
+−{x⊤ ⊗ B−1(z)} ∂vec C(z)
+
+∂ϑ⊤
+−(e⊤ ⊗ B−1(z)) ∂vec B(z)
+
+∂ϑ⊤
+
+(78)
+
+To obtain the term ∂ϵ/∂ϑ T, of size n × n2(p + q + r), the same differential rules are applied as for the
+VARMA(p, q) process. In Proposition 2.3 in [23], the representation of the FIM of a VARMAX process
+is expressed in terms of tensor Sylvester matrices, this obtained when ∂ϵ/∂ϑ T in (78) is substituted
+in (16), to obtain
+
+G(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1 Φx(z)Θ(z)Φ∗
+x(z)dz
+
+z +
+1
+
+2πi
+
+�
+
+|z|=1 Λx(z)Ψ(z)Λ∗
+x(z)dz
+
+z
+(79)
+
+The matrices in (79) are of the form
+
+Φx(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A−1(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Orn2×rn2
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Iq ⊗ In ⊗ A−1(z)
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠ (up+q(z) ⊗ In2)
+
+Λx(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A−1(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Ir ⊗ In ⊗ A−1(z)
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Oqn2×qn2
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠ (up+r(z) ⊗ In2)
+
+S⊗
+p,q(−B, A) =
+
+�
+−S⊗
+p (B)
+S⊗
+q (A)
+
+�
+
+, S⊗
+p,r(−C, A) =
+
+�
+−S⊗
+p (C)
+S⊗
+r (A)
+
+�
+
+(80)
+
+additionally we have Ψ(z) = Rx(z) ⊗ σ(z) and the Hermitian spectral density matrix Rx(z) is defined
+in (10), whereas the matrix polynomials Θ(z) and σ(z) are presented in (75). In (80), we have the pn2
+
+× (p + q)n2 and qn2 × (p + q)n2 submatrices S⊗
+p (−B) and S⊗
+q (A) of the tensor Sylvester resultant
+matrix S⊗
+p,q(−B, A). Whereas the matrices S⊗
+p (−C) and S⊗
+r (A) are the upper and lower blocks of the
+(p+r)n2×(p+r)n2 tensor Sylvester resultant matrix S⊗
+p,r(−C, A). As for the FIM of the VARMA(p, q)
+process, the objective is to construct a multiple version of (65), this done in [23], to obtain
+
+106
+
+
+Entropy 2014, 16, 2023–2055
+
+Mx(ϑ) =
+1
+
+2πi
+�
+
+|z|=1 L(z)A(z)L∗(z) dz
+
+z +
+1
+
+2πi
+�
+
+|z|=1 W(z)B(z)W∗(z) dz
+
+z
+
+=
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠ P(ϑ)
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠ T(ϑ)
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠
+
+⊤
+(81)
+
+The matrices involved are of the form
+
+L(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Orn2×rn2
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Iq ⊗ In ⊗ A(z)
+
+⎞
+
+⎟
+⎠ and A(z) := Φx(z)Θ(z)Φ∗
+x(z)
+
+W(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Ir ⊗ In ⊗ A(z)
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Oqn2×qn2
+
+⎞
+
+⎟
+⎠ and B(z) := Λx(z)Ψ(z)Λ∗
+x(z)
+
+T(ϑ) =
+1
+
+2πi
+�
+
+|z|=1 (up+r(z) ⊗ In2) Ψ(z) (up+r(z) ⊗ In2)∗ dz
+
+z
+
+and P(ϑ) is given in (77). Note, the matrices Φx(z), Λx(z), L(z) and P (z) are the corrected versions of
+the corresponding matrices in [23].
+A parallel between the scalar and multiple structures is straightforward. This is best illustrated
+by comparing the representations (27) and (28) with (76) and (77) respectively, confronting the FIM
+for scalar and vector ARMA(p, q) processes. The FIM of the scalar ARMAX(p, r, q) process contains
+an ARMA(p, q) part, this is confirmed by (65), through the presence of the matrix ℘(ϑ) which is
+originally displayed in (28). The multiple resultant matrices M(ϑ) and Mx(ϑ) derived from the FIM
+of the VARMA(p, q) and VARMAX(p, r, q) processes respectively both contain P(ϑ), whereas the first
+matrix term of the matrices Φ(z) and Φx(z), which are of different size, consist of the same nonzero
+submatrices. To summarize, in [23] compact forms of the FIM of a VARMAX process expressed in
+terms of multiple and tensor Sylvester matrices are developed. The tensor Sylvester matrices allow
+us to investigate the multiple resultant matrix property of the FIM of VARMAX(p, r, q) processes.
+However, since no proof of the multiple resultant matrix property of the FIM G(ϑ) has been done yet,
+justifies the consideration of a conjecture. A conjecture that states, the FIMG(ϑ) of a VARMAX(p, r, q)
+process becomes singular if and only if the matrix polynomials A(z), B(z) and C(z) have at least one
+common eigenvalue. A multiple equivalent to Theorem 3.1 in [4] and combined with Proposition 2.4
+in [6], but based on the representations (79) and (81), can be envisaged to formulate a proof which will
+be a subject for future study.
+
+4. Conclusions
+
+In this survey paper, matrix algebraic properties of the FIM of stationary processes are discussed.
+The presented material is a summary of papers where several matrix structural aspects of the FIM
+are investigated. The FIM of scalar and multiple processes like the (V)ARMA(X) are set forth with
+appropriate factorized forms involving (tensor) Sylvester matrices. These representations enable us to
+prove the resultant matrix property of the corresponding FIM. This has been done for (V)ARMA(p,
+q) and ARMAX(p, r, q) processes in the papers [4,6]. The development of the stages that lead to the
+appropriate factorized form of the FIM G(ϑ) (79) is set forth in [23]. However, there is no proof done
+yet that confirms the multiple resultant matrix property of the FIM G(ϑ) of a VARMAX(p, r, q) process.
+This justifies the consideration of a conjecture which is formulated in the former section, this can be a
+subject for future study.
+The statistical distance measure derived in [7], involves entries of the FIM. This distance measure
+can be a challenge to its quantum information counterpart (41). Because (36) involves information
+about m parameters estimated from n measurements. Whereas in quantum information, like in
+e.g., [8,10], the information about one parameter in a particular measurement procedure is considered
+
+107
+
+
+Entropy 2014, 16, 2023–2055
+
+for establishing an interconnection with the appropriate statistical distance measure. A possible
+approach, by combining matrix algebra and quantum information, for developing a statistical distance
+measure in quantum information or quantum statistics but at the matrix level, can be a subject of
+future research. Some results concerning interconnections between the FIM of ARMA(X) models
+and appropriate solutions to Stein matrix equations are discussed, the material is extracted from the
+papers, [12] and [13]. However, in this paper, some alternative and new proofs that emphasize the
+conditions under which the FIM fulfills appropriate Stein equations, are set forth. The presence of
+various types of Vandermonde matrices is also emphasized when an explicit expansion of the FIM is
+computed. These Vandermonde matrices are inserted in interconnections with appropriate solutions to
+Stein equations. This explains, when the matrix algebraic structures of the FIM of stationary processes
+are investigated, the involvement of structured matrices like the (tensor) Sylvester, Bezoutian and
+Vandermonde matrices is essential.
+
+Acknowledgments: The author thanks a perceptive reviewer for his comments which significantly improved the
+quality and presentation of the paper.
+
+Conflicts of Interest: The authors have declared no conflict of interest.
+
+References
+
+1.
+Dym, H. Linear Algebra in Action; American Mathematical Society: Providence, RI, USA, 2006; Volume 78.
+2.
+Lancaster, P.; Tismenetsky, M. The Theory of Matrices with Applications, 2nd ed; Academic Press: Orlando, FL,
+USA, 1985.
+3.
+Gohberg, I.; Lerer, L. Resultants of matrix polynomials. Bull. Am. Math. Soc 1976, 82, 565–567.
+4.
+Klein, A.; Spreij, P. On Fisher’s information matrix of an ARMAX process and Sylvester’s resultant matrices.
+Linear Algebra Appl 1996, 237/238, 579–590.
+5.
+Klein, A.; Spreij, P. On Fisher’s information matrix of an ARMA process. In Stochastic Differential and Difference
+Equations; Csiszar, I., Michaletzky, Gy., Eds.; Birkhäuser: Boston: Boston, USA, 1997; Progress in Systems and
+Control Theory; Volume 23, pp. 273–284.
+6.
+Klein, A.; Mélard, G.; Spreij, P. On the Resultant Property of the Fisher Information Matrix of a Vector ARMA
+process. Linear Algebra Appl 2005, 403, 291–313.
+7.
+Klein, A.; Spreij, P. Transformed Statistical Distance Measures and the Fisher Information Matrix.
+Linear Algebra Appl 2012, 437, 692–712.
+8.
+Braunstein, S.L.; Caves, C.M. Statistical Distance and the Geometry of Quantum States. Phys. Rev. Lett 1994,
+72, 3439–3443.
+9.
+Jones, P.J.; Kok, P. Geometric derivation of the quantum speed limit. Phys. Rev. A 2010, 82, 022107.
+10.
+Kok, P. Tutorial: Statistical distance and Fisher information; Oxford: UK, 2006.
+11.
+Lancaster, P.; Rodman, L. Algebraic Riccati Equations; Clarendon Press: Oxford, UK, 1995.
+12.
+Klein, A.; Spreij, P. On Stein’s equation, Vandermonde matrices and Fisher’s information matrix of time
+series processes. Part I: The autoregressive moving average process. Linear Algebra Appl 2001, 329, 9–47.
+13.
+Klein, A.; Spreij, P. On the solution of Stein’s equation and Fisher’s information matrix of an ARMAX process.
+Linear Algebra Appl 2005, 396, 1–34.
+14.
+Grenander, U.; Szeg˝o, G.P. Toeplitz Forms and Their Applications; University of California Press: New York, NY,
+USA, 1958.
+15.
+Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods, 2nd ed; Springer Verlag: Berlin, Germany;
+New York, NY, USA, 1991.
+16.
+Caines, P. Linear Stochastic Systems; John Wiley and Sons: New York, NY, USA, 1988.
+17.
+Ljung, L.; Söderström, T. Theory and Practice of Recursive Identification; M.I.T. Press: Cambridge, MA, USA, 1983.
+18.
+Hannan, E.J.; Deistler, M. The Statistical Theory of Linear Systems; John Wiley and Sons: New York, NY, USA, 1988.
+19.
+Hannan, E.J.; Dunsmuir, W.T.M.; Deistler, M. Estimation of vector Armax models. J. Multivar. Anal 1980, 10,
+275–295.
+20.
+Horn, R.A.; Johnson, C.R. Topics in Matrix Analysis; Cambridge University Press: New York, NY, USA, 1995.
+21.
+Klein, A.; Spreij, P. Matrix differential calculus applied to multiple stationary time series and an extended
+Whittle formula for information matrices. Linear Algebra Appl 2009, 430, 674–691.
+
+108
+
+
+Entropy 2014, 16, 2023–2055
+
+22.
+Klein, A.; Mélard, G. An algorithm for the exact Fisher information matrix of vector ARMAX time series.
+Linear Algebra Its Appl 2014, 446, 1–24.
+23.
+Klein, A.; Spreij, P. Tensor Sylvester matrices and the Fisher information matrix of VARMAX processes.
+Linear Algebra Appl 2010, 432, 1975–1989.
+24.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc 1945, 37, 81–91.
+25.
+Ibragimov, I.A.; Has’minski˘ı, R.Z. Statistical Estimation. In Asymptotic Theory; Springer-Verlag: New York,
+NY, USA, 1981.
+26.
+Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983.
+27.
+Friedlander, B. On the computation of the Cramér-Rao bound for ARMA parameter estimation. IEEE Trans.
+Acoust. Speech Signal Process 1984, 32, 721–727.
+28.
+Holevo, A.S. Probabilistic and Statistical Aspects of Quantum Theory, 2nd ed; Edizioni Della Normale, SNS Pisa:
+Pisa, Italy, 2011.
+29.
+Petz, T. Quantum Information Theory and Quantum Statistics; Springer-Verlag: Berlin Heidelberg, Germany,
+2008.
+30.
+Barndorff-Nielsen, O.E.; Gill, R.D. Fisher Information in quantum statistics. J. Phys. A 2000, 30, 4481–4490.
+31.
+Luo, S. Wigner-Yanase skew information vs. quantum Fisher information. Proc. Amer. Math. Soc 2004, 132,
+885–890.
+32.
+Klein, A.; Mélard, G. On algorithms for computing the covariance matrix of estimates in autoregressive
+moving average processes. Comput. Stat. Q 1989, 5, 1–9.
+33.
+Klein, A.; Mélard, G. An algorithm for computing the asymptotic Fisher information matrix for seasonal
+SISO models. J. Time Ser. Anal 2004, 25, 627–648.
+34.
+Bistritz, Y.; Lifshitz, A. Bounds for resultants of univariate and bivariate polynomials. Linear Algebra Appl
+2010, 432, 1995–2005.
+35.
+Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: New York, NY, USA, 1996.
+36.
+Golub, G.H.; van Loan, C.F. Matrix Computations, 3rd ed; John Hopkins University Press: Baltimore, USA, 1996.
+37.
+Kullback, S. Information Theory and Statistics; John Wiley and Sons: New York, NY, USA, 1959.
+38.
+Klein, A.; Spreij, P. The Bezoutian, state space realizations and Fisher’s information matrix of an ARMA
+process. Linear Algebra Appl 2006, 416, 160–174.
+
+© 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+109
+
+
+entropy
+
+Article
+Asymptotically Constant-Risk Predictive Densities
+When the Distributions of Data and Target Variables
+Are Different
+
+Keisuke Yano 1,* and Fumiyasu Komaki 1,2
+
+1 Department of Mathematical Informatics, Graduate School of Information Science and Technology,
+The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan; E-Mail:
+komaki@mist.i.u-tokyo.ac.jp
+2 RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
+*
+E-Mail: Keisuke_Yano@mist.i.u-tokyo.ac.jp; Tel.: +81-3-5841-6909.
+
+Received: 28 March 2014; in revised form: 9 May 2014 / Accepted: 22 May 2014 /
+Published: 28 May 2014
+
+Abstract: We investigate the asymptotic construction of constant-risk Bayesian predictive densities
+under the Kullback–Leibler risk when the distributions of data and target variables are different and
+have a common unknown parameter. It is known that the Kullback–Leibler risk is asymptotically
+equal to a trace of the product of two matrices: the inverse of the Fisher information matrix for the
+data and the Fisher information matrix for the target variables. We assume that the trace has a unique
+maximum point with respect to the parameter. We construct asymptotically constant-risk Bayesian
+predictive densities using a prior depending on the sample size. Further, we apply the theory to the
+subminimax estimator problem and the prediction based on the binary regression model.
+
+Keywords:
+Bayesian prediction; Fisher information; Kullback–Leibler divergence; minimax;
+predictive metric; subminimax estimator
+
+1. Introduction
+
+Let x(N) = (x1, · · · , xN) be independent N data distributed according to a probability density,
+p(x|θ), that belongs to a d-dimensional parametric model, {p(x|θ) : θ ∈ Θ}, where θ = (θ1, · · · , θd)
+is an unknown d-dimensional parameter and Θ is the parameter space. Let y be a target variable
+distributed according to a probability density, q(y|θ), that belongs to a d-dimensional parametric
+model, {q(y|θ) : θ ∈ Θ} with the same parameter, θ. Here, we assume that the distributions of the
+data and the target variables, p(x|θ) and q(y|θ), are different. For simplicity, we assume that the data
+and the target variables are independent, given by θ.
+We construct predictive densities for target variables based on the data.
+We measure
+the performance of the predictive density,
+ˆq(y; x(N)), by the Kullback–Leibler divergence,
+D(q(·|θ), ˆq(·; x(N))), from the true density, q(y|θ), to the predictive density, ˆq(y; x(N)):
+
+D(q(·|θ), ˆq(·; x(N)))
+=
+�
+q(y|θ) log
+q(y|θ)
+
+ˆq(y; x(N))dy.
+
+Then, the risk function, R(θ, ˆq(y; x(N))), of the predictive density, ˆq(y; x(N)), is given by:
+
+R(θ, ˆq(y; x(N)))
+=
+�
+p(x(N)|θ)D(q(·|θ), ˆq(·; x(N)))dx(N)
+
+=
+�
+p(x(N)|θ)
+�
+q(y|θ) log
+q(y|θ)
+
+ˆq(y; x(N))dydx(N).
+
+Entropy 2014, 16, 3026–3048; doi:10.3390/e16063026
+www.mdpi.com/journal/entropy
+110
+
+
+Entropy 2014, 16, 3026–3048
+
+For the construction of predictive densities, we consider the Bayesian predictive density defined
+by:
+
+ˆqπ(y|x(N))
+=
+�
+q(y|θ)p(x(N)|θ)π(θ; N)dθ
+�
+p(x(N)|θ)π(θ; N)dθ
+,
+
+where π(θ; N) is a prior density for θ, possibly depending on the sample size, N. Aitchison [1]
+showed that, for a given prior density, π(θ; N), the Bayesian predictive density, ˆqπ(y|x(N)), is a Bayes
+solution under the Kullback–Leibler risk. Based on the asymptotics as the sample size goes to infinity,
+Komaki [2] and Hartigan [3] showed its superiority over any plug-in predictive density, q(y| ˆθ), with
+any estimator, ˆθ. However, there remains a problem of prior selection for constructing better Bayesian
+predictive densities. Thus, a prior, π(θ; N), must be chosen based on an optimality criterion for actual
+applications.
+Among various criteria, we focus on a criterion of constructing minimax predictive densities
+under the Kullback–Leibler risk. For simplicity, we refer to the priors generating minimax predictive
+densities as minimax priors. Minimax priors have been previously studied in various predictive
+settings; see [4–8]. When the simultaneous distributions of the target variables and the data belong to
+the submodel of the multinomial distributions, Komaki [7] shows that minimax priors are given as
+latent information priors maximizing the conditional mutual information between target variables and
+the parameter given the data. However, the explicit forms of latent information priors are difficult to
+obtain, and we need asymptotic methods, because they require the maximization on the space of the
+probability measures on Θ.
+Except for [7], these studies on minimax priors are based on the assumption that the distributions,
+p(x|θ) and q(y|θ), are identical. Let us consider the prediction based on the logistic regression model
+where the covariates of the data and the target variables are not identical. In this predictive setting, the
+assumption that the distributions, p(x|θ) and q(y|θ), are identical is no longer valid.
+We focus on the minimax priors in predictions where the distributions, p(x|θ) and q(y|θ), are
+different and have a common unknown parameter. Such a predictive setting has traditionally been
+considered in statistical prediction and experiment design. It has recently been studied in statistical
+learning theory; for example, see [9]. Predictive densities where the distributions, p(x|θ) and q(y|θ),
+are different and have a common unknown parameter are studied by [10–13].
+Let gX
+ij (θ) be the (i, j)-component of the Fisher information matrix of the distribution, p(x|θ), and
+
+let gY
+ij(θ) be the (i, j)-component of the Fisher information matrix of the distribution, q(y|θ). Let gX,ij(θ)
+
+and gY,ij(θ) denote the (i, j)-components of their inverse matrices. We adopt Einstein’s summation
+convention: if the same indices appear twice in any one term, it implies summation over that index
+from one to d. For the asymptotics below, we assume that the prior densities, π(θ; N), are smooth.
+On the asymptotics as the sample size N goes to infinity, we construct the asymptotically
+constant-risk prior, π(θ; N), in the sense that the asymptotic risk:
+
+R(θ, ˆqπ(y|x(N))) = 1
+
+N R1(θ, ˆqπ(y|x(N))) +
+1
+
+N
+√
+
+N
+R2(θ, ˆqπ(y|x(N))) + O(N−2)
+
+is constant up to O(N−2). Since the proper prior with the constant risk is a minimax prior for any finite
+sample size, the asymptotically constant-risk prior relates to the minimax prior; in Section 4, we verify
+that the asymptotically constant-risk prior agrees with the exact minimax prior in binomial examples.
+When we use the prior, π(θ), independent of the sample size, N, it is known that the N−1-order
+term, R1(θ, ˆqπ(y|x(N))), of the Kullback–Leibler risk is equal to the trace, gX,ij(θ)gY
+ij(θ). If the trace
+does not depend on the parameter, θ, the construction of the asymptotically constant-risk prior is
+parallel to [6]; see also [13].
+However, we consider the settings where there exists a unique maximum point of the trace,
+gX,ij(θ)gY
+ij(θ); for example, these settings appear in predictions based on the binary regression model,
+
+111
+
+
+Entropy 2014, 16, 3026–3048
+
+where the covariates of the data and the target variables are not identical. In the settings, there do
+not exist asymptotically constant-risk priors among the priors independent of the sample size, N.
+The reason is as follows: we consider the prior, π(θ), independent of the sample size, N. Then, the
+Kullback–Leibler risk of the Bayesian predictive density is expanded as:
+
+R(θ, ˆqπ(y|x(N))) =
+1
+2N gY
+ij(θ)gX,ij(θ) + O(N−2).
+
+Since, in our settings, the first-order term, gY
+ij(θ)gX,ij(θ), is not constant, the prior independent of the
+sample size, N, is not an asymptotically constant-risk prior.
+When there exists a unique maximum point of the trace, gX,ij(θ)gY
+ij(θ), we construct the
+
+asymptotically constant-risk prior, π(θ; N), up to O(N−2), by making the prior dependent on the
+sample size, N, as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+{ f (θ)}
+√
+
+Nh(θ),
+
+where f (θ) and h(θ) are the scalar functions of θ independent of N and |gX(θ)| denotes the determinant
+of the Fisher information matrix, gX(θ).
+The key idea is that, if the specified parameter point has more undue risk than the other parameter
+points, then the more prior weights should be concentrated on that point.
+Further, we clarify the subminimax estimator problem based on the mean squared error from
+the viewpoint of the prediction where the distributions of data and target variables are different and
+have a common unknown parameter. We obtain the improvement achieved by the minimax estimator
+over the subminimax estimators up to O(N−2). The subminimax estimator problem [14,15] is the
+problem that, at first glance, there seems to exist asymptotically dominating estimators of the minimax
+estimator. However, any relationship between such subminimax estimator problems and predictions
+have not been investigated, and further, in general, the improvement by the minimax estimator over
+the subminimax estimators has not been investigated.
+
+2. Information Geometrical Notations
+
+In this section, we prepare the information geometrical notations; see [16] for details. We
+abbreviate ∂/∂θi to ∂i, where the indices, i, j, k, . . ., run from one to d. Similarly, we abbreviate ∂2/∂θi∂θj,
+∂3/∂θi∂θj∂θk and ∂4/∂θi∂θj∂θk∂θl to ∂ij, ∂ijk and ∂ijkl, respectively. We denote the expectations of the
+random variables, X, Y and X(N), by EX[·], EY[·] and EX(N)[·], respectively. We denote their probability
+densities by p(x|θ), q(y|θ) and p(x(N)|θ), respectively.
+We define the predictive metric proposed by Komaki [13] as:
+
+˚gij(θ)
+=
+gX
+ik(θ)gY,kl(θ)gX
+lj (θ).
+
+When the parameter is one-dimensional, gθθ(θ) denotes Fisher information and gθθ(θ) denotes its
+
+inverse. Let
+
+e
+Γ X
+ij,k(θ) and
+
+m
+Γ X
+ij,k(θ) be the quantities given by:
+
+e
+Γ X
+ij,k(θ)
+:=
+EX[∂ij log p(x|θ)∂k log p(x|θ)]
+
+and:
+
+m
+Γ X
+ij,k(θ)
+:=
+�
+1
+
+p(x|θ)[∂ijp(x|θ)∂kp(x|θ)]dx.
+
+112
+
+
+Entropy 2014, 16, 3026–3048
+
+Using these quantities, the e-connection and m-connection coefficients with respect to the parameter, θ,
+for the model, {p(x|θ) : θ ∈ Θ}, are given by:
+
+e
+Γ X
+ij,k(θ)
+:=
+gX,lk(θ)
+
+e
+Γ X
+ij,l(θ)
+
+and:
+
+m
+Γ X
+ij,k(θ)
+:=
+gX,kl(θ)
+
+m
+Γ X
+ij,l(θ),
+
+respectively.
+The (0, 3)-tensor, TX
+ijk(θ), is defined by:
+
+TX
+ijk(θ)
+:=
+EX[∂i log p(x|θ)∂j log p(x|θ)∂k log p(x|θ)].
+
+The tensor, TX
+ijk(θ), also produces a (0, 1)-tensor:
+
+TX
+i (θ)
+:=
+TX
+ijk(θ)gX,jk(θ).
+
+In the same manner, the information geometrical quantities,
+
+e
+Γ X
+ij,l(θ),
+
+m
+Γ X
+ij,l(θ) and TY
+ijk(θ), are
+defined for the model, {q(y|θ) : θ ∈ Θ}.
+Let Mk
+ij(θ) be a (1, 2)-tensor defined by:
+
+Mk
+ij(θ) :=
+
+m
+Γ Y,k
+ij
+(θ) −
+
+m
+Γ X,k
+ij
+(θ).
+
+For a derivative, (∂1v(θ), · · · , ∂dv(θ)), of the scalar function, v(θ), the e-covariant derivative is
+given by:
+
+e
+∇i∂jv(θ)
+:=
+∂ijv(θ) −
+
+e
+Γ X,k
+ij
+(θ)∂kv(θ).
+
+3. Asymptotically Constant-Risk Priors When the Distributions of Data and Target Variables
+Are Different
+
+In this section, we consider the settings where the trace, gX,ij(θ)gY
+ij(θ), has a unique maximum
+point. We construct the asymptotically constant-risk prior under the Kullback–Leibler risk in the sense
+that the asymptotic risk up to O(N−2) is constant. We find asymptotically constant-risk priors up to
+O(N−2) in two steps: first, expand the Kullback–Leibler risks of Bayesian predictive densities; second,
+find the prior having an asymptotically constant risk using this expansion.
+From now on, we assume the following two conditions for the prior, π(θ; N):
+
+(C1) The prior, π(θ; N), has the form:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ) + log h(θ)},
+
+where f (θ) and h(θ) are smooth scalar functions of θ independent of N.
+(C2) The unique maximum point of the scalar function, f (θ), is equal to the unique maximum point
+of the trace, gX,ij(θ)gY
+ij(θ).
+
+113
+
+
+Entropy 2014, 16, 3026–3048
+
+Based on Conditions (C1) and (C2), we expand the Kullback–Leibler risk of a Bayesian predictive
+density up to O(N−2).
+
+Theorem 1. The Kullback–Leibler risk of a Bayesian predictive density based on the prior, π(θ; N), satisfying
+Condition (C1), is expanded as:
+
+R(θ, ˆqπ(y|x(N)))
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ) −
+1
+
+N
+√
+
+N
+TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)
+e
+∇i∂j log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
+−
+1
+
+3N
+√
+
+N
+TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gY
+kl(θ)Ml
+ij(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij (θ)∂m log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)Mk
+ij(θ)∂k log f (θ)
+
++
+1
+
+2N
+√
+
+N
+˚gij(θ)TX
+i (θ)∂j log f (θ) +
+1
+
+2N
+√
+
+N
+gX,im(θ)gY
+ij(θ)gX,kl(θ)Mj
+kl(θ)∂m log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(1)
+
+The proof is given in the Appendix. The first term in (1) represents that the precision of the
+estimation is determined by the geometric quantity of the data, gX,ij(θ), and the metric of the parameter
+is determined by the geometric quantity of the target variables, gY
+ij(θ). Note that each term in (1) is
+invariant under the reparametrization.
+
+Remark 1. For the subsequent theorem, it is important that at the point, θ f , maximizing the scalar function,
+log f (θ), R(θ f , ˆqπ(y|xN)) is given by:
+
+R(θ f , ˆqπ(y|xN))
+
+=
+1
+2N sup
+θ∈Θ
+{gX,ij(θ)gY
+ij(θ)} +
+1
+
+N
+√
+
+N
+˚gij(θ f )∂ij log f (θ f ) + O(N−2).
+(2)
+
+The N−3/2-order term of this risk is common whenever we use the same scalar function, log f (θ). This term is
+negative because of the definition of the point, θ f . Under Condition (C2), θ f is equal to the unique maximum
+point, θmax, of the trace, gX,ij(θ)gY
+ij(θ).
+
+Based on (1) and (2), we construct asymptotically constant-risk priors using the solutions of the
+partial differential equations.
+
+Theorem 2. Suppose that the scalar functions, log ˜f (θ) and log ˜h(θ), satisfy the following conditions:
+
+(A1) log ˜f (θ) is the solution of the Eikonal equation given by:
+
+˚gij(θ)∂i log ˜f (θ)∂j log ˜f (θ)
+=
+gX,ij(θmax)gY
+ij(θmax) − gX,ij(θ)gY
+ij(θ),
+(3)
+
+where θmax is the unique maximum point of the scalar function, gX,ij(θ)gY
+ij(θ).
+
+114
+
+
+Entropy 2014, 16, 3026–3048
+
+(A2) log ˜h(θ) is the solution of the first-order linear partial equation given by:
+
+˚gij∂i log ˜f (θ)∂j log ˜h(θ) = − ˚gij(θ)
+e
+∇i∂j log ˜f (θ)
+
+− ˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log ˜f (θ)
+�
+∂j log ˜f (θ)∂l log ˜f (θ)
+
++ TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log ˜f (θ)
+
++ 1
+
+3TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log ˜f (θ)∂t log ˜f (θ)∂u log ˜f (θ)
+
+− 1
+
+2 gY
+kl(θ)Ml
+ij(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log ˜f (θ)∂t log ˜f (θ)∂u log ˜f (θ)
+
+− 1
+
+2 gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij (θ)∂m log ˜f (θ) − ˚gij(θ)Mk
+ij(θ)∂k log ˜f (θ)
+
+− 1
+
+2 ˚gij(θ)TX
+i (θ)∂j log ˜f (θ) − 1
+
+2 gX,im(θ)gY
+ij(θ)gX,kl(θ)Mj
+kl(θ)∂m log ˜f (θ)
+
++ ˚gij(θmax)∂ij log ˜f (θmax).
+(4)
+
+Let π(θ; N) be the prior that is constructed as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ) + log ˜h(θ)}.
+
+Further, suppose that log ˜f (θ) satisfies Condition (C2).
+Then, the Bayesian predictive density based on the prior, π(θ; N), has the asymptotically smallest constant
+risk up to O(N−2) among all priors with the form (C1).
+
+Proof. First, we consider the prior, φ(θ; N), constructed as:
+
+φ(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ)}.
+
+From Theorem 1, the Kullback–Leibler risk, R(θ, ˆqφ(y|x(N))), based on the prior, φ(θ; N), is given by:
+
+R(θ, ˆqφ(y|x(N)))
+=
+1
+2N gX,ij(θmax)gY
+ij(θmax) + o(N−1).
+(5)
+
+This is constant up to o(N−1).
+Suppose that there exists another prior, ϕ(θ; N), constructed as:
+
+ϕ(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ)},
+
+and the Bayesian predictive density based on the prior, ϕ(θ; N), has the asymptotically constant risk:
+
+R(θ, ˆqϕ(y|x(N))) =
+k
+2N + o(N−1).
+
+From Theorem 1, the prior ϕ(θ; N) must satisfy the equation:
+
+˚gij(θ)∂i log f (θ)∂j log f (θ)
+=
+k − gX,ij(θ)gY
+ij(θ).
+
+The left-hand side of the above equation is non-negative, because the matrix, ˚gij(θ), is positive-definite.
+Hence, the infimum of the constant, k, is equal to gX,ij(θmax)gY
+ij(θmax). From (5), the N−1-order term
+
+of the risk based on the prior, φ(θ; N), achieves the infimum, gX,ij(θmax)gY
+ij(θmax). Thus, the Bayesian
+
+115
+
+
+Entropy 2014, 16, 3026–3048
+
+predictive density based on the prior, φ(θ; N), has the asymptotically smallest constant risk up to
+o(N−1).
+Second, we consider the prior, π(θ; N), constructed as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ) + log ˜h(θ)}.
+
+The above argument ensures that the prior, π(θ; N), has the asymptotically smallest constant risk up
+to o(N−1). Thus, we only have to check if the N−3/2-order term of the risk is the smallest constant.
+From (2), the N−3/2-order term of the risk at the point, θmax, is unchanged by the choice of the
+scalar function, log h(θ). In other words, the constant N−3/2-order term must agree with the quantity,
+˚gij(θmax)∂ij log ˜f (θmax). From Theorem 1, if we choose the prior, π(θ; N), the N−3/2-order term of the
+risk is the smallest constant, and it agrees with the quantity, ˚gij(θmax)∂ij log ˜f (θmax). Thus, the prior,
+π(θ; N), has the asymptotically smallest constant risk up to O(N−2).
+
+Remark 2. In Theorem 2, we choose log ˜f (θ), satisfying Condition (C2) among the solutions of (A1). We
+consider the model with a one-dimensional parameter, θ. There are four possibilities to the solutions of (A1):
+
+�
+
+˚gθθ(θ)∂θ log ˜f (θ) =
+
+⎧
+⎨
+
+⎩
+±
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≤ θmax,
+
+±
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≥ θmax,
+
+where the double-sign corresponds. From the concavity around θmax as suggested by (C2), we choose log ˜f (θ)
+as the solution of the following equation:
+
+�
+
+˚gθθ(θ)∂θ log ˜f (θ) =
+
+⎧
+⎨
+
+⎩
+
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≤ θmax,
+
+−
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≥ θmax.
+(6)
+
+Integrating both sides of Equation (6), the unique function, log ˜f (θ), is obtained.
+
+Remark 3. Compare the Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), with
+that based on the prior, λ(θ), independent of the sample size, N. From Theorem 1 and Theorem 2, the
+Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), is given as:
+
+R(θ, ˆqπ(y|x(N)))
+=
+1
+2N gX,ij(θmax)gY
+ij(θmax)
+
++
+1
+
+N
+√
+
+N
+˚gij(θmax)∂ij log ˜f (θmax) + O(N−2).
+(7)
+
+In contrast, the Kullback–Leibler risk based on the prior, λ(θ), is given as:
+
+R(θ, ˆqλ(y|x(N)))
+=
+1
+2N gX,ij(θ)gY
+ij(θ) + O(N−2).
+(8)
+
+The N−1-order term in (8) is under the N−1-order term in (7); although the N−3/2-order term in (8) does not
+exist, the N−3/2-order term in (7) is negative. Thus, the maximum of the risk based on the asymptotically
+constant-risk prior, π(θ; N), is smaller than that of the risk based on the prior, λ(θ). This result is consistent
+with the minimaxity of selecting the prior that constructs the predictive density with the smallest maximum of
+the risk.
+
+4. Subminimax Estimator Problem Based on the Mean Squared Error
+
+In this section, we refer to the subminimax estimator problem based on the mean squared error,
+from the viewpoint of the prediction where the distributions of data and target variables are different
+
+116
+
+
+Entropy 2014, 16, 3026–3048
+
+and have a common unknown parameter. First, we give a brief review of subminimax estimator
+problem through the binomial example.
+
+Example 1. Let us consider the binomial estimation based on the mean squared error, RMSE(θ, ˆθ). For any
+finite sample size, N, the Bayes estimator, ˆθπ, based on the Beta prior, π(θ; N) ∝ θ
+√
+
+N/2−1(1 − θ)
+√
+
+N/2−1, is
+minimax under the mean squared error. The mean squared error of the minimax Bayes estimator, ˆθπ, is given by:
+
+RMSE(θ, ˆθπ)
+=
+N
+
+4(
+√
+
+N + N)2 =
+1
+4N −
+1
+
+2N
+√
+
+N
++ O(N−2).
+(9)
+
+In contrast, the mean squared error of the maximum likelihood estimator, ˆθMLE, is given by:
+
+RMSE(θ, ˆθMLE)
+=
+θ(1 − θ)
+
+N
+.
+
+We compare the two estimators, ˆθπ and ˆθMLE. In the comparison of the N−1-order terms of the mean
+squared errors, it seems that the maximum likelihood estimator, ˆθMLE, dominates the minimax Bayes estimator,
+ˆθπ. In other words, the N−1-order term of RMSE(θ, ˆθMLE) is not greater than that of RMSE(θ, ˆθπ) for every
+θ ∈ Θ, and the equality holds when θ = 1/2. This seeming paradox is known as the subminimax estimator
+problem; see [14,17,18] for details. See also [15] for the conditions that such problems do not occur in estimation.
+However, this paradox does not mean the inferiority of the minimax Bayes estimator. This is because,
+although the mean squared error of the minimax Bayes estimator, ˆθπ, has the negative N−3/2-order term, the
+mean squared error of the maximum likelihood estimator, ˆθMLE, does not have the N−3/2-order term. Hence, in
+comparison to the mean squared errors up to O(N−2), the maximum of the mean squared error, RMSE(θ, ˆθπ), is
+below the maximum of the mean squared error, RMSE(θ, ˆθMLE).
+
+Next, we construct the asymptotically constant-risk prior in the estimation based on the mean
+squared error when the subminimax estimator problem occurs, from the viewpoint of the prediction.
+We consider the priors, π(θ; N), satisfying (C1). From Lemma A3 in the Appendix, the mean squared
+error of the Bayes estimator, ˆθπ, is equal to the Kullback–Leibler risk of the ˆθπ-plugin predictive density,
+q(y| ˆθπ), by assuming that the target variable, y, is a d-dimensional Gaussian random variable with the
+
+mean vector, θ, and unit variance. Note that gY
+ij(θ) = 1,
+
+m
+Γ Y
+ij,k = 0 and
+
+e
+Γ Y
+ij,k = 0 for i, j, k = 1, · · · , d.
+
+Thus, if gY
+ij(θ)gX,ij(θ) = Σd
+i=1gX,ii(θ) has a unique maximum point, we obtain the asymptotically
+
+constant-risk prior, π(θ; N), up to O(N−2) from Lemma A2 in the Appendix and Theorem 2.
+Finally, we compare the mean squared error of the asymptotically constant-risk Bayes estimator,
+ˆθπ, with that of the maximum likelihood estimator, ˆθMLE. The mean squared error of the asymptotically
+constant-risk Bayes estimator, ˆθπ, is given as:
+
+RMSE(θ, ˆθπ)
+=
+1
+N
+
+d
+Σ
+i=1
+gX,ii(θmax) +
+2
+
+N
+√
+
+N
+Σd
+k=1gX,ik(θmax)gX,jk(θmax)∂ij log ˜f (θmax) + O(N−2).
+
+In contrast, the mean squared error of the maximum likelihood estimator, ˆθMLE, is given as:
+
+RMSE(θ, ˆθMLE) = 1
+
+N Σd
+i=1gX,ii(θ) + O(N−2).
+
+See [16,19].
+Thus, the maximum of the mean squared error of the asymptotically constant-risk Bayes estimator
+is smaller than that of estimators by the improvement of order N−3/2 in proportion to the Hessian
+of the scalar function, log ˜f (θ), at θmax. In the prediction where the trace, gX,ij(θ)gY
+ij(θ), has a unique
+maximum point, the same improvement holds (Remark 3).
+
+117
+
+
+Entropy 2014, 16, 3026–3048
+
+Example 2. Using the above results, we consider the binomial estimation based on the mean squared error from
+the viewpoint of the prediction. The geometrical quantities to be used are given by:
+
+gX
+θθ(θ) =
+1
+
+θ(1 − θ),
+gY
+θθ(θ) = 1,
+
+m
+Γ X
+θθ,θ (θ) = 0,
+
+m
+Γ X
+θθ,θ(θ) = 0,
+
+e
+Γ X
+θθ,θ(θ) (θ) = −
+1 − 2θ
+
+θ2(1 − θ)2 ,
+
+e
+Γ Y
+θθ,θ (θ) = 0,
+
+TX
+θθθ(θ) =
+1 − 2θ
+
+θ2(1 − θ)2 ,
+and
+TY
+θθθ(θ) = 0,
+
+respectively. Since
+m X,θ
+θθ ,
+mY,θ
+θθ
+and TY
+θθθ vanish, the asymptotically constant-risk prior in the estimation is
+identical to the asymptotically constant-risk prior in the prediction; compare Theorem 1 with the expansion of
+gY,ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] in Lemma A2 in the Appendix.
+In this example, Equation (3) is given by:
+
+θ2(1 − θ)2{∂θ log ˜f (θ)}2
+=
+
+�
+
+1
+4 − θ(1 − θ),
+
+and the solution, log ˜f (θ), is (1/2) log{θ(1 − θ)}. Here, the second-order derivative of the function, log ˜f (θ),
+is given by:
+
+∂θθ log ˜f (θ)
+=
+−1 − 2θ + 2θ2
+
+2θ2(1 − θ)2 .
+
+From this, Equation (4) is given by:
+
+1
+2θ(1 − θ)(1 − 2θ)∂θ log ˜h(θ) + θ2 − θ
+=
+−1
+
+4,
+
+and the solution, log ˜h(θ), is (1/2) log{θ(1 − θ)}. Hence, the asymptotically constant-risk prior, π(θ; N), is a
+Beta prior with the parameters, α =
+√
+
+N/2 and β =
+√
+
+N/2. Note that the asymptotically constant-risk prior
+coincides with the exact minimax prior. Since gX,θθ(θmax) = 1/2 and gX,θθ(θmax)∂θθ log ˜f (θmax) = −1, the
+mean squared error of the asymptotically constant-risk Bayes estimator, ˆθπ, agrees with (9) up to O(N−2).
+
+5. Application to the Prediction of the Binary Regression Model under the Covariate Shift
+
+In this section, we construct asymptotically constant-risk priors in the prediction based on the
+binary regression model under the covariate shift; see [10].
+We consider that we predict a binary response variable, y, based on the binary response variables,
+x(N). We assume that the target variable, y, and the data, x(N), follow the logistic regression models
+with the same parameter, β, given by:
+
+log
+Πx
+
+1 − Πx
+=
+α + zβ
+
+and:
+
+log
+Πy
+
+1 − Πy
+=
+˜α + ˜zβ,
+
+118
+
+
+Entropy 2014, 16, 3026–3048
+
+where Πx is the success probability of the data and Πy is the success probability of the target variable.
+Let α and ˜α denote known constant terms, and let β denote the common unknown parameter. Further,
+we assume that the covariates, z and ˜z, are different.
+Using the parameter θ = Πx, we convert this predictive setting to binomial prediction where the
+data, x, and the target variable, y, are distributed according to:
+
+p(x|θ) :=
+
+�
+θ
+if x = 1,
+1 − θ
+if x = 0,
+
+and:
+
+q(y|θ) :=
+
+⎧
+⎨
+
+⎩
+
+e˜α−˜zz−1αθ ˜zz−1/
+�
+(1 − θ)˜zz−1 + e˜α−˜zz−1αθ ˜zz−1�
+if y = 1,
+
+(1 − θ)˜zz−1/
+�
+(1 − θ)˜zz−1 + e˜α−˜zz−1αθ ˜zz−1�
+if y = 0,
+
+respectively. We obtain two Fisher information for x and y as:
+
+gX
+θθ(θ)
+=
+1
+
+θ(1 − θ)
+
+and:
+
+gY
+θθ(θ)
+=
+� ˜z
+
+z
+
+�2
+e−˜α+˜zz−1α
+(1 − θ)˜zz−1−2θ ˜zz−1−2
+
+�
+θ ˜zz−1 + e−˜α+˜zz−1α(1 − θ)˜zz−1�2 ,
+
+respectively.
+For simplicity, we consider the setting where z = 1, ˜z = 2 and α = ˜α = 0. The geometrical
+quantities for the model, {p(x|θ) : θ ∈ Θ}, are given by:
+
+gX
+θθ(θ) =
+1
+
+θ(1 − θ),
+
+m
+Γ X
+θθ,θ (θ) = 0,
+
+e
+Γ X
+θθ,θ(θ) (θ) = −
+1 − 2θ
+
+θ2(1 − θ)2 ,
+and
+TX
+θθθ(θ) =
+1 − 2θ
+
+θ2(1 − θ)2 ,
+
+respectively. In the same manner, the geometrical quantities for the model, {q(y|θ) : θ ∈ Θ}, are
+given by:
+
+gY
+θθ(θ) =
+4
+
+{(1 − θ)2 + θ2}2 ,
+
+m
+Γ X
+θθ,θ(θ) = 4 (1 − 2θ)(1 + 2θ − 2θ2)
+
+θ(1 − θ){(1 − θ)2 + θ2}3 ,
+
+e
+Γ Y
+θθ,θ (θ) = −4
+1 − 2θ
+
+θ(1 − θ){(1 − θ)2 + θ2}2 ,
+and
+TY
+θθθ(θ) = 8
+1 − 2θ
+
+θ(1 − θ){(1 − θ)2 + θ2}3 ,
+
+respectively.
+Using these quantities, Equation (3) is given by:
+
+4
+θ2(1 − θ)2
+
+{θ2 + (1 − θ)2}2 (∂θ log ˜f (θ))2 = 4 − 4
+θ(1 − θ)
+
+{θ2 + (1 − θ)2}2 .
+
+119
+
+
+Entropy 2014, 16, 3026–3048
+
+By noting that the maximum point of gX,θθ(θ)gY
+θθ(θ) is 1/2, the solution, log ˜f (θ), of this equation is
+given by:
+
+log ˜f (θ)
+=
+2
+�
+
+1 − θ + θ2 + log{θ(1 − θ)}
+
+− log(2 − θ + 2
+�
+
+1 − θ + θ2) − log(1 + θ + 2
+�
+
+1 − θ + θ2).
+
+Using this solution, we obtain the solution of Equation (4) given by:
+
+log ˜h(θ)
+=
+1
+6
+
+�
+−
+1
+
+1 − θ − 1
+
+θ − 12θ(1 − θ) − 12
+√
+
+3
+�
+
+1 − θ + θ2
+
++(3 − 6
+√
+
+3){log θ + log(1 − θ)} − 3 log(1 − θ + θ2) + 10 log{(1 − θ)2 + θ2}
+
+−6 log(
+√
+
+3 + 2
+�
+
+1 − θ + θ2) + 6
+√
+
+3 log{1 + (1 − θ) + 2
+�
+
+1 − θ + θ2}
+
++6
+√
+
+3 log{1 + θ + 2
+�
+
+1 − θ + θ2}
+�
+.
+
+The asymptotically constant-risk priors for the different sample sizes are shown in Figure 1. The prior
+weight is found to be more concentrated to 1/2 as the sample size, N, grows.
+In this example, we obtain the Kullback–Leibler risk of the Bayesian predictive density based on
+the asymptotically constant-risk prior, π(θ; N), as:
+
+R(θ, ˆqπ(y|x(N)))
+=
+2
+N − 4
+√
+
+3
+
+N
+√
+
+N
++ O(N−2).
+
+We compare this value with the Bayes risk calculated using the Monte Carlo simulation; see Figure 2.
+As the sample size, N, grows, the difference appears negligible. Further, we compare this value with
+the risk itself calculated by the Monte Carlo simulation; see Figure 3. As the sample size, N, grows, the
+risk becomes more constant.
+
+Figure 1. Asymptotically constant-risk prior in the prediction where the data are distributed according
+to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial
+distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+120
+
+
+Entropy 2014, 16, 3026–3048
+
+�����
+�����
+�����
+�����
+
+���������������
+
+����������
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+����
+
+��������������������
+����� ��� ���
+
+�����
+�����
+�����
+�����
+
+Figure 2. Bayes risk based on the asymptotically constant-risk prior in the prediction where the data
+are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed
+according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+Figure 3. Comparison of the Kullback–Leibler risk calculated using the Monte Carlo simulations and
+the asymptotic risk, 2/N − (4
+√
+
+3)/(N
+√
+
+N), in the prediction where the data are distributed according
+to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial
+distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+6. Discussion and Conclusions
+
+We have considered the setting where the quantity, gX,ij(θ)gY
+ij(θ)—the trace of the product of the
+
+inverse Fisher information matrix, gX,ij(θ), and the Fisher information matrix, gY
+ij(θ)—has a unique
+maximum point, and we have investigated the asymptotically constant-risk prior in the sense that the
+asymptotic risk is constant up to O(N−2).
+In Section 3, we have considered the prior depending on the sample size, N, and constructed
+the asymptotically constant-risk prior using Equations (3) and (4). In Section 4, we have clarified the
+relationship between the subminimax estimator problem based on the mean squared error and the
+prediction where the distributions of data and target variables are different. In Section 5, we have
+constructed the asymptotically constant-risk prior in the prediction based on the logistic regression
+model under the covariate shift.
+We have assumed that the trace, gX,ij(θ)gY
+ij(θ), is finite. However, the trace may diverge in the
+non-compact parameter space; for example, it diverges under the predictive setting, where the
+
+121
+
+
+Entropy 2014, 16, 3026–3048
+
+distribution, q(y|θ), of the target variable is the Poisson distribution and the data distribution, p(x|θ),
+is the exponential distribution, with Θ equivalent to R. Therefore, for our future work, in such a
+setting, we should adopt criteria other than minimaxity.
+
+Acknowledgments: The authors thank the referees for their helpful comments. This research was partially
+supported by a Grant-in-Aid for Scientific Research (23650144, 26280005).
+
+Author Contributions: Both authors contributed to the research and writing of this paper. Both authors read and
+approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+Appendix A
+
+We prove Theorem 1. First, we introduce some lemmas for the proof. For the expansion, we
+follow the following six steps (the first five steps are arranged in the form of lemmas): the first is to
+expand the MAPestimator; the second is to calculate their bias and mean squared error; the third
+is to expand the Kullback–Leibler risk using ˆθπ-plugin predictive density, q(y| ˆθπ); the fourth is to
+expand the Bayesian predictive density based on the prior π(θ; N); the fifth is to expand the Bayesian
+estimator minimizing the Bayes risk; and the last is to prove Theorem 1 using these lemmas.
+We use some additional notations for the expansion. Let ˆθπ be the maximum point of the
+scalar function log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Let l(θ|x(N)) denote the log likelihood of
+the data, x(N). Let lij(θ|x(N)), lijk(θ|x(N)) and lijkl(θ|x(N)) be the derivatives of order 2, 3 and 4 of the
+log likelihood, l(θ|x(N)). Let Hij(θ|x(N)) denote the quantity, lij(θ|x(N)) + NgX
+ij (θ). Let ˜li(θ|x(N)) and
+˜Hij(θ|x(N)) denote (1/
+√
+
+N)li(θ|x(N)) and (1/
+√
+
+N)Hij(θ|x(N)), respectively. In addition, the brackets ( )
+denotes the symmetrization: for any two tensors, aij and bij, ai(jbk)l denotes ai(jbk)l = (aijbkl + aikbjl)/2.
+
+Lemma A1. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th
+component of this estimator ˆθπ is expanded as follows:
+
+ˆθi
+π
+=
+θi +
+1
+√
+
+N
+gX,ik(θ)˜lk(θ|x(N)) +
+1
+√
+
+N
+gX,ik(θ)∂k log f (θ)
+
++ 1
+
+N gX,ik(θ) ˜Hkm(θ|x(N))gX,mr(θ)˜lr(θ|xN)
+
++ 1
+
+2N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)˜lq(θ|xN)˜ls(θ|x(N))
+
++ 1
+
+N gX,ik(θ) ˜Hkm(θ|xN)gX,mr(θ)∂r log f (θ)
+
++ 1
+
+N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)˜lq(θ|xN)∂s log f (θ)
+
++ 1
+
+2N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)˜lq(θ|x(N))
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ)
+
++ 1
+
+N gX,ik(θ)∂k log h(θ) + OP(N−3/2).
+(A1)
+
+Proof. By the definition of ˆθπ, we get the equation given by:
+
+∂i log p(x(N)| ˆθπ) + ∂i log
+π( ˆθπ; N)
+
+|gX( ˆθπ)|1/2
+=
+0.
+
+122
+
+
+Entropy 2014, 16, 3026–3048
+
+From our assumption that prior π(θ; N) has the form given by:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ) + log h(θ)},
+
+we rewrite this equation as:
+
+∂i log p(x(N)| ˆθπ) +
+√
+
+N∂i log f ( ˆθπ) + ∂i log h( ˆθπ)
+=
+0.
+
+By applying Taylor expansion around θ to this new equation, we derive the following expansion:
+
+∂i log p(x(N)|θ) + {∂ij log p(x(N)|θ)}( ˆθj
+π − θj)
+
++1
+
+2{∂ijk log p(x(N)|θ)}( ˆθj
+π − θj)( ˆθk
+π − θk) +
+√
+
+N∂i log f (θ)
+
++
+√
+
+N{∂ij log f (θ)}( ˆθj
+π − θj) + ∂i log h(θ) + oP(1) = 0.
+
+From the law of large numbers and the central limit theorem, we rewrite the above expansion as:
+
+NgX
+ij (θ)( ˆθj
+π − θj)
+=
+∂i log p(x(N)|θ) +
+√
+
+N∂i log f (θ) + Hij(θ|x(N))( ˆθj
+π − θj)
+
++ N
+
+2 Lijk(θ)( ˆθj
+π − θj)( ˆθk
+π − θk) +
+√
+
+N∂ij log f (θ)( ˆθj
+π − θj)
+
++∂i log h(θ) + oP(1).
+(A2)
+
+By substituting the deviation, ˆθπ − θ, recursively into Expansion (A2), we obtain Expansion (A1).
+
+Lemma A2. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th
+component of the bias of the estimator, ˆθπ, is given by:
+
+EX(N)[ ˆθi
+π]
+=
+θi +
+1
+√
+
+N
+gX,ik∂k log f (θ)
+
+− 1
+
+2N
+
+m
+Γ X,i(θ) + 1
+
+2N gX,ik(θ)gX,mq(θ)gX,rs(θ)LX
+kmr(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ)
+
++ 1
+
+N gX,ik(θ)∂k log h(θ) + O(N−3/2).
+(A3)
+
+123
+
+
+Entropy 2014, 16, 3026–3048
+
+The (i, j)-component of the mean squared error of ˆθπ is given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
+= 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+−
+1
+
+N
+√
+
+N
+gX,k(i(θ)
+m
+Γ X,j)(θ)∂k log f (θ) +
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂kl log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)∂kgX,j)l(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)gX,pt(θ)LX
+lrt(θ)∂k log f (θ)∂n log f (θ)∂p log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)∂ln log f (θ)∂r log f (θ)∂k log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂k log f (θ)∂l log h(θ)
+
++O(N−2),
+(A4)
+
+where gX,k(i(θ)
+
+m
+Γ X,j) (θ) denotes (1/2){gX,ki(θ)
+
+m
+Γ X,j(θ) + gX,ki(θ)
+
+m
+Γ X,j(θ)} and gX,k(i(θ)∂kgX,j)l(θ)
+denotes (1/2){gX,ki(θ)∂kgX,jl(θ) + gX,kj(θ)∂kgX,il(θ)}. The (i, j, k)-component of the mean of the third
+power of the deviation, ˆθπ − θ, is given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+
+N
+√
+
+N
+gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+3
+
+N
+√
+
+N
+gX,(ij(θ)gX,k)l(θ)∂l log f (θ) + O(N−2).
+(A5)
+
+Proof. First, using Lemma A1, we determine the i-th component of the bias of ˆθπ given by:
+
+EX(N)[ ˆθi
+π − θi]
+
+=
+1
+√
+
+N
+gX,ik∂k log f (θ)
+
+− 1
+
+2N
+
+m
+Γ X,i(θ) + 1
+
+2N gX,ik(θ)gX,mq(θ)gX,rs(θ)LX
+kmr(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ) + 1
+
+N gX,ik(θ)∂k log h(θ) + O(N−3/2).
+
+Second, consider the following relationship:
+
+EX(N)
+
+��
+ˆθi
+π − θi −
+1
+√
+
+N
+gX,ik(θ)˜lk(θ|x(N)) −
+1
+√
+
+N
+gX,ik(θ)∂k log f (θ)
+�
+
+×
+�
+ˆθj
+π − θj −
+1
+√
+
+N
+gX,jl(θ)˜ll(θ|xN) −
+1
+√
+
+N
+gX,jl(θ)∂l log f (θ)
+��
+
+= EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] + 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+− 1
+√
+
+N
+gX,ki(θ)EX(N)[( ˆθj
+π − θj)˜lk(θ|x(N))] −
+1
+√
+
+N
+gX,kj(θ)EX(N)[( ˆθi
+π − θi)˜lk(θ|x(N))]
+
+− 1
+√
+
+N
+gX,ki(θ)EX(N)[( ˆθj
+π − θj)∂k log f (θ)] −
+1
+√
+
+N
+gX,kj(θ)EX(N)[( ˆθi
+π − θi)∂k log f (θ)]. (A6)
+
+124
+
+
+Entropy 2014, 16, 3026–3048
+
+By differentiating the j-th component of the bias, EX(N)[ ˆθj
+π − θj], we obtain the equation given by:
+
+1
+N ∂kEX(N)[ ˆθj
+π − θj]
+=
+− 1
+
+N δj
+k +
+1
+√
+
+N
+EX(N)[( ˆθj
+π − θj)˜lk(θ|xN)],
+(A7)
+
+where δi
+j denotes the delta function: if the upper and the lower indices agree, then the value of
+this function is one and otherwise zero. Equation (A7) has been used by [2,16,19]. By substituting
+Equations (A7) and (A3) into Relationship (A6), we obtain the (i, j)-component of the mean squared
+error of ˆθπ given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
+= 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+−
+1
+
+N
+√
+
+N
+gX,k(i(θ)
+m
+Γ X,j)(θ)∂k log f (θ) +
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂kl log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)∂kgX,j)l(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)gX,pt(θ)LX
+lrt(θ)∂k log f (θ)∂n log f (θ)∂p log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)∂ln log f (θ)∂r log f (θ)∂k log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂k log f (θ)∂l log h(θ) + O(N−2).
+
+Finally, by taking the expectation of the third power of the deviation, ˆθi
+π − θi, we obtain the
+following expansion:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+
+N
+√
+
+N
+gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+3
+
+N
+√
+
+N
+gX,(ij(θ)gX,k)l(θ)∂l log f (θ) + O(N−2).
+
+125
+
+
+Entropy 2014, 16, 3026–3048
+
+Lemma
+A3. Let
+ˆθπ
+be
+the
+maximum
+point
+of
+log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}.
+The Kullback–Leibler risk of the plug-in predictive density, q(y(N)| ˆθπ), with the estimator,
+ˆθπ, is
+expanded as follows:
+
+R(θ, q(y| ˆθπ))
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)
+� e
+∇i∂j log f (θ)
+�
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
+−
+1
+
+3N
+√
+
+N
+TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gY
+kl(θ)Ml
+ijgX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij ∂m log f (θ) −
+1
+
+N
+√
+
+N
+TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)Mk
+ij∂k log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(A8)
+
+Proof. By applying the Taylor expansion, the Kullback–Leibler risk, R(θ, q(y| ˆθπ)), is expanded as:
+
+Ex(N)[D(q(·|θ), q(·| ˆθπ))]
+
+= EX(N)
+
+��
+q(y|θ)
+�
+−li(θ|y) ˜θi
+π − 1
+
+2lij(θ|y)( ˆθi
+π − θi)( ˆθj
+π − θj)
+
+−1
+
+6lijk(θ|y)( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk) + OP(N−2)
+�
+dy
+�
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] − 1
+
+6 LY
+ijk(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)] + O(N−2)
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
++
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+3
+2
+
+m
+Γ Y
+(ij,k)(θ) − 1
+
+3TY
+ijk(θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)] + O(N−2)
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] − 1
+
+3TY
+ijk(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
++1
+
+2
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+gY
+kl(θ)
+
+m
+Γ Y,l
+ij (θ) − gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
++1
+
+2 gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk] + O(N−2),
+(A9)
+
+where
+
+e
+Γ Y
+(ij,k) denotes (1/3){
+
+e
+Γ Y
+ij,k +
+
+e
+Γ Y
+jk,i +
+
+e
+Γ Y
+ki,j}.
+
+126
+
+
+Entropy 2014, 16, 3026–3048
+
+By the definition of the predictive metric, ˚gij(θ) = gX
+ik(θ)gY,kl(θ)gX
+lj (θ), by Expansions (A4) and
+
+(A5) and by the relationship LX
+ijk(θ) = −
+
+e
+Γ X
+ij,k(θ) −
+e X
+jk,i(θ) −
+e X
+ki,j(θ) − TX
+ijk(θ), the last two terms of the
+above expansion (A9) are expanded as:
+
+1
+2 gY
+ij(θ)EX(N)[ ˆ(θ
+i
+π − θi)( ˆθj
+π − θj)] + 1
+
+2 gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂ij log f (θ) −
+
+e
+Γ X,k
+ij
+(θ)∂k log f (θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+�
+∂ik log f (θ) −
+e X,m
+ik
+∂m log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(A10)
+
+By substituting Expansion (A10) into Expansion (A9), Expansion (A8) is obtained.
+
+Note that Expansion (A8) is invariant up to O(N−2) under the reparametrization, so that each
+term of this expansion is a scalar function of θ.
+
+Lemma A4. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. The Bayesian
+predictive density based on the prior, π(θ; N), is expanded as:
+
+ˆqπ(y|x(N))
+=
+q(y| ˆθπ) + 1
+
+N gX,ij( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂i log |gX( ˆθπ)|
+1
+2 −
+
+e
+Γ X,k
+ik
+( ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+∂jq(y| ˆθπ)
+
++ 1
+
+2N gX,ij( ˆθπ)
+
+⎧
+⎨
+
+⎩∂ijq(y| ˆθπ) −
+
+m
+Γ X,k
+ij
+( ˆθπ)∂kq(y| ˆθπ)
+
+⎫
+⎬
+
+⎭ + OP(N−3/2).
+(A11)
+
+Proof. Let ˜θπ denote ˆθπ − θ. First, using a Taylor expansion twice, we expand the posterior density,
+π(θ|x(N)), as:
+
+π(θ|x(N))
+=
+|gX( ˆθπ)|
+1
+2
+π( ˆθπ)
+
+|gX( ˆθπ)|
+1
+2
+p(x(N)| ˆθπ) exp
+�
+−1
+
+2{−lij( ˆθπ|x(N))} ˜θi
+π ˜θj
+π
+
+�
+
+×
+
+�
+
+1 − {∂i log |gX( ˆθπ)|
+1
+2 } ˜θi
+π + 1
+
+2
+
+�
+∂ij|gX( ˆθπ)|
+1
+2
+
+|gX( ˆθπ)|
+1
+2
+
+�
+˜θi
+π ˜θj
+π + OP(N−3/2)
+
+�
+
+×
+�
+1 + 1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6{lijk( ˆθπ|x(N))} ˜θi
+π ˜θj
+π ˜θk
+π + 1
+
+2{log h( ˆθπ)} ˜θi
+π ˜θj
+π
+
+−1
+
+6{
+√
+
+N∂ijk log f ( ˆθπ)} ˜θi
+π ˜θj
+π ˜θk
+π + 1
+
+24lijkl( ˆθπ|x(N)) ˜θi
+π ˜θj
+π ˜θk
+π ˜θl
+π
+
++1
+
+2
+
+�1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6lijk( ˆθπ|xN) ˜θi
+π ˜θj
+π ˜θk
+π
+
+�
+
+×
+�1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6lijk( ˆθπ|x(N)) ˜θi
+π ˜θj
+π ˜θk
+π
+
+�
++ OP(N−3/2)
+�
+
+×
+
+��
+p(x(N)|θ) π(θ; N)
+
+|gX(θ)|
+1
+2
+|gX(θ)|
+1
+2 dθ
+
+�−1
+.
+
+127
+
+
+Entropy 2014, 16, 3026–3048
+
+We denote the N−1/2-order,
+N−1-order and N−3/2-order terms by (N−1/2)a0( ˜θπ; ˆθπ),
+(N−1)a1( ˜θπ; ˆθπ) and (N−3/2)a2( ˜θπ; ˆθπ), respectively. Then, this expansion is rewritten as:
+
+π(θ|x(N))
+=
+|gX( ˆθπ)|
+1
+2
+π( ˆθπ)
+
+|gX( ˆθπ)|
+1
+2
+p(x(N)| ˆθπ) exp
+�
+−1
+
+2{−lij( ˆθπ|x(N))} ˜θi
+π ˜θj
+π
+
+�
+
+×
+�
+1 +
+1
+√
+
+N
+a0( ˜θπ; ˆθπ)
+
++ 1
+
+N a1( ˜θπ; ˆθπ) +
+1
+
+N
+√
+
+N
+a2( ˜θπ; ˆθπ) + OP(N−2)
+�
+
+×
+
+��
+p(x(N)|θ) π(θ; N)
+
+|gX(θ)|
+1
+2
+|gX(θ)|
+1
+2 dθ
+
+�−1
+.
+
+To make the expansion easier to see, the following notations are used. Let φ(η; −lij( ˆθπ|x(N))) be
+the probability density function of the d-dimensional normal distribution with the precision matrix
+whose (i, j)-component is −lij( ˆθπ|x(N)). Let η = (η1, · · · , ηd) be a d-dimensional random vector
+distributed according to the normal density, φ(η; −lij( ˆθπ|x(N))) The notations, ¯a0( ˆθπ), ¯a1( ˆθπ), ¯a2( ˆθπ)
+and ˆωij( ˆθπ), denote the expectations of a0(η; ˆθπ), a1(η; ˆθπ), a2(η; ˆθπ) and ηiηj, respectively.
+Using the above notations, we get the following posterior expansion:
+
+π(θ|x(N))
+=
+φ( ˆθπ; −lij( ˆθπ|x(N)))
+
+×
+�
+1 +
+1
+√
+
+N
+{a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)} + 1
+
+N {a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)}
+
+− 1
+
+N ¯a0( ˆθπ){a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)} +
+1
+
+N
+√
+
+N
+{a2( ˜θπ; ˆθπ) − ¯a2( ˆθπ)}
+
+−
+1
+
+N
+√
+
+N
+¯a0( ˆθπ){a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)} −
+1
+
+N
+√
+
+N
+¯a1( ˆθπ){a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)}
+
++
+1
+
+N
+√
+
+N
+¯a2
+0( ˆθπ){a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)} + OP(N−2)
+�
+.
+(A12)
+
+Second, using (A12), the Bayesian predictive density, ˆqπ(y|x(N)), based on the prior, π(θ; N), is
+expanded as:
+
+ˆqπ(y|x(N))
+
+=
+�
+q(y| ˆθπ)
+
+�
+
+1 − {∂i log q(y| ˆθπ)} ˜θi
+π + 1
+
+2
+∂ijq(y| ˆθπ)
+
+q(y| ˆθπ)
+˜θi
+π ˜θj
+π + oP(N−1)
+
+�
+
+π(θ|xN)dθ
+
+=
+�
+q(y| ˆθπ)
+�
+1 + {∂i log |gX( ˆθπ)|
+1
+2 }{∂j log q(y| ˆθπ)} ˜θi
+π ˜θj
+π
+
++1
+
+6{∂ijk log p(x(N)| ˆθπ) +
+√
+
+N∂ijk log f ( ˆθπ)}{∂l log q(y| ˆθπ)} ˜θi
+π ˜θj
+π ˜θk
+π ˜θl
+π
+
++1
+
+2
+∂ijq(y| ˆθπ)
+
+q(y| ˆθπ)
+˜θi
+π ˜θj
+π + oP(N−1)
+
+�
+
+φ( ˜θπ; −lij( ˆθπ|xN))d ˜θπ
+
+= q(y| ˆθπ) + ˆωij( ˆθπ){∂i log |gX( ˆθπ)|
+1
+2 }∂jq(y| ˆθπ) + 1
+
+2 ˆωik( ˆθπ) ˆωjl( ˆθπ)lijk( ˆθπ|xN)∂lq(y| ˆθπ)
+
++1
+
+2 ˆωij( ˆθπ)∂ijq(y| ˆθπ) + OP(N−3/2).
+(A13)
+
+Here, the following two equations hold:
+
+−lij( ˆθπ|x(N))
+=
+NgX
+ij ( ˆθπ) −
+√
+
+N ˜Hij( ˆθπ|xN) + OP(1),
+(A14)
+
+128
+
+
+Entropy 2014, 16, 3026–3048
+
+lijk( ˆθπ|x(N))
+=
+−2N
+
+e
+Γ X
+ij,k( ˆθπ) − N
+
+m
+Γ X
+ik,j( ˆθπ) +
+√
+
+N ˜Hijk( ˆθ|xN).
+(A15)
+
+By combining Equation (A14) with the Sherman–Morrison–Woodbury formula, the following
+expansion is obtained:
+
+ˆωij( ˆθπ)
+=
+1
+N gX,ij( ˆθπ) +
+1
+
+N
+√
+
+N
+gX,ik( ˆθπ)gX,jl( ˆθπ)Hkl( ˆθπ|x(N)) + OP(N−2).
+(A16)
+
+By substituting Equations (A14), (A15) and (A16) into Expansion (A13), Expansion (A11) is
+obtained.
+
+Note that the integration of Expansion (A11) is one up to OP(N−2). Further, Expansion (A11) is
+similar to the expansion in [2]. However, the estimator that is the center of the expansion is different,
+because of the dependence of the prior on the sample size.
+
+Lemma A5. The Bayesian estimator, ˆθopt, minimizing the Bayes risk,
+�
+R(θ, q(y| ˆθ))dπ(θ; N), among plug-in predictive densities is given by:
+
+ˆθi
+opt
+=
+ˆθi
+π + 1
+
+2N gX,ij( ˆθπ)TX
+j ( ˆθπ)
+
++ 1
+
+2N gX,jk( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+m
+Γ Y,i
+jk ( ˆθπ) −
+
+m
+Γ X,i
+jk ( ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
++ OP(N−3/2).
+(A17)
+
+Proof. The Bayes risk, �
+R(θ, q(y| ˆθ))dπ(θ; N), is decomposed as:
+
+�
+R(θ, q(y| ˆθ))dπ(θ; N)
+=
+�
+π(θ; N)
+�
+p(x(N)|θ)
+�
+q(y|θ) log
+q(y|θ)
+
+ˆqπ(y|x(N))dydx(N)dθ
+
++
+�
+π(θ; N)
+�
+p(x(N)|θ)
+�
+q(y|θ) log ˆqπ(y|x(N))
+
+q(y| ˆθ)
+dydx(N)dθ.
+
+The first term of this decomposition is not dependent on ˆθ. From Fubini’s theorem and Lemma A4, the
+proof is completed.
+
+Using these lemmas, we prove Theorem 1. First, we find that the Kullback–Leibler risk of the
+plug-in predictive density with the estimator, ˆθopt, defined in Lemma A5, is given by:
+
+R(θ, q(y| ˆθopt))
+=
+R(θ, q(y| ˆθπ)) +
+1
+
+2N
+√
+
+N
+˚gij(θ)TX
+i (θ)∂j log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,im(θ)gY
+ij(θ)gX,kl(θ)
+
+×
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+m
+Γ Y,j
+kl (θ) −
+
+m
+Γ X,j
+kl
+((θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+∂m log f (θ).
+(A18)
+
+Using Expansion (A18) and Lemma A3, we expand the Kullback–Leibler risk, R(θ, ˆqπ(y|x(N))).
+Here, the risk, R(θ, ˆqπ(y|x(N))), is equal to the risk, R(θ, q(y| ˆθopt)), up to O(N−2), because we expand
+the Bayesian predictive density, ˆqπ(y|x(N)) as:
+
+q(y|x(N)) = q(y| ˆθopt) + 1
+
+2N gX,ij( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂ijq(y| ˆθπ) −
+
+m
+Γ Y,k
+ij
+( ˆθπ)∂kq(y| ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
++ OP(N−3/2).
+(A19)
+
+129
+
+
+Entropy 2014, 16, 3026–3048
+
+Thus, we obtain Expansion (1).
+
+References
+
+1. Aitchison, J. Goodness of prediction fit. Biometrika 1975, 62, 547–554.
+2. Komaki, F. On asymptotic properties of predictive distributions. Biometrika 1996, 83, 299–313.
+3. Hartigan, J. The maximum likelihood prior. Ann. Stat. 1998, 26, 2083–2103.
+4. Bernardo, J. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. B 1979, 41, 113–147.
+5. Clarke, B.; Barron, A. Jeffreys prior is asymptotically least favorable under entropy risk.
+J. Stat. Plan.
+Inference 1994, 41, 37–60.
+6. Aslan, M. Asymptotically minimax Bayes predictive densities. Ann. Stat. 2006, 34, 2921–2938.
+7. Komaki, F. Bayesian predictive densities based on latent information priors. J. Stat. Plan. Inference 2011, 141,
+3705–3715.
+8. Komaki, F. Asymptotically minimax Bayesian predictive densities for multinomial models. Electron. J. Stat.
+2012, 6, 934–957.
+9. Kanamori, T.; Shimodaira, H. Active learning algorithm using the maximum weighted log-likelihood
+estimator. J. Stat. Plan. Inference 2003, 116, 149–162.
+10. Shimodaira,
+H.
+Improving
+predictive
+inference
+under
+covariate
+shift
+by
+weighting
+the
+log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244.
+11. Fushiki, T.; Komaki, F.; Aihara, K. On parametric bootstrapping and Bayesian prediction. Scand. J. Stat. 2004,
+31, 403–416.
+12. Suzuki, T.; Komaki, F. On prior selection and covariate shift of β-Bayesian prediction under α-divergence
+risk. Commun. Stat. Theory 2010, 39, 1655–1673.
+13. Komaki, F. Asymptotic properties of Bayesian predictive densities when the distributions of data and target
+variables are different. Bayesian Anal. 2014, submitted for publication.
+14. Hodges, J.L.; Lehmann, E.L. Some problems in minimax point estimation. Ann. Math. Stat. 1950, 21, 182–197.
+15. Ghosh, M.N. Uniform approximation of minimax point estimates. Ann. Math. Stat. 1964, 35, 1031–1047.
+16. Amari, S. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1985.
+17. Robbins, H.
+Asymptotically Subminimax solutions of Compound Statistical Decision Problems.
+In Proceedings of the Second Berkley Symposium Mathematical Statistics and Probability, Berkeley, CA,
+USA, 31 July–12 August 1950; University of California Press: Oakland, CA, USA, 1950; pp. 131–148.
+18. Frank, P.; Kiefer, J. Almost subminimax and biased minimax procedures. Ann. Math. Stat. 1951, 22, 465–468.
+19. Efron, B. Defining curvature of a statistical problem (with applications to second order efficiency). Ann. Stat.
+1975, 3, 189–1372.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+130
+
+
+entropy
+
+Article
+Information-Geometric Markov Chain Monte Carlo
+Methods Using Diffusions
+
+Samuel Livingstone 1,* and Mark Girolami 2
+
+1 Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK
+2 Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; E-Mail:m.girolami@warwick.ac.uk
+*
+E-Mail: samuel.livingstone@ucl.ac.uk; Tel.: +44-20-7679-1872.
+
+Received: 29 March 2014; in revised form: 23 May 2014 / Accepted: 28 May 2014 /
+Published: 3 June 2014
+
+Abstract: Recent work incorporating geometric ideas in Markov chain Monte Carlo is reviewed
+in order to highlight these advances and their possible application in a range of domains beyond
+statistics. A full exposition of Markov chains and their use in Monte Carlo simulation for statistical
+inference and molecular dynamics is provided, with particular emphasis on methods based on
+Langevin diffusions. After this, geometric concepts in Markov chain Monte Carlo are introduced.
+A full derivation of the Langevin diffusion on a Riemannian manifold is given, together with
+a discussion of the appropriate Riemannian metric choice for different problems. A survey of
+applications is provided, and some open questions are discussed.
+
+Keywords: information geometry; Markov chain Monte Carlo; Bayesian inference; computational
+statistics; machine learning; statistical mechanics; diffusions
+
+1. Introduction
+
+There are three objectives to this article. The first is to introduce geometric concepts that have
+recently been employed in Monte Carlo methods based on Markov chains [1] to a wider audience.
+The second is to clarify what a “diffusion on a manifold” is, and how this relates to a diffusion
+defined on Euclidean space. Finally, we review the state-of-the-art in the field and suggest avenues for
+further research.
+The connections between some Monte Carlo methods commonly used in statistics, physics
+and application domains, such as econometrics, and ideas from both Riemannian and information
+geometry [2,3] were highlighted by Girolami and Calderhead [1] and the potential benefits
+demonstrated empirically. Two Markov chain Monte Carlo methods were introduced, the manifold
+Metropolis-adjusted Langevin algorithm and Riemannian manifold Hamiltonian Monte Carlo. Here,
+we focus on the former for two reasons. First, the intuition for why geometric ideas can improve
+standard algorithms is the same in both cases. Second, the foundations of the methods are quite
+different, and since the focus of the article is on using geometric ideas to improve performance,
+we considered a detailed description of both to be unnecessary. It should be noted, however, that
+impressive empirical evidence exists for using Hamiltonian methods in some scenarios (e.g., [4]). We
+refer interested readers to [5,6].
+We take an expository approach, providing a review of some necessary preliminaries from
+Markov chain Monte Carlo, diffusion processes and Riemannian geometry. We assume only a minimal
+familiarity with measure-theoretic probability. More informed readers may prefer to skip these
+sections. We then provide a full derivation of the Langevin diffusion on a Riemannian manifold and
+offer some intuition for how to think about such a process. We conclude Section 4 by presenting the
+Metropolis-adjusted Langevin algorithm on a Riemannian manifold.
+A key challenge in the geometric approach is which manifold to choose. We discuss this in
+Section 4.4 and review some candidates that have been suggested in the literature, along with the
+
+Entropy 2014, 16, 3074–3102; doi:10.3390/e16063074
+www.mdpi.com/journal/entropy
+131
+
+
+Entropy 2014, 16, 3074–3102
+
+reasoning for each. Rather than provide a simulation study here, we instead reference studies where
+the methods we describe have been applied in Section 5. In Section 6, we discuss several open
+questions, which we feel could be interesting areas of further research and of interest to both theorists
+and practitioners.
+Throughout, π(·) will refer to an n-dimensional probability distribution and π(x) its density with
+respect to the Lebesgue measure.
+
+2. Markov Chain Monte Carlo
+
+Markov chain Monte Carlo (MCMC) is a set of methods for drawing samples from a distribution,
+π(·), defined on a measurable space (X , B), whose density is only known up to some proportionality
+constant. Although the i-th sample is dependent on the (i − 1)-th, the Ergodic Theorem ensures that
+for an appropriately constructed Markov chain with invariant distribution π(·), long-run averages are
+consistent estimators for expectations under π(·). As a result, MCMC methods have proven useful
+in Bayesian statistical inference, where often, the posterior density π(x|y) ∝ f (y|x)π0(x) for some
+parameter, x (where f (y|x) denotes the likelihood for data y and π0(x) the prior density), is only
+known up to a constant [7]. Here, we briefly introduce some concepts from general state space Markov
+chain theory together with a short overview of MCMC methods. The exposition follows [8].
+
+2.1. Markov Chain Preliminaries
+
+A time-homogeneous Markov chain, {Xm}m∈N, is a collection of random variables, Xm, each of
+which is defined on a measurable space (X , B), such that:
+P[Xm ∈ A|X0 = x0, ..., Xm−1 = xm−1] = P[Xm ∈ A|Xm−1 = xm−1],
+(1)
+
+for any A ∈ B. We define the transition kernel P(xm−1, A) = P[Xm ∈ A|Xm−1 = xm−1] for the chain to
+be a map for which P(x, ·) defines a distribution over (X , B) for any x ∈ X , and P(·, A) is measurable
+for any A ∈ B. Intuitively, P defines a map from points to distributions in X . Similarly, we define the
+m-step transition kernel to be:
+
+Pm(x0, A) = P[Xm ∈ A|X0 = x0].
+(2)
+
+We call a distribution π(·) invariant for {Xm}m∈N if:
+
+π(A) =
+�
+
+X P(x, A)π(dx)
+(3)
+
+for all A ∈ B. If P(x, ·) admits a density, p(x′|x), this can be equivalently written:
+
+π(x′) =
+�
+
+X π(x)p(x′|x)dx.
+(4)
+
+The connotation of Equations (3) and (4) is that if Xm ∼ π(·), then Xm+s ∼ π(·) for any s ∈ N. In this
+instance, we say the chain is “at stationarity”. Of interest to us will be Markov chains for which there
+is a unique invariant distribution, which is also the limiting distribution for the chain, meaning that for
+any x0 ∈ X for which π(x0) > 0:
+lim
+m→∞ Pm(x0, A) = π(A)
+(5)
+
+for any A ∈ B. Certain conditions are required for Equation (5) to hold, but for all Markov chains
+presented here, these are satisfied (though, see [8]).
+A useful condition, which is sufficient (though not necessary) for π(·) to be an invariant
+distribution, is reversibility, which can be shown by the relation:
+
+π(x)p(x′|x) = π(x′)p(x|x′).
+(6)
+
+132
+
+
+Entropy 2014, 16, 3074–3102
+
+Integrating over both sides with respect to x, we recover Equation (4). In other words, a chain is
+reversible if, at stationarity, the probability that xi ∈ A and xi+1 ∈ B are equal to the probability that
+xi+1 ∈ A and xi ∈ B. The relation (6) will be the primary tool used to construct Markov chains with a
+desired invariant distribution in the next section.
+
+2.1.1. Monte Carlo Estimates from Markov Chains
+
+Of most interest here are estimators constructed from a Markov chain. The Ergodic Theorem
+states that for any chain, {Xm}m∈N, satisfying Equation (5) and any g ∈ L1(π), we have that:
+
+lim
+m→∞
+1
+m
+
+m
+∑
+i=1
+g(Xi) = Eπ[g(X)]
+(7)
+
+with probability one [7]. This is a Markov chain analogue to the Law of large numbers.
+The efficiency of estimators of the form ˆtm
+= ∑i g(Xi)/m can be assessed through the
+autocorrelation between elements in the chain. We will assess the efficiency of ˆtm relative to estimators
+¯tm = ∑i g(Zi)/m, where {Zi}m∈N is a sequence of independent random variables, each having
+distribution π(·). Provided Varπ[g(Zi)] < ∞, then Var[¯tm] = Varπ[g(Zi)]/m. We now seek a similar
+result for estimators of the form, ˆtm.
+It follows directly from the Kipnis–Varadhan Theorem [9] that an estimator, ˆtm, from a reversible
+Markov chain for which X0 ∼ π(·) satisfies:
+
+lim
+m→∞
+Var[ˆtm]
+Var[¯tm] = 1 + 2
+∞
+∑
+i=1
+ρ(0,i) = τ,
+(8)
+
+provided that ∑∞
+i=1 i|ρ(0,i)| < ∞, where ρ(0,i) = Corrπ[g(X0), g(Xi)]. We will refer to the constant, τ, as
+the autocorrelation time for the chain.
+Equation (8) implies that for large enough m, Var[ˆtm] ≈ τVar[¯tm]. In practical applications, the
+sum in Equation (8) is truncated to the first p − 1 realisations of the chain, where p is the first instance
+at which |ρ(0,p)| < ϵ for some ϵ > 0. For example, in the Convergence Diagnosis and Output Analysis
+for MCMC (CODA) package within the R statistical software ϵ = 0.05 [10,11].
+Another commonly used measure of efficiency is the effective sample size me f f = m/τ, which
+gives the number of independent samples from π(·) needed to give an equally efficient estimate for
+Eπ[g(X)]. Clearly, minimising τ is equivalent to maximising me f f .
+The measures arising from Equation (8) give some intuition for what sort of Markov chain gives
+rise to efficient estimators. However, in practice, the chain will never be at stationarity. Therefore, we
+also assess Markov chains according to how far away they are from this point. For this, we need to
+measure how close Pm(x0, ·) is from π(·), which requires a notion of distance between probability
+distributions.
+Although there are several appropriate choices [12], a common option in the Markov chain
+literature is the total variation distance:
+
+∥μ(·) − ν(·)∥TV := sup
+A∈B
+|μ(A) − ν(A)|,
+(9)
+
+which informally gives the largest possible difference between the probabilities of a single event in
+B according to μ(·) and ν(·). If both distributions admit densities, Equation (9) can be written (see
+Appendix A):
+
+∥μ(·) − ν(·)∥TV = 1
+
+2
+
+�
+
+X |μ(x) − ν(x)|dx.
+(10)
+
+which is proportional to the L1 distance between μ(x) and ν(x). Our metric, ∥ · ∥TV ∈ [0, 1], with
+∥ · ∥TV = 1 for distributions with disjoint supports and ∥μ(·) − ν(·)∥TV = 0, implies μ(·) ≡ ν(·).
+
+133
+
+
+Entropy 2014, 16, 3074–3102
+
+Typically, for an unbounded X , the distance ∥Pm(x0, ·) − π(·)∥TV will depend on x0 for any finite
+m. Therefore, bounds on the distance are often sought via some inequality of the form:
+
+∥Pm(x0, ·) − π(·)∥TV ≤ MV(x0) f (m),
+(11)
+
+for some M < ∞, where V : X → [1, ∞) depends on x0 and is called a drift function, and f : N → [0, ∞)
+depends on the number of iterations, m (and is often defined, such that f (0) = 1).
+A Markov chain is called geometrically ergodic if f (m) = rm in Equation (11) for some 0 < r < 1.
+If in addition to this, V is bounded above, the chain is called uniformly ergodic. Intuitively, if either
+condition holds, then the distribution of Xm will converge to π(·) geometrically quickly as m grows,
+and in the uniform case, this rate is independent of x0. As well as providing some (often qualitative
+if M and r are unknown) bounds on the convergence rate of a Markov chain, geometric ergodicity
+implies that a central limit theorem exists for estimators of the form, ˆtm. For more detail on this,
+see [13,14].
+In practice several approximate methods also exist to assess whether a chain is close enough to
+stationarity for long-run averages to provide suitable estimators (e.g., [15]). The MCMC practitioner
+also uses a variety of visual aids to judge whether an estimate from the chain will be appropriate for
+his or her needs.
+
+2.2. Markov Chain Monte Carlo
+
+Now that we have introduced Markov chains, we turn to simulating them. The objective here
+is to devise a method for generating a Markov chain, which has a desired limiting distribution, π(·).
+In addition, we would strive for the convergence rate to be as fast as possible and the effective sample
+size to be suitably large relative to the number of iterations. Of course, the computational cost of
+performing an iteration is also an important practical consideration. Ideally, any method would
+also require limited problem-specific alterations, so that practitioners are able to use it with as little
+knowledge of the inner workings as is practical.
+Although other methods exist for constructing chains with a desired limiting distribution, a
+popular choice is the Metropolis–Hastings algorithm [7]. At iteration i, a sample is drawn from some
+candidate transition kernel, Q(xi−1, ·), and then either accepted or rejected (in which case, the state of
+the chain remains xi−1). We focus here on the case where Q(xi−1, ·) admits a density, q(x′|xi−1), for all
+xi−1 ∈ X (though, see [8]). In this case, a single step is shown below (the wedge notation a ∧ b denotes
+the minimum of a and b). The “acceptance rate”, α(xi−1, x′), governs the behaviour of the chain, so
+that, when it is close to one, then many proposed moves are accepted, and the current value in the
+chain is constantly changing. If it is on average close to zero, then many proposals are rejected, so
+that the chain will remain in the same place for many iterations. However, α ≈ 1 is typically not ideal,
+often resulting in a large autocorrelation time (see below). The challenge in practice is to find the right
+acceptance rate to balance these two extremes.
+
+Algorithm 1 Metropolis–Hastings, single iteration.
+
+Require: xi−1
+Draw X′ ∼ Q(xi−1, ·)
+Draw Z ∼ U[0, 1]
+Set α(xi−1, x′) ← 1 ∧
+π(x′)q(xi−1|x′)
+
+π(xi−1)q(x′|xi−1)
+if z < α(xi−1, x′) then
+Set xi ← x′
+else
+Set xi ← xi−1
+end if
+
+Combining the “proposal” and “acceptance” steps, the transition kernel for the resulting Markov
+chain is:
+
+134
+
+
+Entropy 2014, 16, 3074–3102
+
+P(x, A) = r(x)δx(A) +
+�
+
+A α(x, x′)q(x′|x)dx′,
+(12)
+
+for any A ∈ B, where:
+
+r(x) = 1 −
+�
+
+X α(x, x′)q(x′|x)dx′
+
+is the average probability that a draw from Q(x, ·) will be rejected, and δx(A) = 1 if x ∈ A and zero,
+otherwise. A Markov chain defined in this way will have π(·) as an invariant distribution, since the
+chain is reversible for π(·). We note here that:
+
+π(xi−1)q(xi|xi−1)α(xi−1, xi) = π(xi−1)q(xi|xi−1) ∧ π(xi)q(xi−1|xi)
+
+= α(xi, xi−1)q(xi−1|xi)π(xi)
+
+in the case that the proposed move is accepted and that if the proposed move is rejected, then xi = xi−1;
+so the chain is reversible for π(·). It can be shown that π(·) is also the limiting distribution for the
+chain [7].
+The convergence rate and autocorrelation time of a chain produced by the algorithm are dependent
+on both the choice of proposal, Q(xi−1, ·), and the target distribution, π(·). For simple forms of the
+latter, less consideration is required when choosing the former. A broad objective among researchers
+in the field is to find classes of proposal kernels that produce chains that converge and mix quickly for
+a large class of target distributions. We first review a simple choice before discussing one that is more
+sophisticated, and the will be the focus of the rest of the article.
+
+2.3. Random Walk Proposals
+
+An extremely simple choice for Q(x, ·) is one for which:
+
+q(x′|x) = q(∥x′ − x∥)
+(13)
+
+where ∥ · ∥ denotes some appropriate norm on X , meaning the proposal is symmetric. In this case, the
+acceptance rate reduces to:
+
+α(x, x′) = 1 ∧ π(x′)
+
+π(x) .
+(14)
+
+In addition to simplifying calculations, Equation (14) strengthens the intuition for the method, since
+proposed moves with higher density under π(·) will always be accepted. A typical choice for Q(x, ·)
+is N (x, λ2Σ), where the matrix, Σ, is often chosen in an attempt to match the correlation structure
+of π(·) or simply taken as the identity [16]. The tuning parameter, λ, is the only other user-specific
+input required.
+Much research has been conducted into properties of the random walk Metropolis algorithm
+(RWM). It has been shown that the optimal acceptance rate for proposals tends to 0.234 as the
+dimension, n, of the state space, X , tends to ∞ for a wide class of targets (e.g., [17,18]). The intuition
+for an optimal acceptance rate is to find the right balance between the distance of proposed moves
+and the chances of acceptance. Increasing the former will reduce the autocorrelation in the chain if the
+proposal is accepted, but if it is rejected, the chain will not move at all, so autocorrelation will be high.
+Random walk proposals are sometimes referred to as blind (e.g., [19]), as no information about π(·)
+is used when generating proposals, so typically, very large moves will result in a very low chance of
+acceptance, while small moves will be accepted, but result in very high autocorrelation for the chain.
+Figure 1 demonstrates this in the simple case where π(·) is a one-dimensional N (0, 12) distribution.
+
+135
+
+
+Entropy 2014, 16, 3074–3102
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+�������������
+����������������������������
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+������������
+�������������������������������
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+������������
+�����������������������������
+
+Figure 1. These traceplots show the evolution of three RWM Markov chains for which π(·) is a N (0, 12)
+distribution, with different choices for λ.
+
+Several authors have also shown that for certain classes of π(·), the tuning parameter, λ, should
+be chosen, such that λ2 ∝ n−1, so that α ↛ 0 as n → ∞ [20]. Because of this, we say that algorithm
+efficiency “scales” O(n−1) as the dimension n of π(·) increases.
+Ergodicity results for a Markov chain constructed using the RWM algorithm also exist [21–23].
+At least exponentially light tails are a necessity for π(x) for geometric ergodicity, which means that
+π(x)/e−∥x∥ → c as ∥x∥ → ∞, for some constant, c. For super-exponential tails (where π(x) → 0 at
+a faster than the exponential rate), additional conditions are required [21,23]. We demonstrate with
+a simple example why heavy-tailed forms of π(x) pose difficulties here (where π(x) → 0 at a rate
+slower than e−∥x∥).
+
+Example: Take π(x) ∝ 1/(1 + x2), so that π(·) is a Cauchy distribution. Then, if X′ ∼ N (x, λ2), the ratio
+π(x′)/π(x) = (1 + x2)/(1 + (x′)2) → 1 as |x| → ∞. Therefore, if x0 is far away from zero, the Markov
+chain will dissolve into a random walk, with almost every proposal being accepted.
+
+It should be noted that starting the chain from at or near zero can also cause problems in the
+above example, as the tails of the distribution may not be explored. See [7] for more detail here.
+Ergodicity results for the RWM also exist for specific classes of the statistical model. Conditions for
+geometric ergodicity in the case of generalised linear mixed models are given in [24], while spherically
+constrained target densities are discussed in [25]. In [26], the authors provide necessary conditions
+for the geometric convergence of RWM algorithms, which are related to the existence of exponential
+moments for π(·) and P(x, ·). Weaker forms of ergodicity and corresponding conditions are also
+discussed in the paper.
+In the remainder of the article, we will primarily discuss another approach to choosing Q, which
+has been shown empirically [1] and, in some cases, theoretically [20] to be superior to the RWM
+algorithm, though it should be noted that random walk proposals are still widely used in practice and
+are often sufficient for more straightforward problems [16].
+
+3. Diffusions
+
+In MCMC, we are concerned with discrete time processes. However, often, there are benefits
+to first considering a continuous time process with the properties we desire. For example, some
+continuous time processes can be specified via a form of differential equation. In this section, we
+derive a choice for a Metropolis–Hastings proposal kernel based on approximations to diffusions,
+
+136
+
+
+Entropy 2014, 16, 3074–3102
+
+those continuous-time n-dimensional Markov processes (Xt)t≥0 for which any sample path t �→ Xt(ω)
+is a continuous function with probability one. For any fixed t, we assume Xt is a random variable
+taking values on the measurable space (X , B) as before. The motivation for this section is to define a
+class of diffusions for which π(·) is the invariant distribution. First, we provide some preliminaries,
+followed by an introduction to our main object of study, the Langevin diffusion.
+
+3.1. Preliminaries
+
+We focus on the class of time-homogeneous Itô diffusions, whose dynamics are governed by a
+stochastic differential equation of the form:
+
+dXt = b(Xt)dt + σ(Xt)dBt, X0 = x0,
+(15)
+
+where (Bt)t≥0 is a standard Brownian motion and the drift vector, b, and volatility matrix, σ, are
+Lipschitz continuous [27]. Since E[Bt+△t − Bt|Bt = bt] = 0 for any △t ≥ 0, informally, we can see that:
+
+E[Xt+△t − Xt|Xt = xt] = b(xt)△t + o(△t),
+(16)
+
+implying that the drift dictates how the mean of the process changes over a small time interval, and if
+we define the process (Mt)t≥0 through the relation:
+
+Mt = Xt −
+� t
+
+0 b(Xs)ds
+(17)
+
+then we have:
+
+E[(Mt+△t − Mt)(Mt+△t − Mt)T|Mt = mt, Xt = xt] = σ(xt)σ(xt)T△t + o(△t),
+(18)
+
+giving the stochastic part of the relationship between Xt+△t and Xt for small enough △t; see, e.g., [28].
+While Equation(15) is often a suitable description of an Itô diffusion, it can also be characterised
+through an infinitesimal generator, A, which describes how functions of the process are expected to
+evolve. We define this partial differential operator through its action on a function, f ∈ C0(X ), as:
+
+A f (Xt) = lim
+△t→0
+E[ f (Xt+△t)|Xt = xt] − f (xt)
+
+△t
+,
+(19)
+
+though A can be associated with the drift and volatility of (Xt)t≥0 by the relation:
+
+A f (x) = ∑
+i
+bi(x) ∂ f
+
+∂xi
+(x) + 1
+
+2 ∑
+i,j
+Vij(x) ∂2 f
+
+∂xi∂xj
+(x),
+(20)
+
+where Vij(x) denotes the component in row i and column j of σ(x)σ(x)T [27].
+As in the discrete case, we can describe the transition kernel of a continuous time Markov process,
+Pt(x0, ·). In the case of an Itô diffusion, Pt(x0, ·) admits a density, pt(x|x0), which, in fact, varies
+smoothly as a function of t. The Fokker–Planck equation describes this variation in terms of the drift
+and volatility and is given by:
+
+∂
+∂t pt(x|x0) = −∑
+i
+
+∂
+∂xi
+[bi(x)pt(x|x0)] + 1
+
+2 ∑
+i,j
+
+∂2
+
+∂xi∂xj
+[Vij(x)pt(x|x0)].
+(21)
+
+137
+
+
+Entropy 2014, 16, 3074–3102
+
+Although, typically, the form of Pt(x0, ·) is unknown, the expectation and variance of Xt ∼ Pt(x0, ·)
+are given by the integral equations:
+
+E[Xt|X0 = x0] = x0 + E
+�� t
+
+0 b(Xs)ds
+�
+,
+
+E[(Xt − E[Xt])(Xt − E[Xt])T|X0 = x0] = E
+�� t
+
+0 σ(Xs)σ(Xs)Tds
+�
+,
+
+where the second of these is a result of the Itô isometry [27]. Continuing the analogy, a natural question
+is whether a diffusion process has an invariant distribution, π(·), and whether:
+
+lim
+t→∞ Pt(x0, A) = π(A)
+(22)
+
+for any A ∈ B and any x0 ∈ X , in some sense. For a large class of diffusions (which we confine
+ourselves to), this is, in fact, the case. Specifically, in the case of positive Harris recurrent diffusions
+with invariant distribution π(·), all compact sets must be small for some skeleton chain, see [29] for
+details. In addition, Equation (21) provides a means of finding π(·), given b and σ. Setting the left-hand
+side of Equation (21) to zero gives:
+
+∑
+i
+
+∂
+∂xi
+[bi(x)π(x)] = 1
+
+2 ∑
+i,j
+
+∂2
+
+∂xi∂xj
+[Vij(x)π(x)],
+(23)
+
+which can be solved to find π(·).
+
+3.2. Langevin Diffusions
+
+Given Equation (23), our goal becomes clearer: find drift and volatility terms, so that the resulting
+dynamics describe a diffusion, which converges to some user-defined invariant distribution, π(·). This
+process can then be used as a basis for choosing Q in a Metropolis–Hastings algorithm. The Langevin
+diffusion, first used to describe the dynamics of molecular systems [30], is such a process, given by the
+solution to the stochastic differential equation:
+
+dXt = 1
+
+2∇ log π(Xt)dt + dBt, X0 = x0.
+(24)
+
+Since Vij(x) = 1{i=j}, it is clear that
+
+1
+2
+∂
+∂xi
+[log π(x)]π(x) = 1
+
+2
+∂
+∂xi
+π(x), ∀i,
+(25)
+
+which is a sufficient condition for Equation (23) to hold. Therefore, for any case in which π(x) is
+suitably regular, so that ∇ log π(x) is well-defined and the derivatives in Equation (23) exist, we can
+use (24) to construct a diffusion, which has invariant distribution, π(·).
+Roberts and Tweedie [31] give sufficient conditions on π(·) under which a diffusion, (Xt)t≥0,
+with dynamics given by Equation (24), will be ergodic, meaning:
+
+∥Pt(x0, ·) − π(·)∥TV → 0
+(26)
+
+as t → ∞, for any x0 ∈ X .
+
+3.3. Metropolis-Adjusted Langevin Algorithm
+
+We can use Langevin diffusions as a basis for MCMC in many ways, but a popular
+variant is known as the Metropolis-adjusted Langevin algorithm (MALA), whereby Q(x, ·) is
+
+138
+
+
+Entropy 2014, 16, 3074–3102
+
+constructed through a Euler–Maruyama discretisation of (24) and used as a candidate kernel in
+a Metropolis–Hastings algorithm. The resulting Q is:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 ∇ log π(x), λ2I
+�
+,
+(27)
+
+where λ is again a tuning parameter.
+Before we discuss the theoretical properties of the approach, we first offer an intuition for the
+dynamics. From Equation (27), it can be seen that Langevin-type proposals comprise a deterministic
+shift towards a local mode of π(x), combined with some random additive Gaussian noise, with
+variance λ2 for each component. The relative weights of the deterministic and random parts are fixed,
+given as they are by the parameter, λ. Typically, if λ1/2 ≫ λ, then the random part of the proposal will
+dominate and vice versa in the opposite case, though this also depends on the form of ∇ log π(x) [31].
+Again, since this is a Metropolis–Hastings method, choosing λ is a balance between proposing
+large enough jumps and ensuring that a reasonable proportion are accepted. It has been shown that in
+the limit, as n → ∞, the optimal acceptance rate for the algorithm is 0.574 [20] for forms of π(·), which
+either have independent and identically distributed components or whose components only differ
+by some scaling factor [20]. In these cases, as n → ∞, the parameter, λ, must be ∝ n−1/3, so we say
+the algorithm efficiency scales O(n−1/3). Note that these results compare favourably with the O(n−1)
+scaling of the random walk algorithm.
+Convergence properties of the method have also been established. Roberts and Tweedie [31]
+highlight some cases in which MALA is either geometrically ergodic or not. Typically, results are
+based on the tail behaviour of π(x). If these tails are heavier than exponential, then the method is
+typically not geometrically ergodic and similarly if the tails are lighter than Gaussian. However, in the
+in between case, the converse is true. We again offer two simple examples for intuition here.
+
+Example: Take π(x) ∝ 1/(1 + x2) as in the previous example. Then, ∇ log π(x) = −2x/(1 + x2)2 → 0
+as |x| → ∞. Therefore, if x0 is far away from zero, then the MALA will be approximately equal to the RWM
+algorithm and, so, will also dissolve into a random walk.
+
+Example: Take π(x) ∝ e−x4. Then, ∇ log π(x) = −4x3 and X′ ∼ N (x − 4λ2x3, λ2). Therefore, for any fixed
+λ, there exists c > 0, such that, for |x0| > c, we have |4λ2x3| >> x and |x − 4λ2x3| >> λ, suggesting that
+MALA proposals will quickly spiral further and further away from any neighbourhood of zero, and hence, nearly
+all will be rejected.
+
+For cases where there is a strong correlation between elements of x or each element has a different
+marginal variance, the MALA can also be “pre-conditioned” in a similar way to the RWM, so that the
+covariance structure of proposals more accurately reflects that of π(x) [32]. In this case, proposals take
+the form:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 Σ∇ log π(x), λ2Σ
+�
+,
+(28)
+
+where λ is again a tuning parameter. It can be shown that provided Σ is a constant matrix, π(x) is still
+the invariant distribution for the diffusion on which Equation (28) is based [33].
+
+4. Geometric Concepts in Markov Chain Monte Carlo
+
+Ideas from information geometry have been successfully applied to statistics from as early as [34].
+More widely, other geometric ideas have also been applied, offering new insight into common problems
+(e.g., [35,36]). A survey is given in [37]. In this section, we suggest why some ideas from differential
+geometry may be beneficial for sampling methods based on Markov chains. We then review what is
+
+139
+
+
+Entropy 2014, 16, 3074–3102
+
+meant by a “diffusion on a manifold”, before turning to the specific case of Equation (24). After this,
+we discuss what can be learned from work in information geometry in this context.
+
+4.1. Manifolds and Markov Chains
+
+We often make assumptions in MCMC about the properties of the space, X , in which our
+Markov chains evolve. Often X = Rn or a simple re-parametrisation would make it so. However,
+here, Rn = {(a1, ..., an) : ai ∈ (−∞, ∞) ∀i}. The additional assumption that is often made is that Rn is
+Euclidean, an inner product space with the induced distance metric:
+
+d(x, y) =
+�
+
+∑
+i
+(xi − yi)2.
+(29)
+
+For sampling methods based on Markov chains that explore the space locally, like the RWM and
+MALA, it may be advantageous to instead impose a different metric structure on the space, X , so that
+some points are drawn closer together and others pushed further apart. Intuitively, one can picture
+distances in the space being defined, such that if the current position in the chain is far from an area
+of X , which is “likely to occur” under π(·), then the distance to such a typical set could be reduced.
+Similarly, once this region is reached, the space could be “stretched” or “warped”, so that it is explored
+as efficiently as possible.
+While the idea is attractive, it is far from a constructive definition. We only have the pre-requisite
+that (X , d) must be a metric space. However, as Langevin dynamics use gradient information, we will
+require (X , d) to be a space on which we can do differential calculus. Riemannian manifolds are an
+appropriate choice, therefore, as the rules of differentiation are well understand for functions defined
+on them [38,39], while we are still free to define a more local notion of distance than Euclidean. In this
+section, we write Rn to denote the Euclidean vector space.
+
+4.2. Preliminaries
+
+We do not provide a full overview of Riemannian geometry here [38–40]. We simply note that
+for our purposes, we can consider an n-dimensional Riemannian manifold (henceforth, manifold)
+to be an n-dimensional metric space, in which distances are defined in a specific way. We also only
+consider manifolds for which a global coordinate chart exists, meaning that a mapping r : Rn → M
+exists, which is both differentiable and invertible and for which the inverse is also differentiable (a
+diffeomorphism). Although this restricts the class of manifolds available (the sphere, for example, is
+not in this class), it is again suitable for our needs and avoids the practical challenges of switching
+between coordinate patches. The connection with Rn defined through r is crucial for making sense of
+differentiability in M. We say a function f : M → R is “differentiable” if ( f ◦ r) : Rn → R is [39].
+As has been stated, Equation (29) can be induced via a Euclidean inner product, which we denote
+⟨·, ·⟩. However, it will aid intuition to think of distances in Rn via curves:
+
+γ : [0, 1] → Rn.
+(30)
+
+We could think of the distance between two points in x, y ∈ Rn as the minimum length among all
+curves that pass through x and y. If γ(0) = x and γ(1) = y, the length is defined as:
+
+L(γ) =
+� 1
+
+0
+
+�
+
+⟨γ′(t), γ′(t)⟩dt,
+(31)
+
+giving the metric:
+d(x, y) = inf {L(γ) : γ(0) = x, γ(1) = y} .
+(32)
+
+In Rn, the curve with a minimum length will be a straight line, so that Equation (32) agrees with
+Equation (29). More generally, we call a solution to Equation (32) a geodesic [38].
+
+140
+
+
+Entropy 2014, 16, 3074–3102
+
+In a vector space, metric properties can always be induced through an inner product (which also
+gives a notion of orthogonality). Such a space can be thought of as “flat”, since for any two points, y
+and z, the straight line ay + (1 − a)z, a ∈ [0, 1] is also contained in the space. In general, manifolds do
+not have vector space structure globally, but do so at the infinitesimal level. As such, we can think
+of them as “curved”. We cannot always define an inner product, but we can still define distances
+through (32). We define a curve on a manifold, M, as γM : [0, 1] → M. At each point γM(t) = p ∈ M,
+the velocity vector, γ′
+M(t), lies in an n-dimensional vector space, which touches M at p. These are
+known as tangent spaces, denoted TpM, which can be thought of as local linear approximations to M.
+We can define an inner product on each as gp : TpM → R, which allows us to define a generalisation
+of (31) as:
+
+L(γM) =
+� 1
+
+0
+
+�
+
+gp(γ′
+M(t), γ′
+M(t))dt.
+(33)
+
+and
+provides
+a
+means
+to
+define
+a
+distance
+metric
+on
+the
+manifold
+as
+d(x, y)
+=
+inf {L(γM) : γM(0) = x, γM(1) = y}. We emphasise the difference between this distance metric on M
+and gp, which is called a Riemannian metric or metric tensor and which defines an inner product on
+TpM.
+
+Embeddings and Local Coordinates
+
+So far, we have introduced manifolds as abstract objects. In fact, they can also be considered as
+objects that are embedded in some higher-dimensional Euclidean space. A simple example is any
+two-dimensional surface, such as the unit sphere, lying in R3. If a manifold is embedded in this way,
+then metric properties can be induced from the ambient Euclidean space.
+We seek to make these ideas more concrete through an example, the graph of a function, f (x1, x2),
+of two variables, x1 and x2. The resulting map, r, is:
+
+r : R2 → M
+(34)
+
+r(x1, x2) = (x1, x2, f (x1, x2)).
+(35)
+
+We can see that M is embedded in R3, but that any point can be identified using only two coordinates,
+x1 and x2. In this case, each TpM is a plane, and therefore, a two-dimensional subspace of R3, so: (i) it
+inherits the Euclidean inner product, ⟨·, ·⟩; and (ii) any vector, v ∈ TpM, can be expressed as a linear
+combination of any two linearly independent basis vectors (a canonical choice is the partial derivatives
+∂r/∂x1 := r1 and r2, evaluated at x = r−1(p) ∈ R2). The resulting inner product, gp(v, w), between
+two vectors, v, w ∈ TpM, can be induced from the Euclidean inner product as:
+
+⟨v, w⟩ = ⟨v1r1(x) + v2r2(x), w1r1(x) + w2r2(x)⟩,
+
+= v1w1⟨r1(x), r1(x)⟩ + v1w2⟨r1(x), r2(x)⟩ + v2w1⟨r2(x), r1(x)⟩ + v2w2⟨r2(x), r2(x)⟩,
+
+= vTG(x)w,
+
+where:
+
+G(x) =
+
+�
+⟨r1(x), r1(x)⟩
+⟨r1(x), r2(x)⟩
+⟨r1(x), r2(x)⟩
+⟨r2(x), r2(x)⟩
+
+�
+
+(36)
+
+and we use vi, wi to denote the components of v and w. To write (31) using this notation, we define the
+curve, x(t) ∈ R2, corresponding to γM(t) ∈ M as x = (r−1 ◦ γM) : [0, 1] → R2. Equation (31) can then
+be written:
+
+L(γM) =
+� 1
+
+0
+
+�
+
+x′(t)TG(x(t))x′(t)dt,
+(37)
+
+which can be used in (32) as before.
+
+141
+
+
+Entropy 2014, 16, 3074–3102
+
+The key point is that, although we have started with an object embedded in R3, we can compute
+the Riemannian metric, gp(v, w) (and, hence, distances in M), using only the two-dimensional “local”
+coordinates (x1, x2). We also need not have explicit knowledge of the mapping, r, only the components
+of the positive definite matrix, G(x). The Nash embedding theorem [41] in essence enables us to define
+manifolds by the reverse process: simply choose the matrix, G(x), so that we define a metric space
+with suitable distance properties, and some object embedded in some higher-dimensional Euclidean
+space will exist for which these metric properties can be induced as above. Therefore, to define our new
+space, we simply choose an appropriate matrix-valued map, G(x) (we discuss this choice in Section
+4.4). If G(x) does not depend on x, then M has a vector space structure and can be thought of as “flat”.
+Trivially, G(x) = I gives Euclidean n-space.
+We can also define volumes on a Riemannian manifold in local coordinates. Following standard
+coordinate transformation rules, we can see that for the above example, the area element, dx, in R2
+
+will change according to a Jacobian J = |(Dr)T(Dr)|1/2, where Dr = ∂(p1, p2, p3)/∂(x1, x2). This
+reduces to J = |G(x)|1/2, which is also the case for more general manifolds [38]. We therefore define
+the Riemannian volume measure on a manifold, M, in local coordinates as:
+
+VolM(dx) = |G(x)|
+1
+2 dx.
+(38)
+
+If G(x) = I, then this reduces to the Lebesgue measure.
+
+4.3. Diffusions on Manifolds
+
+By a “diffusion on a manifold” in local coordinates, we actually mean a diffusion defined on
+Euclidean space. For example, a realisation of Brownian motion on the surface, S ⊂ R3, defined
+in Figure 2 through r(x1, x2) = (x1, x2, sin(x1) + 1) will be a sample path, which is defined on S
+and “looks locally” like Brownian motion in a neighbourhood of any point, p ∈ S. However, the
+pre-image of this sample path (through r−1) will not be a realisation of a Brownian motion defined on
+R2, owing to the nonlinearity of the mapping. Therefore, to define “Brownian motion on S”, we define
+some diffusion (Xt)t≥0 that takes values in R2, for which the process (r(Xt))t≥0 “looks locally” like a
+Brownian motion (and lies on S). See [42] for more intuition here.
+Our goal, therefore, is to define a diffusion on Euclidean space, which, when mapped onto a
+manifold through r, becomes the Langevin diffusion described in (24) by the above procedure. Such a
+diffusion takes the form:
+
+dXt = 1
+
+2
+˜∇ log ˜π(Xt)dt + d ˜Bt,
+(39)
+
+where those objects marked with a tilde must be defined appropriately. The next few paragraphs
+are technical, and readers aiming to simply grasp the key points may wish to skip to the end of
+this Subsection.
+We turn first to ( ˜Bt)t≥0, which we use to denote Brownian motion on a manifold. Intuitively,
+we may think of a construction based on embedded manifolds, by setting ˜B0 = p ∈ M, and for
+each increment sampling some random vector in the tangent space TpM, and then moving along the
+manifold in the prescribed direction for an infinitesimal period of time before re-sampling another
+velocity vector from the next tangent space [42]. In fact, we can define such a construction using
+Stratonovich calculus and show that the infinitesimal generator can be written using only local
+coordinates [28]. Here, we instead take the approach of generalising the generator directly from
+Euclidean space to the local coordinates of a manifold, arriving at the same result. We then deduce the
+stochastic differential equation describing ( ˜Bt)t≥0 in Itô form using (20).
+For a standard Brownian motion on Rn, A = △/2, where △ denotes the Laplace operator:
+
+△ f = ∑
+i
+
+∂2 f
+∂x2
+i
+= div(∇ f ).
+(40)
+
+142
+
+
+Entropy 2014, 16, 3074–3102
+
+Substituting A = △/2 into (20) trivially gives bi(x) = 0 ∀i, Vij(x) = 1{i=j}, as required. The Laplacian,
+△ f (x), is the divergence of the gradient vector field of some function, f ∈ C2(Rn), and its value at
+x ∈ Rn can be thought of as the average value of f in some neighbourhood of x [43].
+
+A 
+
+B 
+
+Figure 2. A two-dimensional manifold (surface) embedded in R3 through r(x1, x2) = (x1, x2, sin(x1) +
+1), parametrised by the local coordinates, x1 and x2. The distance between points A and B is given by
+the length of the curve γ(t) = (t, t, sin(t) + 1)).
+
+To define a Brownian motion on any manifold, the gradient and divergence must be generalised.
+We provide a full derivation in Appendix B, which shows that the gradient operator on a manifold can
+be written in local coordinates as ∇M = G−1(x)∇. Combining with the operator, divM, we can define
+a generalisation of the Laplace operator, known as the Laplace–Beltrami operator (e.g., [44,45]), as:
+
+△LB f = divM(∇M f ) = |G(x|− 1
+
+2
+n
+∑
+i=1
+
+∂
+∂xi
+
+�
+
+|G(x)|
+1
+2
+n
+∑
+j=1
+{G−1(x)}ij
+∂ f
+∂xj
+
+�
+
+,
+(41)
+
+for some f ∈ C2
+0(M).
+The generator of a Brownian motion on M is △LB/2 [44]. Using (20), the resulting diffusion has
+dynamics given by:
+
+d ˜Bt = Ω(Xt)dt +
+�
+
+G−1(Xt)dBt,
+
+Ωi(Xt) = 1
+
+2|G(Xt)|− 1
+
+2
+n
+∑
+j=1
+
+∂
+∂xj
+
+�
+|G(Xt)|
+1
+2 {G−1(Xt)}ij
+�
+.
+
+Those familiar with the Itô formula will not be surprised by the additional drift term, Ω(Xt). As Itô
+integrals do not follow the chain rule of ordinary calculus, non-linear mappings of martingales, such
+as (Bt)t≥0, typically result in drift terms being added to the dynamics (e.g., [27]).
+To define ˜∇, we simply note that this is again the gradient operator on a general manifold, so
+˜∇ = G−1(x)∇. For the density, ˜π(x), we note that this density will now implicitly be defined with
+respect to the volume measure, |G(x)|
+1
+2 dx, on the manifold. Therefore, to ensure the diffusion (39) has
+the correct invariant density with respect to the Lebesgue measure, we define:
+
+˜π(x) = π(x)|G(x)|− 1
+
+2 .
+(42)
+
+Putting these three elements together, Equation (39) becomes:
+
+143
+
+
+Entropy 2014, 16, 3074–3102
+
+dXt = 1
+
+2G−1(Xt)∇ log
+�
+π(Xt)|G(Xt)|− 1
+
+2
+�
+dt + Ω(Xt)dt +
+�
+
+G−1(Xt)dBt,
+
+which, upon simplification, becomes:
+
+dXt = 1
+
+2G−1(Xt)∇ log π(Xt)dt + Λ(Xt)dt +
+�
+
+G−1(Xt)dBt,
+(43)
+
+Λi(Xt) = 1
+
+2 ∑
+j
+
+∂
+∂xj
+{G−1(Xt)}ij.
+(44)
+
+It can be shown that this diffusion has invariant Lebesgue density π(x), as required [33]. Intuitively,
+when a set is mapped onto the manifold, distances are changed by a factor,
+�
+
+G(x). Therefore, to end
+up with the initial distances, they must first be changed by a factor of
+�
+
+G−1(x) before the mapping,
+which explains the volatility term in Equation (43).
+The resulting Metropolis–Hastings proposal kernel for this “MALA on a manifold” was clarified
+in [33] and is given by:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 G−1(x)∇ log π(x) + λ2Λ(x), λ2G−1(x)
+�
+,
+(45)
+
+where λ2 is a tuning parameter. The nonlinear drift term here is slightly different to that reported
+in [1,32], for reasons discussed in [33].
+
+4.4. Choosing a Metric
+
+We now turn to the question of which metric structure to put on the manifold, or equivalently,
+how to choose G(x). In this section, we sometimes switch notation slightly, denoting the target density,
+π(x|y), as some of the discussion is directed towards Bayesian inference, where π(·) is the posterior
+distribution for some parameter, x, after observing some data, y. The problem statement is: what is an
+appropriate choice of distance between points in the sample space of a given probability distribution?
+A related (but distinct) question is how to define a distance between two probability distributions
+from the same parametric family, but with different parameters. This has been a key theme in
+information geometry, explored by Rao [46] and others [2] for many years.
+Although generic
+measures of distance between distributions (such as total variation) are often appropriate, based on
+information-theoretic principles, one can deduce that for a given parametric family, {px(y) : x ∈ X },
+it is in some sense natural to consider this “space of distributions” to be a manifold, where the Fisher
+information is the matrix, G(x) (with the α = 0 connection employed; see [2] for details).
+Because of this, Girolami and Calderhead [1] proposed a variant of the Fisher metric for geometric
+Markov chain Monte Carlo, as:
+
+G(x) = Ey|x
+
+�
+
+−
+∂2
+
+∂xi∂xj
+log f (y|x)
+
+�
+
+−
+∂2
+
+∂xi∂xj
+log π0(x),
+(46)
+
+where π(x|y) ∝ f (y|x)π0(x) is the target density, f denotes the likelihood and π0 the prior. The metric
+is tailored to Bayesian problems, which are a common use for MCMC, so the Fisher information is
+combined with the negative Hessian of the log-prior. One can also view this metric as the expected
+negative Hessian of the log target, since this naturally reduces to (46).
+The motivation for a Hessian-style metric can also be understood from studying MCMC proposals.
+From (45) and by the same logic as for general pre-conditioning methods [32], the objective is to choose
+G−1(x) to match the covariance structure of π(x|y) locally. If the target density were Gaussian with
+covariance matrix, Σ, then:
+
+−
+∂2
+
+∂xi∂xj
+log π(x|y) = Σ.
+(47)
+
+144
+
+
+Entropy 2014, 16, 3074–3102
+
+In the non-Gaussian case, the negative Hessian is no longer constant, but we can imagine that it
+matches the correlation structure of π(x|y) locally at least. Such ideas have been discussed in the
+geostatistics literature previously [47]. One problem with simply using (47) to define a metric is that
+unless π(x|y) is log-concave, the negative Hessian will not be globally positive-definite, although
+Petra et al. [48] conjecture that it may be appropriate for use in some realistic scenarios and suggest
+some computationally efficient approximation procedures [48].
+
+Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = −∂2 log π(x)/∂x2. Then, G−1(x) = (1 + x2)2/(2 −
+2x2), which is negative if x2 > 1, so unusable as a proposal variance.
+
+Girolami and Calderhead [1] use the Fisher metric in part to counteract this problem. Taking
+expectations over the data ensures that the likelihood contribution to G(x) in (46) will be positive
+(semi-)definite globally (e.g., [49]); so, provided a log-concave prior is chosen, then (46) should
+be a suitable choice for G(x). Indeed, Girolami and Calderhead [1] provide several examples in
+which geometric MCMC methods using this Fisher metric perform better than their “non-geometric”
+counterparts.
+Betancourt [50] also starts from the viewpoint that the Hessian (47) is an appropriate choice for
+G(x) and defines a mapping from the set of n × n matrices to the set of positive-definite n × n matrices
+by taking a “smooth” absolute value of the eigenvalues of the Hessian. This is done in a way such that
+derivatives of G(x) are still computable, inspiring the author to the name, SoftAbs metric. For a fixed
+value of x, the negative Hessian, H(x), is first computed and, then, decomposed into UTDU, where
+D is the diagonal matrix of eigenvalues. Each diagonal element of D is then altered by the mapping
+tα : R → R, given by:
+tα(λi) = λi coth(αλi),
+(48)
+
+where α is a tuning parameter (typically chosen to be as large as possible for which eigenvalues remain
+non-zero numerically). The function, tα, acts as an absolute value function, but also uplifts eigenvalues,
+which are close to zero to ≈ 1/α. It should be noted that while the Fisher metric is only defined for
+models in which a likelihood is present and for which the expectation is tractable, the SoftAbs metric
+can be found for any target distribution, π(·).
+Many authors (e.g., [1,48]) have noted that for many problems, the terms involving derivatives
+of G(x) are often small, and so, it is not always worth the computational effort of evaluating them.
+Girolami and Calderhead [1] propose the simplified manifold, MALA, in which proposals are of
+the form:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 G−1(x)∇ log π(x), λ2G−1(x)
+�
+(49)
+
+Using this method means derivatives of G(x) are no longer needed, so more pragmatic ways of
+regularising the Hessian are possible. One simple approach would be to take the absolute values
+of each eigenvalue, giving G(x) = UT|D|U, where H(x) = UTDU is the negative Hessian and |D|
+is a diagonal matrix with {|D|}ii = |λi| (this approach may fall into difficulties if eigenvalues are
+numerically zero). Another would be choosing G(x) as the “nearest” positive-definite matrix to the
+negative Hessian, according to some distance metric on the set of n × n matrices. The problem has, in
+fact, been well-studied in mathematical finance, in the context of finding correlations using incomplete
+data sets [51], and tackled using distances induced by the Frobenius norm. Approximate solution
+algorithms are discussed in Higham [51]. It is not clear to us at present whether small changes to
+the Hessian would result in large changes to the corresponding positive definite matrix under a
+given distance or, indeed, whether given a distance metric on the space of matrices, there is always a
+well-defined unique “nearest” positive definite matrix. Below, we provide two simple examples, here
+showing how a “Hessian-style metric” can alleviate some of the difficulties associated with both heavy
+and light-tailed target densities.
+
+145
+
+
+Entropy 2014, 16, 3074–3102
+
+Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) =
+−x(1 + x2)/|1 − x2|, which no longer tends to zero as |x| → ∞, suggesting a manifold variant of MALA with
+a Hessian-style metric may avoid some of the pitfalls of the standard algorithm. Note that the drift may become
+very large if |x| ≈ 1, but since this event occurs with probability zero, we do not see it as a major cause for
+concern.
+
+Example: Take π(x) ∝ e−x4, and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) = −x/3,
+which is O(x), so alleviating the problem of spiralling proposals for light-tailed targets demonstrated by MALA
+in an earlier example.
+
+Other choices for G(x) have been proposed, which are not based on the Hessian. These have the
+advantage that gradients need not be computed (either analytically or using computational methods).
+Sejdinovic et al. [52] propose a Metropolis–Hastings method, which can be viewed as a geometric
+variant of the RWM, where the choice for G(x) is based on mapping samples to an appropriate
+feature space, and performing principal component analysis on the resulting features to choose a local
+covariance structure for proposals.
+If we consider the RWM with Gaussian proposals to be a Euler–Maruyama discretisation of
+Brownian motion on a manifold, then proposals will take the form Q(x, ·) ≡ N (x + λ2Ω(x), λ2G−1(x)).
+If we assume (like in the simplified manifold MALA) that Ω(x) ≈ 0, then we have proposals centred
+at the current point in the Markov chain with a local covariance structure (the full Hastings acceptance
+rate must now be used as q(x′|x) ̸= q(x|x′) in general).
+As no gradient information is needed, the Sejdinovic et al. metric can be used in conjunction with
+the pseudo-marginal MCMC algorithm, so that π(x|y) need not be known exactly. Examples from the
+article demonstrate the power of the approach [52].
+An important property of any Riemannian metric is how it transforms under coordinate change
+(e.g., [2]). The Fisher information metric commonly studied in information geometry is an example of a
+“coordinate invariant” choice for G(x). If we consider two parametrisations for a statistical model given
+by x and z = t(x), computing the Fisher information under x and then transforming this matrix using
+the Jacobian for the mapping, t, will give the same result as computing the Fisher information under z.
+It should be noted that because of either the prior contribution in (46) or the nonlinear transformations
+applied in other cases, none of the metrics we have reviewed here have this property, which means
+that we have no principled way of understanding how G(x) will relate to G(z). It is intuitive, however,
+that using information from all of π(x), rather than only the likelihood contribution, f (y|x), would
+seem sensible when trying to sample from π(·).
+
+5. Survey of Applications
+
+Rather than conduct our own simulation study, we instead highlight some cases in the literature
+where geometric MCMC methods have been used with success.
+Martin et al. [53] consider Bayesian inference for a statistical inverse problem, in which a surface
+explosion causes seismic waves to travel down into the ground (the subsurface medium). Often,
+the properties of the subsurface vary with distance from ground level or because of obstacles in the
+medium, in which case, a fraction of the waves will scatter off these boundaries and be reflected
+back up to ground level at later times. The observations here are the initial explosion and the waves,
+which return to the surface, together with return times. The challenge is to infer the properties of the
+subsurface medium from this data. The authors construct a likelihood based on the wave equation for
+the data and perform Bayesian inference using a variant of the manifold MALA. Figures are provided
+showing the local correlations present in the posterior and, therefore, highlighting the need for an
+algorithm that can navigate the high density region efficiently. Several methods are compared in the
+paper, but the variant of MALA that incorporates a local correlation structure is shown to be the most
+efficient, particularly as the dimension of the problem increases [53].
+
+146
+
+
+Entropy 2014, 16, 3074–3102
+
+Calderhead and Girolami [54] dealt with two models for biological phenomena based on nonlinear
+dynamical systems. A model of circadian control in the Arabidopsis thaliana plant comprised a system
+of six nonlinear differential equations, with twenty two parameters to be inferred. Another model
+for cell signalling consisted of a system of six nonlinear differential equations with eight parameters,
+with inference complicated by the fact that observations of the model are not recorded directly [54].
+The resulting inference was performed using RWM, MALA and geometric methods, with the results
+highlighting the benefits of taking the latter approach. The simplified variant of MALA on a manifold
+is reported to have produced the most efficient inferences overall, in terms of effective sample size per
+unit of computational time.
+Stathopoulos and Girolami [55] considered the problem of inferring parameters in Markov jump
+processes. In the paper, a linear noise approximation is shown, which can make inference in such
+models more straightforward, enabling an approximate likelihood to be computed. Models based on
+chemical reaction dynamics are considered; one such from chemical kinetics contained four unknown
+parameters; another from gene expression consisted of seven. Inference was performed using the
+RWM, the simplified manifold MALA and Hamiltonian methods, with the MALA reported as most
+efficient according to the chosen diagnostics. The authors note that the simplified manifold method is
+both conceptually simple and able to account for local correlations, making it an attractive choice for
+inference [55].
+Konukoglu et al. [56] designed a method for personalising a generic model for a physiological
+process to a specific patient, using clinical data. The personalisation took the form of patient-specific
+parameter inference. The authors highlight some of the difficulties of this task in general, including the
+complexity of the models and the relative sparsity of the datasets, which often result in a parameter
+identifiability issue [56]. The example discussed in the paper is the Eikonal-diffusion model describing
+electrical activity in cardiac tissue, which results in a likelihood for the data based on a nonlinear partial
+differential equation, combined with observation noise [56]. A method for inference was developed by
+first approximating the likelihood using a spectral representation and then using geometric MCMC
+methods on the resulting approximate posterior. The method was first evaluated on synthetic data
+and then repeated on clinical data taken from a study for ventricular tachycardia radio-frequency
+ablation [56].
+
+6. Discussion
+
+The geometric viewpoint in not necessary to understand manifold variants of the MALA. Indeed,
+several authors [32,33] have discussed these algorithms without considering them to be “geometric”,
+rather simply Metropolis–Hastings methods in which proposal kernels have a position-dependent
+covariance structure. We do not claim that the geometric view is the only one that should be taken.
+Our goal is merely to point out that such position-dependent methods can often be viewed as methods
+defined on a manifold and that studying the structure of the manifold itself may lead to new insights on
+the methods. For example, taking the geometric viewpoint and noting the connection with information
+geometry enabled Girolami and Calderhead to adopt the Fisher metric for calculations [1]. We list here
+a few open questions on which the geometric viewpoint may help shed some insight.
+Computationally-minded readers will have noted that using position-dependent covariance
+matrices adds a significant computational overhead in practice, with additional O(n3) matrix inversions
+required at each step of the corresponding Metropolis–Hastings algorithms. Clearly, there will be
+many problems for which the matrix, G(x), does not change very much, and therefore, choosing
+a constant covariance G−1(x) = Σ may result in a more efficient algorithm overall. Geometrically,
+this would correspond to a manifold with scalar curvature close to zero everywhere. It may be that
+geometric ideas could be used to understand whether the manifold is flat enough that a constant choice
+of G(x) is sufficient. To make sense of this truly would require a relationship between curvature, an
+inherently local property and more global statements about the manifold. Many results in differential
+geometry, beginning with the celebrated Gauss–Bonnet theorem, have previously related global and
+
+147
+
+
+Entropy 2014, 16, 3074–3102
+
+local properties in this way [57]. It is unknown to the authors whether results exist relating the
+curvature of a manifold to some global property, but this is an interesting avenue for further research.
+A related question is when to choose the simplified manifold MALA over the full method.
+Problems in which the term, ∥Λ(x)∥, is sufficient large to warrant calculation correspond to those for
+which the manifold has very high curvature in many places; so again, making some global statement
+related to curvature could help here.
+Although there is a reasonably intuitive argument for why the Hessian is an appropriate starting
+point for G(x), the lack of positive-definiteness may be seen as a cause for concern by some. After
+all, it could be argued that if the curvature is not positive-definite in a region, then how can it be a
+reasonable approximation to the local covariance structure. Many statistical models used to describe
+natural phenomena are characterised by distributions with heavy tails or multiple modes, for which
+this is the case. In addition, for target densities of the form π(x) ∝ e−|x|, the Hessian is everywhere
+equal to zero!The attempts to force positive-definiteness we have described will typically result in
+small moves being proposed in such regions of the sample space, which may not be an optimal strategy.
+Much work in information geometry has centred on the geometry of Hessian structures [58], and some
+insights from this field may help to better understand the question of choosing an appropriate metric.
+In addition, the field of pseudo-Riemannian geometry deals with forms of G(x), which need not be
+positive-definite [39]; so again, understanding could be gained from here.
+Some recent work in high-dimensional inference has centred on defining MCMC methods for
+which efficiency scales O(1) with respect to the dimension, n, of π(·) [19,59]. In the case where X
+takes values in some infinite-dimensional function space, this can be done provided a Gaussian prior
+measure is defined for X. A striking result from infinite-dimensional probability spaces is that two
+different probability measures defined over some infinite dimensional space have a striking tendency
+to have disjoint supports [60]. The key challenge for MCMC is to define transition kernels for which
+proposed moves are inside the support for π(·). A straight-forward approach is to define proposals for
+which the prior is invariant, since the likelihood contribution to the posterior typically will not alter
+its support from that of the prior [19]. However, the posterior may still look very different from the
+prior, as noted in [61], so this proposal mechanism, though O(1), can still result in slow exploration.
+Understanding the geometry of the support and defining methods that incorporate the likelihood term,
+but also respect this geometry, so as to ensure proposals remain in support of π(·), is an intriguing
+research proposition.
+The methods reviewed in this paper are based on first order Langevin diffusions. Algorithms
+have also been developed that are based on second order Langevin diffusions, in which a stochastic
+differential equation governs the behaviour of the velocity of a process [62,63]. A natural extension to
+the work of Girolami and Calderhead [1] and Xifara et al. [33] would be to map such diffusions onto
+a manifold and derive Metropolis–Hastings proposal kernels based on the resulting dynamics. The
+resulting scheme would be a generalisation of [63], though the most appropriate discretisation scheme
+for a second order process to facilitate sampling is unclear and perhaps a question worthy of further
+exploration.
+We have focused primarily here on the sample space X = Rn and on defining an appropriate
+manifold on which to construct Markov chains. In some inference problems, however, the sample
+space is a pre-defined manifold, for example the set of n × n rotation matrices, commonly found in the
+field of directional statistics [64]. Such manifolds are often not globally mappable to Euclidean n-space.
+Methods have been devised for sampling from such spaces [65,66]. In order to use the methods
+described here for such problems, an appropriate approach for switching between coordinate patches
+at the relevant time would need to be devised, which could be an interesting area of further study.
+Alongside these geometric problems, we can also discuss geometric MCMC methods from a
+statistical perspective. The last example given in the previous section hinted that the manifold MALA
+may cope better with target distributions with heavy tails. In fact, Latuszynski et al. [67] have shown
+that, in one dimension, the manifold MALA is geometrically ergodic for a class of targets of the
+
+148
+
+
+Entropy 2014, 16, 3074–3102
+
+form π(x) ∝ exp(−|x|β) for any choice of β ̸= 1. This incorporates cases where tails are heavier
+than exponential and lighter than Gaussian, two scenarios under which geometric ergodicity fails for
+the MALA.
+Finding optimal acceptance rates and scaling of λ with dimension are two other related challenges.
+In this case, the picture is more complex. Traditional results have been shown for Metropolis–Hastings
+methods in the case where target distributions are independent and identically-distributed or some
+other suitable symmetry and regularity in the shape of π(·).
+Manifold methods are, however,
+specifically tailored to scenarios in which this is not the case, scenarios in which there is a high
+correlation between components of x, which changes depending on the value of x. It is less clear how
+to proceed with finding relevant results that can serve as guidelines to practitioners here. Indeed,
+Sherlock [18] notes that a requirement for optimal acceptance rate results for the RWM to be appropriate
+is that the curvature of π(x) does not change too much, yet this is the very scenario in which we would
+want to use a manifold method.
+
+Acknowledgments: We thank the two reviewers for helpful comments and suggestions. Samuel Livingstone is
+funded by a PhD Scholarship from Xerox Research Centre Europe. Mark Girolami is funded by an Engineering
+and Physical Sciences Research Council Established Career Research Fellowship, EP/J016934/1, and a Royal
+Society Wolfson Research Merit Award.
+
+Author Contributions: Author Contributions
+The article was written by Samuel Livingstone under the guidance of Mark Girolami. All authors
+have read and approved the final manuscript.
+
+Appendix
+
+Appendix Total Variation Distance
+
+We show how to obtain (10) from (9). Denoting two probability distributions, μ(·) and ν(·), and
+associated densities, μ(x) and ν(x), we have:
+
+∥μ(·) − ν(·)∥TV := sup
+A∈B
+|μ(A) − ν(A)|.
+
+Define the set B = {x ∈ X : μ(x) > ν(x)}. To see that B ∈ B, note that B = ∪q∈Q{x ∈ X : μ(x) >
+q} ∩ {x ∈ X : ν(x) < q}, and the result follows from properties of B (e.g., [68]). Now, for any A ∈ B:
+
+μ(A) − ν(A) ≤ μ(A ∩ B) − ν(A ∩ B) ≤ μ(B) − ν(B),
+
+and similarly:
+ν(A) − μ(A) ≤ ν(Bc) − μ(Bc),
+
+so, the supremum will be attained either at B or Bc. However, since μ(X ) = ν(X ) = 1, then:
+
+[μ(B) − ν(B)] − [ν(Bc) − μ(Bc)] = 0,
+
+so that
+|μ(B) − ν(B)| = |μ(Bc) − ν(Bc)|.
+
+Using these facts gives an alternative characterisation of the total variation distance as:
+
+∥μ(·) − ν(·)∥TV = 1
+
+2 (|μ(B) − ν(B)| + |μ(Bc) − ν(Bc)|)
+
+= 1
+
+2
+
+�
+
+X |μ(x) − ν(x)|dx
+
+as required.
+
+149
+
+
+Entropy 2014, 16, 3074–3102
+
+Appendix Gradient and Divergence Operators on a Riemannian Manifold
+
+The gradient of a function on Rn is the unique vector field, such that, for any unit vector, u:
+
+⟨∇ f (x), u⟩ = Du [ f (x)] = lim
+h→0
+
+� f (x + hu) − f (x)
+
+h
+
+�
+,
+(A1)
+
+the directional derivative of f along u at x ∈ Rn.
+On a manifold, the gradient operator, ∇M, can still be defined, such that the inner product
+gp(∇M f (x), u) = Du[ f (x)]. Setting ∇M = G(x)−1∇ gives:
+
+gp(∇M f (x), u) = (G−1(x)∇ f (x))TG(x)u,
+
+= ⟨∇ f (x), u⟩,
+
+which is equal to the directional derivative along u as required.
+The divergence of some vector field, v, at a point, x ∈ Rn, is the net outward flow generated by
+v through some small neighbourhood of x. Mathematically, the divergence of v(x) ∈ R3 is given by
+∑i ∂vi/∂xi. On a more general manifold, the divergence is also a sum of derivatives, but here, they
+are covariant derivatives. A short introduction is provided in Appendix C. Here, we simply state
+that the covariant derivative of a vector field, v, at a point p ∈ M is the orthogonal projection of the
+directional derivative onto the tangent space, TpM. Intuitively, a vector field on a manifold is a field
+of vectors, each of which lie in the tangent space to a point, p ∈ M. It only makes sense therefore to
+discuss how vector fields change along the manifold or in the direction of vectors, which also lie in the
+tangent space. Although the idea seems simple, the covariant derivative has some attractive geometric
+properties; notably, it can be completely written in local coordinates,and, so, does not depend on
+knowledge of an embedding in some ambient space.
+The divergence of a vector field, v, defined on a manifold, M, at the point, p ∈ M, is defined as:
+
+divM(v) =
+n
+∑
+i=1
+Dc
+ei[vi],
+
+where ei denotes the i-th basis vector for the tangent space, TpM, at p ∈ M, and vi denotes the i-th
+coefficient. This can be written in local coordinates (see Appendix C) as:
+
+divM(v) = |G(x)|− 1
+
+2
+n
+∑
+i=1
+
+∂
+∂xi
+
+�
+|G(x)|
+1
+2 vi
+�
+,
+
+and can be combined with ∇M to form the Laplace–Beltrami operator (41).
+
+Appendix Vector Fields and the Covariant Derivative
+
+Here, we provide a short introduction to vector fields and differentiation on a smooth manifold;
+see [38,39]. The following geometric notation is used here: (i) vector components are indexed with a
+superscript, e.g., v = (v1, ..., vn); and (ii) repeated subscripts and superscripts are summed over, e.g.,
+viei = ∑i viei (known as the Einstein summation convention).
+For any smooth manifold, M, the set of all tangent vectors to points on M is known as the tangent
+bundle and denoted TM.
+A Cr vector field defined on M is a mapping that assigns to each point, p ∈ M, a tangent vector,
+v(p) ∈ TpM. In addition, the components of v(p) in any basis for TpM must also be Cr [38]. We
+will denote the set of all vector fields on M as Γ(TM). For some vector field, v ∈ Γ(TM), at any
+point, p ∈ M, the vector, v(p) ∈ TpM, can be written as a linear combination of some n basis vectors
+{e1, ..., en} as v = viei. To understand how v will change in a particular direction along M, it only
+makes sense, therefore, to consider derivatives along vectors in TpM. Two other things must be
+
+150
+
+
+Entropy 2014, 16, 3074–3102
+
+considered when defining a derivative along a manifold: (i) how the components, vi, of each basis
+vector will change; and (ii) how each basis vector, ei, itself will change. For the usual directional
+derivative on Rn, the basis vectors do not change, as the tangent space is the same at each point, but
+for a more general manifold, this is no longer the case: the ei’s are referred to as a “local” basis for each
+TpM.
+The covariant derivative, Dc, is defined so as to account for these shortcomings. When considering
+differentiation along a vector, u∗ /∈ TpM, u∗ is simply projected onto the tangent space. The derivative
+with respect to any u ∈ TpM can now be decomposed into a linear combination of derivatives of basis
+vectors and vector components:
+
+Dc
+u[v] = Dc
+uiei[viei],
+(A2)
+
+where the argument, p, has been dropped, but is implied for both components and local basis vectors.
+The operator, Dc
+u[v], is defined to be linear in both u and v and to satisfy the product rule [38]; so,
+Equation (A2) can be decomposed into:
+
+Dc
+u[v] = ui �
+Dc
+ei[vj]ej + vjDc
+ei[ej]
+�
+.
+(A3)
+
+The operator, Dc, need, therefore, only be defined along the direction of basis vectors ei and for vector
+component vi and basis vector ei arguments.
+For components vi, Dc
+ej[vi] is defined as simply the partial derivative ∂jvi := ∂vi/∂xj. The
+directional derivative of some basis vector ei along some ej is best understood through the example of
+a regular surface Σ ⊂ R3. Here, Dej[ei] will be a vector, w ∈ R3. Taking the basis for this space at the
+point, p, as {e1, e2, ˆn}, where ˆn denotes the unit normal to TpΣ, we can write w = αe1 + βe2 + κ ˆn. The
+covariant derivative, Dc
+ej[ei], is simply the projection of w onto TpΣ, given by w∗ = αe1 + βe2. More
+
+generally, at some point, p, in a smooth manifold, M, the covariant derivative Dc
+ej[ei] = Γk
+jiek (with
+
+upper and lower indices summed over). The coefficients, Γk
+ji, are known as the Christoffel symbols: Γk
+ji
+denotes the coefficient of the k-th basis vector when taking the derivative of the i-th with respect to the
+j-th. If a Riemannian metric, g, is chosen for M; then, they can be expressed completely as a function
+of g (or in local coordinates as a function of the matrix, G). Using these definitions, Equation (A3) can
+be re-written as:
+Dc
+u[v] = ui �
+∂ivk + vjΓk
+ij
+�
+ek.
+(A4)
+
+The divergence of a vector field, v ∈ Γ(TM), at the point, p ∈ M, is given by:
+
+divM(v) = Dc
+ei[vi],
+(A5)
+
+where, again, repeated indices are summed over. If M = Rn, this reduces to the usual sum of partial
+derivatives, ∂ivi. On a more general manifold, M, the equivalent expression is:"’
+
+Dc
+ei[vi] = ∂ivi + viΓj
+ij,
+(A6)
+
+where, again, repeated indices are summed. As has been previously stated, if a metric, g, and coordinate
+chart is chosen for M, the Christoffel symbols can be written in terms of the matrix, G(x). In this
+case [69]:
+
+Γj
+ij = |G(x)|− 1
+
+2 ∂i
+�
+|G(x)|
+1
+2
+�
+,
+(A7)
+
+so Equation (A6) becomes:
+
+Dc
+ei[vi] = |G(x)|− 1
+
+2 ∂i
+�
+|G(x)|
+1
+2 vi�
+,
+(A8)
+
+where v = v(x).
+
+Conflicts of Interest: Conflicts of Interest
+
+151
+
+
+Entropy 2014, 16, 3074–3102
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Girolami, M.; Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat.
+Soc. Ser. B 2011, 73, 123–214.
+2.
+Amari, S.I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2007; Volume 191.
+3.
+Marriott, P.; Salmon, M. Applications of Differential Geometry to Econometrics; Cambridge University Press:
+Cambridge, UK, 2000.
+4.
+Betancourt, M.; Girolami, M. Hamiltonian Monte Carlo for Hierarchical Models. 2013, arXiv: 1312.0906.
+5.
+Neal, R. MCMC using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo; Chapman and
+Hall/CRC: Boca Raton, FL, USA, 2011; pp. 113–162.
+6.
+Betancourt, M.; Stein, L.C. The Geometry of Hamiltonian Monte Carlo. 2011, arXiv: 1112.4118.
+7.
+Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer: New York, NY, USA, 2004; Volume 319.
+8.
+Tierney, L. Markov chains for exploring posterior distributions. Ann. Stat. 1994, 22, 1701–1728.
+9.
+Kipnis, C.; Varadhan, S. Central limit theorem for additive functionals of reversible Markov processes and
+applications to simple exclusions. Commun. Math. Phys. 1986, 104, 1–19.
+10.
+R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing:
+Vienna, Austria, 2012.
+11.
+Plummer, M.; Best, N.; Cowles, K.; Vines, K. CODA: Convergence diagnosis and output analysis for MCMC.
+R. News 2006, 6, 7–11.
+12.
+Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435.
+13.
+Jones, G.L.; Hobert, J.P. Honest exploration of intractable probability distributions via Markov chain Monte
+Carlo. Stat. Sci. 2001, 16, 312–334.
+14.
+Jones, G.L. On the Markov chain central limit theorem. Probab. Surv. 2004, 1, 299–320.
+15.
+Gelman, A.; Rubin, D.B. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992, 7,
+457–472.
+16.
+Sherlock, C.; Fearnhead, P.; Roberts, G.O. The random walk Metropolis: Linking theory and practice through
+a case study. Stat. Sci. 2010, 25, 172–190.
+17.
+Sherlock, C.; Roberts, G. Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal
+targets. Bernoulli 2009, 15, 774–798.
+18.
+Sherlock, C. Optimal scaling of the random walk Metropolis: General criteria for the 0.234 acceptance rule. J.
+Appl. Probab. 2013, 50, 1–15.
+19.
+Beskos, A.; Kalogeropoulos, K.; Pazos, E. Advanced MCMC methods for sampling on diffusion pathspace.
+Stoch. Processes Appl. 2013, 123, 1415–1453.
+20.
+Roberts, G.O.; Rosenthal, J.S. Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. 2001,
+16, 351–367.
+21.
+Roberts, G.O.; Tweedie, R.L. Geometric convergence and central limit theorems for multidimensional
+Hastings and Metropolis algorithms. Biometrika 1996, 83, 95–110.
+22.
+Mengersen, K.L.; Tweedie, R.L. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Stat.
+1996, 24, 101–121.
+23.
+Jarner, S.F.; Hansen, E. Geometric ergodicity of Metropolis algorithms.
+Stoch. Processes Appl. 2000,
+85, 341–361.
+24.
+Christensen, O.F.; Møller, J.; Waagepetersen, R.P. Geometric ergodicity of Metropolis–Hastings algorithms
+for conditional simulation in generalized linear mixed models.
+Methodol. Comput. Appl. Probab. 2001,
+3, 309–327.
+25.
+Neal, P.; Roberts, G. Optimal scaling for random walk Metropolis on spherically constrained target densities.
+Methodol. Comput. Appl. Probab. 2008, 10, 277–297.
+26.
+Jarner, S.F.;
+Tweedie, R.L.
+Necessary conditions for geometric and polynomial ergodicity of
+random-walk-type. Bernoulli 2003, 9, 559–578.
+27.
+Øksendal, B. Stochastic Differential Equations; Springer: New York, NY, USA, 2003.
+
+152
+
+
+Entropy 2014, 16, 3074–3102
+
+28.
+Rogers, L.C.G.; Williams, D. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus; Cambridge
+University Press: Cambridge, UK, 2000; Volume 2.
+29.
+Meyn, S.P.; Tweedie, R.L. Stability of Markovian processes III: Foster–Lyapunov criteria for continuous-time
+processes. Adv. Appl. Probab. 1993, 25, 518–518.
+30.
+Coffey, W.; Kalmykov, Y.P.; Waldron, J.T. The Langevin Equation: with Applications to Stochastic Problems in
+Physics, Chemistry, and Electrical Engineering; World Scientific: Singapore, Singapore, 2004; Volume 14.
+31.
+Roberts, G.O.; Tweedie, R.L.
+Exponential convergence of Langevin distributions and their discrete
+approximations. Bernoulli 1996, 2, 341–363.
+32.
+Roberts, G.O.; Stramer, O. Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput.
+Appl. Probab. 2002, 4, 337–357.
+33.
+Xifara, T.; Sherlock, C.; Livingstone, S.; Byrne, S.; Girolami, M.
+Langevin diffusions and the
+Metropolis-adjusted Langevin algorithm. Stat. Probab. Lett. 2013, 91, 14–19.
+34.
+Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A
+Math. Phys. Sci. 1946, 186, 453–461.
+35.
+Critchley, F.; Marriott, P.; Salmon, M. Preferred point geometry and statistical manifolds. Ann. Stat. 1993,
+21, 1197–1224.
+36.
+Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 77–93.
+37.
+Barndorff-Nielsen, O.; Cox, D.; Reid, N. The role of differential geometry in statistical theory. Int. Stat. Rev.
+1986, 54, 83–96.
+38.
+Boothby, W.M. An Introduction to Differentiable Manifolds and Riemannian Geometry; Academic Press: San
+Diego, CA, USA, 1986; Volume 120.
+39.
+Lee, J.M. Smooth Manifolds; Springer: New York, NY, USA, 2003.
+40.
+Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992.
+41.
+Nash, J.F., Jr. The imbedding problem for Riemannian manifolds. In The Essential John Nash; Princeton
+University Press: Princeton, NJ, USA, 2002; p. 151.
+42.
+Manton, J.H. A Primer on Stochastic Differential Geometry for Signal Processing. 2013, arXiv: 1302.0430.
+43.
+Stewart, J. Multivariable Calculus; Cengage Learning: Boston, MA, USA, 2011.
+44.
+Hsu, E.P. Stochastic Analysis on Manifolds; American Mathematical Society: Providence, RI, USA, 2002;
+Volume 38.
+45.
+Kent, J. Time-reversible diffusions. Adv. Appl. Probab. 1978, 10, 819–835.
+46.
+Radhakrishna Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull.
+Calcutta Math. Soc. 1945, 37, 81–91.
+47.
+Christensen, O.F.; Roberts, G.O.; Sköld, M. Robust Markov chain Monte Carlo methods for spatial generalized
+linear mixed models. J. Comput. Graph. Stat. 2006, 15, 1–17.
+48.
+Petra, N.; Martin, J.; Stadler, G.; Ghattas, O. A computational framework for infinite-dimensional Bayesian
+inverse problems: Part II. Stochastic Newton MCMC with application to ice sheet flow inverse problems.
+2013, arXiv: 1308.6221.
+49.
+Pawitan, Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood; Oxford University Press:
+Oxford, UK, 2001.
+50.
+Betancourt, M. A General Metric for Riemannian Manifold Hamiltonian Monte Carlo. In Geometric Science of
+Information; Springer: New York, NY, USA, 2013; pp. 327–334.
+51.
+Higham, N.J. Computing the nearest correlation matrix—a problem from finance. IMA J. Numer. Anal. 2002,
+22, 329–343.
+52.
+Sejdinovic, D.; Garcia, M.L.; Strathmann, H.; Andrieu, C.; Gretton, A. Kernel Adaptive Metropolis–Hastings.
+2013, arXiv: 1307.5302.
+53.
+Martin, J.; Wilcox, L.C.; Burstedde, C.; Ghattas, O. A stochastic Newton MCMC method for large-scale
+statistical inverse problems with application to seismic inversion.
+SIAM J. Sci. Comput.
+2012,
+34, A1460–A1487.
+54.
+Calderhead, B.; Girolami, M. Statistical analysis of nonlinear dynamical systems using differential geometric
+sampling methods. Interface Focus 2011, 1, 821–835.
+55.
+Stathopoulos, V.; Girolami, M.A. Markov chain Monte Carlo inference for Markov jump processes via the
+linear noise approximation. Philos. Trans. R. Soc. A 2013, 371, 20110541.
+
+153
+
+
+Entropy 2014, 16, 3074–3102
+
+56.
+Konukoglu, E.; Relan, J.; Cilingir, U.; Menze, B.H.; Chinchapatnam, P.; Jadidi, A.; Cochet, H.; Hocini, M.;
+Delingette, H.; Jaïs, P.; et al. Efficient probabilistic model personalization integrating uncertainty on data and
+parameters: Application to eikonal-diffusion models in cardiac electrophysiology. Prog. Biophys. Mol. Biol.
+2011, 107, 134–146.
+57.
+Do Carmo, M.P.; Do Carmo, M.P. Differential Geometry of Curves and Surfaces; Englewood Cliffs: Prentice-Hall,
+NJ, USA, 1976; Volume 2.
+58.
+Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, Singapore, 2007; Volume 1.
+59.
+Cotter, S.; Roberts, G.; Stuart, A.; White, D. MCMC methods for functions: Modifying old algorithms to
+make them faster. Stat. Sci. 2013, 28, 424–446.
+60.
+Da Prato, G.; Zabczyk, J. Stochastic Equations in Infinite Dimensions; Cambridge University Press: Cambridge,
+UK, 2008.
+61.
+Law, K.J. Proposals which speed up function-space MCMC. J. Comput. Appl. Math. 2014, 262, 127–138.
+62.
+Ottobre, M.; Pillai, N.S.; Pinski, F.J.; Stuart, A.M. A Function Space HMC Algorithm With Second Order
+Langevin Diffusion Limit. 2013, arXiv: 1308.0543.
+63.
+Horowitz, A.M. A generalized guided Monte Carlo algorithm. Phys. Lett. B 1991, 268, 247–252.
+64.
+Mardia, K.V.; Jupp, P.E. Directional Statistics; Wiley: New York, NY, USA, 2009; Volume 494.
+65.
+Byrne, S.; Girolami, M. Geodesic Monte Carlo on embedded manifolds. Scand. J. Stat. 2013, 40, 825–845.
+66.
+Diaconis, P.; Holmes, S.; Shahshahani, M. Sampling from a manifold. In Advances in Modern Statistical Theory
+and Applications: A Festschrift in Honor of Morris L. Eaton; Institute of Mathematical Statistics: Washington,
+DC, USA, 2013; pp. 102–125.
+67.
+Latuszynski, K.; Roberts, G.O.; Thiery, A.; Wolny, K. Discussion on “Riemann manifold Langevin and
+Hamiltonian Monte Carlo methods” (by Girolami, M. and Calderhead, B.). J. R. Stat. Soc. Ser. B 2011,
+73, 188–189.
+68.
+Capinski, M.; Kopp, P.E. Measure, Integral and Probability; Springer: New York, NY, USA, 2004.
+69.
+Schutz, B.F. Geometrical Methods of Mathematical Physics; Cambridge University Press: Cambridge, UK, 1984.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+154
+
+
+entropy
+
+Article
+Variational Bayes for Regime-Switching
+Log-Normal Models
+
+Hui Zhao and Paul Marriott *
+
+University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada; E-Mail:
+h6zhao@uwaterloo.ca
+*
+E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567.
+
+Received: 14 April 2014; in revised form: 12 June 2014 / Accepted: 7 July 2014 /
+Published: 14 July 2014
+
+Abstract: The power of projection using divergence functions is a major theme in information
+geometry. One version of this is the variational Bayes (VB) method. This paper looks at VB in the
+context of other projection-based methods in information geometry. It also describes how to apply
+VB to the regime-switching log-normal model and how it provides a computationally fast solution to
+quantify the uncertainty in the model specification. The results show that the method can recover
+exactly the model structure, gives the reasonable point estimates and is very computationally efficient.
+The potential problems of the method in quantifying the parameter uncertainty are discussed.
+
+Keywords: information geometry; variational Bayes; regime-switching log-normal model; model
+selection; covariance estimation
+
+1. Introduction
+
+While, in principle, the calculation of the posterior distribution is mathematically straightforward,
+in practice, the computation of many of its features, such as posterior densities, normalizing constants
+and posterior moments, is a major challenge in Bayesian analysis. Such computations typically
+involve high dimensional integrals, which often have no analytical or tractable forms. The variational
+Bayes (VB) method was developed to generate tractable approximations to these quantities. This
+method provides analytic approximations to the posterior distribution by minimizing the Kullback–
+Leibler (KL) divergence from the approximations to the actual posterior and has been demonstrated to
+be computationally very fast.
+VB gains its computational advantages by making simplifying assumptions about the posterior
+dependence structure.
+For example, in the simplest form, it assumes posterior independence
+between selected sets of parameters. Under these assumptions, the resultant approximate posterior
+is either known analytically or can be computed by a simple iteration algorithm similar to the
+Expectation-Maximization (EM) algorithm. In this paper, we show that, as well as having advantages
+of computational speed, the VB algorithm does an excellent job of model selection, in particular in
+finding the appropriate number of regimes.
+While the simplification in the dependence gives computational advantages, it also comes at
+a cost. For example, we also found that the posterior variance may be underestimated. In [1], we
+propose a novel method to compute the true posterior covariance matrix by only using the information
+obtained from VB approximations.
+The use of projections to particular families is, of course, not new to information geometry (IG).
+In [2], we find the famous Pythagorean results concerning projection using α-divergences to α-families,
+and other important results on projections based on divergences can be found in [3] and [4] (Chapter 7).
+
+Entropy 2014, 16, 3832–3847; doi:10.3390/e16073832
+www.mdpi.com/journal/entropy
+155
+
+
+Entropy 2014, 16, 3832–3847
+
+1.1. Variational Bayes
+
+Suppose, in a Bayesian inference problem that we use q(τ) to approximate the posterior p(τ|y),
+where y is the data and τ = {τ1, · · · , τp} the model parameter vector. The KL divergence between
+them is defined as,
+
+KL [q(τ)||p(τ|y)] =
+�
+q(τ) log q(τ)
+
+p(τ|y)dτ,
+(1)
+
+provided the integral exists. We want to balance two things, having the discrepancy between p and q
+small, while keeping q tractable. Hence, we want to seek q(τ), which minimizes Equation (1), while
+keeping q(τ) in an analytically tractable form. First, note that the evaluation of Equation (1) requires
+p(τ|y), which may be unavailable, since in the general Bayesian problem, its normalizing constant is
+one of the main intractable integrals. However, we note that:
+
+KL [q(τ)||p(τ|y)]
+=
+�
+q(τ) log
+q(τ)
+
+p(τ|y)p(y)dτ + log p(y)
+
+=
+−
+�
+q(τ) log p(τ, y)
+
+q(τ) dτ + log p(y).
+(2)
+
+Thus, minimizing Equation (1) is equivalent to maximizing the first term of the right-hand side of
+Equation (2). The key computational point is that, often, the term p(τ, y) is available even when the
+full posterior
+� p(τ,y)
+p(τ,y)dτ is not.
+
+Definition 1. Let F(q) = �
+q(τ) log p(τ,y)
+
+q(τ) dτ and:
+
+ˆq = arg max
+q∈Q
+F(q),
+(3)
+
+where Q is a predetermined set of probability density functions over the parameter space. Then ˆq is called the
+variational approximation or variational posterior distribution, and functions of ˆq (such as mean, variance, etc.),
+are called variational parameters.
+
+Some of the power of Definition 1 comes when we assume that all elements of Q have tractable
+posteriors. In that case, all variational parameters will then also be tractable when the optimization
+can be achieved. A prime example of a choice for Q is the set of all densities that factorize as
+
+q(τ) =
+d
+∏
+i=1
+qi(τi).
+
+This reduces the computational problem from computing a high dimensional integral to one of
+computing a number of one-dimensional ones. Furthermore, as we see in the example of this paper, it
+is often the case that the variational families are standard exponential families (since they are often
+‘maximum entropy models’ in some sense), and the optimisation problem (3) can be solved by simple
+iterative methods with very fast convergence.
+The core of the method builds on the basis of the principle of the variational free energy
+minimization in physics, which is concerned with finding the maxima and minima of a functional over
+a class of functions, and the method gains its name from this root. Early developments of the method
+can be found in machine learning, especially in applications on neural networks [5,6]. The method
+has been successfully applied in many different disciplines and domains, for example, in independent
+component analysis [7,8], graphical models [9,10], information retrieval [11] and factor analysis [12].
+
+156
+
+
+Entropy 2014, 16, 3832–3847
+
+In the statistical literature, an early application of the variational principle can be found in the
+work of [13] to construct Bayes estimators. In recent years, the method has obtained more attention
+from both the application and theoretical perspective, for example [14–18].
+
+1.2. Regime-Switching Models
+
+In this paper, we illustrate the strengths and weaknesses of VB through a detailed case study.
+In particular, we look at a model that is used in finance, risk management and actuarial science, the
+so-called regime-switching log-normal model (RSLN) proposed, in this context, by [19].
+Switching between different states, or regimes, is a common phenomenon in many time series,
+and regime-switching models, originally proposed by [20], have been used to model these switching
+processes. As demonstrated in [21], the maximum likelihood estimate (MLE) does not give a simple
+method to deal with parameter uncertainty; for details of this method, see [21]. The asymptotic
+normality of maximum likelihood estimators may not apply for sample sizes commonly found in
+practice. Hence, to understand parameter uncertainty, [21] considered the RSLN model in a Bayesian
+framework using the Metropolis–Hastings algorithm. Furthermore, model uncertainty, in particular
+selecting the correct number of regimes, is a major issue. Hence, model selection criteria have to be
+used to choose the “best” model. Hardy [19] found that a two-regime RSLN model maximized the
+Bayes information criterion (BIC) [22] for both monthly TSE 300 total return data and S&P 500 total
+return data; however, according to the Akaike information criterion (AIC) [23], a three-regime model
+was the optimal on S&P data. To account for the model uncertainty associated with the number of
+regimes, [24] offered a trans-dimensional model using reversible jump MCMC [25]. We note that BIC
+is not necessarily ideal for model selection with state space models [26], while it is still commonly used
+in the literature.
+MCMC methods make possible the computation of all posterior quantities; however there are a
+number of practical issues associated with their implementation. A primary concern is determining
+that the generated chain has, in fact, “converged”. In practice, MCMC practitioners have to resort
+to convergence diagnostic techniques. Furthermore, the computational cost can be a concern. Other
+implementational issues include the difficulty of making good initalisation choices, implementing the
+MCMC algorithm in one long chain or several shorter chains in parallel, etc. Detailed discussions can
+be found in [27].
+One of the main contributions of this paper is to apply the variational Bayes (VB) method to the
+RSLN model and present a solution to quantify the uncertainty in model specification. The VB method
+is a technique that provides analytical approximations to the posterior quantities, and in practice, it is
+demonstrated to be a very much faster alternative to MCMC methods.
+
+2. Variational Bayes and Informational Geometry
+
+In this section, we explore the relationship between VB and IG, in particular the statistical
+properties of divergence-based projections onto exponential families. Here, we used the IG of [2], in
+particular the ±1 dual affine parameters for exponential families. One of the most striking results
+from [2] is the Pythagorean property of these dual affine coordinate systems. This is illustrated in
+Figure 1, which shows a schematic representing a model space containing the distribution f0(x) and
+an exponential family f (x; θ).
+
+157
+
+
+Entropy 2014, 16, 3832–3847
+
+θ
+
+−1−geodesic
+
++1−geodesic
+
+of (x)
+
+f(x,   )
+
+Figure 1. Projections onto an exponential family.
+
+The Pythagorean result comes from using the KL divergence to project onto the exponential family
+f (x; θ) = ν(x) exp {s(x)θ − ψ(θ)}, i.e.,
+
+min
+θ
+
+�
+− log f (x; θ)
+
+f0(x) f0(x)dx.
+
+All distributions that project to the same point form a −1-flat space defined by all distributions f (x)
+with the same mean, i.e.,
+E�θ(s(x)) = Ef (x)(s(x)),
+
+and further, it is Fisher orthogonal to the +1-flat family f (x; θ). The statistical interpretation of this
+concerns the behaviour of a model f (x, θ) when the data generation process does not lie in the model.
+In contrast to this, we have the VB method, which uses the reverse KL divergence for the projection,
+i.e.,
+
+min
+θ
+
+�
+log f (x; θ)
+
+f0(x) f (x; θ)dx.
+
+This results in a Fisher orthogonal projection, shown in Figure 1, but now using a +1-flat family.
+This does not have the property that the mean of s(x) is constant, but as we shall see, it does have nice
+computational properties when used in the context of Bayesian analysis.
+In order to investigate the information geometry of VB, we consider two examples. The first,
+in Section 3.1, is selected to maximally illustrate the underlying geometric issues and to get some
+understanding of the quality of the VB approximation. The second, in Section 3.2, shows an important
+real-world application from actuarial science and is illustrated with simulated and real data.
+
+3. Applications of Variational Bayes
+
+3.1. Geometric Foundation
+
+We consider the simplest model that shows dependence. Let X1, X2 be two binary random
+variables, with distribution π := (π00, π10, π01, π11), where P(X1 = i, X2 = j) = πij, i, j ∈ {0, 1}.
+Further, let the marginal distributions be denoted by π1 = P(X1 = 1), π2 = P(X2 = 1). We want to
+consider the geometry of the VB projection from a general distribution to the family of independent
+distributions. This represents the way that VB gains its computational advantages by simplifying the
+posterior dependence structure.
+The model space is illustrated in Figure 2, where π is represented by a point in the three simplex,
+and the independence surface, where π00π11 = π10π01, is also shown.
+
+158
+
+
+Entropy 2014, 16, 3832–3847
+
+1
+
+ y
+0
+
+0
+
+0.5
+ z
+
+1.0
+
+x
+
+1
+
+Figure 2. Space of distributions with independence surface: marginal probabilityand dependence.
+
+Both the interior of the simplex and independence surface are exponential families, and it is
+convenient to use the natural parameters for the interior of the simplex:
+
+ξ1 = log π10
+
+π00
+, ξ2 = log π01
+
+π00
+, ξ3 = log π11π00
+
+π10π01
+
+where the independence surface is given by ξ3 = 0.
+The independence surface can also be
+parameterised by the marginal distributions π1, π2 or the corresponding natural parameters ξind
+i
+:=
+log(πi/(1 − πi)). For any distribution, π, represented in natural parameters by (ξ1, ξ2, ξ3), has its VB
+approximation defined implicitly by the simultaneous equations:
+
+ξind
+1 (π1)
+=
+ξ1 + ξ3π2,
+(4)
+
+ξind
+2 (π2)
+=
+ξ2 + ξ3π1.
+(5)
+
+These can be solved, as is typical with VB methods, by iterating updated estimates of π1 and π2
+across the two equations. We show this in a realistic example in the following section.
+Having seen the VB solution in this simple model, we can investigate the quality of the
+approximation. If we were using the forward KL project, as proposed by [2], then the mean will
+be preserved by the approximation, while, of course, the variance structure is distorted. In the case of
+using the reverse KL projection, as used by VB, the mean will not be preserved, but in this example,
+we can investigate the distortion explicitly. Let (ξ1(α), ξ2(α), ξ3(α)) be a +1-geodesic, which cuts
+the independence surface orthogonally and is parameterised by α, where α = 0 corresponds to the
+independence surface. In this example, all such geodesics can be computed explicitly. Figure 3 shows
+the distortion associated with the VB approximation. In the left-hand panel, we show the mean, which
+is the marginal probability, P(X1 = 1), for all points on the orthogonal geodesic. We see, as expected,
+that this is not constant, but it is locally constant at α = 0, showing that the distortion of the mean can
+be small near the independence surface. The right-hand panel shows the dependence, as measured
+by the log-odds, for points on the geodesic. As expected, the VB does not preserve the dependence
+structure; indeed, it is designed to exploit the simplification of the dependence structure.
+
+159
+
+
+Entropy 2014, 16, 3832–3847
+
+Figure 3. Distortion implied by variational Bayes (VB) approximation.
+
+3.2. Variational Bayes for the RSLN Model
+
+The regime-switching log-normal model [19] with a fixed finite number, K, of regimes can be
+described as a bivariate discrete time process with the observed data sequence w1:T = {wt}T
+t=1 and
+the unobserved regime sequence S1:T = {St}T
+t=1, where St ∈ {1, · · · , K} and T is the number of
+observations. The logarithm of wt, denoted by yt = log wt, is assumed normally distributed, having
+mean μi and variance σ2
+i both dependent on the hidden regime St. The sequence of S1:T is assumed
+to follow a first order Markov chain having transition probabilities A = (aij) with the probabilities
+π = (πi)K
+i=1 to start the first regime.
+The RSLN model is a special case of more general state-space models, which were studied in
+detail by [28]. In this paper, we use this model and simulated and real data to illustrate the VB method
+in practice. We also calibrate its performance by referring to [24], which used MCMC methods to fit the
+same model to the same data. Here, we are regarding the MCMC analysis as a form of “gold-standard",
+but with the cost of being orders-of-magnitude slower than VB in computational time.
+In the Bayesian framework, we use a symmetric Dirichlet prior for π, that is p(π) =
+Dir(π; Cπ
+
+K , · · · , Cπ
+
+K ), for Cπ > 0.
+Let ai denote the i − th row vector of A. The prior for A is
+
+chosen as p(A) = ∏K
+i=1 p(ai) = ∏K
+i=1 Dir(ai; CA
+
+K , · · · , CA
+
+K ), for CA > 0, and the prior distribution for
+
+{(μi, σ2
+i )}K
+i=1 is chosen to be normal-inverse gamma, p({μi, σ2
+i }K
+i=1) = ∏K
+i=1 N(μi|σ2
+i ; γ, σ2
+i
+η2 )IG(σ2
+i ; α, β).
+
+In the above setting, Cπ, CA, γ, η2, α and β are hyper-parameters. Thus, the joint posterior distribution
+of π, A, {μi, σ2
+i }K
+i=1, and S1:T is P(π, A, {μi, σ2
+i }K
+i=1, S1:T|y1:T) and is proportional to:
+
+p(S1|π)
+T−1
+∏
+t=1
+p(St+1|St; A)
+T
+∏
+t=1
+p(yt|St; {μi, σ2
+i }K
+i=1)p(π)p(A)p({μi, σ2
+i }K
+i=1).
+(6)
+
+This posterior distribution and its corresponding marginal posterior distributions are analytically
+intractable. In VB, we seek an approximation of Equation (6), denoted by q(π, A, {μi, σ2
+i }K
+i=1, S1:T),
+to which we want to balance two things: having the discrepancy between Equation (6) and q small,
+while keeping q tractable. In general, there are two ways to choose q. The first is to specify a particular
+distributional family for q, for example the multivariate normal distribution. The other is to choose
+q with a simpler dependency structure than that of Equation (6); for example, we choose q, which
+factorizes as:
+
+q(π, A, {μi, σ2
+i }K
+i=1, S1:T) = q(π)
+K
+∏
+i=1
+q(ai)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T).
+(7)
+
+160
+
+
+Entropy 2014, 16, 3832–3847
+
+The Kullback–Leibler (KL) divergence [29] can be used as the measure of dissimilarity between
+Equations (6) and (7). For succinctness, we denote τ = (π, A, {μi, σ2
+i }K
+i=1, S1:T); thus the KL divergence
+is defined as:
+
+KL(q(τ) || p(τ|y)) =
+�
+q(τ) log q(τ)
+
+p(τ|y)dτ.
+(8)
+
+Note that the evaluation of Equation (8) requires p(τ|y), which is unavailable. However, we note that:
+
+KL(q(τ) || p(τ|y)) = log p(y) −
+�
+q(τ) log p(τ, y)
+
+q(τ) dτ
+
+Given the factorization Equation (7), this can be written as:
+
+KL(q(τ) || p(τ|y)) =
+
+log p(y) −
+�
+∑
+S1:T
+q(π)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T) log
+p(π, A, {μi, σ2
+i }K
+i=1, S1:T, y1:T)
+
+q(π)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T)
+
+dπdAd{μi, σ2
+i }K
+i=1
+
+Consider first the q(π) term. The right-hand side can be rearranged as:
+
+KL
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+q(π)
+����
+
+����
+
+exp
+� � ∑
+S1:T
+q(S1:T)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i ) log p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T)dAd{μi, σ2
+i }K
+i=1
+
+�
+
+Zπ
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
++ Kπ,
+(9)
+
+where:
+
+Kπ =
+�
+∑
+S1:T
+q(S1:T)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T) log q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )dAd{μi, σ2
+i }K
+i=1 − log Zπ + log p(y),
+
+and Zπ is a normalizing term. The first term of Equation (9) is the only term that depends on q(π).
+Thus, the minimum value of KL(q(τ) || p(τ|y)) is achieved when this term equals zero. Hence, we
+obtained:
+
+q(π) =
+exp
+� �
+∑S1:T q(S1:T)q(A) ∏K
+i=1 q(μi|σ2
+i )q(σ2
+i ) log p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T)dAd{μi, σ2
+i }K
+i=1
+
+�
+
+Zπ
+(10)
+
+Given the joint distribution of p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T) in the form of Equation (6), the
+straightforward evaluation of Equation (10) results in:
+
+q(π)
+∝
+K
+∏
+i=1
+π
+
+CKπ
+K +ws
+i −1
+i
+= Dir(π, wπ
+1 , · · · , wπ
+K); wπ
+i = CK
+π
+K + ws
+i , ws
+i = Eq(S1:T)[S1,i]
+
+(11)
+
+where S1,i = 1, if the process is in state i at time 1, and zero otherwise.
+
+161
+
+
+Entropy 2014, 16, 3832–3847
+
+Similarly, we can rearrange Equation (9) with respect to {q(ai)}K
+i=1, {q(μi|σ2
+i )}K
+i=1, {q(σ2
+i )}K
+i=1 and
+q(S1:T), respectively, and using the same arguments, then we can obtain:
+
+q(A)
+=
+k
+∏
+i
+Dir(ai; wA
+i1, ..., wA
+ik); wA
+ij = CA
+
+K + vs
+ij,
+(12)
+
+q(μi|σ2
+i )
+=
+N
+
+�
+
+γ′
+i, σ2
+i
+κi
+
+�
+
+, γ′
+i = η2γ + ps
+i
+
+η2 + qs
+i
+, κi = η2 + qs
+i
+(13)
+
+q(σ2
+i )
+=
+IG
+�
+α′
+i, β′
+i
+�
+, α′
+i = α + qs
+i
+2 , β′
+i = β + rs
+i
+2 + η2
+
+2 (γ
+′
+i − γ)2
+(14)
+
+q(S1:T)
+=
+
+k
+∏
+i=1
+π∗S1,i
+i
+
+T−1
+∏
+t=1
+
+k
+∏
+i=1
+
+k
+∏
+j=1
+a
+∗St,iSt+1,j
+ij
+
+T
+∏
+t=1
+
+k
+∏
+i=1
+θ∗St,i
+
+˜Z
+,
+(15)
+
+where St,i = 1, if the process in state i at time t, and zero otherwise, and with π∗
+i = eEq(π)[log πi],
+
+a∗
+ij = eEq(A)[log(aij)], θ∗
+i,t = eEq(μi|σ2
+i )q(σ2
+i )[log φi(yt)], vs
+ij = ∑T−1
+t=1 Eq(S1:T)
+�
+St,iSt+1,j
+�
+, ps
+i = ∑T
+t=1 Eq(S1:T)[st,i]yt,
+
+qs
+i = ∑T
+t=1 Eq(S1:T)[st,i], rs
+i = ∑T
+t=1(γ′
+i − yt)2Eq(S)[st,i]. Here, ψ is the digamma function, φ is the normal
+density function and the exact functional forms used in the updates are shown in Algorithm 1.
+
+Algorithm 1 Variational Bayes algorithm for the regime-switching log-normal model (RSLN) model.
+
+Initialize ws
+i
+(0), vs
+ij
+(0), ps
+i
+(0), qs
+i
+(0), and rs
+i
+(0) at step 0
+while wπ
+i
+(t−1), wA
+ij
+(t−1), γ′
+i
+(t−1), α′
+i
+(t−1), β′
+i
+(t−1), π∗
+i
+(t−1), a∗
+ij
+(t−1), and θ∗
+i,t
+(t−1) do not converge do
+
+1.
+Compute wπ
+i
+(t), wA
+ij
+(t), γ′
+i
+(t), κi(t), α′
+i
+(t), and β′
+i
+(t)at step t by
+
+wπ
+i
+(t) = CK
+π
+K + ws
+i
+(t−1),
+wA
+ij
+(t) = CA
+π
+K + vs
+ij
+(t−1),
+γ′
+i
+(t) = η2γ + ps
+i
+(t−1)
+
+η2 + qs
+i
+(t−1) ,
+
+κi(t) = η2 + qs
+i
+(t−1),
+α′
+i
+(t) = α + qs
+i
+(t−1)
+
+2
+,
+β′
+i
+(t) = β + rs
+i
+(t−1)
+
+2
++ η2
+
+2 (γ
+′
+i
+(t) − γ)2
+
+2.
+Compute π∗
+i
+(t), θ∗
+i,t
+(t) and a∗
+ij
+(t) at step t by:
+
+π∗
+i
+(t) = exp
+�
+ψ(wπ
+i
+(t)) − ψ(∑
+i
+wπ
+i
+(t))
+�
+,
+a∗
+ij
+(t) = exp
+�
+ψ(wA
+ij
+(t)) − ψ(∑
+j=1
+wA
+ij
+(t))
+�
+
+θ∗
+i,t
+(t) = exp
+� − 1
+
+2 log 2π − 1
+
+2(log β′
+i
+(t) − ψ(α′
+i
+(t))) − 1
+
+2
+
+�
+
+(yt − γ′
+i
+(t))2 α′
+i
+(t)
+
+β′
+i
+(t) +
+1
+
+κi(t)
+
+�
+�
+
+3.
+Compute ws
+i
+(t), vs
+ij
+(t), ps
+i
+(t), qs
+i
+(t), and rs
+i
+(t) at step t by:
+
+ws
+i
+(t) = Eq(t)(S1:T)[S1,i], vs
+ij
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)
+�
+St,iSt+1,j
+�
+, ps
+i
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)[st,i]yt,
+
+qs
+i
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)[st,i], rs
+i
+(t) =
+T−1
+∑
+t=1
+(γ′
+i
+(t) − yt)2Eq(t)(S)[st,i]
+
+t ⇐ t + 1
+end while
+
+The VB method proceeds, as was shown with the simple Equations (4) and (5), by iterative
+updating the variational parameters to solve a set of simultaneous equations. In this example, the
+update equations for the variables π, A, {μi, σ2
+i }K
+i=1, S1:T are given explicitly by Algorithm 1. For the
+initialisation, we choose symmetric values for most of the parameters and choose random values for
+
+162
+
+
+Entropy 2014, 16, 3832–3847
+
+others, as appropriate. For this example, this worked very satisfactory, although we note that for more
+general state space models [28], states that find good initial values can be non-trivial.
+
+3.3. Interpretation of Results
+
+First, all approximating distributions above turn out to lie in well-known parametric families.
+The only unknown quantities are the parameters of these distributions, which are often called the
+variational parameters.
+The evaluation of parameters of q(π), q(A), q(μi|σ2
+i ), and q(σ2
+i ) requires knowledge of q(S1:T),
+and also, the evaluation of π∗
+i , a∗
+ij and θ∗
+i,t requires knowledge of q(π), q(A), q(μi|σ2
+i ) and q(σ2
+i ). This
+structure leads to an iterative updating scheme, described in Algorithm 1.
+The main computational effort in Algorithm 1 is computing Eq(S1:T)[St,i] and Eq(S1:T)
+�
+St,iSt+1,j
+�
+,
+which have no simple tractable forms. We note that the distributional form of q(S1:T) has a very
+similar structure as the conditional distribution of p(S1:T|Y1:T, τ) for which the forward-backward algo-
+rithm [30] is commonly used to compute Ep(S1:T|Y1:T,τ)[St,i|Y1:T, τ] and Ep(S1:T|Y1:T,τ)
+�
+St,iSt+1,j|Y1:T, τ
+�
+.
+Therefore, we also use the forward-backward algorithm to compute Eq(S1:T)[St,i] and Eq(S1:T)
+�
+St,iSt+1,j
+�
+.
+
+The conditional distribution of q(μi|σ2
+i ) is N
+�
+μi|σ2
+i ; γ′
+i, σ2
+i
+κi
+
+�
+, then the marginal distribution of μi
+
+is the location-scale t distribution, denoted as t2α′
+i(μi; γ′
+i,
+κi
+
+β′
+i/α′
+i ), where the density function of tν(x; μ, λ)
+
+is defined as p(x|ν, μ, λ) = Γ( ν+1
+
+2 )
+
+Γ( ν
+
+2 )
+
+�
+λ
+πν
+� 1
+
+2 �
+1 + λ(x−μ)2
+
+ν
+�− ν+1
+
+2 , for x, μ ∈ (−∞, +∞) and ν, λ > 0.
+
+4. Numerical Studies
+
+4.1. Simulated Data
+
+In this section, we applied the VB solutions to four sets of simulated data, which are used in [24].
+Through these simulated studies, we will test the performance of VB on detecting the number of
+regimes and compare it with those of the BIC and the MCMC methods [24]. For this paper, we present
+only an initial study with a relatively small number of datasets. The results are highly promising,
+but more extensive studies are needed to draw comprehensive conclusions. Furthermore, see [28] for
+general results on VB in hidden state space models.
+To estimate the number of regimes, we construct a matrix, called the relative magnitude matrix
+
+(RMM), defined as A′ =
+�ˆa′
+ij
+�
+, where ˆa′
+ij =
+wA
+ij
+
+wA
+0 , wA
+0 = ∑K
+i=1 ∑K
+j=1 wA
+ij and wA
+ij is the parameter of q(A).
+
+Our model selection procedure is to fit a VB with a large number of regimes and to examine the rows
+and columns in the RMM. If the values of the entries in the i − th row and the i − th column of A′ are
+all equal to
+CA/K
+
+T−1+CA×K, then we will declare the regime i nonexistent. This method is validated by the
+
+following observations. It can be shown that the parameter of vs
+ij in wA
+ij is equal to the number of times
+the process leaves regime i and enters regime j. Therefore, for the i − th regime, the values of zero for
+all of vs
+ji and vs
+ij with j = 1, · · · , K indicate that there is no transition process entering or leaving regime
+i.
+Table 1 specifies the parameters for the four cases, and we generate 671 observations for each
+case (equal to the number of months from January 1956 to September 2011). The parameters used in
+Case 1 are identical to the maximum likelihood estimates for TSX monthly return data from 1956 to
+1999 [19]. Case 2 only has one regime present. Case 3 is similar to Case 1, but the two regimes have the
+same mean. Case 4 adds a third regime. For each case, we use MLE to fit a one-regime, two-regime,
+three-regime and four-regime RSLN model and report the corresponding BIC and log-likelihood scores.
+We then misspecify the number of regimes and run a four-regime VB algorithm.
+
+163
+
+
+Entropy 2014, 16, 3832–3847
+
+Table 1. Parameters of the simulated data.
+
+Case
+Regime 1
+Regime 2
+Regime 3
+Transition Probability
+(μi, σi)
+(μi, σi)
+(μi, σi)
+
+1
+(0.012, 0.035)
+(−0.016, 0.078)
+-
+
+�0.963
+0.037
+0.210
+0.790
+
+�
+
+2
+(0.014, 0.050)
+-
+-
+-
+
+3
+(0.000, 0.035)
+(0.000, 0.078)
+-
+
+�0.963
+0.037
+0.210
+0.790
+
+�
+
+4
+(0.012, 0.035)
+(−0.016, 0.078)
+(0.04, 0.01)
+
+⎛
+
+⎝
+0.953
+0.037
+0.01
+0.210
+0.780
+0.01
+0.80
+0.190
+0.01
+
+⎞
+
+⎠
+
+Table 2 shows the number of iterations that VB takes to converge in each case and the
+corresponding computational time (on a MacBook, 2 GHz processor). On average, VB converges after
+a hundred iterations and takes about one minute. On the same computer, a 104-iteration Reverse Jump
+MCMC (RJMCMC) will take about 10 h to finish. Using diagnostics, this seemed to be enough for
+convergence, while not being an “unfair” comparison in terms of time with VB. We can see that the
+computational efficiency will be a very attractive feature of the VB method. The results of the BIC with
+the log-likelihood (in parentheses), the relative magnitude matrices and the posterior probabilities
+for the models with the different number of regimes estimated by MCMC (cited from Hartman and
+Heaton [24]) are given in Table 3. In Case 1, the BIC favors the two-regime model. The posterior
+probability estimated by MCMC for the one-regime model is the largest, but there is still a large
+probability for the two regime model. Note that the prior specification for the number of regimes
+can effect these numbers and is always an issue with these forms of multidimensional MCMC. The
+relative magnitude matrix clearly shows that there are only two regimes whose ˆa′
+ij are not negligible.
+This implies VB removes excess transition and emission processes and discovers the exact number of
+hidden regimes. In Case 2 and Case 3, both VB and the BIC can select the correct number of regimes,
+and the posterior probability for the one-regime model estimated by MCMC is still the largest. In Case
+4, VB does not detect the third regime. The transition probability to this regime is only 0.01, and the
+means and standard deviations of Regime 1 make the rare data from Regime 3 easily merged within
+the data from Regime 1. From Table 3, it is clear that for all of the cases, the log-likelihood always
+increases as the number of regimes increase.
+
+Table 2. Computational efficiency of VB.
+
+Case 1
+Case 2
+Case 3
+Case 4
+
+Iterations to converge
+62
+182
+132
+94
+Computational time [s]
+27.161
+80.842
+58.510
+45.044
+
+164
+
+
+Entropy 2014, 16, 3832–3847
+
+Table 3. The estimated number of regimes by VB, BIC and MCMC.
+
+No. of
+MLE
+RJMCMC
+VB
+Case
+Regimes
+BIC (Log Likelihood)
+Posterior
+Probability
+Relative Magnitude Matrix
+
+1
+
+1
+2
+3
+4
+
+1, 108.875(1, 115.384)
+1, 158.227(1, 174.499)
+1, 156.370(1, 182.405)
+1, 153.150(1, 188.948)
+
+0.647
+0.214
+0.088
+<0.052
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.14357
+0.00004
+0.00004
+0.03153
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.03018
+0.00004
+0.00004
+0.79428
+
+⎞
+
+⎟
+⎟
+⎠
+
+2
+
+1
+2
+3
+4
+
+1, 045.448(1, 051.957)
+1, 038.360(1, 054.632)
+1, 030.733(1, 056.768)
+1, 026.882(1, 062.680)
+
+0.864
+0.109
+0.020
+<0.006
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.99944
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+
+⎞
+
+⎟
+⎟
+⎠
+
+3
+
+1
+2
+3
+4
+
+1, 110.903(1, 117.411)
+1, 139.214(1, 155.486)
+1, 131.904(1, 157.719)
+1, 121.921(1, 157.940)
+
+0.629
+0.221
+0.098
+<0.052
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.11322
+0.00004
+0.00004
+0.02647
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.02659
+0.00004
+0.00004
+0.83327
+
+⎞
+
+⎟
+⎟
+⎠
+
+4
+
+1
+2
+3
+4
+
+1, 044.819(1, 051.328)
+1, 092.610(1, 108.881)
+1, 087.435(1, 113.470)
+1, 080.240(1, 116.038)
+
+0.641
+0.203
+0.094
+<0.06
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.22643
+0.00004
+0.00004
+0.05518
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.05377
+0.00004
+0.00004
+0.66417
+
+⎞
+
+⎟
+⎟
+⎠
+
+4.2. Real Data
+
+In this section, we apply the VB solution to the TSX monthly total return index in the period from
+January, 1956, to December, 1999 (528 observations in total and studied in [19,21]).
+A four-regime VB is implemented first. VB converges after 100 iterations about 34.284 s (on a
+MacBook, 2 GHz processor). The relative magnitude matrix, given in Table 4, clearly shows that VB
+identifies two regimes. This matches both of the BIC and AIC-based results [19]. Based on these results,
+we then fit a two-regime VB, which converges after 83 iterations in about 14.241 s. Table 5 gives the
+marginal distributions for all of the parameters. Figure 4 presents the corresponding density functions,
+where we can see that all of the plots show a symmetric and bell-shaped pattern.
+
+Table 4. Estimations of the number of regimes for TSXdata.
+
+January 1956–December 1999
+
+R. M. M.
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.11496
+0.00005
+0.00005
+0.02803
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.02853
+0.00005
+0.00005
+0.82791
+
+⎞
+
+⎟
+⎟
+⎠
+
+Table 5. The marginal distributions of the parameters estimated by VB.
+
+Parameter
+Distribution
+Mean
+s.d.
+Transition Probability
+
+μ1
+t454.61(0.0123, 370778.19)
+0.0123
+0.00165
+-
+σ2
+1
+IG(227.30, 0.28)
+0.00122(0.0349)
+0.00008
+-
+μ2
+t80.39(−0.0161, 12987.55)
+−0.0161
+0.00889
+-
+σ2
+2
+IG(40.20, 0.24)
+0.00603(0.0777)
+0.00098
+-
+p1,2
+p2,1
+Beta(15.21, 434.78)
+Beta(15.00, 61.21)
+0.0338
+0.1969
+0.00851
+0.04525
+
+�0.9662
+0.0338
+0.1969
+0.8031
+
+�
+
+165
+
+
+Entropy 2014, 16, 3832–3847
+
+�����
+�����
+����
+����
+����
+
+�
+��
+���
+���
+���
+���
+���
+���
+
+�������
+
+(a)
+
+�����
+�����
+�����
+�����
+�����
+
+�
+����
+����
+����
+����
+����
+
+�������
+
+(b)
+
+���
+���
+���
+���
+���
+���
+
+�
+��
+��
+��
+��
+��
+
+�������
+
+(c)
+
+Figure 4. The VB marginal distributions of the parameters. (a) μ2 (left) and μ1 (right); (b) σ2
+1 (left) and
+σ2
+2 (right) ; (c) p1,2 (left) and p2,1 (right) .
+
+Table 6 (the upper part) gives the maximum likelihood estimates (cited from [19]), mean
+parameters computed by the MCMC method (cited from [21]) and mean parameters computed
+by VB. It clearly shows that the point estimates by VB are very close to those by MLE and MCMC.
+The numbers in parenthesis in Table 6 are the standard deviations computed by the three methods,
+respectively. It is worth noting that all of the variance estimated by VB are smaller than those by the
+MLE or MCMC methods. In fact, some other researchers also report the underestimation of posterior
+variance in other VB applications, for example [31,32]. In the paper [1], we look at some diagnostics
+methods that can assess how well the VB approximates the true posterior, particularly with regards to
+its covariance structure. The methods proposed also allow us to generate simple corrections when the
+approximation error is large.
+
+Table 6. Estimates and standard deviations by VB, MLE and MCMC.
+
+μ1
+σ1
+p1,2
+μ2
+σ2
+p2,1
+
+VB
+0.0123(0.00165)
+0.0349(0.00008)
+0.0338(0.00851)
+−0.0161(0.00889)
+0.0777(0.00098)
+0.1969(0.04525)
+MLE
+0.0123(0.002)
+0.0347(0.001)
+0.0371(0.012)
+−0.0157(0.010)
+0.0778(0.009)
+0.2101(0.086)
+MCMC
+0.0122(0.002)
+0.0351(0.002)
+0.0334(0.012)
+−0.0164(0.010)
+0.0804(0.009)
+0.2058(0.065)
+
+5. Conclusions
+
+Variational Bayes can be thought of in terms of information geometry as a projection-based
+approximation technique; it provides a framework to approximate posteriors. We applied this method
+to the regime-switching log-normal model and provide solutions to account for both model uncertainty
+and parameter uncertainty. The numerical results show that our method can recover exactly the
+number of regimes and gives reasonable point estimates. The VB method is also demonstrated to be
+very computationally efficient.
+The application on the TSX monthly total return index data in the period from January 1956 to
+December 1999, confirms the similar results in the literature in finding the number of regimes.
+
+Author Contributions
+
+The article was written by Hui Zhao under the guidance of Paul Marriott. All authors have read
+and approved the final manuscript.
+
+166
+
+
+Entropy 2014, 16, 3832–3847
+
+Conflicts of Interest
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Zhao, H.; Marriott, P. Diagnostics for variational bayes approximations. 2013, arXiv:1309.5117.
+2.
+Amari, S.-I. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1990.
+3.
+Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat.
+1983, 11, 793–803.
+4.
+Kass, R.; Vos, P. Geometrical Foundations of Asymptotic Inference; Wiley: New York, NY, USA, 1997.
+5.
+Hinton, G.E.; van Camp, D. Keeping neural networks simple by minimizing the description length of the
+weights. In Proceedings of the 6th ACM Conference on Computational Learning Theory, Santa Cruz, CA,
+USA, 26–28 July 1993; ACM: New York, NY, USA, 1993.
+6.
+MacKay, D. Developments in Probabilistic Modelling with Neural Networks—Ensemble Learning. In Neural
+Networks: Artifical Intelligence and Industrial Applications; Springer: London, UK, 1995; pp. 191–198.
+7.
+Attias, H. Independent Factor Analysis. Neur. Comput. 1999, 11, 803–851.
+8.
+Lappalainen, H. Ensemble Learning For Independent Component Analysis. In Proceedings of the First
+International Workshop on Independent Component Analysis, Aussois, France, 11–15 January 1999; pp.
+7–12.
+9.
+Beal, M.; Ghahramani, Z. The variational Bayesian EM algorithm for incomplete data: With application to
+scoring graphical model structures. Bayesian Stat. 2003, 7, 453–463.
+10.
+Winn, J. Variational Message Passing and its Applications. Ph.D. Thesis, Department of Physics, University of
+Cambridge, Cambridge, UK, 2003.
+11.
+Blei, D.M.; Ng, A.Y.; Jordan, M.I.; Lafferty, J. Latent Dirichlet allocation. J. Mach. Learn. Res.
+2003, 3,
+993–1022.
+12.
+Ghahramani, Z.; Beal, M.J. A Variational Inference for Bayesian Mixtures of Factor Analysers. Adv. Neur. Inf.
+Process. Syst. 2000, 12, 449–455.
+13.
+Haff, L.R. The Variational Form of Certain Bayes Estimators. Ann. Stat. 1991, 19, 1163–1190.
+14.
+Faes, C.; Ormerod, J.T.; Wand, M.P. Variational Bayesian Inference for Parametric and Nonparametric
+Regression With Missing Data. J. Am. Stat. Assoc. 2011, 106, 959–971.
+15.
+McGrory, C.; Titterington, D.; Reeves, R.; Pettitt, A.N. Variational Bayes for estimating the parameters of a
+hidden Potts model. Stat. Comput. 2009, 19, 329–340.
+16.
+Ormerod, J.T.; Wand, M.P. Gaussian Variational Approximate Inference for Generalized Linear Mixed
+Models. J. Comput. Graph. Stat. 2011, 21, 1–16.
+17.
+Hall, P.; Humphreys, K.; Titterington, D.M. On the Adequacy of Variational Lower Bound Functions for
+Likelihood-Based Inference in Markovian Models with Missing Values. J. R. Stat. Soc. Ser. B 2002, 64,
+549–564.
+18.
+Wang, B.; Titterington, M. Convergence Properties of a general algorithm for calculating variational Bayesian
+estimates for a normal mixture model. Bayesian Anal. 2006, 1, 625–650.
+19.
+Hardy, M.R. A Regime-Switching Model of Long-Term Stock Returns. N. Am. Actuar. J. 2001, 5, 41–53.
+20.
+Hamilton, J.D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business
+Cycle. Econometrica 1989, 57, 357–384.
+21.
+Hardy, M.R. Bayesian Risk Management for Equity-Linked Insurance. Scand. Actuar. J. 2002, 2002, 185–211.
+22.
+Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464.
+23.
+Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723.
+24.
+Hartman, B.M.; Heaton, M.J. Accounting for regime and parameter uncertainty in regime-switching models.
+Insur. Math. Econ. 2011, 49, 429–437.
+25.
+Green, P.J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.
+Biometrika 1995, 82, 711–732.
+26.
+Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: Cambridge, UK, 2009.
+27.
+Brooks, S.P. Markov Chain Monte Carlo Method and Its Application. J. R. Stat. Soc. Ser. D 1998, 47, 69–100.
+
+167
+
+
+Entropy 2014, 16, 3832–3847
+
+28.
+Ghahramani, Z.; Hinton, G.E. Variational learning for switching state-space models. Neur. Comput. 1998, 12,
+831–864.
+29.
+Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat 1951, 22, 79–86.
+30.
+Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A maximization technique occurring in the statistical analysis of
+probabilistic functions of markov chains. Ann. Math. Stat. 1970, 41, 164–171.
+31.
+Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using
+integrated nested Laplace approximations. J. R. Stat. Soc. Ser. B 2009, 71, 319–392.
+32.
+Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+168
+
+
+entropy
+
+Article
+On Clustering Histograms with k-Means by Using
+Mixed α-Divergences
+
+Frank Nielsen 1,2,*, Richard Nock 3 and Shun-ichi Amari 4
+
+1 Sony Computer Science Laboratories, Inc, Tokyo 141-0022, Japan
+2 École Polytechnique, 91128 Palaiseau Cedex, France
+3 NICTA and The Australian National University, Locked Bag 9013, Alexandria NSW 1435, Australia
+4 RIKEN Brain Science Institute, 2-1 Hirosawa Wako City, Saitama 351-0198, Japan; E-Mail: amari@brain.riken.jp
+*
+E-Mail: Frank.Nielsen@acm.org; Tel.:+81-3-5448-4380.
+
+Received: 15 May 2014; in revised form: 10 June 2014 / Accepted: 13 June 2014 /
+Published: 17 June 2014
+
+Abstract: Clustering sets of histograms has become popular thanks to the success of the generic
+method of bag-of-X used in text categorization and in visual categorization applications. In this paper,
+we investigate the use of a parametric family of distortion measures, called the α-divergences, for
+clustering histograms. Since it usually makes sense to deal with symmetric divergences in information
+retrieval systems, we symmetrize the α-divergences using the concept of mixed divergences. First,
+we present a novel extension of k-means clustering to mixed divergences. Second, we extend the
+k-means++ seeding to mixed α-divergences and report a guaranteed probabilistic bound. Finally, we
+describe a soft clustering technique for mixed α-divergences.
+
+Keywords: bag-of-X; α-divergence; Jeffreys divergence; centroid; k-means clustering; k-means seeding
+
+1. Introduction: Motivation and Background
+
+1.1. Clustering Histograms in the Bag-of-Word Modeling Paradigm
+
+A common task of information retrieval (IR) systems is to classify documents into categories.
+Given a training set of documents labeled with categories, one asks to classify new incoming documents.
+Text categorisation [1,2] proceeds by first defining a dictionary of words from a corpus. It then
+models each document by a word count yielding a word distribution histogram per document (see
+the University of California, Irvine, UCI, machine learning repository for such data-sets [3]). The
+importance of the words in the dictionary can be weighted by the term frequency-inverse document
+frequency [2] (tf-idf) that takes into account both the frequency of the words in a given document, but
+also of the frequency of the words in all documents: Namely, the tf-idf weight for a given word in a
+given document is the product of the frequency of that word in the document times the logarithm of
+the ratio of the number of documents divided by the document frequency of the word [2]. Defining a
+proper distance between histograms allows one to:
+
+• Classify a new on-line document: We first calculate its word distribution histogram signature and
+seek for the labeled document, which has the most similar histogram to deduce its category tag.
+• Find the initial set of categories: we cluster all document histograms and assign a category
+per cluster.
+
+This text classification method based on the representation of the bag-of -words (BoWs) has also
+been instrumental in computer vision for efficient object categorization [4] and recognition in natural
+images [5]. This paradigm is called bag-of-features [6] (BoFs) in the general case. It first requires one
+to create a dictionary of “visual words” by quantizing keypoints (e.g., affine invariant descriptors of
+image patches) of the training database. Quantization is performed using the k-means [7–9] algorithm
+
+Entropy 2014, 16, 3273–3301; doi:10.3390/e16063273
+www.mdpi.com/journal/entropy
+169
+
+
+Entropy 2014, 16, 3273–3301
+
+that partitions n data X = {x1, ..., xn} into k pairwise disjoint clusters C1, ..., Ck, where each data
+element belongs to the closest cluster center (i.e., the cluster prototype). From a given initialization,
+batched k-means first assigns data points to their closest centers and then updates the cluster centers
+and reiterates this process until convergence is met to a local minimum (not necessarily the global
+minimum) after a provably finite number of steps. Csurka et al. [4] used the squared Euclidean
+distance for building the visual vocabulary. Depending on the chosen features, other distances
+have proven useful. For example, the symmetrized Kullback–Leibler (KL) divergence was shown to
+perform experimentally better than the Euclidean or squared Euclidean distances for a compressed
+histogram of gradient descriptors [10] (CHoGs), even if it is not a metric distance, since its fails to
+satisfy the triangular inequality. To summarize, k-means histogram clustering with respect to the
+symmetrized KL (called Jeffreys divergence J) can be used to quantize both visual words and document
+categories. Nowadays, the seminal bag-of-word method has been generalized fruitfully to various
+settings using the generic bag-of-X paradigm, like the bag-of-textons [6], the bag-of-readers [11], etc.
+Bag-of-X represents each data (e.g., document, image, etc.) as an histogram of codeword count indices.
+Furthermore, the semantic space [12] paradigm has been recently explored to overcome two drawbacks
+of the bag-of-X paradigms: the high-dimensionality of the histograms (number of bins) and difficult
+human interpretation of the codewords due to the lack of semantic information. In semantic space,
+modeling relies on semantic multinomials that are discrete frequency histograms; see [12].
+In summary, clustering histograms with respect to symmetric distances (like the symmetrized KL
+divergence) is playing an increasing role. It turns out that the symmetrized KL divergence belongs to a
+1-parameter family of divergences, called symmetrized α-divergences, or Jeffreys α-divergence [13].
+
+1.2. Contributions
+
+Since divergences D(p : q) are usually asymmetric distortion measures between two objects
+p and q, one has to often consider two kinds of centroids obtained by carrying the minimization
+process either on the left argument or on the right argument of the divergences; see [14]. In theory,
+it is enough to consider only one type of centroid, say the right centroid, since the left centroid with
+respect to a divergence D(p : q) is equivalent to the right centroid with respect to the mirror divergence
+D′(p : q) = D(q : p).
+In this paper, we consider mixed divergences [15] that allow one to handle in a unified way the
+arithmetic symmetrization S(p, q) = 1
+
+2(D(p : q) + D(q : p)) of a given divergence D(p : q) with both
+the sided divergences: D(p : q) and its mirror divergence D′(p : q). The mixed α-divergence is the
+mixed divergence obtained for the α-divergence. We term α-clustering the clustering with respect
+to α-divergences and mixed α-clustering the clustering w.r.t. mixed α-divergences [16]. Our main
+contributions are to extend the celebrated batched k-means [7–9] algorithm to mixed divergences
+by associating two dual centroids per cluster and to generalize the probabilistically guaranteed
+good seeding of k-means++ [17] to mixed α-divergences. The mixed α-seedings provide guaranteed
+probabilistic clustering bounds by picking up seeds from the data and do not require explicitly
+computing of centroids. Therefore, it follows a fast clustering technique in practice, even when cluster
+centers are not available in closed form. We also consider clustering histograms by explicitly building
+the symmetrized α-centroids and end up with a variational k-means when the centroids are not
+available in closed-form, Finally, we investigate soft mixed α-clustering and discuss topics related to
+α-clustering. Note that clustering with respect to non-symmetrized α-divergences has been recently
+investigated independently in [18] and proven useful in several applications.
+
+1.3. Outline of the Paper
+
+The paper is organized as follows: Section 2 introduces the notion of mixed divergences, presents
+an extension of k-means to mixed divergences and recalls some properties of α-divergences. Section 3
+describes the α-seeding techniques and reports a probabilistically-guaranteed bound on the clustering
+quality. Section 4 investigates the various sided/symmetrized/mixed calculations of the α-centroids.
+
+170
+
+
+Entropy 2014, 16, 3273–3301
+
+Section 5 presents the soft α-clustering with respect to α-mixed divergences. Finally, Section 6
+summarises the contributions, discusses related topics and hints at further perspectives. The paper is
+followed by two appendices. Appendix B studies several properties of α-divergences that are used to
+derive the guaranteed probabilistic performance of the α-seeding. Appendix C proves that α-sided
+centroids are quasi-arithmetic means for the power generator functions.
+
+2. Mixed Centroid-Based k-Means Clustering
+
+2.1. Divergences, Centroids and k-Means
+
+Consider a set H of n histograms h1, ..., hn, each with d bins, with all positive real-valued bins:
+hi
+j > 0, ∀1 ≤ i ≤ d, 1 ≤ j ≤ n. A histogram h is called a frequency histogram when its bins sums up
+
+to one: w(h) = wh = ∑i hi = 1. Otherwise, it is called a positive histogram that can eventually be
+normalized to a frequency histogram:
+
+˜h
+.=
+h
+
+w(h).
+(1)
+
+The frequency histograms belong to the (d-1)-dimensional open probability simplex Δd:
+
+Δd
+.=
+
+�
+
+(x1, ..., xd) ∈ Rd | ∀i, xi > 0, and
+d
+∑
+i=1
+xi = 1
+
+�
+
+.
+(2)
+
+That is, although frequency histograms have d bins, the constraint that those bin values should
+sum up to one yields d-1 degrees of freedom. In probability theory, the frequency or counting of
+histograms either model discrete multinomial probabilities or discrete positive measures (also called
+positive arrays [19]).
+The celebrated k-means clustering [8,9] is one of the most famous clustering techniques that has
+been generalized in many ways [20,21]. In information geometry [22], a divergence D(p : q) is a
+smooth C3 differentiable dissimilarity measure that is not necessarily symmetric (D(p : q) ̸= D(q : p),
+hence the notation “:” instead of the classical “,” reserved for metric distances), but is non-negative and
+satisfies the separability property: D(p : q) = 0 iff p = q. More precisely, let ∂iD(x : y) =
+∂
+∂xi D(x : y),
+
+∂,iD(x : y) =
+∂
+∂yi D(x : y). Then, we require ∂iD(x : x) = ∂,iD(x : x) = 0 and −∂i∂,jD(x : y) positive
+
+definite for defining a divergence. For a distance function D(· : ·), we denote by D(x : H) the weighted
+average distance of x to a set a weighted histograms:
+
+D(x : H)
+.=
+n
+∑
+j=1
+wiD(x : hj).
+(3)
+
+An important class of divergences on frequency histograms is the f-divergences [23–25] defined for a
+convex generator f (with f (1) = f ′(1) = 0 and f ′′(1) = 1):
+
+If (p : q)
+.=
+d
+∑
+i=1
+qi f
+� pi
+
+qi
+
+�
+.
+
+Those divergences preserve information monotonicity [19] under any arbitrary transition probability
+(Markov morphisms). f-divergences can be extended to positive arrays [19].
+The k-means algorithm on a set of weighted histograms can be tailored to any divergence as
+follows: First, we initialize the k cluster centers C = {c1, ..., ck} (say, by picking up randomly arbitrary
+distinct seeds). Then, we iteratively repeat until convergence the following two steps:
+
+• Assignment: Assign each histogram hj to its closest cluster center:
+
+l(hj)
+.= arg
+k
+min
+l=1 D(hj : cl).
+
+171
+
+
+Entropy 2014, 16, 3273–3301
+
+This yields a partition of the histogram set H = ∪k
+l=1Al, where Al denotes the set of histograms
+of the l-th cluster: Al = {hj |l(hj) = l}.
+• Center relocation: Update the cluster centers by taking their centroids:
+
+cl
+.= arg min
+x
+∑
+hj∈Al
+wjD(hj : x).
+
+Throughout this paper, centroid shall be understood in the broader sense of a barycenter when
+weights are non-uniform.
+
+2.2. Mixed Divergences and Mixed k-Means Clustering
+
+Since divergences are potentially asymmetric, we can define two-sided k-means or always consider
+a right-sided k-means, but then define another sided divergence D′(p : q) = D(q : p). We can
+also consider the symmetrized k-means with respect to the symmetrized divergence: S(p, q) =
+D(p : q) + D(q : p). Eventually, we may skew the symmetrization with a parameter λ ∈ [0, 1]:
+Sλ(p, q) = λD(p : q) + (1 − λ)D(q : p) (and consider other averaging schemes instead of the arithmetic
+mean).
+In order to handle those sided and symmetrized k-means under the same framework, let us
+introduce the notion of mixed divergences [15] as follows:
+
+Definition 1 (Mixed divergence).
+
+Mλ(p : q : r)
+.= λD(p : q) + (1 − λ)D(q : r),
+(4)
+
+for λ ∈ [0, 1].
+
+A mixed divergence includes the sided divergences for λ ∈ {0, 1} and the symmetrized (arithmetic
+mean) divergence for λ = 1
+
+2.
+We generalize k-means clustering to mixed k-means clustering [15] by considering two centers
+per cluster (for the special cases of λ = 0, 1, it is enough to consider only one). Algorithm 1 sketches
+the generic mixed k-means algorithm. Note that a simple initialization consists of choosing randomly
+the k distinct seeds from the dataset with li = ri.
+
+Algorithm 1: Mixed divergence-based k-means clustering.
+
+Input: Weighted histogram set H, divergence D(·, ·), integer k > 0, real λ ∈ [0, 1];
+Initialize left-sided/right-sided seeds C = {(li, ri)}k
+i=1;
+repeat
+
+//Assignment
+for i = 1, 2, ..., k do
+
+Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj)};
+
+// Dual-sided centroid relocation
+for i = 1, 2, ..., k do
+
+ri ← arg minx D(Ci : x) = ∑h∈Ci wjD(h : x);
+li ← arg minx D(x : Ci) = ∑h∈Ci wjD(x : h);
+
+until convergence;
+Output: Partition of H into k clusters following C;
+
+Notice that the mixed k-means clustering is different from the k-means clustering with respect to
+the symmetrized divergences Sλ that considers only one centroid per cluster.
+
+172
+
+
+Entropy 2014, 16, 3273–3301
+
+2.3. Sided, Symmetrized and Mixed α-Divergences
+
+For α ̸= ±1, we define the family of α-divergences [26] on positive arrays [27] as:
+
+Dα(p : q)
+.=
+d
+∑
+i=1
+
+4
+
+1 − α2
+
+�1 − α
+
+2
+pi + 1 + α
+
+2
+qi − (pi)
+1−α
+
+2 (qi)
+1+α
+
+2
+�
+,
+
+=
+D−α(q : p), α ∈ R\{0, 1},
+(5)
+
+with the limit cases D−1(p : q) = KL(p : q) and D1(p : q) = KL(q : p), where KL is the extended
+Kullback–Leibler divergence:
+
+KL(p : q)
+.=
+d
+∑
+i=1
+pi log pi
+
+qi + qi − pi.
+(6)
+
+Divergence D0 is the squared Hellinger symmetric distance (scaled by a multiplicative factor of
+four) extended to positive arrays:
+
+D0(p : q) = 2
+� ��
+
+p(x) −
+�
+
+q(x)
+�2
+dx = 4H2(p, q),
+(7)
+
+with the Hellinger distance:
+
+H(p, q) =
+
+�
+
+1
+2
+
+� ��
+
+p(x) −
+�
+
+q(x)
+�2
+dx.
+(8)
+
+Note that α-divergences are defined for the full range of α values: α ∈ R.
+Observe that
+α-divergences of Equation (5) are homogeneous of degree one: Dα(λp : λq) = λDα(p : q) for
+λ > 0.
+When histograms p and q are both frequency histograms, we have:
+
+Dα( ˜p : ˜q)
+=
+4
+
+1 − α2
+
+�
+
+1 −
+d
+∑
+i=1
+( ˜pi)
+1−α
+
+2 ( ˜qi)
+1+α
+
+2
+
+�
+
+,
+
+=
+D−α( ˜q : ˜p), α ∈ R\{0, 1},
+(9)
+
+and the extended Kullback–Leibler divergence reduces to the traditional Kullback–Leibler
+
+divergence: KL( ˜p : ˜q) = ∑d
+i=1 ˜pi log ˜pi
+
+˜qi .
+The Kullback–Leibler divergence between frequency histograms ˜p and ˜q (α = ±1) is interpreted
+as the cross-entropy minus the Shannon entropy:
+
+KL( ˜p : ˜q)
+.= H×( ˜p : ˜q) − H( ˜p).
+
+Often, ˜p denotes the true model (hidden by nature), and ˜q is the estimated model from observations.
+However, in information retrieval, both ˜p and ˜q play the same symmetrical role, and we prefer to deal
+with a symmetric divergence.
+The Pearson and Neyman χ2 distances are obtained for α = −3 and α = 3, respectively:
+
+D3( ˜p : ˜q)
+=
+1
+2 ∑
+i
+
+( ˜qi − ˜pi)2
+
+˜pi
+,
+(10)
+
+D−3( ˜p : ˜q)
+=
+1
+2 ∑
+i
+
+( ˜qi − ˜pi)2
+
+˜qi
+.
+(11)
+
+173
+
+
+Entropy 2014, 16, 3273–3301
+
+The α-divergences belong to the class of Csiszár f-divergences with the following generator:
+
+f (t) =
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+4
+
+1−α2
+�
+1 − t(1+α)/2�
+,
+if α ̸= ±1,
+t ln t,
+if α = 1,
+− ln t,
+if α = −1
+(12)
+
+Remark 1. Historically, the α-divergences have been introduced by Chernoff [28,29] in the context of hypothesis
+testing. In Bayesian binary hypothesis testing, we are asked to decide whether an observation belongs to one
+class or the other class, based on prior w1 and w2 and class-conditional probabilities p1 and p2. The average
+expected error of the best decision maximum a posteriori (MAP) rule is called the probability of error, denoted by
+Pe. When prior probabilities are identical (w1 = w2 = 1
+
+2), we have Pe(p1, p2) = 1
+
+2
+�
+min(p1(x), p2(x))dx.
+Let S(p, q) = �
+min(p(x), q(x))dx denote the intersection similarity measure, with 0 < S ≤ 1 (generalizing
+the histogram intersection distance often used in computer vision [30]). S is bounded by the α-Chernoff affinity
+coefficient:
+
+S(p, q) ≤ Cβ(p, q) =
+�
+pβ(x)q1−β(x)dx,
+
+for all β ∈ [0, 1]. We can convert the affinity coefficient 0 < Cβ ≤ 1 into a divergence Dβ by simply taking
+Dβ = 1 − Cβ. Since the absolute value of divergences does not matter, we can rescale appropriately the
+divergence. One nice rescaling is by multiplying by
+1
+
+β(1−β): Dβ =
+1
+
+β(1−β)(1 − Cβ). This lets coincide the
+parameterized divergence with the fundamental Kullback–Leibler divergence for the limit values β ∈ {0, 1}.
+Last, by choosing β = 1−α
+
+2 , it yields the well-known expression of the α-divergences.
+
+Interestingly, the α-divergences can be interpreted as a generalized α-Kullback–Leibler
+divergence [26] with deformed logarithms.
+Next, we introduce the mixed α-divergence of a histogram x to two histograms p and q as follows:
+
+Definition 2 (Mixed α-divergence). The mixed α-divergence of a histogram x to two histograms p and q is
+defined by:
+
+Mλ,α(p : x : q)
+=
+λDα(p : x) + (1 − λ)Dα(x : q),
+
+=
+λD−α(x : p) + (1 − λ)D−α(q : x),
+
+=
+M1−λ,−α(q : x : p),
+(13)
+
+The α-Jeffreys symmetrized divergence is obtained for λ = 1
+
+2:
+
+Sα(p, q) = M 1
+
+2 ,α(q : p : q) = M 1
+
+2 ,α(p : q : p).
+
+The skew symmetrized α-divergence is defined by:
+
+Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p).
+
+2.4. Notations and Hard/Soft Clusterings
+
+Throughout the paper, superscript index i denotes the histogram bin numbers and subscript
+index j the histogram numbers. Index l is used to iterate on the clusters. The left-sided, right-sided
+and symmetrized histogram positive and frequency α-centroids are denoted by lα, rα, sα and ˜lα, ˜rα, ˜sα,
+respectively.
+In this paper, we investigate the following kinds of clusterings for sets of histograms:
+
+Hard clustering. Each histogram belongs to exactly one cluster:
+
+• k-means with respect to mixed divergences Mλ,α.
+• k-means with respect to symmetrized divergences Sλ,α.
+
+174
+
+
+Entropy 2014, 16, 3273–3301
+
+• Randomized seeding for mixed/symmetrized k-means by extending k-means++ with
+guaranteed probabilistic bounds for α-divergences.
+
+Soft clustering. Each histogram belongs to all clusters according to some weight distribution:
+the soft mixed α-clustering.
+
+3. Coupled k-Means++ α-Seeding
+
+It is well-known that the Lloyd k-means clustering algorithm monotonically decreases the loss
+function and stops after a finite number of iterations into a local optimal. Optimizing globally
+the k-means loss is NP-hard [17] when d > 1 and k > 1. In practice, the performance of the
+k-means algorithm heavily relies on the initialization. A breakthrough was obtained by the k-means++
+seeding [17], which guarantees in expectation a good starting partition. We extend this scheme to
+the coupled α-clustering. However, we point out that although k-means++ prove popular and are
+often used in practice with very good results; it has been recently pointed out that “worst case”
+configurations exist and even in small dimensions, on which the algorithm cannot beat significantly
+its expected approximability with a high probability [31]. Still, the expected approximability ratio,
+roughly in O(log(k)), is very good, as long as the number of clusters is not too large.
+
+Algorithm 2: Mixed α-seeding; MAS(H, k, λ, α)
+
+Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R;
+Let C ← hj with uniform probability ;
+for i = 2, 3, ..., k do
+
+Pick at random histogram h ∈ H with probability:
+
+πH(h)
+.=
+whMλ,α(ch : h : ch)
+
+∑y∈H wyMλ,α(cy : y : cy) ,
+(14)
+
+//where (ch, ch)
+.= arg min(z,z)∈C Mλ,α(z : h : z);
+C ← C ∪ {(h, h)};
+
+Output: Set of initial cluster centers C;
+
+Algorithm 2 provides our adaptation of k-means++ seeding [15,17]. It works for all three of our
+sided/symmetrized and mixed clustering settings:
+
+• Pick λ = 1 for the left-sided centroid initialization,
+• Pick λ = 0 for the right-sided centroid initialization (a left-sided initialization for −α),
+• with arbitrary λ, for the λ-Jα (skew Jeffreys) centroids or mixed λ centroids. Indeed, the
+initialization is the same (see the MAS procedure in Algorithm 2).
+
+Our proof follows and generalizes the proof described for the case of mixed Bregman seeding [15]
+(Lemma 2). In fact, our proof is more precise, as it quantifies the expected potential with respect to the
+optimum only, whereas in [15], the optimal potential is averaged with a dual optimal potential, which
+depends on the optimal centers, but may be larger than the optimum sought.
+
+Theorem 1. Let Cλ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew
+Jeffreys or mixed) in MASand Copt
+λ,α denote the optimal related clustering in k clusters, for λ
+∈ [0, 1] and α ∈ (−1, 1). Then, on average, with respect to distribution (14), the initial clustering of
+MAS satisfies:
+
+Eπ[Cλ,α]
+≤
+4
+
+�
+f (λ)g(k)h2(α)Copt
+λ,α
+if
+λ ∈ (0, 1)
+
+g(k)z(α)h4(α)Copt
+λ,α
+otherwise
+.
+(15)
+
+175
+
+
+Entropy 2014, 16, 3273–3301
+
+Here, f (λ) = max
+�
+1−λ
+
+λ ,
+λ
+
+1−λ
+�
+, g(k) = 2(2 + log k), z(α) =
+� 1+|α|
+
+1−|α|
+
+�
+8|α|2
+
+(1−|α|)2 , h(α) = maxi p|α|
+i / mini p|α|
+i ;
+the min is defined on strictly positive coordinates, and π denotes the picking distribution of Algorithm 2.
+
+Remark 2. The bound is particularly good when λ is close to 1/2, and in particular for the α-Jeffreys clustering,
+as in these cases, the only additional penalty compared to the Euclidean case [17] is h2(α), a penalty that relies
+on an optimal triangle inequality for α-divergences that we provide in Lemma A6 below.
+
+Remark 3. This guaranteed initialization is particularly useful for α-Jeffreys clustering, as there is no closed
+form solution for the centroids (except when α = ±1, see [32]).
+
+Algorithm 3: Mixed α-hard clustering: MAhC(H, k, λ, α)
+
+Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1], real α ∈ R;
+Let C = {(li, ri)}k
+i=1 ← MAS(H, k, λ, α);
+repeat
+
+//Assignment
+for i = 1, 2, ..., k do
+
+Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj)};
+
+// Centroid relocation
+for i = 1, 2, ..., k do
+
+ri ←
+�
+∑h∈Ai wih
+1−α
+
+2
+�
+2
+
+1−α ;
+
+li ←
+�
+∑h∈Ai wih
+1+α
+
+2
+�
+2
+
+1+α ;
+
+until convergence;
+Output: Partition of H in k clusters following C;
+
+Algorithm 3 presents the general hard mixed k-means clustering, which can be adapted also to
+left- (λ = 1) and right- (λ = 0) α-clustering.
+For skew Jeffreys centers, since the centroids are not available in closed form [32], we adopt a
+variational approach of k-means by updating iteratively the centroid in each cluster (thus improving
+the overall loss function without computing the optimal centroids that would eventually require
+infinitely many iterations).
+
+4. Sided, Symmetrized and Mixed α-Centroids
+
+The k-means clustering requires assigning data elements to their closest cluster center and
+then updating those cluster centers by taking their centroids. This section investigates the centroid
+computations for the sided, symmetrized and mixed α-divergences.
+Note that the mixed α-seeding presented in Section 3 does not require computing centroids and,
+yet, guarantees probabilistically a good clustering partition.
+Since mixed α-divergences are f-divergences, we start with the generic f-centroids.
+
+4.1. Csiszár f-Centroids
+
+The centroids induced by f-divergences of a set of positive measures (that relaxes the
+normalisation constraint) have been studied by Ben-Tal et al. [33]. Those entropic centroids are
+
+176
+
+
+Entropy 2014, 16, 3273–3301
+
+shown to be unique, since f-divergences are convex statistical distances in both arguments. Let Ef
+denote the energy to minimize when considering f-divergences:
+
+Ef
+.=
+min
+x∈X If (H : x) =
+n
+∑
+j=1
+wjIf (hj : x),
+(16)
+
+=
+min
+x∈X
+
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+pi
+j f
+
+�
+ci
+
+hi
+j
+
+�
+
+.
+(17)
+
+When the domain is the open probability simplex X = Δd, we get a constrained optimisation
+problem to solve. We transform this constrained minimisation problem (i.e., x ∈ Δd) into an equivalent
+unconstrained minimisation problem by using the Lagrange multiplier, γ:
+
+min
+x∈Rd
+
+n
+∑
+j=1
+wjIf (hj : c) + γ
+
+�
+d
+∑
+i=1
+xi − 1
+
+�
+
+.
+(18)
+
+Taking the derivatives according to xi, we get:
+
+∀i ∈ {1, ..., d},
+n
+∑
+j=1
+wj f ′
+�
+xi
+
+hi
+j
+
+�
+
+− γ = 0.
+(19)
+
+We now consider this equation for α-divergences and symmetrized α-divergences, both
+f-divergences.
+
+4.2. Sided Positive and Frequency α-Centroids
+
+The positive sided α-centroids for a set of weighted histograms were reported in [34] using the
+representation Bregman divergence. We summarise the results in the following theorem:
+
+Theorem 2 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted α-centroid
+coordinates of a set of n positive histograms h1, ..., hn are weighted α-means:
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+, li
+α = ri
+−α
+
+with fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+Furthermore, the frequency-sided α-centroids are simply the normalized-sided α-centroids.
+
+Theorem 3 (Sided frequency α-centroids [16]). The coordinates of the sided frequency α-centroids of a set of
+n weighted frequency histograms are the normalised weighted α-means.
+
+Table 1 summarizes the results concerning the sided positive and frequency α-centroids.
+
+177
+
+
+Entropy 2014, 16, 3273–3301
+
+Table 1. Positive and frequency α-centroids: the frequency α-centroids are normalized positive
+α-centroids, where w(h) denotes the cumulative sum of the histogram bins. The arithmetic mean
+is obtained for r−1 = l1 and the geometric mean for r1 = l−1.
+
+Positive centroid
+Frequency centroid
+
+Right-sided centroid
+riα =
+
+�
+(∑n
+j=1 wj(hi
+j)
+1−α
+
+2 )
+2
+
+1−α
+α ̸= 1
+ri
+1 = ∏n
+j=1(hi
+j)wj
+α = 1
+˜riα =
+ri
+α
+
+w(˜rα)
+
+Left-sided centroid
+liα = ri−α =
+
+�
+(∑n
+j=1 wj(hi
+j)
+1+α
+
+2 )
+2
+
+1+α
+α ̸= −1
+li
+−1 = ∏n
+j=1(hi
+j)wj
+α = −1
+˜liα = ˜ri−α =
+ri
+−α
+
+w(˜r−α)
+
+4.3. Mixed α-Centroids
+
+The mixed α-centroids for a set of n weighted histograms is defined as the minimizer of:
+
+∑
+j
+wjMλ,α(l : hj : r).
+(20)
+
+We state the theorem generalizing [15]:
+
+Theorem 4. The two mixed α-centroids are the left-sided and right-sided α-centroids.
+
+Figure 1 depicts some clustering result with our α-clustering software. We remark that the clusters
+found are all approximately subclusters of the “distinct” clusters that appear on the figure. When
+those distinct clusters are actually the optimal clusters—which is likely to be the case when they are
+separated by large minimal distance to other clusters—this is clearly a desirable qualitative property
+as long as the number of experimental clusters is not too large compared to the number of optimal
+clusters. We remark also that in the experiment displayed, there is no closed form solution for the
+cluster centers.
+
+Figure 1. Snapshot of the α-clustering software. Here, n = 800 frequency histograms of three bins with
+k = 8, and α = 0.7 and λ = 1
+
+2.
+
+178
+
+
+Entropy 2014, 16, 3273–3301
+
+4.4. Symmetrized Jeffreys-Type α-Centroids
+
+The Kullback–Leibler divergence can be symmetrized in various ways: Jeffreys divergence,
+Jensen–Shannon divergence and Chernoff information, just to mention a few. Here, we consider the
+following symmetrization of α-divergences extending Jeffreys J-divergence:
+
+Sα(p, q)
+=
+1
+2 (Dα(p : q) + Dα(q : p)) = S−α(p, q),
+
+=
+M 1
+
+2 (p : q : p),
+(21)
+
+For α = ±1, we get half of Jeffreys divergence:
+
+S±1(p, q) = 1
+
+2
+
+d
+∑
+i=1
+(pi − qi) log pi
+
+qi
+
+In particular, when p and q are frequency histograms, we have for α ̸= ±1:
+
+Jα( ˜p : ˜q)
+=
+8
+
+1 − α2
+
+�
+
+1 +
+d
+∑
+i=1
+H 1−α
+
+2 ( ˜pi, ˜qi)
+
+�
+
+,
+(22)
+
+where H 1−α
+
+2 (a, b) a symmetric Heinz mean [35,36]:
+
+Hβ(a, b) = aβb1−β + a1−βbβ
+
+2
+.
+
+Heinz means interpolate the arithmetic and geometric means and satisfies the inequality:
+
+√
+
+ab = H 1
+
+2 (a, b) ≤ Hα(a, b) ≤ H0(a, b) = a + b
+
+2
+.
+
+(Another interesting property of Heinz means is the integral representation of the logarithmic mean:
+L(x, y) =
+x−y
+
+log x−log y = � 1
+0 Hβ(x, y)dβ. This allows one to prove easily that √xy ≤ L(x, y) ≤ x+y
+
+2 .)
+The Jα-divergence is a Csiszár f-divergence [24,25].
+Observe that it is enough to consider α ∈ [0, ∞) and that the symmetrized α-divergence for
+positive and frequency histograms coincide only for α = ±1.
+For α = ±1, Sα(p, q) tends to the Jeffreys divergence:
+
+J(p, q) = KL(p, q) + KL(q, p) =
+d
+∑
+i=1
+(pi − qi)(log pi − log qi).
+(23)
+
+The Jeffreys divergence writes mathematically the same for frequency histograms:
+
+J( ˜p, ˜q) = KL( ˜p, ˜q) + KL( ˜q, ˜p) =
+d
+∑
+i=1
+( ˜pi − ˜qi)(log ˜pi − log ˜qi).
+(24)
+
+We state the results reported in [32]:
+
+Theorem 5 (Jeffreys positive centroid [32]). The Jeffreys positive centroid c = (c1, ..., cd) of a set {h1, ..., hn}
+of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W
+analytic function:
+
+ci =
+ai
+
+W( ai
+
+gi e)
+,
+
+where ai = ∑n
+j=1 πjhi
+j denotes the coordinate-wise arithmetic weighted means and gi = ∏n
+j=1(hi
+j)πj the
+coordinate-wise geometric weighted means.
+
+179
+
+
+Entropy 2014, 16, 3273–3301
+
+The Lambert analytic function W [37] (positive branch) is defined by W(x)eW(x) = x for x ≥ 0.
+
+Theorem 6 (Jeffreys frequency centroid [32]). Let ˜c denote the Jeffreys frequency centroid and ˜c′ =
+c
+wc the
+
+normalised Jeffreys positive centroid. Then, the approximation factor α˜c′ = S1(˜c′, ˜H)
+
+S1(˜c, ˜H) is such that 1 ≤ α˜c′ ≤
+1
+wc
+(with wc ≤ 1).
+
+Therefore, we shall consider α ̸= ±1 in the remainder.
+We state the following lemma generalizing the former results in [38] that were tailored to the
+symmetrized Kullback–Leibler divergence or the symmetrized Bregman divergence [14]:
+
+Lemma 1 (Reduction property). The symmetrized Jα-centroid of a set of n weighted histograms amount to
+computing the symmetrized α-centroid for the weighted α-mean and −α-mean:
+
+min Jα(x, H) = min
+x
+(Dα(x : rα) + Dα(lα : x)) .
+
+Proof. It follows that the minimization problem minx Sα(x, H) = ∑n
+j=1 wjSα(x, hj) reduces to the
+following minimization:
+
+min
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 ¯hi
+α − (xi)
+1−α
+
+2 ¯hi
+−α.
+(25)
+
+This is equivalent to minimizing:
+
+≡
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 ((¯hi
+α)
+2
+
+1−α )
+1−α
+
+2 −
+
+(xi)
+1−α
+
+2 ((¯hi
+−α)
+2
+
+1+α )
+1+α
+
+2 ,
+
+≡
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 (ri
+α)
+
+1−α
+
+2
+− (xi)
+1−α
+
+2 (li
+α)
+1+α
+
+2
+
+≡
+Dα(x : rα) + Dα(lα : x).
+
+Note that α = ±1, the lemma states that the minimization problem is equivalent to minimizing
+KL(a : x) + KL(x : g) with respect to x, where a = l1 and g = r1 denote the arithmetic and geometric
+means, respectively.
+
+The lemma states that the optimization problem with n weighted histograms is equivalent to the
+optimization with only two equally weighted histograms.
+The positive symmetrized α-centroid is equivalent to computing a representation symmetrized
+Bregman centroid [14,34].
+The frequency symmetrized α-centroid asks to minimize the following problem:
+
+min
+˜x∈Δd ∑
+j
+wjSα( ˜x, ˜hi).
+
+Instead of seeking for ˜x in the probability simplex, we can optimize on the unconstrained
+domain Rd−1 by using a reparameterization. Indeed, frequency histograms belong to the exponential
+families [39] (multinomials).
+Exponential families also include many other continuous distributions, like the Gaussian, Beta or
+Dirichlet distributions. It turns out the α-divergences can be computed in closed-form for members of
+the same exponential family:
+
+180
+
+
+Entropy 2014, 16, 3273–3301
+
+Lemma 2. The α-divergence for distributions belonging to the same exponential families amounts to computing
+a divergence on the corresponding natural parameters:
+
+Aα(p : q) =
+4
+
+1 − α2
+
+�
+
+1 − e−J( 1−α
+
+2 )
+F
+(θp:θq)
+�
+
+,
+
+where Jβ
+F(θ1 : θ2) = βF(θ1) + (1 − β)F(θ2) − F(βθ1 + (1 − β)θ2) is a skewed Jensen divergence defined for
+the log-normaliser F of the family.
+
+The proof follows from the fact that �
+pα(x)q1−α(x)dx = e−J
+(α)(θp:θq)
+F
+; see [40].
+
+First, we convert a frequency histogram ˜h to its natural parameter θ with θi = log ˜hi
+
+˜hd ; see [39].
+The log-normaliser is a non-separable convex function F(θ) = log(1 + ∑i eθi). To convert back a
+multinomial to a frequency histogram with d bins, we first set ˜hd =
+1
+
+1+∑d−1
+l=1 eθl and then retrieve the
+
+other bin values as ˜hi = ˜hdeθi.
+The centroids with respect to skewed Jensen divergences has been investigated in [13,40].
+
+Remark 4. Note that for the special case of α = 0 (squared Hellinger centroid), the sided and symmetrized
+centroids coincide. In that case, the coordinates si
+0 of the squared Hellinger centroid are:
+
+si
+0 =
+
+�
+n
+∑
+j=1
+wj
+�
+
+hi
+j
+
+�2
+, 1 ≤ i ≤ d.
+
+Remark 5. The symmetrized positive α-centroids can be solved in special cases (α = ±3, α = ±1 corresponding
+to the symmetrized χ2 and Jeffreys positive centroids). For frequency centroids, when dealing with binary
+histograms (d = 2), we have only one degree of freedom and can solve the binary frequency centroids. Binary
+histograms (and mixtures thereof) are used in computer vision and pattern recognition [41].
+
+Remark 6. Since α-divergences are Csiszár f-divergences and f-divergences can always be symmetrized by
+taking generator s(t) = f (t) + t f ( 1
+
+t ), we deduce that symmetrized α-divergences Sα are f-divergences for
+the generator:
+
+f (t) = − log((1 − α) + αt) − t log
+�
+(1 − α) + α
+
+t
+
+�
+.
+(26)
+
+Hence, Sα divergences are convex in both arguments, and the sα centroids are unique.
+
+5. Soft Mixed α-Clustering
+
+Algorithm 4 reports the general clustering with soft membership, which can be adapted to left
+(λinit = 1), right (λinit = 0) or mixed clustering. We have not considered a weighted histogram set in
+order not load the notations and because the extension is straightforward.
+Again, for skew Jeffreys centers, we shall adopt a variational approach. Notice that the soft
+clustering approach learns all parameters, including λ (if not constrained to zero or one) and α ∈ R.
+This is not the case for Matsuyama’s α-expectation maximization (EM) algorithm [42] in which α is
+fixed beforehand (and, thus, not learned).
+Assuming we model the prior for histograms by:
+
+pλ,α,j(hi) ∝
+
+λ exp −Dα(lj : hi) + (1 − λ) exp −Dα(hi : rj) ,
+(27)
+
+181
+
+
+Entropy 2014, 16, 3273–3301
+
+the negative log-likelihood involves the α-depending quantity:
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+p(j|hi) log pλ,α,j(hi) ≥
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+Mλ,α(lj : hi : rj)p(j|hi),
+(28)
+
+because of the concavity of the logarithm function.
+Therefore, the maximization step for α
+involves finding:
+
+arg max
+α
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+Mλ,α(lj : hi : rj)p(j|hi) .
+(29)
+
+Algorithm 4: Mixed α-soft clustering; MAsC(H, k, λ, α)
+
+Input: Histogram set H with |H| = m, integer k > 0, real λ ← λinit ∈ [0, 1], real α ∈ R;
+Let C = {(li, ri)}k
+i=1 ← MAS(H, k, λ, α);
+repeat
+
+//Expectation
+for i = 1, 2, ..., m do
+
+for j = 1, 2, ..., k do
+
+p(j|hi) =
+πj exp(−Mλ,α(lj:hi:rj))
+
+∑j′ πj′ exp(−Mλ,α(lj′ :hi:rj′));
+
+//Maximization
+for j = 1, 2, ..., k do
+
+πj ← 1
+
+m ∑i p(j|hi);
+
+li ←
+�
+1
+
+∑i p(j|hi) ∑i p(j|hi)h
+
+1+α
+
+2
+i
+
+�
+2
+
+1+α
+;
+
+ri ←
+�
+1
+
+∑i p(j|hi) ∑i p(j|hi)h
+
+1−α
+
+2
+i
+
+�
+2
+
+1−α
+;
+
+//Alpha - Lambda
+α ← α − η1 ∑k
+j=1 ∑m
+i=1 p(j|hi) ∂
+
+∂α Mλ,α(lj : hi : rj);
+
+if λinit ̸= 0, 1 then
+
+λ ← λ − η2
+�
+∑k
+j=1 ∑m
+i=1 p(j|hi)Dα(lj : hi)−
+
+∑k
+j=1 ∑m
+i=1 p(j|hi)Dα(hi : rj)
+�
+;
+
+//for some small η1, η2; ensure that λ ∈ [0, 1].
+
+until convergence;
+Output: Soft clustering of H according to k densities p(j|.) following C;
+
+No closed-form solution are known, so we compute the gradient update in Algorithm 4 with:
+
+∂Mλ,α(lj : hi : rj)
+
+∂α
+=
+
+λ∂Dα(lj : hi)
+
+∂α
++ (1 − λ)∂Dα(hi : rj)
+
+∂α
+,
+(30)
+
+182
+
+
+Entropy 2014, 16, 3273–3301
+
+∂Dα(p : q)
+
+∂α
+=
+2
+
+(1 − α)2 ×
+�
+
+q −
+�1 − α
+
+1 + α
+
+�2
+p + p
+1−α
+
+2 q
+1+α
+
+2
+�
+4α
+
+1 − α2 − ln
+� q
+
+p
+
+���
+
+.
+(31)
+
+The update in λ is easier as:
+
+∂Mλ,α(lj : hi : rj)
+
+∂λ
+=
+Dα(lj : hi) − Dα(hi : rj) .
+(32)
+
+Maximizing the likelihood in λ would imply choosing λ ∈ {0, 1} (a hard choice for left/right centers),
+yet we prefer the soft update for the parameter, like for α.
+
+6. Conclusions
+
+The family of α-divergences plays a fundamental role in information geometry: These statistical
+distortion measures are the canonical divergences of dual spaces on probability distribution manifolds
+with constant curvature κ = 1−α2
+
+4
+and the canonical divergences of dually flat manifolds for positive
+distribution manifolds [19].
+In this work, we have presented three techniques for clustering (positive or frequency) histograms
+using k-means:
+
+(1) Sided left or right α-centroid k-means,
+(2) Symmetrized Jeffreys-type α-centroid (variational) k-means, and
+(3) Coupled k-means with respect to mixed α-divergences relying on dual α-centroids.
+
+Sided and mixed dual centroids are always available in closed-forms and are therefore highly
+attractive from the standpoint of implementation. Symmetrized Jeffreys centroids are in general not
+available in closed-form and require one to implement a variational k-means by updating incrementally
+the cluster centroids in order to monotonically decrease the k-means loss function. From the clustering
+standpoint, this appears not to be a problem when guaranteed expected approximations to the optimal
+clustering are enough.
+Indeed, we also presented and analyzed an extension of k-means++ [17] for seeding those k-means
+algorithms. The mixed α-seeding initializations do not require one to calculate centroids and behaves
+like a discrete k-means by picking up the seeds among the data. We reported guaranteed probabilistic
+clustering bounds. Thus, it yields a fast hard/soft data partitioning technique with respect to mixed or
+symmetrized α-divergences. Recently, the advantage of clustering using α-divergences by tuning α in
+applications has been demonstrated in [18]. We thus expect the computationally fast mixed α-seeding
+with guaranteed performance to be useful in a growing number of applications.
+
+Acknowledgments
+
+NICTA is funded by the Australian Government as represented by the Department of Broadband,
+Communication and the Digital Economy and the Australian Research Council through the ICT Centre
+of Excellence program.
+
+Author Contributions: Author Contributions
+All authors contributed equally to the design of the research. The research was carried out by all
+authors. Frank Nielsen and Richard Nock wrote the paper. Frank Nielsen implemented the algorithms
+and performed experiments. All authors have read and approved the final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interests.
+
+183
+
+
+Entropy 2014, 16, 3273–3301
+
+Appendix Proof Sketch of Theorem 1
+
+We give here the key results allowing one to obtain the proof of the Theorem, following the proof
+scheme of [15]. In order not to load the notations, weights are considered uniform. The extension to
+non-uniform weights is immediate as it boils down to duplicate histograms in the histogram set and
+does not change the approximation result.
+Let A ⊆ H be an arbitrary cluster of Copt. Let us define UA and πA as the uniform and biased
+distributions conditioned to A. The key to the proof is to relate the expected potential of A under UA
+and πA to its contribution to the optimal potential.
+
+Lemma A1. Let A ⊆ H be an arbitrary cluster of Copt. Then:
+
+Ec∼UA[Mλ,α(A, c)]
+=
+Mopt,λ,α(A) + Mopt,λ,−α(A)
+
+=
+Ec∼UA[Mλ,−α(A, c)] ,
+
+where UA is the uniform distribution over A.
+
+Proof. α-coordinates have the property that for any subset A ⊆ H, (1/|A|) ∑p∈A uα(p) = uα(rα,A).
+Hence, we have:
+
+∀c ∈ A , ∑
+p∈A
+Dα(p : c)
+
+=
+∑
+p∈A
+Dϕα(uα(p) : uα(c))
+
+=
+∑
+p∈A
+Dϕα(uα(p) : uα(rα,A)) + |A|Dϕα(uα(rα,A) : uα(c))
+
+=
+∑
+p∈A
+Dα(p : rα,A) + |A|Dα(rα,A : c) .
+(A1)
+
+Because Dα(p : q) = D−α(q : p) and lα = r−α, we obtain:
+
+∀c ∈ A , ∑
+p∈A
+Dα(c : p)
+
+=
+∑
+p∈A
+D−α(p : c)
+
+=
+∑
+p∈A
+D−α(p : r−α,A) + |A|D−α(r−α,A : c)
+
+=
+∑
+p∈A
+Dα(lα,A : p) + |A|Dα(c : lα,A) .
+(A2)
+
+It comes now from (A1) and (A2) that:
+
+Ec∼UA[Mλ,α(A, c)]
+
+=
+1
+|A| ∑
+c∈A ∑
+p∈A
+{λDα(c : p) + (1 − λ)Dα(p : c)}
+
+=
+(1 − λ) ∑
+p∈A
+Dα(p : rα,A) + (1 − λ) ∑
+p∈A
+Dα(rα,A : p)
+
++λ ∑
+p∈A
+Dα(lα,A : p) + λ ∑
+p∈A
+Dα(p : lα,A)
+
+=
+(1 − λ)Mopt,0,α(A) + λMopt,1,α(A)
+
++(1 − λ)Mopt,0,−α(A) + λMopt,1,−α(A)
+
+=
+Mopt,λ,α(A) + Mopt,λ,−α(A) .
+(A3)
+
+184
+
+
+Entropy 2014, 16, 3273–3301
+
+This gives the left-hand side equality of the Lemma. The right-hand side follows from the fact that
+Ec∼UA[Mλ,−α(A, c)] = Mopt,1−λ,α(A) + Mopt,1−λ,−α(A).
+
+Instead of Mopt,λ,α(A) + Mopt,λ,−α(A), we want a term depending solely on Mopt,λ,α(A) as it is
+the “true” optimum. We now give two lemmata that shall be useful in obtaining this upper bound.
+The first is of independent interest, as it shows that any α-divergence is a scaled, squared Hellinger
+distance between geometric means of points.
+
+Lemma A2. For any p, q and α ̸= 1, there exists r ∈ [p, q], such that (1 − α)2Dα(p : q) = D0(p1−αrα :
+q1−αrα).
+
+Proof. By the definition of Bregman divergences, for any x, y, there exists some z ∈ [x, y], such that:
+
+Dϕα(x : y)
+=
+1
+2(x − y)2ϕ”α(z)
+
+=
+1
+2(x − y)2
+�
+1 + 1 − α
+
+2
+z
+� 2α
+
+1−α
+,
+
+and since uα is continuous and strictly increasing, for any p, q, there exists some r ∈ [p, q], such that:
+
+Dα(p : q)
+
+=
+Dϕα(uα(p) : uα(q))
+
+=
+1
+2(uα(p) − uα(q))2
+�
+1 + 1 − α
+
+2
+uα(r)
+� 2α
+
+1−α
+
+=
+2
+
+(1 − α)2
+
+�
+p
+1−α
+
+2 − q
+1−α
+
+2
+�2
+rα
+
+=
+2
+
+(1 − α)2
+
+�
+p1−α + q1−α − 2(pq)
+1−α
+
+2
+�
+rα
+
+=
+1
+
+(1 − α)2 D0(p1−αrα : q1−αrα) .
+
+Lemma A3. Let discrete random variable x take non-negative values x1, x2, ..., xm with uniform probabilities.
+Then, for any β > −1, we have var(x1+β/uβ) ≤ var(x), with u
+.= (1 + β)β maxi xi.
+
+Proof. First, ∀β > −1, remark that for any x, function f (x) = x(uβ − xβ) is increasing for x ≤
+u/(1 + β)β. Hence, assuming that the xis are put in non-increasing order without loss of generality, we
+have f (xi) ≥ f (xj), and so, xi(uβ − xβ
+i ) ≥ xj(uβ − xβ
+j ), ∀i ≥ j, as long as xi ≤ u/(1 + β)β. Choosing
+
+185
+
+
+Entropy 2014, 16, 3273–3301
+
+u = x1(1 + β)β yields, after reordering and putting the exponent, (x1+β
+i
+− x1+β
+j
+)2 ≤ (xiuβ − xjuβ)2.
+Hence:
+
+1
+m ∑
+i
+x2(1+β)
+i
+−
+
+�
+1
+m ∑
+i
+x(1+β)
+i
+
+�2
+
+=
+1
+
+2m2 ∑
+i,j
+(x1+β
+i
+− x1+β
+j
+)2
+
+≤
+1
+
+2m2 ∑
+i,j
+(xiuβ − xjuβ)2
+
+= u2β
+
+2m2 ∑
+i,j
+(xi − xj)2
+
+= u2β
+
+⎛
+
+⎝ 1
+
+m ∑
+i
+x2
+i −
+
+�
+1
+m ∑
+i
+xi
+
+�2⎞
+
+⎠ .
+
+Dividing by u2β the leftmost and rightmost terms and using the fact that var(λx) = λ2var(x) yields
+the statement of the Lemma.
+
+We are now ready to upper bound Mopt,λ,−α(A) as a function of Mopt,λ,α(A).
+
+Lemma A4. For any cluster A of Copt,
+
+Mopt,λ,−α(A)
+≤
+Mopt,λ,α(A) ×
+
+�
+f (λ)
+if λ ∈ (0, 1)
+z(α)h2(α)
+otherwise
+,
+
+where z(α), f (λ) and h(α) are defined in Theorem 1.
+
+Proof. The case λ ̸= 0, 1 is fast, as we have by definition:
+
+Mopt,λ,−α(A)
+=
+∑
+p∈A
+λD−α(l−α,A : p) + (1 − λ)D−α(p : r−α,A)
+
+=
+∑
+p∈A
+λDα(p : l−α,A) + (1 − λ)Dα(r−α,A : p)
+
+=
+∑
+p∈A
+λDα(p : rα,A) + (1 − λ)Dα(lα,A : p)
+
+≤
+max
+�1 − λ
+
+λ
+,
+λ
+
+1 − λ
+
+�
+Mopt,λ,α(A)
+
+= f (λ)Mopt,λ,α(A) .
+
+Suppose now that λ
+=
+0 and α
+≥
+0.
+Because Mopt,0,−α(A)
+=
+∑p∈A D−α(p : r−α,A)
+=
+∑p∈A Dα(lα,A : p) = Mopt,1,α(A), what we wish to do is upper bound ∑p∈A Dα(lα,A : p) = Mopt,1,α(A)
+as a function of ∑p∈A Dα(p : rα,A) = Mopt,0,α(A). We use Lemmatas A2 and A3 in the following
+derivations, using r(p) to refer to the r in Lemma A2, assuming α ≥ 0. We also note varA( f (p)) as
+
+186
+
+
+Entropy 2014, 16, 3273–3301
+
+the variance, under the uniform distribution over A, of discrete random variable f (p), for p ∈ A. We
+have:
+
+∑
+p∈A
+Dα(lα,A : p)
+
+=
+∑
+p∈A
+D−α(p : lα,A)
+
+=
+1
+
+(1 + α)2 ∑
+p∈A
+r(p)−αD0(p1+α : l1+α
+α,A )
+
+≤
+1
+
+(1 + α)2 minA pα ∑
+p∈A
+D0(p1+α : l1+α
+α,A )
+
+=
+1
+
+(1 + α)2 minA pα ∑
+p∈A
+
+�
+p1+α + l1+α
+α,A − 2p
+1+α
+
+2 l
+
+1+α
+
+2
+α,A
+
+�
+
+=
+|A|
+
+(1 + α)2 minA pα
+
+⎛
+
+⎝ 1
+
+|A| ∑
+p∈A
+p1+α −
+
+�
+1
+|A| ∑
+p∈A
+p
+1+α
+
+2
+
+�2⎞
+
+⎠
+
+=
+|A|varA(p
+1+α
+
+2 )
+
+(1 + α)2 minA pα .
+(A4)
+
+We have used the expression of left centroid l1+α
+α,A to simplify the expressions. Now, picking xi = p
+
+1−α
+
+i 2
+,
+
+β = 2α/(1 − α) and u =
+�
+1+α
+1−α
+� 2α
+
+1−α maxA p
+1−α
+
+2
+in Lemma A3 yields:
+
+varA(p
+1+α
+
+2 )
+
+=
+u2βvarA(p
+1+α
+
+2 /uβ)
+
+=
+u2βvarA
+�
+p
+1−α
+
+2 pα/uβ�
+
+=
+u2βvar(x1+β/uβ)
+
+≤
+u2βvar(x)
+
+= u2βvarA
+�
+p
+1−α
+
+2
+�
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2
+max
+A
+p2αvarA
+�
+p
+1−α
+
+2
+�
+.
+(A5)
+
+187
+
+
+Entropy 2014, 16, 3273–3301
+
+Plugging this in (A4) yields:
+
+∑
+p∈A
+Dα(lα,A : p)
+
+≤
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 |A| maxA p2αvarA
+�
+p
+1−α
+
+2
+�
+
+(1 + α)2 minA pα
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× |A| minA pαvarA(p
+1−α
+
+2 )
+
+(1 − α)2
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× minA pα
+
+(1 − α)2 ∑
+p∈A
+D0(p1−α : r1−α
+α,A )
+(A6)
+
+≤
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+×
+1
+
+(1 − α)2 ∑
+p∈A
+r(p)αD0(p1−α : r1−α
+α,A )
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× ∑
+p∈A
+Dα(p : rα,A)
+
+≤
+z(α)
+�maxA p
+
+minA p
+
+�2α
+× ∑
+p∈A
+Dα(p : rα,A) .
+(A7)
+
+Here, (A6) follows the path backwards of derivations that lead to (A4). The cases λ = 1 or α < 0 are
+obtained using the same chains of derivations and achieve the proof of Lemma A4.
+
+Lemma A4 can be directly used to refine the bound of Lemma A1 in the uniform distribution. We
+give the Lemma for the biased distribution, directly integrating the refinement of the bound.
+
+Lemma A5. Let A be an arbitrary cluster of Copt and C an arbitrary clustering. If we add a random couple
+(c, c) to C, chosen from A with π as in Algorithm 2, then:
+
+Ec∼πA[Mλ,α(A, c)]
+
+≤
+4
+
+�
+f (λ)h2(α)Mopt,λ,α(A)
+if
+λ ∈ (0, 1)
+z(α)h4(α)Mopt,λ,α(A)
+otherwise
+,
+(A8)
+
+where f (λ) and h(α) are defined in Theorem 1.
+
+Proof. The proof essentially follows the proof of Lemma 3 in [15]. To complete it, we need a triangle
+inequality involving α-divergences. We give it here.
+
+Lemma A6. For any p, q, r and α, we have:
+
+�
+
+Dα(p : q)
+≤
+�maxi{pi, qi, ri}
+
+mini{pi, qi, ri}
+
+�|α| ��
+
+Dα(p : r) +
+�
+
+Dα(r : q)
+�
+(A9)
+
+(where the min is over strictly positive values)
+
+Remark: take α = 0; we find the triangle inequality for the squared Hellinger distance.
+
+Proof. Using the proof of Lemma 2 in [15] for Bregman divergence Dϕα, we get:
+
+�
+
+Dϕα(x : z)
+
+≤
+ρ(α)
+��
+
+Dϕα(x : y) +
+�
+
+Dϕα(y : z)
+�
+,
+(A10)
+
+188
+
+
+Entropy 2014, 16, 3273–3301
+
+where:
+
+ρ(α)
+=
+max
+u,v
+
+�
+1 + 1−α
+
+2 u
+� 2α
+
+1−α
+
+�
+1 + 1−α
+
+2 v
+� 2α
+
+1−α
+.
+(A11)
+
+Taking x = uα(p), y = uα(q), z = uα(r) yields ρ(α) = maxs,t∈{pi,qi,ri}(s/t)|α| and the statement of
+Lemma A6.
+
+The rest of the proof of Lemma A5 follows the proof of Lemma 3 in [15].
+
+We get all of the ingredients to our proof, and there remains to use Lemma 4 in [15] to achieve the
+proof of Theorem 1.
+
+Appendix Properties of α-Divergences
+
+For positive arrays p and q, the α-divergence Dα(p : q) can be defined as an equivalent
+representational Bregman divergence [19,34] Bϕα(uα(p) : uα(q)) over the (uα, vα)-structure [43] with:
+
+ϕα(x)
+.=
+2
+
+1 + α
+
+�
+1 + 1 − α
+
+2
+x
+�
+2
+
+1−α
+,
+(A12)
+
+uα(p)
+.=
+2
+
+1 − α
+
+�
+p
+1−α
+
+2 − 1
+�
+,
+(A13)
+
+vα(p)
+.=
+2
+
+1 + α p
+1+α
+
+2
+,
+(A14)
+
+where we assume that α ̸= ±1. Otherwise, for α = ±1, we compute Dα(p : q) by taking the sided
+Kullback–Leibler divergence extended to positive arrays.
+In the proof of Theorem 1, we have used two properties of α-divergences of independent interest:
+
+• any α-divergence can be explained as a scaled squared Hellinger distance between geometric
+means of its arguments and a point that belong to their segment (Lemma A2);
+• any α-divergence satisfies a generalized triangle inequality (Lemma A6). Notice that this Lemma
+is optimal in the sense that for α = 0, it is possible to recover the triangle inequality of the
+Hellinger distance.
+
+The following lemma shows how to bound the mixed divergence as a function of an α-divergence.
+
+Lemma A7. For any positive arrays l, h, r and α ̸= ±1, define η
+.= λ(1 − α)/(1 − α(2λ − 1)) ∈ [0, 1], gη
+with gi
+η
+.= (li)η(ri)1−η and aη with ai
+η
+.= ηli + (1 − η)ri. Then, we have:
+
+Mλ,α(l : h : r)
+≤
+1 − α2(2λ − 1)2
+
+1 − α2
+Dα(2λ−1)(gη : h)
+
++2(1 − α(2λ − 1))
+
+1 − α2
+∑
+i
+
+�
+ai
+η − gi
+η
+�
+.
+
+Proof. For all index i, we have:
+
+Mλ,α(li : hi : ri) = λDα(li : hi) + (1 − λ)Dα(hi : ri)
+
+=
+4
+
+1 − α2
+
+�λ(1 − α)
+
+2
+li + (1 − λ)(1 + α)
+
+2
+ri + 1 + α(2λ − 1)
+
+2
+hi
+(A15)
+
+−λ(li)
+1−α
+
+2 (hi)
+1+α
+
+2 − (1 − λ)(ri)
+1+α
+
+2 (hi)
+1−α
+
+2
+�
+.
+(A16)
+
+189
+
+
+Entropy 2014, 16, 3273–3301
+
+The arithmetic-geometric-harmonic (AGH) inequality implies:
+
+λ(li)
+1−α
+
+2 (hi)
+1+α
+
+2 + (1 − λ)(ri)
+1+α
+
+2 (hi)
+1−α
+
+2
+≥
+(li)
+λ(1−α)
+
+2
+(ri)
+(1−λ)(1+α)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+=
+�
+(li)
+
+λ(1−α)
+
+1−α(2λ−1) (ri)
+
+(1−λ)(1+α)
+1−α(2λ−1)
+� 1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+=
+�
+(li)η(ri)1−η� 1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+= (gi
+η)
+1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+.
+
+It follows that (A16) yields:
+
+Mλ,α(li : hi : ri)
+≤
+4
+
+1 − α2
+
+�1 − α(2λ − 1)
+
+2
+
+�
+ηli + (1 − η)ri�
++
+(A17)
+
+1 + α(2λ − 1)
+
+2
+hi − (gi
+η)
+1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+�
+
+= 1 − α2(2λ − 1)2
+
+1 − α2
+Dα(2λ−1)(gi
+η : hi) + 2(1 − α(2λ − 1))
+
+1 − α2
+
+�
+ai
+η − gi
+η
+�
+,(A18)
+
+out of which we get the statement of the Lemma.
+
+Appendix Sided α-Centroids
+
+For the sake of completeness, we prove the following theorem:
+
+Theorem A1 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted
+α-centroid coordinates of a set of n positive histograms h1, ..., hn are weighted α-means:
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+, li
+α = ri
+−α
+
+with:
+
+fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+Proof. We distinguish three cases: α ̸= ±1, α = −1 and α = 1.
+First, consider the general case α ̸= ±1. We have to minimize:
+
+Rα(x, H) =
+4
+
+1 − α2
+
+n
+∑
+j=1
+wj×
+
+d
+∑
+i=1
+
+�1 − α
+
+2
+hi
+j + 1 + α
+
+2
+xi − (hi
+j)
+1−α
+
+2 (xi)
+1+α
+
+2
+�
+.
+
+Removing all additive terms independent of xi and the overall constant multiplicative factor
+4
+
+1−α2 ̸= 0,
+we get the following equivalent minimisation problem:
+
+R′
+α(x, H) =
+d
+∑
+i=1
+
+1 + α
+
+2
+xi − (xi)
+1+α
+
+2
+
+�
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2
+
+�
+
+�
+��
+�
+¯hiα
+
+,
+(A19)
+
+190
+
+
+Entropy 2014, 16, 3273–3301
+
+where ¯hi
+α denote the following aggregation term:
+
+¯hi
+α =
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2 .
+
+Setting coordinate-wise the derivative to zero of Equation (A19) (i.e., ∇xR′(x, H) = 0), we get:
+
+1 + α
+
+2
+− 1 + α
+
+2
+(xi)
+α−1
+
+2 ¯hi
+α = 0
+
+Thus, we find that the coordinates of the right-sided α-centroids are:
+
+ci
+α = (¯hi
+α)
+2
+
+1−α =
+
+�
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2
+
+�
+2
+
+1−α
+= ˆhi
+α.
+
+We recognise the expression of a quasi-arithmetic mean for the strictly monotonous generator fα(x):
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+,
+(A20)
+
+with:
+fα(x) = x
+1−α
+
+2 ,
+f −1
+α (x) = x
+2
+
+1−α , α ̸= ±1.
+
+Therefore, we conclude that the coordinates of the positive α-centroid are the weighted α-means of
+the histogram coordinates (for α ̸= ±1). Quasi-arithmetic means are also called in the literature
+quasi-linear means or f-means.
+When α = −1, we search for the right-sided extended Kullback–Leibler divergence centroid by
+minimising:
+
+R−1(x; ˜H) =
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+hi
+j log
+hi
+j
+
+xi + xi − hi
+j.
+
+It is equivalent to minimizing:
+
+R′
+−1(x; ˜H) =
+d
+∑
+i=1
+xi −
+
+�
+n
+∑
+j=1
+wjhi
+j
+
+�
+
+�
+��
+�
+a
+
+log xi,
+
+where a denotes the arithmetic mean. Solving coordinate-wise, we get ci = ai = ∑n
+j=1 wjhi
+j.
+When α = 1, the right-sided reverse extended KL centroid is a left-sided extended KL centroid. The
+minimisation problem is:
+
+R1(x; ˜H) =
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+xi log xi
+
+hi
+j
++ hi
+j − xi.
+
+Since ∑j wj = 1, we solve coordinate-wise and find log x = ∑j wj log hj. That is, ri
+1 is the geometric
+mean:
+
+ri
+1 =
+n
+∏
+j=1
+(hi
+j)wj.
+
+Both the arithmetic mean and the geometric mean are power means in the limit case (and hence
+quasi-arithmetic means). Thus,
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+,
+(A21)
+
+191
+
+
+Entropy 2014, 16, 3273–3301
+
+with:
+
+fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+References
+
+1.
+Baker, L.D.; McCallum, A.K.Distributional clustering of words for text classification. In Proceedings of the
+21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,
+Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 96–103.
+2.
+Bigi, B. Using Kullback–Leibler distance for text categorization.
+In Proceedings of the 25th European
+conference on IR research (ECIR), Pisa, Italy, 14–16 April 2003; Springer-Verlag: Berlin/Heidelberg, Germany,
+2003; ECIR’03, pp. 305–319.
+3.
+Bag of Words Data Set. Available online: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words (accessed
+on 17 June 2014).
+4.
+Csurka, G.; Bray, C.; Dance, C.; Fan, L. Visual Categorization with Bags of Keypoints; Workshop on Statistical
+Learning in Computer Vision (ECCV); Xerox Research Centre Europe: Meylan, France, 2004, pp. 1–22.
+5.
+Jégou, H.; Douze, M.; Schmid, C. Improving Bag-of-Features for Large Scale Image Search. Int. J. Comput.
+Vis. 2010, 87, 316–336.
+6.
+Yu, Z.; Li, A.; Au, O.; Xu, C. Bag of textons for image segmentation via soft clustering and convex shift. In
+Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI,
+USA, 16–21 June 2012; pp. 781–788.
+7.
+Steinhaus, H. Sur la division des corp matériels en parties. Bull. Acad. Polon. Sci. 1956, 1, 801–804. (in
+French)
+8.
+Lloyd, S.P. Least Squares Quantization in PCM; Technical Report RR-5497; Bell Laboratories: Murray Hill, NJ,
+USA, 1957.
+9.
+Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137.
+10.
+Chandrasekhar, V.; Takacs, G.; Chen, D.M.; Tsai, S.S.; Reznik, Y.A.; Grzeszczuk, R.; Girod, B. Compressed
+histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vis. 2012, 96, 384–399.
+11.
+Nock, R.; Nielsen, F.; Briys, E. Non-linear book manifolds: Learning from associations the dynamic geometry
+of digital libraries. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, New
+York, NY, USA, 2013; pp. 313–322.
+12.
+Kwitt, R.; Vasconcelos, N.; Rasiwasia, N.; Uhl, A.; Davis, B.C.; Häfner, M.; Wrba, F. Endoscopic image
+analysis in semantic space. Med. Image Anal. 2012, 16, 1415–1422.
+13.
+Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. 2010, arXiv:1009.4004.
+14.
+Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904.
+15.
+Nock, R.; Luosto, P.; Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proceedings of
+the European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium,
+15–19 September 2008; Springer-Verlag: Berlin/Heidelberg, Germany, 2008; pp. 154–169.
+16.
+Amari, S. Integration of Stochastic Models by Minimizing α-Divergence. Neural Comput. 2007, 19, 2780–2796.
+17.
+Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth
+Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, LA, USA, 7–9 January 2007;
+Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035.
+18.
+Olszewski, D.; Ster, B. Asymmetric clustering using the alpha-beta divergence. Pattern Recognit. 2014,
+47, 2031–2041.
+19.
+Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes.
+IEEE Trans. Inf. Theory 2009, 55, 4925–4931.
+20.
+Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res.
+2005, 6, 1705–1749.
+21.
+Teboulle, M. A unified continuous optimization framework for center-based clustering methods. J. Mach.
+Learn. Res. 2007, 8, 65–102.
+22.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000.
+23.
+Morimoto, T. Markov Processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331.
+
+192
+
+
+Entropy 2014, 16, 3273–3301
+
+24.
+Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat.
+Soc. Ser. B 1966, 28, 131–142.
+25.
+Csiszár, I. Information-type measures of difference of probability distributions and indirect observation.
+Studi. Sci. Math. Hung. 1967, 2, 229–318.
+26.
+Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust
+nonnegative matrix factorization. Entropy 2011, 13, 134–170.
+27.
+Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of
+Neural Networks; Operations Research/Computer Science Interfaces Series; Ellacott, S., Mason, J., Anderson,
+I., Eds.; Springer: New York, NY, USA, 1997; Volume 8, pp. 394–398.
+28.
+Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.
+Ann. Math. Stat. 1952, 23, 493–507.
+29.
+Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett.
+2013, 20, 269–272.
+30.
+Wu, J.; Rehg, J. Beyond the euclidean distance: creating effective visual codebooks using the histogram
+intersection kernel. In Proceedings of 2009 IEEE 12th International Conference on Computer Vision, Kyoto,
+Japan, 29 September–2 October 2009; pp. 630–637.
+31.
+Bhattacharya, A.; Jaiswal, R.; Ailon, N. A tight lower bound instance for k-means++ in constant dimension.
+In Theory and Applications of Models of Computation; Lecture Notes in Computer Science; Gopal, T., Agrawal,
+M., Li, A., Cooper, S., Eds.; Springer International Publishing: New York, NY, USA, 2014; Volume 8402, pp.
+7–22.
+32.
+Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight
+approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660.
+33.
+Ben-Tal, A.; Charnes, A.; Teboulle, M. Entropic means. J. Math. Anal. Appl. 1989, 139, 537–551.
+34.
+Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In
+Proceedings of International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June
+2009; pp. 71–78.
+35.
+Heinz, E. Beiträge zur Störungstheorie der Spektralzerlegung. Math. Anna. 1951, 123, 415–438. (in German)
+36.
+Besenyei, A. On the invariance equation for Heinz means. Math. Inequal. Appl. 2012, 15, 973–979.
+37.
+Barry, D.A.; Culligan-Hensley, P.J.; Barry, S.J. Real values of the W-function. ACM Trans. Math. Softw. 1995,
+21, 161–171.
+38.
+Veldhuis, R.N.J. The centroid of the symmetrical Kullback–Leibler distance. IEEE Signal Process. Lett. 2002,
+9, 96–99.
+39.
+Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. 2009, arXiv.org: 0911.4863.
+40.
+Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466.
+41.
+Romberg, S.; Lienhart, R. Bundle min-hashing for logo recognition.
+In Proceedings of the 3rd ACM
+Conference on International Conference on Multimedia Retrieval, Dallas, TX, USA, 16–19 April 2013; ACM:
+New York, NY, USA, 2013; pp. 113–120.
+42.
+Matsuyama, Y. The alpha-EM algorithm: Surrogate likelihood maximization using alpha-logarithmic
+information measures. IEEE Trans. Inf. Theory 2003, 49, 692–706.
+43.
+Amari, S.I. New developments of information geometry (26): Information geometry of convex programming
+and game theory. In Mathematical Sciences (suurikagaku); Number 605; The Science Company: Denver, CO,
+USA, 2013; pp. 65–74. (In Japanese)
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+193
+
+
+entropy
+
+Article
+New Riemannian Priors on the Univariate
+Normal Model
+
+Salem Said *, Lionel Bombrun and Yannick Berthoumieu
+
+Groupe Signal et Image, CNRS Laboratoire IMS, Institut Polytechnique de Bordeaux, Université de Bordeaux,
+UMR 5218, Talence, 33405, France; E-Mails: lionel.bombrun@u-bordeaux.fr (L.B.);
+yannick.berthoumieu@u-bordeaux.fr (Y.B.)
+*
+E-Mail: salem.said@u-bordeaux.fr; Tel.:+33-(0)5-4000-6185.
+
+Received: 17 April 2014; in revised form: 23 June 2014 / Accepted: 9 July 2014 /
+Published: 17 July 2014
+
+Abstract: The current paper introduces new prior distributions on the univariate normal model, with
+the aim of applying them to the classification of univariate normal populations. These new prior
+distributions are entirely based on the Riemannian geometry of the univariate normal model, so that
+they can be thought of as “Riemannian priors”. Precisely, if {pθ; θ ∈ Θ} is any parametrization of the
+univariate normal model, the paper considers prior distributions G( ¯θ, γ) with hyperparameters ¯θ ∈ Θ
+and γ > 0, whose density with respect to Riemannian volume is proportional to exp(−d2(θ, ¯θ)/2γ2),
+where d2(θ, ¯θ) is the square of Rao’s Riemannian distance. The distributions G( ¯θ, γ) are termed
+Gaussian distributions on the univariate normal model. The motivation for considering a distribution
+G( ¯θ, γ) is that this distribution gives a geometric representation of a class or cluster of univariate
+normal populations. Indeed, G( ¯θ, γ) has a unique mode ¯θ (precisely, ¯θ is the unique Riemannian
+center of mass of G( ¯θ, γ), as shown in the paper), and its dispersion away from ¯θ is given by γ.
+Therefore, one thinks of members of the class represented by G( ¯θ, γ) as being centered around ¯θ
+and lying within a typical distance determined by γ. The paper defines rigorously the Gaussian
+distributions G( ¯θ, γ) and describes an algorithm for computing maximum likelihood estimates of
+their hyperparameters. Based on this algorithm and on the Laplace approximation, it describes
+how the distributions G( ¯θ, γ) can be used as prior distributions for Bayesian classification of
+large univariate normal populations.
+In a concrete application to texture image classification,
+it is shown that this leads to an improvement in performance over the use of conjugate priors.
+
+Keywords: Fisher information; Riemannian metric; prior distribution; univariate normal distribution;
+image classification
+
+1. Introduction
+
+In this paper, a new class of prior distributions is introduced on the univariate normal model. The
+new prior distributions, which will be called Gaussian distributions, are based on the Riemannian
+geometry of the univariate normal model. The paper introduces these new distributions, uncovers
+some of their fundamental properties and applies them to the problem of the classification of univariate
+normal populations. It shows that, in the context of a real-life application to texture image classification,
+the use of these new prior distributions leads to improved performance in comparison with the use of
+more standard conjugate priors.
+To motivate the introduction of the new prior distributions, considered in the following, recall
+some general facts on the Riemannian geometry of parametric models.
+In information geometry [1], it is well known that a parametric model {pθ; θ ∈ Θ}, where Θ ⊂ Rp,
+can be equipped with a Riemannian geometry, determined by Fisher’s information matrix, say I(θ).
+
+Entropy 2014, 16, 4015–4031; doi:10.3390/e16074015
+www.mdpi.com/journal/entropy
+194
+
+
+Entropy 2014, 16, 4015–4031
+
+Indeed, assuming I(θ) is strictly positive definite, for each θ ∈ Θ, a Riemannian metric on Θ is defined
+by:
+
+ds2(θ) =
+
+p
+∑
+i,j=1
+Iij(θ)dθidθj
+(1)
+
+The fact that the length element Equation (1) is invariant to any change of parametrization was realized
+by Rao [2], who was the first to propose the application of Riemannian geometry in statistics.
+Once the Riemannian metric Equation (1) is introduced, the whole machinery of Riemannian
+geometry becomes available for application to statistical problems relevant to the parametric model
+{pθ; θ ∈ Θ}. This includes the notion of Riemannian distance between two distributions, pθ and pθ′,
+which is known as Rao’s distance, say d(θ, θ′), the notion of Riemannian volume, which is exactly the
+same as Jeffreys prior [3], and the notion of Riemannian gradient, which can be used in numerical
+optimization and coincides with the so-called natural gradient of Amari [4].
+It is quite natural to apply Rao’s distance to the problem of classifying populations that belong to
+the parametric model {pθ; θ ∈ Θ}. In the case where this parametric model is the univariate normal
+model, this approach to classification is implemented in [5]. For more general parametric models,
+beyond the univariate normal model, similar applications of Rao’s distance to problems of image
+segmentation and statistical tests can be found in [6–8].
+The idea of [5] is quite elegant. In general, it requires that some classes {SL; L = 1, . . . , C}, (based
+on a learning sequence) have been identified with “centers” ¯θL ∈ Θ. Then, in order to assign a test
+population, given by the parameter θt, to a class L∗, it is proposed to choose L∗, which minimizes Rao’s
+distance d2(θt, ¯θL), over L = 1, . . . , C. In the specific context of the classification of univariate normal
+populations [5], this leads to the introduction of hyperbolic Voronoi diagrams.
+The present paper is also concerned with the case where the parametric model {pθ; θ ∈ Θ} is a
+univariate normal model. It starts from the idea that a class SL should be identified not only with
+a center ¯θL, as in [5], but also with a kind of “variance”, say γ2, which will be called a dispersion
+parameter. Accordingly, assigning a test population given by the parameter θt to a class L should be
+based on a tradeoff between the square of Rao’s distance d2(θt, ¯θL) and the dispersion parameter γ2.
+Of course, this idea has a strong Bayesian flavor. It proposes to give more “confidence” to classes
+that have a smaller dispersion parameter. Thus, in order to implement it, in a concrete way, the paper
+starts by introducing prior distributions on the univariate normal model, which it calls Gaussian
+distributions. By definition, a Gaussian distribution G( ¯θ, γ2) has a probability density function, with
+respect to Riemannian volume, given by:
+
+p(θ| ¯θ, γ) ∝ exp
+�−d2(θ, ¯θ)
+
+2γ2
+
+�
+(2)
+
+Given this definition of a Gaussian distribution (which is developed in a detailed way, in Section 3),
+classification of univariate normal populations can be carried out by associating to each class SL of
+univariate normal populations a Gaussian distribution G( ¯θL, γ2
+L) and by assigning any test population
+with parameter θt to the class L∗, which maximizes the likelihood p(θt| ¯θL, γL), over L = 1, . . . , C.
+The present paper develops in a rigorous way the general approach to the classification of
+univariate normal populations, which has just been described. It proceeds as follows.
+Section 2, which is basically self-contained, provides the concepts, regarding the Riemannian
+geometry of the univariate normal model, which will be used throughout the paper.
+Section 3 introduces Gaussian distributions on the univariate normal model and uncovers some
+of their general properties. In particular, Section 3.2 of this section gives a Riemannian gradient descent
+algorithm for computing maximum likelihood estimates of the parameters ¯θ and γ of a Gaussian
+distribution.
+Section 4 states the general approach to classification of univariate normal populations proposed
+in this paper. It deals with two problems: (i) given a class S of univariate normal populations Si, how
+
+195
+
+
+Entropy 2014, 16, 4015–4031
+
+to fit a Gaussian distribution G(¯z, γ) to this class; and (ii) given a test univariate normal population St
+and a set of classes {SL, L = 1, . . . , C}, how to assign St to a suitable class SL∗.
+In the present paper, the chosen approach for resolving these two problems is marginalized
+likelihood estimation, in the asymptotic framework where each univariate normal population contains
+a large number of data points. In this asymptotic framework, the Laplace approximation plays a
+major role [9]. In particular, it reduces the first problem, of fitting a Gaussian distribution to a class
+of univariate normal populations, to the problem of maximum likelihood estimation, covered in
+Section 3.2.
+The final result of Section 4 is the decision rule Equation (37). This generalizes the one developed
+in [5] and already explained above, by taking into account the dispersion parameter γ, in addition to
+the center ¯θ, for each class.
+In Section 5, the formalism of Section 4 is applied to texture image classification, using the
+VisTeX image database [10]. This database is used to compare the performance obtained using
+Gaussian distributions, as in Section 4, to that obtained using conjugate prior distributions. It is
+shown that Gaussian distributions, proposed in the current paper, lead to a significant improvement
+in performance.
+Before going on, it should be noted that probability density functions of the form (2), on general
+Riemannian manifolds, were considered by Pennec in [11]. However, they were not specifically used
+as prior distributions, but rather as a representation of uncertainty in medical image analysis and
+directional or shape statistics.
+
+2. Riemannian Geometry of the Univariate Normal Model
+
+The current section presents in a self-contained way the results on the Riemannian geometry of
+the univariate normal model, which are required for the remainder of the paper. Section 2.1 recalls
+the fact that the univariate normal model can be reparametrized, so that its Riemannian geometry is
+essentially the same as that of the Poincaré upper half plane. Section 2.2 uses this fact to give analytic
+formulas for distance, geodesics and integration on the univariate normal model. Finally, Section 2.3
+presents, in general form, the Riemannian gradient descent algorithm.
+
+2.1. Derivation of the Fisher Metric
+
+This paper considers the Riemannian geometry of the univariate normal model, as based on the
+Fisher metric (1). To be precise, the univariate normal model has a two-dimensional parameter space
+Θ = {θ = (μ, σ)|μ ∈ R , σ > 0}, and is given by:
+
+pθ(x) = |2πσ2|−1/2 exp
+� −(x − μ)2
+
+2σ2
+
+�
+(3)
+
+where each pθ is a probability density function with respect to the Lebesgue measure on R. The Fisher
+information matrix, obtained from Equation (3), is the following:
+
+I(θ) =
+
+�
+1
+σ2
+0
+0
+2
+σ2
+
+�
+
+As in [12], this expression can be made more symmetric by introducing the parametrization z = (x, y),
+where x = μ/
+√
+
+2 and y = σ. This yields the Fisher information matrix:
+
+I(z) = 2 ×
+
+�
+1
+y2
+0
+
+0
+1
+y2
+
+�
+
+196
+
+
+Entropy 2014, 16, 4015–4031
+
+It is suitable to drop the factor two in this expression and introduce the following Riemannian metric
+for the univariate normal model,
+
+ds2(z) = dx2 + dy2
+
+y2
+(4)
+
+This is essentially the same as the Fisher metric (up to the factor tow) and will be considered throughout
+the following. The resulting Rao’s distance and Riemannian geometry are given in the following
+paragraph.
+
+2.2. Distance, Geodesics and Volume
+
+The Riemannian metric (4), obtained in the last paragraph, happens to be a very well-known
+object in differential geometry. Precisely, the parameter space H = {z = (x, y)|y > 0} equipped with
+the metric (4) is known as the Poincaré upper half plane and is a basic model of a two-dimensional
+hyperbolic space [13].
+Rao’s distance between two points z1 = (x1, y1) and z2 = (x2, y2) in H can be expressed as follows
+(for results in the present paragraph, see [13], or any suitable reference on hyperbolic geometry),
+
+d(z1, z2) = acosh
+�
+1 + (x1 − x2)2 + (y1 − y2)2
+
+2y1y2
+
+�
+(5)
+
+where acosh denotes the inverse hyperbolic cosine.
+Starting from z1, in any given direction, it is possible to draw a unique geodesic ray γ : R+ → H.
+This is a curve having the property that γ(0) = z1 and, for any t ∈ R+, if γ(t) = z2 then d(z1, z2) = t.
+In other words, the length of γ between z1 and z2 is equal to the distance between z1 and z2.
+The equation of a geodesic ray starting from z ∈ H is conveniently written down in complex
+notation (that is, by treating points of H as complex numbers). To begin, consider the case of z = i
+(which stands for x = 0 and y = 1). The geodesic in the direction making an angle ψ with the y-axis is
+the curve,
+
+γi(t) = et/2 cos(ψ/2) i − e−t/2 sin(ψ/2)
+
+et/2 sin(ψ/2) i + e−t/2 cos(ψ/2)
+(6)
+
+In particular ψ = 0 gives γi(t) = eti and ψ = π gives γi(t) = e−ti. If ψ is not a multiple of π, γi(t)
+traces out a portion of a circle, which is parallel to the y-axis, in the limit t → ∞. For a general starting
+point z, the geodesic ray in the direction making an angle ψ with the y-axis can be written:
+
+γz(t, ψ) = x + yγi(t/y, ψ)
+(7)
+
+where z = (x, y) and γi(t, ψ) is given by Equation (6). A more detailed treatment of Rao’s distance (5)
+and of geodesics in the Poincaré upper half plane, along with applications in image clustering, can be
+found in [5].
+The Riemannian volume (or area, since H is of dimension 2) element corresponding to the
+Riemannian metric (4) is dA(z) = dxdy/y2. Accordingly, the integral of a function f : H → R with
+respect to dA is given by:
+�
+
+H f (z)dA(z) =
+� +∞
+
+0
+
+� +∞
+
+−∞
+f (x, y)
+
+y2
+dxdy
+(8)
+
+In many cases, the analytic computation of this integral can be greatly simplified by using polar
+coordinates (r, φ) defined with respect to some “origin” ¯z ∈ H. Polar coordinates (r, ϕ) map to the
+point z(r, ϕ) given by:
+
+z(r, ϕ) = γ¯z
+�
+r, π
+
+2 − ϕ
+�
+(9)
+
+197
+
+
+Entropy 2014, 16, 4015–4031
+
+where the right-hand side is defined according to Equation (7). The polar coordinates (r, ϕ) do indeed
+define a global coordinate system of H, in the sense that the application that takes a complex number
+reiϕ to the point z(r, ϕ) in H is a diffeomorphism. The standard notation from differential geometry is:
+
+exp¯z
+�
+reiϕ�
+= z(r, ϕ)
+(10)
+
+In these coordinates, the Riemannian metric (4) takes on the form:
+
+ds2(z) = dr2 + sinh2 rdϕ2
+(11)
+
+The integral Equation (8) can be computed in polar coordinates using the formula [13],
+
+�
+
+H f (z)dA(z) =
+� 2π
+
+0
+
+� +∞
+
+0
+( f ◦ exp¯z)
+�
+reiϕ�
+sinh(r)drdϕ
+(12)
+
+where exp¯z was defined in Equation (10) and ◦ denotes composition. This is particularly useful when
+f ◦ exp¯z does not depend on ϕ.
+
+2.3. Riemannian Gradient Descent
+
+In this paper, the problem of minimizing, or maximizing, a differentiable function f : H → R will
+play a central role. A popular way of handling the minimization of a differentiable function defined on
+a Riemannian manifold (such as H) is through Riemannian gradient descent [14].
+Here, the definition of Riemannian gradient is reviewed, and a generic description of Riemannian
+gradient descent is provided. The Riemannian gradient of f is here defined as a mapping ∇ f : H → C
+with the following property:
+
+1
+y2 × Re {∇ f (z) h∗} = Re {d f (z) h∗}
+(13)
+
+for any complex number h, where Re denotes the real part, ∗ denotes conjugation and d f is the
+“derivative”, d f = (∂ f /∂x) + (∂ f /∂y) i. For example, if f (z) = y, it follows from Equation (13) that
+∇ f (z) = y2.
+Riemannian gradient descent consists in following the direction of −∇ f at each step, with the
+length of the step (in other words, the step size) being determined by the user. The generic algorithm
+is, up to some variations, the following:
+
+INPUT
+ˆz ∈ H
+% Initial guess
+
+WHILE
+∥∇ f (ˆz)∥ > ε
+% ε ≈ 0 machine precision
+
+ˆz ← expˆz (−λ∇ f (ˆz))
+% λ > 0 step size, depends on ˆz
+
+END WHILE
+
+OUTPUT
+ˆz
+% near critical point of f
+
+Here, in the condition for the while loop, ∥∇ f (zk)∥ is the Riemannian norm of the gradient
+∇ f (zk). In other words,
+
+∥∇ f (zk)∥2 = 1
+
+y2
+k
+× Re {∇ f (zk) ∇ f (zk)∗}
+
+Just like a classical gradient descent algorithm, the above Riemannian gradient descent consists in
+following the direction of the negative gradient −∇ f (ˆz), in order to define a new estimate. This is
+repeated as long as the gradient is sensibly nonzero, in the sense of the loop condition.
+The generic algorithm described above has no guarantee of convergence. Convergence and
+behavior near limit points depends on the function f, on the initialization of the algorithm and on the
+step sizes λ. For these aspects, the reader may consult [14](Chapter 4).
+
+198
+
+
+Entropy 2014, 16, 4015–4031
+
+3. Riemannian Prior on the Univariate Normal Model
+
+The current section introduces new prior distributions on the univariate normal model. These may
+be referred to as “Riemannian priors”, since they are entirely based on the Riemannian geometry of
+this model, and will also be called “Gaussian distributions”, when viewed as probability distributions
+on the Poincaré half plane.
+Here, Section 3.1 defines in a rigorous way Gaussian distributions on H (based on the intuitive
+Formula (2)). A Gaussian distribution G(¯z, γ) has two parameters, ¯z ∈ H, called the center of mass, and
+γ > 0, called the dispersion parameter. Section 3.2 uses the Riemannian gradient descent algorithm
+Section 2.3 to provide an algorithm for computing maximum likelihood estimates of ¯z and γ. Finally,
+Section 3.3 proves that ¯z is the Riemannian center of mass or Karcher mean of the distribution G(¯z, γ),
+(Historically, it is more correct to speak of the “Fréchet mean”, since this concept was proposed by
+Fréchet in 1948 [15]), and that γ is uniquely related to mean square Rao’s distance from ¯z.
+The reader may wish to note that the results of Section 3.3 are not used in the following, so this
+paragraph may be skipped on a first reading.
+
+3.1. Gaussian Distributions on H
+
+A Gaussian distribution G(¯z, γ) on H is a probability distribution with the following probability
+density function:
+
+p(z|¯z, γ) =
+1
+
+Z(γ) exp
+� −d2(z, ¯z)
+
+2γ2
+
+�
+(14)
+
+Here, ¯z ∈ H is called the center of mass and γ > 0 the dispersion parameter of the distribution G(¯z, γ).
+The squared distance d2(z, ¯z) refers to Rao’s distance (5). The probability density function (14) is
+understood with respect to the Riemannian volume element dA(z). In other words, the normalization
+constant Z(γ) is given by:
+
+Z(γ) =
+�
+
+H f (z)dA(z)
+f (z) = exp
+� −d2(z, ¯z)
+
+2γ2
+
+�
+
+Using polar coordinates, as in Equation (12), it is possible to calculate this integral explicitly. To do so,
+let (r, ϕ), whose origin is ¯z. Then, d2(z, ¯z) = r2 when z = z(r, ϕ), as in Equation (9). It follows that:
+
+( f ◦ exp¯z) (r, ϕ) = exp
+� −r2
+
+2γ2
+
+�
+(15)
+
+According to Equation (12), the integral Z(γ) reduces to:
+
+Z(γ) =
+� 2π
+
+0
+
+� +∞
+
+0
+exp
+� −r2
+
+2γ2
+
+�
+sinh(r)drdϕ
+
+which is readily calculated,
+
+Z(γ) = 2π ×
+�
+
+π
+2 γ × e
+γ2
+2 × erf
+� γ
+√
+
+2
+
+�
+(16)
+
+where erf denotes the error function. Formula (16) completes the definition of the Gaussian distribution
+G(¯z, γ). This definition is the same as suggested in [11], with the difference that, in the present work, it
+has been possible to compute exactly the normalization constant Z(γ).
+It is noteworthy that the normalization constant Z(γ) depends only on γ and not on ¯z. This shows
+that the shape of the probability density function (14) does not depend on ¯z, which only plays the role
+of a location parameter. At a deeper mathematical level, this reflects the fact that H is a homogeneous
+Riemannian space [13].
+
+199
+
+
+Entropy 2014, 16, 4015–4031
+
+The probability density function (14) bears a clear resemblance to the usual Gaussian (or normal)
+probability density function. Indeed, both are proportional to the exponential minus the “square
+distance”, but in one case, the distance is interpreted as Euclidean distance and, in the other (that of
+Equation (14)) as Rao’s distance.
+
+3.2. Maximum Likelihood Estimation of ¯z and γ
+
+Consider the problem of computing maximum likelihood estimates of the parameters ¯z and γ of
+the Gaussian distribution G(¯z, γ), based on independent samples {zi}N
+i=1 from this distribution. Given
+the expression (14) of the density p(z|¯z, γ), the log-likelihood function ℓ(¯z, γ) can be written,
+
+ℓ(¯z, γ) = −N log{Z(γ)} −
+1
+
+2γ2
+
+N
+∑
+i=1
+d2(zi, ¯z)
+(17)
+
+Since ¯z only appears in the second term, the maximum likelihood estimate of ¯z, say ˆz, can be computed
+first. It is given by the minimization problem:
+
+ˆz = argminz∈H
+1
+2
+
+N
+∑
+i=1
+d2(zi, z)
+(18)
+
+In other words, the maximum likelihood estimate ˆz minimizes the sum of squared Rao distances to the
+samples zi. This exhibits ˆz as the Riemannian center of mass, also called the Karcher or the Fréchet
+mean [16], of the samples zi.
+The notion of Riemannian center of mass is currently a widely popular one in signal and image
+processing, with applications ranging from blind source separation and radar signal processing [17,18]
+to shape and motion analysis [19,20]. The definition of Gaussian distributions, proposed in the
+present paper, shows how the notion of Riemannian center of mass is related to maximum likelihood
+estimation, thereby giving it a statistical foundation.
+An original result, due to Cartan and cited in Equation [16], states that ˆz, as defined in
+Equation (18), exists and is unique, since H, with the Riemannian distance (4), has constant negative
+curvature. Here, ˆz is computed using Riemannian gradient descent, as described in Section 2.3. The
+cost function f to be minimized is given by (the factor N−1 is conventional),
+
+f (z) =
+1
+2N
+
+N
+∑
+i=1
+d2(zi, z)
+(19)
+
+Its Riemannian gradient ∇ f (z) is easily found by noting the following fact. Let fi(z) = (1/2)d2(z, zi).
+Then, the Riemannian gradient of this function is (see [21] (page 407)),
+
+∇ fi(z) = logz(zi)
+(20)
+
+where logz : H → C is the inverse of expz : C → H. It follows from Equation (20) that,
+
+∇ f (z) = 1
+
+N
+
+N
+∑
+i=1
+logz(zi)
+(21)
+
+The analytic expression of logz, for any z ∈ H, will be given below (see Equation (23)).
+Here, the gradient descent algorithm for computing ˆz is described. This algorithm uses a constant
+step size λ, which is fixed manually.
+
+200
+
+
+Entropy 2014, 16, 4015–4031
+
+Once the maximum likelihood estimate ˆz has been computed, using the gradient descent
+algorithm, the maximum likelihood estimate of γ, say ˆγ, is found by solving the equation:
+
+F(γ) = 1
+
+N
+
+N
+∑
+i=1
+d2(zi, ˆz)
+where F(γ) = γ3 × d
+
+dγ log{Z(γ)}
+(22)
+
+The gradient descent algorithm for computing ˆz is the following,
+
+INPUT
+{z1, . . . , zN}
+% N independent samples from G(¯z, γ)
+
+ˆz ∈ H
+% Initial guess
+
+WHILE
+∥∇ f (ˆz)∥ > ε
+% ε ≈ 0 machine precision
+
+ˆz ← expˆz (−λ∇ f (ˆz))
+% ∇ f (ˆz) given by Equation (21)
+
+% step size λ is constant
+
+END WHILE
+
+OUTPUT
+ˆz
+% near Riemannian center of mass
+
+Application of Formula (21) requires computation of logˆz(zi) for i = 1, . . . , N. Fortunately, this
+can be done analytically as follows. In general, for ˆz = ( ¯x, ¯y),
+
+logˆz(z) = ¯y logi
+
+�z − ¯x
+
+¯y
+
+�
+(23)
+
+where logi is found by inverting Equation (6). Precisely,
+
+logi(z) = reiϕ
+(24)
+
+where, for z = (x, y) with x ̸= 0,
+
+r = acosh
+�
+1 + x2 + (y − 1)2
+
+2y
+
+�
+
+and:
+
+cos(ϕ) =
+x
+
+y sinh(r)
+sin(ϕ) = cosh(r) − y−1
+
+sinh(r)
+
+and, for z = (0, y),
+logi(z) = ln(y)i
+
+with ln denoting the natural logarithm.
+
+3.3. Significance of ¯z and γ
+
+The parameters ¯z and γ of a Gaussian distribution G(¯z, γ) have been called the center of mass
+and the dispersion parameter. In the present paragraph, it is proven that,
+
+¯z = argminz∈H
+1
+2
+
+�
+
+H d2(z′, z)p(z′|¯z, γ)dA(z′)
+(25)
+
+and also that:
+F(γ) =
+�
+
+H d2(z′, ¯z)p(z′|¯z, γ)dA(z′)
+(26)
+
+where F(γ) was defined in Equation (22) and p(z′|¯z, γ) is the probability density function of G(¯z, γ),
+given in Equation (14).
+
+201
+
+
+Entropy 2014, 16, 4015–4031
+
+Note that Equations (25) and (26) are asymptotic versions of Equations (18) and (22). Indeed,
+Equations (25) and (26) can be written:
+
+¯z = argminz∈H
+1
+2E¯z,γd2(z′, z)
+F(γ) = E¯z,γd2(z, ¯z)
+(27)
+
+where E¯z,γ denotes the expectation with respect to G(¯z, γ), and the expectation is carried out on the
+variable z′ in the first formula. Now, these two formulae are the same as Equations (18) and (22), but
+with expectation instead of empirical mean.
+Note, moreover, that Equations (25) and (26) can be interpreted as follows. If z′ is distributed
+according to the Gaussian distribution G(¯z, γ), then Equation (25) states that ¯z is the unique point, out
+of all z ∈ H, which minimizes the expectation of squared Rao’s distance to z′. Moreover, Equation (26)
+states that the expectation of squared Rao’s distance between ¯z and z′ is equal to F(γ), so F(γ) is the
+least possible expected squared Rao’s distance between a point z ∈ H and z′. This interpretation
+justifies calling ¯z the center of mass of G(¯z, γ) and shows that γ is uniquely related to the expected
+dispersion, as measured by squared Rao’s distance, away from ¯z.
+In order to prove Equation (25), consider the log-likelihood function,
+
+ℓ(¯z, γ; z) = − log{Z(γ)} −
+1
+
+2γ2 d2(z, ¯z)
+(28)
+
+Let fz(¯z) = (1/2)d2(z, ¯z). The score function, with respect to ¯z is, by definition,
+
+∇¯zℓ(¯z, γ; z) = ∇ fz(¯z)
+(29)
+
+where ∇¯z indicates the Riemannian gradient (defined in Equation (13) of Section 2.3) is with respect to
+the variable ¯z. Under certain regularity conditions, which are here easily verified, the expectation of
+the score function is identically zero,
+E¯z,γ∇ fz(¯z) = 0
+(30)
+
+Let f (z) be defined by:
+
+f (z) = E¯z,γ fz′(z) = 1
+
+2E¯z,γd2(z′, z)
+
+with the expectation carried out on the variable z′. Clearly, f (z) is the expression to be minimized
+in Equation (25) (or in the first formula in Equation (27), which is just the same). By interchanging
+Riemannian gradient and expectation,
+
+∇ f (¯z) = E¯z,γ∇ fz(¯z) = 0
+
+where the last equality follows from Equation (30).
+It has just been proved that ¯z is a stationary point of f (a point where the gradient is zero).
+Theorem 2.1 in [16] states the function f has one and only one stationary point, which is moreover a
+global minimizer. This concludes the Proof (25).
+The proof of Equation (26) follows exactly the same method, defining the score function with
+respect to γ and noting that its expectation is identically zero.
+
+4. Classification of Univariate Normal Populations
+
+The previous section studied Gaussian distributions on H, “as they stand”, focusing on the
+fundamental issue of maximum likelihood estimation of their parameters. The present Section
+considers the use of Gaussian distributions as prior distributions on the univariate normal model.
+The main motivation behind the introduction of Gaussian distributions is that a Gaussian
+distribution G(¯z, γ) can be used to give a geometric representation of a cluster or class of univariate
+normal populations. Recall that each point (x, y) ∈ H is identified with a univariate normal population
+
+202
+
+
+Entropy 2014, 16, 4015–4031
+
+with mean μ =
+√
+
+2x and standard deviation σ = y. The idea is that populations belonging to the same
+cluster, represented by G(¯z, γ), should be viewed as centered on ¯z and lying within a typical distance
+determined by γ.
+In the remainder of this Section, it is shown how the maximum likelihood estimation algorithm of
+Section 3.2 can be used to fit the hyperparameters ¯z and γ to data, consisting in a class S = {Si; i =
+1, . . . , K} of univariate normal populations. This is then applied to the problem of the classification
+of univariate normal populations. The whole development is based on marginalized likelihood
+estimation, as follows.
+Assume each population Si contains Ni points, Si = {sj; j = 1, . . . , Ni}, and the points sj, in any
+class, are drawn from a univariate normal distribution with mean μ and standard deviation σ. The
+focus will be on the asymptotic case where the number Ni of points in each population Si is large.
+In order to fit the hyperparameters ¯z and γ to the data S, assume moreover that the distribution
+of z = (x, y), where (x, y) = (μ/
+√
+
+2, σ), is a Gaussian distribution G(¯z, γ). Then, the distribution of S
+can be written in integral form:
+
+p(S|¯z, γ) =
+K
+∏
+i=1
+
+�
+
+H p(Si|z)p(z|¯z, γ)dA(z)
+(31)
+
+where p(z|¯z, γ) is the probability density of a Gaussian distribution G(¯z, γ), defined in Equation (14).
+Moreover, expressing p(Si|z) as a product of univariate normal distributions p(sj|z), it follows,
+
+p(S|¯z, γ) =
+K
+∏
+i=1
+
+�
+
+H
+
+Ni
+∏
+j=1
+p(sj|z)p(z|¯z, γ)dA(z)
+(32)
+
+This expression, given the data S, is to be maximized over (¯z, γ). Using the Laplace approximation,
+this task is reduced to the maximum likelihood estimation problem, addressed in Section 3.2.
+The Laplace approximation will here be applied in its “basic form” [9]. That is, up to terms of
+order N−1
+i
+. To do so, write each of the integrals in Equation (32), using Equation (8) of Section 2.2.
+These integrals then take on the form:
+
+� +∞
+
+0
+
+� +∞
+
+−∞
+
+Ni
+∏
+j=1
+|2πy2|−1/2 exp
+
+⎛
+
+⎜
+⎝
+−
+�
+sj −
+√
+
+2 x
+�2
+
+2y2
+
+⎞
+
+⎟
+⎠ × p(z|¯z, γ) × 1
+
+y2 dxdy
+(33)
+
+where the univariate normal distribution p(sj|z) has been replaced by its full expression. Now, this
+expression can be written p(sj|z) = exp [−Nih(x, y)], where:
+
+h(x, y) = −1
+
+2 ln
+�
+2πy2�
+− B2
+i + V2
+i
+
+2y2
+
+Here, B2 and V2
+i are the empirical bias and variance, within population Si,
+
+Bi = ˆSi −
+√
+
+2 x
+V2
+i = N−1
+i
+
+Ni
+∑
+j=1
+( ˆSi − sj)2
+
+where ˆSi is the empirical mean of the population ˆSi = N−1
+i
+∑Ni
+j=1 sj.
+The expression h(x, y) is maximized when x = ˆxi and y = ˆyi, where ˆzi = ( ˆxi, ˆyi) is the couple of
+maximum likelihood estimates of the parameters (x, y), based on the population Si.
+
+203
+
+
+Entropy 2014, 16, 4015–4031
+
+According to the Laplace approximation, the integral Equation (33) is equal to:
+
+2π
+���∂2h( ˆxi, ˆyi)
+���
+−1/2
+× exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) × 1
+
+ˆy2
+i
++ O(N−1
+i
+)
+
+where ∂2h( ˆxi, ˆyi) is the matrix of second derivatives of h, and | · | denotes the determinant. Now, since
+h is essentially the logarithm of p(sj|z), a direct calculation shows that ∂2h( ˆxi, ˆyi) is the same as the
+Fisher information matrix derived in Section 2.1 (where it was denoted I(z)). Thus, the first factor in
+the above expression is 2π ˆy2
+i , and cancels out with the last factor.
+Finally, the Laplace approximation of the integral Equation (33) reads:
+
+2π × exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) + O(N−1
+i
+)
+
+and the resulting approximation of the distribution of S, as given by Equation (32), can be written:
+
+p(S|¯z, γ) ≈
+K
+∏
+i=1
+α × p(ˆzi|¯z, γ)
+(34)
+
+where α is a constant, which does not depend either on the data or on the parameters, and p(ˆzi|¯z, γ)
+has the expression (14).
+Accepting this expression to give the distribution of the data S, conditionally on the
+hyperparameters (¯z, γ), the task of estimating these hyperparameters becomes the same as the
+maximum likelihood estimation problem, described in Section 3.2.
+In conclusion, if one assumes the populations Si belong to a single cluster or class S and wishes
+to fit the hyperparameters ¯z and γ of a Gaussian distribution representing this cluster, it is enough to
+start by computing the maximum likelihood estimates ˆxi and ˆyi for each population Si and then to
+consider these as input to the maximum likelihood estimation algorithm described in Section 3.2.
+The same reasoning just carried out, using the Laplace approximation, can be generalized to
+the problem of classification of univariate normal populations. Indeed, assume that classes {SL, L =
+1, . . . , C}, each containing some number KL of univariate normal populations, have been identified
+based on some training sequence. Using the Laplace approximation and the maximum likelihood
+estimation approach of Section 3.2, to each one of these classes, it is possible to fit hyperparameters
+(¯zL, γL) of a Gaussian distribution G(¯zL, γL) on H.
+For a test population St, the maximum likelihood rule, for deciding which of the classes SL this
+test population St belongs to, requires finding the following maximum:
+
+L∗ = argmaxLp(St|¯zL, γL)
+(35)
+
+and assigning the test population St to the class with label L∗. If the number of points Nt in the
+population St is large, the Laplace approximation, in the same way used above, approximates the
+maximum in Equation (35) by:
+L∗ = argmaxLp(ˆzt|¯zL, γL)
+(36)
+
+where ˆzt = ( ˆxt, ˆyt) is the couple of maximum likelihood estimates computed based on the test
+population St and where p(ˆzt|¯zL, γL) is given by Equation (14). Now, writing out Equation (14), the
+decision rule becomes:
+
+L∗ = argmaxL
+
+�
+
+− log {Z(γL)} −
+1
+
+2γ2
+L
+d2(ˆzt, ¯zL)
+
+�
+
+(37)
+
+204
+
+
+Entropy 2014, 16, 4015–4031
+
+Under the homoscedasticity assumption, that all of the γL are equal, this decision rule essentially
+becomes the same as the one proposed in [5], which requires St to be assigned to the “nearest” cluster,
+in terms of Rao’s distance. Indeed, if all the γL are equal, then Equation (37) is the same as,
+
+L∗ = argminLd2(ˆzt, ¯zL)
+(38)
+
+This decision rule is expected to be less efficient that the one proposed in Equation (37), which also takes
+into account the uncertainty associated with each cluster, as measured by its dispersion parameter γL.
+
+5. Application to Image Classification
+
+In this section, the framework proposed in Section 4, for classification of univariate normal
+populations, is applied to texture image classification using Gabor filters. Several authors have found
+that Gabor energy features are well-suited texture descriptors. In the following, consider 24 Gabor
+energy sub-bands that are the result of three scales and eight orientations. Hence, each texture image
+can be decomposed as the collection of those 24 sub-bands. For more information concerning the
+implementation, the interested reader is referred to [22].
+Starting from the VisTeX database of 40 images [10] (these are displayed in Figure 1), each image
+was divided into 16 non-overlapping subimages of 128 × 128 pixels each. A training sequence was
+formed by choosing randomly eight subimages out of each image. To each subimage in the training
+sequence, a bank of 24 Gabor filters was applied. The result of applying a Gabor filter with scale s
+and orientation o to a subimage i belonging to an image L is a univariate normal population Si,s,o of
+128 × 128 points (one point for each pixel, after the filter is applied).
+
+Figure 1. Forty images of the VisTex database.
+
+These populations Si,s,o (called sub-bands) are considered independent, each one of them
+univariate normal with mean μi,s,o =
+√
+
+2xi,s,o, standard deviation σi,s,o = yi,s,o and with zi,s,o =
+(xi,s,o, yi,s,o). The couple of maximum likelihood estimates for these parameters is denoted ˆzi,s,o =
+( ˆxi,s,o, ˆyi,s,o). An image L (recall, there are 40 images) contains, in each sub-band, eight populations
+Si,s,o, with which hyperparameters ¯zL,s,o and γL,s,o are associated, by applying the maximum likelihood
+estimation algorithm of Section 3.2 to the inputs ˆzi,s,o.
+If St is a test subimage, then one should begin by applying the 24 Gabor filters to it, obtaining
+independent univariate normal populations St,s,o, and then compute for each population the couple
+of maximum likelihood estimates ˆzt,s,o = ( ˆxt,s,o, ˆyt,s,o). The decision rule Equation (37) of Section 4
+requires that St should be assigned to the image L∗, which realizes the maximum:
+
+L∗ = argmaxL ∑
+s,o
+− log{Z(γL,s,o)} −
+1
+
+2γ2
+L,s,o
+d2(ˆzt,s,o, ¯zL,s,o)
+(39)
+
+205
+
+
+Entropy 2014, 16, 4015–4031
+
+When considering the homoscedasticity assumption, i.e., γL,s,o = γs,o for all L, this decision rule
+becomes:
+L∗ = argminL ∑
+s,o
+d2(ˆzt,s,o, ¯zL,s,o)
+(40)
+
+For this concrete application, to the VisTex database, it is pertinent to compare the rate of successful
+classification (or overall accuracy) obtained using the Riemannian prior, based on the framework
+of Section 4, to that obtained using a more classical conjugate prior, i.e., a normal-inverse gamma
+distribution of the mean μ =
+√
+
+2x and the standard deviation σ = y. This conjugate prior is given by:
+
+p(μ|σ, μp, κp) =
+√κp
+σ
+√
+
+2π
+exp
+�
+− κp
+
+2σ2 (μ − μp)2�
+
+with an inverse gamma prior, on σ2,
+
+p(σ2|α, β) =
+βα
+
+Γ(α)
+
+�
+σ2�−(α+1)
+exp
+�
+− β
+
+σ2
+
+�
+(41)
+
+Using this conjugate prior, instead of a Riemannian prior, and following the procedure of applying the
+Laplace approximation, a different decision rule is obtained, where L∗ is taken to be the maximum of
+the following expression:
+
+∑
+s,o
+
+ln κpL,s,o
+
+2
+− κpL,s,o
+
+2 ˆy2
+t,s,o
+
+�√
+
+2 ˆxt,s,o − μpL,s,o
+�2
+
++ αL,s,o ln βL,s,o − ln Γ(αL,s,o) − 2(αL,s,o + 1) ln ˆyt,s,o − βL,s,o
+
+ˆy2
+t
+(42)
+
+where, as in Equation (39), ˆxt,s,o and ˆyt,s,o are the maximum likelihood estimates computed for the
+population St,s,o.
+Both the Riemannian and conjugate priors have been applied to the VisTex database, with half of
+the database used for training and half for testing. In the course of 100 Monte Carlo runs, a significant
+gain of about 3% is observed with the Riemannian prior compared to the conjugate prior. This is
+summarized in the following table.
+
+Prior Model
+Overall Accuracy
+
+Riemannian prior Equation (39)
+71.88% ± 2.16%
+
+Riemannian prior, homoscedasticity assumption Equation (40)
+69.06% ± 1.96%
+
+Conjugate prior Equation (42)
+68.73% ± 2.92%
+
+Recall that the overall accuracy is the ratio of the number of successfully classified subimages
+to the total number of subimages. The table shows that the use of a Riemannian prior, even under a
+homoscedasticity assumption, yields significant improvement upon the use of a conjugate prior.
+
+6. Conclusions
+
+Motivated by the problem of the classification of univariate normal populations, this paper
+introduced a new class of prior distributions on the univariate normal model. With the univariate
+normal model viewed as the Poincaré half plane H, these new prior distributions, called Gaussian
+distributions, were meant to reflect the geometric picture (in terms of Rao’s distance) that a cluster or
+class of univariate normal populations can be represented as having a center ¯z ∈ H and a “variance”
+or dispersion γ2. Precisely, a Gaussian distribution G(¯z, γ) has a probability density function p(z),
+
+with respect to Riemannian volume of the Poincaré half plane, which is proportional to exp
+�
+− d2(z,¯z)
+
+2γ2
+�
+.
+
+206
+
+
+Entropy 2014, 16, 4015–4031
+
+Using Gaussian distributions as prior distributions in the problem of the classification of univariate
+normal populations was shown to lead to a new, more general and efficient decision rule. This
+decision rule was implemented in a real-world application to texture image classification, where it
+led to significant improvement in performance, in comparison to decision rules obtained by using
+conjugate priors.
+The general approach proposed in this paper contains several simplifications and approximations,
+which could be improved upon in future work. First, it is possible to use different prior distributions,
+which are more geometrically rich than Gaussian distributions, to represent classes of univariate normal
+populations. For example, it may be helpful to replace Gaussian distributions that are “isotropic”, in
+the sense of having a scalar dispersion parameter γ, by non-isotropic distributions, with a dispersion
+matrix Γ (a 2 × 2 symmetric positive definite matrix). Another possibility would be to represent
+each class of univariate normal populations by a finite mixture of Gaussian distributions, instead of
+representing it by a single Gaussian distribution.
+These variants, which would allow classes with a more complex geometric structure to be taken
+into account, can be integrated in the general framework proposed in the paper, based on: (i) fitting
+each class to a prior distribution (Gaussian non-isotropic, mixture of Gaussians); and (ii) choosing, for
+a test population, the most adequate class, based on a decision rule. These two steps can be realized as
+above, through the Laplace approximation and maximum likelihood estimation, or through alternative
+techniques, based on Markov chain Monte Carlo stochastic optimization.
+In addition to generalizing the approach of this paper and improving its performance, a further
+important objective for future work will be to extend it to other parametric models, beyond univariate
+normal models. Indeed, there is an increasing number of parametric models (generalized Gaussian,
+elliptical models, etc.), whose Riemannian geometry is becoming well understood and where the
+present approach may be helpful.
+
+Author Contributions
+
+Salem Said carry out the mathematical development, and specify the algorithms, appearing in
+Sections 2, 3 and 4. Lionel Bombrun carry out all numerical simulations, and to propose the theoretical
+development of Section 4. Yannick Berthoumieu devise the main idea of the paper. That is, use of
+Riemannian priors as geometric representation of a class or cluster of univariate normal population.
+All authors have read and approved the final manuscript.
+
+Conflicts of Interest
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Amari, S.I; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2000.
+2.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc. 1945, 37, 81–91.
+3.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234.
+4.
+Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276.
+5.
+Nielsen, F; Nock, R. Hyperbolic Voronoi diagrams made easy. 2009 , arXiv:0903.3287.
+6.
+Lenglet, C.; Rousson, M.; Deriche, R.; Fougeras, O. Statistics on the manifold of multivariate normal
+distributions: Theory and application to diffusion tensor MRI processing. J. Math. Imaging Vis. 2006, 25,
+423–444.
+7.
+Verdoolaege, G.; Scheunders, P. On the geometry of multivariate generalized Gaussian models. J. Math.
+Imaging Vis. 2012, 43, 180–193.
+
+207
+
+
+Entropy 2014, 16, 4015–4031
+
+8.
+Berkane, M.; Oden, K. Geodesic estimation in elliptical distributions. J. Multival. Anal. 1997, 63, 35–46.
+9.
+Erdélyi, A. Asymptotic Expansions; Dover Books: Mineola, New York, NY, USA, 2010.
+10.
+MIT Vision and Modeling Group.
+Vision Texture.
+Available online: http://vismod.media.mit.edu/
+pub/VisTex (accessed on 10 June 2014).
+11.
+Pennec, X. Intrinsic statistics on Riemannian manifold: Basic tools for geometric measurements. J. Math.
+Imaging Vis. 2006, 25, 127–154.
+12.
+Atkinson, C.; Mitchell, A.F.S. Rao’s distance measure. Sankhya Ser. A 1981, 43, 345–365.
+13.
+Gallot, S.; Hulin, D.; Lafontaine, J. Riemannian Geometry; Springer-Verlag: Berlin, Germany, 2004.
+14.
+Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
+Press: Cambridge, MA, USA, 2006.
+15.
+Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’I.H.P. 1948,
+10, 215–310. (In French)
+16.
+Afsari, B. Riemannian Lp center of mass: Existence, Uniqueness and convexity. Proc. Am. Math. Soc. 2011,
+139, 655–673.
+17.
+Manton, J.H. A centroid (Karcher mean) approach to the joint approximate diagonalisation problem: The real
+symmetric case. Digit. Sign. Process. 2006, 16, 468–478.
+18.
+Arnaudon, M.; Barbaresco, F. Riemannian medians and means with applications to RADAR signal processing.
+IEEE J. Sel. Top. Sign. Process. 2013, 7, 595–604.
+19.
+Le, H. On the consistency of procrustean mean shapes. Adv. Appl. Prob. 1998, 30, 53–63.
+20.
+Turaga, P.; Veeraraghavan, A.; Chellappa, R. Statistical Snalysis on Stiefel and Grassmann Manifolds with
+Applications in Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern
+Recognition, Anchorage, AK, USA, 23–28 June 2008; doi: 10.1109/CVPR.2008.4587733.
+21.
+Chavel, I. Riemannian geometry: A modern introduction; Cambridge University Press: Princeton, MA, USA, 2008.
+22.
+Grigorescu, S.E.; Petkov, N.; Kruizinga, P. Comparison of texture features based on Gabor filter. IEEE Trans.
+Image Process. 2002, 11, 1160–1167.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+208
+
+
+entropy
+
+Article
+Combinatorial Optimization with Information
+Geometry: The Newton Method
+
+Luigi Malagò 1 and Giovanni Pistone 2,*
+
+1 Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135 Milano, Italy;
+E-Mail: malago@di.unimi.it
+2 de Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024 Moncalieri, Italy
+*
+E-Mail: giovanni.pistone@carloalberto.org; Tel.: +39-011-670-5033; Fax: +39-011-670-5082.
+
+Received: 31 March 2014; in revised form: 10 July 2014 / Accepted: 11 July 2014 /
+Published: 28 July 2014
+
+Abstract: We discuss the use of the Newton method in the computation of max(p �→ Ep [ f ]), where
+p belongs to a statistical exponential family on a finite state space. In a number of papers, the authors
+have applied first order search methods based on information geometry. Second order methods
+have been widely used in optimization on manifolds, e.g., matrix manifolds, but appear to be new
+in statistical manifolds. These methods require the computation of the Riemannian Hessian in a
+statistical manifold. We use a non-parametric formulation of information geometry in view of further
+applications in the continuous state space cases, where the construction of a proper Riemannian
+structure is still an open problem.
+
+Keywords: statistical manifold; Riemannian Hessian; combinatorial optimization; Newton method
+
+1. Introduction
+
+In this paper, statistical exponential families [1] are thought of as differentiable manifolds along
+the approach called information geometry [2] or the exponential statistical manifold [3]. Specifically,
+our aim is to discuss optimization on statistical manifolds using the Newton method, as is suggested
+in ([4] (Ch. 5 and 6)); see also the monograph [5]. This method is based on classical Riemannian
+geometry [6], but here, we put our emphasis on coordinate-free differential geometry; see [7,8].
+We mainly refer to the above-mentioned references [2,4], with one notable exception in the
+description of the tangent space. Our manifold will be an exponential family EV of positive densities,
+V being a vector space of sufficient statistics. Given a one-dimensional statistical model p(t) ∈ EV,
+t ∈ I, we define its velocity at time t to be its Fisher score s(t) = d
+
+dt ln p(t) [9]. The Fisher score s(t)
+is a random variable with zero expectation with respect to p(t), Ep(t) [s(t)] = 0. Because of that, the
+tangent space at p ∈ EV is a vector space of random variables with zero expectation at p. A vector field
+is a mapping from p to a random variable V(p), such that for all p ∈ E, the random variable V(p) is
+centered at p, Ep [V(p)] = 0. In other words, each point of the manifold has a different tangent space,
+and this tangent space can be used as a non-parametric model space of the manifold. In this formalism,
+a vector field is a mapping from densities to centered random variables, that is, it is what in statistics is
+called a pivot of the statistical model. To avoid confusion with the product of random variables, we
+do not use the standard notation for the action of a vector field on a real function. This approach is
+possibly unusual in differential geometry, but it is fully natural from the statistical point of view, where
+the Fisher score has a central place. Moreover, this approach scales nicely from the finite state space to
+the general state space; see the discussion in [9] and the review in [3].
+A complete construction of the geometric framework based on the idea of using the Fisher scores
+as elements of the tangent bundle has been actually worked out. In this paper, we go on by considering
+a second order geometry based on the non-parametric settings.
+
+Entropy 2014, 16, 4260–4289; doi:10.3390/e16084260
+www.mdpi.com/journal/entropy
+209
+
+
+Entropy 2014, 16, 4260–4289
+
+Our main motivation for such a geometrical construction is its application to combinatorial
+optimization using exponential families, whose first order version was developed in [10–14]. We give
+here an illustration of the methods in the following toy example.
+Consider the function f (x1, x2) = a0 + a1x1 + a2x2 + a12x1x2, with x1, x2 = ±1, a0, a1, a2, a12 ∈ R.
+The function f is a real random variable on the sample space Ω = {+1, −1}2 with the uniform
+probability λ.
+Note that the coordinate mappings X1, X2 of Ω generate an orthonormal basis
+1, X1, X2, X1X2 of L2(Ω, λ) and that f is the general form of a real random variable on such a space. Let
+P> be the open simplex of positive densities on (Ω, λ), and let EV be a statistical model, i.e., a subset
+of P>. The relaxed mapping F: EV → R,
+
+F(p) = Ep [ f ] = a0 + a1 Ep [X1] + a2 Ep [X2] + a12 Ep [X1X2] ,
+(1)
+
+is strictly bounded by the maximum of f, F(p) = Ep [ f ] < maxx∈Ω f (x), unless f is constant. We are
+looking for a sequence pn, n ∈ N, such that Epn [ f ] → maxx∈Ω f (x) as n → ∞. The existence of such a
+sequence is a nontrivial condition for the model E. Precisely, the closure of EV must contain a density,
+whose support is contained in the set of maxima {x ∈ Ω| f (x) = max f }. This condition is satisfied by
+the independence model, V = Span {X1, X2}, where we can write:
+
+F(η1, η2) = a0 + a1η1 + a2η2 + a12η1η2,
+ηi = Ep [Xi] ,
+(2)
+
+See Figure 1.
+The gradient of Equation (2) has components ∂1F = a1 + a12η2, ∂2F = a2 + a12η1, and the flow
+along the gradient produces increasing values for F; however, the gradient flow does not converge to
+the maximum of F; see the dotted line in Figure 2. However, one can follow the suggestion by [15] and
+use a modified gradient (the “natural” gradient) flow that produces better results in our problem; see
+Figure 3. Full details on this example are given in Section 2.5.2.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3.
+
+210
+
+
+Entropy 2014, 16, 4260–4289
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+η1
+
+η2
+
+−20
+
+−10
+
+0
+
+10
+
+20
+
+ −10 
+
+ −10 
+
+ 0 
+
+ 0 
+
+10 
+
+ 10 
+
+ 20 
+
+Expectation parameters
+
+Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside
+the square [−1, +1]2.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Gradient vs Natural gradient
+
+Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting
+at (−1/4, −1/4).
+
+In combinatorial optimization, the values of the function f are assumed to be available at each
+point, and the curve of steepest ascent of the relaxed function is learned through a simulation procedure
+based on exponential statistical models.
+
+211
+
+
+Entropy 2014, 16, 4260–4289
+
+In this paper, we introduce, in Section 2, the geometry of exponential families and its first order
+calculus. The second order calculus and the Hessian are discussed in Section 3. Finally, in Section 4,
+we apply the formalism to the discussion of the Newton method in the context of the maximization of
+the relaxed function.
+
+2. Models on a Finite State Space
+
+We consider here the exponential statistical manifold on the set of positive densities on a measure
+space (Ω, μ) with Ω finite and counting measure μ. The setup we describe below is not strictly required
+in the finite case, because in such a case, other approaches are possible, but it provides a mathematical
+formalism that has its own pros and that scales naturally to the infinite case.
+We provide below a schematic presentation of our formalism as an introduction to this section.
+
+• Two different exponential families can actually be the same statistical model, as the set of
+densities in the two exponential families are actually equal. This fact is due to both the
+arbitrariness of the reference density and the fact that sufficient statistics are actually a vector
+basis of the vector space generated by the sufficient statistics. In a non-parametric approach,
+we can refer directly to the vector space of centered log-densities, while the change of reference
+density is geometrically interpreted as a change of chart. The set of all possible such charts
+defines a manifold.
+• We make a specific interpretation of the tangent bundle as the vector space of Fisher’s scores at
+each density and use such tangent spaces as the space of coordinates. This produces a different
+tangent space/space of coordinates at each density, and different tangent spaces are mapped
+one onto another by a proper parallel transport, which is nothing else than the re-centering of
+random variables.
+• If a basis is chosen, a parametrization is given, and such a parametrization is, in fact, a new
+chart, whose values are real vectors. In the real parametrization, the natural scalar product in
+each scores space is given by Fisher’s information matrix.
+• Riemannian gradients are defined in the usual way. It is customary in information geometry to
+call “natural gradient” the real coordinate presentation of the Riemannian gradient. The natural
+gradient is computed by applying the inverse of the Fisher information matrix to the Euclidean
+gradient. It seems that there are tree gradients involved, but they all represent the same object
+when correctly understood.
+• The classical notion of expectation parameters for exponential families carries on as another
+chart on the statistical manifold, which gives rise to a further presentation of a geometrical
+object.
+• While the statistical manifold is unique, there are at least three relevant connections as structures
+on the vector bundles of the manifold: one relating to the exponential charts, one relating to the
+expectation charts and one depending on the Riemannian structure.
+
+2.1. Exponential Families As Manifolds
+
+On the finite sample space Ω, #Ω = n, let a set of random variables B = {X1, . . . , Xm} be
+given, such that ∑J αjXj is constant if, and only if, the αj’s are zero, or, equivalently, such that X0 =
+1, X1, . . . , Xm are affinely independent. The condition implies, necessarily, the linear independence of
+B. A common choice is to take a set of linearly independent and μ-centered random variables.
+We write V = Span {X1, . . . , Xm} and define the following exponential family of positive densities
+p ∈ P>:
+EV =
+�
+q ∈ P>
+���q ∝ eV p, V ∈ V
+�
+.
+(3)
+
+Given any couple p, q ∈ EV, then there exist a unique set of parameters θ = θp(q), such that:
+
+q = exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p
+(4)
+
+212
+
+
+Entropy 2014, 16, 4260–4289
+
+where eUp is the centering at p, that is,
+
+eUp : V ∋ U �→ U − Ep [U] ∈ eUpV.
+(5)
+
+The linear mapping eUp is one-to-one on V and eUpXj, j = 1, . . . , m, and is a basis of eUpV. We view
+each choice of a specific reference p as providing a chart centered at p on the exponential family EV,
+namely:
+
+σp : exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p �→ θ,
+(6)
+
+If:
+
+U = eUpU + Ep [U] =
+m
+∑
+j=1
+θj eUpXj + Ep [U] ,
+(7)
+
+then:
+
+Ep
+�
+U eUpXi
+� =
+m
+∑
+j=1
+θj Ep
+�eUpXi
+eUpXj
+�
+,
+(8)
+
+so that θ = I−1
+B (p) Ep [U eUpX], where:
+
+IB(p) =
+�
+Covp
+�
+Xi, Xj
+��
+
+ij = Ep
+�
+XX′� − Ep [X] Ep
+�
+X′�
+(9)
+
+is the Fisher information matrix of the basis B = {X1, . . . , Xm}.
+The mappings:
+σp : EV ∋ q �→ U �→ θ ∈ Rm
+(10)
+
+where:
+
+sp : q �→ U = log
+� q
+
+p
+
+�
+− Ep
+
+�
+log
+� q
+
+p
+
+��
+,
+(11)
+
+σp : q �→ θ = I−1
+B (p) Ep
+�
+U eUpX
+� = I−1
+B (p) Ep
+
+�
+log
+� q
+
+p
+
+�
+eUpX
+�
+,
+(12)
+
+are global charts in the non-parametric and parametric coordinates, respectively.
+Notice that
+Equation (12) provides the regression coefficients of the least squares estimate on eUpV of the
+log-likelihood.
+We denote by ep : Rm → EV the inverse of σp, i.e.,
+
+ep(θ) = exp
+
+�
+m
+∑
+j=1
+θj eUpXj − ψp(θ)
+
+�
+
+· p,
+(13)
+
+so that the representation of the divergence q �→ D (p ∥q) in the chart σp is ψp:
+
+ψp(θ) = log
+�
+Ep
+�
+e∑m
+j=1 θj eUpXj��
+= Eθ
+
+�
+log
+�
+p
+
+ep(θ)
+
+��
+= D
+�
+p ∥ep(θ)
+�
+.
+(14)
+
+The mapping IB : p �→ Covp (X, X) ∈ Rm×m is represented in the chart centered at p by:
+
+IB,p(θ) = IB(ep(θ)) = [Covep(θ)
+�
+Xi, Xj
+�]i,j = Hess ψp(θ),
+(15)
+
+See [1].
+
+213
+
+
+Entropy 2014, 16, 4260–4289
+
+2.2. Change of Chart
+
+Fix p, ¯p ∈ EV; then, we can express p in the chart centered at ¯p,
+
+p = exp
+� ¯U − kp( ¯U)
+� · ¯p,
+¯U ∈ eU ¯pV,
+k ¯p( ¯U) = log
+�
+E ¯p
+�
+e ¯U��
+.
+(16)
+
+In coordinates ¯U = ∑m
+j=1‘ ¯θj eU ¯pXj.
+For all q ∈ EV, q = exp
+�
+U − kp(U)
+�
+p, U ∈ eUpV, kp(U) = log
+�Ep
+�
+eU��
+, in coordinates
+U = ∑m
+j=1‘ θj eUpXj, we can write:
+
+q = exp
+�
+U − kp(U)
+� · p
+
+= exp
+�
+U − kp(U)
+�
+exp
+� ¯U − k ¯p( ¯U)
+� · ¯p
+
+= exp
+�
+U − kp(U) + ¯U − k ¯p( ¯U)
+� · ¯p
+
+= exp
+��(U + ¯U) − E ¯p [U]
+� −
+�
+kp(U) − k ¯p( ¯U) + E ¯p [U]
+�� · ¯p,
+(17)
+
+hence, the non-parametric coordinate of q in the chart centered at ¯p is U + ¯U − E ¯p [U] = eU ¯p(U) + ¯U.
+From Equation (12):
+
+σ¯p(q) = I−1
+V ( ¯p) E ¯p
+�
+(eU ¯pU + ¯U) eU ¯pX
+�
+
+= θ + ¯θ
+(18)
+
+This provides the change of charts σ¯p ◦ σ−1
+p
+: θ �→ θ + ¯θ. This atlas of charts defines the affine
+manifold (EV, (σp)). This fact has deep consequences that we do not discuss here, e.g., our manifold is
+an instance of a Hessian manifold [16].
+
+2.3. Tangent Bundle
+
+The space of Fisher scores at p is eUpV, and it is identified with the tangent space of the manifold
+at p, TqEV; see the discussion in [3,9]. Let us check the consistency of this statement with our θ-
+parametrization.
+Let:
+
+q(τ) = exp
+
+�
+m
+∑
+j=1
+θj(τ)
+eUq(0)X − ψq(0)(τ)
+
+�
+
+· q(0),
+(19)
+
+τ ∈ I, I an open interval containing zero, a curve in EV. In the chart centered at q(0), we have from
+Equation (12):
+
+σq(0)(q(τ)) = I−1
+B (q(0)) Eq(0)
+
+�
+log
+�q(τ)
+
+q(0)
+
+�
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+
+��
+m
+∑
+j=1
+θj(τ)
+eUq(0)Xj − ψq(0)(θ(τ))
+
+�
+eUq(0)X
+
+�
+
+= I−1
+B (q(0))
+m
+∑
+j=1
+θj(τ) Eq(0)
+�eUq(0)Xj
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�eUq(0)X
+eUq(0)X
+�
+θ
+
+= θ(τ).
+(20)
+
+The vector space eUpV is represented by the coordinates in the base eUpB. The tangent bundle
+TEV as a manifold is defined by the charts (σp, ˙σp) on the domain:
+
+TEV =
+�(p, v)
+��p ∈ EV, v ∈ TpEV
+�
+(21)
+
+214
+
+
+Entropy 2014, 16, 4260–4289
+
+with:
+(σp, ˙σp): (q, V) �→
+�
+I−1
+B (p) Ep
+
+�
+log
+� q
+
+p
+
+�
+eUpX
+�
+, I−1
+B (p) Ep
+�
+V eUpX
+��
+.
+(22)
+
+The dot notation ˙σp for the charts on the tangent spaces is justified by the computation in Equation (23)
+below:
+
+d
+dtσq(0)(q(τ))
+����
+τ=0
+= I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+����
+τ=0
+
+eUq(0)X
+�
+=
+
+I−1
+B (q(0)) Eq(0)
+�
+δq(0)
+eUq(0)X
+�
+= ˙σq(0)(δq(0)).
+(23)
+
+The velocity at τ = 0 is δq(0) =
+d
+dτ log (q(τ))
+���
+τ=0 ∈ Tq(0)EV and:
+
+d
+dτ θ(τ)
+����
+τ=0
+= I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+����
+τ=0
+
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�
+δq(0)
+eUq(0)X
+�
+,
+(24)
+
+which is consistent with both the definition of tangent space as set of Fisher scores and with the chart
+of the tangent bundle as defined in Equation (22).
+The velocity at a generic τ is δq(τ) =
+d
+dτ log (q(τ)) ∈ Tq(τ)EV and has coordinates at p:
+
+d
+dτ θ(τ) = I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�
+δq(τ)
+eUq(0)X
+�
+.
+(25)
+
+If V, W are vector fields on TEV, i.e., V(p), W(p) ∈ TpEV = eUpV, p ∈ EV, we define a Riemannian
+metric g(V, W)) by:
+g(V, W)(p) = gp(V(p), W(p)) = Ep [V(p)W(p)]
+(26)
+
+In coordinates at p, V(p) = ∑j ˙σj
+p(V) eUpXj, W(p) = ∑j ˙σj
+p(W) eUpXj, so that:
+
+gp(V(p), W(p)) = ˙σp(V)′IB(p) ˙σp(W).
+(27)
+
+2.4. Gradients
+
+Given a function φ: EV → R let φp = φ ◦ ep, ep = σ−1
+p , its representation in the chart centered
+at p:
+
+EV
+φ
+� R
+
+Rm
+
+ep
+�
+
+φp
+
+�
+(28)
+
+The derivative of θ �→ φp(θ) at θ = 0 along α ∈ Rm is:
+
+∇φp(0)α = ∇φp(0)I−1
+B (p)IB(p)α =
+�
+I−1
+B (p)∇φp(0)′�′
+IB(p)α = gp(I−1
+B (p)∇φp(0)′, α).
+(29)
+
+The mapping �∇φ: p �→ I−1
+B (p)(∇φp(0))′ ∈ Rm that appears in Equation (29) is Amari’s natural
+gradient of φ: EV; see [15]. It is a standard notion in Riemannian geometry; cf. [4] (p. 46).
+
+215
+
+
+Entropy 2014, 16, 4260–4289
+
+More generally, the derivative of θ �→ φp(θ) at θ along α ∈ Rm is:
+
+∇φp(θ)α = ∇φp(θ)I−1
+B (ep(θ))IB(ep(θ))α =
+�
+I−1
+B (ep(θ))∇φp(θ)′�′
+IB(ep(θ))α = gep(θ)(I−1
+B (ep(θ))∇φp(θ)′, α).
+(30)
+
+Let us compare ∇φq(0) and ∇φp(θ) when q = ep(θ). As φp = φ ◦ ep and φq = φ ◦ eq, we have the
+change of charts:
+φq = φ ◦ eq = φ ◦ ep ◦ σp ◦ eq = φp ◦ σp ◦ eq,
+(31)
+
+hence ∇φq(0) = ∇φp(σp(q))J(σp ◦ eq)(0), where J(σp ◦ eq) is the Jacobian of σp ◦ eq. As σp ◦ eq(θ) =
+θ + σp(q), we have J(σp ◦ eq) = Id, and in conclusion, ∇φep(θ)(0) = ∇φp(θ). For all p ∈ EV and
+θ ∈ Rm,
+�∇φ(ep(θ)) = I−1
+B (ep(θ))∇φp(θ).
+(32)
+
+Alternatively, for all q, p ∈ EV, �∇φ: EV → Rm is defined by:
+
+�∇φ(q) = I−1
+B (q)∇φp(σp(q)).
+(33)
+
+The Riemannian gradient of φ: EV is the vector field ∇φ, such that DYφ = g(∇φ, Y). Note that
+the Riemannian gradient takes values in the tangent bundle, while the natural gradient takes values in
+Rm. We compute the Riemannian gradient at p as follows. If y = ˙σp(Y(p)),
+
+DYφ(p) = dφp(0)y = gp( �∇φ(p), y) = Ep [∇φ(p)Y(p)] ,
+(34)
+
+hence �∇φ(p) = I−1
+B (p)∇φp(0)′ is the representation in the chart centered at p of the vector field
+∇φ: EV. Explicitly, we have (see Equation (22)),
+
+�∇φ(p) = I−1
+B (p)(∇φp(0))′ = I−1
+B (p) Ep
+�∇φ(p) eUpX
+�
+,
+(35)
+
+∇φ(p) = ∑
+j
+( �∇φ(p))j eUpXj
+(36)
+
+The Euclidean gradient ∇φp(θ) is sometimes called the “vanilla gradient.” It is equal to the
+covariance between the Riemannian gradient ∇φ(p) and the basis X, (∇φp(0))′ = Ep [∇φ(p) eUpX].
+We summarize in a display the relations between our three gradients: Euclidean ∇φp(0), natural
+�∇φ(p) and Riemannian ∇φ(p).
+
+TEV
+(σp, ˙σp)�
+
+π
+�
+
+R2m
+
+π1
+�
+
+EV
+σp
+� Rm
+
+TpEV
+˙σp
+� Rm
+
+IB(p)
+�
+
+EV
+
+∇φ(p)
+�
+
+∇φp(0)
+� Rm
+˙σp ◦ ∇φ(p) = I−1
+B ∇φp(0) = �∇φ(p)
+
+(37)
+In the following, we shall frequently use the fact that the representation of the gradient vector
+field ∇φ in a generic chart centered at p is:
+
+(∇φ)p(θ) = ˙σp(∇φ(ep(θ))) = ( �∇φ)(ep(θ)) = I−1
+B,p(θ)∇φp(θ).
+(38)
+
+It should be noted that the leftmost term (∇φ)p(θ) is the presentation of the gradient in the charts
+of the tangent bundle, while in the rightmost term, ∇φp(θ) denotes the Euclidean gradient of the
+presentation of the function φ in the charts of the manifold.
+
+216
+
+
+Entropy 2014, 16, 4260–4289
+
+2.4.1. Expectation Parameters
+
+As ψp is strictly convex, the gradient mapping θ �→ (∇ψp(θ))′ is a homeomorphism from the
+space of parameters Rm to the interior of the convex set generated by the image of eUpX; see [1]. The
+function μp : EV defined by:
+
+μp(q) = Eq
+�eUpX
+� = Eq [X] − Ep [X] = (∇ψp(θ))′,
+θ = σp(q)
+(39)
+
+is a chart for all p ∈ EV. The value of the inverse q = Lp(μ) is characterized as the unique q ∈ EV, such
+that μ = Eq [eUpX], i.e., the maximum likelihood estimator.
+Let us compute the change of chart from p to ¯p:
+
+μ ¯p ◦ μ−1
+p (η) = ¯η = η + Ep [X] − E ¯p [X] .
+(40)
+
+In fact, μ = ELp(μ) [eUpX] and ¯μ = μ ¯p(Lp(μ)) = ELp(μ)
+�eU ¯pX
+�
+.
+We do not discuss here the rich theory started in [2] about the duality between σp and μp. We
+limit ourselves to the computation of the Riemannian gradient in the expectation parameters. If φ: EV,
+
+φp(θ) = φ ◦ ep(θ) = φ ◦ Lp ◦ μp ◦ ep(θ) = (φ ◦ Lp) ◦ (∇ψp)(θ),
+(41)
+
+because μp ◦ ep(θ) = Eep(θ) [eUpX] = ∇φp(θ), hence:
+
+∇φp(θ) = ∇(φ ◦ Lp)(∇ψp(θ)) Hess ψp(θ),
+(42)
+
+�∇φ(p) = IV(p)−1(∇(φ ◦ Lp)(0) Hess ψp(0))′ = (∇(φ ◦ Lp)(0))′,
+(43)
+
+∇φ(p) = ∇(φ ◦ Lp)(0) eUpX,
+(44)
+
+that is, the natural gradient �∇φ at p = Lp(μ) is equal to the Euclidean gradient of μ �→ φ ◦ Lp(μ) at
+μ = 0.
+
+2.4.2. Vector Fields
+
+If V is a vector field of TEV and φ: EV is a real function, then we define the action of V on φ, ∇Vφ,
+to be the real function:
+∇Vφ: EV ∋ p �→ ∇Vφ(p) = ∇φp(0) ˙σp (V(p)) .
+(45)
+
+We prefer to avoid the standard notation Vφ, because in our setting, V(p) is a random variable, and
+the product V(p)φ(p) is otherwise defined as the ordinary product.
+Let us represent ∇Vφ in the chart centered at p:
+
+(∇Vφ)p(θ) = ∇Vφ(ep(θ)) = ∇φep(θ)(0) ˙σep(θ)
+�
+V(ep(θ))
+� = ∇φp(θ)Vp(θ),
+(46)
+
+where we have used the equality ∇φep(θ)(0) = ∇φp(θ) and Vp(θ) = ˙σep(θ)
+�
+V(ep(θ))
+�
+.
+If W is a vector field, we can compute ∇W∇Vφ at p as:
+
+∇W∇Vφ(p) = ∇(∇Vφ)p(0) ˙σp(W(p))
+
+= Vp(0)′ Hess φp(0)Wp(0) + ∇φp(0)JVp(0)Wp(0),
+(47)
+
+where J denotes the Jacobian matrix.
+The Lie bracket [W, V]φ (see [7] (§4.2), [8] (V, §1), [4] (Section 5.3.1)) is given by:
+
+[W, V]φ(p) = ∇W∇Vφ(p) − ∇V∇wφ(p) = ∇φp(0)
+�
+JVp(0)Wp(0) − JWp(0)Vp(0)
+�
+,
+(48)
+
+because of Equation (47) and the symmetry of the Hessian.
+
+217
+
+
+Entropy 2014, 16, 4260–4289
+
+The flow of the smooth vector field V : EV is a family of curves γ(t, p), p ∈ EV, t ∈ Jp, Jp open real
+interval containing zero, such that for all p ∈ EV and t ∈ Jp,
+
+γ(0, p) = p,
+(49)
+
+δγ(t, p) = V(γ(t, p)).
+(50)
+
+As uniqueness holds in Equation (50) (see [8] (VI, §1) or [7] (§4.1)), we have semi-group property
+γ(s + t, p) = γ(s, γ(t, p)), and Equation (50) is equivalent to δγ(0, p) = V(γ(0, p)), p ∈ EV.
+If a flow of V is available, we have an interpretation of ∇Vφ as a derivative of φ along γ(t, p),
+
+d
+dtφ(γ(t, p))
+����
+t=0
+= ∇φp(σp(γ(t, p)))
+� d
+
+dtσp(γ(t, p))
+�����
+t=0
+= ∇φp(0)V(p) = ∇Vφ(p).
+(51)
+
+2.5. Examples
+
+The following examples are intended to show how the formalism of gradients is usable in
+performing basic computations.
+
+2.5.1. Expectation
+
+Let f be any random variable, and define F: EV by F(p) = Ep [ f ]. In the chart centered at p, we
+have:
+
+Fp(θ) =
+�
+f exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p dμ
+(52)
+
+and the Euclidean gradient:
+∇Fp(0) = Covp ( f, X) ∈ (Rm)′.
+(53)
+
+The natural gradient is:
+
+�∇F(p) = Covp (X, X)−1 Covp (X, f ) ∈ Rm,
+(54)
+
+and the Riemannian gradient is:
+
+∇F(p) = ( �∇F(p))′ eUpX = Covp ( f, X) Covp (X, X)−1 eUpX ∈ TpEV.
+(55)
+
+From Equation (55), it follows that ∇F(p) is the L2(p)-projection f onto eUpV, while �∇F(p) in
+Equation (54) are the coordinates of the projection. Let us consider the family of curves:
+
+γ(t, p) = exp
+
+�
+m
+∑
+j=1
+t( �∇F(p))j eUpXj − ψp(t �∇F(p))
+
+�
+
+· p,
+t ∈ R.
+(56)
+
+The velocity is:
+
+δγ(t, p) = d
+
+dt
+
+�
+m
+∑
+j=1
+t( �∇F(p))j eUpXj − ψp(t �∇F(p))
+
+�
+
+= ∇F(p) − Eγ(t,p) [∇F(p)] ,
+(57)
+
+which is different from ∇F(γ(t, p)), unless f ∈ V ⊕ R. Then, γ is not, in general, the flow of ∇F, but it
+is a local approximation, as δγ(0, p) = ∇F(p).
+These computation are the basis of model-based methods in combinatorial optimization; see [10–14].
+
+218
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.2. Binary Independent Variables
+
+Here, we present, in full generality, the toy example of the Introduction; see [17] for more
+information on the application to combinatorial optimization. Our example is a very special case of
+Ising exactly solvable models [18], our aim being here to explore the geometric framework.
+Let Ω = {+1, −1}m with counting measure μ, and let the space V be generated by the coordinate
+projections B = {X1, . . . , Xd}. Note that we use here the coding +1, −1 (from physics) instead of
+the coding 0, 1, which is more common in combinatorial optimization. The exponential family is
+EV =
+�
+exp
+�
+∑m
+J=1 θjXj − ψλ(θ)
+�
+· 2−m�
+, λ(x) = 2−m for x ∈ Ω being the uniform density. The
+independence of the sufficient statistics Xj under all distributions in EV implies:
+
+ψλ(θ) =
+m
+∑
+j=1
+ψ(θj),
+ψ(θ) = log (cosh(θ)) .
+(58)
+
+We have:
+
+∇ψλ(θ) = [tanh(θj): j = 1, . . . , d]
+
+= ηλ(θ),
+(59)
+
+Hess ψλ(θ) = diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+
+= diag
+�
+e−2ψ(θj) : j = 1, . . . , d
+�
+
+= IB,λ(θ),
+(60)
+
+IB,λ(θ)−1 = diag
+�
+cosh2(θj): j = 1, . . . , d
+�
+
+= diag
+�
+e2ψ(θj) : j = 1, . . . , d
+�
+.
+(61)
+
+The quadratic function f (X) = a0 + ∑j ajXj + ∑{i,j} ai,jXiXj has expected value at p = eλ(θ), i.e.,
+relaxed value, equal to:
+
+F(p) = Fλ(θ) = Eθ [ f (X)] = a0 + ∑
+j
+aj tanh(θj) + ∑
+{i,j}
+ai,j tanh(θi) tanh(θj),
+(62)
+
+and covariance with Xk ∈ B equal to:
+
+Covθ ( f (X), Xk) = ∑
+j
+aj Covθ
+�
+Xj, Xk
+� + ∑
+{i,j}
+ai,j Covθ
+�
+XiXj, Xk
+�
+
+= ak Varθ (Xk) + ∑
+i̸=k
+ai,k Eθ [Xi] Varθ (Xk)
+
+= cosh−2(θk)
+
+�
+
+ak + ∑
+i̸=k
+ai,k tanh(θi)
+
+�
+
+.
+(63)
+
+In the computation, we have used the independence and the special algebra of ±1, which implies
+X2
+i = 1, so that Covθ
+�
+XiXj, Xk
+� = 0 if i, j ̸= k, otherwise Covθ (XiXk, Xk) = Eθ [Xi] − Eθ [Xi] Eθ [Xk]2;
+see [13].
+
+219
+
+
+Entropy 2014, 16, 4260–4289
+
+The Euclidean gradient, the natural gradient and the Riemannian gradient are, respectively,
+
+∇Fλ(θ) =
+
+�
+
+cosh−2(θj)
+
+�
+
+aj + ∑
+i̸=j
+ai,j tanh(θi)
+
+�
+
+: j = 1, . . . , d
+
+�
+
+,
+(64)
+
+�∇F(eλ(θ)) =
+
+�
+
+aj + ∑
+i̸=j
+ai,j tanh(θi): j = 1, . . . , d
+
+�
+
+,
+(65)
+
+∇F(eλ(θ)) =
+m
+∑
+j=1
+
+�
+
+aj + ∑
+i̸=j
+ai,j Eθ [Xi]
+
+�
+�
+Xj − Eθ
+�
+Xj
+��
+.
+(66)
+
+The (natural) gradient flow equations are:
+
+˙θj(t) = aj + ∑
+i̸=j
+ai,j tanh(θi(t)),
+j = 1, . . . , d.
+(67)
+
+Equations (64)–(66) are usable in practice if the aj’s and the ai,j’s are estimable. Otherwise, one
+can use Equation (63) and the following forms of the gradients:
+
+∇Fλ(θ) =
+�
+Covθ
+�
+Xj, f (X)
+�
+: j = 1, . . . , d
+�
+,
+(68)
+
+�∇F(eλ(θ)) =
+�
+cosh2(θj) Covθ
+�
+f (X), Xj
+�
+: j = 1, . . . , d
+�
+,
+(69)
+
+in which case, the gradient flow equations are:
+
+˙θj(t) = cosh2(θj) Covθ
+�
+f (X), Xj
+�
+,
+j = 1, . . . , d.
+(70)
+
+Let us study the relaxed function in the expectation parameters ηj = ηj(θ), j = 1, . . . , d,
+
+Fλ(η) = a0 + ∑
+j
+ajηj + ∑
+{i,j}
+ai,jηiηj,
+η ∈] − 1, +1[m.
+(71)
+
+The Euclidean gradient with respect to η has components:
+
+∂jFλ(η) = aj + ∑
+i̸=j
+ai,jηi,
+(72)
+
+which are equal to the components of the natural gradient; see Section 2.4.1. As:
+
+˙ηj(t) = d
+
+dt tanh(θj(t)) = cosh−2(θj(t)) ˙θj(t) =
+�
+1 − ηj(t)2�
+˙θj(t),
+j = 1, . . . , m,
+(73)
+
+the gradient flow expressed in the η-parameters has equations:
+
+˙ηj(t) =
+�
+1 − ηj(t)2� �
+
+aj + ∑
+i̸=j
+ai,jηi(t)
+
+�
+
+,
+j = 1, . . . , d.
+(74)
+
+Alternatively, in vector form,
+
+˙η(t) = diag
+�
+1 − ηj(t)2 : j = 1, . . . , d
+�
+(a + Aη(t)) ,
+(75)
+
+where a = [aj : j = 1, . . . , d]t and Ai,j = 0 if i = j, Aij = ai,j. The matrix A is symmetric with zero
+diagonal, and it has the meaning of the adjacency matrix of the (weighted) interaction graph. We do
+not know a closed-form solution of Equation (74). An example of a numerical solution is shown in
+Figure 3.
+
+220
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.3. Escort Probabilities
+
+For a given a > 0, consider the function C(a) : EV defined by C(a)(p) = �
+pa dμ. We have:
+
+C(a)
+p (θ) =
+�
+exp
+
+�
+
+a
+m
+∑
+j=1
+θj eUpXj − aψp(θ)
+
+�
+
+pa dμ
+(76)
+
+and:
+
+dC(a)
+p (0)α =
+�
+a
+
+�
+m
+∑
+j=1
+αj eUpXj
+
+�
+
+pa dμ =
+
+m
+∑
+j=1
+αj
+�
+a eUpXjpa dμ =
+m
+∑
+j=1
+αj Covp
+�
+Xj, apa−1�
+,
+(77)
+
+that is, the Euclidean gradient is ∇C(a)
+p (0) = Covp
+�
+apa−1, X
+�
+(row vector). The natural gradient is
+computed from Equation (35) as:
+
+�∇C(a)(p) = I−1
+B (p)(∇C(a)
+p (0))′ = Covp (X, X)−1 Covp
+�
+X, apa−1�
+,
+(78)
+
+while the Riemannian gradient follows from Equation (36):
+
+∇C(a)(p) = Covp
+�
+apa−1, X
+�
+Covp (X, X)−1 eUpX.
+(79)
+
+Note that the Riemannian gradient is the orthogonal projection of the random variable apa−1 onto
+the tangent space TpEV = eUpV.
+The probability density pa/C(p) is called the escort density in the literature on non-extensive
+statistical mechanics; see, e.g., [19] (Section 7.4).
+We compute now the tangent mapping of EV ∋ p �→ pa/C(a)(a) ∈ P>. Let us extend the basis
+X1, . . . , Xm to a basis X1, . . . , Xn, n ≥ m, whose exponential family is full, i.e., equal to P>. The
+
+non-parametric coordinate of q =
+�
+exp
+�
+∑m
+j=1 θj eUpXj − ψp(θ)
+�
+p
+�a
+/C(a)
+p (θ) in the chart centered at
+
+¯p = pa/C(a)
+p (0) is the ¯p-centering of the random variable:
+
+log
+� q
+
+¯p
+
+�
+= log
+
+⎛
+
+⎜
+⎝
+
+�
+exp
+�
+∑m
+j=1 θj eUpXj − ψp(θ)
+�
+p
+�a
+/C(a)
+p (θ)
+
+pa/C(a)
+p (0)
+
+⎞
+
+⎟
+⎠
+
+= a
+m
+∑
+j=1
+θj eUpXj − aψp(θ) + ln C(a)
+( 0) − ln C(a)
+p (θ),
+(80)
+
+that is,
+
+v = a
+m
+∑
+j=1
+θj eU ¯pXj.
+(81)
+
+The coordinates of v in the basis eU ¯pX1, . . . , eU ¯pXn are (aθ1, . . . , aθm, 0, . . . , 0), and the Jacobian of
+θ �→ (aθ, 0n−m) is the m × n matrix [aIm|0m×(n−m)].
+
+221
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.4. Polarization Measure
+
+The polarization measure has been introduced in Economics by [20]. Here, we consider the
+qualitative version of [21]. If π is a distribution of a finite set, the probability that in three independent
+samples from π there are exactly two equal is 3 ∑j π2
+j (1 − πj). If p ∈ EV, define:
+
+G(p) =
+�
+p2(1 − p) dμ = C(2)(p) − C(3)(p),
+(82)
+
+where C(2) and C(3) are defined as in Example 2.5.3.
+From Equation (78), we find the natural gradient:
+
+�∇G(p) = Covp (X, X)−1 Covp
+�
+X, 2p − 3p2�
+.
+(83)
+
+Note that �∇G(p) = 0 if p is constant; see Figure 4.
+
+Figure 4. Normalized polarization.
+
+3. Second Order Calculus
+
+In this section, we turn to considering second order calculus, in particular Hessians, in order to
+prepare the discussion of the Newton method for the relaxed optimization of Section 4.
+
+3.1. Metric Derivative (Levi–Civita connection)
+
+Let V, W : EV be vector fields, that is, V(p), W(p) ∈ TpEV = eUpV, p ∈ EV. Consider the real
+function R = g(V, W): EV → R, whose value at p ∈ EV is R(p) = gp(V(p), W(p)) = Ep [V(p)W(p)].
+Assuming smoothness, we want to compute the derivative of R along the vector field Y : EV, that is,
+(DYR)(p) = dRp(0)α, with α = ˙σp(Y(p)). The expression of R in the chart centered at p is, according
+to Equation (27),
+
+θ �→ Rp(θ) = ˙σp(V(ep(θ)))′IB(ep(θ)) ˙σp(W(ep(θ))) = Vp(θ)′IB,p(θ)Wp(θ),
+(84)
+
+where Vp and Wp are the presentation in the chart of the vector fields V and W, respectively.
+
+222
+
+
+Entropy 2014, 16, 4260–4289
+
+The i-th component ∂iRp(θ) of the Euclidean gradient ∇Rp(θ) is:
+
+∂iRp(θ) = ∂i
+�
+Vp(θ)′IB,p(θ)Wp(θ)
+� =
+
+∂iVp(θ)′IB,p(θ)Wp(θ) + Vp(θ)′∂iIB,p(θ)Wp(θ) + Vp(θ)′IB,p(θ)∂iWp(θ) =
+�
+∂iVp(θ) + 1
+
+2 I−1
+B,p(θ)∂iIB,p(θ)Vp(θ)
+�′
+IB,p(θ)Wp(θ)+
+
+Vp(θ)′IB,p(θ)
+�
+∂iWp(θ) + 1
+
+2 I−1
+B,p(θ)∂iIB,p(θ)Wp(θ)
+�
+,
+(85)
+
+so that the derivative at θ along α = ˙σep(θ)(Y(ep(θ))) is:
+
+dRp(θ)α =
+�
+dVp(θ)α + 1
+
+2 I−1
+B,p(θ)
+�
+dIB,p(θ)α
+�
+Vp(θ)
+�′
+IB,p(θ)Wp(θ)+
+
+Vp(θ)′IB,p(θ)
+�
+dWp(θ)α + 1
+
+2 I−1
+B,p(θ)
+�
+dIB,p(θ)α
+�
+Wp(θ)
+�
+.
+(86)
+
+Proposition 1. If we define DYV to be the vector field on EV, whose value at q = ep(θ) has coordinates
+centered at p given by:
+
+˙σp(DYV(q)) = dVp(θ)α + 1
+
+2 I−1
+B (p)
+�
+dIB,p(θ)α
+�
+Vp(θ),
+α = ˙σp(Y(q)),
+(87)
+
+then:
+DYg(V, W) = g(DYV, W) + g(V, DYW),
+(88)
+
+i.e., Equation (87) is a metric covariant derivative; see [6] (Ch. 2 §3), [8] (VIII §4), [4] (§5.3.2).
+
+The metric derivative Equation (87) could be computed from the flow of the vector field Y. Let
+(t, p) �→ γ(t, p) be the flow of the vector field V, i.e., δγ(t, p) = V(γ(t, p)) and γ(0, p) = p. Using
+Equation (23), we have:
+
+d
+dt ˙σ(V(γ(t, p)))
+����
+t=0
+= d
+
+dtVp(σp(γ(t, p)))
+����
+t=0
+
+= dVp(σp(γ(t, p))) d
+
+dtσp(γ(t, p))
+����
+t=0
+= dVp(0) ˙σp(δγ(0, p)) = dVp(0) ˙σp(Y(p)),
+(89)
+
+and:
+
+d
+dt IV(γ(t, p))
+����
+t=0
+= d
+
+dt IB,p(σpγ(t, p))
+����
+t=0
+= dIB,p(0) ˙σp(δγ(0, p)) = dIB,p(0) ˙σp(Y(p))Vp(0),
+(90)
+
+so that:
+
+˙σ(DYV(p)) = d
+
+dt ˙σV(γ(t, p))
+����
+t=0
++ 1
+
+2 I−1
+V (p) d
+
+dt IV(γ(t, p))
+����
+t=0
+.
+(91)
+
+Let us check the symmetry of the metric covariant derivative to show that it is actually the unique
+Riemannian or Levi–Civita affine connection; see [6] (Th. 3.6).
+The Lie bracket of the vector fields V and W is the vector field [V, W], whose coordinates are:
+
+[V, W]p(θ) = dVp(0)˙σp(W(p)) − dWp(0) ˙σp(V(p)).
+(92)
+
+223
+
+
+Entropy 2014, 16, 4260–4289
+
+As the ij entry of ∂kIB,p(0) is ∂k∂i∂jψp(0), then the symmetry (dIB,p(0)α)β = (dIB,p(0)β)α holds,
+and we have:
+
+˙σp (DWV(p) − DVW(p)) =
+
+dVp(0)˙σp(W(p)) + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0) ˙σp(W(p))
+�
+Vp(0)
+
+− dWp(0) ˙σp(V(p)) − 1
+
+2 I−1
+B (p)
+�
+dIB,p(0) ˙σp(V(p))
+�
+Wp(0)
+
+= ˙σ[V, W](p).
+(93)
+
+The term Γk(p) = 1
+
+2 I−1
+p (0)∂kdIB,p(0) of Equation (87) is sometimes referred to as the Christoffel
+matrix, but we do not use this terminology in this paper. As:
+
+IB,p(θ) = IB(ep(θ)) =
+�
+Covep(θ)
+�
+Xi, Xj
+��
+
+i,j=1,...,m =
+�
+∂i∂jψp(θ)
+�
+
+i,j=i,...,m ,
+(94)
+
+we have ∂kIB(ep(θ)) = [∂i∂j∂kψp(θ)]i,j=i,...,m =
+�
+Covep(θ)
+�
+Xi, Xj, Xk
+��
+
+i,j=i,...,m and:
+
+Γk(p) = 1
+
+2
+�
+Covp
+�
+Xi, Xj
+��−1
+i,j=i,...,m
+�
+Covp
+�
+Xi, Xj, Xk
+��
+
+i,j=i,...,m
+(95)
+
+.
+If V, W are vector fields of TEV, we have:
+
+Γ(p, V, W) = 1
+
+2 I−1
+B (p) Covp (X, V, W)
+
+= 1
+
+2 I−1
+B (p) Ep
+�eUpXVW
+�
+,
+(96)
+
+which is the projection of V(p)W(p)/2 on eUpV.
+Notice also that:
+
+(dI−1
+p (0)α)IB,p(0) = −I−1
+p (0)(dIB,p(0)α)I−1
+p (0)IB,p(0)y = −I−1
+p (0)
+�
+dIB,p(0)α
+�
+.
+(97)
+
+3.2. Acceleration
+
+Let p(t), t ∈ I, be a smooth curve in EV. Then, the velocity δp(t) = d
+
+dt log (p(t)) is a vector field
+V(p(t)) = δp(t), defined on the support p(I) of the curve. As the curve is the flow of the velocity
+field, we can compute the metric derivative of the velocity along the the velocity itself Dδpδp from
+Equation (91) with V(p(0)) = δp(0); we can use Equation (91) to get:
+
+˙σp(Dδpδp)(p(0)) = d
+
+dt ˙σp(0) (δ(p(t)))
+����
+t=0
++ 1
+
+2 I−1
+B (p(0)) d
+
+dt IB(p(t))
+����
+t=0
+=
+
+d2
+
+dt2 σp(0)(p(t))
+����
+t=0
++ 1
+
+2 I−1
+B (p(0)) d
+
+dt IB(p(t))
+����
+t=0
+.
+(98)
+
+which can be defined to be the Riemannian acceleration of the curve at t = 0.
+Let us write θ(t) = σp(p(t)), p = p(0) and:
+
+p(t) = exp
+
+�
+m
+∑
+j=1
+θj(t) eUpXj − ψp(θ(t))
+
+�
+
+· p,
+(99)
+
+224
+
+
+Entropy 2014, 16, 4260–4289
+
+so that ˙σp(δp)(0) = ˙θ(0) and
+d2
+dt2 σp(p(t))
+���
+t=0 = ¨θ(0). We have:
+
+d
+dt IB(p(t))
+����
+t=0
+= d
+
+dt IB,p(θ(t))
+����
+t=0
+= d
+
+dt Hess ψp(θ(t))
+����
+t=0
+= Covp(X, X,
+m
+∑
+j=1
+˙θj(t)Xj)
+(100)
+
+so that the acceleration at p has coordinates:
+
+¨θ(0) + 1
+
+2
+
+m
+∑
+i,j=1
+˙θi(0) ˙θj(0) Covp (X, X)−1 Covp(X, Xi, Xj) =
+
+¨θ(0) + 1
+
+2 Covp (X, X)−1 Covp(X,
+m
+∑
+i
+˙θi(0)Xi,
+m
+∑
+j=1
+˙θj(0)Xj).
+(101)
+
+A geodesic is a curve whose acceleration is zero at each point. The exponential map is the mapping
+Exp: TEV → EV defined by:
+(p, U) �→ Expp U = p(1),
+(102)
+
+where t �→ p(t) is the geodesic, such that p(0) = p and δp(0) = U, for all U, such that the geodesic
+exists for t = 1.
+The exponential map is a particular retraction, that is, a family of mappings Rp, p ∈ E, from
+the tangent space at p to the manifold; here R: TpE → E, such that Rp(0) = p and dRp(0) = Id;
+see [4] (§5.4). It should be noted that exponential manifolds have natural retractions other than Exp, a
+notable one being the exponential family itself. A retraction provides a crucial step in a gradient search
+algorithms by mapping a direction of increase of the objective function to a new trial point.
+
+3.2.1. Example: Binary Independent 2.5.2 Continued.
+
+Let us consider the binary independent model of Section 2.5.2. We have
+
+IB(eλ(θ)) = IB,λ(θ) = diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+,
+(103)
+
+it follows that
+
+∂kIB,λ(θ) = ∂k diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+
+= −2 cosh−3(θk) sinh(θk)Ekk,
+(104)
+
+where Ekk is the d × d matrix with entry one at (k, k), zero otherwise. The k-th Christoffel’s matrix in
+the second term in the definition of the metric derivative (aka Levi–Civita connection) is:
+
+Γk
+B(eλ(θ)) = Γk
+λ(θ) = 1
+
+2 I−1
+B,λ(θ)∂kIB,λ(θ) = − tanh(θk)Ekk.
+(105)
+
+In terms of the moments, we have IB,λ(θ) = Covθ (X, X′) = Hess ψλ(θ). As ∂k∂i∂jψλ(θ) =
+Covθ
+�
+Xk, Xi, Xj
+�
+, we that can write:
+
+∂kIB,λ(θ) = ∂k diag
+�
+Varθ
+�
+Xj
+�
+: j = 1, . . . , d
+�
+
+= Covθ (Xk, Xk, Xk) Ekk
+(106)
+
+and:
+
+Γk
+λ(θ) = 1
+
+2 Covθ (Xk, Xk)−1 Covθ (Xk, Xk, Xk) Ekk
+
+= 1
+
+2(1 − (ηk)2)−1(−2ηk + 2(ηk)3)Ekk = −ηkEkk.
+(107)
+
+225
+
+
+Entropy 2014, 16, 4260–4289
+
+The equations for the geodesics starting from θ(0) with velocity ˙θ(0) = u are:
+
+¨θk(t) +
+m
+∑
+ij=1
+Γk
+ij(θ(t)) ˙θi(t) ˙θj(t) = ¨θk(t) − tanh(θk(t))( ˙θk(t))2 = 0,
+k = 1, . . . , d.
+(108)
+
+The ordinary differential equation:
+
+¨θ − tanh(θ) ˙θ2 = 0
+(109)
+
+has the closed form solution:
+
+θ(t) = gd−1
+�
+gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t
+�
+= tanh−1
+�
+sin
+�
+gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t
+��
+(110)
+
+for all t, such that:
+
+− π/2 < gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t < π/2,
+(111)
+
+where gd: R →] − π/2, +π/2[ is the Gudermannian function, that is, gd′(x) = 1/ cosh x, gd(0) = 0;
+in closed form, gd(x) = arcsin(tanh(x)). In fact, if θ is a solution of Equation (109), then:
+
+d
+dt gd(θ(t)) =
+˙θ(t)
+
+cosh(θ(t))
+(112)
+
+d2
+
+dt2 gd(θ(t)) = −sinh(θ(t))˙(θ(t))2
+
+cosh2(θ(t))
++
+¨θ(t)
+
+cosh(θ(t))
+
+=
+1
+
+cosh(θ(t))
+
+�
+¨θ(t) − tanh(θ(t))( ˙θ(t))2�
+= 0,
+(113)
+
+so that t �→ gd(θ(t)) coincides (where it is defined) with an affine function characterized by the initial
+conditions.
+In particular, at t = 1, the geodesic Equation (110) defines the Riemannian exponential
+Exp: TEV → EV.
+If (p, U) ∈ TEV, that is, p ∈ EV and U ∈ TpEV, then σλ(p) = θ(0) and
+U = ∑ uj
+eUpXj, ˙σλ(U) = u. If:
+
+− π/2 < gd(θj) +
+uj
+
+cosh(θj) < π/2,
+(114)
+
+then we can take ˙θ(0) = u and t = 1, so that:
+
+Expp : U
+˙σλ
+�−→ u �→
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+: j = 1, . . . , d
+�
+eλ
+�−→
+
+m
+∏
+j=1
+exp
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+Xj − ψ
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+���
+2−m.
+(115)
+
+We have:
+
+exp
+�
+gd−1(v)
+�
+= exp
+�
+tanh−1(sin(v))
+�
+=
+
+�
+
+1 + sin v
+1 − sin v
+(116)
+
+and:
+
+ψ
+�
+gd−1(v)
+�
+= + log
+�
+gd−1(sin v)
+�
+= log
+�
+1
+
+cos v
+
+�
+,
+(117)
+
+226
+
+
+Entropy 2014, 16, 4260–4289
+
+hence u �→ Expp
+�
+∑d
+j=1 uj
+eUpXj
+�
+is given for:
+
+u ∈
+
+d×
+j=1
+
+�
+cosh(θj)(−π/2 − gd(θj)), cosh(θj)(π/2 − gd(θj))
+�
+,
+(118)
+
+by:
+
+Expθ(u) =
+m
+∏
+j=1
+cos
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+� ⎛
+
+⎝
+1 + sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+
+1 − sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+
+⎞
+
+⎠
+
+Xj
+2
+
+=
+
+m
+∏
+j=1
+
+�
+1 + sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+Xj
+
+�
+2−m ∈ EV.
+(119)
+
+The expectation parameters are:
+
+ηi(t) = Eθ=0
+
+�
+
+Xi
+
+m
+∏
+j=1
+
+�
+1 + sin
+�
+gd(θj) +
+tuj
+
+cosh(θj)
+
+�
+Xj
+
+��
+
+= sin
+�
+gd(θj) +
+tuj
+
+cosh(θj)
+
+�
+,
+(120)
+
+and:
+gd(θj) = arcsin(ηj),
+cosh(θj) =
+1
+
+�
+1 − (ηj)2� 1
+
+2
+,
+(121)
+
+so that the exponential in terms of the expectation parameters is:
+
+Expη(u) =
+�
+sin
+�
+arcsin ηj +
+�
+1 − (ηj)2� 1
+
+2 uj
+
+�
+: j = 1, . . . , m
+�
+.
+(122)
+
+The inverse of the Riemannian exponential provides a notion of translation between two elements
+of the exponential model, which is a particular parametrization of the model:
+
+−−→
+η1η2 = Exp−1
+η1 η2 =
+��
+(1 − (ηj
+i)2�− 1
+
+2 �
+arcsin ηj
+2 − arcsin ηj
+1
+�
+: j = 1, . . . , m
+�
+(123)
+
+In particular, at θ = 0, we have the geodesic:
+
+t �→
+d
+∏
+j=1
+
+�
+1 + sin(tuj)Xj
+�
+2−m,
+|t| <
+π
+
+2 max
+��uj
+��
+(124)
+
+See in Figure 5 some geodesic curves.
+
+227
+
+
+Entropy 2014, 16, 4260–4289
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+Expectation parameters
+
+η1
+
+η2
+
+Figure 5. Geodesics from η = (0.75, 0.75).
+
+3.3. Riemannian Hessian
+
+Let φ: EV → R with Riemannian gradient ∇φ(p) = ∑i( �∇φ)i(p) eUpXi, �∇φ(p) = I−1
+B (p)∇φp(0).
+The Riemannian Hessian of φ is the metric derivative of the gradient ∇φ along the vector field Y, that
+is, HessY φ = DY∇φ; see [6] (Ch. 6, Ex. 11), [4] (§5.5). in the following, we denote by the symbol Hess,
+without a subscript, the ordinary Hessian matrix.
+From Equation (87), we have the coordinates of HessY φ(p). Given a generic tangent vector α, we
+compute from Equation (38):
+
+d(∇φ)p(θ)α
+��
+θ=0 = d
+�
+I−1
+B,p(θ)∇φp(θ)
+�
+α
+���
+θ=0
+= (dI−1
+B,p(0)α)∇φp(0) + I−1
+B,p(0) Hess φp(0)α
+
+= −I−1
+B (p)(dIB,p(0)α) �∇φ(p) + I−1
+B (p) Hess φp(0)α
+(125)
+
+and, upon substitution of (∇φ)p to Vp in Equation (87),
+
+˙σp(HessY φ(p)) = d(∇φ)p(0)α + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� (∇φ)p(0),
+α = Sp(Y(p))
+
+= −I−1
+B (p)(dIB,p(0)α) �∇φ(p) + I−1
+B (p) Hess φp(0) + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� �∇φ(p)
+
+= I−1
+B (p) Hess φp(0)α − 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� �∇φ(p)
+
+= I−1
+B (p)
+�
+Hess φp(0)α − 1
+
+2
+�
+dIB,p(0)α
+� �∇φ(p)
+�
+(126)
+
+HessY φ is characterized by knowing the value of g(HessY φ, X): EV for all vector fields X. We have
+from Equation (126), with α = ˙σp(Y(p)) and β = ˙σp(X(p)),
+
+gp(HessY(p) φ(p), X(p)) = β′ Hess φp(0)α − 1
+
+2 β′ �
+dIB,p(0)α
+� �∇φ(p).
+(127)
+
+228
+
+
+Entropy 2014, 16, 4260–4289
+
+This is the presentation of the Riemannian Hessian as a bi-linear form on TEV; see the comments
+in [4] (Prop. 5.5.2-3). Note that the Riemannian Hessian is positive definite if:
+
+α′ Hess φp(0)α ≥ 1
+
+2α′ �
+dIB,p(0)α
+� �∇φ(p),
+α ∈ Rm.
+(128)
+
+4. Application to Combinatorial Optimization
+
+We conclude our paper by showing how the geometric method applies to the problem of finding
+the maximum of the expected value of a function.
+
+4.1. Hessian of a Relaxed Function
+
+Here is a key example of vector field. Let f be any bounded random variable, and define the
+relaxed function to be φ(p) = Ep [ f ], p ∈ P>. Define F(p) to be the projection of f, as an element of
+L2(p), onto TpEV = eUpV, i.e., F(p) is the element of eUpV, such that:
+
+Ep [( f − F(p))v] = 0,
+v ∈ eUpV
+(129)
+
+In the basis eUpB, we have F(p) = ∑i ˆfp,i
+eUpXi and:
+
+Covp
+�
+f, Xj
+� = ∑
+i
+ˆfp,i Ep
+�eUpXi
+eUpXj
+�
+,
+j = 1, . . . , m,
+(130)
+
+so that ˆfp = I−1
+B (p) Covp (X, f ) and
+
+F(p) = ˆf ′
+p
+eUpX = Covp ( f, X) I−1
+B (p) eUpX.
+(131)
+
+Let us compute the gradient of the relaxed function φ = E· [ f ] : EV. We have φp(θ) = Eep(θ) [ f ],
+and from the properties of exponential families, the Euclidean gradient is ∇φp(0) = Covp ( f, X). It
+follows that the natural gradient is:
+
+�∇φp(0) = I−1
+B (p) Covp (X, f ) = ˆf,
+(132)
+
+and the Riemannian gradient is ∇φ(p) = F(p).
+From the properties of exponential families, we have:
+
+Hess φp(0) = Covp (X, X, f ) ,
+
+so that, in this case, Equation (127), when written in terms of the moments, is:
+
+β′ Covp (X, X, f ) α − 1
+
+2 β′ Covp (X, X, α · X) Covp (X, X)−1 Covp (X, f ) .
+(133)
+
+4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued
+
+We list below the computation of the Hessian in the case of two binary independent variables.
+Computations were done with Sage [22], which allows both the reduction x2
+i = 1 in the ring of
+polynomials and the simplifications in the symbolic ring of parameters.
+
+Covη (X, f ) =
+
+�
+−
+�
+η2
+1 − 1
+�
+a1 −
+�
+η2
+1η2 − η2
+�
+a12
+−
+�
+η2
+2 − 1
+�
+a2 −
+�
+η1η2
+2 − η1
+�
+a12
+
+�
+
+=
+
+�
+−(η1 − 1)(η1 + 1)(a12η2 + a1)
+−(η2 − 1)(η2 + 1)(a12η1 + a2)
+
+�
+
+(134)
+
+Covη (X, X) =
+
+�
+−η2
+1 + 1
+0
+0
+−η2
+2 + 1
+
+�
+
+=
+
+�
+−(η1 − 1)(η1 + 1)
+0
+0
+−(η2 − 1)(η2 + 1)
+
+�
+
+(135)
+
+229
+
+
+Entropy 2014, 16, 4260–4289
+
+Covη (X, X)−1 Covη (X, f ) =
+
+�
+a12η2 + a1
+a12η1 + a2
+
+�
+
+= ∇F(η)
+(136)
+
+Covη (X, X, f ) =
+�
+2
+�
+η3
+1 − η1
+�
+a1 + 2
+�
+η3
+1η2 − η1η2
+�
+a12
+�
+η2
+1η2
+2 − η2
+1 − η2
+2 + 1
+�
+a12
+�
+η2
+1η2
+2 − η2
+1 − η2
+2 + 1
+�
+a12
+2
+�
+η1η3
+2 − η1η2
+�
+a12 + 2
+�
+η3
+2 − η2
+�
+a2
+
+�
+
+=
+
+�
+2 (η1 − 1)(η1 + 1)(a12η2 + a1)η1
+(η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12
+(η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12
+2 (η2 − 1)(η2 + 1)(a12η1 + a2)η2
+
+�
+
+(137)
+
+Covη (X, X)−1 Covη (X, X, f ) =
+
+�
+−2 (a12η2 + a1)η1
+−a12η2
+2 + a12
+−a12η2
+1 + a12
+−2 (a12η1 + a2)η2
+
+�
+
+(138)
+
+Covη (X, X, ∇F(η)) =
+�
+2 (a12η2 + a1)(η1 + 1)(η1 − 1)η1
+0
+0
+2 (a12η1 + a2)(η2 + 1)(η2 − 1)η2
+
+�
+
+(139)
+
+Covη (X, X)−1 Covη (X, X, ∇F(η)) =
+�
+−2 (a12η2 + a1)η1
+0
+0
+−2 (a12η1 + a2)η2
+
+�
+
+(140)
+
+The Riemannian Hessian as a matrix in the basis of the tangent space is:
+
+Hess F(η) = Covη (X, X)−1
+�
+Covη (X, X, f ) − 1
+
+2 Covη (X, X, ∇F(η))
+�
+=
+�
+−(a12η2 + a1)η1
+−a12(η2 + 1)(η2 − 1)
+−a12(η1 + 1)(η1 − 1)
+−(a12η1 + a2)η2
+
+�
+
+(141)
+
+As a check, let us compute the Riemannian Hessian as a natural Hessian in the Riemannian
+
+parameters, Hess φ ◦ Expp(u)
+���
+u=0; see [4] (Prop. 5.5.4). We have:
+
+F ◦ Expη(u) =
+
+a12 sin
+��
+
+−η2
+1 + 1u1 + arcsin (η1)
+�
+sin
+��
+
+−η2
+2 + 1u2 + arcsin (η2)
+�
++
+
+a1 sin
+��
+
+−η2
+1 + 1u1 + arcsin (η1)
+�
++ a2 sin
+��
+
+−η2
+2 + 1u2 + arcsin (η2)
+�
+(142)
+
+and:
+
+Hess F ◦ Expη(u)
+���
+u=0 =
+� �
+η2
+1 − 1
+�
+a12η1η2 +
+�
+η2
+1 − 1
+�
+a1η1
+�
+η2
+1 − 1
+��
+η2
+2 − 1
+�
+a12
+�
+η2
+1 − 1
+��
+η2
+2 − 1
+�
+a12
+�
+η2
+2 − 1
+�
+a12η1η2 +
+�
+η2
+2 − 1
+�
+a2η2
+
+�
+
+=
+
+�
+(a12η2 + a1)(η1 + 1)(η1 − 1)η1
+a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1)
+a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1)
+(a12η1 + a2)(η2 + 1)(η2 − 1)η2
+
+�
+
+.
+(143)
+
+230
+
+
+Entropy 2014, 16, 4260–4289
+
+Note the presence of the factor Covη (X, X).
+
+4.2. Newton Method
+
+The Newton method is an iterative method that generates a sequence of points pt, with t = 0, 1, . . . ,
+that converges towards a stationary point ˆp of a F(p) = Ep [ f ], p ∈ EV, that is, a critical point of the
+vector field p �→ ∇F(p), ∇F( ˆp) = 0. Here, we follow [4] (Ch. 5–6), and in particular Algorithm 5 on
+Page 113.
+Let ∇F be a gradient field. We reproduce in our case the basic derivation of the Newton method
+in the following. Note that, in this section, we use the notation Hess •[α] to denote Hessα •. Using
+the definition of metric derivative, we have for a geodesic curve [0, 1] ∋ t �→ p(t) ∈ EV connecting
+p = p(0) to ˆp = p(1) that:
+
+d
+dt gp(t) (∇F(p(t)), δp(t)) = gp(t) (Hess F(p(t))[δp(t)], δp(t))
+(144)
+
+hence the increment from p to ˆp is:
+
+g ˆp (∇F( ˆp), δp(1)) − gp (∇F(p), δp(0)) =
+� 1
+
+0 gp(t) (Hess F(p(t))[δp(t)], δp(t)) dt.
+(145)
+
+Now, we assume that ∇F( ˆp) = 0 and that in Equation (145), the integral is approximated by the
+initial value of the integrand, that is to say, the Hessian is approximately constant on the geodesic from
+p to ˆp; we obtain:
+− gp (∇F(p), δp(0)) = gp (Hess F(p)[δp(0)], δp(0)) + ϵ.
+(146)
+
+If we can solve the Newton equation:
+
+Hess F(p(t))[u] = −∇F(p)
+(147)
+
+then u is approximately equal to the initial velocity of the geodesic connecting p to ˆp, that is, ˆp =
+Expp(u).
+The particular structure of the exponential manifold suggests at least two natural retractions
+that could be used to move from u to ˆp.
+Namely, we have the Riemannian exponential
+(θt, θt+1) �→ Expθt(θt+1 − θt) and the e-retraction coming from the exponential family itself and
+defined by (θt, θt+1) �→ eθt(θt+1 − θt), with θt+1 − θt = ut.
+In the θ parameters, with the e-retraction, the Newton method generates a sequence (θt) according
+to the following updating rule:
+
+θt+1 = θt − λ Hess F(θt)−1 �∇F(θt)
+(148)
+
+where λ > 0 is an extra parameter intended to control the step size and, in turn, the convergence to ˆθ;
+see [5].
+We can rewrite Equation (148) in terms of covariances as:
+
+θt+1 = θt − λ
+�
+Covθt(X, X, f ) − 1
+
+2 Covθt(X, X, �∇F(θt))
+�−1
+�∇F(θt).
+(149)
+
+4.3. Example: Binary Independent
+
+In the η parameters, the Newton step is:
+
+u = − Hess F(η)−1∇F(η) =
+
+⎛
+
+⎜
+⎝
+
+a2
+12η1+a12a2+(a1a12η1+a1a2)η2
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2
+a1a2η1+a1a12+(a12a2η1+a2
+12)η2
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2
+
+⎞
+
+⎟
+⎠
+(150)
+
+231
+
+
+Entropy 2014, 16, 4260–4289
+
+and the new η in the Riemannian retraction is:
+
+Expη(u) =
+
+⎛
+
+⎜
+⎜
+⎝
+
+sin
+�
+(a2
+12η1+a12a2+(a1a12η1+a1a2)η2)√
+
+−η2
+1+1
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2 + arcsin (η1)
+�
+
+sin
+�
+(a1a2η1+a1a12+(a12a2η1+a2
+12)η2)√
+
+−η2
+2+1
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2 + arcsin (η2)
+�
+.
+
+⎞
+
+⎟
+⎟
+⎠
+(151)
+
+In Figure 6, we represented the vector field associated with the Newton step in the η parameters,
+with λ = 0.05, using the Riemannian retraction, for the case a1 = 1, a2 = 2 and a12 = 3, with:
+
+Expη(u) =
+
+⎛
+
+⎜
+⎜
+⎝
+
+sin
+�
+λ
+√
+
+−η2
+1+1((3 η1+2)η2+9 η1+6)
+
+3 (2 η1+3)η2
+2+9 η2
+1+(3 η2
+1+2 η1)η2−9 + arcsin (η1)
+�
+
+sin
+�
+λ
+(3 (2 η1+3)η2+2 η1+3)√
+
+−η2
+2+1
+
+3 (2 η1+3)η2
+2+9 η2
+1+(3 η2
+1+2 η1)η2−9 + arcsin (η2)
+�
+
+⎞
+
+⎟
+⎟
+⎠ .
+(152)
+
+The red dotted lines represented in the figure identify the basins of attraction of the vector field and
+correspond to the solutions of the explicit equation in η for which the Newton step u is not defined.
+This vector field can be compared to that in Figure 7, associated with the Newton step for F(η) using
+the Euclidean geometry. In the Euclidean geometry, F(η) is a quadratic function with one saddle point,
+so that from any η, the Newton step points in the direction of the critical point. This makes the Newton
+step unsuitable for an optimization algorithm. On the other side, in the Riemannian geometry, the
+vertices of the polytope are critical points for F(η), and they determine the presence of multiple basins
+of attraction, as expected.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines
+identify the different basins of attraction and correspond to the points for which the Newton step is not
+defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.
+
+232
+
+
+Entropy 2014, 16, 4260–4289
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+θ1
+
+θ2
+
+−2
+
+0
+
+2
+
+4
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Natural parameters
+
+ 0 
+
+ 0 
+
+ 0 
+
+ 0 
+
+Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted
+lines identify the different basins of attraction and correspond to the points for which the Newton step
+is not defined. The instability along the critical lines, which identifies the basins of attraction, is not
+represented.
+
+233
+
+
+Entropy 2014, 16, 4260–4289
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+θ1
+
+θ2
+
+−2
+
+0
+
+2
+
+4
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Natural parameters
+
+ 0 
+
+ 0 
+
+ 0 
+
+ 0 
+
+Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines
+identify the different basins of attraction and correspond to the points for which the Newton step is
+not defined. The instability along the critical lines, which identifies the basins of attraction, is not
+represented.
+
+Figure 8 shows the Newton step in the θ parameters based on the e-retraction of Equation (149),
+while Figure 9 represents the Newton step evaluated with respect to the Euclidean geometry. A
+comparison of the two vector fields shows that, differently from the η parameters, the number of
+basins of attraction is the same in the two geometries; however, the scale of the vectors is different.
+In particular, notice how on the plateau, for diverging θ, the Newton step in the Euclidean geometry
+vanishes, while in the Riemannian geometry, it gets larger. This behavior suggests better convergence
+properties for an optimization algorithm based on the Newton step evaluated using the proper
+Riemannian geometry. In the θ parameters, the boundaries of the basins of attraction represented by
+the red dotted lines have been computed numerically and correspond to the values of θ for which the
+update step is not defined.
+Finally, notice that in both the η and θ parameters, the step is not always in the direction of descent
+for the function, a common behavior of the Newton method, which converges to the critical points.
+
+5. Discussion and Conclusions
+
+In this paper, we introduced second-order calculus over a statistical manifold, following the
+approach described in [4], which has been adapted to the special case of exponential statistical
+models [2,3]. By defining the Riemannian Hessian and using the notion of retraction, we developed
+the proper machinery necessary for the definition of the updating rule of the Newton method for the
+optimization of a function defined over an exponential family.
+The examples discussed in the paper show that by taking into account the proper Riemannian
+geometry of a statistical exponential family, the vector fields associated with the Newton step in the
+different parametrizations change profoundly. Not only new basins of attraction associated with local
+and global minima appear, as for the expectation parameters, but also the magnitude of the Newton
+step is affected, as over the plateau in the natural parameters. Such differences are expected to have a
+strong impact on the performance of an optimization algorithm based on the Newton step, from both
+the point of view of achievable convergence and the speed of convergence to the optimum.
+
+234
+
+
+Entropy 2014, 16, 4260–4289
+
+The Newton method is a popular second order optimization technique based on the computation
+of the Hessian of the function to be optimized and is well known for its super-linear convergence
+properties. However, the use of the Newton method poses a number of issues in practice.
+First of all, as the examples in Figures 6 and 8 show, the Newton step does not always point
+in the direction of the natural gradient, and the algorithm may not converge to a (local) optimum
+of the function. Such behavior is not unexpected; indeed the Newton method tends to converge to
+critical points of the function to be optimized, which include local minima, local maxima and saddle
+points. In order to obtain a direction of ascent for the function to be optimized, the Hessian must
+be negative-definite, i.e., its eigenvalues must be strictly negative, which is not guaranteed in the
+general case. Another important remark is related to the computational complexity associated with
+the evaluation of the Hessian, compared to the (natural) gradient. Indeed, to obtain the Newton step d,
+Christoffel matrices have to be evaluated, together with the third order covariances between sufficient
+statistics and the function, and the Hessian has to be inverted. Finally, notice that when the Hessian is
+close to being non-invertible, numerical problems may arise in the computation of the Newton step,
+and the algorithm may become unstable and diverge.
+In the literature, different methods have been proposed to overcome these issues. Among them,
+we mention quasi-Newton methods, where the update vector is obtained using a modified Hessian,
+which has been made negative-definite, for instance, by adding a proper correction matrix.
+This paper represents the first step in the design of an algorithm based on the Newton method
+for the optimization over a statistical model. The authors are working on the computational aspects
+related to the implementation of the method, and a new paper with experimental results is in progress.
+
+Acknowledgments: Luigi Malagò was supported by the Xerox University Affairs Committee Award and by
+de Castro Statistics, Collegio Carlo Alberto, Moncalieri. Giovanni Pistone is supported by de Castro Statistics,
+Collegio Carlo Alberto, Moncalieri, and is a member of GNAMPA–INdAM, Roma.
+
+Author Contributions
+
+All authors contributed to the design of the research. The research was carried out by all authors.
+The study of the Hessian and of the Newton method in statistical manifolds was originally suggested
+by Luigi Malagò. The manuscript was written by Luigi Malagò and Giovanni Pistone. All authors
+have read and approved the final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory;
+Number 9 in IMS Lecture Notes. Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA,
+1986; p. 283.
+2.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2000; p. 206.
+3.
+Pistone, G. Nonparametric Information Geometry. In Geometric Science of Information, Proceedings of the
+First International Conference, GSI 2013, Paris, France, 28–30 August 2013; Nielsen, F., Barbaresco, F., Eds.;
+Lecture Notes in Computer Science, Volume 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36.
+4.
+Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
+Press: Princeton, NJ, USA, 2008; pp. xvi+224.
+5.
+Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial
+Engineering; Springer: New York, NY, USA, 2006; pp. xxii+664.
+6.
+Do Carmo, M.P. Riemannian geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston,
+MA, USA, 1992; pp. xiv+300.
+7.
+Abraham, R.; Marsden, J.E.; Ratiu, T.
+Manifolds, Tensor Analysis, and Applications, 2nd ed.; Applied
+Mathematical Sciences, Volume 75; Springer: New York, NY, USA, 1988; pp. x+654.
+
+235
+
+
+Entropy 2014, 16, 4260–4289
+
+8.
+Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York,
+NY, USA, 1995; pp. xiv+364.
+9.
+Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric
+Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P., Eds.; Cambridge University
+Press: Cambridge, UK, 2010.
+10.
+Malagò, L.; Matteucci, M.; Dal Seno, B. An information geometry perspective on estimation of distribution
+algorithms: Boundary analysis. In Proceedings of the 2008 GECCO Conference Companion On Genetic and
+Evolutionary Computation (GECCO ’08); ACM: New York, NY, USA, 2008; pp. 2081–2088.
+11.
+Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming. In
+Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity,
+Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
+12.
+Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by Estimation of Empirical
+Covariances. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA,
+USA, 5–8 June 2011; pp. 949–956.
+13.
+Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based
+on the exponential family. In Proceedings of the 11th Workshop on Foundations of Genetic Algorithms
+(FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011 ; ACM: New York, NY, USA, 2011; pp. 230–242.
+14.
+Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying
+perspective. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico,
+20–23 June 2013; pp. 486–493.
+15.
+Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276.
+16.
+Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA,
+2007; pp. xiv+246.
+17.
+Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis,
+Politecnico di Milano, Milano, Italy, 2012.
+18.
+Gallavotti, G. Statistical Mechanics: A Short Treatise; Texts and Monographs in Physics; Springer: Berlin,
+Germany, 1999; pp. xiv+339.
+19.
+Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149.
+20.
+Esteban, J.; Ray, D. On the Measurement of Polarization. Econometrica 1994, 62, 819–851.
+21.
+Montalvo, J.; Reynal-Querol, M. Ethnic polarization, potential conflict, and civil wars. Am. Econ. Rev. 2005,
+796–816.
+22.
+Stein, W. et al. Sage Mathematics Software (Version 6.0). The Sage Development Team, 2013. Available
+online: http://www.sagemath.org (accessed on 27 March 2014).
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+236
+
+
+entropy
+
+Article
+Information Geometric Complexity of a Trivariate
+Gaussian Statistical Model
+
+Domenico Felice 1,2,*, Carlo Cafaro 3 and Stefano Mancini 1,2
+
+1 School of Science and Technology, University of Camerino, I-62032 Camerino, Italy; E-Mail:
+stefano.mancini@unicam.it
+2 INFN-Sezione di Perugia, Via A. Pascoli, I-06123 Perugia, Italy
+3 Department of Mathematics, Clarkson University, Potsdam, 13699 NY, USA; E-Mail: carlocafaro2000@yahoo.it
+*
+E-Mail: domenico.felice@unicam.it
+
+Received: 1 April 2014; in revised form: 21 May 2014 / Accepted: 22 May 2014 /
+Published: 26 May 2014
+
+Abstract: We evaluate the information geometric complexity of entropic motion on low-dimensional
+Gaussian statistical manifolds in order to quantify how difficult it is to make macroscopic predictions
+about systems in the presence of limited information. Specifically, we observe that the complexity of
+such entropic inferences not only depends on the amount of available pieces of information but also
+on the manner in which such pieces are correlated. Finally, we uncover that, for certain correlational
+structures, the impossibility of reaching the most favorable configuration from an entropic inference
+viewpoint seems to lead to an information geometric analog of the well-known frustration effect that
+occurs in statistical physics.
+
+Keywords: probability theory; Riemannian geometry; complexity
+
+1. Introduction
+
+One of the main efforts in physics is modeling and predicting natural phenomena using relevant
+information about the system under consideration. Theoretical physics has had a general measure of
+the uncertainty associated with the behavior of a probabilistic process for more than 100 years: the
+Shannon entropy [1]. The Shannon information theory was applied to dynamical systems and became
+successful in describing their unpredictability [2].
+Along a similar avenue we may set Entropic Dynamics [3] which makes use of inductive inference
+(Maximum Entropy Methods [4]) and Information Geometry [5]. This is clearly remarkable given
+that microscopic dynamics can be far removed from the phenomena of interest, such as in complex
+biological or ecological systems. Extension of ED to temporally-complex dynamical systems on curved
+statistical manifolds led to relevant measures of chaoticity [6]. In particular, an information geometric
+approach to chaos (IGAC) has been pursued studying chaos in informational geodesic flows describing
+physical, biological or chemical systems. It is the information geometric analogue of conventional
+geometrodynamical approaches [7] where the classical configuration space is being replaced by a
+statistical manifold with the additional possibility of considering chaotic dynamics arising from non
+conformally flat metrics. Within this framework, it seems natural to consider as a complexity measure
+the (time average) statistical volume explored by geodesic flows, namely an Information Geometry
+Complexity (IGC).
+This quantity might help uncover connections between microscopic dynamics and experimentally
+observable macroscopic dynamics which is a fundamental issue in physics [8].
+An interesting
+manifestation of such a relationship appears in the study of the effects of microscopic external
+noise (noise imposed on the microscopic variables of the system) on the observed collective motion
+(macroscopic variables) of a globally coupled map [9]. These effects are quantified in terms of the
+complexity of the collective motion. Furthermore, it turns out that noise at a microscopic level reduces
+
+Entropy 2014, 16, 2944–2958; doi:10.3390/e16062944
+www.mdpi.com/journal/entropy
+237
+
+
+Entropy 2014, 16, 2944–2958
+
+the complexity of the macroscopic motion, which in turn is characterized by the number of effective
+degrees of freedom of the system.
+The investigation of the macroscopic behavior of complex systems in terms of the underlying
+statistical structure of its microscopic degrees of freedom also reveals effects due to the presence of
+microcorrelations [10]. In this article we first show which macro-states should be considered in a
+Gaussian statistical model in order to have a reduction in time of the Information Geometry Complexity.
+Then, dealing with correlated bivariate and trivariate Gaussian statistical models, the ratio between
+the IGC in the presence and in the absence of microcorrelations is explicitly computed, finding an
+intriguing, even though non yet deep understood, connection with the phenomenon of geometric
+frustration [11].
+The layout of the article is as follows. In Section 2 we introduce a general statistical model
+discussing its geometry and describing both its dynamics and information geometry complexity. In
+Section 3, Gaussian statistical models (up to a trivariate model) are considered. There, we compute
+the asymptotic temporal behaviors of their IGCs. Finally, in Section 4 we draw our conclusions by
+outlining our findings and proposing possible further investigations.
+
+2. Statistical Models and Information Geometry Complexity
+
+Given n real-valued random variables X1, . . . , Xn defined on the sample space Ω with joint
+probability density p : Rn → R satisfying the conditions
+
+p(x) ≥ 0 (∀x ∈ Rn)
+and
+�
+
+Rn dx p(x) = 1,
+(1)
+
+let us consider a family P of such distributions and suppose that they can be parametrized using m
+real-valued variables (θ1, . . . , θm) so that
+
+P = {pθ = p(x|θ)|θ = (θ1, . . . , θm) ∈ Θ},
+(2)
+
+where Θ ⊆ Rm is the parameter space and the mapping θ → pθ is injective. In such a way, P is an
+m-dimensional statistical model on Rn.
+The mapping ϕ : P → Rm defined by ϕ(pθ) = θ allows us to consider ϕ = [θi] as a coordinate
+system for P. Assuming parametrizations which are C∞, we can turn P into a C∞ differentiable
+manifold (thus, P is called statistical manifold) [5].
+The values x1, . . . , xn taken by the random variables define the micro-state of the system, while the
+values θ1, . . . , θm taken by parameters define the macro-state of the system.
+Let P = {pθ|θ ∈ Θ} be an m-dimensional statistical model. Given a point θ, the Fisher information
+matrix of P in θ is the m × m matrix G(θ) = [gij], where the (i, j) entry is defined by
+
+gij(θ) :=
+�
+
+Rn dxp(x|θ)∂i log p(x|θ)∂j log p(x|θ),
+(3)
+
+with ∂i standing for
+∂
+∂θi . The matrix G(θ) is symmetric, positive semidefinite and determines a
+Riemannian metric on the parameter space Θ [5]. Hence, it is possible to define a Riemannian statistical
+manifold M := (Θ, g), where g = gijdθi ⊗ dθj (i, j = 1, . . . , m) is the metric whose components gij are
+given by Equation (3) (throughout the paper we use the Einstein sum convention).
+Given the Riemannian manifold M = (Θ, g), it is well known that there exists only one
+linear connection ∇(the Levi–Civita connection) on M that is compatible with the metric g and
+symmetric [12]. We remark that the manifold M has one chart, being Θ an open set of Rm, and the
+Levi-Civita connection is uniquely defined by means of the Christoffel coefficients
+
+Γk
+ij = 1
+
+2 gkl�∂glj
+
+∂θi + ∂gil
+
+∂θj − ∂gij
+
+∂θl
+
+�
+,
+(i, j, k = 1, . . . , m)
+(4)
+
+238
+
+
+Entropy 2014, 16, 2944–2958
+
+where gkl is the (k, l) entry of the inverse of the Fisher matrix G(θ).
+The idea of curvature is the fundamental tool to understand the geometry of the manifold
+M = (Θ, g). Actually, it is the basic geometric invariant and the intrinsic way to obtain it is by
+means of geodesics. It is well-known, that given any point θ ∈ M and any vector v tangent to
+M at θ, there is a unique geodesic starting at θ with initial tangent vector v. Indeed, within the
+considered coordinate system, the geodesics are solutions of the following nonlinear second order
+coupled ordinary differential equations [12]
+
+d2θk
+
+dτ2 + Γk
+ij
+dθi
+
+dτ
+dθj
+
+dτ = 0,
+(5)
+
+with τ denoting the time.
+The recipe to compute some curvatures at a point θ ∈ M is the following: first, select a
+2-dimensional subspace Π of the tangent space to M at θ; second, follow the geodesics through
+θ whose initial tangent vectors lie in Π and consider the 2-dimensional submanifolds SΠ swiped out
+by them inheriting a Riemannian metric from M; finally, compute the Gaussian curvature of SΠ at θ,
+which can be obtained from its Riemannian metric as stated in the Theorema Egregium [13]. The number
+K(Π) found in such manner is called the sectional curvature of M at θ associated with the plane Π. In
+terms of local coordinates, to compute the sectional curvature we need the curvature tensor,
+
+Rh
+ijk =
+∂Γh
+jk
+
+∂θi − ∂Γh
+ik
+
+∂θj + Γl
+jkΓh
+il − Γl
+ikΓh
+jl.
+(6)
+
+For any basis (ξ, η) for a 2-plane Π ⊂ TθM, the sectional curvature at θ ∈ M is given by [12]
+
+K(ξ, η) =
+R(ξ, η, η, ξ)
+
+|ξ|2|η|2 − ⟨ξ, η⟩,
+(7)
+
+where R is the Riemann curvature tensor which is written in coordinates as R = Rijkldθi ⊗ dθj ⊗ dθk ⊗
+dθl with Rijkl = glhRh
+ijk and ⟨·, ·⟩ is the inner product defined by the metric g.
+The sectional curvature is directly related to the topology of the manifold; along this direction
+the Cartan-Hadamard Theorem [13] is enlightening by stating that any complete, simply connected
+n-dimensional manifold with non positive sectional curvature is diffeomorphic to Rn.
+We can consider upon the statistical manifold M = (Θ, g) the macro-variables θ as accessible
+information and then derive the information dynamical Equation (5) from a standard principle of least
+action of Jacobi type [3]. The geodesic Equations (5) describe a reversible dynamics whose solution is
+the trajectory between an initial and a final macrostate θinitial and θfinal, respectively. The trajectory can
+be equally traversed in both directions [10]. Actually, an equation relating instability with geometry
+exists and it makes hope that some global information about the average degree of instability (chaos)
+of the dynamics is encoded in global properties of the statistical manifolds [7]. The fact that this might
+happen is proved by the special case of constant-curvature manifolds, for which the Jacobi-Levi-Civita
+equation simplifies to [7]
+d2Ji
+
+dτ2 + KJi = 0,
+(8)
+
+where K is the constant sectional curvature of the manifold (see Equation (7)) and J is the geodesic
+deviation vector field. On a positively curved manifold, the norm of the separating vector J does not
+grow, whereas on a negatively curved manifold, the norm of J grows exponentially in time, and if the
+manifold is compact, so that its geodesic are sooner or later obliged to fold, this provide an example of
+chaotic geodesic motion [14].
+
+239
+
+
+Entropy 2014, 16, 2944–2958
+
+Taking into consideration these facts, we single out as suitable indicator of dynamical (temporal)
+complexity, the information geometric complexity defined as the average dynamical statistical
+volume [15]
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+:= 1
+
+τ
+
+� τ
+
+0 dτ′vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+,
+(9)
+
+where
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+:=
+�
+
+D(geodesic)
+Θ
+(τ′)
+
+�
+
+det(G(θ)) dθ,
+(10)
+
+with G(θ) the information matrix whose components are given by Equation (3). The integration space
+D(geodesic)
+Θ
+(τ′) is defined as follows
+
+D(geodesic)
+Θ
+(τ′) :=
+�
+θ = (θ1, . . . , θm) : θk(0) ≤ θk ≤ θk(τ′)
+�
+,
+(11)
+
+where θk ≡ θk(s) with 0 ≤ s ≤ τ′ such that θk(s) satisfies (5). The quantity vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+is the
+
+volume of the effective parameter space explored by the system at time τ′. The temporal average
+has been introduced in order to average out the possibly very complex fine details of the entropic
+dynamical description of the system’s complexity dynamics.
+Relevant properties, concerning complexity of geodesic paths on curved statistical manifolds, of
+the quantity (10) compared to the Jacobi vector field are discussed in [16].
+
+3. The Gaussian Statistical Model
+
+In the following we devote our attention to a Gaussian statistical model P whose element are
+multivariate normal joint distributions for n real-valued variables X1, . . . , Xn given by
+
+p(x|θ) =
+1
+�
+
+(2π)n det C
+exp
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+,
+(12)
+
+where μ =
+�E(X1), . . . , E(Xn)
+�
+is the n-dimensional mean vector and C denotes the n × n covariance
+matrix with entries cij = E(XiXj) − E(Xi)E(Xj), i, j = 1, . . . , n. Since μ is a n-dimensional real vector
+
+and C is a n × n symmetric matrix, the parameters involved in this model should be n + n(n+1)
+
+2
+.
+Moreover C is a symmetric, positive definite matrix, hence we have the parameter space given by
+
+Θ := {(μ, C)|μ ∈ Rn, C ∈ Rn×n, C > 0}.
+(13)
+
+Hereafter we consider the statistical model given by Equation (12) when the covariance matrix C has
+only variances σ2
+i = E(X2
+i ) − (E(Xi))2 as parameters. In fact we assume that the non diagonal entry
+(i, j) of the covariance matrix C equals ρσiσj with ρ ∈ R quantifying the degree of correlation.
+We may further notice that the function fij(x) := ∂i log p(x|θ)∂j log p(x|θ), when p(x|θ) is given
+by Equation (12), is a polynomial in the variables xi (i = 1, . . . , n) whose degree is not grater than four.
+Indeed, we have that
+
+∂i log p(x|θ) =
+1
+
+p(x|θ)∂ip(x|θ) = ∂i
+1
+�
+
+(2π)n det C + ∂i
+
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+,
+(14)
+
+and, therefore, the differentiation does not affect variables xi. With this in mind, in order to compute
+the integral in (3), we can use the following formula [17]
+
+1
+�
+
+(2π)n det C
+
+�
+dx fij(x) exp
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+= exp
+
+�
+1
+2
+
+n
+∑
+h,k=1
+chk
+∂
+∂xh
+
+∂
+∂xk
+
+�
+
+fij|x=μ,
+(15)
+
+where the exponential denotes the power series over its argument (the differential operator).
+
+240
+
+
+Entropy 2014, 16, 2944–2958
+
+3.1. The monovariate Gaussian Statistical Model
+
+We now start to apply the concepts of the previous section to a Gaussian statistical model of
+Equation (12) for n = 1. In this case, the dimension of the statistical Riemannian manifold M = (Θ, g)
+is at most two. Indeed, to describe elements of the statistical model P given by Equation (12), we
+basically need the mean μ = E(X) and variance σ2 = E(X − μ)2. We deal separately with the
+cases when the monovariate model has only μ as macro-variable (Case 1), when σ is the unique
+macro-variable (Case 2), and finally when both μ and σ are macro-variables (Case 3).
+
+3.1.1. Case 1
+
+Consider the monovariate model with only μ as macro-variable by setting σ = 1. In this case
+the manifold M is trivially the real flat straight line, since μ ∈ (−∞, +∞). Indeed, the integral
+
+in (3) is equal to 1 when the distribution p(x|θ) reads as p(x|μ) =
+exp
+�
+− 1
+
+2 (x−μ)2�
+
+√
+
+2π
+; so the metric
+
+is g = dμ2. Furthermore, from Equations (4) and (5) the information dynamics is described by
+the geodesic μ(τ) = A1τ + A2, where A1, A2 ∈ R. Hence, the volume of Equation (10) results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+= �
+dμ = A1τ + A2; since this quantity must be positive we assume A1, A2 > 0.
+
+Finally, the asymptotic behavior of the IGC (9) is
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� A1
+
+2
+
+�
+τ.
+(16)
+
+This shows that the complexity linearly increases in time meaning that acquiring information about μ
+and updating it, is not enough to increase our knowledge about the micro state of the system.
+
+3.1.2. Case 2
+
+Consider now the monovariate Gaussian statistical model of Equation(12) when μ = E(X) = 0
+and the macro-variable is only σ. In this case the probability distribution function reads p(x|σ) =
+
+exp
+�
+− x2
+
+2σ2
+�
+
+√
+
+2πσ
+while the Fisher–Rao metric becomes g =
+2
+σ2 dσ2. Emphasizing that also in this case the
+manifold is flat as well, we derive the information dynamics by means of Equations (4) and (5) and we
+obtain the geodesic σ(τ) = A1 exp
+�
+A2τ
+�
+. The volume in Equation (10) then results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� √
+
+2
+σ dσ =
+√
+
+2 log
+�
+A1 exp
+�
+A2τ
+��
+.
+(17)
+
+Again, to have positive volume we have to assume A1, A2 > 0. Finally, the (asymptotic) IGC (9)
+becomes
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+�√
+
+2A2
+2
+
+�
+τ.
+(18)
+
+This shows that also in this case the complexity linearly increases in time meaning that acquiring
+information about σ and updating it, is not enough to increase our knowledge about the micro-state of
+the system.
+
+3.1.3. Case 3
+
+The take home message of the previous cases is that we have to account for both mean μ and
+variance σ as macro-variables to look for possible non increasing complexity. Hence, consider the
+probability distribution function is given by,
+
+p(x1, x2|μ, σ) =
+exp
+�
+− 1
+
+2
+(x−μ)2
+
+σ2
+�
+
+σ
+√
+
+2π
+.
+(19)
+
+241
+
+
+Entropy 2014, 16, 2944–2958
+
+The dimension of the Riemannian manifold M = (Θ, g) is two, where the parameter space Θ is given
+by Θ = {(μ, σ)|μ ∈ (−∞, +∞), σ > 0} and the Fisher–Rao metric reads as g =
+1
+σ2 dμ2 + 2
+
+σ2 dσ2. Here,
+the sectional curvature given by Equation (7) is a negative function and despite the fact that is not
+constant, we expect a decreasing behavior in time of the IGC. Thanks to Equation (4), we find that the
+only non negative Christoffel coefficients are Γ1
+12 = − 1
+
+σ, Γ2
+11 =
+1
+2σ and Γ2
+22 = − 1
+
+σ. Substituting them
+into Equation (5) we derive the following geodesic equations
+
+⎧
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎩
+
+d2μ(τ)
+
+dτ2
+− 2
+
+σ
+dσ
+dτ
+dμ
+dτ = 0,
+
+d2σ(τ)
+
+dτ2
+− 1
+
+σ
+�
+dσ
+dτ
+�2
++ 1
+
+2σ
+� dμ
+
+dτ
+�2
+= 0.
+
+(20)
+
+The integration of the above coupled differential equations is non-trivial. We follow the method
+described in [10] and arrive at
+
+σ(τ) =
+2σ0 exp
+� σ0|A1|
+√
+
+2 τ
+�
+
+1 + exp
+� 2σ0|A1|
+√
+
+2
+τ
+�,
+μ(τ) = −
+2σ0
+√
+
+2A1
+
+|A1|
+�
+1 + exp
+� 2σ0|A1|
+√
+
+2
+τ
+��,
+(21)
+
+where σ0 and A1 are real constants. Then, using (21), the volume of Equation (10) results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� √
+
+2
+
+σ2 dσdμ =
+√
+
+2A1
+|A1| exp
+�
+− σ0|A1|
+√
+
+2
+τ
+�
+.
+(22)
+
+Since the last quantity must be positive, we assume A1 > 0. Finally, employing the above expression
+into Equation (9) we arrive at
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+�
+2
+
+σ0A1
+
+� 1
+
+τ .
+(23)
+
+We can now see a reduction in time of the complexity meaning that acquiring information about both
+μ and σ and updating them allows us to increase our knowledge about the micro state of the system.
+Hence, comparing Equations (16), (18) and (23) we conclude that the entropic inferences on a
+Gaussian distributed micro-variable is carried out in a more efficient manner when both its mean and
+the variance in the form of information constraints are available. Macroscopic predictions when only
+one of these pieces of information are available are more complex.
+
+3.2. Bivariate Gaussian Statistical Model
+
+Consider now the Gaussian statistical model P of the Equation (12) when n = 2. In this case
+the dimension of the Riemannian manifold M = (Θ, g) is at most four. From the analysis of the
+monovariate Gaussian model in Section 3.1 we have understood that both mean and variance should
+be considered. Hence the minimal assumption is to consider E(X1) = E(X2) = μ and E(X1 − μ)2 =
+E(X2 − μ)2 = σ2. Furthermore, in this case we have also to take into account the possible presence of
+(micro) correlations, which appear at the level of macro-states as off-diagonal terms in the covariance
+matrix. In short, this implies considering the following probability distribution function
+
+p(x1, x2|μ, σ) =
+exp
+�
+−
+1
+
+2σ2(1−ρ2)
+
+�
+(x1 − μ)2 − 2ρ(x1 − μ)(x2 − μ) + (x2 − μ)2��
+
+2πσ2�
+
+1 − ρ2
+,
+(24)
+
+where ρ ∈ (−1, 1).
+Thanks to Equation (15) we compute the Fisher-Information matrix G and find g = g11dμ2 +
+g22dσ2 with,
+
+g11 =
+2
+
+σ2(ρ + 1); g22 = 4
+
+σ2 .
+(25)
+
+242
+
+
+Entropy 2014, 16, 2944–2958
+
+The only non trivial Christoffel coefficients (4) are Γ1
+12 = − 1
+
+σ, Γ2
+11 =
+1
+
+2σ(ρ+1) and Γ2
+22 = − 1
+
+σ. In this case
+as well, the sectional curvature (Equation (7)) of the manifold M is a negative function and so we may
+expect a decreasing asymptotic behavior for the IGC. From Equation (5) it follows that the geodesic
+equations are,
+⎧
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎩
+
+d2μ(τ)
+
+dτ2
+− 2
+
+σ
+dσ
+dτ
+dμ
+dτ = 0
+
+d2σ(τ)
+
+dτ2
+− 1
+
+σ
+�
+dσ
+dτ
+�2
++
+1
+
+2(1+ρ)σ
+
+� dμ
+
+dτ
+�2
+= 0,
+
+(26)
+
+whose solutions are,
+
+σ(τ) =
+2σ0 exp
+�
+σ0|A1|
+√
+
+2(1+ρ)τ
+�
+
+1 + exp
+� 2σ0|A1|
+√
+
+2(1+ρ)τ
+�,
+μ(τ) = −
+2σ0
+�
+
+2(1 + ρ)A1
+
+|A1|
+�
+1 + exp
+� 2σ0|A1|
+√
+
+2(1+ρ)τ
+��.
+(27)
+
+Using (27) in Equation (10) gives the volume,
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+�
+2
+√
+
+2
+�
+
+1 + ρ σ2 dσdμ = 4A1
+
+|A1| exp
+�
+−
+σ0|A1|
+�
+
+2(1 + ρ)
+τ
+�
+.
+(28)
+
+To have it positive we have to assume A1 > 0. Finally, employing (28) in (9) leads to the IGC,
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 4
+√
+
+2
+
+σ0A1
+
+��
+
+1 + ρ
+τ
+,
+(29)
+
+with ρ ∈ (−1, 1). We may compare the asymptotic expression of the ICGs in the presence and in the
+absence of correlations, obtaining
+
+R
+strong
+bivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+�
+
+1 + ρ,
+(30)
+
+where “strong” stands for the fully connected lattice underlying the micro-variables. The ratio R
+strong
+bivariate(ρ)
+results a monotonic increasing function of ρ.
+While the temporal behavior of the IGC (29) is similar to the IGC in (23), here correlations play
+a fundamental role. From Equation (30), we conclude that entropic inferences on two Gaussian
+distributed micro-variables on a fully connected lattice is carried out in a more efficient manner when
+the two micro-variables are negatively correlated. Instead, when such micro-variables are positively
+correlated, macroscopic predictions become more complex than in the absence of correlations.
+Intuitively, this is due to the fact that for anticorrelated variables, an increase in one variable
+implies a decrease in the other one (different directional change): variables become more distant, thus
+more distinguishable in the Fisher–Rao information metric sense. Similarly, for positively correlated
+variables, an increase or decrease in one variable always predicts the same directional change for the
+second variable: variables do not become more distant, thus more distinguishable in the Fisher–Rao
+information metric sense. This may lead us to guess that in the presence of anticorrelations, motion on
+curved statistical manifolds via the Maximum Entropy updating methods becomes less complex.
+
+3.3. Trivariate Gaussian Statistical Model
+
+In this section we consider a Gaussian statistical model P of the Equation (12) when n = 3.
+In this case as well, in order to understand the asymptotic behavior of the IGC in the presence of
+correlations between the micro-states, we make the minimal assumption that, given the random vector
+X = (X1, X2, X3) distributed according to a trivariate Gaussian, then E(X1) = E(X2) = E(X3) = μ
+
+243
+
+
+Entropy 2014, 16, 2944–2958
+
+and E(X1 − μ)2 = E(X2 − μ)2 = E(X2 − μ)2 = σ2. Therefore, the space of the parameters of P is
+given by Θ = {(μ, σ)|μ ∈ R, σ > 0}.
+The manifold M = (Θ, g) changes its metric structure depending on the number of correlations
+between micro-variables, namely, one, two, or three . The covariance matrices corresponding to these
+cases read, modulo the congruence via a permutation matrix [17],
+
+C1 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+0
+ρ
+1
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ ,
+C2 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+ρ
+ρ
+1
+0
+ρ
+0
+1
+
+⎞
+
+⎟
+⎠ ,
+C3 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+ρ
+ρ
+1
+ρ
+ρ
+ρ
+1
+
+⎞
+
+⎟
+⎠ .
+(31)
+
+3.3.1. Case 1
+
+First, we consider the trivariate Gaussian statistical model of Equation (12) when C ≡ C1. Then
+proceeding like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 =
+3+ρ
+
+(1+ρ)σ2 and g22 =
+6
+σ2 . Also in
+this case we find that the sectional curvature of Equation (7) is a negative function. Hence, as we state
+in Section 2, we may expect a decreasing (in time) behavior of the information geometry complexity.
+Furthermore, we obtain the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(32)
+
+where A(ρ) = A2
+1(3+ρ)
+6(1+ρ) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−1, 1). Then, the volume (10)
+becomes
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� �
+
+6(3 − 4ρ)
+(1 − 2ρ2)
+1
+σ2 dσdμ = 6A1
+
+|A1| exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+,
+(33)
+
+requiring A1 > 0 for its positivity. Finally, using (33) in (9) we arrive at the asymptotic behavior of
+the IGC
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 6
+√
+
+6
+
+σ0A1
+
+��
+
+1 + ρ
+3 + ρ
+1
+τ .
+(34)
+
+Comparing (34) in the presence and in the absence of correlations yields
+
+R
+weak
+trivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+√
+
+3
+
+�
+
+1 + ρ
+3 + ρ,
+(35)
+
+where “weak” stands for low degree of connection in the lattice underlying the micro-variables
+Notice that Rweak
+trivariate(ρ) is a monotonic increasing function of the argument ρ ∈ (−1, 1).
+
+3.3.2. Case 2
+
+When the trivariate Gaussian statistical model of Equation (12) has C ≡ C2, the condition C > 0
+constraints the correlation coefficient to be ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ). Proceeding again like in Section 3.2 we
+
+have g = g11dμ2 + g22dσ2, where g11 =
+3−4ρ
+
+(1−2ρ2)σ2 and g22 =
+6
+σ2 . The sectional curvature of Equation (7)
+is a negative function as well and so we may apply the arguments of Section 2 expecting a decreasing
+in time of the complexity. Furthermore, we obtain the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(36)
+
+244
+
+
+Entropy 2014, 16, 2944–2958
+
+where A(ρ) = A2
+1(3−4ρ)
+
+6(1−2ρ2) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ). Then, the
+volume (10) becomes
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� �
+
+6(3 − 4ρ)
+(1 − 2ρ2)
+1
+σ2 dσdμ = 6A1
+
+|A1| exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+.
+(37)
+
+We have to set A1 > 0 for the positivity of the volume (37), and using it in (9) we arrive at the
+asymptotic behavior of the IGC
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 6
+√
+
+6
+
+σ0A1
+
+��
+
+1 − 2ρ2
+
+3 − 4ρ
+1
+τ .
+(38)
+
+Then, comparing (38) in the presence and in the absence of correlations yields
+
+R
+mildly weak
+trivariate (ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+√
+
+3
+
+�
+
+1 − 2ρ2
+
+3 − 4ρ ,
+(39)
+
+where “mildly weak” stands for a lattice (underlying micro-variables) neither fully connected nor with
+minimal connection.
+This is a function of the argument ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ) that attains the maximum
+�
+
+3
+2 at ρ = 1
+
+2, while
+
+in the extrema of the interval (−
+√
+
+2
+2 ,
+√
+
+2
+2 ) it tends to zero.
+
+3.3.3. Case 3
+
+Last, we consider the trivariate Gaussian statistical model of the Equation (12) when C ≡ C3. In
+this case, the condition C > 0 requires the correlation coefficient to be ρ ∈ (− 1
+
+2, 1). Proceeding again
+like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 =
+3
+
+(1+2ρ)σ2 and g22 =
+6
+σ2 . We find that the
+sectional curvature of Equation (7) is a negative function; hence, we may expect a decreasing (in time)
+behavior of the complexity. It follows the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(40)
+
+where A(ρ) =
+A2
+1
+
+2(1+2ρ) and A1 ∈ R. We note that A(ρ) > 0 for all ρ ∈ (− 1
+
+2, 1). Using (40), we compute
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+�
+3
+√
+
+2
+�
+
+(1 + 2ρ)
+
+1
+σ2 dσdμ = 6
+√
+
+2A1
+
+|A1|
+exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+.
+(41)
+
+Also in this case we need to assume A1 > 0 to have positive volume. Finally, substituting Equation (41)
+into Equation (9), the asymptotic behavior of the IGC results
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 12
+
+σ0A1
+
+��
+
+1 + 2ρ 1
+
+τ .
+(42)
+
+The comparison of (42) in the presence and in the absence of correlations yields
+
+R
+strong
+trivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+�
+
+1 + 2ρ,
+(43)
+
+245
+
+
+Entropy 2014, 16, 2944–2958
+
+where “strong” stands for a fully connected lattice underlying the (three) micro-variables. We remark
+the latter ratio is a monotonically increasing function of the argument ρ ∈ (− 1
+
+2, 1).
+
+The behaviors of R(ρ) of Equations (30), (35), (39) and (43) are reported in Figure 1.
+
+−1
+−0.5
+0
+0.5
+1
+
+ρpeak
+
+ρ
+
+R(ρ)
+
+Figure 1. Ratio R(ρ) of volumes vs. degree of correlations ρ. Solid line refers to R
+strong
+bivariate(ρ); Dotted line
+refers to Rweak
+trivariate(ρ); Dashed line referes to R
+mildly weak
+trivariate
+(ρ); Dash-dotted refers to R
+strong
+trivariate(ρ).
+
+The non-monotonic behavior of the ratio R
+mildly weak
+trivariate (ρ) in Equation (39) corresponds to the
+information geometric complexities for the mildly weak connected three-dimensional lattice.
+Interestingly, the growth stops at a critical value ρpeak = 1
+
+2 at which R
+mildly weak
+trivariate (ρpeak) = R
+strong
+bivariate(ρpeak). From
+Equation (30), we conclude that entropic inferences on three Gaussian distributed micro-variables on
+a fully connected lattice is carried out in a more efficient manner when the two micro-variables are
+negatively correlated. Instead, when such micro-variables are positively correlated, macroscopic
+predictions become more complex that in the absence of correlations.
+Furthermore, the ratio
+R
+strong
+trivariate(ρ) of the information geometric complexities for this fully connected three-dimensional
+lattice increases in a monotonic fashion. These conclusions are similar to those presented for the
+bivariate case. However, there is a key-feature of the IGC to emphasize when passing from the
+two-dimensional to the three-dimensional manifolds associated with fully connected lattices: the
+effects of negative-correlations and positive-correlations are amplified with respect to the respective
+absence of correlations scenarios,
+R
+strong
+trivariate(ρ)
+
+R
+strong
+bivariate(ρ) =
+
+�
+
+1 + 2ρ
+1 + ρ ,
+(44)
+
+where ρ ∈ (− 1
+
+2, 1).
+Specifically, carrying out entropic inferences on the higher-dimensional manifold in the presence
+
+of anti-correlations, that is for ρ ∈
+�
+− 1
+
+2, 0
+�
+, is less complex than on the lower-dimensional manifold as
+evident form Equation (44). The vice-versa is true in the presence of positive-correlations, that is for
+ρ ∈ (0, 1).
+
+4. Concluding Remarks
+
+In summary, we considered low dimensional Gaussian statistical models (up to a trivariate model)
+and have investigated their dynamical (temporal) complexity. This has been quantified by the volume
+of geodesics for parameters characterizing the probability distribution functions. To the best of our
+knowledge, there is no dynamic measure of complexity of geodesic paths on curved statistical manifolds
+that could be compared to our IGC. However, it could be worthwhile to understand the connection, if
+
+246
+
+
+Entropy 2014, 16, 2944–2958
+
+any, between our IGC and the complexity of paths of dynamic systems introduced in [20]. Specifically,
+according to the Alekseev-Brudno theorem in the algorithmic theory of dynamical systems [21], a way
+to predict each new segment of chaotic trajectory is obtained by adding information proportional to the
+length of this segment and independent of the full previous length of trajectory. This means that this
+information cannot be extracted from observation of the previous motion, even an infinitely long one!
+If the instability is a power law, then the required information per unit time is inversely proportional
+to the full previous length of the trajectory and, asymptotically, the prediction becomes possible.
+For the sake of completeness, we also point out that the relevance of volumes in quantifying the
+static model complexity of statistical models was already pointed out in [22] and [23]: complexity is
+related to the volume of a model in the space of distributions regarded as a Riemannian manifold
+of distributions with a natural metric defined by the Fisher–Rao metric tensor. Finally, we would
+like to point out that two of the Authors have recently associated Gaussian statistical models to
+networks [17]. Specifically, it is assumed that random variables are located on the vertices of the
+network while correlations between random variables are regarded as weighted edges of the network.
+Within this framework, a static network complexity measure has been proposed as the volume of the
+corresponding statistical manifold. We emphasize that such a static measure could be, in principle,
+applied to time-dependent networks by accommodating time-varying weights on the edges [24]. This
+requires the consideration of a time-sequence of different statistical manifolds. Thus, we could follow
+the time-evolution of a network complexity through the time evolution of the volumes of the associated
+manifolds.
+In this work we uncover that in order to have a reduction in time of the complexity one has to
+consider both mean and variance as macro-variables. This leads to different topological structures of
+the parameter space in (13); in particular, we have to consider at least a 2-dimensional manifold in
+order to have effects such as a power law decay of the complexity. Hence, the minimal hypothesis in a
+multivariate Gaussian model consists in considering all mean values equal and all covariances equal.
+In such a case, however, the complexity shows interesting features depending on the correlation among
+micro-variables (as summarized in Figure 1). For a trivariate model with only two correlations the
+information geometric complexity ratio exhibits a non monotonic behavior in ρ (correlation parameter)
+taking zero value at the extrema of the range of ρ. In contrast to closed configurations (bivariate
+and trivariate models with all micro-variables correlated each other) the complexity ratio exhibits a
+monotonic behavior in terms of the correlation parameter. The fact that in such a case this ratio cannot
+be zero at the extrema of the range of ρ is reminiscent of the geometric frustration phenomena that
+occurs in the presence of loops [11].
+Specifically, recall that a geometrically frustrated system cannot simultaneously minimize all
+interactions because of geometric constraints [11,18]. For example, geometric frustration can occur
+in an Ising model which is an array of spins (for instance, atoms that can take states ±1) that are
+magnetically coupled to each other. If one spin is, say, in the +1 state then it is energetically favorable
+for its immediate neighbors to be in the same state in the case of a ferromagnetic model. On the
+contrary, in antiferromagnetic systems, nearest neighbor spins want to align in opposite directions.
+This rule can be easily satisfied on a square. However, due to geometrical frustration, it is not possible
+to satisfy it on a triangle: for an antiferromagnetic triangular Ising model, any three neighboring spins
+are frustrated. Geometric frustration in triangular Ising models can be observed by considering spin
+configurations with total spin J = ±1 and analyzing the fluctuations in energy of the spin system as
+a function of temperature. There is no peak at all in the standard deviation of the energy in the case
+J = −1, and a monotonic behavior is recorded. This indicates that the antiferromagnetic system does
+not have a phase transition to a state with long-range order. Instead, in the case J = +1, a peak in
+the energy fluctuations emerges. This significant change in the behavior of energy fluctuations as a
+function of temperature in triangular configurations of spin systems is a signature of the presence of
+frustrated interactions in the system [19].
+
+247
+
+
+Entropy 2014, 16, 2944–2958
+
+In this article, we observe a significant change in the behavior of the information geometric
+complexity ratios as a function of the correlation coefficient in the trivariate Gaussian statistical models.
+Specifically, in the fully connected trivariate case, no peak arises and a monotonic behavior in ρ of
+the information geometric complexity ratio is observed. In the mildly weak connected trivariate
+case, instead, a peak in the information geometric complexity ratio is recorded at ρpeak ≥ 0. This
+dramatic disparity of behavior can be ascribed to the fact that when carrying out statistical inferences
+with positively correlated Gaussian random variables, the maximum entropy favorable scenario is
+incompatible with these working hypothesis. Thus, the system appears frustrated.
+These considerations lead us to conclude that we have uncovered a very interesting information
+geometric resemblance of the more standard geometric frustration effect in Ising spin models. However,
+for a conclusive claim of the existence of an information geometric analog of the frustration effect, we
+feel we have to further deepen our understanding. A forthcoming research project along these lines
+will be a detailed investigation of both arbitrary triangular and square configurations of correlated
+Gaussian random variables where we take into consideration both the presence of different intensities
+and signs of pairwise interactions (ρij ̸= ρik if j ̸= k, ∀i).
+
+Acknowledgments: Domenico Felice and Stefano Mancini acknowledge the financial support of the Future
+and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the
+European Commission, under the FET-Open grant agreement TOPDRIM, number FP7-ICT-318121.
+
+Author Contributions: The authors have equally contributed to the paper. All authors read and approved the
+final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Feldman, D.F.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+2.
+Kolmogorov, A.N. A new metric invariant of transitive dynamical systems and of automorphism of Lebesgue
+spaces. Doklady Akademii Nauk SSSR 1958, 119, 861–864.
+3.
+Caticha, A. Entropic Dynamics. Bayesian Inference and Maximum Entropy Methods in Science and Engineering,
+the 22nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and
+Engineering, Moscow, Idaho, 3-7 August 2002; Fry, R.L., Ed.; American Institute of Physics: College Park,
+MD, USA, 2002; Volume 617, p. 302.
+4.
+Caticha, A.; Preuss, R. Maximum entropy and Bayesian data analysis: Entropic prior distributions. Phys. Rev.
+E 2004, 70, 046127.
+5.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NY, USA, 2000.
+6.
+Cafaro, C. Works on an information geometrodynamical approach to chaos. Chaos Solitons Fractals 2009, 41,
+886–891.
+7.
+Pettini, M. Geometry and Topology in Hamiltonian Dynamics and Statistical Mechanics; Springer-Verlag:
+Berlin/Heidelberg, Germany, 2007.
+8.
+Lebowitz, J.L. Microscopic Dynamics and Macroscopic Laws. Ann. N. Y. Acad. Sci. 1981, 373, 220–233.
+9.
+Shibata, T.; Chawanya, T.; Chawanya, K. Noiseless Collective Motion out of Noisy Chaos. Phys. Rev. Lett.
+1999, 82, doi: http://dx.doi.org/10.1103/PhysRevLett.82.4424.
+10.
+Ali, S.A.; Cafaro, C.; Kim, D.-H.; Mancini, S. The effect of the microscopic correlations on the information
+geometric complexity of Gaussian statistical models. Physica A 2010, 389, 3117–3127.
+11.
+Sadoc, J.F.; Mosseri, R. Geometrical Frustration; Cambridge University Press: Cambridge, UK, 2006.
+12.
+Lee, J.M. Riemannian Manifolds: An Introduction to Curvature; Springer: New York, NY, USA, 1997.
+13.
+Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992.
+14.
+Cafaro, C.; Ali, S.A. Jacobi fields on statistical manifolds of negative curvature. Physica D 2007, 234, 70–80.
+15.
+Cafaro, C.; Giffin, A.; Ali, S.A.; Kim, D.-H. Reexamination of an information geometric construction of
+entropic indicators of complexity. Appl. Math. Comput. 2010, 217, 2944–2951.
+16.
+Cafaro, C.; Mancini, S. Quantifying the complexity of geodesic paths on curved statistical manifolds through
+information geometric entropies and Jacobi fields. Physica D 2011, 240, 607–618.
+
+248
+
+
+Entropy 2014, 16, 2944–2958
+
+17.
+Felice, D.; Mancini, S.; Pettini, M. Quantifying Networks Complexity from Information Geometry Viewpoint.
+J. Math. Phys. 2014, 55, 043505.
+18.
+Moessner, R.; Ramirez, A.P. Geometrical Frustration. Phys. Today 2006, 59, 24–29.
+19.
+MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge,
+UK, 2003.
+20.
+Brudno, A.A. The complexity of the trajectories of a dynamical system. Uspekhi Mat. Nauk 1978, 33, 207–208.
+21.
+Alekseev, V.M.; Yacobson, M.V. Symbolic dynamics and hyperbolic dynamic systems. Phys. Rep. 1981, 75,
+290–325.
+22.
+Myung, J.; Balasubramanian, V.; Pitt, M.A. Counting probability distributions: differential geometry and
+model selection. Proc. Natl. Acad. Sci. USA 2000, 97, 11170.
+23.
+Rodriguez, C.C. The volume of bitnets. AIP Conf. Proc. 2004, 735, 555–564.
+24.
+Motter, A.E.; Albert, R. Networks in motion. Phys. Today 2012, 65, 43–48.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+249
+
+
+entropy
+
+Article
+Learning from Complex Systems: On the Roles of
+Entropy and Fisher Information in Pairwise Isotropic
+Gaussian Markov Random Fields
+
+Alexandre Levada
+
+Computing Department, Federal University of São Carlos, Rod. Washington Luiz, km 235, São Carlos, SP, Brazil;
+E-Mail: alexandre@dc.ufscar.br
+
+Received: 4 December 2013; / Accepted: 30 January 2014 / Published: 18 February 2014
+
+Abstract: Markov random field models are powerful tools for the study of complex systems.
+However, little is known about how the interactions between the elements of such systems are
+encoded, especially from an information-theoretic perspective.
+In this paper, our goal is to
+enlighten the connection between Fisher information, Shannon entropy, information geometry and
+the behavior of complex systems modeled by isotropic pairwise Gaussian Markov random fields.
+We propose analytical expressions to compute local and global versions of these measures using
+Besag’s pseudo-likelihood function, characterizing the system’s behavior through its Fisher curve , a
+parametric trajectory across the information space that provides a geometric representation for the
+study of complex systems in which temperature deviates from infinity. Computational experiments
+show how the proposed tools can be useful in extracting relevant information from complex patterns.
+The obtained results quantify and support our main conclusion, which is: in terms of information,
+moving towards higher entropy states (A –> B) is different from moving towards lower entropy states
+(B –> A), since the Fisher curves are not the same, given a natural orientation (the direction of time).
+
+Keywords: Markov random fields; information theory; Fisher information; entropy; maximum
+pseudo-likelihood estimation
+
+1. Introduction
+
+With the increasing value of information in modern society and the massive volume of digital
+data that is available, there is an urgent need for developing novel methodologies for data filtering and
+analysis in complex systems. In this scenario, the notion of what is informative or not is a top priority.
+Sometimes, patterns that at first may appear to be locally irrelevant may turn out to be extremely
+informative in a more global perspective. In complex systems, this is a direct consequence of the
+intricate non-linear relationship between the pieces of data along different locations and scales.
+Within this context, information theoretic measures play a fundamental role in a huge variety of
+applications once they represent statistical knowledge in a systematic, elegant and formal framework.
+Since the first works of Shannon [1], and later with many other generalizations [2–4], the concept of
+entropy has been adapted and successfully applied to almost every field of science, among which we
+can cite physics [5], mathematics [6–8], economics [9] and, fundamentally, information theory [10–12].
+Similarly, the concept of Fisher information [13,14] has been shown to reveal important properties of
+statistical procedures, from lower bounds on estimation methods [15–17] to information geometry [18,19].
+Roughly speaking, Fisher information can be thought of as the likelihood analog of entropy, which is a
+probability-based measure of uncertainty.
+In general, classical statistical inference is focused on capturing information about location and
+dispersion of unknown parameters of a given family of distribution and studying how this information
+is related to uncertainty in estimation procedures. In typical situations, an exponential family of
+
+Entropy 2014, 16, 1002–1036; doi:10.3390/e16021002
+www.mdpi.com/journal/entropy
+250
+
+
+Entropy 2014, 16, 1002–1036
+
+distributions and independence hypothesis (independent random variables) are often assumed, giving
+the likelihood function a series of desirable mathematical properties [15–17].
+Although mathematically convenient for many problems, in complex systems modeling,
+independence assumption is not reasonable, because much of the information is somehow encoded
+in the relations between the random variables [20,21]. In order to overcome this limitation, Markov
+random field (MRF) models appear to be a natural generalization of the classical approach by the
+replacement of the independence assumption by a more realistic conditional independence assumption.
+Basically, in every MRF, knowledge of a finite-support neighborhood around a given variable isolates it
+from all the remaining variables. A further simplification consists in considering a pairwise interaction
+model, constraining the size of the maximum clique to be two (in other words, the model captures
+only binary relationships). Moreover, if the MRF model is isotropic, which means that the parameter
+controlling the interactions between neighboring variables is invariant to change in the directions,
+all the information regarding the spatial dependence structure of the system is conveyed by a single
+parameter, from now on denoted by β (or simply, the inverse temperature).
+In this paper, we assume an isotropic pairwise Gaussian Markov random field (GMRF) model [22,23],
+also known as an auto-normal model or a conditional auto-regressive model [24,25]. Basically, the question
+that motivated this work and that we are trying to elucidate here is: What kind of information is encoded
+by the β parameter in such a model? We want to know how this parameter, and as a consequence, the
+whole spatial dependence structure of a complex system modeled by a Gaussian Markov random field, is
+related to both local and global information theoretic measures, more precisely the observed and expected
+Fisher information, as well as self-information and Shannon entropy.
+In searching for answers for our fundamental question, investigations led us to an exact expression
+for the asymptotic variance of the maximum pseudo-likelihood (MPL) estimator of β in an isotropic
+pairwise GMRF model, suggesting that asymptotic efficiency is not granted. In the context of statistical
+data analysis, Fisher information plays a central role in providing tools and insights for modeling
+the interactions between complex systems and their components. The advantage of MRF models
+over the traditional statistical ones is that MRFs take into account the dependence between pieces of
+information as a function of the system’s temperature, which may even be variable along time. Briefly
+speaking, this investigation aims to explore ways to measure and quantify distances between complex
+systems operating in different thermodynamical conditions. By analyzing and comparing the behavior
+of local patterns observed throughout the system (defined over a regular 2D lattice), it is possible
+to measure how informative those patterns for a given inverse temperature are, or simply β (which
+encodes the expected global behavior).
+In summary, our idea is to describe the behavior of a complex system in terms of information
+as its temperature deviates from infinity (when the particles are statistically independent) to a lower
+bound. The obtained results suggest that, in the beginning, when the temperature is infinite and the
+information equilibrium prevails, the information is somehow spread along the system. However,
+when temperature is low and this equilibrium condition does not hold anymore, we have a more
+sparse representation in terms of information, since this information is concentrated in the boundaries
+of the regions that define a smooth global configuration. In the vast remaining of this “universe”, due
+to this smooth constraint, the strong alignment between the particles prevails, which is exactly the
+expected global behavior for temperatures below a critical value (making the majority of the interaction
+patterns along the system uninformative).
+The remainder of the paper is organized as follows: Section 2 discusses a technique for the
+estimation of the inverse temperature parameter, called the maximum pseudo-likelihood (MPL)
+approach, and provides derivations for the observed Fisher information in an isotropic pairwise GMRF
+model. Intuitive interpretations for the two versions of this local measure are discussed. In Section 3,
+we derive analytical expressions for the computation of the expected Fisher information, which allows
+us to assign a global information measure for a given system configuration. Similarly, in Section 4, an
+expression for the global entropy of a system modeled by a GMRF is shown. The results suggest a
+
+251
+
+
+Entropy 2014, 16, 1002–1036
+
+connection between maximum pseudo-likelihood and minimum entropy criteria in the estimation of
+the inverse temperature parameter on GMRFs. Section 5 discusses the uncertainty in the estimation
+of this important parameter by defining an expression for the asymptotic variance of its maximum
+pseudo-likelihood estimator in terms of both forms of Fisher information. In Section 6, the definition
+of the Fisher curve of a system as a parametric trajectory in the information space is proposed. Section
+7 shows the experimental setup. Computational simulations with both Markov chain Monte Carlo
+algorithms and some real data were conducted, showing how the proposed tools can be used to extract
+relevant information from complex systems. Finally, Section 8 presents our conclusions, final remarks
+and possibilities for future works.
+
+2. Fisher Information in Isotropic Pairwise GMRFs
+
+The remarkable Hammersley–Clifford theorem [26] states the equivalence between Gibbs random
+fields (GRF) and Markov random fields (MRF), which implies that any MRF can be defined either in
+terms of a global (joint Gibbs distribution) or a local (set of local conditional density functions) model.
+For our purposes, we will choose the latter representation.
+
+Definition 1. An isotropic pairwise Gaussian Markov random field regarding a local neighborhood system,
+ηi, defined on a lattice S = {s1, s2, . . . , sn} is completely characterized by a set of n local conditional density
+functions p(xi|ηi,⃗θ), given by:
+
+p
+�
+xi|ηi,⃗θ
+�
+=
+1
+√
+
+2πσ
+exp
+
+⎧
+⎨
+
+⎩− 1
+
+2σ2
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+(1)
+
+with⃗θ = (μ, σ2, β), where μ and σ2 are the expected value and the variance of the random variables,
+and β = 1/T is the parameter that controls the interaction between the variables (inverse temperature).
+Note that, for β = 0, the model degenerates to the usual Gaussian distribution. From an information
+geometry perspective [18,19], this means that we are constrained to a sub-manifold within the
+Riemannian manifold of probability distributions, where the natural Riemannian metric (tensor)
+is given by the Fisher information. It has been shown that the geometric structure of exponential
+family distributions exhibits constant curvature. However, little is known about information geometry
+on more general statistical models, such as GMRFs. For β > 0, some degree of correlation between
+the observations is expected, making the interactions grow stronger. Typical choices for ηi are the
+first and second order non-causal neighborhood systems, defined by the sets of four and eight nearest
+neighbors, respectively.
+
+2.1. Maximum Pseudo-Likelihood Estimation
+
+Maximum likelihood estimation is intractable in MRF parameter estimation, due to the existence
+of the partition function in the joint Gibbs distribution. An alternative, proposed by Besag [24], is
+maximum pseudo-likelihood estimation, which is based on the conditional independence principle.
+The pseudo-likelihood function is defined as the product of the LCDFs for all the n variables of the
+system, modeled as a random field.
+
+Definition 2. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the set corresponding to the observations at time
+t, the pseudo-likelihood function of the model is defined by:
+
+L
+�
+⃗θ; X(t)�
+=
+n
+∏
+i=1
+p(xi|ηi,⃗θ)
+(2)
+
+252
+
+
+Entropy 2014, 16, 1002–1036
+
+Note that the pseudo-likelihood function is a function of the parameters. For better mathematical
+tractability, it is usual to take the logarithm of L(⃗θ; X(t)). Plugging Equation (1) into Equation (2) and
+taking the logarithm leads to:
+
+log L
+�
+⃗θ; X(t)�
+= −n
+
+2 log
+�
+2πσ2�
+−
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(3)
+
+By differentiating Equation (3) with respect to each parameter and properly solving the
+pseudo-likelihood equations, we obtain the following maximum pseudo-likelihood estimators for the
+parameters, μ, σ2 and β:
+
+ˆβMPL =
+
+n
+∑
+i=1
+
+�
+
+(xi − μ) ∑
+j∈ηi
+
+�
+xj − μ
+�
+�
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(4)
+
+ˆμMPL =
+1
+
+n (1 − kβ)
+
+n
+∑
+i=1
+
+�
+
+xi − β ∑
+j∈ηi
+xj
+
+�
+
+(5)
+
+ˆσ2
+MPL = 1
+
+n
+
+n
+∑
+i=1
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(6)
+
+where k denotes the cardinality of the non-causal neighborhood set ηi. Note that if β = 0, the MPL
+estimators of both μ and σ2 become the widely known sample mean and sample variance.
+Since the cardinality of the neighborhood system, k = |ηi|, is spatially invariant (we are assuming
+a regular neighborhood system) and each variable is dependent on a fixed number of neighbors on a
+lattice, ˆβMPL can be rewritten in terms of cross-covariances:
+
+ˆβMPL =
+∑
+j∈ηi
+ˆσij
+
+∑
+j∈ηi ∑
+k∈ηi
+ˆσjk
+(7)
+
+where σij denotes the sample covariance between the central variable, xi, and xj ∈ ηi. Similarly, σjk
+denotes the sample covariance between two variables belonging to the neighborhood system, ηi (the
+definition of the neighborhood system, ηi, does not include the the location, si).
+
+2.2. Fisher Information of Spatial Dependence Parameters
+
+Basically, Fisher information measures the amount of information a sample conveys about
+an unknown parameter.
+It can be thought of as the likelihood analog of entropy, which is a
+probability-based measure of uncertainty. Often, when we are dealing with independent and identically
+distributed (i.i.d) random variables, the computation of the global Fisher information presented in
+
+a random sample X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } is quite straightforward, since each observation, xi,
+i = 1, 2, . . . , n, brings exactly the same amount of information (when we are dealing with independent
+samples, the superscript, t, is usually suppressed, since the underlying dependence structure does
+not change through time). However, this is not true for spatial dependence parameters in MRFs,
+since different configuration patterns (xi ∪ ηi) provide distinct contributions to the local observed
+Fisher information, which can be used to derive a reasonable approximation to the global Fisher
+information [27].
+
+253
+
+
+Entropy 2014, 16, 1002–1036
+
+2.3. The Information Equality
+
+It is widely known from statistical inference theory that, under certain regularity conditions,
+information equality holds in the case of independent observations in the exponential family [15–17].
+In other words, we can compute the Fisher information of a random sample regarding a parameter of
+interest, θ, by:
+
+I
+�
+θ; X(t)�
+= E
+
+�� ∂
+
+∂θ logL
+�
+θ; X(t)��2�
+
+= −E
+� ∂2
+
+∂θ2 logL
+�
+θ; X(t)��
+(8)
+
+where L
+�
+θ; X(t)�
+denotes the likelihood function at a time instant, t. In our investigations, to avoid the
+joint Gibbs distribution, often intractable due to the presence of the partition function (global Gibbs
+field), we replace the usual likelihood function by Besag’s pseudo-likelihood function, and then, we
+work with the local model instead (local Markov field).
+However, given the intrinsic spatial dependence structure of Gaussian Markov random field
+models, information equilibrium is not a natural condition. As we will discuss later, in general,
+information equality fails.
+Thus, in a GMRF model, we have to consider two kinds of Fisher
+information, from now on denoted by Type I (due to the first derivative of the pseudo-likelihood
+function) and Type II (due to the second derivative of the pseudo-likelihood function). Eventually,
+when certain conditions are satisfied, these two values of information will converge to a unique bound.
+Essentially, β is the parameter responsible to control whether both forms of information converge or
+diverge. Knowing the role of β (inverse temperature) in a GMRF model, it is expected that for β = 0
+(or T → ∞), information equilibrium prevails. In fact, we will see in the following sections that as β
+deviates from zero (and long-term correlations start to emerge), the divergence between the two kinds
+of information increases.
+In terms of information geometry, it has been shown that the geometric structure of the exponential
+family of distributions is basically given by the Fisher information matrix, which is the natural
+Riemmanian metric (metric tensor) [18,19]. So, when the inverse temperature parameter is zero, the
+geometric structure of the model is a surface since the parametric space is 2D (μ and σ2). However,
+as the inverse temperature parameter starts to increase, the original surface is gradually transformed
+to a 3D Riemmanian manifold, equipped with a novel metric tensor (the 3 × 3 Fisher information
+matrix for μ, σ2 and β). In this context, by measuring the Fisher information regarding the inverse
+temperature parameter along an interval ranging from βMIN = A = 0 to βMAX = B, we are essentially
+trying to capture part of the deformation in the geometric structure of the model. In this paper, we
+focus on the computation of this measure. In future works we expect to derive the complete Fisher
+information matrix in order to completely characterize the transformations in the metric tensor.
+
+2.4. Observed Fisher Information
+
+In order to quantify the amount of information conveyed by a local configuration pattern in a
+complex system, the concept of observed Fisher information must be defined.
+
+Definition 3. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The
+Type I local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β, is
+defined in terms of its local conditional density function as:
+
+φβ(xi) =
+� ∂
+
+∂βlog p
+�
+xi|ηi,⃗θ
+��2
+(9)
+
+254
+
+
+Entropy 2014, 16, 1002–1036
+
+Hence, for an isotropic pairwise GMRF model, the Type I local observed Fisher information
+regarding β for the observation, xi, is given by:
+
+φβ(xi) = 1
+
+σ4
+
+��
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+� �
+∑
+j∈ηi
+
+�
+xj − μ
+�
+��2
+
+= 1
+
+σ4
+
+�
+∑
+j∈ηi
+(xi − μ)
+�
+xj − μ
+� − β ∑
+j∈ηi ∑
+k∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�2
+(10)
+
+Definition 4. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The
+Type II local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β,
+is defined in terms of its local conditional density function as:
+
+ψβ(xi) = − ∂2
+
+∂β2 log p
+�
+xi|ηi,⃗θ
+�
+(11)
+
+In case of an isotropic pairwise GMRF model, the Type II local observed Fisher information
+regarding β for the observation, xi, is given by:
+
+φβ(xi) = 1
+
+σ2
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�
+
+(12)
+
+Note that φβ(xi) does not depend on xi, only on the neighborhood system, ηi.
+Therefore, we have two local measures, φβ(xi) and ψβ(xi), that can be assigned to every element of
+a system modeled by an isotropic pairwise GMRF. In the following, we will discuss some interpretations
+for what is being measured with the proposed tools and how to define global versions for these
+measures by means of the expected Fisher information.
+
+2.5. The Role of Fisher Information in GMRF Models
+
+At this point, a relevant issue is the interpretation of these Fisher information measures in a
+complex system modeled by an isotropic pairwise GMRF. Roughly speaking, φβ(xi) is the quadratic
+rate of change of the logarithm of the local likelihood function at xi, given a global value of β. As
+this global value of β determines what would be the expected global behavior (if β is large, a high
+degree of correlation among the observations is expected and if β is close to zero, the observations are
+independent), it is reasonable to admit that configuration patterns showing values of φβ(xi) close to
+zero are more likely to be observed throughout the field, once their likelihood values are high (close to
+the maximum local likelihood condition). In other words, these patterns are more “aligned” to what is
+considered to be the expected global behavior, and therefore, they convey little information about the
+spatial dependence structure (these samples are not informative once they are expected to exist in a
+system operating at that particular value of inverse temperature).
+Now, let us move on to configuration patterns showing high values of φβ(xi). Those samples
+can be considered landmarks, because they convey a large amount of information about the global
+spatial dependence structure. Roughly speaking, those points are very informative once they are
+not expected to exist for that particular value of β (which guides the expected global behavior of the
+system). Therefore, Type I local observed Fisher information minimization in GMRFs can be a useful
+tool in producing novel configuration patterns that are more likely to exist given the chosen value of
+inverse temperature. Basically, φβ(xi) tells us how informative a given pattern is for that specific global
+behavior (represented by a single parameter in an isotropic pairwise GMRF model). In summary, this
+
+255
+
+
+Entropy 2014, 16, 1002–1036
+
+measure quantifies the degree of agreement between an observation, xi, and the configuration defined
+by its neighborhood system for a given β.
+As we will see later in the experiments section, typical informative patterns (those showing
+high values of φβ(xi)) in an organized system are located at the boundaries of the regions defining
+homogeneous areas (since these boundary samples show an unexpected behavior for large β, which is:
+there is no strong agreement between xi and its neighbors).
+Let us analyze the Type II local observed Fisher information, ψβ(xi). Informally speaking, this
+measure can be interpreted as a curvature measure, that is, how curved is the local likelihood function
+at xi. Thus, patterns showing low values of ψβ(xi) tend to have a nearly flat local likelihood function.
+This means that we are dealing with a pattern that could have been observed for a variety of β values
+(a large set of β values have approximately the same likelihood). An implication of this fact is that
+in a system dominated by this kind of patterns (patterns for which ψβ(xi) is close to zero), small
+perturbations may cause a sharp change in β (and, therefore, in the expected global behavior). In other
+words, these patterns are more susceptible to changes once they do not have a “stable” configuration
+(it raises our uncertainty about the true value of β).
+On the other hand, if the global configuration is mostly composed of patterns exhibiting large
+values of ψβ(xi), changes on the global structure are unlikely to happen (uncertainty on β is sufficiently
+small). Basically, ψβ(xi) measures the degree of agreement or dependence among the observations
+belonging to the same neighborhood system. If at a given xi, the observations belonging to ηi are
+totally symmetric around the mean value, ψβ(xi) would be zero. It is reasonable to expect that in this
+situation, as there is no information about the induced spatial dependence structure (this means that
+there is no contextual information available at this point). Notice that the role of ψβ(xi) is not the same
+as φβ(xi). Actually, these two measures are almost inversely related, since if at xi the value of φβ(xi)
+is high (it is a landmark or boundary pattern), then it is expected that ψβ(xi) will be low (in decision
+boundaries or edges, the uncertainty about β is higher, causing ψβ(xi) to be small). In fact, we will
+observe this behavior in some computational experiments conducted in future sections of the paper.
+It is important to mention that these rather informal arguments define the basis for understanding
+the meaning of the asymptotic variance of maximum pseudo-likelihood estimators, as we will discuss
+ahead. In summary, ψβ(xi) is a measure of how sure or confident we are about the local spatial
+dependence structure (at a given point, xi), since a high average curvature is desired for predicting the
+system’s global behavior in a reasonable manner (reducing the uncertainty of β estimation).
+
+3. Expected Fisher Information
+
+In order to avoid the use of approximations in the computation of the global Fisher information
+in an isotropic pairwise GMRF, in this section, we provide an exact expression for ˆφβ and ˆψβ as Type
+I and Type II expected Fisher information. One advantage of using the expected Fisher information
+instead of its global observed counterpart is the faster computing time. As we will see, instead of
+computing a single local measure for each observation ,xi ∈ X, and then taking the average, both
+Φβ and Ψβ expressions depend only on the covariance matrix of the configuration patterns observed
+along the random field.
+
+3.1. The Type I Expected Fisher Information
+
+Recall that the Type I expected Fisher information, from now on denoted by Φβ, is given by:
+
+Φβ = E
+
+�� ∂
+
+∂βlog L
+�
+⃗θ; X(t)��2�
+
+(13)
+
+The Type II expected Fisher information, from now on denoted by Ψβ, is given by:
+
+Ψβ = −E
+� ∂2
+
+∂β2 log L
+�
+⃗θ; X(t)��
+(14)
+
+256
+
+
+Entropy 2014, 16, 1002–1036
+
+We first proceed to the definition of Φβ. Plugging Equation (3) in Equation (13), and after some
+algebra, we obtain the following expression, which is composed by four main terms:
+
+Φβ = 1
+
+σ4 E
+
+⎧
+⎨
+
+⎩
+
+�
+n
+∑
+s=1
+
+�
+
+xs − μ − β ∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+∑
+j∈ηs
+
+�
+xj − μ
+�
+��2⎫
+⎬
+
+⎭
+(15)
+
+= 1
+
+σ4 E
+
+�
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+
+xs − μ − β ∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+
+xr − μ − β ∑
+k∈ηr
+(xk − μ)
+
+�
+
+×
+
+�
+∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+∑
+k∈ηr
+(xk − μ)
+
+��
+
+= 1
+
+σ4 E
+
+�
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+
+(xs − μ) (xr − μ) − β ∑
+k∈ηr
+(xs − μ) (xk − μ) − β ∑
+j∈ηs
+(xr − μ)
+�
+xj − μ
+�
+
++β2 ∑
+j∈ηs ∑
+k∈ηr
+
+�
+xj − μ
+� (xk − μ)
+
+� �
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+xj − μ
+� (xk − μ)
+
+��
+
+= 1
+
+σ4
+
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+∑
+j∈ηs ∑
+k∈ηr
+E
+�(xs − μ) (xr − μ)
+�
+xj − μ
+� (xk − μ)
+�
+
+−β ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xs − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+�
+
+−β ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+E
+�(xr − μ) (xm − μ)
+�
+xj − μ
+� (xk − μ)
+�
+
++β2 ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xm − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+�
+�
+
+Hence, the expression for Φβ is composed by four main terms, each one of them involving
+a summation of higher-order cross-moments. According to Isserlis’ theorem [28], for normally
+distributed random variables, we can compute higher order moments in terms of the covariance
+matrix through the following identity:
+
+E [X1X2X3X4] = E [X1X2] E [X3X4] + E [X1X3] E [X2X4] + E [X2X3] E [X1X4]
+(16)
+
+Then, the first term of Equation (15) is reduced to:
+
+∑
+j∈ηs ∑
+k∈ηr
+E
+�(xs − μ) (xr − μ)
+�
+xj − μ
+� (xk − μ)
+� =
+(17)
+
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+E [(xs − μ) (xr − μ)] E
+��
+xj − μ
+� (xk − μ)
+�
+
++ E
+�(xs − μ)
+�
+xj − μ
+��
+E [(xr − μ) (xk − μ)]
+
++ E
+�(xr − μ)
+�
+xj − μ
+��
+E [(xs − μ) (xk − μ)]
+� =
+
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+σsrσjk + σsjσrk + σrjσsk
+�
+
+257
+
+
+Entropy 2014, 16, 1002–1036
+
+where σsr denotes the covariance between variables xs and xr (note that in an MRF, we have σsr = 0 if
+xr /∈ ηs). We now proceed to the expansion of the second main term of Equation (15). Similarly, by
+applying Isserlis’ identity, we have:
+
+∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xs − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+� = ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σsjσkl + σskσjl + σjkσsl
+�
+(18)
+
+The third term of Equation (15) can be rewritten as:
+
+∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+E
+�(xr − μ) (xm − μ)
+�
+xj − μ
+� (xk − μ)
+� =
+(19)
+
+= ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+
+�
+σrmσjk + σrjσmk + σmjσrk
+�
+
+Finally, the fourth term of it is:
+
+∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xm − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+� =
+(20)
+
+= ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σmjσkl + σmkσjl + σmlσjk
+�
+
+Therefore, by combining Expressions Equations (17)–(20), we have the complete expression for Φβ,
+the Type I expected Fisher information for an isotropic pairwise GMRF model regarding the inverse
+temperature parameter, as:
+
+Φβ = 1
+
+σ4
+
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+σsrσjk + σsjσrk + σrjσsk
+�
+(21)
+
+−β ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σsjσkl + σskσjl + σjkσsl
+�
+
+−β ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+
+�
+σrmσjk + σrjσmk + σmjσrk
+�
+
++β2 ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σmjσkl + σmkσjl + σmlσjk
+��
+
+However, since we are interested in studying how the spatial correlations change as the system evolves,
+
+we need to estimate a value for Φβ given a single global state X(t) =
+�
+x(t)
+1 , x(t)
+2 , . . . , x(t)
+n
+�
+. Hence, to
+
+compute Φβ from a single static configuration X(t) (a photograph of the system at a given moment),
+we consider n = 1 in the previous equation, which means, among other things, that s = r (which
+implies ηs = ηr) and that observations belonging to different local neighborhoods are independent
+from each other (as we are dealing with a pairwise interaction Markovian process, it does not make
+sense to model the interactions between variables that are far away from each other in the lattice).
+Before proceeding, we would like to clarify some points regarding the estimation of the β
+parameter and the computation of the expected Fisher information in the isotropic pairwise GMRF
+model. Basically, there are two main possibilities: (1) the parameter is spatially-invariant, which
+means that we have a unique value, ˆβ(t), for a global configuration of the system, X(t) (this is our
+assumption); or (2) the parameter is spatially-variant, which means that we have a set of ˆβs values,
+
+for s = 1, 2, . . . , n, each one of them estimated from Xs =
+�
+x(1)
+s , x(2)
+s , . . . , x(t)
+s
+�
+(we are observing the
+outcomes of a random pattern along time in a fixed position of the lattice). When we are dealing with
+the first model (β is spatially-invariant), all possible observation patterns (samples) are extracted from
+the global configuration by a sliding window (with the shape of the neighborhood system) that moves
+
+258
+
+
+Entropy 2014, 16, 1002–1036
+
+through the lattice at a fixed time instant, t. In this case, we are interested in studying the spatial
+correlations, not the temporal ones. In other words, we would like to investigate how the the spatial
+structure of a GMRF model is related to Fisher information (this is exactly the scenario described
+above, for which n = 1). Our motivation here is to characterize, via information-theoretic measures,
+the behavior of the system as it evolves from states of minimum entropy to states of maximum entropy
+(and vice versa) by providing a geometrical tool based on the definition of the Fisher curve , which will
+be introduced in the following sections.
+Therefore, in our case (n = 1), Equation (21) is further simplified for practical usage. By unifying
+s and r to a unique index, i, we have a final expression for Φβ in terms of the local covariances between
+the random variables in a given neighborhood system (i.e., for the eight nearest neighbors):
+
+Φβ = 1
+
+σ4
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+σ2σjk + 2σijσik
+�
+− 2β ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi
+
+�
+σijσkl + σikσjl + σilσjk
+�
+(22)
+
++β2 ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi ∑
+m∈ηi
+
+�
+σjkσlm + σjlσkm + σjmσkl
+��
+
+Note that we have two types of covariances in the definition of Φβ for an isotropic pairwise GMRF: (1)
+covariances between the central variable, xi, and a neighboring variable, xj, denoted by σij, for j ∈ ηi;
+and (2) covariances between two neighboring variables, xj and xk, for j, k ∈ ηi. In the next sections, we
+will see how to compute the value of Ψβ directly from the covariance matrix of the local patterns.
+
+3.2. The Type II Expected Fisher Information
+
+Following the same methodology of replacing the likelihood function by the pseudo-likelihood
+function of the GMRF model, a closed form expression for Ψβ is developed. Plugging Equation (3)
+into Equation (14) leads us to:
+
+Ψβ = 1
+
+σ2
+
+n
+∑
+i=1
+E
+
+⎧
+⎨
+
+⎩
+
+�
+∑
+xj∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+(23)
+
+= 1
+
+σ2
+
+n
+∑
+i=1
+E
+
+�
+∑
+xj∈ηi ∑
+xk∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�
+
+=
+
+= 1
+
+σ2
+
+n
+∑
+i=1
+
+�
+∑
+xj∈ηi ∑
+xk∈ηi
+E
+��
+xj − μ
+� (xk − μ)
+�
+�
+
+= 1
+
+σ2
+
+n
+∑
+i=1 ∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+Note that unlike Φβ, Ψβ does not depend explicitly on β (inverse temperature). As we have seen
+before, Φβ is a quadratic function of the spatial dependence parameter.
+In order to simplify the notations and also to make computations easier, the expressions for Φβ
+and Ψβ can be rewritten in a matrix-vector form. Let Σp be the covariance matrix of the random vectors
+⃗pi, i = 1, 2, . . . , n, obtained by lexicographic ordering of the local configuration patterns xi ∪ ηi. Thus,
+
+259
+
+
+Entropy 2014, 16, 1002–1036
+
+considering a neighborhood system, ηi, of size K, we have Σp given by a (K + 1) × (K + 1) symmetric
+matrix (for K + 1 odd, i.e., K = 4, 8, 12, . . .):
+
+Σp =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+σ1,1
+· · ·
+σ1,K/2
+σ1,(K/2)+1
+σ1,(K/2)+2
+· · ·
+σ1,K+1
+...
+...
+...
+...
+...
+...
+...
+σK/2,1
+· · ·
+σK/2,K/2
+σK/2,(K/2)+1
+σK/2,(K/2)+2
+· · ·
+σK/2,K+1
+σ(K/2)+1,1
+· · ·
+σ(K/2)+1,K/2
+σ(K/2)+1,(K/2)+1
+σ(K/2)+1,(K/2)+2
+· · ·
+σ(K/2)+1,K+1
+σ(K/2)+2,1
+· · ·
+σ(K/2)+2,K/2
+σ(K/2)+2,(K/2)+1
+σ(K/2)+2,(K/2)+2
+· · ·
+σ(K/2)+2,K+1
+...
+...
+...
+...
+...
+...
+...
+σK+1,1
+· · ·
+σK+1,K/2
+σK+1,(K/2)+1
+σK+1,(K/2)+2
+· · ·
+σK+1,K+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+Let Σ−
+p be the submatrix of dimensions K × K obtained by removing the central row and central
+column of Σp (the covariances between xi and each one of its neighbors, xj). Then, for K + 1 odd, we
+have:
+
+Σ−
+p =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+σ1,1
+· · ·
+σ1,K/2
+σ1,(K/2)+2
+· · ·
+σ1,K+1
+...
+...
+...
+...
+...
+...
+σK/2,1
+· · ·
+σK/2,K/2
+σK/2,(K/2)+2
+· · ·
+σK/2,K+1
+σ(K/2)+2,1
+· · ·
+σ(K/2)+2,K/2
+σ(K/2)+2,(K/2)+2
+· · ·
+σ(K/2)+2,K+1
+...
+...
+...
+...
+...
+...
+σK+1,1
+· · ·
+σK+1,K/2
+σK+1,(K/2)+2
+· · ·
+σK+1,K+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(24)
+
+Thus, Σ−
+p is a matrix that stores only the covariances among the neighboring variables. Furthermore,
+let ⃗ρ be the vector of dimensions K × 1 formed by all the elements of the central row of Σp, excluding
+the middle one (which is a variance actually), that is:
+
+⃗ρ =
+�
+σ(K/2)+1,1
+· · ·
+σ(K/2)+1,K/2
+σ(K/2)+1,(K/2)+2
+· · ·
+σ(K/2)+1,K+1
+�
+(25)
+
+Therefore, we can rewrite Equation (23) (for n = 1) using Kronecker products. The following definition
+provides a fast way to compute Φβ exploring these tensor products.
+
+Definition 5. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+system, ηi, of size K (usual choices for K are even values: four, eight, 12, 20 or 24). Assuming that X(t) =
+{x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t and ⃗ρ and Σ−
+p are defined as
+
+Equations (25) and (24), the Type I expected Fisher information, Φβ, for this state, X(t), is:
+
+Φβ = 1
+
+σ4
+
+�
+σ2 ���Σ−
+p
+���
++ + 2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(26)
+
+where ∥A∥+ denotes the summation of all the entries of the matrix, A (not to be confused with a matrix
+norm) and ⊗ denotes the Kronecker (tensor) product. From an information geometry perspective,
+the presence of tensor products indicates the intrinsic differential geometry of a manifold in the form
+of the Riemann curvature tensor [18]. Note that all the necessary information for computing the
+Fisher information is somehow encoded in the covariance matrix of the local configuration patterns,
+(xi ∪ ηi), i = 1, 2, . . . , n, as would be expected in the case of Gaussian variables (second-order statistics).
+The same procedure is applied to the Type II expected Fisher information.
+
+Definition 6. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi, of size K (usual choices for K are four, eight, 12, 20 or 24). Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n }
+
+260
+
+
+Entropy 2014, 16, 1002–1036
+
+denotes the global configuration of the system at time t and Σ−
+p is defined as Equation (24), the Type II expected
+
+Fisher information, Ψβ, for this state, X(t), is given by:
+
+Ψβ = 1
+
+σ2
+
+���Σ−
+p
+���
++
+(27)
+
+3.3. Information Equilibrium in GMRF Models
+
+From the definition of both Φβ and Ψβ, a natural question that raises would be: under what
+conditions do we have Φβ = Ψβ in an isotropic pairwise GMRF model? As we can see from
+
+Equations (26) and (27), the difference between Φβ and Ψβ, from now on denoted by Δβ
+�
+⃗ρ, Σ−
+p
+�
+,
+is simply:
+
+Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 1
+
+σ4
+
+�
+2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(28)
+
+Then, intuitively, the condition for information equality is achieved when Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0. As
+
+Δβ
+�
+⃗ρ, Σ−
+p
+�
+is a simple quadratic function of the inverse temperature parameter, β, we can easily find
+
+that the value, β∗, for which Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0, is:
+
+β∗ =
+
+���⃗ρT ⊗ Σ−
+p
+���
++
+��Σ−p ⊗ Σ−p
+��
++
+±
+√
+
+3
+3
+
+�
+
+3
+��⃗ρT ⊗ Σ−p
+��2
++ − 2
+��Σ−p ⊗ Σ−p
+��
++ ∥⃗ρ ⊗⃗ρT∥+
+��Σ−p ⊗ Σ−p
+��
++
+(29)
+
+provided that 3
+���⃗ρT ⊗ Σ−
+p
+���
+2
+
++ ≥ 2
+���Σ−
+p ⊗ Σ−
+p
+���
++
+
+��⃗ρ ⊗⃗ρT��
++ and
+���Σ−
+p ⊗ Σ−
+p
+���
++ ̸= 0.
+Note that if
+��⃗ρ ⊗⃗ρT��
++ = 0, then one solution for the above equation is β∗ = 0.
+In other words, when
+σij = 0, ∀j ∈ ηi (no correlation between xi and its neighbors, xj), information equilibrium is achieved for
+β∗ = 0, which in this case, is the maximum pseudo-likelihood estimate of β, since in this matrix-vector
+notation, ˆβMPL is given by:
+
+ˆβMPL =
+∑
+j∈ηi
+ˆσij
+
+∑
+j∈ηi ∑
+k∈ηi
+ˆσjk
+=
+∥⃗ρ∥+
+��Σ−p
+��
++
+(30)
+
+In the isotropic pairwise GMRF model, if β = 0, then we have ∥⃗ρ∥+ = 0, and as a consequence,
+Φβ = Ψβ. However, the opposite is not necessarily true, that is, we may observe that Φβ = Ψβ for a
+
+non-zero β. One example is for β∗, a solution of Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0.
+
+4. Entropy in Isotropic Pairwise GMRFs
+
+Our definition of entropy is done by repeating the same process employed to derive Φβ and Ψβ.
+Knowing that the entropy of random variable x is defined by the expected value of self-information,
+given by −log p(x), it can be thought of as a probability-based counterpart to the Fisher information.
+
+Definition 7. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t,
+then the entropy, Hβ, for this state X(t) is given by:
+
+261
+
+
+Entropy 2014, 16, 1002–1036
+
+Hβ = −E
+�
+log L
+�
+⃗θ; X(t)��
+= −E
+
+�
+
+log
+n
+∏
+i=1
+p
+�
+xi|ηi,⃗θ
+��
+
+=
+(31)
+
+= n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+E
+
+⎧
+⎨
+
+⎩
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭ =
+
+= n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+E
+�
+(xi − μ)2�
+− 2βE
+
+�
+∑
+j∈ηi
+(xi − μ)
+�
+xj − μ
+�
+�
+
++ β2E
+
+⎧
+⎨
+
+⎩
+
+�
+∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+
+⎫
+⎬
+
+⎭
+
+After some algebra, the expression for Hβ becomes:
+
+Hβ = n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+σ2 − 2β ∑
+j∈ηi
+σij + β2 ∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�
+
+=
+(32)
+
+=
+�n
+
+2 log(2πσ2) + n
+
+2
+
+�
+− β
+
+σ2
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi
+σij
+
+�
+
++ β2
+
+2σ2
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�
+
+Using the same matrix-vector notation introduced in the previous sections, we can further simplify the
+expression for Hβ (considering n = 1).
+
+Definition 8. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t
+and ⃗ρ and Σ−
+p are defined as Equations (25) and (24), the entropy, Hβ, for this state, X(t), is given by:
+
+Hβ = HG −
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2σ2
+
+���Σ−
+p
+���
++
+
+�
+= HG −
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2 Ψβ
+
+�
+(33)
+
+where HG denotes the entropy of a Gaussian random variable with variance σ2 and Ψβ is the Type II
+expected Fisher information.
+Note that Shannon entropy is a quadratic function of the spatial dependence parameter, β.
+Since the coefficient of the quadratic term is strictly non-negative (Ψβ is the Type II expected Fisher
+information), entropy is a convex function of β. Furthermore, as expected, when β = 0 and there is no
+induced spatial dependence in the system, the resulting expression for Hβ is the usual entropy of a
+Gaussian random variable, HG. Thus, there is a value,
+ˆ
+βMH, for the inverse temperature parameter,
+which minimizes the entropy of the system. In fact, ˆβMH is given by:
+
+∂Hβ
+∂β = β
+
+σ2
+
+���Σ−
+p
+���
++ − 1
+
+σ2 ∥⃗ρ∥+ = 0
+(34)
+
+ˆβMH = ∥⃗ρ∥+
+��Σ−p
+��
++
+= ˆβMPL
+
+262
+
+
+Entropy 2014, 16, 1002–1036
+
+showing that the maximum pseudo-likelihood and the minimum-entropy estimates are equivalent
+in an isotropic pairwise GMRF model. Moreover, using the derived equations, we see a relationship
+between Φβ, Ψβ and Hβ:
+
+Φβ − Ψβ = Δβ
+�
+⃗ρ, Σ−
+p
+�
+(35)
+
+∂2Hβ
+∂β2 = Ψβ
+
+where the functional Δβ
+�
+⃗ρ, Σ−
+p
+�
+that represents the difference between Φβ and Ψβ is defined by
+Equation (28). These equations relate the entropy and one form of Fisher information (Ψβ) in GMRF
+models, showing that Ψβ can be roughly viewed as the curvature of Hβ. In this sense, in a hypothetical
+information equilibrium condition Ψβ = Φβ = 0, the entropy’s curvature would be null (Hβ would
+never change). These results suggest that an increase in the value of Ψβ, which means stability (a
+measure of agreement between the neighboring observations of a given point), contributes to the curve
+and, therefore, to inducing a change in the entropy of the system. In this context, the analysis of the
+Fisher information could bring us insights in predicting the entropy of a system.
+
+5. Asymptotic Variance of MPL Estimators
+
+It is known from the statistical inference literature that unbiasedness is a property that is not
+granted by maximum likelihood estimation, nor by maximum pseudo-likelihood (MPL) estimation.
+Actually, there is no universal method that guarantees the existence of unbiased estimators for a fixed
+n-sized sample. Often, in the exponential family of distributions, maximum likelihood estimators
+(MLEs) coincide with the UMVU (uniform minimum variance unbiased) estimators, because MLEs
+are functions of complete sufficient statistics. There is an important result in statistical inference that
+shows that if the MLE is unique, then it is a function of sufficient statistics. We could enumerate
+and make a huge list of several properties that make maximum likelihood estimation a reference
+method [15–17]. One of the most important properties concerns the asymptotic behavior of MLEs:
+when we make the sample size grow infinitely (n → ∞), MLEs become asymptotically unbiased and
+efficient. Unfortunately, there is no result showing that the same occurs in maximum pseudo-likelihood
+estimation. The objective of this section is to propose a closed expression for the asymptotic variance
+of the maximum pseudo-likelihood of β in an isotropic pairwise GMRF model. Unsurprisingly, this
+variance is completely defined as a function of both forms of expected Fisher information, Ψβ and Φβ;
+as for general values of the inverse temperature parameter, the information equality condition fails.
+
+5.1. The Asymptotic Variance of the Inverse Temperature Parameter
+
+In mathematical statistics, asymptotic evaluations uncover several fundamental properties of
+inference methods, providing a powerful and general tool for studying and characterizing the behavior
+of estimators. In this section, our objective is to derive an expression for the asymptotic variance
+of the maximum pseudo-likelihood estimator of the inverse temperature parameter (β) in isotropic
+pairwise GMRF models. It is known from the statistical inference literature that both maximum
+likelihood and maximum pseudo-likelihood estimators share two important properties: consistency
+and asymptotic normality [29,30]. It is possible, therefore, to completely characterize their behaviors
+in the limiting case. In other words, the asymptotic distribution of ˆβMPL is normal, centered around
+the real parameter value (since consistency means that the estimator is asymptotically unbiased),
+with the asymptotic variance representing the uncertainty about how far we are from the mean (real
+value). From a statistical perspective, ˆβMPL ≈ N
+�
+β, υβ
+�
+, where υβ denotes the asymptotic variance
+
+263
+
+
+Entropy 2014, 16, 1002–1036
+
+of the maximum pseudo-likelihood estimator. It is known that the asymptotic covariance matrix of
+maximum pseudo-likelihood estimators is given by [31]:
+
+C(⃗θ) = H−1(⃗θ)J(⃗θ)H−1(⃗θ)
+(36)
+
+with:
+
+H(⃗θ) = Eβ
+�
+∇2log L
+�
+⃗θ; X(t)��
+(37)
+
+J(⃗θ) = Varβ
+�
+∇log L
+�
+⃗θ; X(t)��
+(38)
+
+where H and J denote, respectively, the Jacobian and Hessian matrices regarding the logarithm of the
+pseudo-likelihood function. Thus, considering the parameter of interest, β, we have the following
+definition for its asymptotic variance, υβ (the derivatives are taken with respect to β):
+
+υβ =
+Varβ
+�
+∂
+∂βlog L
+�
+⃗θ; X(t)��
+
+E2
+β
+�
+∂2
+∂β2 log L
+�
+⃗θ; X(t)
+�� =
+Eβ
+
+��
+∂
+∂βlog L
+�
+⃗θ; X(t)��2�
+− E2
+β
+�
+∂
+∂βlog L
+�
+⃗θ; X(t)��
+
+E2
+β
+�
+∂2
+∂β2 log L
+�
+⃗θ; X(t)
+��
+(39)
+
+However, note that the expected value of the first derivative of log L
+�
+⃗θ; X(t)�
+with relation to β is zero:
+
+E
+� ∂
+
+∂βlog L
+�
+⃗θ; X(t)��
+= 1
+
+σ2
+
+n
+∑
+i=1
+
+�
+
+E [xi − μ] − β ∑
+j∈ηi
+E
+�
+xj − μ
+�
+�
+
+= 0
+(40)
+
+Therefore, the second term of the numerator of Equation (39) vanishes and the final expression for the
+asymptotic variance of the inverse temperature parameter is given as the ratio between Φβ and Ψ2
+β:
+
+υβ =
+1
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�2
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+σ2σjk + 2σijσik
+�
+− 2β ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi
+
+�
+σijσkl + σikσjl + σilσjk
+�
+
++β2 ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi ∑
+m∈ηi
+
+�
+σjkσlm + σjlσkm + σjmσkl
+��
+
+(41)
+
+This derivation leads us to another definition concerning an isotropic pairwise GMRF.
+
+Definition 9. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t,
+and⃗ρ and Σ−
+p are defined as Equations (25) and (24), the asymptotic variance of the maximum pseudo-likelihood
+estimator of the inverse temperature parameter, β, is given by (using the same matrix-vector notation from the
+previous sections):
+
+υβ =
+σ2 ���Σ−
+p
+���
++ + 2
+��⃗ρ ⊗⃗ρT��
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+��Σ−p
+��2
++
+=
+(42)
+
+=
+σ2
+
+��Σ−p
+��
++
++
+σ4Δβ
+�
+⃗ρ, Σ−
+p
+�
+
+��Σ−p
+��2
++
+= 1
+
+Ψβ
++ 1
+
+Ψ2
+β
+
+�
+Φβ − Ψβ
+�
+
+264
+
+
+Entropy 2014, 16, 1002–1036
+
+Note that when information equilibrium prevails, that is Φβ = Ψβ, the asymptotic variance is
+given by the inverse of the expected Fisher information. However, the interpretation of this equation
+indicates that the uncertainty in the estimation of the inverse temperature parameter is minimized when
+Ψβ is maximized. Essentially, this means that on average, the local pseudo-likelihood functions are not
+flat, that is small changes on the local configuration patterns along the system cannot cause abrupt
+changes in the expected global behavior (the global spatial dependence structure is not susceptible to
+sharp changes). To reach this condition, there must be a reasonable degree of agreement between the
+neighboring elements throughout the system, a behavior that is usually associated to low temperature
+states (β is above a critical value and there is a visible induced spatial dependence structure).
+
+6. The Fisher Curve of a System
+
+With the definition of Φβ, Ψβ and Hβ, we have the necessary tools to compute three important
+information-theoretic measures of a global configuration of the system. Our idea is that we can study
+the behavior of a complex system by constructing a parametric curve in this information-theoretic
+space as a function of the inverse temperature parameter, β. Our expectation is that the resulting
+trajectory provides a geometrical interpretation of how the system moves from an initial configuration,
+A (with a low entropy value for instance), to a desired final configuration, B (with a greater value
+of entropy, for instance), since the Fisher information plays an important role in providing a natural
+metric to the Riemannian manifolds of statistical models [18,19]. We will call the path from global State
+A to global State B as the Fisher curve (from A to B) of the system, denoted by ⃗FB
+A(β). Instead of using
+time as the parameter to build the curve, ⃗F, we parametrize ⃗F by the inverse temperature parameter, β.
+
+Definition 10. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+system, ηi, and X(β1), X(β2), . . . , X(βn) be a sequence of outcomes (global configurations) produced by different
+values of βi (inverse temperature parameters) for which A = βMIN = β1 < β2 < · · · < βn = βMAX = B.
+The system’s Fisher curve from A to B is defined as the function ⃗F : ℜ → ℜ3 that maps each configuration,
+X(βi), to a point
+�
+Φβi, Ψβi, Hβi
+�
+from the information space, that is:
+
+⃗FB
+A (β) =
+�
+Φβ, Ψβ, Hβ
+�
+β = A, . . . , B
+(43)
+
+where Φβ, Ψβ and Hβ denote the Type I expected Fisher information, the Type II expected Fisher
+information and the Shannon entropy of the global configuration, X(β), defined by:
+
+Φβ = 1
+
+σ4
+
+�
+σ2 ���Σ−
+p
+���
++ + 2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(44)
+
+Ψβ = 1
+
+σ2
+
+���Σ−
+p
+���
++
+(45)
+
+Hβ = 1
+
+2
+
+�
+log
+�
+2πσ2 + 1
+��
+−
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2 Ψβ
+
+�
+(46)
+
+In the next sections, we show some computational experiments that illustrate the effectiveness of
+the proposed tools in measuring the information encoded in complex systems. We want to investigate
+what happens to the Fisher curve as the inverse temperature parameter is modified in order to control
+the system’s global behavior. Our main conclusion, which is supported by experimental analysis, is
+that ⃗FB
+A(β) ̸= ⃗FA
+B (β). In other words, in terms of information, moving towards higher entropy states
+is not the same as moving towards lower entropy states, since the Fisher curves that represent the
+trajectory between the initial State A and the final State B are significantly different.
+
+265
+
+
+Entropy 2014, 16, 1002–1036
+
+7. Computational Simulations
+
+This section discusses some numerical experiments proposed to illustrate some applications of
+the derived tools in both simulations and real data. Our computational investigations were divided
+into two main sets of experiments:
+
+(1) Local analysis: analysis of the local and global versions of the measures (φβ, ψβ, Φβ, Ψβ and Hβ),
+considering a fixed inverse temperature parameter;
+(2) Global analysis: analysis of the global versions of the measures (Φβ, Ψβ and Hβ) along Markov
+chain Monte Carlo (MCMC) simulations in which the inverse temperature parameter is modified
+to control the expected global behavior.
+
+7.1. Learning from Spatial Data with Local Information-Theoretic Measures
+
+First, in order to illustrate a simple application of both forms of local observed Fisher
+information, φβ and ψβ, we performed an experiment using some synthetic images generated by
+the Metropolis–Hastings algorithm. The basic idea of this simulation process is to start at an initial
+configuration in which temperature is infinite (or β = 0). This basic initial condition is randomly
+chosen, and after a fixed number of steps, the algorithm produces a configuration that is considered to
+be a valid outcome of an isotropic pairwise GMRF model. Figure 1 shows an example of the initial
+condition and the resulting system configuration after 1,000 iterations considering a second order
+neighborhood system (eight nearest neighbors). The model parameters were chosen as: μ = 0, σ2 = 5
+and β = 0.8.
+
+266
+
+
+Entropy 2014, 16, 1002–1036
+
+Figure 1. Example of Gaussian Markov random field (GMRF) model outputs. The values of the inverse
+temperature parameter, β, in the left and right configurations are zero and 0.8, respectively.
+
+Three Fisher information maps were generated from both initial and resulting configurations.
+The first map was obtained by calculating the value, φβ(xi), for every point of the system, that is for
+i = 1, 2, . . . , n. Similarly, the second one was obtained by using ψβ(xi). The last information map was
+built by using the ratio between φβ(xi) and ψβ(xi), motivated by the fact that boundaries are often
+composed of patterns that are not expected to be “aligned” to the global behavior (and, therefore, show
+high values of φβ(xi)) and also are somehow unstable (show low values of ψβ(xi)). We will recall
+this measure, Lβ(xi) = φβ(xi)/ψβ(xi), the local L-information, once it is defined in terms of the first
+two derivatives of the logarithm of the local pseudo-likelihood function. Figure 2 shows the obtained
+information maps as images. Note that while φβ has a strong response for boundaries (the edges are
+light), ψβ has a weak one (so the edges are dark), evidence in favor of considering L-information in
+boundary detection procedures. Note also that in the initial condition, when the temperature is infinite,
+the informative patterns are almost uniformly spread all over the system, while the final configuration
+
+267
+
+
+Entropy 2014, 16, 1002–1036
+
+shows a more sparse representation in terms of information. Figure 3 shows the distribution of local
+L-information for both systems’ configurations depicted in Figure 1.
+
+Figure 2. Fisher information maps. The first row shows the information maps of the system when the
+temperature is infinite (β = 0). The second row shows the same maps when the temperature is low
+(β = 0.8). The first and second columns show information maps that were generated by computing
+φβ(xi) and ψβ(xi) for each observation in the lattice. The column map was produced by computing the
+local L-information, that is the ratio between both local information measures. In terms of information,
+low temperature configurations are more sparse, since most local patterns are uninformative, due to
+the strong alignment of the particles throughout the system, which is the expected global behavior for
+β above a certain critical value.
+
+268
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+�
+�
+�
+�
+��
+��
+��
+��
+��
+��
+�
+
+����
+
+����
+
+����
+
+����
+
+�����
+
+�����
+
+�������������
+
+������������������������
+
+����������������������������������������������������������
+
+�
+�
+��
+��
+��
+��
+��
+��
+��
+��
+��
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+�
+
+��� ����
+�
+
+�������������
+
+������������������������
+
+�����������������������������������������������������������������
+
+Figure 3.
+Distribution of local L-information. When the temperature is infinite, the information
+is spread along the system. For low temperature configurations, the number of local patterns with
+zero information content significantly increases, that is the system is more sparse in terms of Fisher
+information.
+
+7.2. Analyzing Dynamical Systems with Global Information-Theoretic Measures
+
+In order to study the behavior of a complex system that evolves from an initial State A to
+another State B, we use the Metropolis–Hastings algorithm, an MCMC simulation method, to generate
+a sequence of valid isotropic pairwise GMRF model outcomes for different values of the inverse
+temperature parameter, β. This process is an attempt to perform a random walk on the state space of
+the system, that is, in the space of all possible global configurations in order to analyze the behavior of
+the proposed global measures: entropy and both forms of Fisher information. The main purpose of
+this experiment is to observe what happens to Φβ, Ψβ and Hβ when the system evolves from a random
+initial state to other global configurations. In other words, we want to investigate the Fisher curve of
+the system in order to characterize its behavior in terms of information. Basically, the idea is to use the
+Fisher curve as a kind of signature for the expected behavior of any system modeled by an isotropic
+pairwise GMRF, making it possible to gain insights into the understanding of large complex systems.
+
+269
+
+
+Entropy 2014, 16, 1002–1036
+
+To simulate a system in which we can control the inverse temperature parameter, we define an
+updating rule for β based on fixed increments. In summary, we start with a minimum value βMIN
+(when βMIN = 0, the temperature of the system is infinite). Then, the value of β in the iteration, t, is
+defined as the value of β in t − 1 plus a small increment (Δβ), until it reaches a pre-defined upper bound,
+βMAX. The process in then repeated with negative increments −Δβ, until the inverse temperature
+reaches its minimum value, βMIN, again. This process continues for a fixed number of iterations, NMAX,
+during an MCMC simulation. As a result of this approach, a sequence of GMRF samples is produced.
+We use this sequence to calculate Φβ, Ψβ and Hβ and define the Fisher curve ⃗F, for β = βMIN, . . . , βMAX.
+Figure 4 shows some of the system’s configurations along an MCMC simulation. In this experiment, the
+parameters were defined as: βMIN = 0, Δβ = 0.001, βMAX = 0.15 and NMAX = 1, 000, μ = 0, σ2 = 5
+and ηi = {(i − 1, j − 1), (i − 1, j), (i − 1, j + 1), (i, j − 1), (i, j + 1), (i + 1, j − 1), (i + 1, j), (i + 1, j + 1)}.
+
+Figure 4. Global configurations along a Markov chain Monte Carlo (MCMC) simulation. Evolution of
+the system as the inverse temperature parameter, β, is modified to control the expected global behavior.
+
+A plot of both forms of the expected Fisher information, Φβ and Ψβ, for each iteration of
+the MCMC simulation is shown in Figure 5.
+The graph produced by this experiment shows
+some interesting results. First of all, regarding upper and lower bounds on these measures, it is
+possible to note that when there is no induced spatial dependence structure (β ≈ 0), we have an
+information equilibrium condition (Φβ = Ψβ and the information equality holds). In this condition,
+the observations are practically independent in the sense that all local configuration patterns convey
+approximately the same amount of information. Thus, it is hard to find and separate the two categories
+of patterns we know: the informative and the non-informative ones. Once they all behave in a similar
+manner, there is no informative pattern to highlight. Moreover, in this information equilibrium
+situation, Ψβ reaches its lower bound (in this simulation, we observed that in the equilibrium
+Φβ ≈ Ψβ ≈ 8), indicating that this condition emerges when the system is most susceptible to a
+change in the expected global behavior, since the uncertainty about β is maximum at this moment. In
+other words, modification in the behavior of a small subset of local patterns may guide the system to a
+totally different stable configuration in the future.
+The results also show that the difference between Φβ and Ψβ is maximum when the system
+operates with large values of β, that is, when organization emerges and there is a strong dependence
+structure among the random variables (the global configuration shows clear visible clusters and
+boundaries between them). In such states, it is expected that the majority of patterns be aligned to the
+global behavior, which causes the appearance of few, but highly informative patterns: those connecting
+
+270
+
+
+Entropy 2014, 16, 1002–1036
+
+elements from different regions (boundaries). Besides that, the results suggest that it takes more time
+for the system to go from the information equilibrium state to organization than the opposite. We
+will see how this fact becomes clear by analyzing the Fisher curve along Markov chain Monte Carlo
+(MCMC) simulations. Finally, the results also suggest that both Φβ and Ψβ are bounded by a superior
+value, possibly related to the size of the neighborhood system.
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������������������
+
+�������������������������������������������������������������������
+
+���
+���
+�
+
+�
+
+��
+
+Figure 5. Evolution of Fisher information along an MCMC simulation. As the difference between Φβ
+and Ψβ is maximized (*), the uncertainty about the real inverse temperature parameter is minimized
+and the number of informative patterns increases. In the information equilibrium condition (**), it is
+hard to find informative patterns, since there is no induced spatial dependence structure.
+
+Figure 6 shows the real parameter values used to generate the GMRF outputs (blue line), the
+maximum pseudo-likelihood estimative used to calculate Φβ and Ψβ (red line) and also a plot of the
+asymptotic variances (uncertainty about the inverse temperature) along the entire MCMC simulation.
+
+271
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�����
+
+�
+
+����
+
+����
+
+����
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+����������
+
+�������������������������������������������������������������������������������������
+
+����������
+��������������
+��������
+
+Figure 6. Real and estimated inverse temperatures along the MCMC simulation. The system’s global
+behavior is controlled by the real inverse temperature parameter values (blue line), used to generate
+the GMRF outputs. The maximum pseudo-likelihood estimative is used to compute both Φβ and Ψβ.
+Note that the uncertainty about the inverse temperature increases as β → 0 and the system approaches
+the information equilibrium condition.
+
+We now proceed to the analysis of the Shannon entropy of the system along the simulation.
+Despite showing a behavior similar to Ψβ, the range of values for entropy is significantly smaller. In
+this simulation, we observed that 0 ≤ Hβ ≤ 4.5, 0 ≤ Φβ ≤ 18 and 8 ≤ Ψβ ≤ 61. An interesting point is
+that knowledge of Φβ and Ψβ allows us to infer the entropy of the system. For example, looking at
+Figures 5 and 7, we can see that Φβ and Ψβ start to diverge a little bit earlier (t ≈ 80), then the entropy
+in a GMRF model begins to grow (t ≈ 120). Therefore, in an isotropic pairwise GMRF model, if the
+system is close to the information equilibrium condition, then Hβ is low, since there is little variability
+in the observed configuration patterns. When the difference between Φβ and Ψβ is large, Hβ increases.
+
+272
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+����������
+
+�������
+
+��������������������������������������������
+
+�������������������
+����������
+
+Figure 7.
+Evolution of Shannon entropy along an MCMC simulation. Hβ start to grow when the
+system leaves the equilibrium condition, where the entropy in the isotropic pairwise GMRF model is
+identical to the entropy of a simple Gaussian random variable (since β → 0).
+
+Another interesting global information-theoretic measure is L-information, from now on denoted
+by Lβ, since it conveys all the information about the likelihood function (in a GMRF model, only the
+first two derivatives of L(⃗θ; X(t)) are not null). Lβ is defined as the ratio between the two forms of
+expected Fisher information, Φβ and Ψβ. A nice property about this measure is that 0 ≤ Lβ ≤ 1. With
+this single measurement, it is possible to gain insights about the global system behavior. Figure 8 shows
+that a value close to one indicates a system approximating the information equilibrium condition, while
+a value close to zero indicates a system close to the maximum entropy condition (a stable configuration
+with boundaries and informative patterns).
+
+273
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+�
+
+����������
+
+�������������
+
+��������������������������������������������������
+
+Figure 8.
+Evolution of L-information along an MCMC simulation. When Lβ approaches one, the
+system tends to the information equilibrium condition. For values close to zero, the system tends to the
+maximum entropy condition.
+
+To investigate the intrinsic non-linear connection between Φβ, Ψβ and Hβ in a complex system
+modeled by an isotropic pairwise GMRF model, we now analyze its Fisher curves. The first curve,
+which is a planar one, is defined as ⃗F(β) = (Φβ, Ψβ), for A = βmin to B = βmax and shows how Fisher
+information changes when the inverse temperature of the system is modified to control the global
+behavior. Figure 9 shows the results. In the first image, the blue piece of the curve is the path from
+A to B, that is, ⃗F(β)B
+A, and the red piece is the inverse path (from B to A), that is, ⃗F(β)A
+B . We must
+emphasize that ⃗F(β)B
+A is the trajectory from a lower entropy global configuration to a higher entropy
+global configuration. On the other hand, when the system moves from B to A, we are moving towards
+entropy minimization. To make this clear, the second image of Figure 9 illustrates the same Fisher
+curve as before, but now in three dimensions, that is, ⃗F(β) = (Φβ, Ψβ, Hβ). For comparison purposes,
+Figure 10 shows the Fisher curves for another MCMC simulation with different parameter settings.
+Note that the shape of the curves are quite similar to those in Figure 9.
+
+274
+
+
+Entropy 2014, 16, 1002–1036
+
+0
+2
+4
+6
+8
+10
+12
+14
+16
+18
+20
+0
+
+10
+
+20
+
+30
+
+40
+
+50
+
+60
+
+70
+
+PHI
+
+PSI
+
+2D Fisher curve for a GMRF model
+
+A
+
+Equilibrium Line
+
+B
+
+0
+
+5
+
+10
+
+15
+
+20
+
+0
+
+20
+
+40
+
+60
+
+80
+
+2
+
+2.5
+
+3
+
+3.5
+
+4
+
+PHI
+
+3D Fisher curve for a GMRF model
+
+PSI
+
+H
+
+B
+
+A
+
+Figure 9. 2D and 3D Fisher curves of a complex system along an MCMC simulation. The graph shows
+a parametric curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from
+a differential geometry perspective, as the divergence between Φβ and Ψβ increases, the torsion of the
+parametric curve becomes evident (the curve leaves the plane of constant entropy).
+
+275
+
+
+Entropy 2014, 16, 1002–1036
+
+0
+2
+4
+6
+8
+10
+12
+14
+16
+18
+20
+0
+
+10
+
+20
+
+30
+
+40
+
+50
+
+60
+
+70
+
+PHI
+
+PSI
+
+2D Fisher curve for an isotropic pairwise GMRF model
+
+Equilibrium line
+
+B
+
+A
+
+0
+
+5
+
+10
+
+15
+
+20
+0
+
+20
+
+40
+
+60
+
+80
+2
+
+2.1
+
+2.2
+
+2.3
+
+2.4
+
+PSI
+
+3D Fisher curve for an isotropic pairwise GMRF model
+
+PHI
+
+H
+
+A
+
+B
+
+Figure 10. 2D and 3D Fisher curves along another MCMC simulation. The graph shows a parametric
+curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from a geometrical
+perspective, the properties of these curves are essentially the same as the ones from the previous
+simulation.
+
+We can see that the majority of points along the Fisher curve is concentrated around two regions
+of high curvature: (A) around the information equilibrium condition (an absence of short-term and
+long-term correlations, since β = 0); and (B) around the maximum entropy value, where the divergence
+between the information values are maximum (self-organization emerges, since β is greater than a
+critical value, βc). The points that lie in the middle of the path connecting these two regions represent
+the system undergoing a phase transition. Its properties change rapidly and in an asymmetric way,
+since ⃗F(β)B
+A ̸= ⃗F(β)A
+B for a given natural orientation.
+By now, some observations can be highlighted. First, the natural orientation of the Fisher curve
+defines the direction of time. The natural A–B path (increase in entropy) is given by the blue curve and
+the natural B–A path (decrease in entropy) is given by the red curve. In other words, the only possible
+way to walk from A to B (increase Hβ) by the red path or to walk from B to A (decrease Hβ) by the
+blue path would be moving back in time (by running the recorded simulation backwards).Eventually,
+we believe that a possible explanation for this fact could be that the deformation process that takes the
+original geometric structure (with constant curvature) of the usual Gaussian model (A) to the novel
+geometric structure of the isotropic pairwise GMRF model (B) is not reversible. In other words, the
+way the model is "curved" is not simply the reversal of the "flattering" process (when it is restored to its
+
+276
+
+
+Entropy 2014, 16, 1002–1036
+
+constant curvature form). Thus, even the basic notion of time seems to be deeply connected with the
+relationship between entropy and Fisher information in a complex system: in the natural orientation
+(forward in time), it seems that the divergence between Φβ and Ψβ is the cause of an increase in
+the entropy, and the decrease of entropy is the cause of the convergence of Φβ and Ψβ. During the
+experimental analysis, we repeated the MCMC simulations with different parameter settings, and
+the observed behavior for Fisher information and entropy was the same. Figure 11 shows the graphs
+of Φβ, Ψβ and Hβ for another recorded MCMC simulation. The results indicate that in the natural
+orientation (in the direction of time), an increase in Ψβ seems to be a trigger for an increase in the
+entropy and a decrease in the entropy seems to be a trigger for a decrease in Ψβ. Roughly speaking,
+Ψβ “pushes Hβ up” and Hβ “pushes Ψβ down”.
+
+�
+���
+���
+���
+���
+����
+����
+����
+����
+����
+����
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+�����������������������������������������������������������
+
+���
+���
+
+�
+���
+���
+���
+���
+����
+����
+����
+����
+����
+����
+�
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����������
+
+�������
+
+������������������������������������������������
+
+Figure 11. Relations between entropy and Fisher information. When a system modeled by an isotropic
+pairwise GMRF evolves in the natural orientation (forward in time), two rules that relate Fisher
+information and entropy can be observed: (1) an increase in Ψβ is the cause of an increase in Hβ (the
+increase in Hβ is a consequence of the increase in Ψβ); (2) a decrease in Hβ is the cause of a decrease in
+Ψβ (the decrease in Ψβ is a consequence of the decrease in Hβ). In other words, when moving towards
+higher entropy states, changes in Fisher information precedes changes in entropy (Ψβ “pushes Hβ
+up”). When moving towards lower entropy states, changes in entropy precedes changes in Fisher
+information (Hβ “pushes Ψβ down”).
+
+In summary, the central idea discussed here is that while entropy provides a measure
+of order/disorder of the system at a given configuration, X(t), Fisher information links these
+thermodynamical states through a path (Fisher curve). Thus, Fisher information is a powerful
+
+277
+
+
+Entropy 2014, 16, 1002–1036
+
+mathematical tool in the study of complex and dynamical systems, since it establishes how these
+different thermodynamical states are related along the evolution of the inverse temperature. Instead of
+knowing whether the entropy, Hβ, is increasing or decreasing, with Fisher information, it is possible to
+know how and why this change is happening.
+
+7.2.1. The Effect of Induced Perturbations in the System
+
+To test whether a system can recover part of its original configuration after a perturbation is
+induced, we conducted another computational experiment. During a stable simulation, two kinds of
+perturbations were induced in the system: (1) the value of the inverse temperature parameter was set
+to zero for the next consecutive two iterations; (2) the value of the inverse temperature parameter was
+set to the equilibrium value, β∗ (the solution of Equation 28), for the next consecutive two iterations.
+We should mention that in both cases, the original value of β is recovered after these two iterations
+are completed.
+When the system is disturbed by setting β to zero, the simulations indicate that the system is
+not successful in recovering components from its previous stable configuration (note that Φβ and Ψβ
+clearly touch one another in the graph). When the same perturbation is induced, but using the smallest
+of the two β∗ values (minimum solution of Equation 28), after a short period of turbulence, the system
+can recover parts (components, clusters) of its previous stable state. This behavior suggests that this
+softer perturbation is not enough to remove all the information encoded within the spatial dependence
+structure of system, preserving some of the long-term correlations in data (stronger bonds), slightly
+remodeling the large clusters presented in the system. Figures 12 and 13 illustrate the results.
+
+7.3. Considerations and Final Remarks
+
+The goal of this section is to summarize the main results obtained in this paper, focusing on the
+interpretation of the Fisher curve of a system modeled by a GMRF. First, our system is initialized with a
+random configuration, simulating that in the moment of its creation, the temperature is infinite (β = 0).
+We observe two important things at this moment: (1) there is a perfect symmetry in information, since
+the equilibrium condition prevails, that is, Φβ = Ψβ; (2) the entropy of the system is minimal. By a
+mere convention, we name this initial state of minimal entropy, A.
+By reducing the global temperature (β increases), this “universe” is deviating from this initial
+condition. As the system is drifted apart from the initial condition, we clearly see a break in the
+symmetry of information (Φβ diverges from Ψβ), which apparently is the cause for an increase in the
+system’s entropy, since this symmetry break seems to precede an increase in the entropy, H. This is a
+fundamental symmetry break, since other forms of ruptures that will happen in the future and will give
+rise to several properties of the system, including the basic notion of time as an irreversible process,
+follow from this first one. During this first stage of evolution, the system evolves to the condition of
+maximum entropy, named B.
+Hence, after the break in the information equilibrium condition, there is a significant increase in
+the entropy as the system continues to evolve. This stage lasts while the temperature of the system is
+further reduced or kept established. When the temperature starts to increase (β decreases), another
+form of symmetry break takes place. By moving towards the initial condition (A) from B, changes in
+the entropy seem to precede changes in Fisher information (when moving from A to B, we observe
+exactly the opposite). Moreover, the variations in entropy and Fisher information towards A are
+not symmetric with the variations observed when we move towards B, a direct consequence of that
+first fundamental break of the information equilibrium condition. By continuing this process of
+increasing the temperature of the system until infinity (β is approaching zero), we take our system to a
+configuration that is equivalent to the initial condition, that is, where information equilibrium prevails.
+This fundamental symmetry break becomes evident when we look at the Fisher curve of the
+system. We clearly see that the path from the state of minimum entropy, A, and the state of maximum
+entropy, B, defined by the curve, ⃗FB
+A (the blue trajectory in Figure 9), is not the same as the path from B
+
+278
+
+
+Entropy 2014, 16, 1002–1036
+
+to A, defined by the curve, ⃗FA
+B (the red trajectory in Figure 9). An implication of this behavior is that if
+the system is moving along the arrow of time, then we are moving through the Fisher curve in the
+clockwise orientation. Thus, the only way to go from A to B along ⃗FA
+B (the red path) is going back in
+time.
+Therefore, if that first fundamental symmetry break did not exist, or even if it had happened, but
+all the posterior evolution of Φβ, Ψβ and Hβ were absolutely symmetric (i.e., the variations in these
+measures were exactly the same when moving from A to B and when moving from B to A), what we
+
+�
+��
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+����������������������������������������������������
+
+���
+���
+
+�
+��
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+�����������������������������������������������������
+
+���
+���
+
+Figure 12.
+Disturbing the system to induce changes. Variation on Φβ and Ψβ after the system is
+disturbed by an abrupt change in the value of β. In the first image, the inverse temperature is set
+to zero. Note that Φβ and Ψβ touch one another, indicating that no residual information is kept, as
+if the simulation had been restarted from a random configuration. In the second image, the inverse
+temperature is set to the equilibrium value, β∗. The results suggest that this kind of perturbation is not
+enough to remove all the information within the spatial dependence structure, allowing the system to
+recover a significant part of its original configuration after a short stabilization period.
+
+279
+
+
+Entropy 2014, 16, 1002–1036
+
+Figure 13. The sequence of outputs along the MCMC simulation before and after the system is
+disturbed. The first row (when β is set to zero) shows that the system evolved to a different stable
+configuration after the perturbation. The second row (when β is set to β∗) indicates that the system
+was able to recover a significant part from its previous stable configuration.
+
+would actually see is that ⃗FB
+A = ⃗FA
+B . As a consequence, to decrease/increase the system’s temperature
+would be like moving towards the future/past. In fact, the basic notion of time in that system would
+be compromised, since time would be a perfectly reversible process (just similar to a spatial dimension,
+in which we can move in both directions). In other words, we would not distinguish whether the
+system is moving forward or moving backwards in time.
+
+8. Conclusions
+
+The definition of what is information in a complex system is a fundamental concept in
+the study of many problems. In this paper, we discussed the roles of two important statistical
+measures in isotropic pairwise Markov random fields composed of Gaussian variables: Shannon
+entropy and Fisher information. By using the pseudo-likelihood function of the GMRF model, we
+derived analytical expressions for these measures. The definition of a Fisher curve as a geometric
+representation for the study and analysis of complex systems allowed us to reveal the intrinsic
+non-linear relation between these information-theoretic measures and gain insights about the behavior
+of such systems. Computational experiments demonstrate the effectiveness of the proposed tools
+in decoding information from the underlying spatial dependence structure of a Gaussian-Markov
+random field. Typical informative patterns in a complex systems are located in the boundaries of
+the clusters. One of the main conclusions of this scientific investigation concerns the notion of time
+in a complex system. The obtained results suggest that the relationship between Fisher information
+and entropy determines whether the system is moving forward or backward in time. Apparently,
+in the natural orientation (when the system is evolving forward in time), when β is growing, that
+is, the temperature of the system is reducing, an increase in Fisher information leads to an increase
+in the system’s entropy, and when β is reducing, that is the temperature of the system is growing,
+
+280
+
+
+Entropy 2014, 16, 1002–1036
+
+a decrease in the system’s entropy leads to a decrease in the Fisher information. In future works
+we expect to completely characterize the metric tensor that represents the geometric structure of the
+isotropic pairwise GMRF model by specifying all the elements of the Fisher information matrix. Future
+investigations should also include the definition and analysis of the proposed tools in other Markov
+random field models, such as the Ising and Potts pairwise interaction models. Besides, a topic of
+interest concerns the investigation of minimum and maximum information paths in graphs to explore
+intrinsic similarity measures between objects belonging to a common surface or manifold in ℜn. We
+believe this study could bring benefits to some pattern recognition and data analysis computational
+applications.
+
+Acknowledgments: The author would like to thank CNPQ(Brazilian Council for Research and Development) for
+the financial support through research grant number 475054/2011-3.
+
+Conflicts of Interest: Conflict of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Shannon, C.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana,
+Chicago, IL & London, USA, 1949.
+2.
+Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on
+Mathematics, Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California
+Press: Berkeley, CA, USA, 1961. pp. 547–561
+3.
+Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487.
+4.
+Bashkirov, A. Rényi entropy as a statistical entropy for complex systems.
+Theor. Math. Phys. 2006,
+149, 1559–1573.
+5.
+Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630.
+6.
+Grad, H. The many faces of entropy. Comm. Pure. Appl. Math. 1961, 14, 323–254.
+7.
+Adler, R.; Konheim, A.; McAndrew, A. Topological entropy. Trans. Am. Math. Soc. 1965, 114, 309–319.
+8.
+Goodwyn, L. Comparing topological entropy with measure-theoretic entropy. Am. J. Math. 1972, 94, 366–388.
+9.
+Samuelson, P.A. Maximum principles in analytical economics. Am. Econ. Rev. 1972, 62, 249–262.
+10.
+Costa, M. Writing on dirty paper. IEEE T. Inform. Theory 1983, 29, 439–441.
+11.
+Dembo, A.; Cover, T.; Thomas, J.
+Information theoretic inequalities.
+IEEE T. Inform.
+Theory 1991,
+37, 1501–1518.
+12.
+Cover, T.; Zhang, Z. On the maximum entropy of the sum of two dependent random variables. IEEE T.
+Inform. Theory 1994, 40, 1244–1246.
+13.
+Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, London,
+2004.
+14.
+Frieden, B.R.; Gatenby, R.A. Exploratory Data Analysis Using Fisher Information; Springer: London, UK, 2006.
+15.
+Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983.
+16.
+Bickel, P.J. Mathematical Statistics; Holden Day: New York, NY, USA, 1991.
+17.
+Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury: New York, NY, USA, 2002.
+18.
+Amari, S. Nagaoka, H. Methods of information geometry (Translations of mathematical monographs vol. 191); AMS
+Bookstore: Tokyo, Japan, 2000.
+19.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234.
+20.
+Anandkumar, A.; Tong, L.; Swami, A. Detection of Gauss-Markov random fields with nearest-neighbor
+dependency. IEEE T. Inform. Theory 2009, 55, 816–827.
+21.
+Gómez-Villegas, M.A.; Main, P.; Susi, R. The effect of block parameter perturbations in Gaussian Bayesian
+networks: Sensitivity and robustness. Inform. Sci. 2013, 222, 439–458.
+22.
+Moura, J.; Balram, N. Recursive structure of noncausal Gauss-Markov random fields. IEEE T. Inform. Theory
+1992, 38, 334–354.
+23.
+Moura, J.; Goswami, S. Gauss-Markov random fields (GMrf) with continuous indices. IEEE Trans. Inform.
+Theory 1997, 43, 1560–1573.
+
+281
+
+
+Entropy 2014, 16, 1002–1036
+
+24.
+Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Stat. Soc. B Stat. Meth. 1974,
+36, 192–236.
+25.
+Besag, J. Statistical analysis of non-lattice data. The Statistician 1975, 24, 179–195.
+26.
+Hammersley, J.; Clifford, P. (University of California, Berkeley, Oxford and Bristol). Markov Field on Finite
+Graphs and Lattices. Unpublished work, 1971.
+27.
+Efron, B.F.; Hinkley, D.V. Assessing the accuracy of the ml estimator: Observed versus expected fisher
+information. Biometrika 1978, 65, 457–487.
+28.
+Isserlis, L. On a formula for the product-moment coefficient of any order of a normal frequency distribution
+in any number of variables. Biometrika 1918, 12, 134–139.
+29.
+Jensen, J.; Künsh, H. On asymptotic normality of pseudo likelihood estimates for pairwise interaction
+processes. Ann. Inst. Stat. Math. 1994, 46, 475–486.
+30.
+Winkler, G. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction;
+Springer-Verlag New York, Inc.: Secaucus, NJ, USA, 2006.
+31.
+Liang, G.; Yu, B. Maximum pseudo likelihood estimation in network tomography. IEEE T. Signal Proces.
+2003, 51, 2043–2053.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+282
+
+
+entropy
+
+Article
+Network Decomposition and Complexity Measures:
+An Information Geometrical Approach
+
+Masatoshi Funabashi
+
+Sony Computer Science Laboratories, inc. Takanawa muse bldg. 3F, 3-14-13, Higashi Gotanda, Shinagawa-ku,
+Tokyo 141-0022, Japan; E-Mail: masa_funabashi@csl.sony.co.jp; Tel.: +81-3-5448-4380; Fax: +81-3-5448-4273
+
+Received: 28 March 2014; in revised form: 24 June 2014 / Accepted: 14 July 2014 /
+Published: 23 July 2014
+
+Abstract: We consider the graph representation of the stochastic model with n binary variables, and
+develop an information theoretical framework to measure the degree of statistical association existing
+between subsystems as well as the ones represented by each edge of the graph representation. Besides,
+we consider the novel measures of complexity with respect to the system decompositionability, by
+introducing the geometric product of Kullback–Leibler (KL-) divergence. The novel complexity
+measures satisfy the boundary condition of vanishing at the limit of completely random and ordered
+state, and also with the existence of independent subsystem of any size. Such complexity measures
+based on the geometric means are relevant to the heterogeneity of dependencies between subsystems,
+and the amount of information propagation shared entirely in the system.
+
+Keywords: information geometry; complexity measure; complex network; system decomposition-
+ability; geometric mean
+
+1. Introduction
+
+Complex systems sciences emphasize on the importance of non-linear interactions that can not be
+easily approximated linearly. In other word, the degrees of non-linear interactions are the source of
+complexity. The classical reductionism approach generally decomposes a system into its components
+with linear interactions, and tries to evaluate whether the whole property of the system can still
+be reproduced. If this decomposition of a system destroys too much information to reproduce the
+system’s whole property, the plausibility of such reductionism is lost. Inversely, if we can evaluate
+how much information is ignored by the decomposition, we can assume how much complexity of the
+whole system is lost. This gives us a way to measure the complexity of a system with respect to the
+system decomposition.
+In stochastic systems described as a set of joint distributions, the interaction can basically be
+expressed as the statistical association between the variables. The simplest reductionism approach is to
+separate the whole system into some subsets of variables, and assume the independence between them.
+If such decomposition does not affect the system’s property, the isolated subsystem is independent
+from the rest. On the other hand, if the decomposition loses too much information, then the subsystem
+is inside of a larger subsystem with strong internal dependencies and can not be easily separated.
+The stochastic models have often been represented with the use of graph representation, and
+treated with the name of complex network [1–3]. Generally, the nodes represent the variables and
+the weights on the edges are the statistical association between them. However, if we consider the
+information contained in the different orders of dependencies among variables, the graph with a single
+kind of edges is not sufficient to express the whole information of the system [4]. An edge of a graph
+with n nodes contains the information of statistical association up to the n-th order dependencies
+among n variables. If we try to decompose the system independently by cutting these information, we
+have to consider what it means to cut the edge of the graph from the information theoretical point
+of view.
+
+Entropy 2014, 16, 4132–4167; doi:10.3390/e16074132
+www.mdpi.com/journal/entropy
+283
+
+
+Entropy 2014, 16, 4132–4167
+
+Indeed, analysis on the degree of dependencies existing between variables derived many defini-
+tion of complexity in stochastic model [5], which have been mostly studied with information theoretical
+perspective. Beginning with seminal works of Lempel and Ziv (e.g., [6]), computation-oriented definition
+of complexity takes deterministic formalization and measures the necessary information to reproduce
+a given symbolic sequence exactly, which is classified with the name of algorithmic complexity [7–9].
+On the other hand, statistical approach to complexity, namely statistical complexity, assumes
+some stochastic model as theoretical basis, and refers to the structure of information source on it in
+measure-theoretic way [10–12].
+One of the most classical statistical complexities is the mutual information between two stochastic
+variables, and its generalized form to measure dependence between n variables is proposed (e.g., [13])
+and explored in relevance to statistical models and theories by several authors [14–16].
+We should also recall that complexity is not necessary conditioned only by information theory,
+but rather motivated from the organization of living system such as brain activity. The TSE complexity
+shows further extension of generalized mutual information into biological context, where complexity
+exists as the heterogeneity between different system hierarchies [17]. These statistical complexities
+are all based on the boundary condition of vanishing at the limit of completely random and ordered
+state [18].
+The complexity measure is usually the projection from system’s variables to one-dimensional
+quantity, which is composed to express the degree of characteristic that we define to be important in
+what means “complexity”. Since the complexity measure is always a many-to-one association, it has both
+aspects of compressing information to classify the system from simple to complex, and losing resolution
+of the system’s phase space. If the system has n variables, we generally need n independent complexity
+measures to completely characterize the system with real-value resolution. The problematics of
+defining a complexity measure is situated on the edge of balancing the information compression on
+system’s complexity with theoretical support, and the resolution of the system identification to be
+maintained high enough to avoid trivial classification. The latter criterion increases its importance as
+the system size becomes larger. The better complexity measure is therefore a set of indices, with as
+less number as possible, which characterizes major features related to the complexity of the system.
+In this sense, the ensemble of complexity measures is also analogous to the feature space of support
+vector machine. A non-trivial set of complexity measures need to be complementary to each other in
+parameter space for the possible best discrimination of different systems.
+In this paper, we first consider the stochastic system with binary variables and theoretically
+develop a way to measure the information between subsystems, which is consistent to the information
+represented by the edges of the graph representation.
+Next, we particularly focus on the generalized mutual information as a start point of the argument,
+and further consider to incorporate network heterogeneity into novel measures of complexity with
+respect to the system’s decompositionability. This approach will be revealed to be complementary to
+TSE complexity as the difference between arithmetic and geometric means of information.
+
+2. System Decomposition
+
+Let us consider the stochastic system with n binary variables x = (x1, · · · , xn) where xi ∈
+{0, 1} (1 ≤ i ≤ n). We denote the joint distribution of x by p(x). We define the decomposition pdec(x)
+of p(x) into two subsystems y1 = (x1
+1, · · · , x1
+n1) and y2 = (x2
+1, · · · , x2
+n2) (n1 + n2 = n, y1 ∪ y2 = x,
+y1 ∩ y2 = φ) as follows:
+
+pdec(x) = p(y1)p(y2),
+(1)
+
+where p(y1) and p(y2) are the joint distributions of y1 and y2, respectively. For simplicity, hereafter
+we denote the system decomposition using the smallest subscript of variables in each subsystem. For
+example, in case n = 4, y1 = (x1, x3) and y2 = (x2, x4), we describe the decomposed system pdec(x)
+
+284
+
+
+Entropy 2014, 16, 4132–4167
+
+as < 1212 >. The system decomposition means to cut all statistical association between the two
+subsystems, which is expressed as setting the independent relation between them.
+We will further consider the Equation (1) in terms of the graph representation. We define
+the undirected graph Γ := (V, E) of the system p(x), whose vertices V = {x1, · · · , xn} and edges
+E = V × V represent the variables and the statistical association, respectively. To express the system,
+we set the value of each vertex as the value of the corresponding variable, and the weight of each edge
+as the degree of dependency between the connected variables.
+There is however a problem considering the representation with a single kind of edge. The
+statistical association among variables is not only between two variables, but can be independently
+defined among plural variables up to the n-th order. Therefore, the exact definition of the weight
+of the edges remains unclear. To clarify these problematics, we consider the hierarchical marginal
+distributions j as another coordinates of the system p(x) as follows:
+
+j = (j1; j2; · · · ; jn),
+(2)
+
+where
+
+j1
+=
+(η1, · · · , ηi, · · · , ηn), (1 < i < n),
+
+j2
+=
+(η1,2, · · · , ηi,j, · · · , ηn−1,n), (1 < i < j < n),
+
+...
+
+jn
+=
+η1,2,··· ,n,
+(3)
+
+and
+
+η1
+=
+∑
+i2,··· ,in∈{0,1}
+p(1, i2, · · · , in),
+
+...
+
+ηn
+=
+∑
+i1,··· ,in−1∈{0,1}
+p(i1, · · · , in−1, 1),
+
+η1,2
+=
+∑
+i3,··· ,in∈{0,1}
+p(1, 1, i3, · · · , in),
+
+...
+
+ηn−1,n
+=
+∑
+i1,··· ,in−2∈{0,1}
+p(i1, · · · , in−2, 1, 1),
+
+...
+
+η1,2,··· ,n
+=
+p(1, 1, · · · , 1).
+(4)
+
+Since the definition of j is a linear transformation of p(x), both coordinates have the degrees of
+freedom ∑n
+k=1 nCk.
+The subcoordinates j1 are simply the set of marginal distributions of each variable.
+The
+subcoordinates jk (1 < k ≤ n) include the statistical association among k variables, that can not
+be expressed with the coordinates less than the k-th order. This means that the different statistical
+associations exist independently in each order among the corresponding sets of the variables. The
+statistical association represented by the weight of a graph edge {xi, xj} is therefore the superposition
+of the different dependencies defined on every subset of x including xi and xj.
+To measure the degree of statistical association in each order, the information geometry established
+the following setting [19]. We first define another coordinates ` = (`1; `2; · · · ; `n) that are the dual
+
+285
+
+
+Entropy 2014, 16, 4132–4167
+
+coordinates of j with respect to the Legendre transformation of the exponential family’s potential
+function ψ(`) to its conjugate potential φ(j) as follows:
+
+`1
+=
+(θ1, · · · , θn),
+
+`2
+=
+(θ1,2, · · · , θn−1,n),
+(5)
+...
+
+`n
+=
+θ1,2,··· ,n,
+
+where
+
+ψ(`)
+=
+log
+1
+
+p(0, · · · , 0),
+
+φ(j)
+= ∑
+i
+θiηi + ∑
+i<j
+θi,jηi,j + · · · + θ1,2,··· ,nη1,2,··· ,n − ψ(`),
+
+θi
+=
+∂φ(j)
+
+∂ηi
+, (1 ≤ i ≤ n),
+
+θi,j
+=
+∂φ(j)
+∂ηi,j
+, (1 ≤ i < j ≤ n),
+(6)
+
+...
+
+θ1,2,··· ,n
+=
+∂φ(j)
+
+∂η1,2,··· ,n
+.
+
+Note that j can be inversely derived from `, following Legendre transformation between φ(j) and
+ψ(`):
+
+ηi
+=
+∂ψ(`)
+
+∂θi
+, (1 ≤ i ≤ n),
+
+ηi,j
+=
+∂ψ(`)
+∂θi,j
+, (1 ≤ i < j ≤ n),
+(7)
+
+...
+
+η1,2,··· ,n
+=
+∂ψ(`)
+
+∂θ1,2,··· ,n
+.
+
+Using the coordinates `, the system is described in the form of the exponential family as follows:
+
+p(x) = ∑
+i
+θixi + ∑
+i<j
+θi,jxixj + · · · + θ1,2,··· ,nx1x2 · · · xn − ψ(`).
+(8)
+
+The information geometry revealed that the exponential family of probability distribution forms a
+manifold with a dual-flat structure. More precisely, the coordinates ` form a flat manifold with respect
+to the Fisher information matrix as the Riemannian metric, and α-connection with α = 1. Dually to `,
+the coordinates j are flat with respect to the same metric but α-connection with α = −1. It is known that
+` and j are orthogonal to each other with respect to the Fisher information matrix. This structure give
+us a way to decompose the degree of statistical association among variables into separated elements of
+arbitrary orders. We define the so-called k-cut mixture coordinates ık as follows [14].
+
+ık
+=
+(jk−; `k+),
+(9)
+
+jk−
+=
+(j1, · · · , jk),
+(10)
+
+`k+
+=
+(`k+1, · · · , `n).
+(11)
+
+286
+
+
+Entropy 2014, 16, 4132–4167
+
+We also define the k-cut mixture coordinates ık
+0 = (jk−; 0, · · · , 0) with no dependency above the
+k-th order. We denote the system specified with ık and ık
+0 as p(x, ık) and p(x, ık
+0 ), respectively.
+Then the degree of the statistical association more than the k-th order in the system can be
+measured by the Kullback-Leibler (KL-) divergence D[p(x, ı) : p(x, ık
+0 )].
+
+2N · D[p(x, ı) : p(x, ık
+0)] ∼ χ2(
+n
+∑
+i=k+1
+nCi),
+(12)
+
+where D[· : ·] is the KL-divergence from the first system to the second one.
+Here, the decomposition is performed according to the orders of statistical association, which does
+not spatially distinguish the vertices. If we define the weight of an edge {xi, xj} with the KL-divergence,
+the above k-cut coordinates ık are not appropriate to measure the information represented in each
+edge. We need to set another mixture coordinates so that to separate only the existing information
+between xi and xj regardless of its order.
+Let us return to the definition of the system decomposition and consider on the dual-flat
+coordinates ` and j.
+
+Proposition 1. The independence between the two decomposed systems y1 = (x1
+1, · · · , x1
+n1) and y2 =
+(x2
+1, · · · , x2
+n2) can be expressed on the new coordinates jdec as follows:
+
+ηdec
+i
+=
+ηi, (1 ≤ i ≤ n),
+
+ηdec
+i,j
+=
+
+�
+ηi,j, (1 ≤ i < j ≤ n),
+if {xi, xj} ⊆ y1 or ⊆ y2
+
+ηiηj, (1 ≤ i < j ≤ n),
+else
+,
+
+ηdec
+i,j,k
+=
+
+⎧
+⎪
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎪
+⎩
+
+ηi,j,k, (1 ≤ i < j < k ≤ n),
+if {xi, xj, xk} ⊆ y1 or ⊆ y2
+
+ηi,jηk, (1 ≤ i < j < k ≤ n),
+else if {xi, xj} ⊆ y1 or ⊆ y2
+
+ηiηj,k, (1 ≤ i < j < k ≤ n),
+else if {xj, xk} ⊆ y1 or ⊆ y2
+
+ηjηi,k, (1 ≤ i < j < k ≤ n),
+else (if {xi, xk} ⊆ y1 or ⊆ y2)
+
+,
+
+...
+(13)
+
+ηdec
+1,2,··· ,n
+=
+ηs[i,k1,··· ,kn1−1]ηs[j,l1,··· ,ln2−1], (xi ∈ y1, xj ∈ y2),
+
+where s[· · · ] is the ascending sort of the internal sequence.
+Then the corresponding dual coordinates `dec take 0 elements as follows:
+
+θdec
+i,j
+=
+0,
+(1 ≤ i < j <≤ n),
+if {xi, xj} ∩ y1 ̸= φ and {xi, xj} ∩ y2 ̸= φ
+
+θdec
+i,j,k
+=
+0,
+(1 ≤ i < j < k ≤ n),
+if {xi, xj, xk} ∩ y1 ̸= φ and {xi, xj, xk} ∩ y2 ̸= φ
+
+...
+
+θdec
+1,2,··· ,n
+=
+0.
+(14)
+
+Proof. For simplicity, we show the cases of n = 2 and n = 3 for the first node separation.
+
+287
+
+
+Entropy 2014, 16, 4132–4167
+
+For n = 2, the above defined jdec for the system decomposition < 12 > give its dual coordinates
+`dec as follows:
+
+θdec
+1
+=
+log
+ηdec
+1
+− ηdec
+1,2
+
+1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2
+=
+log
+η1
+
+1 − η1
+,
+
+θdec
+2
+=
+log
+ηdec
+2
+− ηdec
+1,2
+
+1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2
+=
+log
+η2
+
+1 − η2
+,
+
+θdec
+1,2
+=
+log
+ηdec
+1,2 (1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2 )
+
+(ηdec
+1
+− ηdec
+1,2 )(ηdec
+2
+− ηdec
+1,2 )
+=
+0,
+
+(15)
+
+which means the first and second node is independent.
+For n = 3, the above defined jdec for the system decomposition < 122 > give its dual coordinates
+`dec as follows:
+
+θdec
+1
+=
+log
+ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η1
+
+1 − η1
+,
+
+θdec
+2
+=
+log
+ηdec
+2
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η2 − η2,3
+
+1 − η2 − η3 + η2,3
+,
+
+θdec
+3
+=
+log
+ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η3 − η2,3
+
+1 − η2 − η3 + η2,3
+,
+
+(16)
+
+θdec
+1,2
+=
+log
+(ηdec
+1,2 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+0,
+
+θdec
+1,3
+=
+log
+(ηdec
+1,3 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+0,
+
+θdec
+2,3
+=
+log
+(ηdec
+2,3 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+log η2,3(1 − η2 − η3 + η2,3)
+
+(η2 − η2,3)(η3 − η2,3) ,
+
+(17)
+
+θdec
+1,2,3
+=
+log
+
+�
+ηdec
+1,2,3
+
+(ηdec
+1,2 − ηdec
+1,2,3)(ηdec
+1,3 − ηdec
+1,2,3)(ηdec
+2,3 − ηdec
+1,2,3)
+
+×
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+�
+
+=
+0,
+
+(18)
+
+which means the first node is independent from the other nodes.
+The generalization is possible with the use of recurrence formula between system size n and
+n + 1, according to the symmetry of the model and Legendre transformation between jdec and `dec
+
+coordinates.
+Numerical proof can be obtained by computing directly 0 elements of `dec from jdec.
+
+288
+
+
+Entropy 2014, 16, 4132–4167
+
+The definition of jdec means to decompose the hierarchical marginal distributions j into the
+products of the subsystems’ marginal distributions, in case the subscripts traverse the two subsystems.
+Therefore, only the statistical associations between two subsystems are set to be independent, while
+the internal dependencies of each subsystem remain unchanged. This is analytically equivalent to
+compose another mixture coordinates ¸, namely the < · · · >-cut coordinates, with proper description
+of the system decomposition with < · · · >. The ¸ consists of the j coordinates with the subscripts that
+do not traverse between the decomposed subsystems, and the ` coordinates whose subscripts traverse
+between them.
+For simplicity, we only describe here the case n = 4 and the decomposition < 1133 > (the set of
+the first, second, and the third, fourth nodes each form a subsystem). The system p(x) is expressed
+with the < 1133 >-cut coordinates ¸ as
+
+ξ1
+=
+η1,
+...
+
+ξ4
+=
+η4,
+
+ξ1,2
+=
+η1,2,
+
+ξ1,3
+=
+θ1,3,
+
+ξ1,4
+=
+θ1,4,
+
+ξ2,3
+=
+θ2,3,
+(19)
+
+ξ2,4
+=
+θ2,4,
+
+ξ3,4
+=
+η3,4,
+
+ξ1,2,3
+=
+θ1,2,3,
+...
+
+ξ2,3,4
+=
+θ2,3,4,
+
+ξ1,2,3,4
+=
+θ1,2,3,4.
+
+The decomposed system with no statistical association between two subsystems have the
+following coordinates ¸dec, which is, in any decomposition, equivalent to set all θ in ¸ as 0:
+
+ξdec
+1
+=
+η1,
+...
+
+ξdec
+4
+=
+η4,
+
+ξdec
+1,2
+=
+η1,2,
+
+ξdec
+1,3
+=
+0,
+
+ξdec
+1,4
+=
+0,
+
+ξdec
+2,3
+=
+0,
+(20)
+
+ξdec
+2,4
+=
+0,
+
+ξdec
+3,4
+=
+η3,4,
+
+ξdec
+1,2,3
+=
+0,
+
+...
+
+ξdec
+2,3,4
+=
+0,
+
+ξdec
+1,2,3,4
+=
+0.
+
+289
+
+
+Entropy 2014, 16, 4132–4167
+
+This is analytically equivalent to the definition of the decomposition (13)–(14) in case of < 1133 >.
+Therefore, the KL-divergence D[p(x, ¸) : p(x, ¸dec)] measures the information lost by the system
+decomposition. The following asymptotic agreement to χ2 test also holds.
+
+Proposition 2.
+
+2N · D[p(x, ¸) : p(x, ¸dec)] ∼ χ2(♯θ(¸)),
+(21)
+
+where ♯θ(¸) is the number of ` coordinates appearing in the ¸ coordinates.
+
+3. Edge Cutting
+
+We further expand the concept of system decomposition to eventually quantify the total amount of
+information expressed by an edge of the graph. Let us consider to cut an edge {xi, xj} (1 ≤ i < j ≤ n)
+of the graph with n vertices. Hereafter we call this operation as the edge cutting i − j. In the same way
+as the system decomposition, the edge cutting corresponds to modify the j coordinates to produce jec
+
+coordinates as follows:
+
+ηec
+i,j
+=
+ηiηj,
+
+ηec
+s[i,j,k1]
+=
+ηs[i,k1]ηs[j,k1],
+
+ηec
+s[i,j,k1,k2]
+=
+ηs[i,k1,k2]ηs[j,k1,k2],
+(22)
+
+...
+
+ηec
+s[i,j,k1,··· ,kn−2]
+=
+ηs[i,k1,··· ,kn−2]ηs[j,k1,··· ,kn−2],
+
+({i, j, k1, · · · , kn−2}
+=
+{1, · · · , n}),
+
+and the rest of jec remains the same as those of j.
+The formation of jec from j consists of replacing the k-th order elements (k ≥ 3) of j including both
+i and j in its subscripts, with the product of the k − 1-th order j in maximum subgraphs (k − 1 vertices)
+each including i or j. This means that all orders of statistical association including the variables xi and
+xj are set to be independent only between them. Other relations that do not include simultaneously xi
+and xj remain unchanged.
+Certain combinations of edge cuttings coincide with system decompositions. For example, in case
+n = 4, the edge cuttings 1 − 2, 1 − 3, and 1 − 4 are equivalent to the system decomposition < 1222 >.
+We define the i − j-cut mixture coordinates ¸ for orthogonal decomposition of the statistical
+association represented by the edge {xi, xj}. Although actual calculation can be performed only with j
+coordinates, this generalization is necessary to have a geometrical definition of the orthogonality. For
+simplicity, we only describe the ¸ in the case of n = 4:
+
+290
+
+
+Entropy 2014, 16, 4132–4167
+
+ξ1
+=
+η1,
+...
+
+ξ4
+=
+η4,
+
+ξ1,2
+=
+θ1,2,
+
+ξ1,3
+=
+η1,3,
+
+ξ1,4
+=
+η1,4,
+
+ξ2,3
+=
+η2,3,
+
+ξ2,4
+=
+η2,4,
+(23)
+
+ξ3,4
+=
+η3,4,
+
+ξ1,2,3
+=
+θ1,2,3,
+
+ξ1,2,4
+=
+θ1,2,4,
+
+ξ1,3,4
+=
+η1,3,4,
+
+ξ2,3,4
+=
+η2,3,4,
+
+ξ1,2,3,4
+=
+θ1,2,3,4,
+
+where orthogonality between the elements of j and ` holds with respect to the Fisher information
+matrix.
+Calculating the dual coordinates `ec of jec, we can define the coordinates ¸ec of the system after
+the edge cutting 1 − 2 as follows:
+
+ξec
+1
+=
+η1,
+...
+
+ξec
+4
+=
+η4,
+
+ξec
+1,2
+=
+θec
+1,2,
+
+ξec
+1,3
+=
+η1,3,
+
+ξec
+1,4
+=
+η1,4,
+
+ξec
+2,3
+=
+η2,3,
+
+ξec
+2,4
+=
+η2,4,
+
+ξec
+3,4
+=
+η3,4,
+
+ξec
+1,2,3
+=
+θec
+1,2,3,
+
+ξec
+1,2,4
+=
+θec
+1,2,4,
+
+ξec
+1,3,4
+=
+η1,3,4,
+
+ξec
+2,3,4
+=
+η2,3,4,
+
+ξec
+1,2,3,4
+=
+θec
+1,2,3,4.
+
+Note that the edge cutting can not be defined simply by setting the corresponding elements of `ec as 0.
+Then the KL-divergence D[p(x, ¸)
+:
+p(x, ¸ec)] represent the total amount of information
+represented by the edge 1 − 2.
+The following asymptotic agreement to χ2 test also holds:
+
+291
+
+
+Entropy 2014, 16, 4132–4167
+
+Proposition 3.
+
+2N · D[p(x, ¸) : p(x, ¸ec)] ∼ χ2(1 +
+n−2
+∑
+k=1
+n−2Ck).
+(24)
+
+We call this χ2 value or the KL-divergence itself as edge information of edge 1 − 2.
+
+4. Generalized Mutual Information as Complexity with Respect to the Total System
+Decomposition
+In previous sections, we have introduced a measure of complexity in terms of system
+decomposition, by measuring the KL-divergence between a given system and its independently
+decomposed subsystems.
+We consider here the total system decomposition, and measure the
+informational distance I between the system and the totally decomposed system where each element
+are independent.
+
+I :=
+n
+∑
+i=1
+H(xi) − H(x1, · · · , xn),
+(25)
+
+where
+
+H(x) := −∑
+x
+p(x) log(x).
+(26)
+
+This quantity is the generalization of mutual information, and is named in various ways such as
+generalized mutual information, integration, complexity, multi-information, etc. according to different
+authors. For simplicity, we call the I as “multi-information taking after [15]. This quantity can be
+interpreted as a measure of complexity that sums up the order-wise statistical association existing in
+each subset of components with information geometrical formalization [14]
+For simplicity, we denote the multi-information I of n-dimensional stochastic binary variables as
+follows, using the notation of the system decomposition:
+
+I = D[< 111 · · · 1 >:< 123 · · · n >].
+(27)
+
+5. Rectangle-Bias Complexity
+
+The multi-information contains some degrees of freedom in case n > 2. That is, we can define a
+set of distributions {p(x)|I = const.} with different parameters but the same I value. This fact can be
+clearly explained with the use of information geometry. From the Pythagorean relation, we obtain the
+followings in case of n = 3:
+
+D[< 111 >:< 113 >] + D[< 113 >:< 123 >]
+=
+D[< 111 >:< 123 >],
+
+D[< 111 >:< 121 >] + D[< 121 >:< 123 >]
+=
+D[< 111 >:< 123 >],
+(28)
+
+D[< 111 >:< 122 >] + D[< 122 >:< 123 >]
+=
+D[< 111 >:< 123 >].
+
+Using these relations, we can schematically represent the decomposed systems on a circle diagram
+with diameter
+√
+
+I. This representation is based on the analogous algebra between Pythagorean relation
+of KL-divergence, and that of Euclidian geometry where the circumferential angle of a semi-circular
+arc is always π
+
+2 .
+Figure1 represents two different cases with the same I value in case n = 3. The distance between
+two systems in the same diagram corresponds to the square root value of KL-divergence between
+them. Clearly the left and right figures represent different dependencies between nodes, although
+they both have the same I value. Such geometrical variation is possible by the abundance of degree of
+freedom in dual coordinates compared to the given constraint. There exist 7 degrees of freedom in η
+or θ coordinates for n = 3, while the only constraint is the invariance of I value, which only reduce 1
+
+292
+
+
+Entropy 2014, 16, 4132–4167
+
+degree of freedom. The remaining 6 degrees of freedom can then be deployed to produce geometrical
+variation in the circle diagram. As for considering system decomposition, the left figure is difficult to
+obtain decomposed systems without losing much information. While in the right figure there exists
+relatively easy decomposition < 122 >, which loses less information than any decomposition in the left
+figure. We call such degree of facility of decomposition with respect to the losing information as system
+decompositionability. In this sense, the left system is more complex although the 2 systems both have the
+same I value. Especially, in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] = D[< 111 >:< 121 >],
+the system does not have any easiest way of decomposition, and any isolation of a node loses significant
+amount of information.
+To further incorporate such geometrical structure reflecting system decompositionability into a
+measure of complexity, we consider a mathematical way to distinguish between these two figures.
+Although the total sum of KL-divergence along the sequence of system decomposition is always
+identical to I by Pythagorean relation, their product can vary according to the geometrical composition
+in the circle diagram. This is analogous to the isoperimetric inequality of rectangle, where regular
+tetragon gives the maximum dimensions amongst constant perimeter rectangles.
+We propose provisionary a new measure of complexity as follows, namely rectangle-bias
+complexity Cr:
+
+Cr =
+1
+
+|SD| − 2
+∑
+<···>∈SD
+D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >],
+(29)
+
+where SD is the set of possible system decomposition in n binary variables, and |SD| is the element
+number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5
+for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value
+for the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >
+] = D[< 111 >:< 121 >]. We propose provisionary a new measure of complexity as follows, namely
+rectangle-bias complexity Cr:
+
+Cr =
+1
+
+|SD| − 2
+∑
+<···>∈SD
+D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >],
+(30)
+
+where SD is the set of possible system decomposition in n binary variables, and |SD| is the element
+number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5
+for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value for
+the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] =
+D[< 111 >:< 121 >].
+
+293
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 1. Circle diagrams of system decomposition in 3-node network. Both systems have the same value of
+multi-information I that is expressed as the identical diameter length of the circles. 2 variations are
+shown, where the left system is more complex (Cr high) in a sense any system decomposition requires
+to lose more information than the easiest one (< 122 >) in the right figure (Cr low).
+
+6. Complementarity between Complexities Defined with Arithmetic and Geometric Means
+
+We evaluate the possibility and the limit of rectangle-bias complexity Cr comparing with other
+proposed measures of complexity.
+The Interests in measuring network heterogeneity have been developed toward the incorporation
+of multi-scale characteristics into complexity measures. The TSE complexity is motivated from the
+structure of the functional differentiation of brain activity, which measures the difference of neural
+integration between all sizes of subsystems and the whole system [17]. Biologically motivated TSE
+complexity is also investigated from theoretical point of view, to further attribute desirable property
+as an universal complexity measure independent of system size [20]. The hierarchical structure of
+the exponential family in information geometry also leads to the order-wise description of statistical
+association, which can be regarded as a multi-scale complexity measure [14]. The relation between
+the order-wise dependencies and the TSE complexity is theoretically investigated to establish the
+order-wise component correspondence between them [15].
+These indices of network heterogeneity, however, all depend on the arithmetic mean of the
+component-wise information theoretical measure. We show that these arithmetic means still miss to
+measure certain modularity based on the statistical independence between subsystems.
+Figure 2 present the simplified cases where complexity measures with arithmetic means fail to
+distinguish. We consider the two systems with different heterogeneity but identical multi-information
+I. Here, the multi-information can not reflect the network heterogeneity. The TSE complexity and its
+information geometrical correspondence in [15] has a sensitivity to measure the network heterogeneity,
+but since the arithmetic mean is taken over all subsystems, they do not distinguish the component-wise
+break of symmetry between different scales. The renormalized TSE complexity with respect to the
+multi-information I still has the same insensitivity. Even by incorporating the information of each
+subsystem scale, the arithmetic mean can balance out between the scale-wise variations, and a large
+range of the heterogeneity in different scale can realize the same value of these complexities. For the
+application in neuroscience, the assumption of a model with simple parametric heterogeneity and the
+comparison of TSE complexity between different I values alleviate this limitation [17].
+
+294
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+Figure 2. Schematic examples of stochastic systems with identical multi-information I where complexity
+measures with arithmetic mean fail to distinguish.
+(a):
+Example 1 of stochastic system with
+homogeneous mean of edge information and symmetric fluctuation of its heterogeneity; (b):
+Example 2 of heterogeneous stochastic system with bimodal edge information distribution and
+identical multi-information I and complexity based on arithmetic mean as example 1; (c): schematic
+representation of the distribution of statistical association (edge information) in upper network; (d):
+schematic representation of the distribution of statistical association (edge information) in upper
+network.
+
+In contrast to complexities with arithmetic mean, the rectangle-bias complexity Cr is related to
+the geometrical mean. The Cr can distinguish the two systems in Figure 2, giving relatively high Cr
+value to the left system and low value to the right one.
+This does not mean , however, that the Cr has a finer resolution than other complexity
+measures. The constant conditions of complexity measures are the constraints on ∑n
+k=1 nCk degrees of
+freedom in model parameter space, which define different geometrical composition of corresponding
+submanifolds. We basically need ∑n
+k=1 nCk independent measures to assure the real-value resolution
+of network feature characterization. Complexities with arithmetic and geometric means are just giving
+complementary information on network heterogeneity, or different constant-complexity submanifolds
+structure in statistical manifold as depicted in Figure 3. Therefore, it is also possible to construct a
+class of systems that has identical I and Cr values but different TSE complexity. Complexity measures
+should be utilized in combination, with respect to the non-linear separation capacity of network
+features of interest.
+
+295
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 3. Schematic representation of complementarity between complexity measures based on arithmetic
+mean (Ca) and geometric mean (Cg) of informational distance. An example of the n − 1 dimensional
+constant-complexity submanifolds with respect to Ca = const. and Cg = const. conditions are depicted
+with yellow and orange surface, respectively. The dimension of the whole statistical manifold S is the
+parameter number n.
+
+7. Cuboid-Bias Complexity with Respect to System Decompositionability
+
+We consider the expansion of Cr into general system size n. The n ≥ 4 situation is different from
+n = 3 and less in the existence of a hierarchical structure between system decompositions.
+Figure 4 shows the hierarchy of the system decompositions in case n = 4. Such hierarchical
+structure between system decompositions is not homogeneous with respect to the subsystems
+number, and depends on the isomorphic types of decomposed systems. This fact produces certain
+difference of meaning in complexity between each KL-divergences when considering the system
+decompositionability.
+
+Figure 4. Hierarchy of system decomposition for 4 nodes network (n = 4). Possible sequences of Seq =
+{SD1(is) → SD2(is) → SD3(is) → SD4(is)|1 ≤ is ≤ |Seq| = 18} are connected with the lines.
+
+A simple example in 4 nodes network is shown in Figure 5.
+
+296
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 5. Hierarchical effect of sequential system decomposition on cuboid volume and rectangle surface
+on circle graph. We consider to increase the diameter of the green circle from dotted to dashed one
+without changing those of the red and blue circles, which gives different effect on the change of
+D[< 1222 >:< 1233 >] and D[< 1133 >:< 1134 >] according to the hierarchical structure of the
+decomposition sequences.
+
+We consider the modification of 2 KL-divergences in the figure, D[< 1111 >:< 1222 >] and
+D[< 1111 >:< 1133 >] from the diameter of green dotted circle to the dashed one.
+The joint distribution P(x1, x2, x3, x4) of a discrete distribution with 4 binary variables
+(x1, x2, x3, x4) (x1, x2, x3, x4 ∈ {0, 1}) have 24 − 1 = 15 parameters, which define the dual-flat
+coordinates of statistical manifold in information geometry.
+On the other hand, the possible system decompositions exist as the followings in n = 4:
+
+SD
+:=
+{< 1111 >, < 1114 >, < 1131 >, < 1211 >, < 1222 >,
+
+< 1133 >, < 1212 >, < 1221 >, < 1134 >, < 1214 >,
+
+< 1231 >, < 1224 >, < 1232 >, < 1233 >, < 1234 >}.
+(31)
+
+Since the number of possible system decompositions is 15, and each is associated with the
+modification of different sets of P(x1, x2, x3, x4) parameters, the system decompositions and KL-
+divergences between them can be defined independently. This also holds even under the constant
+condition of I value or other complexity measures except the ones imposing dependency between
+system decompositions.
+This means that we can independently modify the diameter of green dotted circle in Figure 5,
+without changing the diameters of the red and blue circles, which define the system decompositions
+< 1233 > and < 1134 > in the sub-hierarchy of < 1222 > and < 1133 >, respectively. Other
+KL-divergences can also be maintained as given constant values for the same reason.
+The rectangle-biased complexity Cr increases its value with such modification, but does not reflect
+the heterogeneity of KL-divergences according to the hierarchy of system decompositions. If we
+consider the system decompositionability as the mean facility to decompose the given system into
+its finest components with respect to the “all” possible system decompositions, such hierarchical
+difference also has a meaning in the definition of complexity.
+The effect of modifying the diameter of the green dotted circle is different between the
+decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→<
+
+297
+
+
+Entropy 2014, 16, 4132–4167
+
+1133 >→< 1134 >→< 1234 >. The decrease of the KL-divergence D[< 1222 >:< 1233 >]
+is less than D[< 1133 >:< 1134 >] since the diameter of the red dotted circle is larger than the
+blue one in Figure 5. This means that the effect of changing the same amount of KL-divergences
+in D[< 1111 >:< 1222 >] and D[< 1111 >:< 1133 >] produces larger effect on the sequence
+< 1111 >→< 1133 >→< 1134 >→< 1234 > than < 1111 >→< 1222 >→< 1233 >→< 1234 >, if
+compared at the sequence level. The rectangle-biased complexity Cr does not reflect such characteristics
+since it does not distinguish between the hierarchical structure between the diameters of the green, red
+and blue dotted circles.
+To incorporate such hierarchical effect in a complexity measure with geometric mean, we have
+the natural expansion of the rectangle-biased complexity Cr as the cuboid-bias complexity Cc, which is
+defined as follows:
+
+Cc :=
+1
+
+|Seq|
+
+|Seq|
+∑
+is=1
+
+n−1
+∏
+i=1
+D[SDi(is) : SDi+1(is)],
+(32)
+
+where Seq represents the possible sequences of hierarchical system decompositions as follows:
+
+Seq = {SD1(is) → SD2(is) → · · · SDi(is) · · · → SDn(is)|1 ≤ is ≤ |Seq|}.
+(33)
+
+The elements SDi(is) of Seq corresponds to the system decomposition, which is aligned according to
+the hierarchy with the following algorithmic procedure (based on [15]):
+
+(1) Initialization: Set the initial sets of system decomposition of all sequences in Seq as the whole
+system SD1(is) :=< 111 · · · 1 > (1 ≤ is ≤ |Seq|).
+(2) Step i → i + 1: If the system decomposition is the total system decomposition (SDi(is) :=<
+123 · · · n >), then stop. Otherwise, choose a non-decomposed subsystem SSi(is) of the system
+decomposition SDi(is), and further divide it into two independent subsystems SS1
+i (is) and
+SS2
+i (is) different for each is. SDi+1(is) is then defined as a system decomposition of total system
+that further separates independently subsystems SS1
+i (is) and SS2
+i (is), in addition to the previous
+decomposition SDi(is).
+(3) Go to the next step i + 1 → i + 2.
+
+The value of |Seq| corresponds to the number of different sequences generated by this algorithm. For
+example, |Seq| = 3 and |Seq| = 18 holds for n = 3 and n = 4, respectively. The general analytical form
+|Seq|n of |Seq| with system size n is obtained as the following recurrence formula:
+
+|Seq|n =
+
+⌊ n
+
+2 ⌋
+∑
+i=1
+nCi|Seq|n−i|Seq|i,
+(34)
+
+where ⌊·⌋ is a floor function and with formal definition of |Seq|1 := 1.
+The products of KL-divergences according to the hierarchical sequences of system decompositions
+in Equation (32) is related to the volume of n − 1-dimensional cuboids in the circle diagram. An
+example in case of n = 4 is presented in Figure 5, where two cuboids with 3 orthogonal edges of the
+different decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→<
+1133 >→< 1134 >→< 1234 > are depicted, whose cuboid volumes are
+
+�
+
+D[< 1111 >:< 1222 >]D[< 1222 >:< 1233 >]D[< 1233 >:< 1234 >],
+(35)
+
+and
+�
+
+D[< 1111 >:< 1133 >]D[< 1133 >:< 1134 >]D[< 1134 >:< 1234 >],
+(36)
+
+respectively.
+
+298
+
+
+Entropy 2014, 16, 4132–4167
+
+In the same way as Cr, we took in the definition of Cc the arithmetic average of cuboid volumes
+so that to renormalize the combinatorial increase of the decomposition paths (|Seq|) according to the
+system size n.
+Note that on the other hand we did not renormalize the rectangle-bias complexity Cr and the
+cuboid-bias complexity Cc by taking the exact geometrical mean of each product of KL-divergences
+
+such as
+n−1�
+
+∏n−1
+i=1 D[SDi(is) : SDi+1(is)]. This is for further accessibility to theoretical analysis such as
+variational method (see “Further Consideration" section), and does not change qualitative behavior
+of Cr and Cc since the power root is a monotonically increasing function. This treatment can be
+interpreted as taking the (n − 1)-th power of the geometric means for the hierarchical sequences of
+KL-divergences.
+A more comprehensive example on the utility of the cuboid-bias complexity Cc with respect to
+the rectangle-biased one Cr is shown in Figure 6. We consider the 6 nodes networks (n = 6) with the
+same I and Cr values but different heterogeneity. The system in the top left figure has a circularly
+connected structure with medium intensity, while that of the top right figure has strongly connected 3
+subsystems. These systems have qualitatively five different ways of system decomposition that are
+the basic generators of all hierarchical sequences Seq = {SD1(is) → · · · → SD5(is)|1 ≤ is ≤ |Seq|} for
+these networks. The five basial system decompositions are shown with the number 1⃝, 2⃝, 2⃝′, 3⃝ and
+4⃝ in top figures.
+The circle diagrams of these systems are depicted in the middle figures. To suppose the same
+constant value of Cr in both systems, the following condition is satisfied in the middle right figure:
+D[< 111111 >:
+2⃝] < D[< 111111 >:
+1⃝in Middle Left figure] < D[< 111111 >:
+1⃝] < D[<
+111111 >: 2⃝in Middle Left figure] < D[< 111111 >: 3⃝] < D[< 111111 >: 4⃝]. Furthermore, the
+total surface of right triangles sharing the circle diameter as hypotenuse in the middle left and the
+middle right figures are conditioned to be identical, therefore the rectangle-bias complexity Cr fails to
+distinguish.
+On the other hand, under the same condition, the cuboid-bias complexity Cc distinguishes between
+these two systems and gives higher value to the left one. The volume of 5-dimensional cuboids of the
+
+decomposition sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝
+−−−−−−−−→< 123456 > are schematically shown in the bottom
+figures, maintaining the quantitative difference between KL-divergences. Since the multi-information
+I is identical between the two systems, so is the values of KL-divergence D[< 111111 >:< 123456 >],
+
+which is the sum of the KL-divergences along the sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝
+−−−−−−−−→< 123456 >
+from the Pythagorean theorem. This means that the inequality between the cuboid volumes can be
+represented as the isoperimetric inequality of high-dimensional cuboid. As a consequence, the left
+system has quantitatively higher value of Cc than the right one. The cuboid-bias complexity Cc is also
+sensitive to such heterogeneity.
+
+299
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+(e)
+(f)
+
+Figure 6. Meaning of taking geometric mean over the sequence of system decomposition in cuboid-bias complexity
+Cc. (a): Example of 6-node network with circularly connected structure with medium intensity. Edge
+width is proportional to edge information; (b): Example of 6-node network with strongly connected
+3 subsystems. Edge width is proportional to edge information. The multi-information I of the two
+systems in Top figures are conditioned to be identical; The dotted lines schematically represent possible
+system decompositions. (c,d): Circle diagrams of each system decomposition in upper networks; The
+total surface of right triangles sharing the circle diameter as hypotenuse in (c) and (d) are conditioned to
+be identical, therefore the rectangle-bias complexity Cr fails to distinguish. (e,f): 5-dimensional cuboids
+of upper networks (Figure 6a,b) whose edges are the root of KL-divergences for the strain of system
+
+decomposition < 111111 > 1⃝ 2⃝ 2⃝
+′ 3⃝ 4⃝
+−−−−−−−−→< 123456 >. Only the first 3-dimensional part is shown with
+solid line, and the remaining 2-dimensional part is represented with dotted line. The volume of
+cuboid in (e) is larger than the one in (f), according to the isoperimetric inequality of high-dimensional
+cuboid. The total squared length of each side is identical between two cuboids, which represents
+multi-information I = D[< 111111 >:< 123456 >].
+
+8. Regularized Cuboid-Bias Complexity with Respect to Generalized Mutual Information
+
+We further consider the geometrical composition of system decompositions in the circle diagram
+and insist the necessity of renormalizing the cuboid-bias complexity Cc with the multi-information I,
+which gives another measure of complexity namely “regularized cuboid-bias complexity CR
+c .”
+We consider the situation in actual data where the multi-information I varies. Figure 7 shows
+the n = 3 cases where the Cc fails to distinguish. Both the blue and red systems are supposed to have
+the same Cc value by adjusting the red system to have relatively smaller values of KL-divergences
+
+300
+
+
+Entropy 2014, 16, 4132–4167
+
+D[< 111 >:< 122 >] and D[< 113 >:< 123 >] than the blue one. Such conditioning is possible since
+the KL-divergences are independent parameters with each other.
+
+(a)
+(b)
+(c)
+
+Figure 7.
+Examples of the 3-node systems with identical cuboid-bias complexity Cc but different
+multi-information I on circle graph. (a): System with smaller I but larger CRc ; (b): System with larger I but
+smaller CRc ; (c): Superposition of the above two systems. The regularized cuboid-bias complexity CRc
+distinguishes between the blue and red systems.
+
+Although the Cc value is identical, the two systems have different geometrical composition of
+system decompositions in the circle diagram. The red system has relatively easier way of decomposition
+< 111 >→< 122 > if renormalized with the total system decomposition < 111 >→< 123 >. This
+relative decompositionability with respect to the renormalization with the multi-information I can
+be clearly understood by superimposing the circle diagram of the two systems and comparing the
+angles between each and total decomposition paths (bottom figure). The red system has larger angle
+between the decomposition paths < 111 >→< 122 > and < 111 >→< 123 > than any others in the
+blue system, which represents the relative facility of the decomposition under renormalization with I.
+In this term, the paths < 111 >→< 121 > in the red and blue system do not change its relative facility,
+and the paths < 111 >→< 113 > are easier in the blue system.
+To express the system decompositionability based on these geometrical compositions in a
+comprehensive manner, we define the regularized cuboid-bias complexity CR
+c as follows:
+
+CR
+c
+:=
+1
+
+|Seq|
+
+|Seq|
+∑
+is=1
+
+n−1
+∏
+i=1
+
+D[SDi(is) : SDi+1(is)]
+
+D[< 11 · · · 1 >:< 12 · · · n >]
+
+:=
+Cc
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+:=
+Cc
+In−1 .
+(37)
+
+The red system then has quantitatively smaller CR
+c value than the blue system in Figure 7.
+
+9. Modular Complexity with Respect to the Easiest System Decomposition Path
+
+We have considered so far the system decompositionability with respect to the all possible
+decomposition sequences.
+This was also a way to avoid the local fluctuation of the network
+heterogeneity to be reflected in some specific decomposition paths. On the other hand, the easiest
+decomposition is particularly important when considering the modularity of the system. If there exists
+hierarchical structure of modularity in different scales with different coherence of the system, the
+KL-divergence and the sequence of the easiest decomposition gives much information.
+
+301
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 8 schematically shows a typical example where there exist two levels of modularity. Such
+structure with different scales of statistical coherence appears as functional segregation in neural
+systems [17], and is expected to be observed widely in complex systems.
+The hierarchical topology of the easiest decomposition path reflects these structures.
+For
+example, in the system of Figure 8, the decompositions between <
+1 1 · · · 1
+> and <
+1 1 1 1 5 5 5 5 9 9 9 9 13 13 13 13 > are easier than those inside of the 4-node subsystems. The values of
+KL-divergence also reflect the hierarchy, giving relatively low values for the decomposition between
+the 4-node subsystems, and high values inside of them. By examining the shortest decomposition
+path and associated KL-divergences in possible Seq, one can project the hierarchical structure of the
+modularity existing in the system.
+
+Figure 8. Example of 16-node system < 11 · · · 1 > that has different levels of modularity. The four 4-node
+subsystems < 1111 > (blue blocks) are loosely connected and easy to be decomposed, while inside each
+component (red blocks) is tightly connected. The degree of connection represents statistical dependency
+or edge information between subsystems. Such hierarchical structure can be detected by observing the
+decomposition path of the modular complexity Cm.
+
+For this reason, we define the modular complexity Cm as follows, which is the shortest path
+component of the cuboid-bias complexity Cc:
+
+Cm :=
+n−1
+∏
+i=1
+D[SDi(imin) : SDi+1(imin)],
+(38)
+
+where the index imin of the sequence SD1(imin) → SD2(imin) → · · · → SDn(imin) is chosen as follows:
+
+imin
+=
+{i1} ∩ {i2} ∩ · · · ∩ {in−1},
+(39)
+
+where
+
+{i1}
+=
+argmin
+is
+{D[SD1(is) : SD2(is)]|1 ≤ is ≤ |Seq|},
+
+{i2}
+=
+argmin
+i1
+{D[SD2(i1) : SD3(i1)]|i1 ∈ {i1}},
+
+...
+
+{in−1}
+=
+argmin
+in−2
+{D[SDn−1(in−2) : SDn(in−2)]|in−1 ∈ {in−1}},
+(40)
+
+302
+
+
+Entropy 2014, 16, 4132–4167
+
+which gives eventually
+
+imin
+=
+in−1.
+(41)
+
+This means that beginning from the undecomposed state < 11 · · · 1 >, we continue to choose
+the shortest decomposition path in the next hierarchy of system decomposition. The minimization of
+the path length is guaranteed by the sequential minimization since the geometric mean of isometric
+path division is bounded below by its minimum component. imin is unique if the system is completely
+heterogenous (i.e., D[SD1(ik) : SD2(ik)] ̸= D[SD1(il) : SD2(il)], 1 ≤ ik < il ≤ |Seq|), otherwise plural
+decomposition paths that give the same Cm value are possible according to the homogeneity of the
+system. Besides its value, the modular complexity Cm should be utilized with the sequence information
+of the shortest decomposition path to evaluate the modularity structure of a system.
+The cases where Cm are identical but Cc are different can be composed by varying the system
+decompositions other than in the shortest path SD1(imin) → SD2(imin) → · · · → SDn(imin) without
+modifying the index imin. There exist also inverse examples with identical Cc and different Cm, due to
+the complementarity between Cm and Cc.
+We finally define the regularized modular complexity CR
+m as follows, for the same reason as defining
+CR
+c from Cc;
+
+CR
+m
+:=
+n−1
+∏
+i=1
+
+D[SDi(imin) : SDi+1(imin)]
+D[< 11 · · · 1 >:< 12 · · · n >]
+
+:=
+Cm
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+:=
+Cm
+In−1 .
+(42)
+
+Proposition 4. The cuboid-bias complexities Cc and CR
+c are bounded by the modular complexities Cm and CR
+m
+respectively:
+
+Cc ≤ Cm,
+(43)
+
+CR
+c ≤ CR
+m.
+(44)
+
+And they coincide at the maximum values under the given multi-information I:
+
+max{Cm|I = const.}
+=
+max{Cc|I = const.},
+(45)
+
+max{CR
+m}
+=
+max{CR
+c }.
+(46)
+
+These relations (43)–(46) are numerically shown in the “Numerical Comparison” section.
+The superiority of the modular complexities is due to the hierarchical dependency of
+KL-divergence value in decomposition paths. In the shortest decomposition path defining modular
+complexities, the easier system decomposition relatively increase its value since they incorporate more
+number of edge cutting. Since we eventually cut all edges to obtain < 12 · · · n > at the end of the
+decomposition sequence, collecting the edges with relatively weak edge information and cutting them
+together augment the value of the product of KL-divergences. The modular complexities are then
+the maximum value components among the possible decomposition paths calculated in cuboid-bias
+complexities:
+
+Cm
+=
+max
+
+�
+n−1
+∏
+i=1
+D[SDi(is) : SDi+1(is)]
+
+����� 1 ≤ is ≤ |Seq|
+
+�
+
+,
+(47)
+
+CR
+m
+=
+max
+
+�
+n−1
+∏
+i=1
+
+D[SDi(is) : SDi+1(is)]
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+����� 1 ≤ is ≤ |Seq|
+
+�
+
+.
+(48)
+
+303
+
+
+Entropy 2014, 16, 4132–4167
+
+The difference between the cuboid-bias complexities and the modular complexities is an index of
+the geometrical variation of decomposed systems in the circle graph, which reflects the fluctuation of
+the sequence-wise system decompositionability. If the variation of the system decompositionability for
+each system decomposition is large, accordingly the modular complexities tend to give higher values
+than the cuboid-bias complexities.
+
+10. Numerical Comparison
+
+We numerically investigate the complementarity between the proposed complexities, Cc, CR
+c , Cm,
+and CR
+m. Since the minimum node number giving non-trivial meaning to these measures is n = 4, the
+corresponding dimension of parameter space is ∑n
+k=1 nCk = 15. The constant-complexity submanifolds
+are therefore difficult to visualize due to the high dimensionality. For simplicity, we focus on the
+2-dimensional subspace of this parameter space whose first axis ranging from random to maximum
+dependencies of the system, and the second one representing the system decompositionability of
+< 1133 >.
+For this purpose, we introduce the following parameters α and β (0 ≤ α, β ≤ 1) in the j-coordinates
+of the discrete distribution with 4-dimensional binary stochastic variable:
+
+η1
+=
+η0,
+
+η2
+=
+η0,
+
+η3
+=
+η0,
+
+η4
+=
+η0,
+
+η1,2
+=
+η1η2 + α(η0 − ϵ − η1η2),
+(49)
+
+η3,4
+=
+η3η4 + α(η0 − ϵ − η3η4),
+
+η1,3
+=
+η1η3 + αβ(η0 − ϵ − η1η3),
+
+η1,4
+=
+η1η4 + αβ(η0 − ϵ − η1η4),
+
+η2,3
+=
+η2η3 + αβ(η0 − ϵ − η2η3),
+
+η2,4
+=
+η2η4 + αβ(η0 − ϵ − η2η4),
+
+η1,2,3
+=
+η1,2η3 + αβ(η0 − 2ϵ − η1,2η3),
+
+η1,2,4
+=
+η1,2η4 + αβ(η0 − 2ϵ − η1,2η4),
+
+η1,3,4
+=
+η1η3,4 + αβ(η0 − 2ϵ − η1η3,4),
+
+η2,3,4
+=
+η2η3,4 + αβ(η0 − 2ϵ − η2η3,4),
+
+η1,2,3,4
+=
+η1,2η3,4 + αβ(η0 − 3ϵ − η1,2η3,4).
+
+Where α represents the degree of statistical association from random (α = 0) to maximum (α = 1),
+and β control the system decompositionability of < 1133 >. If β = 1, the system has the maximum
+KL-divergence D[< 1111 >:< 1133 >] under the constraint of α parameter, and β = 0 gives D[<
+1111 >:< 1133 >] = 0.
+ϵ is the minimum value of the joint distribution of 4-dimensional variable, which is defined to be
+more than 0 to avoid singularity in the dual-flat coordinates of statistical manifold. ϵ = 1.0 × 10−10
+
+and η0 = 0.5 was chosen for the calculation.
+
+304
+
+
+Entropy 2014, 16, 4132–4167
+
+The system with maximum statistical association under given η0 corresponds to the α = β = 1
+condition in given parameters, whose j-coordinates become as follows:
+
+η1
+=
+η0,
+...
+
+η4
+=
+η0,
+
+η1,2
+=
+η0 − ϵ,
+...
+
+η3,4
+=
+η0 − ϵ,
+(50)
+
+η1,2,3
+=
+η0 − 2ϵ,
+...
+
+η2,3,4
+=
+η0 − 2ϵ,
+
+η1,2,3,4
+=
+η0 − 3ϵ, .
+
+On the other hand, the totally decomposed system corresponds to the α = 0 condition, and the
+j-coordinates are:
+
+η1
+=
+η0,
+...
+
+η4
+=
+η0,
+
+η1,2
+=
+η0η0,
+...
+
+η3,4
+=
+η0η0,
+(51)
+
+η1,2,3
+=
+η0η0η0,
+...
+
+η2,3,4
+=
+η0η0η0,
+
+η1,2,3,4
+=
+η0η0η0η0.
+
+Note that the completely deterministic case η0 = 1.0 and α = β = 1 gives I = 0.
+The intuitive meaning of these parameters α and β are also schematically depicted in Figure 9
+
+bottom right.
+
+305
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+Figure 9. Contour plot of the complexity landscape of I, Cc, Cm, CRc , and CRm on α-β plane. (a): Contour plot
+superposition of Cc and Cm. (b): Contour plot superposition of CRc and CRm. (c): Contour plot of I.
+The color of contour plots corresponds to the color gradient of 3D plots in Figure 10; (d): Schematic
+representation of the system in different regions of α-β plane. Edge width represents the degree of edge
+information, and independence is depicted with dotted line.
+
+Figure 10 shows the landscape of the proposed complexities on the α-β plane. Their contour plots
+are depicted in Figure 9. The proposed complexities each differs from others in almost everywhere
+points on α-β plane except at the intersection lines. Therefore, these measures serve as the independent
+features of the system, each has its specific meaning with respect to the system decompositionability.
+The α-β plane shows a section of the actual structure of the complementarity expressed in Figure 3
+between the proposed complexity measures.
+The
+relations
+between
+the
+cuboid-bias
+complexities
+and
+modular
+complexities
+in
+Equations (43)–(46) are also numerically confirmed.
+The modular complexities are superior
+than the corresponding cuboid-bias complexities, and coincide at the parameter α = β = 1 giving
+maximum values and dependencies in this parameterization.
+
+306
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+
+(b)
+(c)
+
+(d)
+(e)
+
+Figure 10. Landscape of complexities I, Cc, Cm, CRc , and CRm on α-β plane. (a): Multi-information I; (b):
+Cuboid-bias complexity Cc. (c): Modular complexity Cm;(d): Regularized cuboid-bias complexity
+CRc ; (e): Regularized modular complexity CRm. All complexity measures show the complementarity
+intersecting with each other, satisfying the boundary conditions vanishing at α = 0 and β = 0 except the
+multi-information I. Note that regularized complexities CRc and CRm show singularity of convergence at
+α → 0 due to the regularization of infinitesimal value.
+
+In general case without the parameterization with α, β and η0, the boundary conditions of Cc, CR
+c ,
+Cm and CR
+m include that of the multi-information I, which vanish at the completely random or ordered
+state. This is common to other complexity measures such as the LMC complexity, and fit to the basic
+
+307
+
+
+Entropy 2014, 16, 4132–4167
+
+intuition on the concept of complexity situated equivalently far from the completely predictable and
+disordered states [21,22].
+The proposed complexities further incorporate boundary conditions that vanish with the existence
+of a completely independent subsystem of any size. This means that the Cc, CR
+c , Cm and CR
+m of a system
+become 0 if we add another independent variable. This property does not reflect the intuition of
+complexity defined by the arithmetic average of statistical measures. The proposed complexity can
+better find its meaning in comparison to other complexity measures such as the multi-information
+I, and by interactively changing the system scale to avoid trivial results with small independent
+subsystem. For example, the proposed complexities could be utilized as the information criteria
+for the model selection problems, especially with an approximative modular structure based on the
+statistical independency of data between subsystems. We insist that the complementarity principle
+between plural complexity measures of different foundation is the key to understand the complexity
+in a comprehensive manner.
+To characterize the property of Cc, CR
+c , Cm and CR
+m in relation to the diverse composition of each
+system decomposition, it is useful to consider the geometry of their contour structure, as compared
+in Figure 9. The contour can be formalized as Cc, CR
+c , Cm, CR
+m = const. for each complexity measure,
+and D[< 11 · · · 1 >: SDi(is)] = const. (1 ≤ i ≤ n − 1, 1 ≤ is ≤ |Seq|) for each system decomposition.
+For that purpose, analysis with algebraic geometry can be considered as a prominent tool. Algebraic
+geometry investigates the geometrical property of polynomial equations [23]. The complexities Cc, CR
+c ,
+Cm and CR
+m can be interpreted as polynomial functions by taking each system decomposition as novel
+coordinates, therefore directly accessible to algebraic geometry. However, if we want to investigate the
+contour of the complexities on the p parameter space, logarithmic function appears as the definition of
+KL-divergence, which is a transcendental function and outreach the analytical requirement of algebraic
+geometry. To introduce compatibility between the p parameter space of information geometry and
+algebraic geometry, it suffices to describe the model by replacing the logarithmic functions as another n
+variables such as q = log p, and reconsider the intersection between the result from algebraic geometry
+on the coordinates (p, q) and q = log p condition. The contour of Cc, CR
+c , Cm and CR
+m is also important
+to seek for the utility of these measures as a potential to interpret the dynamics of statistical association
+as geodesics.
+
+11. Further Consideration
+
+11.1. Pythagorean Relations in System Decomposition and Edge Cutting
+
+We further look back at the system decomposition and edge cutting in terms of the Pythagorean
+relation between KL-divergences, which is based on the orthogonality between ` and j coordinates.
+In system decomposition, the distribution of decomposed system is analytically obtained from
+the product of subsystems’ η coordinates, which is equivalent to set all θdec parameters as 0 in mixture
+coordinate ξdec. From the consistency of θdec parameters in ξdec being 0 in all system decompositions,
+we have the Pythagorean relation according to the inclusion relation of system decomposition. For
+example, the following holds:
+
+D[< 1111 >:< 1234 >]
+=
+D[< 1111 >:< 1222 >]
+
++
+D[< 1222 >:< 1233 >]
+(52)
+
++
+D[< 1233 >:< 1234 >].
+
+The proof is in the same way as k-cut coordinates isolating k-tuple statistical association between
+variables [14].
+On the other hand, the edge cutting previously defined using the product of remaining maximum
+cliques’ η coordinates does not coincides with the θec = 0 condition in mixture coordinates ξec. We
+have defined the ηec values of edge cutting based only on the orthogonal relation between η and θ
+
+308
+
+
+Entropy 2014, 16, 4132–4167
+
+coordinates, by generalizing the rule of system decomposition in ηec coordinates, and did not consider
+the Pythagorean relation between different edge cuttings.
+It is then possible to define another way of edge cutting using θec = 0 condition in ξec. Indeed,
+in k-cut mixture coordinates, θk+ = 0 condition is derived from the independent condition of the
+variables in all orders, and k-tuple statistical association is measured by reestablishing the η parameters
+for the statistical association up to k − 1-tuple order. In the same way, we can set θdec = 0 condition for
+ξdec of a system decomposition, and reestablish edges with respect to the η parameters, except the one
+in focus for edge cutting.
+As a simple example, consider the system decomposition < 1222 > and edge cutting 1 − 2 in
+4-node graph. We have the mixture coordinate ξdec for the system decomposition as follows:
+
+ξdec
+1,2
+=
+θdec
+1,2 = 0,
+
+ξdec
+1,3
+=
+θdec
+1,3 = 0,
+
+ξdec
+1,4
+=
+θdec
+1,4 = 0,
+
+ξdec
+1,2,3
+=
+θdec
+1,2,3 = 0,
+(53)
+
+ξdec
+1,3,4
+=
+θdec
+1,3,4 = 0,
+
+ξdec
+1,2,3,4
+=
+θdec
+1,2,3,4 = 0,
+
+where all the rest of ξdec coordinates is equivalent to that of η coordinates.
+We then consider the new way of edge cutting 1 − 2 by recovering the statistical association in
+edges 1 − 3 and 1 − 4 from system decomposition < 1222 >, orthogonally to that of edge 1 − 2. The
+new mixture coordinate ξEC changes to the following:
+
+ξEC
+1,2 = θEC
+1,2 = 0,
+
+ξEC
+1,3 = η1,3,
+
+ξEC
+1,4 = η1,4,
+
+ξEC
+1,2,3 = θEC
+1,2,3 = 0,
+(54)
+
+ξEC
+1,3,4 = η1,3,4,
+
+ξEC
+1,2,3,4 = θEC
+1,2,3,4 = 0,
+
+and the rest is equivalent to that of η coordinates.
+This new ξEC is also compatible with k-cut coordinates formalization for its simple θEC = 0
+conditions. To obtain ξEC for arbitrary edge cutting i − j, one should take θEC containing i and j in
+its subscript, set them to 0, and combine with η coordinates for the rest of the subscript. For plural
+edge cuttings i − j, · · · , k − l (1 ≤ i, j, k, l ≤ n), it suffices to take θEC containing i and j, ... , k and l in
+its subscript respectively, then set them to 0.
+We finally obtain the Pythagorean relation between edge cuttings. Denoting the general edge
+cutting(s) coordinates as ξi−j,··· ,k−l, the following holds for the example of system decomposition
+< 1222 >:
+
+D[< 1111 >:< 1222 >]
+=
+D[< 1111 >: p(ξ1−2)]
+
++
+D[p(ξ1−2) : p(ξ1−2,1−3)]
+(55)
+
++
+D[p(ξ1−2,1−3) : p(ξ1−2,1−3,1−4)].
+
+Despite the consistency with the dual structure between θ and η, we do not generally have
+analytical solution to determine ηEC values from θEC = 0 conditions. We should call for some
+numerical algorithm to solve θEC = 0 conditions with respect to ηEC values, which are in general
+high-degree simultaneous polynomials. Furthermore, numerical convergence of the solution has to be
+
+309
+
+
+Entropy 2014, 16, 4132–4167
+
+very strict, since tiny deviation from the conditions can become non-negligible by passing fractional
+function and logarithmic function of θ coordinates.
+On the other hand, the previously defined edge cutting with ξec using the product between
+subgraphs’ η coordinates is analytically simple and does not need to consider the other edges’ recovery
+from system decomposition or independence hypothesis. We then chose the previous way of edge
+cutting for both calculability and clarity of the concept.
+There have been many attempts to approximate complex network by low-dimensional system
+with the use of statistical physics and network theory. As a contemporary example, moment-closure
+approximation provides a various way to abstract essential dynamics e.g., in discrete adaptive
+network [24]. Although the approximation takes several theoretical assumptions such as random graph
+approximation, it is difficult to quantitatively reproduce the dynamics even in some simplest model.
+This is partly due to homogeneous treatment of statistics such as truncation into pair-wise order. The
+edge cutting can offer a complementary view on the evaluation of moment-closure approximations.
+Using orthogonal decomposition between edge information, one can evaluate which part of network
+link and which order of statistics contain essential information, which does not necessary conform to
+top-down theoretical treatment.
+
+11.2. Complexity of the Systems with Continuous Phase Space
+
+We have developed the concept of system decompositionability based on discrete binary variables.
+One can also apply the same principle to continuous variable.
+For an ergodic map G : X → X in continuous space X, KS entropy h(μ, G) is defined as the
+maximum of entropy rate with respect to all possible system decomposition A, when the invariant
+measure μ exists:
+
+h(μ, G) = sup
+A
+h(μ, G, A).
+(56)
+
+where A is the disjoint decomposition of X that consists of non-trivial sets ai, whose total number is
+n(A), defined as
+
+X =
+
+n(A)
+�
+
+i=1
+ai,
+(57)
+
+ai ∩ aj = φ, i ̸= j, 1 ≤ i, j ≤ n(A),
+(58)
+
+meaning the natural expansion of system decomposition into continuous space.
+The entropy rate h(μ, G, A) in Equation (56) is defined as
+
+h(μ, G, A) = lim
+n→∞
+1
+n H(μ, A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)),
+(59)
+
+according to the entropy H(μ, A) based on the decomposition A = {ai}
+
+H(μ, A) = −
+
+n(A)
+∑
+i=1
+μ(ai) ln μ(ai),
+(60)
+
+and the product C = A ∨ B as
+
+C
+=
+A ∨ B
+
+=
+{ci = aj ∩ bk|1 ≤ j ≤ n(A), 1 ≤ k ≤ n(B)}.
+(61)
+
+310
+
+
+Entropy 2014, 16, 4132–4167
+
+In a more general case, topological entropy hT(G) is defined simply with the number of
+decomposed subsystem elements by preimages as follows, without requiring ergodicity, therefore
+neither the existence of invariant measure μ:
+
+hT(G) = sup
+A
+lim
+n→∞
+1
+n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)).
+(62)
+
+Topological entropy takes the maximum value of the possible preimage divisions, in order to
+measure the complexity in terms of the mixing degree of the orbits. For example, if the KS entropy
+is positive as h(μ, G) > 0, the dynamics of G on an invariant set of invariant measure μ is chaotic for
+almost everywhere initial conditions. As for the positive topological entropy hT(G) > 0, the dynamics
+of G contain chaotic orbits, but not necessary as attractive chaotic invariant set, since hT(G) ≥ h(μ, G)
+and the KS entropy can be negative.
+Although these definitions are useful to characterize the existence of chaotic dynamics, the system
+decompositionability is another property representing different aspect of the system complexity. It
+is rather the matter of the existence of independent dynamics components, or the degree of orbit
+localization between arbitrary system decompositions. We propose the following “geometric topological
+entropy” hg(G) applying the same principle of taking geometric product between all hierarchical
+structure of the system decomposition A.
+
+hg(G) := ∏
+σ(A)>0
+lim
+n→∞
+1
+n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)),
+(63)
+
+where σ(A) > 0 means to take all components of A having positive Lebesgue measure on X.
+This gives 0 if the preimage of certain ai ∈ A is ai itself, meaning there exist a subsystem ai whose
+range is invariant under G, closed by itself. The system X can be completely divided into ai and the
+rest. This corresponds to the existence of an independent subsystem in cuboid-bias and modular
+complexities. In case such independent components do not exist, it still reflects the degree of orbit
+localization for all possible system decompositions in multiplicative manner. The condition σ(A) > 0
+is to avoid trivial case such as the existence of unstable limit cycle, whose Lebesgue measure is 0.
+Typical example giving hg(G) = 0 is the function having independent ergodic components, such
+as the Chirikov-Taylor map with appropriate parameter [25].
+
+12. Conclusions and Discussion
+
+We have theoretically developed a framework to measure the degree of statistical association
+existing between subsystems as well as the ones represented by each edge of the graph representation.
+We then reconsidered the problem of how to define complexity measures in terms of the construction
+of non-linear feature space. We defined new type of complexity based on the geometrical product of
+KL-divergence representing the degree of system decompositionability. Different complexity measures
+as well as newly proposed ones are compared on a complementarity basis on statistical manifold.
+Application of presented theory can encompass a large field of complex systems and data science,
+such as social network, genetic expression network, neural activities, ecological database, and any
+kind of complex networks with binary co-occurrence matrix data e.g., [26–29], databases: [30–34].
+Continuous variables are also accessible by appropriate discretization of information source with e.g.,
+entropy maximization principle.
+In contrast to arithmetic mean of information over the whole system, geometric mean has not been
+investigated sufficiently in the analysis of complex network. However in different fields, theoretical
+ecology has already pointed out the importance of geometric mean when considering the long-term
+fitness of a species population in a randomly varying environment [35,36]. Long-term fitness refers
+to the ecological complexity of its survival strategy under large stochastic fluctuation. Here, we can
+find useful analogy between the growth rate of a population in ecology and the spatio-temporal
+
+311
+
+
+Entropy 2014, 16, 4132–4167
+
+propagation rate of information between subsystems in general. If we take an arbitrary subsystem
+and consider the amount of information it can exchange with all other subsystems, the proposed
+complexity measures with geometric mean reflect the minimum amount with amongst all possible
+other subsystems, which can not be distinguished with arithmetic mean. The propagation rate of a
+population in ecology and the information transmission in complex network hold mathematically
+analogous structure. In population ecology, the variance of growth rate is crucial to evaluate the
+long-term survival of the population. Even if the arithmetic mean of growth rate is high, large variance
+will lead to low geometric mean even with a small amount of exceptionally small fitness situation,
+which ecologically means extinction of an entire species. In stochastic network, the variance of system
+decompositionability is essential to evaluate the amount of information shared between subsystems, or
+information persistence in the entire network. Even the multi-information I is high, large heterogeneity
+of edge information can lead to informational isolation of certain subsystem, which means extinction
+of its information. If such subsystem is situated on the transmission pathway, information cannot
+propagate across these nodes. Therefore, the proposed complexity measures CC, CR
+C, Cm and CR
+m
+generally reflect the minimum amount of information propagation rate spread entirely on the system
+without exception of isolated division.
+Some recent studies on adaptive network focus on the evolution of network topology in response
+to node activity, such as game-theoretic evolution of strategies [37], opinion dynamics on an evolving
+network [38], epidemic spreading on an adaptive network [39], etc. Analysis of coevolution network
+between variables and interactions can capture important dynamical feature of complex systems. In
+contrast to topological network analysis, the newly proposed complexity measures can complement
+its statistical dynamics analysis. In addition to the topological change of network model, (e.g., linking
+dynamics of game theory, opinion community network structure, contact network of epidemics
+transmission), one can evaluate the emerged statistical association between the variables that does
+not necessary coincide with the network topology. Interesting feature of non-linear dynamics is the
+unexpected correlation between distant variables, which is quantified as Tsallis entropy [40]. The
+complementary relation between concrete interaction and resulting statistical association can provide a
+twofold methodology to characterize the coevolutionary dynamics of adaptive network. Such strategy
+can promote integrated science from laboratory experiments to open-field in natura situation, where
+actual multi-scale problematics remain to be solved [41].
+Arithmetic and geometric means can be integrated in a mutual formula called generalized
+mean [42]. Therefore, the proposed complexity measures with geometric mean of KL-divergence is
+an expansion of preexisting complexity measures with mixture coordinates. Table 1 summarizes the
+generalization of complexity measure in this article. Based on the k-cut coordinates ı, the weighted
+sum of KL-divergence representing k-tuple order of statistical association derived complexity measures
+with (weighted) arithmetic mean such as multi-information I and TSE complexity. On the other hand,
+we showed that subsystem-wise correlation can also be isolated with the use of mixture coordinates,
+namely < · · · >-cut coordinates ¸. To quantify the heterogeneity of system decompositionability, we
+generally took a weighted geometric mean of KL-divergence in CC, CR
+C, Cm and CR
+m. Here, the shortest
+path selection of Cm and CR
+m, and regularization of CR
+C and CR
+m with respect to multi-information I
+can be interpreted as the weight function of geometric mean. This perspective brings a definition
+of a generalized class of complexity measures based on the mixture coordinates and generalized
+mean of KL-divergence. Information discrepancy can also be generalized from KL-divergence to
+Bregman divergence, providing access to the concept of multiple centroids in large stochastic data
+analysis such as image processing [43]. The blank columns of the Table 1 imply the possibility of
+other complexity measures in this class. For example, the weighted geometric mean of KL-divergence
+defined between k-cut coordinates is expected to yield complexity measures that are sensitive to
+the heterogeneity of correlation orders. The weighted arithmetic mean of KL-divergence defined
+between < · · · >-cut coordinates should be sensitive to the mean decompositionability of arbitrary
+subsystem. Since these measures take analytically different form on mixture coordinates and/or mean
+
+312
+
+
+Entropy 2014, 16, 4132–4167
+
+functions, their derivatives do not coincide, which give independent information of the system on
+the complementary basis on statistical manifold, as long as the number of complexity measures are
+inferior to the freedom degree of the system.
+
+Table 1. Classification of complexity measures with KL-divergence on mixture coordinates.
+
+Generalized Mean of KL-Divergence
+
+Arithmetic Mean
+Geometric Mean
+
+Mixture Coordinates
+k-cut ı
+TSE complexity, I
+
+< · · · >-cut ¸
+CC, CR
+C, Cm, CRm
+
+Acknowledgments: This study was partially supported by CNRS, the long term study abroad support program
+of the university of Tokyo, and the French government (Promotion Simone de Beauvoir).
+
+Conflicts of Interest: Conflicts of Interest
+The author declares no conflict of interest.
+
+References
+
+1.
+Boccalettia, S.; Latorab, V.; Morenod, Y.; Chavezf, M.; Hwang, D.U. Complex Networks: Structure and
+Dynamics. Phys. Rep. 2006, 424, 175–308.
+2.
+Strogatz, S.H. Exploring Complex Networks. Nature 2001, 410, 268–276.
+3.
+Wasserman, S.; Faust, K. Social Network Analysis; Cambridge University Press: Cambridge, UK, 1994.
+4.
+Funabashi, M.; Cointet, J.P.; Chavalarias, D. Complex Network. In Studies in Computational Intelligence;
+Springer: Berlin/Heidelberg, Germany, 2009; Volume 207, pp. 161–172.
+5.
+Badii, R.; Politi, A. Complexity: Hierarchical Structures and Scaling in Physics; Cambridge University Press:
+Cambridge, UK, 2008.
+6.
+Lempel, A.; Ziv, J. On the Complexity of Finite Sequences. IEEE Trans. Inf. Theory 1976, 22, 75–81.
+7.
+Li, M.; Vitanyi, P. Texts in Computer Science. In An Introduction to Kolmogorov Complexity and Its Applications,
+2nd ed.; Springer: Berlin/Heidelberg, Germany, 1997.
+8.
+Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 2006.
+9.
+Bennett, C. On the Nature and Origin of Complexity in Discrete, Homogeneous, Locally-Interacting Systems.
+Found. Phys. 1986, 16, 585–592.
+10.
+Grassberger, P. Toward a Quantitative Theory of Self-Generated Complexity. Int. J. Theor. Phys. 1986,
+25, 907–938.
+11.
+Crutchfield, J.P.; Feldman, D.P. Regularities Unseen, Randomness Observed: The Entropy Convergence
+Hierarchy. Chaos 2003, 15, 25–54.
+12.
+Crutchfield, J.P. Inferring Statistical Complexity. Phys. Rev. Lett. 1989, 63, 105–108.
+13.
+Prichard, D.; Theiler, J. Generalized Redundancies for Time Series Analysis. Physica D 1995, 84, 476–493.
+14.
+Amari, S. Information Geometry on Hierarchy of Probability Distributions. IEEE Trans. Inf. Theory 2001,
+47, 1701–1711.
+15.
+Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A Unifying Framework for Complexity Measures of Finite Systems;
+Report 06-08-028; Santa Fe Institute: Santa Fe, NM, USA, 2006.
+16.
+MacKay, R.S. Nonlinearity in Complexity Science. Nonlinearity 2008, 21, T273–T281.
+17.
+Tononi, G.; Sporns, O.; Edelman, M. A Measure for Brain Complexity: Relating Functional Segregation and
+Integration in the Nervous System. Proc. Natl. Acad. Sci. USA 1994, 91, 5033.
+18.
+Feldman, D.P.; Crutchfield, J.P. Measures of statistical complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+19.
+Nakahara, H.; Amari, S. Information-Geometric Measure for Neural Spikes. Neural Comput. 2002, 14, 2269–
+2316.
+20.
+Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How Should Complexity Scale with System Size? Eur. Phys. J. B
+2008, 63, 407–415.
+21.
+Feldman, D.P.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+22.
+Lopez-Ruiz, R.; Mancini, H.; Calbet, X. A Statistical Measure of Complexity. Phys. Lett. A 1995, 209, 321–326.
+
+313
+
+
+Entropy 2014, 16, 4132–4167
+
+23.
+Hodge, W.; Pedoe, D. Methods of Algebraic Geometry; Cambridge Mathematical Library, Cambridge University
+Press: Cambridge, UK, 1994; Volume 1–3.
+24.
+Demirel, G.; Vazquez, F.; Bohme, G.; Gross, T. Moment-closure Approximations for Discrete Adaptive
+Networks. Physica D 2014, 267, 68–80.
+25.
+Fraser, G., Ed. The New Physics for the Twenty-First Century; Cambridge University Press: Cambridge, UK,
+2006; p. 335.
+26.
+Scott, J. Social Network Analysis: A Handbook; SAGE Publications Ltd.: London, UK, 2000.
+27.
+Geier, F.; Timmer, J.; Fleck, C. Reconstructing Gene-Regulatory Networks from Time Series, Knock-Out Data,
+and Prior Knowledge. BMC Syst. Biol. 2007, 1, doi:10.1186/1752-0509-1-11.
+28.
+Brown, E.N.; Kass, R.E.; Mitra, P.P. Multiple Neural Spike Train Data Analysis: State-of-the-Art and Future
+Challenges. Nat. Neurosci. 2004, 7, 456–461.
+29.
+Yee, T.W. The Analysis of Binary Data in Quantitative Plant Ecology.
+Ph.D. Thesis, The University of
+Auckland, New Zealand, 1993.
+30.
+Stanford Large Network Dataset Collection. Available online: http://snap.stanford.edu/data/ (accessed on
+19 July 2014).
+31.
+BioGRID. Available online: http://thebiogrid.org/ (accessed on 19 July 2014).
+32.
+Neuroscience Information Framework.
+Available online:
+http://www.neuinfo.org/ (accessed on
+19 July 2014).
+33.
+Global Biodiversity Information Facility. Available online: http://www.gbif.org/ (accessed on 19 July 2014).
+34.
+UCI Network Data Repository. Available online: http://networkdata.ics.uci.edu/index.php (accessed on 19
+July 2014).
+35.
+Lewontin, R.C.; Cohen, D. On Population Growth in a Randomly Varying Environment. Proc. Natl. Acad.
+Sci. USA 1969, 62, 1056–1060.
+36.
+Yoshimura, J.; Clark, C.W. Individual Adaptations in Stochastic Environments. Evol. Ecol. 1969, 5, 173–192.
+37.
+Wu, B.; Zhou, D.; Wang, L. Evolutionary Dynamics on Stochastic Evolving Networks for Multiple-Strategy
+Games. Phys. Rev. E 2011, 84, 046111.
+38.
+Fu, F.; Wang, L. Coevolutionary Dynamics of Opinions and Networks: From Diversity to Uniformity.
+Phys. Rev. E 2008, 78, 016104.
+39.
+Gross, T.; D’Lima, C.J.D.; Blasius, B. Epidemic Dynamics on an Adaptive Network.
+Phys. Rev. Lett.
+2006, 96, 208701.
+40.
+Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statistics. J. Stat. Phys. 1988, 52, 479–487.
+41.
+Quintana-Murci, L.; Alcais, A.; Abel, L.; Casanova, J.L. Immunology in natura: Clinical, Epidemiological
+and Evolutionary Genetics of Infectious Diseases. Nat. Immunol. 2007, 8, 1165–1171.
+42.
+Hardy, G.; Littlewood, J.; Polya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1967; Chapter 3.
+43.
+Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+314
+
+
+entropy
+
+Article
+The Entropy-Based Quantum Metric
+
+Roger Balian
+
+Institut de Physique Théorique, CEA/Saclay, F-91191 Gif-sur-Yvette Cedex, France;
+E-Mail: roger@balian.fr
+
+Received: 15 May 2014; in revised form: 25 June 2014 / Accepted: 11 July 2014 /
+Published: 15 July 2014
+
+Abstract: The von Neumann entropy S( ˆD) generates in the space of quantum density matrices
+ˆD the Riemannian metric ds2 = −d2S( ˆD), which is physically founded and which characterises
+the amount of quantum information lost by mixing ˆD and ˆD + d ˆD. A rich geometric structure is
+thereby implemented in quantum mechanics. It includes a canonical mapping between the spaces
+of states and of observables, which involves the Legendre transform of S( ˆD). The Kubo scalar
+product is recovered within the space of observables. Applications are given to equilibrium and non
+equilibrium quantum statistical mechanics. There the formalism is specialised to the relevant space of
+observables and to the associated reduced states issued from the maximum entropy criterion, which
+result from the exact states through an orthogonal projection. Von Neumann’s entropy specialises
+into a relevant entropy. Comparison is made with other metrics. The Riemannian properties of the
+metric ds2 = −d2S( ˆD) are derived. The curvature arises from the non-Abelian nature of quantum
+mechanics; its general expression and its explicit form for q-bits are given, as well as geodesics.
+
+Keywords: quantum entropy; metric; q-bit; information; geometry; geodesics; relevant entropy
+
+1. A Physical Metric for Quantum States
+
+Quantum physical quantities pertaining to a given system, termed as “observables” ˆO, behave
+as non-commutative random variables and are elements of a C*-algebra. We will consider below
+systems for which these observables can be represented by n-dimensional Hermitean matrices in a
+finite-dimensional Hilbert space H. In quantum (statistical) mechanics, the “state” of such a system
+encompasses the expectation values of all its observables [1]. It is represented by a density matrix ˆD,
+which plays the rôle of a probability distribution, and from which one can derive the expectation value
+of ˆO in the form
+< ˆO >= Tr ˆD ˆO = ( ˆD; ˆO) .
+(1)
+
+Density matrices should be Hermitean (< ˆO > is real for ˆO = ˆO†), normalised (the expectation
+value of the unit observable is Tr ˆD = 1) and non-negative (variances <
+ˆO2 > − <
+ˆO >2 are
+non-negative). They depend on n2 − 1 real parameters. If we keep aside the multiplicative structure of
+the set of operators and focus on their linear vector space structure, Equation (1) appears as a linear
+mapping of the space of observables onto real numbers. We can therefore regard the observables and
+the density operators ˆD as elements of two dual vector spaces, and expectation values (1) appear as
+scalar products.
+It is of interest to define a metric in the space of states. For instance, the distance between an
+exact state ˆD and an approximation ˆDapp would then characterise the quality of this approximation.
+However, all physical quantities come out in the form (1) which lies astride the two dual spaces of
+observables and states. In order to build a metric having physical relevance, we need to rely on another
+meaningful quantity which pertains only to the space of states.
+
+Entropy 2014, 16, 3878–3888; doi:10.3390/e16073878
+www.mdpi.com/journal/entropy
+315
+
+
+Entropy 2014, 16, 3878–3888
+
+We note at this point that quantum states are probabilistic objects that gather information about
+the considered system. Then, the amount of missing information is measured by von Neumann’s
+entropy
+S( ˆD) ≡ − Tr ˆD ln ˆD .
+(2)
+
+Introduced in the context of quantum measurements, this quantity is identified with the
+thermodynamic entropy when ˆD is an equilibrium state. In non-equilibrium statistical mechanics,
+it encompasses, in the form of “relevant entropy” (see Section 5 below), various entropies defined
+through the maximum entropy criterion. It is also introduced in quantum computation. Alternative
+entropies have been introduced in the literature, but they do not present all the distinctive and natural
+features of von Neumann’s entropy, such as additivity and concavity.
+As S( ˆD) is a concave function, and as it is the sole physically meaningful quantity apart from
+expectation values, it is natural to rely on it for our purpose. We thus define [2] the distance ds between
+two neighbouring density matrices ˆD and ˆD + d ˆD as the square root of
+
+ds2 = −d2S( ˆD) = Tr d ˆDd ln ˆD .
+(3)
+
+This Riemannian metric is of the Hessian form since the metric tensor is generated by taking second
+derivatives of the function S( ˆD) with respect to the n2 − 1 coordinates of ˆD. We may take for such
+coordinates the real and imaginary parts of the matrix elements, or equivalently (Section 6) some linear
+transform of these (keeping aside the norm Tr ˆD = 1).
+
+2. Interpretation in the Context of Quantum Information
+
+The simplest example, related to quantum information theory, is that of a q-bit (two-level system
+or spin 1
+
+2) for which n = 2. Its states, represented by 2 × 2 Hermitean normalised density matrices ˆD,
+can conveniently be parameterised, on the basis of Pauli matrices, by the components rμ = D12 + D21,
+i(D12 − D21), D11 − D22 (μ = 1, 2, 3) of a 3-dimensional vector r lying within the unit Poincaré–Bloch
+sphere (r ≤ 1). From the corresponding entropy
+
+S = 1 + r
+
+2
+ln
+2
+
+1 + r + 1 − r
+
+2
+ln
+2
+
+1 − r ,
+(4)
+
+we derive the metric
+
+ds2 =
+1
+
+1 − r2
+
+�r · dr
+
+r
+
+�2
++ 1
+
+2r ln 1 + r
+
+1 − r
+
+����
+r × dr
+
+r
+
+����
+
+2
+,
+(5)
+
+which is a natural Riemannian metric for q-bits, or more generally for positive 2 × 2 matrices. The
+metric tensor characterizing (5) diverges in the vicinity of pure states r = 1, due to the singularity of
+the entropy (2) for vanishing eigenvalues of ˆD. However, the distance between two arbitrary (even
+pure) states ˆD′ and ˆD′′ measured along a geodesic is always finite. We shall see (Equation (29)) that
+for n = 2 the geodesic distance s between two neighbouring pure states ˆD′ and ˆD′′, represented by
+unit vectors r′ and r′′ making a small angle δϕ ∼ |r′ − r′′|, behaves as δs2 ∼ δϕ2 ln(4√π/δϕ). The
+singularity of the metric tensor manifests itself through this logarithmic factor.
+Identifying von Neumann’s entropy to a measure of missing information, we can give a simple
+interpretation to the distance between two states. Indeed, the concavity of entropy expresses that some
+information is lost when two statistical ensembles described by different density operators merge. By
+mixing two equal size populations described by the neighbouring distributions ˆD′ = ˆD + 1
+
+2δ ˆD and
+ˆD′′ = ˆD − 1
+
+2δ ˆD separated by a distance δs, we lose an amount of information given by
+
+ΔS ≡ S
+� ˆD
+� − S( ˆD′) + S( ˆD′′)
+
+2
+∼ ffis2
+
+8
+,
+(6)
+
+316
+
+
+Entropy 2014, 16, 3878–3888
+
+and thereby directly related to the distance δs defined by (3). The proof of this equivalence relies on
+the expansion of the entropies S( ˆD′) and S( ˆD′′) around ˆD, and is valid when Tr δ ˆD2 is negligible
+compared to the smallest eigenvalue of ˆD. If ˆD′ and ˆD′′ are distant, the quantity 8ΔS cannot be
+regarded as the square of a distance that would be generated by a local metric. The equivalence (6) for
+neighbouring states shows that ds2 is the metric that is the best suited to measure losses of information
+my mixing.
+The singularity of δs2 at the edge of the positivity domain of ˆD may suggest that the result (6)
+holds only within this domain. In fact, this equivalence remains nearly valid even in the limit of
+pure states because ΔS itself involves a similar singularity. Indeed, if the states ˆD′ = |ψ′ >< ψ′|
+and ˆD′′ = |ψ′′ >< ψ′′| are pure and close to each other, the loss of information ΔS behaves as
+8ΔS ∼ ffi’2 ln(4/ffi’) where δϕ2 ∼ 2 Tr δD2. This result should be compared to various geodesic
+distances between pure quantum states, which behave as δs2 ∼ δϕ2 ln(4√π/δϕ for the present metric,
+and as δs2
+BH = 4δs2
+FS ∼ δϕ2 ∼ Tr( ˆD′ − ˆD′′)2 for the Bures – Helstrom and the quantum Fubini – Study
+metrics, respectively (see Section 7; these behaviours hold not only for n = 2 but for arbitrary n since
+only the space spanned by |ψ′ > and |ψ′′ > is involved). Thus, among these metrics, only ds2 = −d2S
+can be interpreted in terms of information loss, whether the states ˆD′ and ˆD′′ are pure or mixed.
+At the other extreme, around the most disordered state ˆD = ˆI/n, in the region ∥ n ˆD − ˆI ∥≪ 1,
+the metric becomes Euclidean since ds2 = Tr d ˆDd ln ˆD ∼ n Tr(d ˆD)2 (for n = 2, ds2 = dr2). For a given
+shift d ˆD, the qualitative change of a state ˆD, as measured by the distance ds, gets larger and larger as
+the state ˆD becomes purer and purer, that is, when the information contents of ˆD increases.
+
+3. Geometry of Quantum Statistical Mechanics
+
+A rich geometric structure is generated for both states and observables by von Neumann’s
+entropy through introduction of the metric ds2 = −d2S. Now, this metric (3) supplements the
+algebraic structure of the set of observables and the above duality between the vector spaces of states
+and of observables, with scalar product (1). Accordingly, we can define naturally within the space of
+states scalar products, geodesics, angles, curvatures.
+We can also regard the coordinates of d ˆD and d ln ˆD as covariant and contravariant components
+of the same infinitesimal vector (Section 6). To this aim, let us introduce the mapping
+
+ˆD ≡
+e ˆX
+
+Tr e ˆX
+(7)
+
+between ˆD in the space of states and ˆX in the space of observables. The operator ˆX appears as a
+parameterisation of ˆD. (The normalisation of ˆD entails that ˆX, defined within an arbitrary additive
+constant operator X0 ˆI, also depends on n2 − 1 independent real parameters.) The metric (3) can then
+be re-expressed in terms of ˆX in the form
+
+ds2 = Tr d ˆDd ˆX = Tr
+� 1
+
+0 dξ ˆDe−ξ ˆXd ˆXeξ ˆXd ˆX − (Tr ˆDd ˆX)2 = d2 ln Tr e ˆX = d2F ,
+(8)
+
+where we introduced the function
+F( ˆX) ≡ ln Tr e ˆX
+(9)
+
+of the observable ˆX(The addition of X0 ˆI to ˆX results in the addition of the irrelevant constant X0 to F).
+This mapping provides us with a natural metric in the space of observables, from which we recover
+the scalar product between d ˆX1 and d ˆX2 in the form of a Kubo correlation in the state ˆD. The metric
+(8) has been quoted in the literature under the names of Bogoliubov–Kubo–Mori.
+
+4. Covariance and Legendre Transformation
+
+We can recover the above geometric mapping (7) between ˆD and ˆX, or between the covariant
+and contravariant coordinates of d ˆD, as the outcome of a Legendre transformation, by considering
+
+317
+
+
+Entropy 2014, 16, 3878–3888
+
+the function F( ˆX). Taking its differential dF = Tr e ˆXd ˆX/ Tr e ˆX, we identify the partial derivatives
+of F( ˆX) with the coordinates of the state ˆD = e ˆX/ Tr e ˆX, so that ˆD appears as conjugate to ˆX in the
+sense of Legendre transformations. Expressing then ˆX as function of ˆD and inserting into F − Tr ˆD ˆX,
+we recognise that the Legendre transform of F( ˆX) is von Neumann’s entropy F − Tr ˆD ˆX = S( ˆD) =
+− Tr ˆD ln ˆD. The conjugation between ˆD and ˆX is embedded in the equations
+
+dF = Tr ˆDd ˆX ;
+dS = − Tr ˆXd ˆD .
+(10)
+
+Legendre transformations are currently used in equilibrium thermodynamics. Let us show that
+they come out in this context directly as a special case of the present general formalism. The entropy of
+thermodynamics is a function of the extensive variables, energy, volume, particle numbers, etc. Let us
+focus for illustration on the energy U, keeping the other extensive variables fixed. The thermodynamic
+entropy S(U), a function of the single variable U, generates the inverse temperature as β = ∂S/∂U.
+Its Legendre transform is the Massieu potential F(β) = S − βU. In order to compare these properties
+with the present formalism, we recall how thermodynamics comes out in the framework of statistical
+mechanics. The thermodynamic entropy S(U) is identified with the von Neumann entropy (2) of the
+Boltzmann–Gibbs canonical equilibrium state ˆD, and the internal energy with U = Tr ˆD ˆH. In the
+relation (7), the operator ˆX reads ˆX = −β ˆH (within an irrelevant additive constant). By letting U or
+β vary, we select within the spaces of states and of observables a one-dimensional subset. In these
+restricted subsets, ˆD is parameterised by the single coordinate U, and the corresponding ˆX by the
+coordinate −β.
+By specialising the general relations (10) to these subsets, we recover the thermodynamic relations
+dF = −Udβ and dS = βdU. We also recover, by restricting the metric (3) or (8) to these subsets, the
+current thermodynamic metric ds2 =−(∂2S/∂U2)dU2 =−dUdβ.
+More generally, we can consider the Boltzmann–Gibbs states of equilibrium statistical mechanics
+as the points of a manifold embedded in the full space of states. The thermodynamic extensive
+variables, which parameterise these states, are the expectation values of the conserved macroscopic
+observables, that is, they are a subset of the expectation values (1) which parameterise arbitrary
+density operators. Then the standard geometric structure of thermodynamics simply results from the
+restriction of the general metric (3) to this manifold of Boltzmann–Gibbs states. The commutation of
+the conserved observables simplifies the reduced thermodynamic metric, which presents the same
+features as a Fisher metric (see Section 6).
+
+5. Relevant Entropy and Geometry of the Projection Method
+
+The above ideas also extend to non-equilibrium quantum statistical mechanics [2–4]. When
+introducing the metric (3), we indicated that it may be used to estimate the quality of an approximation.
+Let us illustrate this point with the Nakajima–Zwanzig–Mori–Robertson projection method, best
+introduced through maximum entropy. Consider some set { ˆAk} of “relevant observables”, whose
+time-dependent expectation values ak ≡ < ˆAk > = Tr ˆD ˆAk we wish to follow, discarding all other
+variables. The exact state ˆD encodes the variables {ak} that we are interested in, but also the expectation
+values (1) of the other observables that we wish to eliminate. This elimination is performed by
+associating at each time with ˆD a “reduced state” ˆDR which is equivalent to ˆD as regards the set
+ak = Tr ˆDR ˆAk, but which provides no more information than the values{ak}. The former condition
+provides the constraints <
+ˆAk > = ak, and the latter condition is implemented by means of the
+maximum entropy criterion: One expresses that, within the set of density matrices compatible with
+these constraints, ˆDR is the one which maximises von Neumann’s entropy (2), that is, which contains
+solely the information about the relevant variables ak. The least biased state ˆDR thus defined has the
+form ˆDR = e ˆXR/ Tr e ˆXR, where ˆXR ≡ ∑k λk ˆAk involves the time-dependent Lagrange multipliers λk,
+which are related to the set ak through Tr ˆDR ˆAk = ak.
+
+318
+
+
+Entropy 2014, 16, 3878–3888
+
+The von Neumann entropy S( ˆDR) ≡ SR{ak} of this reduced state ˆDR is called the “relevant
+entropy” associated with the considered relevant observables ˆAk. It measures the amount of missing
+information, when only the values {ak} of the relevant variables are given. During its evolution, ˆD
+keeps track of the initial information about all the variables < ˆO > and its entropy S( ˆD) remains
+constant in time. It is therefore smaller than the relevant entropy S( ˆDR) which accounts for the
+loss of information about the irrelevant variables. Depending on the choice of relevant observables
+{ ˆAk}, the corresponding relevant entropies SR{ak} encompass various current entropies, such as the
+non-equilibrium thermodynamic entropy or Boltzmann’s H-entropy.
+The same structure as the one introduced above for the full spaces of observables and states is
+recovered in this context. Here, for arbitrary values of the parameters λk, the exponents ˆXR = ∑k λk ˆAk
+constitute a subspace of the full vector space of observables, and the parameters {λk} appear as the
+coordinates of ˆXR on the basis { ˆAk}. The corresponding states ˆDR, parameterised by the set {ak},
+constitute a subset of the space of states, the manifold R of “reduced states”(Note that this manifold is
+not a hyperplane, contrary to the space of relevant observables; it is embedded in the full vector space
+of states, but does not constitute a subspace). By regarding SR{ak} as a function of the coordinates {ak},
+we can define a metric ds2 = −d2SR{ak} on the manifold R, which is the restriction of the metric (3).
+Its alternative expression ds2 = ∑k dakdλk = d2FR{λk}, where FR{λk} ≡ ln Tr exp ∑k λk ˆAk, is a
+restriction of (8). The correspondence between the two parameterisations {ak} and {λk} is again
+implemented by the Legendre transformation which relates SR{ak} and FR{λk}.
+The projection method relies on the mapping ˆD �→ ˆDR which associates ˆDR to ˆD. It consists
+in replacing the Liouville–von Neumann equation of motion for ˆD by the corresponding dynamical
+equation for ˆDR on the manifold R, or equivalently for the coordinates {ak} or for the coordinates {λk},
+a programme that is in practice achieved through some approximations. This mapping is obviously
+a projection in the sense that ˆD �→ ˆDR �→ ˆDR, but moreover the introduction of the metric (3) shows
+that the vector ˆD − ˆDR in the space of states is perpendicular to the manifold R at the point ˆDR.
+This property is readily shown by writing, in this metric, the scalar product Tr d ˆD d ˆX′ of the vector
+d ˆD = ˆD − ˆDR by an arbitrary vector d ˆD′ in the tangent plane of R. The latter is conjugate to any
+combination d ˆX′ of observables ˆAk, and this scalar product vanishes because Tr ˆD ˆAk = Tr ˆDR ˆAk. Thus
+the mapping ˆD �→ ˆDR appears as an orthogonal projection, so that the relevant state ˆDR associated
+with ˆD may be regarded as its best possible approximation on the manifold R.
+
+6. Properties of the Metric
+
+The metric tensor can be evaluated explicitly in a basis where the matrix ˆD is diagonal. Denoting
+by Di its eigenvalues and by dDij the matrix elements of its variations, we obtain from (3)
+
+ds2 = Tr
+� ∞
+
+0
+dξ
+� d ˆD
+
+ˆD + ξ
+
+�2
+= ∑
+ij
+
+ln Di − ln Dj
+
+Di − Dj
+dDijdDji .
+(11)
+
+(For Di = Dj,whether or not i = j, the ratio is defined as 1/Di by continuity.) In the same basis, the
+form (8) of the metric reads
+
+ds2 = 1
+
+Z ∑
+ij
+
+eXi − eXj
+
+Xi − Xj
+dXijdXji −
+�
+∑i eXidXii
+
+Z
+
+�2
+,
+(12)
+
+with Z = ∑i eXi(For Xi = Xj, the ratio is eXi). The singularity of the metric (11) in the vicinity of
+vanishing eigenvalues of ˆD, in particular near pure states (end of Section 2), is not apparent in the
+representation (12) of this metric, because the mapping from ˆD to ˆX sends the eignevalue Xi to −∞
+when Di tends to zero.
+Let us compare the expression (11) with the corresponding classical metric, which is obtained
+by starting from Shannon’s entropy instead of von Neumann’s entropy. For discrete probabilities pi,
+
+319
+
+
+Entropy 2014, 16, 3878–3888
+
+we have then S{pi} = − ∑i pi ln pi and hence the same definition ds2 = −d2S{pi} as above of an
+entropy-based metric yields ds2 = ∑i dp2
+i /pi, which is identified with the Fisher information metric.
+The present metric thus appears as the extension to quantum statistical mechanics of the Fisher metric
+when the latter is interpreted in terms of entropy. In fact, the terms of (11) which involve the diagonal
+elements i = j of the variations d ˆD reduce to dD2
+ii/Di. This result was expected since density matrices
+behave as probability distributions if both ˆD and d ˆD are diagonal.
+Let us more generally consider in (11), instead of solely diagonal variations dDii, variations dDij
+with indices i and j such that
+��Di − Dj
+�� ≪ Di + Dj. The expansion of Di and Dj around 1
+
+2(Di + Dj)
+in the corresponding ratios of (11) yields (ln Di − ln Dj)/(Di − Dj) ∼ 2/(Di + Dj). The considered
+terms of (11) are therefore the same as in the Bures–Helstrom metric
+
+ds2
+BH = ∑
+ij
+
+2
+
+Di + Dj
+dDijdDji ,
+(13)
+
+introduced long ago as an extension to matrices of the Fisher metric [5]. We thus recover this
+Bures–Helstrom metric as an approximation of the present entropy-based metric ds2 = −d2S( ˆD).
+For n = 2, ds2
+BH is obtained from the expression (5) of ds2 by omitting the factor tanh−1 r/r entering
+the second term.
+In order to express the properties of the Riemannian metric (3) in a general form, which will
+exhibit the tensor structure, we use a Liouville representation. There, the observables ˆO = Oμ ˆΩμ,
+regarded as elements of a vector space, are represented by their coordinates Oμ on a complete basis
+ˆΩμ of n2 observables. The space of states is spanned by the dual basis ˆΣμ, such that Tr ˆΩν ˆΣμ = δν
+μ, and
+the states ˆD = Dμ ˆΣμ are represented by their coordinates Dμ. Thus, the expectation value (1) is the
+scalar product DμOμ. In the matrix representation which appears as a special case, μ denotes a pair of
+indices i, j, ˆΩμ stands for | j >< i |, ˆΣμ for | i >< j |, Oμ denotes the matrix element Oji and Dμ the
+element Dij. For the q-bit (n = 2) considered in Section 2, we have chosen the Pauli operators ˆσμ as
+basis ˆΩμ for observables, and 1
+
+2 ˆσμ as dual basis ˆΣμ for states, so that the coordinates Dμ = Tr ˆD ˆΩμ
+
+of ˆD = 1
+
+2( ˆI + rμ ˆσμ) are the components rμ of the vector r (The unit operator ˆI is kept aside since ˆD
+is normalised and since constants added to ˆX are irrelevant). The function F{X} = ln Tr e ˆX of the
+coordinates Xμ of the observable ˆX, and the von Neumann entropy S{D} as function of the coordinates
+Dμ of the state ˆD, are related by the Legendre transformation F = S + DμXμ, and the relations (10) are
+expressed by Dμ = ∂F/∂Xμ, Xμ = −∂S/∂Dμ. The metric tensor is given by
+
+gμν =
+∂2F
+
+∂Xμ∂Xν
+,
+gμν = −
+∂2S
+
+∂Dμ∂Dν ,
+(14)
+
+and the correspondence issued from (7) between covariant and contravariant infinitesimal variations
+of ˆX and ˆD is implemented as dDμ = gμνdXν, dXμ = gμνdDν.
+These expressions exhibit the Hessian nature of the metric. This property simplifies the expression
+of the Christoffel symbol, which reduces to
+
+Γμνρ = −1
+
+2
+∂3S
+
+∂Dμ∂Dν∂Dρ ,
+(15)
+
+and which provides a parametric representation ˆD(t) of the geodesics in the space of states through
+
+d2Dμ
+
+dt2
++ gμσΓσνρ
+dDν
+
+dt
+dDρ
+
+dt
+= 0 .
+(16)
+
+Then, the Riemann curvature tensor comes out as
+
+Rμρ νσ = gξζ(ΓμσξΓνρζ − ΓμνξΓρσζ) ,
+(17)
+
+320
+
+
+Entropy 2014, 16, 3878–3888
+
+the Ricci tensor and the scalar curvature as
+
+Rμν = gρσRμρ νσ,
+R = gμνRμν ,
+(18)
+
+We have noted that the classical equivalent of the entropy-based metric ds2 = −d2S is the Fisher
+metric ∑i dp2
+i /pi, which as regards the curvature is equivalent to a Euclidean metric. While the space of
+classical probabilities is thus flat, the above equations show that the space of quantum states is curved.
+This curvature arises from the non-commutation of the observables, it vanishes for the completely
+disordered state ˆD = ˆI/n. Curvature can thus be used as a measure of the degree of classicality of a
+state.
+
+7. Geometry of the Space of q-Bits
+
+In the illustrative example of a q-bit, the operator ˆX = χμ ˆσμ associated with ˆD is parameterised
+by the 3 components of the vector χμ (μ = 1, 2, 3), related to r by χ = tanh−1 r and χμ/χ = rμ/r. The
+metric tensor given by (5) is expressed as
+
+gμν = Krμrν + χ
+
+r δμν ,
+K ≡ 1
+
+r
+d
+dr
+χ
+r = 1
+
+r2
+
+�
+1
+
+1 − r2 − χ
+
+r
+
+�
+,
+(19)
+
+gμν = (1 − r2)pμν + r
+
+χqμν .
+
+(We have defined rμ = rμ, δμν = δμ
+ν = δμν so as to introduce the projectors rμrν/r2 ≡ pμν ≡ δμν − qμν
+
+in the Euclidean 3-dimensional space, and thus to simplify the subsequent calculations.) In polar
+coordinates r = (r, θ, ϕ), the infinitesimal distance takes the form
+
+ds2 = drdχ + rχ(dθ2 + sin2 θdϕ2) .
+(20)
+
+We determine from (15) and (19) the explicit form
+
+Γμνρ = K
+
+2
+�
+rμδνρ + rνδμρ + rρδμν
+� + 1
+
+2r
+dK
+dr rμrνrρ
+(21)
+
+of the Christoffel symbol. By raising its first index with gμν and using polar coordinates, we obtain
+from (16) the equations of geodesics for n = 2. Within the Poincaré–Bloch sphere the geodesics are
+deduced by rotations from a one-parameter family of curves which lie in the θ = 1
+
+2π, |ϕ| ≤ 1
+
+2π
+half-plane and which are symmetric with respect to the ϕ = 0 axis. This family is characterized by the
+equations (where χ = tanh−1 r):
+
+d2r
+dt2 +
+r
+
+1 − r2
+
+�dr
+
+dt
+
+�2
+− r
+
+2
+
+�
+1 + χ
+
+r
+
+�
+1 − r2�� �dϕ
+
+dt
+
+�2
+= 0 ,
+(22)
+
+d2ϕ
+dt2 + 1
+
+r
+dr
+dt
+dϕ
+dt + 1
+
+χ
+dχ
+dt
+dϕ
+dt = 0 ,
+(23)
+
+and the boundary conditions at t = 0:
+
+r (0) = a ,
+ϕ (0) = 0 ,
+dr (0)
+
+dt
+= 0 ,
+dϕ (0)
+
+dt
+= 1
+
+k ,
+k2 = a tanh−1 a .
+(24)
+
+Equation (23) provides, using the boundary conditions (24):
+
+dϕ
+dt = k
+
+rχ .
+(25)
+
+321
+
+
+Entropy 2014, 16, 3878–3888
+
+Insertion of (25) into (22) gives rise to an equation for r (t), which can be integrated by regarding t as a
+function of ζ = arcsin r. One obtains:
+
+�dr
+
+dt
+
+�2
+=
+�
+1 − r2� �
+1 − k2
+
+rχ
+
+�
+.
+(26)
+
+The scale of t has been fixed by relating to r (0) the boundary condition (24) for dϕ (0) /dt, a choice
+which ensures that ds2 = drdχ + rχdϕ2 = dt2, and hence that the parameter t measures the distance
+along geodesics.
+For k = 0, we obtain r = |sin t|, ϕ = ±π/2. Thus, the longest geodesics are the diameters of
+the Poincaré–Bloch sphere. We find the value π for their “length”, that is, for the geodesic distance
+between two orthogonal pure states. At the other extreme, when the middle point r = a, ϕ = 0 of
+a geodesic lies close to the surface r = 1 of the sphere, the asymptotic form of the equation (26) is
+solved as
+
+t = ±2k√
+
+πe−k2 erf ξ ,
+ξ =
+
+�
+
+1
+2 ln 1 − a
+
+1 − r ,
+k2 = 1
+
+2 ln
+2
+
+1 − a
+(27)
+
+(by taking ξ as variable instead of r). The determination of the explicit equations of such short geodesic
+curves is achieved by integrating (25) into
+
+ϕ = t
+
+k = ±2√
+
+πe−k2 erf ξ .
+(28)
+
+From (27) and (28) we can determine the geodesic distance between two neighbouring pure states ˆD′ =
+|ψ′ >< ψ′| and ˆD′′ = |ψ′′ >< ψ′′| represented by the points rmax = 1, ϕmax = ± 1
+
+2δϕ with δϕ small.
+At these two points, we have ξ → ∞, erf ξ = 1, and this determines k in terms of 1
+
+2δϕ through (28).
+The length of the geodesic that joins them, given by (27), is:
+
+δs2 = δϕ2 ln 4√π
+
+δϕ
+,
+δϕ = arccos
+��< ψ′ | ψ′′ >
+�� .
+(29)
+
+Thus, in spite of its singularity for r = 1, the present 3-dimensional metric (5) in the space r, θ, ϕ defines
+distances between pure states represented by points on the surface r = 1 of the Poincaré–Bloch sphere.
+However, It should be noted that the presence of the logarithmic factor in (29) forbids such distances
+to be generated by a 2-dimensional metric in the space θ, ϕ. In fact, the distance (29) is measured along
+a geodesic that penetrates the sphere r = 1, because no geodesic is tangent to the surface of this sphere
+nor lies on its surface.
+In contrast, all geodesics produced by the Bures–Helstrom metric are tangent to the surface of the
+sphere, or are its great circles. They are given by Equations (25) and (26), where χ is replaced by r and
+k by a; the solution of these equations provides the ellipses
+
+r cos ϕ = a cos t ,
+r sin ϕ = sin t .
+(30)
+
+Here as above, the largest distance π is reached for orthogonal pure states represented by opposite
+points on the sphere, but now a peculiarity occurs. Whereas the metric ds2 = −d2S produces a single
+geodesic, the diameter joining these two points (with “length” π), the Bures metric produces a double
+infinity of geodesics, the half-ellipses (30) having as long axis this diameter, and having all the same
+“length” π. Other pairs of pure states are joined by geodesics which are arcs of great circles, and their
+Bures distance δsBH = δϕ is identified with the ordinary length of the arc. Here for n = 2 as in the
+general case, the 3-dimensional Bures–Helstrom metric admits a restriction to pure states generated by
+a 2-dimensional metric, which is identified with the quantum Fubini–Study metric, itself defined only
+for pure states by sFS = arccos |< ψ′ | ψ′′ >| = 1
+
+2sBH.
+
+322
+
+
+Entropy 2014, 16, 3878–3888
+
+Returning to the metric ds2 = d2S, the Riemann curvature is obtained from (17) as
+
+Rμ
+ρ νσ = K
+
+4
+
+�
+(r2 + r
+
+χ − 1)(qμ
+σqνρ − qμ
+νqρσ) + (r2 − r
+
+χ + 1)(pμ
+σqνρ − pμ
+νqρσ)
+(31)
+
++ r
+
+χ
+1
+
+1 − r2 (r2 − r
+
+χ + 1)(qμ
+σpνρ − qμ
+νpρσ)
+�
+.
+
+Contracting with gρσ the indices of (30) as in (18), we finally derive the Ricci curvature
+
+Rμ
+ν = −Kr
+
+2χ
+
+�
+r2δμ
+ν + χ − r
+
+χ
+pμ
+ν
+
+�
+,
+(32)
+
+and the scalar curvature
+
+R = −Kr
+
+2χ
+
+�
+3r2 + χ − r
+
+χ
+
+�
+.
+(33)
+
+Both are negative in the whole Poincaré sphere. In the limit r → 0, the curvature R vanishes as
+R ∼ − 10
+
+9 r2, as expected from the general argument of Section 2: a weakly polarised spin behaves
+classically. At the other extreme r → 1, R behaves as R ∼ −2 [(1 − r) | ln(1 − r) |]−1; it diverges, again
+as expected: pure states have the largest quantum nature.
+
+The metric ds2 = −d2S, introduced above in the context of quantum mechanics for mixed states
+(and their pure limit) and information theory, might more generally be useful to characterise distances
+in spaces of positive matrices.
+
+Conflicts of Interest: Conflicts of Interest
+The author declares no conflict of interest.
+
+References
+
+1.
+Thirring, W. Quantum Mechanics of Large Systems.
+In A Course of Mathematical Physics; Volume 4;
+Springler-Verlag: New York, NY, USA, 1983.
+2.
+Balian, R.; Alhassid, Y.; Reinhardt, H. Dissipation in many-body systems: A geometric approach based on
+information theory. Phys. Rep. 1986, 131, 1–146.
+3.
+Balian, R. Incomplete descriptions and relevant entropies. Am. J. Phys. 1999, 67, 1078–1090.
+4.
+Balian, R. Information in statistical physics. Stud. Hist. Philos. Mod. Phys. 2005, 36, 323–353.
+5.
+Bures, D. An extension of Kakutani’s theorem. Trans. Am. Math. Soc. 1969,135, 199–212.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+323
+
+
+entropy
+
+Article
+Extending the Extreme Physical Information to
+Universal Cognitive Models via a Confident
+Information First Principle
+
+Xiaozhao Zhao 1, Yuexian Hou 1,2,*, Dawei Song 1,3 and Wenjie Li 2
+
+1 School of Computer Science and Technology, Tianjin University, Tianjin 300072, China; E-Mails:
+0.25eye@gmail.com (X.Z.); dawei.song2010@gmail.com (D.S.)
+2 Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China;
+E-Mail: cswjli@comp.polyu.edu.hk
+3 Department of Computing and Communications, The Open University, Milton Keynes MK76AA, UK
+*
+E-Mail: yxhou@tju.edu.cn; Tel.: +86-022-27406538.
+
+Received: 25 March 2014; in revised form: 6 June 2014 / Accepted: 20 June 2014 /
+Published: 1 July 2014
+
+Abstract: The principle of extreme physical information (EPI) can be used to derive many known
+laws and distributions in theoretical physics by extremizing the physical information loss K, i.e.,
+the difference between the observed Fisher information I and the intrinsic information bound J
+of the physical phenomenon being measured. However, for complex cognitive systems of high
+dimensionality (e.g., human language processing and image recognition), the information bound
+J could be excessively larger than I (J ≫ I), due to insufficient observation, which would lead
+to serious over-fitting problems in the derivation of cognitive models. Moreover, there is a lack
+of an established exact invariance principle that gives rise to the bound information in universal
+cognitive systems. This limits the direct application of EPI. To narrow down the gap between I and
+J, in this paper, we propose a confident-information-first (CIF) principle to lower the information
+bound J by preserving confident parameters and ruling out unreliable or noisy parameters in the
+probability density function being measured. The confidence of each parameter can be assessed
+by its contribution to the expected Fisher information distance between the physical phenomenon
+and its observations. In addition, given a specific parametric representation, this contribution can
+often be directly assessed by the Fisher information, which establishes a connection with the inverse
+variance of any unbiased estimate for the parameter via the Cramér–Rao bound. We then consider
+the dimensionality reduction in the parameter spaces of binary multivariate distributions. We show
+that the single-layer Boltzmann machine without hidden units (SBM) can be derived using the CIF
+principle. An illustrative experiment is conducted to show how the CIF principle improves the
+density estimation performance.
+
+Keywords: information geometry; Boltzmann machine; Fisher information; parametric reduction
+
+1. Introduction
+
+Information has been found to play an increasingly important role in physics. As stated in
+Wheeler [1]: “All things physical are information-theoretic in origin and this is a participatory
+universe...Observer participancy gives rise to information; and information gives rise to physics”.
+Following this viewpoint, Frieden [2] unifies the derivation of physical laws in major fields of physics,
+from the Dirac equation to the Maxwell-Boltzmann velocity dispersion law, using the extreme physical
+information principle (EPI). More specifically, a variety of equations and distributions can be derived by
+extremizing the physical information loss K, i.e., the difference between the observed Fisher information
+I and the intrinsic information bound J of the physical phenomenon being measured.
+
+Entropy 2014, 16, 3670–3688; doi:10.3390/e16073670
+www.mdpi.com/journal/entropy
+324
+
+
+Entropy 2014, 16, 3670–3688
+
+The first quantity, I, measures the amount of information as a finite scalar implied by the data
+with some suitable measure [2]. It is formally defined as the trace of the Fisher information matrix [3].
+In addition to I, the second quantity, the information bound J, is an invariant that characterizes the
+information that is intrinsic to the physical phenomenon [2]. During the measurement procedure, there
+may be some loss of information, which entails I = κJ, where κ ≤ 1 is called the efficiency coefficient
+of the EPI process in transferring the Fisher information from the phenomenon (specified by J) to
+the output (specified by I). For closed physical systems, in particular, any solution for I attains some
+fraction of J between 1/2 (for classical physics) and one (for quantum physics) [4].
+However, it is usually not the case in cognitive science. For complex cognitive systems (e.g.,
+human language processing and image recognition), the target probability density function (pdf) being
+measured is often of high dimensionality (e.g., thousands of words in a human language vocabulary
+and millions of pixels in an observed image). Thus, it is infeasible for us to obtain a sufficient collection
+of observations, leading to excessive information loss between the observer and nature. Moreover,
+there is a lack of an established exact invariance principle that gives rise to the bound information in
+universal cognitive systems. This limits the direct application of EPI in cognitive systems.
+In terms of statistics and machine learning, the excessive information loss between the observer
+and nature will lead to serious over-fitting problems, since the insufficient observations may not
+provide necessary information to reasonably identify the model and support the estimation of the
+target pdf in complex cognitive systems. Actually, a similar problem is also recognized in statistics and
+machine learning, known as the model selection problem [5]. In general, we would require a complex
+model with a high-dimensional parameter space to sufficiently depict the original high-dimensional
+observations. However, over-fitting usually occurs when the model is excessively complex with
+respect to the given observations. To avoid over-fitting, we would need to adjust the complexity of the
+models to the available amount of observations and, equivalently, to adjust the information bound J
+corresponding to the observed information I.
+In order to derive feasible computational models for cognitive phenomenon, we propose a
+confident-information-first (CIF) principle in addition to EPI to narrow down the gap between I and J
+(thus, a reasonable efficiency coefficient κ is implied), as illustrated in Figure 1. However, we do not
+intend to actually derive the distribution laws by solving the differential equations of the extremization
+of the new information loss K′. Instead, we assume that the target distribution belongs to some general
+multivariate binary distribution family and focus on the problem of seeking a proper information
+bound with respect to the constraint of the parametric number and the given observations.
+
+Figure 1. (a) The paradigm of the extreme physical information principle (EPI) to derive physical laws
+by the extremization of the information loss K∗ (K∗ = J/2 for classical physics and K∗ = 0 for quantum
+physics); (b) the paradigm of confident-information-first (CIF) to derive computational models by
+reducing the information loss K′ using a new physical bound J′.
+
+The key to the CIF approach is how to systematically reduce the physical information bound for
+high-dimensional complex systems. As stated in Frieden [2], the information bound J is a functional
+form that depends upon the physical parameters of the system. The information is contained in
+
+325
+
+
+Entropy 2014, 16, 3670–3688
+
+the variations of the observations (often imperfect, due to insufficient sampling, noise and intrinsic
+limitations of the “observer”), and can be further quantified using the Fisher information of system
+parameters (or coordinates) [3] from the estimation theory. Therefore, the physical information bound
+J of a complex system can be reduced by transforming it to a simpler system using some parametric
+reduction approach. Assuming there exists an ideal parametric model S that is general enough to
+represent all system phenomena (which gives the ultimate information bound in Figure 1), our goal is
+to adopt a parametric reduction procedure to derive a lower-dimensional sub-model M (which gives
+the reduced information bound in Figure 1) for a given dataset (usually insufficient or perturbed by
+noises) by reducing the number of free parameters in S.
+Formally speaking, let q(ξ) be the ideal distribution with parameters ξ that describes the physical
+system and q(ξ + Δξ) be the observations of the system with some small fluctuation Δξ in parameters.
+In [6], the averaged information distance I(Δξ) between the distribution and its observations, the
+so-called shift information, is used as a disorder measure of the fluctuated observations to reinterpret
+the EPI principle. More specifically, in the framework of information geometry, this information
+distance could also be assessed using the Fisher information distance induced by the Fisher–Rao
+metric, which can be decomposed into the variation in the direction of each system parameter [7].
+In principle, it is possible to divide system parameters into two categories, i.e., the parameters with
+notable variations and the parameters with negligible variations, according to their contributions to the
+whole information distance. Additionally, the parameters with notable contributions are considered
+to be confident, since they are important for reliably distinguishing the ideal distribution from its
+observation distributions. On the other hand, the parameters with negligible contributions can be
+considered to be unreliable or noisy. Then, the CIF principle can be stated as the parameter selection
+criterion that maximally preserves the Fisher information distance in an expected sense with respect
+to the constraint of the parametric number and the given observations (if available), when projecting
+distributions from the parameter space of S into that of the reduced sub-model M. We call it the
+distance-based CIF. As a result, we could manipulate the information bound of the underlying system
+by preserving the information of confident parameters and ruling out noisy parameters.
+In this paper, the CIF principle is analyzed in the multivariate binary distribution family in the
+mixed-coordinate system [8]. It turns out that, in this problematic configuration, the confidence of
+a parameter can be directly evaluated by its Fisher information, which also establishes a connection
+with the inverse variance of any unbiased estimate for the parameter via the Cramér–Rao bound [3].
+Hence, the CIF principle can also be interpreted as the parameter selection procedure that keeps the
+parameters with reliable estimates and rules out unreliable or noisy parameters. This CIF is called
+the information-based CIF. Note that the definition of confidence in distance-based CIF depends on
+both Fisher information and the scale of fluctuation, and the confidence in the information-based CIF
+(i.e., Fisher information) can be seen as a special case of confidence measure with respect to certain
+coordinate systems. This simplification allows us to further apply the CIF principle to improve existing
+learning algorithms for the Boltzmann machine.
+The paper is organized as follows. In Section 2, we introduce the parametric formulation for
+the general multivariate binary distributions in terms of information geometry (IG) framework [7].
+Then, Section 3 describes the implementation details of the CIF principle. We also give a geometric
+interpretation of CIF by showing that it can maximally preserve the expected information distance (in
+Section 3.2.1), as well as the analysis on the scale of the information distance in each individual system
+parameter (in Section 3.2.2). In Section 4, we demonstrate that a widely used cognitive model, i.e., the
+Boltzmann machine, can be derived using the CIF principle. Additionally, an illustrative experiment is
+conducted to show how the CIF principle can be utilized to improve the density estimation performance
+of the Boltzmann machine in Section 5.
+
+326
+
+
+Entropy 2014, 16, 3670–3688
+
+2. The Multivariate Binary Distributions
+
+Similar to EPI, the derivation of CIF depends on the analysis of the physical information bound,
+where the choice of system parameters, also called “Fisher coordinates” in Frieden [2], is crucial.
+Based on information geometry (IG) [7], we introduce some choices of parameterizations for binary
+multivariate distributions (denoted as statistical manifold S) with a given number of variables n, i.e.,
+the open simplex of all probability distributions over binary vector x ∈ {0, 1}n.
+
+2.1. Notations for Manifold S
+
+In IG, a family of probability distributions is considered as a differentiable manifold with certain
+parametric coordinate systems. In the case of binary multivariate distributions, four basic coordinate
+systems are often used: p-coordinates, η-coordinates, θ-coordinates and mixed-coordinates [7,9].
+Mixed-coordinates is of vital importance for our analysis.
+For the p-coordinates [p] with n binary variables, the probability distribution over 2n states
+of x can be completely specified by any 2n − 1 positive numbers indicating the probability of the
+corresponding exclusive states on n binary variables. For example, the p-coordinates of n = 2 variables
+could be [p] = (p01, p10, p11). Note that IG requires all probability terms to be positive [7].
+For simplicity, we use the capital letters I, J, . . . to index the coordinate parameters of probabilistic
+distribution. To distinguish the notation of Fisher information (conventionally used in literature,
+e.g., data information I and information bound J in Section 1) from the coordinate indexes, we
+make explicit explanations when necessary from now on. An index I can be regarded as a subset
+of {1, 2, . . . , n}. Additionally, pI stands for the probability that all variables indicated by I equal
+to one and the complemented variables are zero. For example, if I = {1, 2, 4} and n = 4, then
+pI = p1101 = Prob(x1 = 1, x2 = 1, x3 = 0, x4 = 1). Note that the null set can also be a legal index of
+the p-coordinates, which indicates the probability that all variables are zero, denoted as p0...0.
+Another coordinate system often used in IG is η-coordinates, which is defined by:
+
+ηI = E[XI] = Prob{∏
+i∈I
+xi = 1}
+(1)
+
+where the value of XI is given by ∏i∈I xi and the expectation is taken with respect to the probability
+distribution over x. Grouping the coordinates by their orders, the η-coordinate system is denoted
+as [η] = (η1
+i , η2
+ij, . . . , ηn
+1,2...n), where the superscript indicates the order number of the corresponding
+
+parameter. For example, η2
+ij denotes the set of all η parameters with the order number two.
+The θ-coordinates (natural coordinates) are defined by:
+
+log p(x) =
+∑
+I⊆{1,2,...,n},I̸=NullSet
+θIXI − ψ(θ)
+(2)
+
+where ψ(θ) = log(∑x exp{∑I θIXI(x)}) is the cumulant generating function and its value equals to
+− log Prob{xi = 0, ∀i ∈ {1, 2, ..., n}}. The θ-coordinate is denoted as [θ] = (θi
+1, θij
+2 , . . . , θ1,...,n
+n
+), where
+the subscript indicates the order number of the corresponding parameter. Note that the order indices
+locate at different positions in [η] and [θ] following the convention in Amari et al. [8].
+The relation between coordinate systems [η] and [θ] is bijective. More formally, they are connected
+by the Legendre transformation:
+
+θI = ∂φ(η)
+
+∂ηI
+, ηI = ∂ψ(θ)
+
+∂θI
+(3)
+
+where ψ(θ) is given in Equation (2) and φ(η) = ∑x p(x; η) log p(x; η) is the negative of entropy. It can
+be shown that ψ(θ) and φ(η) meet the following identity [7]:
+
+ψ(θ) + φ(η) − ∑ θIηI = 0
+(4)
+
+327
+
+
+Entropy 2014, 16, 3670–3688
+
+Next, we introduce mixed-coordinates, which is important for our derivation of CIF. In general,
+the manifold S of probability distributions could be represented by the l-mixed-coordinates [8]:
+
+[ζ]l = (η1
+i , η2
+ij, . . . , ηl
+i,j,...,k, θi,j,...,k
+l+1 , . . . , θ1,...,n
+n
+)
+(5)
+
+where the first part consists of η-coordinates with order less or equal to l (denoted by [ηl−]) and the
+second part consists of θ-coordinates with order greater than l (denoted by [θl+]), l ∈ {1, ..., n − 1}.
+
+2.2. Fisher Information Matrix for Parametric Coordinates
+
+For a general coordinate system [ξ], the i-th row and j-th column element of the Fisher information
+matrix for [ξ] (denoted by Gξ) is defined as the covariance of the scores of [ξi] and [ξj] [3], i.e.,
+
+gij = E[∂ log p(x; ξ)
+
+∂ξi
+· ∂ log p(x; ξ)
+
+∂ξj
+]
+
+under the regularity condition for the pdf that the partial derivatives exist. The Fisher information
+measures the amount of information in the data that a statistic carries about the unknown
+parameters [10]. The Fisher information matrix is of vital importance to our analysis, because the
+inverse of Fisher information matrix gives an asymptotically tight lower bound to the covariance
+matrix of any unbiased estimate for the considered parameters [3]. Another important concept related
+to our analysis is the orthogonality defined by Fisher information. Two coordinate parameters ξi and
+ξj are called orthogonal if and only if their Fisher information vanishes, i.e., gij = 0, meaning that their
+influences on the log likelihood function are uncorrelated.
+
+The Fisher information for [θ] can be rewritten as gIJ = ∂2ψ(θ)
+
+∂θI∂θJ , and for [η], it is gIJ = ∂2φ(η)
+
+∂ηI∂ηJ [7].
+
+Let Gθ = (gIJ) and Gη = (gIJ) be the Fisher information matrices for [θ] and [η], respectively. It can be
+shown that Gθ and Gη are mutually inverse matrices, i.e., ∑J gIJgJK = δI
+K, where δI
+K = 1 if I = K and
+zero otherwise [7]. In order to generally compute Gθ and Gη, we develop the following Propositions 1
+and 2. Note that Proposition 1 is a generalization of Theorem 2 in Amari et al. [8].
+
+Proposition 1. The Fisher information between two parameters θI and θJ in [θ], is given by:
+
+gIJ(θ) = ηI � J − ηIηJ
+(6)
+
+Proof. in Appendix A.
+
+Proposition 2. The Fisher information between two parameters ηI and ηJ in [η], is given by:
+
+gIJ(η) = ∑
+K⊆I∩J
+(−1)|I−K|+|J−K| · 1
+
+pK
+(7)
+
+where | · | denotes the cardinality operator.
+
+Proof. in Appendix B.
+
+Based on the Fisher information matrices Gη and Gθ, we can calculate the Fisher information
+matrix Gζ for the l-mixed-coordinate system [ζ]l, as follows:
+
+Proposition 3. The Fisher information matrix Gζ of the l-mixed-coordinates [ζ]l is given by:
+
+Gζ =
+
+�
+A
+0
+0
+B
+
+�
+
+(8)
+
+328
+
+
+Entropy 2014, 16, 3670–3688
+
+where A = ((G−1
+η )Iη)−1, B = ((G−1
+θ )Jθ)−1, Gη and Gθ are the Fisher information matrices of [η] and [θ],
+respectively, Iη is the index set of the parameters shared by [η] and [ζ]l, i.e., {η1
+i , ..., ηl
+i,j,...,k}, and Jθ is the index
+
+set of the parameters shared by [θ] and [ζ]l, i.e., {θi,j,...,k
+l+1 , . . . , θ1,...,n
+n
+}.
+
+Proof. in Appendix C.
+
+3. The General CIF Principle
+
+In this section, we propose the CIF principle to reduce the physical information bound for
+high-dimensionality systems. Given a target distribution q(x) ∈ S, we consider the problem of
+realizing it by a lower-dimensionality submanifold. This is defined as the problem of parametric
+reduction for multivariate binary distributions. The family of multivariate binary distributions has
+been proven to be useful when we deal with discrete data in a variety of applications in statistical
+machine learning and artificial intelligence, such as the Boltzmann machine in neural networks [11,12]
+and the Rasch model in human sciences [13,14].
+Intuitively, if we can construct a coordinate system so that the confidences of its parameters
+entail a natural hierarchy, in which high confident parameters are significantly distinguished from
+and orthogonal to lowly confident ones, then we can conveniently implement CIF by keeping
+the high confident parameters unchanged and setting the lowly confident parameters to neutral
+values. Therefore, the choice of coordinates (or parametric representations) in CIF is crucial to its
+usage. This strategy is infeasible in terms of p-coordinates, η-coordinates or θ-coordinates, since the
+orthogonality condition cannot hold in these coordinate systems. In this section, we will show that the
+l-mixed-coordinates [ζ]l meets the requirement of CIF.
+In principle, the confidence of parameters should be assessed according to their contributions to
+the expected information distance between the ideal distribution and its fluctuated observations. This is
+called the distance-based CIF (see Section 1). For some coordinated systems, e.g., the mixed-coordinate
+system [ζ]l, the confidence of a parameter can also be directly evaluated by its Fisher information. This
+is called the information-based CIF (see Section 1). The information-based CIF (i.e., Fisher information)
+can be seen as an approximation to distance-based CIF, since it neglects the influence of parameter
+scaling to the expected information distance. However, considering the standard mixed-coordinates
+[ζ]l for the manifold of multivariate binary distributions, it turns out that both distance-based CIF and
+information-based CIF entail the same submanifold M (refer to Section 3.2 for detailed reasons).
+For the purpose of legibility, we will start with the information-based CIF, where the parameter’s
+confidence is simply measured using its Fisher information.
+After that, we show that the
+information-based CIF leads to an optimal submanifold M, which is also optimal in terms of the
+more rigorous distance-based CIF.
+
+3.1. The Information-Based CIF Principle
+
+In this section, we will show that the l-mixed-coordinates [ζ]l meet the requirement of the
+information-based CIF. According to Proposition 3 and the following Proposition 4, the confidences of
+coordinate parameters (measured by Fisher information) in [ζ]l entail a natural hierarchy: the first part
+of high confident parameters [ηl−] are separated from the second part of low confident parameters
+[θl+]. Additionally, those low confident parameters [θl+] have the neutral value of zero.
+
+Proposition 4. The diagonal elements of A are lower bounded by one, and those of B are upper bounded by one.
+
+Proof. in Appendix D.
+
+Moreover, the parameters in [ηl−] are orthogonal to the ones in [θl+], indicating that we could
+estimate these two parts independently [9].
+Hence, we can implement the information-based
+CIF for parametric reduction in [ζ]l by replacing low confident parameters with neutral value
+
+329
+
+
+Entropy 2014, 16, 3670–3688
+
+zero and reconstructing the resulting distribution.
+It turns out that the submanifold of S
+tailored by information-based CIF becomes [ζ]lt
+=
+(η1
+i , ..., ηl
+ij...k, 0, . . . , 0).
+We call [ζ]lt the
+l-tailored-mixed-coordinates.
+To grasp an intuitive picture for the CIF strategy and its significance w.r.t mixed-coordinates,
+let us consider an example with [p] = (p001 = 0.15, p010 = 0.1, p011 = 0.05, p100 = 0.2, p101 =
+0.1, p110 = 0.05, p111 = 0.3). Then, the confidences for coordinates in [η], [θ] and [ζ]2 are given by the
+diagonal elements of the corresponding Fisher information matrices. Applying the two-tailored CIF in
+mixed-coordinates, the loss ratio of Fisher information is 0.001%, and the ratio of the Fisher information
+of the tailored parameter (θ123
+3
+) to the remaining η parameter with the smallest Fisher information is
+0.06%. On the other hand, the above two ratios become 7.58% and 94.45% (in η-coordinates) or 12.94%
+and 92.31% (in θ-coordinates), respectively. We can see that [ζ]2 gives us a much better way to tell
+apart confident parameters from noisy ones.
+
+3.2. The Distance-Based CIF: A Geometric Point-of-View
+
+In the previous section, the information-based CIF entails a submanifold of S determined by the
+l-tailored-mixed-coordinates [ζ]lt. A more rigorous definition for the confidence of coordinates is the
+distance-based confidence used in the distance-based CIF, which relies on both of the coordinate’s
+Fisher information and its fluctuation scaling. In this section, we will show that the the submanifold
+M determined by [ζ]lt is also an optimal submanifold M in terms of the distance-based CIF. Note that,
+for other coordinate systems (e.g., arbitrarily rescaling coordinates), the information-based CIF may
+not entail the same submanifold as the distance-based CIF.
+Let q(x), with coordinate ζq, denote the exact solution to the physical phenomenon being
+measured. Additionally, the act of observation would cause small random perturbations to q(x),
+leading to some observation q′(x) with coordinate ζq + Δζq. When two distributions q(x) and q′(x) are
+close, the divergence between q(x) and q′(x) on manifold S could be assessed by the Fisher information
+distance: D(q, q′) = (Δζq · Gζ · Δζq)1/2, where Gζ is the Fisher information matrix and the perturbation
+Δζq is small. The Fisher information distance between two close distributions q(x) and q′(x) on
+manifold S is the Riemannian distance under the Fisher–Rao metric, which is shown to be the square
+root of the twice of the Kullback–Leibler divergence from q(x) to q′(x) [8]. Note that we adopt the
+Fisher information distance as the distance measure between two close distributions, since it is shown
+to be the unique metric meeting a set of natural axioms for the distribution metrics [7,15,16], e.g., the
+invariant property with respect to reparametrizations and the monotonicity with respect to the random
+maps on variables.
+Let M be a smooth k-dimensionality submanifold in S (k < 2n − 1). Given the point q(x) ∈ S,
+the projection [8] of q(x) on M is the point p(x) that belongs to M and is closest to q(x) with respect
+to the Kullback–Leibler divergence (K-L divergence) [17] from the distribution q(x) to p(x). On the
+submanifold M, the projections of q(x) and q′(x) are p(x) and p′(x), with coordinates ζp and ζp + Δζp,
+respectively, shown in Figure 2.
+Let the preserved Fisher information distance be D(p, p′) after projecting on M. In order to retain
+
+the information contained in observations, we need the ratio D(p,p′)
+
+D(q,q′) to be as large as possible in the
+expected sense, with respect to the given dimensionality k of M. The next two sections will illustrate
+that CIF leads to an optimal submanifold M based on different assumptions on the perturbations Δζq.
+
+330
+
+
+Entropy 2014, 16, 3670–3688
+
+Figure 2. By projecting a point q(x) on S to a submanifold M, the l-tailored mixed-coordinates [ζ]lt
+gives a desirable M that maximally preserves the expected Fisher information distance when projecting
+a ε-neighborhood centered at q(x) onto M.
+
+3.2.1. Perturbations in Uniform Neighborhood
+
+Let Bq be a ε-sphere surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε},
+where KL(·, ·) denotes the K-L divergence and ε is small. Additionally, q′(x) is a neighbor of q(x)
+uniformly sampled on Bq, as illustrated in Figure 2. Recall that, for a small ε, the K-L divergence can be
+approximated by half of the squared Fisher information distance. Thus, in the parameterization
+of [ζ]l, Bq is indeed the surface of a hyper-ellipsoid (centered at q(x)) determined by Gζ.
+The
+following proposition shows that the general CIF would lead to an optimal submanifold M that
+maximally preserves the expected information distance, where the expectation is taken upon the
+uniform neighborhood, Bq.
+
+Proposition 5. Consider the manifold S in l-mixed-coordinates [ζ]l. Let k be the number of free parameters
+in the l-tailored-mixed-coordinates [ζ]lt. Then, among all k-dimensional submanifolds of S, the submanifold
+determined by [ζ]lt can maximally preserve the expected information distance induced by the Fisher–Rao metric.
+
+Proof. in Appendix E.
+
+3.2.2. Perturbations in Typical Distributions
+
+To facilitate our analysis, we make a basic assumption on the underlying distributions q(x)
+that at least (2n − 2n/2) p-coordinates are of the scale ϵ, where ϵ is a sufficiently small value. Thus,
+residual p-coordinates (at most 2n/2) are all significantly larger than zero (of scale Θ(1/2(n/2))), and
+their sum approximates one. Note that these assumptions are common situations in real-world data
+collections [18], since the frequent (or meaningful) patterns are only a small fraction of all of the
+system states.
+Next, we introduce a small perturbation Δp to the p-coordinates [p] for the ideal distribution
+q(x). The scale of each fluctuation ΔpI is assumed to be proportional to the standard variation of
+corresponding p-coordinate pI by some small coefficients (upper bounded by a constant a), which can
+be approximated by the inverse of the square root of its Fisher information via the Cramér–Rao bound.
+It turns out that we can assume the perturbation ΔpI to be a√pI.
+In this section, we adopt the l-mixed-coordinates [ζ]l = (ηl−; θl+), where l = 2 is used in
+the following analysis. Let Δζq = (Δη2−; Δθ2+) be the incremental of mixed-coordinates after the
+perturbation. The squared Fisher information distance D2(p, p′) = Δζq · Gζ · Δζq could be decomposed
+into the direction of each coordinate in [ζ]l. We will clarify that, under typical cases, the scale of the
+
+331
+
+
+Entropy 2014, 16, 3670–3688
+
+Fisher information distance in each coordinate of θl+ (reduced by CIF) is asymptotically negligible,
+compared to that in each coordinate of ηl− (preserved by CIF).
+The scale of squared Fisher information distance in the direction of ηI is proportional to ΔηI ·
+(Gζ)I,I · ΔηI, where (Gζ)I,I is the Fisher information of ηI in terms of the mixed-coordinates [ζ]2. From
+Equation (1), for any I of order one (or two), ηI is the sum of 2n−1 (or 2n−2) p-coordinates, and the scale
+is Θ(1). Hence, the incremental Δη2− is proportional to Θ(1), denoted as a · Θ(1). It is difficult to give
+an explicit expression of (Gζ)I,I analytically. However, the Fisher information (Gζ)I,I of ηI is bounded
+by the (I, I)-th element of the inverse covariance matrix [19], which is exactly 1/gI,I(θ) =
+1
+
+ηI−η2
+I (see
+
+Proposition 3). Hence, the scale of (Gζ)I,I is also Θ(1). It turns out that the scale of squared Fisher
+information distance in the direction of ηI is a2 · Θ(1).
+Similarly, for the part θ2+, the scale of squared Fisher information distance in the direction of
+θJ is proportional to ΔθJ · (Gζ)J,J · ΔθJ, where (Gζ)J,J is the Fisher information of θJ in terms of the
+mixed-coordinates [ζ]2. The scale of θJ is maximally f (k)|log(√ϵ)| based on Equation (2), where k is
+the order of θJ and f (k) is the number of p-coordinates of scale Θ(1/2(n/2)) that are involved in the
+calculation of θJ. Since we assume that f (k) ≤ 2(n/2), the maximum scale of θJ is 2(n/2)|log(√ϵ)|. Thus,
+the incremental ΔθJ is of a scale bounded by a · 2(n/2)|log(√ϵ)|. Similar to our previous deviation, the
+Fisher information (Gζ)J,J of θJ is bounded by the (J, J)-th element of the inverse covariance matrix,
+which is exactly 1/gJ,J(η) (see Proposition 3). Hence, the scale of (Gζ)J,J is (2k − f (k))−1ϵ. In summary,
+the scale of squared Fisher information distance in the direction of θJ is bounded by the scale of
+
+a2 · Θ(2nϵ |log(√ϵ)|2
+
+2k− f (k) ). Since ϵ is a sufficiently small value and a is constant, the scale of squared Fisher
+
+information distance in the direction of θJ is asymptotically zero.
+In summary, in terms of modeling the fluctuated observations of typical cognitive systems, the
+original Fisher information distance between the physical phenomenon (q(x)) and observations (q′(x))
+is systematically reduced using CIF by projecting them on an optimal submanifold M. Based on our
+above analysis, the scale of Fisher information distance in the directions of [ηl−] preserved by CIF is
+significantly larger than that of the directions [θl+] reduced by CIF.
+
+4. Derivation of Boltzmann Machine by CIF
+
+In the previous section, the CIF principle is uncovered in the [ζ]l coordinates. Now, we consider
+an implementation of CIF when l equals to two, which gives rise to the single-layer Boltzmann machine
+without hidden units (SBM).
+
+4.1. Notations for SBM
+
+The energy function for SBM is given by:
+
+ESBM(x; ξ) = −1
+
+2xTUx − bTx
+(9)
+
+where ξ = {U, b} are the parameters and the diagonals of U are set to zero.
+The Boltzmann
+distribution over x is p(x; ξ) =
+1
+Zexp{−ESBM(x; ξ)}, where Z is a normalization factor. Actually,
+the parametrization for SBM could be naturally expressed by the coordinate systems in IG (e.g.,
+[θ] = (θi
+1 = bi, θij
+2 = Uij, θijk
+3
+= 0, ..., θ1,2,...,n
+n
+= 0)).
+
+4.2. The Derivation of SBM using CIF
+
+Given any underlying probability distribution q(x) on the general manifold S over {x}, the
+logarithm of q(x) can be represented by a linear decomposition of θ-coordinates, as shown in
+Equation (2). Since it is impractical to recognize all coordinates for the target distribution, we would
+like to only approximate part of them and end up with a k-dimensional submanifold M of S, where k
+(≪ 2n − 1) is the number of free parameters. Here, we set k to be the same dimensionality as SBM,
+i.e., k = n(n+1)
+
+2
+, so that all candidate submanifolds are comparable to the submanifold endowed by
+
+332
+
+
+Entropy 2014, 16, 3670–3688
+
+SBM (denoted as Msbm). Next, the rationale underlying the design of Msbm can be illustrated using the
+general CIF.
+Let the two-mixed-coordinates of q(x) on S be [ζ]2 = (η1
+i , η2
+ij, θi,j,k
+3
+, . . . , θ1,...,n
+n
+). Applying the
+general CIF on [ζ]2, our parametric reduction rule is to preserve the high confident part parameters
+[η2−] and replace low confident parameters [θ2+] by a fixed neutral value of zero. Thus, we derive
+the two-tailored-mixed-coordinates: [ζ]2t = (η1
+i , η2
+ij, 0, . . . , 0), as the optimal approximation of q(x)
+by the k-dimensional submanifolds. On the other hand, given the two-mixed-coordinates of q(x),
+the projection p(x) ∈ Msbm of q(x) is proven to be [ζ]p = (η1
+i , η2
+ij, 0, . . . , 0) [8]. Thus, SBM defines a
+probabilistic parameter space that is derived from CIF.
+
+4.3. The Learning Algorithms for SBM
+
+Let q(x) be the underlying probability distribution from which samples D = {d1, d2, . . . , dN} are
+generated independently. Then, our goal is to train an SBM (with stationary probability p(x)) based on
+D that realizes q(x) as faithfully as possible. Here, we briefly introduce two typical learning algorithms
+for SBM: maximum-likelihood and contrastive divergence [11,20,21].
+Maximum-likelihood (ML) learning realizes a gradient ascent of log-likelihood of D:
+
+ΔUij = ε∂l(ξ; D)
+
+∂Uij
+= ε(Eq[xixj] − Ep[xixj])
+(10)
+
+where ε is the learning rate and l(ξ; D) =
+1
+N ∑N
+n=1 log(dn; ξ). Eq[·] and Ep[·] are expectations over
+q(x) and p(x), respectively. Actually, Eq[xixj] and Ep[xixj] are the coordinates η2
+ij of q(x) and p(x),
+respectively. Eq[xixj] could be unbiasedly estimated from the sample. Markov chain Monte Carlo [22]
+is often used to approximate Ep[xixj] with an average over samples from p(x).
+Contrastive divergence (CD) learning realizes the gradient descent of a different objective function
+to avoid the difficulty of computing the log-likelihood gradient, shown as follows:
+
+ΔUij = −ε∂(KL(q0||p) − KL(pm||p))
+
+∂Uij
+= ε(Eq0[xixj] − Epm[xixj])
+(11)
+
+where q0 is the sample distribution, pm is the distribution by starting the Markov chain with the data
+and running m steps and KL(·||·) denotes the K-L divergence. Taking samples in D as initial states, we
+could generate a set of samples for pm(x). Those samples can be used to estimate Epm[xixj].
+From the perspective of IG, we can see that ML/CD learning is to update parameters in SBM,
+so that its corresponding coordinates [η2−] are getting closer to the data (along with the decreasing
+gradient). This is consistent with our theoretical analysis in Section 3 and Section 4.2 that SBM uses
+the most confident information (i.e., [η2−]) for approximating an arbitrary distribution in an expected
+sense.
+
+5. Experimental Study: Incorporate Data into CIF
+
+In the information-based CIF, the actual values of the data were not used to explicitly effect the
+output PDF (e.g., the derivation of SBM in Section 4). The data constrains the state of knowledge about
+the unknown pdf. In order to force the estimate of our probabilistic model to obey the data, we need
+to further reduce the difference between data information and physical information bound. How can
+this be done?
+In this section, the CIF principle will also be used to modify existing SBM training algorithm (i.e.,
+CD-1) by incorporating data information. Given a particular dataset, the CIF can be used to further
+recognize less-confident parameters in SBM and to reduce them properly. Our solution here is to
+apply CIF to take effect on the learning trajectory with respect to specific samples and, hence, further
+confine the parameter space to the region indicated by the most confident information contained in
+the samples.
+
+333
+
+
+Entropy 2014, 16, 3670–3688
+
+5.1. A Sample-Specific CIF-Based CD Learning for SBM
+
+The main modification of our CIF-based CD algorithm (CD-CIF for short) is that we generate
+the samples for pm(x) based on those parameters with confident information, where the confident
+information carried by certain parameter is inherited from the sample and could be assessed using its
+Fisher information computed in terms of the sample.
+For CD-1 (i.e., m=1), the firing probability for the i-th neuron after a one-step transition from the
+
+initial state x(0) = {x(0)
+1 , x(0)
+2 , . . . , x(0)
+n }) is:
+
+p(x(1)
+i
+= 1|x(0)) =
+1
+
+1 + exp{− ∑j̸=i Uijx(0)
+j
+− bi}
+(12)
+
+For CD-CIF, the firing probability for the i-th neuron in Equation (12) is modified as follows:
+
+p(x(1)
+i
+=1|x(0))=
+1
+
+1+exp{−
+∑
+(j̸=i)&(F(Uij)>τ)
+Uijx(0)
+j −bi}
+(13)
+
+where τ is a pre-selected threshold, F(Uij) = Eq0[xixj] − Eq0[xixj]2 is the Fisher information of Uij (see
+Equation (6)) and the expectations are estimated from the given sample D. We can see that those
+weights whose Fisher information are less than τ are considered to be unreliable w.r.t D. In practice,
+we could setup τ by the ratio r to specify the proportion of the total Fisher information TFI of all
+parameters that we would like to remain, i.e., ∑Uij>τ,i<j F(Uij) = r ∗ TFI.
+In summary, CD-CIF is realized in two phases. In the first phase, we initially “guess” whether
+certain parameter could be faithfully estimated based on the finite sample. In the second phase, we
+approximate the gradient using the CD scheme, except for when the CIF-based firing function in
+Equation (13) is used.
+
+5.2. Experimental Results
+In this section, we empirically investigate our justifications for the CIF principle, especially how
+the sample-specific CIF-based CD learning (see Section 5) works in the context of density estimation.
+Experimental Setup and Evaluation Metric: We utilize the random distribution uniformly generated
+from the open probability simplex over 10 variables as underlying distributions, whose samples size
+N may vary. Three learning algorithms are investigated: ML, CD-1 and our CD-CIF. K-L divergence is
+used to evaluate the goodness-of-fit of the SBM’s trained by various algorithms. For sample size N,
+we run 100 instances (20 (randomly generated distributions) × 5 (randomly running)) and report the
+averaged K-L divergences. Note that we focus on the case that the variable number is relatively small
+(n = 10) in order to analytically evaluate the K-L divergence and give a detailed study on algorithms.
+Changing the number of variables only offers a trivial influence for the experimental results, since we
+obtained qualitatively similar observations on various variable numbers (not reported here).
+Automatically Adjusting r for Different Sample Sizes: The Fisher information is additive for i.i.d.
+sampling. When sample sizes change, it is natural to require that the total amount of Fisher information
+contained in all tailored parameters is steady. Hence, we have α = (1 − r)N, where α indicates the
+amount of Fisher information and becomes a constant when the learning model and the underlying
+distribution family are given. It turns out that we can first identify α using the optimal r w.r.t several
+distributions generated from the underlying distribution family and then determine the optimal r’s for
+various sample sizes using r = 1 − α/N. In our experiments, we set α = 35.
+Density Estimation Performance: The averaged K-L divergences between SBMs (learned by ML,
+CD-1 and CD-CIF with the r automatically determined) and the underlying distribution are shown in
+Figure 3a. In the case of relatively small samples (N ≤ 500) in Figure 3a, our CD-CIF method shows
+significant improvements over ML (from 10.3% to 16.0%) and CD-1 (from 11.0% to 21.0%). This is
+because we could not expect to have reliable identifications for all model parameters from insufficient
+
+334
+
+
+Entropy 2014, 16, 3670–3688
+
+samples, and hence, CD-CIF gains its advantages by using parameters that could be confidently
+estimated. This result is consistent with our previous theoretical insight that Fisher information gives a
+reasonable guidance for parametric reduction via the confidence criterion. As the sample size increases
+(N ≥ 600), CD-CIF, ML and CD-1 tend to have similar performances, since, with relatively large
+samples, most model parameters can be reasonably estimated, hence the effect of parameter reduction
+using CIF gradually becomes marginal. In Figure 3b and Figure 3c, we show how sample size affects
+the interval of r. For N = 100, CD-CIF achieves significantly better performances for a wide range of r.
+While, for N = 1, 200, CD-CIF can only marginally outperform baselines for a narrow range of r.
+
+�
+���
+���
+���
+���
+����
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+�������������������
+
+�����������������������������������
+
+�����������������������������������������
+
+�
+
+�
+
+����
+��
+������
+
+(a)
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+���
+
+����
+
+����
+
+���������������������������������������������������������������������
+
+�
+
+�����������������������������������
+
+�
+
+�
+
+����
+��
+������
+
+(b)
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+�
+����������������������������������������������������������������������
+
+�
+
+�����������������������������������
+
+�
+
+�
+
+����
+����
+����
+�
+
+����
+
+�����
+
+����
+
+����
+��
+������
+
+(c)
+
+����
+����
+����
+����
+����
+����
+����
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+����
+
+����
+
+�������������������������������������
+
+�
+
+�
+
+��������������������
+
+�������������������
+
+�����������������
+
+�����������
+���������
+
+�����������������
+
+(d)
+
+Figure 3. (a): the performance of CD-CIF on different sample sizes; (b) and (c): The performances
+of CD-CIF with various values of r on two typical sample sizes, i.e., 100 and 1200; (d) illustrates one
+learning trajectory of the last 100 steps for ML (squares), CD-1 (triangles) and CD-CIF (circles).
+
+Effects on Learning Trajectory: We use the 2D visualizing technology SNE [20] to investigate learning
+trajectories and dynamical behaviors of three comparative algorithms. We start three methods with the
+same parameter initialization. Then, each intermediate state is represented by a 55-dimensional vector
+formed by its current parameter values. From Figure 3d, we can see that: (1) In the final 100 steps, the
+three methods seem to end up staying in different regions of the parameter space, and CD-CIF confines
+the parameter in a relatively thinner region compared to ML and CD-1; (2) The true distribution is
+usually located on the side of CD-CIF, indicating its potential for converging to the optimal solution.
+Note that the above claims are based on general observations, and Figure 3d is shown as an illustration.
+Hence, we may conclude that CD-CIF regularizes the learning trajectories in a desired region of the
+parameter space using the sample-specific CIF.
+
+6. Conclusions
+
+Different from the traditional EPI, the CIF principle proposed in this paper aims at finding a
+way to derive computational models for universal cognitive systems by a dimensionality reduction
+
+335
+
+
+Entropy 2014, 16, 3670–3688
+
+approach in parameter spaces: specifically, by preserving the confident parameters and reducing the
+less confident parameters. In principle, the confidence of parameters should be assessed according
+to their contributions to the expected information distance between the ideal distribution and its
+fluctuated observations. This is called the distance-based CIF. For some coordinated systems, e.g.,
+the mixed-coordinate system [ζ]l, the confidence of a parameter can also be directly evaluated by
+its Fisher information, which establishes a connection with the inverse variance of any unbiased
+estimate for the parameter via the Cramér–Rao bound. This is called the information-based CIF.
+The criterion of information-based CIF (i.e., Fisher information) can be seen as an approximation to
+distance-based CIF, since it neglects the influence of parameter scaling to the expected information
+distance. However, considering the standard mixed-coordinates [ζ]l for the manifold of multivariate
+binary distributions, it turns out that both distance-based CIF and information-based CIF entail the
+same optimal submanifold M.
+The CIF provides a strategy for the derivation of probabilistic models. The SBM is a specific
+example in this regard.
+It has been theoretically shown that the SBM can achieve a reliable
+representation in parameter spaces by using the CIF principle.
+The CIF principle can also be used to modify existing SBM training algorithms by incorporating
+data information, such as CD-CIF. One interesting result shown in our experiments is that: although
+CD-CIF is a biased algorithm, it could significantly outperform ML when the sample is insufficient.
+This suggests that CIF gives us a reasonable criterion for utilizing confident information from the
+underlying data, while ML lacks a mechanism to do so.
+In the future, we will further develop the formal justification of CIF w.r.t various contexts (e.g.,
+distribution families or models).
+
+Acknowledgments: We would like to thank the anonymous reviewers for their valuable comments. We also
+thank Mengjiao Xie and Shuai Mi for their helpful discussions. This work is partially supported by the Chinese
+National Program on Key Basic Research Project (973 Program, Grant No. 2013CB329304 and 2014CB744604), the
+Natural Science Foundation of China (Grant Nos. 61272265, 61070044, 61272291, 61111130190 and 61105072).
+
+Appendix
+
+Appendix A Proof of Proposition 1
+
+Proof. By definition, we have:
+
+gIJ = ∂2ψ(θ)
+
+∂θI∂θJ
+
+where ψ(θ) is defined by Equation (4). Hence, we have:
+
+gIJ = ∂2(∑I θIηI − φ(η))
+
+∂θI∂θJ
+= ∂ηI
+
+∂θJ
+
+By differentiating ηI, defined by Equation (1), with respect to θJ, we have:
+
+gIJ
+=
+∂ηI
+∂θJ = ∂ ∑x XI(x)(exp{∑I θIXI(x) − ψ(θ)})
+
+∂θJ
+
+= ∑
+x
+XI(x)[XJ(x) − ηJ]p(x; θ) = ηI � J − ηIηJ
+
+This completes the proof.
+
+Appendix B Proof of Proposition 2
+
+Proof. By definition, we have:
+
+gIJ = ∂2φ(η)
+
+∂ηI∂ηJ
+
+336
+
+
+Entropy 2014, 16, 3670–3688
+
+where φ(η) is defined by Equation (4). Hence, we have:
+
+gIJ
+=
+∂2(∑J θJηJ − ψ(θ))
+
+∂ηI∂ηJ
+= ∂θI
+
+∂ηJ
+
+Based on Equations (2) and (1), the θI and pK could be calculated by solving a linear equation of [p]
+and [η], respectively. Hence, we have:
+
+θI = ∑
+K⊆I
+(−1)|I−K|log(pK); pK = ∑
+K⊆J
+(−1)|J−K|ηJ
+
+Therefore, the partial derivation of θI with respect to ηJ is:
+
+gIJ = ∂θI
+
+∂ηJ
+= ∑
+K
+
+∂θI
+
+∂pK
+· ∂pK
+
+∂ηJ
+= ∑
+K⊆I∩J
+(−1)|I−K|+|J−K| · 1
+
+pK
+
+This completes the proof.
+
+Appendix C Proof of Proposition 3
+
+Proof. The Fisher information matrix of [ζ] could be partitioned into four parts: Gζ =
+
+�
+A
+C
+D
+B
+
+�
+
+.
+
+It can be verified that in the mixed coordinate, the θ-coordinate of order k is orthogonal to any
+η-coordinate less than k-order, implying the corresponding element of the Fisher information matrix is
+zero (C = D = 0) [23]. Hence, Gζ is a block diagonal matrix.
+According to the Cramér–Rao bound [3], a parameter (or a pair of parameters) has a unique
+asymptotically tight lower bound of the variance (or covariance) of the unbiased estimate, which is
+given by the corresponding element of the inverse of the Fisher information matrix involving this
+parameter (or this pair of parameters). Recall that Iη is the index set of the parameters shared by [η]
+and [ζ]l and that Jθ is the index set of the parameters shared by [θ] and [ζ]l; we have (G−1
+ζ )Iζ = (G−1
+η )Iη
+
+and (G−1
+ζ )Jζ = (G−1
+θ )Jθ, i.e., G−1
+ζ
+=
+
+�
+(G−1
+η )Iη
+0
+0
+(G−1
+θ )Jθ
+
+�
+
+. Since Gζ is a block tridiagonal matrix, the
+
+proposition follows.
+
+Appendix D Proof of Proposition 4
+
+Proof. Assume the Fisher information matrix of [θ] to be: Gθ =
+
+�
+U
+X
+XT
+V
+
+�
+
+, which is partitioned
+
+based on Iη and Jθ. Based on Proposition 3, we have A = U−1. Obviously, the diagonal elements
+of U are all smaller than one. According to the succeeding Lemma A1, we can see that the diagonal
+elements of A (i.e., U−1) are greater than one.
+Next, we need to show that the diagonal elements of B are smaller than 1. Using the Schur
+complement of Gθ, the bottom-right block of G−1
+θ , i.e., (G−1
+θ )Jθ, equals to (V − XTU−1X)−1. Thus, the
+diagonal elements of B: Bjj = (V − XTU−1X)jj < Vjj < 1. Hence, we complete the proof.
+
+Lemma A1. With a l × l positive definite matrix H, if Hii < 1, then (H−1)ii > 1, ∀i ∈ {1, 2, . . . , l}.
+
+Proof. Since H is positive definite, it is a Gramian matrix of l linearly independent vectors v1, v2, . . . , vl,
+i.e., Hij = ⟨vi, vj⟩ (⟨·, ·⟩ denotes the inner product). Similarly, H−1 is the Gramian matrix of l linearly
+independent vectors w1, w2, . . . , wl and (H−1)ij = ⟨wi, wj⟩. It is easy to verify that ⟨wi, vi⟩ = 1, ∀i ∈
+{1, 2, . . . , l}. If Hii < 1, we can see that the norm ∥vi∥ = √Hii < 1. Since ∥wi∥ × ∥vi∥ ≥ ⟨wi, vi⟩ = 1,
+we have ∥wi∥ > 1. Hence, (H−1)ii = ⟨wi, wi⟩ = ∥wi∥2 > 1.
+
+337
+
+
+Entropy 2014, 16, 3670–3688
+
+Appendix E Proof of Proposition 5
+
+Proof. Let Bq be a ε-ball surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε},
+where KL(·, ·) denotes the Kullback–Leibler divergence and ε is small. ζq is the coordinates of q(x). Let
+q(x) + dq be a neighbor of q(x) uniformly sampled on Bq and ζq(x)+dq be its corresponding coordinates.
+For a small ε, we can calculate the expected information distance between q(x) and q(x) + dq as follows:
+
+EBq =
+�
+[(ζq(x)+dq − ζq)TGζ(ζq(x)+dq − ζq)]
+1
+2 dBq
+(A1)
+
+where Gζ is the Fisher information matrix at q(x).
+Since Fisher information matrix Gζ is both positive definite and symmetric, there exists a singular
+value decomposition Gζ = UTΛU where U is an orthogonal matrix and Λ is a diagonal matrix with
+diagonal entries equal to the eigenvalues of Gζ (all ≥ 0).
+Applying the singular value decomposition into Equation (A1), the distance becomes:
+
+EBq=
+�
+[(ζq(x)+dq − ζq)TUTΛU(ζq(x)+dq − ζq)]
+1
+2 dBq
+(A2)
+
+Note that U is an orthogonal matrix, and the transformation U(ζq(x)+dq − ζq) is a norm-preserving rotation.
+Now, we need to show that among all tailored k-dimensional submanifolds of S, [ζ]lt is the
+one that preserves maximum information distance. Assume IT = {i1, i2, . . . , ik} is the index of k
+coordinates that we choose to form the tailored submanifold T in the mixed-coordinates [ζ]. According
+to the fundamental analytical properties of the surface of the hyper-ellipsoid and the orthogonality of
+the mixed-coordinates, there exists a strict positive monotonicity between the expected information
+distance EBq for T and the sum of eigenvalues of the sub-matrix (Gζ)IT, where the sum equals to the
+trace of (Gζ)IT. That is, the greater the trace of (Gζ)IT, the greater the expected information distance
+EBq for T.
+Next, we show that the sub-matrix of Gζ specified by [ζ]lt gives a maximum trace. Based on
+Proposition 4, the elements on the main diagonal of the sub-matrix A are lower bounded by one and
+those of B upper bounded by one. Therefore, [ζ]lt gives the maximum trace among all sub-matrices of
+Gζ. This completes the proof.
+
+Author Contributions: Author Contributions
+Theoretical study and proof: Yuexian Hou and Xiaozhao Zhao.
+Conceived and designed
+the experiments:
+Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li.
+Performed the
+experiments: Xiaozhao Zhao. Analyzed the data: Xiaozhao Zhao, Yuexian Hou. Wrote the manuscript:
+Xiaozhao Zhao, Dawei Song, Wenjie Li and Yuexian Hou. All authors have read and approved the
+final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Wheeler, J.A. Time Today; Cambridge University Press: Cambridge, UK, 1994; pp. 1–29.
+2.
+Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, UK, 2004.
+3.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc. 1945, 37, 81–91.
+4.
+Frieden, B.R.; Gatenby, R.A. Principle of maximum Fisher information from Hardy’s axioms applied to
+statistical systems. Phys. Rev. E 2013, 88, 042144.
+5.
+Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information—Theoretic
+Approach; Springer: Berlin/Heidelberg, Germany, 2002.
+6.
+Vstovsky, G.V. Interpretation of the extreme physical information principle in terms of shift information.
+Phys. Rev. E 1995, 51, 975–979.
+
+338
+
+
+Entropy 2014, 16, 3670–3688
+
+7.
+Amari, S.; Nagaoka, H.
+Methods of Information Geometry; Translations of Mathematical Monographs;
+Oxford University Press: Oxford, UK, 1993.
+8.
+Amari, S.; Kurata, K.; Nagaoka, H. Information geometry of Boltzmann machines. IEEE Trans. Neural Netw.
+1992, 3, 260–271.
+9.
+Hou, Y.; Zhao, X.; Song, D.; Li, W. Mining pure high-order word associations via information geometry for
+information retrieval. ACM Trans. Inf. Syst. 2013, 31, 12:1–12:32.
+10.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–219.
+11.
+Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for Boltzmann machines. Cogn. Sci. 1985,
+9, 147–169.
+12.
+Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006,
+313, 504–507.
+13.
+Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational
+Research: Copenhagen, Denmark, 1960.
+14.
+Bond, T.; Fox, C. Applying the Rasch Model: Fundamental Measurement in the Human Sciences; Psychology Press:
+London, UK, 2013.
+15.
+Gibilisco, P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: Cambridge, UK, 2010.
+16.
+ˇCencov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Washington,
+D.C., USA, 1982.
+17.
+Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86.
+18.
+Buhlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory And Applications;
+Springer:
+Berlin/Heidelberg, Germany, 2011.
+19.
+Bobrovsky, B.; Mayer-Wolf, E.; Zakai, M. Some classes of global Cramér-Rao bounds. Ann. Stat. 1987,
+15, 1421–1438.
+20.
+Hinton, G.E. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002,
+14, 1771–1800.
+21.
+Carreira-Perpinan, M.A.; Hinton, G.E.
+On contrastive divergence learning.
+In Proceedings of the
+International Workshop on Artificial Intelligence and Statistics, Key West, FL, USA, 6–8 January 2005;
+pp. 33–40.
+22.
+Gilks, W.R.; Richardson, S.; Spiegelhalter, D. Introducing markov chain monte carlo. In Markov Chain Monte
+Carlo in Practice; Chapman and Hall/CRC: London, UK, 1996; pp. 1–19.
+23.
+Nakahara, H.; Amari, S.
+Information geometric measure for neural spikes.
+Neural Comput.
+2002,
+14, 2269–2316.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+339
+
+
+
+MDPI
+St. Alban-Anlage 66
+4052 Basel
+
+Switzerland
+Tel. +41 61 683 77 34
+Fax +41 61 302 89 18
+www.mdpi.com
+
+Entropy Editorial Office
+E-mail: entropy@mdpi.com
+
+www.mdpi.com/journal/entropy
+
+
+
+MDPI  
+St. Alban-Anlage 66 
+4052 Basel 
+Switzerland
+
+Tel: +41 61 683 77 34 
+Fax: +41 61 302 89 18
+
+www.mdpi.com
+ISBN 978-3-03897-633-2
+
+
diff --git a/papers/references/Amari2016_Placeholder.md b/papers/references/Amari2016_Placeholder.md
deleted file mode 100644
index d1ecc4e8..00000000
--- a/papers/references/Amari2016_Placeholder.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Information Geometry and Its Applications (Amari 2016)
-
-This reference is a published book/monograph. 
-Due to copyright and its format, the full PDF is not hosted in this repository.
-
-**Citation:**
-Amari, S. (2016). *Information Geometry and Its Applications.* Springer.
diff --git a/papers/references/Bastos2012.pdf b/papers/references/Bastos2012.pdf
new file mode 100644
index 00000000..28ce3bd5
--- /dev/null
+++ b/papers/references/Bastos2012.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:af1955d6a49759e0a78310c2e1948cf421e3c78babd18a294b7b621a0c7cd599
+size 533952
diff --git a/papers/references/Bastos2012.txt b/papers/references/Bastos2012.txt
new file mode 100644
index 00000000..652e6b8b
--- /dev/null
+++ b/papers/references/Bastos2012.txt
@@ -0,0 +1,1759 @@
+Canonical microcircuits for predictive coding
+
+Andre M. Bastos1,2,6, W. Martin Usrey1,3,4, Rick A. Adams8, George R. Mangun2,3,5, Pascal
+Fries6,7, and Karl J. Friston8
+
+1Center for Neuroscience, University of California-Davis, Davis, CA 95618 USA. 2Center for Mind
+and Brain, University of California-Davis, Davis, CA 95618 USA. 3Department of Neurology,
+University of California-Davis, Davis, CA 95618 USA. 4Department of Neurobiology, Physiology
+and Behavior, University of California-Davis, Davis, CA 95618 USA. 5Department of Psychology,
+University of California-Davis, Davis, CA 95618 USA. 6Ernst Strüngmann Institute (ESI) for
+Neuroscience in Cooperation with Max Planck Society, Deutschordenstraße 46, 60528 Frankfurt,
+Germany. 7Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen,
+Kapittelweg 29, 6525 EN Nijmegen, Netherlands. 8The Wellcome Trust Centre for Neuroimaging,
+University College London, Queen Square, London WC1N 3BG, UK.
+
+Summary
+
+This review considers the influential notion of a canonical (cortical) microcircuit in light of recent
+theories about neuronal processing. Specifically, we conciliate quantitative studies of
+microcircuitry and the functional logic of neuronal computations. We revisit the established idea
+that message passing among hierarchical cortical areas implements a form of Bayesian inference –
+paying careful attention to the implications for intrinsic connections among neuronal populations.
+By deriving canonical forms for these computations, one can associate specific neuronal
+populations with specific computational roles. This analysis discloses a remarkable
+correspondence between the microcircuitry of the cortical column and the connectivity implied by
+predictive coding. Furthermore, it provides some intuitive insights into the functional asymmetries
+between feedforward and feedback connections and the characteristic frequencies over which they
+operate.
+
+Keywords
+
+neuronal; connectivity; cortical; microcircuit; computation; predictive coding; free energy
+principle; gamma oscillations; beta oscillations
+
+Introduction
+
+The idea that the brain actively constructs explanations for its sensory inputs is now
+generally accepted. This notion builds on a long history of proposals that the brain uses
+internal or generative models to make inferences about the causes of its sensorium
+(Helmholtz, 1860; Gregory 1968, 1980; Dayan et al., 1995). In terms of implementation,
+predictive coding is, arguably, the most plausible neurobiological candidate for making
+these inferences (Srinivasan et al., 1982; Mumford, 1992; Rao and Ballard, 1999). This
+review considers the canonical microcircuit in light of predictive coding. We focus on the
+intrinsic connectivity within a cortical column and the extrinsic connections between
+
+Correspondence: Karl Friston The Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, 12
+Queen Square, London, WC1N 3BG, UK. Tel (44) 207 833 7454 k.friston@ucl.ac.uk.
+
+NIH Public Access
+Author Manuscript
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+Published in final edited form as:
+Neuron. 2012 November 21; 76(4): 695–711. doi:10.1016/j.neuron.2012.10.038.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+columns in different cortical areas. We try to relate this circuitry to neuronal computations
+by showing the computational dependencies – implied by predictive coding – recapitulate
+the physiological dependencies implied by quantitative studies of intrinsic connectivity. This
+issue is important as distinct neuronal dynamics in different cortical layers are becoming
+increasingly apparent (de Kock et al., 2007; Sakata and Harris, 2009; Maier et al., 2010;
+Bollimunta et al., 2011). For example, recent findings suggest that the superficial layers of
+cortex show neuronal synchronization and spike-field coherence predominantly in the
+gamma frequencies, while deep layers prefer lower (alpha or beta) frequencies (Roopun et
+al., 2006, 2008; Maier et al., 2010; Buffalo et al., 2011). Since feedforward connections
+originate predominately from superficial layers and feedback connections from deep layers,
+these differences suggest that feedforward connections use relatively high frequencies,
+compared to feedback connections, as recently demonstrated empirically (Bosman et al.,
+2012). These asymmetries call for something quite remarkable: namely, a synthesis of
+spectrally distinct inputs to a cortical column and the segregation of its outputs. This
+segregation can only arise from local neuronal computations that are structured and
+precisely interconnected. It is the nature of this intrinsic connectivity – and the dynamics it
+supports – that we consider. The aim of this review is to speculate about the functional roles
+of neuronal populations in specific cortical layers in terms of predictive coding. Our long-
+term aim is to create computationally informed models of microcircuitry that can be tested
+with dynamic causal modelling (David et al., 2006; Moran et al., 2008, 2011).
+
+This review comprises three sections. We start with an overview of the anatomy and
+physiology of cortical connections – with an emphasis on quantitative advances. The second
+section considers the computational role of the canonical microcircuit that emerges from
+these studies. The third section provides a formal treatment of predictive coding and defines
+the requisite computations in terms of differential equations. We then associate the form of
+these equations with the canonical microcircuit to define a computational architecture. We
+conclude with some predictions about intrinsic connections and note some important
+asymmetries in feedforward and feedback connections that emerge from this treatment.
+
+The anatomy and physiology of cortical connections
+
+This section reviews laminar-specific connections that underlie the notion of a canonical
+microcircuit (Douglas et al., 1989; Douglas and Martin, 1991, 2004). We first focus on
+mammalian visual cortex and then consider whether visual microcircuitry can be generalized
+to a canonical circuit for the entire cortex. Both functional and anatomical techniques have
+been applied to study intrinsic (intracortical) and extrinsic connections. We will emphasise
+the insights from recent studies that combine both techniques.
+
+Intrinsic connections and the canonical microcircuit
+
+The seminal work of Douglas and Martin (1991), in the cat visual system, produced a model
+of how information flows through the cortical column. Douglas and Martin recorded
+intracellular potentials from cells in primary visual cortex during electrical stimulation of its
+thalamic afferents. They noted a stereotypical pattern of fast excitation, followed by slower
+and longer-lasting inhibition. The latency of the ensuing hyperpolarisation distinguished
+responses in supragranular and infragranular layers. Using conductance-based models, they
+showed that a simple model could reproduce these responses. Their model contained
+superficial and deep pyramidal cells with a common pool of inhibitory cells. All three
+neuronal populations received thalamic drive and were fully interconnected. The deep
+pyramidal cells received relatively weak thalamic drive but strong inhibition (Figure 1).
+These interconnections allowed the circuit to amplify transient thalamic inputs to generate
+sustained activity in the cortex, while maintaining a balance between excitation and
+inhibition, two tasks that must be solved by any cortical circuit. Their circuit, although based
+
+Bastos et al.
+Page 2
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+on recordings from cat visual cortex, was also proposed as a basic theme that might be
+present and replicated, with minor variations, throughout the cortical sheet (Douglas et al.,
+1989).
+
+Subsequent studies have used intracellular recordings and histology to measure spikes (and
+depolarisation) in pre and post-synaptic cells, whose cellular morphology can be determined.
+This approach quantifies both the connection probability – defined as the number of
+observed connections divided by total number of pairs recorded – and connection strength –
+defined in terms of post-synaptic responses. Thomson et al (2002) used these techniques to
+study layers 2 to 5 (L2 to L5) of the cat and rat visual systems. The most frequently
+connected cells were located in the same cortical layer, where the largest interlaminar
+projections were the ‘feedforward’ connections from L4 to L3 and from L3 to L5. Excitatory
+reciprocal ‘feedback’ connections were not observed (L3 to L4) or less commonly observed
+(L5 to L3), suggesting that excitation spreads within the column in a feedforward fashion.
+Feedback connections were typically seen when pyramidal cells in one layer targeted
+inhibitory cells in another (see Thomson and Bannister, 2003 for a review).
+
+While many studies have focused on excitatory connections, a few have examined inhibitory
+connections. These are more difficult to study, because inhibitory cells are less common
+than excitatory cells, and because there are at least seven distinct morphological classes
+(Salin and Bullier, 1995). However, recent advances in optogenetics have made it possible
+to more easily target inhibitory cells: Kätzel and colleagues combined optogenetics and
+whole-cell recording to investigate the intrinsic connectivity of inhibitory cells in mouse
+cortical areas M1, S1, and V1 (Kätzel et al., 2010). They transgenically expressed
+channelrhodopsin in inhibitory neurons and activated them, while recording from pyramidal
+cells. This allowed them to assess the effect of inhibition as a function of laminar position
+relative to the recorded neuron.
+
+Several conclusions can be drawn from this approach (Kätzel et al., 2010): first, L4
+inhibitory connections are more restricted in their lateral extent, relative to other layers. This
+supports the notion that L4 responses are dominated by thalamic inputs, while the remaining
+laminae integrate afferents from a wider cortical patch. Second, the primary source of
+inhibition originates from cells in the same layer, reflecting the prevalence of inhibitory
+intralaminar connections. Third, several interlaminar motifs appeared to be general – at least
+in granular cortex: principally, a strong inhibitory connection from L4 onto supragranular
+L2/3 and from infragranular layers onto L4. For more information on the cell-type
+specificity of inhibitory connections, see Yoshimura and Callaway, (2005). Figure 2
+provides a summary of key excitatory and inhibitory intralaminar connections.
+
+Microcircuits in the sensorimotor cortex
+
+Do the features of visual microcircuits generalize to other cortical areas? Recently, two
+studies have mapped the intrinsic connectivity of mouse sensory and motor cortices: Lefort
+and colleagues (2009) used multiple whole-cell recordings in mouse barrel cortex to
+determine the probability of monosynaptic connections – and the corresponding connection
+strength. As in visual cortex, the strongest connections were intralaminar and the strongest
+interlaminar connections were the ascending L4 to L2, and descending L3 to L5.
+
+One puzzle about canonical microcircuits is whether motor cortex has a local circuitry that is
+qualitatively similar to sensory cortex. This question is important because motor cortex lacks
+a clearly defined granular L4 (a property that earns it the name “agranular cortex”). Weiler
+and colleagues combined whole-cell recordings in mouse motor cortex with photo-
+stimulation to uncage Glutamate (Weiler et al., 2008). This allowed them to systematically
+stimulate the cortical column in a grid – centred on the pyramidal neuron from which they
+
+Bastos et al.
+Page 3
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+recorded. By recording from pyramidal neurons in L2-6 (L1 lacks pyramidal cells), the
+authors mapped the excitatory influence that each layer exerts over the others. They found
+that the L2/3 to L5A/B was the strongest connection – accounting for one-third of the total
+synaptic current in the circuit. The second strongest interlaminar connection was the
+reciprocal L5A to L2/3 connection. This pathway may be homologous to the prominent
+L4/5A to L2/3 pathway in sensory cortex. Also – as in sensory cortex – recurrent
+(intralaminar) connections were prominent, particularly in L2, L5A/B and L6. The largest
+fraction of synaptic input arrived in L5A/B, consistent with its key role in accumulating
+information from a wide range of afferents – before sending its output to the corticospinal
+tract. In summary, strong input-layer to superficial, and superficial to deep connectivity,
+together with strong intralaminar connectivity, suggests that the intrinsic circuitry of motor
+cortex is similar to other cortical areas.
+
+The anatomy and physiology of extrinsic connections
+
+Clearly, an account of microcircuits must refer to the layers of origin of extrinsic
+connections and their laminar targets. Although the majority of presynaptic inputs arise from
+intrinsic connections, cortical areas are also richly interconnected, where the balance
+between intrinsic and extrinsic processing mediates functional integration among specialised
+cortical areas (Engel et al., 2010). By numbers alone, intrinsic connections appear to
+dominate – 95% of all neurons labelled with a retrograde tracer lie within about 2 mm of the
+injection site (Markov et al., 2011). The remaining 5% represent cells giving rise to extrinsic
+connections, which – although sparse – can be extremely effective in driving their targets. A
+case in point is the LGN to V1 connection: although it is only the sixth strongest connection
+to V1, LGN afferents have a substantial effect on V1 responses (Markov et al., 2011).
+
+Hierarchies and functional asymmetries—Current dogma holds that the cortex is
+hierarchically organized. The idea of a cortical hierarchy rests on the distinction between
+three types of extrinsic connections: feedforward connections, that link an earlier area to a
+higher area, feedback connections, that link a higher to an earlier area, and lateral
+connections, that link areas at the same level (reviewed in Felleman and Van Essen, 1991).
+These connections are distinguished by their laminar origins and targets. Feedforward
+connections originate largely from superficial pyramidal cells and target L4, while feedback
+connections originate largely from deep pyramidal cells and terminate outside of L4
+(Felleman and van Essen 1991). Clearly, this description of cortical hierarchies is a
+simplification and can be nuanced in many ways: for example, as the hierarchical distance
+between two areas increases, the percentage of cells that send feedforward (resp. feedback)
+projections from a lower (resp. higher) level becomes increasingly biased towards the
+superficial (resp. deep) layers (Barone et al., 2000; Vezoli, 2004) .
+
+In addition to the laminar specificity of their origins and targets, feedforward and feedback
+connections also differ in their synaptic physiology. The traditional view holds that
+feedforward connections are strong and driving, capable of eliciting spiking activity in their
+targets and conferring classical receptive field properties – the prototypical example being
+the synaptic connection between LGN and V1 (Sherman and Guillery, 1998). Feedback
+connections are thought to modulate (extra-classical) receptive field characteristics
+according to the current context; e.g., visual occlusion, attention, salience, etc. The
+prototypical example of a feedback connection is the cortical L6 to LGN connection.
+Sherman and Guillery identified several properties that distinguish drivers from modulators.
+Driving connections tend to show a strong ionotropic component in their synaptic response,
+evoke large EPSPs, and respond to multiple EPSPs with depressing synaptic effects.
+Modulatory connections produce metabotropic and ionotropic responses when stimulated,
+evoke weak EPSPs, and show paired-pulse facilitation (Sherman and Guillery, 1998; 2011).
+
+Bastos et al.
+Page 4
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+These distinctions were based upon the inputs to the LGN, where retinal input is driving and
+cortical input is modulatory. Until recently, little data were available to assess whether a
+similar distinction applies to cortico-cortical feedforward and feedback connections.
+However, recent studies show that cortical feedback connections express not only
+modulatory but also driving characteristics:
+
+Are feedback connections driving, modulatory or both?—Although it is generally
+thought that feedback connections are weak and modulatory (Crick and Koch, 1998;
+Sherman and Guillery, 1998), recent evidence suggests that feedback connections do more
+than modulate lower level responses: Sherman and colleagues recorded cells in mouse area
+V1/V2 and A1/A2, while stimulating feedforward or feedback afferents. In both cases,
+driving-like responses as well as modulatory-like responses were observed (Covic and
+Sherman, 2011; De Pasquale and Sherman, 2011). This indicates that – for these
+hierarchically proximate areas – feedback connections can drive their targets just as strongly
+as feedforward connections. This is consistent with earlier studies showing that feedback
+connections can be driving: Mignard and Malpeli (1991) studied the feedback connection
+between areas 18 and 17, while layer A of the LGN was pharmacologically inactivated. This
+silenced the cells in L4 in area 17 but spared activity in superficial layers. However,
+superficial cells were silenced when area 18 was lesioned. This is consistent with a driving
+effect of feedback connections from area 18, in the absence of geniculate input. In summary,
+feedback connections can mediate modulatory and driving effects. This is important from
+the point of view of predictive coding, because top-down predictions have to elicit
+obligatory responses in their targets (cells reporting prediction errors):
+
+In predictive coding, feedforward connections convey prediction errors, while feedback
+connections convey predictions from higher cortical areas to suppress prediction errors in
+lower areas. In this scheme, feedback connections should therefore be capable of exerting
+strong (driving) influences on earlier areas to suppress or counter feedforward driving
+inputs. However, as we will see later, these influences also need to exert nonlinear or
+modulatory effects. This is because top-down predictions are necessarily context sensitive:
+e.g., the occlusion of one visual object by another. In short, predictive coding requires
+feedback connections to drive cells in lower levels in a context sensitive fashion, which
+necessitates a modulatory aspect to their postsynaptic effects.
+
+Are feedback connections excitatory or inhibitory?—Crucially, because feedback
+connections convey predictions – that serve to explain and thereby reduce prediction errors
+in lower levels – their effective (polysynaptic) connectivity is generally assumed to be
+inhibitory. An overall inhibitory effect of feedback connections is consistent with in vivo
+studies. For example, electrophysiological studies of the mismatch negativity suggest that
+neural responses to deviant stimuli – that violate sensory predictions established by a regular
+stimulus sequence – are enhanced relative to predicted stimuli (Garrido et al., 2009).
+Similarly, violating expectations of auditory repetition causes enhanced gamma-band
+responses in early auditory cortex (Todorovic et al., 2011). These enhanced responses are
+thought to reflect an inability of higher cortical areas to predict, and thereby suppress, the
+activity of populations encoding prediction error (Garrido et al., 2007; Wacongne et al.,
+2011). The suppression of predictable responses can also be regarded as repetition
+suppression, observed in single unit recordings from the inferior temporal cortex of macaque
+monkeys (Desimone, 1996). Furthermore, neurons in monkey inferotemporal cortex respond
+significantly less to a predicted sequence of natural images, compared to an unpredicted
+sequence (Meyer and Olson, 2011).
+
+The inhibitory effect of feedback connections is further supported by neuroimaging studies
+(Murray et al., 2002; Murray, 2005; Harrison et al., 2007; Summerfield et al., 2008, 2011;
+
+Bastos et al.
+Page 5
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Alink et al., 2010). These studies show that predictable stimuli evoke smaller responses in
+early cortical areas. Crucially, this suppression cannot be explained in terms of local
+adaptation, because the attributes of the stimuli that can be predicted are not represented in
+early sensory cortex (e.g., Harrison et al. 2007). It should be noted that the suppression of
+responses to predictable stimuli can coexist with (top-down) attentional enhancement of
+signal processing (Wyart et al., 2012): in predictive coding, attention is mediated by
+increasing the gain of populations encoding prediction error (Spratling, 2008; Feldman and
+Friston, 2010). The resulting attentional modulation (e.g., Hopfinger et al., 2000) can
+interact with top-down predictions to override their suppressive influence – as demonstrated
+empirically (Kok et al., 2011). See Buschman and Miller, (2007), Saalmann et al., (2007),
+Anderson et al. (2011), and Armstrong et al. (2012) for further discussion of top-down
+connections in attention.
+
+Further evidence for the inhibitory (suppressive) effect of feedback connections comes from
+neuropsychology: Patients with damage to the prefrontal cortex (PFC) show disinhibition of
+event related potential responses (ERP) to repeating stimuli (Knight et al., 1989; Yamaguchi
+and Knight, 1990; but see Barceló et al., 2000). In contrast, they show reduced-amplitude
+P300 ERPs in response to novel stimuli – as if there were a failure to communicate top-
+down predictions to sensory cortex (Knight, 1984). Furthermore, normal subjects show a
+rapid adaptation to deviant stimuli as they become predictable – an effect not seen in
+prefrontal patients.
+
+Several invasive studies complement these human studies in suggesting an overall inhibitory
+role for feedback connections. In a recent seminal study, Olsen et al. studied corticothalamic
+feedback between L6 of V1 and the LGN using transgenic expression of channel rhodopsin
+in L6 cells of V1. By driving these cells optogenetically – while recording units in V1 and
+the LGN – the authors showed that deep L6 principal cells inhibited their extrinsic targets in
+the LGN and their intrinsic targets in cortical layers 2 to 5 (Olsen et al., 2012). This
+suppression was powerful – in the LGN, visual responses were suppressed by 76%.
+Suppression was also high in V1, around 80-84% (Olsen et al., 2012). This evidence is in
+line with classical studies of corticogeniculate contributions to length tuning in the LGN,
+showing that cortical feedback contributes to the surround suppression of feline LGN cells:
+without feedback, LGN cells are disinhibited and show weaker surround suppression
+(Murphy and Sillito, 1987; Sillito et al., 1993; but see Alitto and Usrey, 2008).
+
+While these studies provide convincing evidence that cortical feedback to the LGN is
+inhibitory, the evidence is more complicated for corticocortical feedback connections
+(Sandell and Schiller, 1982; Johnson and Burkhalter, 1996, 1997). Hupé and colleagues
+cooled area V5/MT while recording from areas V1, V2, and V3 in the monkey. When visual
+stimuli were presented in the classical receptive field (CRF), cooling of area V5/MT
+decreased unit activity in earlier areas, suggesting an excitatory effect of extrinsic feedback
+(Hupé et al., 1998). However, when the authors used a stimulus that spanned the extra-
+classical RF the responses of V1 neurons were – on average – enhanced after cooling area
+V5, consistent with the suppressive role of feedback connections. These results indicate that
+the inhibitory effects of feedback connections may depend on (natural) stimuli that require
+integration over the visual field. Similar effects were observed when area V2 was cooled and
+neurons were measured in V1: when stimuli were presented only to the CRF, cooling V2
+decreased V1 spiking activity; however, when stimuli were present in the CRF and the
+surround, cooling V2 increased V1 activity (Bullier et al., 1996). Finally, others have argued
+for an inhibitory effect of feedback based on the timing and spatial extent of surround
+suppression in monkey V1 – concluding that the far surround suppression effects were most
+likely mediated by feedback (Bair et al., 2003).
+
+Bastos et al.
+Page 6
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+The empirical finding that feedback connections can both facilitate and suppress firing in
+lower hierarchical areas – depending on the content of classical and extra-classical receptive
+fields – is consistent with predictive coding: Rao and Ballard (1999) trained a hierarchical
+predictive coding network to recognise natural images. They showed that higher levels in
+the hierarchy learn to predict visual features that extend across many CRFs in the lower
+levels (e.g. tree trunks or horizons). Hence, higher visual areas come to predict that visual
+stimuli will span the receptive fields of cells in lower visual areas. In this setting, a stimulus
+that is confined to a CRF would elicit a strong prediction error signal (because it cannot be
+predicted). This provides a simple explanation for the findings of Hupé et al (1998) and
+Bullier et al (1996): when feedback connections are deactivated, there are no top-down
+predictions to explain responses in lower areas – leading to a disinhibition of responses in
+earlier areas, when – and only when – stimuli can be predicted over multiple CRFs.
+
+Feedback connections and layer 1—How might the inhibitory effect of feedback
+connections be mediated? The established view is that extrinsic corticocortical connections
+are exclusively excitatory (using glutamate as their excitatory neurotransmitter) – although
+recent evidence suggests that inhibitory extrinsic connections exist and may play an
+important role in synchronizing distant regions (Melzer et al., 2012). However, one
+important route by which feedback connections could mediate selective inhibition is via
+their termination in L1 (Anderson and Martin, 2006; Shipp, 2007): layer 1 is sometimes
+referred to as acellular due to its pale appearance with Nissl staining (the classical method
+for separating layers that selectively labels cell bodies). Indeed, a recent study concluded
+that L1 contains less than 0.5% of all cells in a cortical column (Meyer et al., 2011). These
+L1 cells are almost all inhibitory and interconnect strongly with each other, via electrical
+connections and chemical synapses (Chu et al., 2003). Simultaneous whole cell patch clamp
+recordings show that they provide strong monosynaptic inhibition to L2/3 pyramidal cells,
+whose apical dendrites project into L1 (Chu et al., 2003; Wozny and Williams, 2011). This
+means L1 inhibitory cells are in a prime position to mediate inhibitory effects of extrinsic
+feedback. The laminar location highlighted by these studies – the bottom of L1 and the top
+of L2/3 – has recently been shown to be a “hotspot” of inhibition in the column (Meyer et
+al., 2011). Indeed, a study of rat barrel cortex – that stimulated (and inactivated) L1 –
+showed that it exerts a powerful inhibitory effect on whisker-evoked responses (Shlosberg et
+al., 2006). These studies suggest that corticocortical feedback connections could deliver
+strong inhibition, if they were to recruit the inhibitory potential of L1.
+
+In terms of the excitatory and modulatory effect of feedback connections, predictive input
+from higher cortical areas might have an important impact via the distal dendrites of
+pyramidal neurons (Larkum et al., 2009). Furthermore, there is a specific type of
+GABAergic neuron that appears to control distal dendritic excitability, gating top down
+excitatory signals differentially during behavior (Gentet et al., 2012). Table 1 summarises
+the studies we have discussed in relation to the role of feedback connections.
+
+Feedforward and transthalamic connections
+
+While the evidence for an inhibitory effect of feedback connections has to be evaluated
+carefully, the evidence for an excitatory effect of feedforward connections is unequivocal.
+For example, in the monkey, V1 projects monosynaptically to V2, V3, V3a, V4, and V5/MT
+(Zeki, 1978; Zeki and Shipp, 1988). In all cases – when V1 is reversibly inactivated through
+cooling – single-cell activity in target areas is strongly suppressed (Girard and Bullier, 1989;
+Girard et al., 1991a, 1991b, 1992). In the cases of V2 and V3, the result of cooling area V1
+is a near-total silencing of single unit activity. These studies illustrate that activity in higher
+cortical areas depends on driving inputs from earlier cortical areas that establish their
+receptive field properties.
+
+Bastos et al.
+Page 7
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Finally, while many studies have focused on extrinsic connections that project directly from
+one cortical area to the next, there is mounting evidence that feedforward driving
+connections (and perhaps feedback) in the cortex could be mediated by transthalamic
+pathways (Sherman and Guillery, 1998, 2011). The strongest evidence for this claim comes
+from the somatosensory system, where it was shown recently that the posterior medial
+nucleus of the thalamus (POm) – a higher-order thalamic nucleus that receives direct input
+from cortex – can relay information between S1 and S2 (Theyel et al., 2009). In addition, the
+thalamic reticular nucleus has been proposed to mediate the inhibition that might underlie
+cross-modal attention or top-down predictions (Yamaguchi and Knight, 1990; Crick, 1984;
+Wurtz et al., 2011). Furthermore, computational considerations and recent experimental
+findings point to a potentially important role for higher-order thalamic nuclei in coordinating
+and synchronizing cortical responses (Vicente et al., 2008; Saalmann et al., 2012). The
+degree to which cortical areas are integrated directly via corticocortical or indirectly via
+cortico-thalamo-cortical connections – and the extent to which transthalamic pathways
+dissociate feedforward from feedback connections in the same way as we have proposed for
+the cortico-cortical connections – are open questions.
+
+The canonical microcircuit
+
+Central to the idea of a canonical microcircuit is the notion that a cortical column contains
+the circuitry necessary to perform requisite computations, and that these circuits can be
+replicated with minor variations throughout the cortex. One of the clearest examples of how
+cortical circuits process simple inputs – to generate complex outputs – is the emergence of
+orientation tuning in V1. Orientation tuning is a distinctly cortical phenomenon because
+geniculocortical relay cells show no orientation preferences. A further elaboration of cortical
+responses can be found in the distinction between simple and complex cells – while simple
+cells possess spatially confined receptive fields, complex cells are orientation-tuned but
+show less preference for the location of an oriented bar. Hubel and Wiesel proposed a model
+for how intrinsic and extrinsic connectivity could establish a circuit explaining these
+receptive field properties. They proposed that orientation tuning in simple cells could be
+generated by a single cortical cell receiving input from several ON centre – OFF surround
+geniculate cells arranged along a particular orientation, thereby endowing it with a
+preference for bars oriented in a particular direction (Hubel and Wiesel, 1962). Complex
+cells were hypothesized to receive inputs from several simple cells – with the same
+orientation preference and slightly varying receptive field locations. Thus, complex cells
+were thought not to receive direct LGN input but to be higher order cells in cortex.
+Subsequent findings supported these predictions, showing that input layer 4Cα and 4Cβ
+contained the largest proportion of cells receiving monosynaptic geniculate input, while
+superficial and deep layer cells contain a larger number of cells receiving disynaptic or
+polysynaptic input (Bullier and Henry, 1980). Furthermore, simple cells project
+monosynaptically onto complex cells, where they exert a strong feedforward influence
+(Alonso and Martinez, 1998; Alonso, 2002). These models suggest that intrinsic cortical
+circuitry allows processing to proceed along discrete steps that are capable of producing
+response properties in outputs that are not present in inputs.
+
+Segregation of processing streams
+
+A key property of canonical circuits is the segregation of parallel streams of processing. For
+example, in primates, parvocellular input enters the cortex primarily in layer 4Cβ, whereas
+magnocellular inputs enter in 4Cα. The corticogeniculate feedback pathway from L6
+maintains this segregation, as upper L6 cells preferentially synapse onto parvocellular cells
+in the LGN, while lower L6 cells target the magnocellular LGN layers (Fitzpatrick et al.,
+1994; Briggs and Usrey, 2009). Further examples of stream segregation are also present in
+
+Bastos et al.
+Page 8
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+the dorsal “where” and the ventral “what” pathways, and in the projection from V1 to the
+thick, thin, and inter-stripe regions of V2 (Zeki and Shipp, 1988; Sincich and Horton, 2005).
+
+Superficial and deep layers are anatomically interconnected, but mounting evidence suggests
+that they constitute functionally distinct processing streams: in an elegant experiment,
+Roopun et al. (2006) showed that L2/3 of rat somatomotor cortex shows prominent gamma
+oscillations that are co-expressed with beta oscillations in L5. Both rhythms persisted, when
+superficial and deep layers were disconnected at the level of L4. Maier and colleagues
+(2010) used multilaminar recordings to show strong LFP coherence amongst sites within the
+superficial layers (the superficial compartment), as well as strong coherence amongst sites in
+deep layers (the deep compartment), but weak inter-compartment coherence. These studies
+indicate a segregation of – potentially autonomous – supragranular and infragranular
+dynamics. Maier et al., found that supragranular sites had higher broadband gamma power
+than infragranular sites. This pattern was reversed in the alpha and beta range; with greater
+power in the infragranular and granular layers. Finally, the spiking activity of neurons in the
+superficial layers of visual cortex are more coherent with gamma frequency oscillations in
+the local field potential, while neurons in deep layers are more coherent with alpha
+frequency oscillations (Buffalo et al., 2011). This finding is consistent with an earlier study
+by Livingstone (1996) showing that 50% of cells in L2/3 of squirrel monkey V1 expressed
+gamma oscillations, compared to less than 20% of cells in L4C and infragranular layers. The
+different spectral behaviour of superficial and deep layers has led to the interesting proposal
+that feedforward and feedback signalling may be mediated by distinct (high and low)
+frequencies (reviewed in Wang, 2010, see also Buschman and Miller, 2007 in the context of
+attention), a proposal that has recently received experimental support - at least for the
+feedforward connections (Bosman et al., 2012; see also Gregoriou et al., 2009).
+
+Integration and segregation within canonical circuits—Given this functional and
+anatomical segregation into parallel streams, the question naturally arises, how are these
+streams integrated? It has been previously suggested that integration occurs through the
+synchronized firing of multiple neurons that form a neural ensemble (Gray et al., 1989;
+Singer, 1999), while others have emphasized inter-areal phase-synchronization or coherence
+(Varela et al., 2001; Fries, 2005; Fujisawa and Buzsáki, 2011). While a full treatment of this
+question is beyond the scope of the current review, we propose that the canonical
+microcircuit contains a clue for how the dialectic between segregation and integration might
+be resolved. While top-down and bottom-up inputs and outputs may be segregated in layers,
+streams, and frequency bands, the canonical microcircuit specifies the circuitry for how the
+basic units of cortex are interconnected and therefore how the intrinsic activity of the
+cortical column is entrained by extrinsic inputs. This intrinsic connectivity specifies how the
+cells of origin and termination of extrinsic projections are interconnected, and thus
+determine how top-down and bottom-up streams are integrated within each cortical column.
+
+Spatial segregation and cortical columns
+
+The notion of a canonical microcircuit implicitly assumes that each circuit is distinct from
+its neighbours; that could presumably carry out computations in parallel. Therefore, the
+canonical microcircuit specifies the spatial scale over which processing is integrated. The
+most likely candidate for this spatial scale is the cortical column – that can vary over three
+orders of magnitude between minicolumns, columns, and hypercolumns. Minicolumns are
+only a few cells wide, estimated to be about 50-60 micrometers in diameter by Mountcastle
+(1997) and are seen in Nissl sections of cortex as slight variations in cell density.
+Minicolumns were originally proposed as elementary units of cortex by Lorente de No
+(1949) and appear to reflect the migration of cells from the ventricular zone to the cortical
+sheet during foetal development (reviewed in Horton and Adams, 2005). Hubel and Wiesel
+
+Bastos et al.
+Page 9
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+estimated that orientation columns were on this order of magnitude, about 25-50
+micrometers wide, although they failed to establish a correspondence between orientation
+columns observed physiologically and the minicolumns seen in Nissl sections (Hubel and
+Wiesel, 1974). A cortical column was classically defined as a vertical alignment of cells
+containing neurons with similar receptive field properties, such as orientation preference and
+ocular dominance in V1; or touch in somatosensory cortex (Mountcastle, 1957; Hubel and
+Wiesel, 1972). These columns were suggested by Mountcastle to encompass a number of
+minicolumns, with a width of 300-400 micrometers (Mountcastle, 1997). Finally, Hubel and
+Wiesel defined a hypercolumn to be the unit of cortex necessary to traverse all possible
+values of a particular receptive field property, such as orientation or eye dominance –
+estimated to be between 0.5 to 1 mm wide (Hubel and Wiesel, 1974).
+
+Columns, connections and computations—So is the cortical column the basic unit
+of cortical computation? Some authors emphasize that even within a dendrite, there are all
+the necessary biophysical mechanisms for performing surprisingly advanced computations,
+such as direction selectivity, coincidence detection, or temporal integration (Häusser and
+Mel, 2003; London and Häusser, 2005). Others argue that single neurons can process their
+inputs at the dendrite, soma, and initial segment, such that the output spike trains of just two
+interconnected cells could mediate computations like independent components analysis
+(Klampfl et al., 2009). Others posit that cortical columns form the basic computational unit
+(Mountcastle, 1997; Hubel and Wiesel, 1972) but see (Horton and Adams, 2005). Donald
+Hebb proposed that neurons distributed over several cortical areas could form a functional
+computational unit called a neural assembly (Hebb, 1949). This view has re-emerged in
+recent years, with the development of the requisite recording and analytic techniques for
+evaluating this proposal (Buzsáki, 2010; Canolty et al., 2010; Singer et al., 1997; Lopes-dos-
+Santos et al., 2011).
+
+Computational modelling studies indicate that cortical columns with structured connectivity
+are computationally more efficient than a network containing the same number of neurons
+but with random connectivity (Haeusler and Maass, 2007). Others suggest that this circuitry
+allows the cortex to organize and integrate bottom-up, lateral, and top-down information
+(Ullman, 1995; Raizada and Grossberg, 2003). Douglas and Martin suggest that the rich
+anatomical connectivity of L2/3 pyramidal cells allows them to collect information from
+top-down, lateral, and bottom-up inputs, and – through processing in the dendritic tree –
+select the most likely interpretation of its inputs. More recently, George and Hawkins have
+suggested that the canonical microcircuit implements a form of Bayesian processing
+(George and Hawkins, 2009). In the following section, we pursue similar ideas, but ground
+them in the framework of predictive coding, and propose a cortical circuit that could
+implement predictive coding through canonical interconnections. In particular, we find that
+the proposed circuitry agrees remarkably well with quantitative characterisations of the
+canonical microcircuit (Haeusler and Maass, 2007).
+
+A canonical microcircuit for predictive coding
+
+This section considers the computational role of cortical microcircuitry in more detail. We
+try to show that the computations performed by canonical microcircuits can be specified
+more precisely than one might imagine and that these computations can be understood
+within the framework of predictive coding. In brief, we will show that (hierarchical
+Bayesian) inference about the causes of sensory input can be cast as predictive coding. This
+is important because it provides formal constraints on the dynamics one would expect to
+find in neuronal circuits. Having established these constraints, we then attempt to match
+them with the neurobiological constraints afforded by the canonical microcircuit. The
+endpoint of this exercise is a canonical microcircuit for predictive coding.
+
+Bastos et al.
+Page 10
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Predictive coding and the free energy principle
+
+It might be thought impossible to specify the computations performed by the brain.
+However, there are some fairly fundamental constraints on the basic form of neuronal
+dynamics. The argument goes as follows – and can be regarded as a brief summary of the
+free energy principle (see Friston, 2010 for details):
+
+•
+Biological systems are homoeostatic (or allostatic), which means that they
+minimise the dispersion (entropy) of their interoceptive and exteroceptive states.
+
+•
+Entropy is the average of surprise over time, which means biological systems
+minimise the surprise associated with their sensory states at each point in time.
+
+•
+In statistics, surprise is the negative logarithm of Bayesian model evidence, which
+means biological systems – like the brain – must continually maximise the
+Bayesian evidence for their (generative) model of sensory inputs.
+
+•
+Maximising Bayesian model evidence corresponds to Bayesian filtering of sensory
+inputs. This is also known as predictive coding.
+
+These arguments mean that by minimising surprise, through selecting appropriate
+sensations, the brain is implicitly maximising the evidence for its own existence – this is
+known as active inference. In other words, to maintain a homoeostasis the brain must predict
+its sensory states on the basis of a model. Fulfilling those predictions corresponds to
+accumulating evidence for that model – and the brain that embodies it. The implicit
+maximisation of Bayesian model evidence provides an important link to the Bayesian brain
+hypothesis (Hinton and van Camp, 1993; Dayan et al., 1995; Knill and Pouget, 2004) and
+many other compelling proposals about perceptual synthesis – including analysis by
+synthesis (Neisser, 1967; Yuille and Kersten, 2006) epistemological automata (MacKay,
+1956), the principle of minimum redundancy (Attneave, 1954; Barlow, H.B., 1961; Dan et
+al., 1996), the Infomax principle (Linsker, 1990; Atick, 1992; Kay and Phillips, 2011), and
+perception as hypothesis testing (Gregory, 1968; 1980).
+
+The most popular scheme – for Bayesian filtering in neuronal circuits – is predictive coding
+(Srinivasan et al., 1982; Buchsbaum and Gottschalk, 1983; Rao and Ballard, 1999). In this
+context, surprise corresponds (roughly) to prediction error. In predictive coding, top-down
+predictions are compared with bottom-up sensory information to form a prediction error.
+This prediction error is used to update higher-level representations – upon which top-down
+predictions are based. These optimised predictions then reduce prediction error at lower
+levels.
+
+To predict sensations, the brain must be equipped with a generative model of how its
+sensations are caused (Helmholtz, 1860). Indeed, this led Geoffrey Hinton and colleagues to
+propose that the brain is an inference (Helmholtz) machine (Hinton and Zemel, 1994; Dayan
+et al., 1995). A generative model describes how variables or causes in the environment
+conspire to produce sensory input. Generative models map from (hidden) causes to (sensory)
+consequences. Perception then corresponds to the inverse mapping from sensations to their
+causes, while action can be thought of as the selective sampling of sensations. Crucially, the
+form of the generative model dictates the form of the inversion – for example, predictive
+coding. Figure 3 depicts a general model as a probabilistic graphical model. A special case
+of these models are hierarchical dynamic models (see Figure 4), which grandfather most
+parametric models in statistics and machine learning (see Friston, 2008). These models
+explain sensory data in terms of hidden causes and states. Hidden causes and states are both
+hidden variables that cause sensations but they play slightly different roles: hidden causes
+link different levels of the model and mediate conditional dependencies among hidden states
+at each level. Conversely, hidden states model conditional dependencies over time (i.e.,
+
+Bastos et al.
+Page 11
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+memory) by modelling dynamics in the world. In short, hidden causes and states mediate
+structural and dynamic dependencies respectively.
+
+The details of the graph in Figure 3 are not important; it just provides a way of describing
+conditional dependencies among hidden states and causes responsible for generating sensory
+input. These dependencies mean that we can interpret neuronal activity as message passing
+among the nodes of a generative model, where each canonical microcircuit contains
+representations or expectations about hidden states and causes. In other words, the form of
+the underlying generative model defines the form of the predictive coding architecture used
+to invert the model. This is illustrated in Figure 4, where each node has a single parent. We
+will deal with this simple sort of model because it lends itself to an unambiguous description
+in terms of bottom-up (feedforward) and top-down (feedback) message passing. We now
+look at how perception or model inversion – recovering the hidden states and causes of this
+model given sensory data – might be implemented at the level of a microcircuit:
+
+Predictive coding and message passing
+
+In predictive coding, representations (or conditional expectations) generate top-down
+predictions to produce prediction errors. These prediction errors are then passed up the
+hierarchy in the reverse direction, to update conditional expectations. This ensures an
+accurate prediction of sensory input and all its intermediate representations. This hierarchal
+message passing can be expressed mathematically as a gradient descent on the (sum of
+squared) prediction errors 
+, where the prediction errors are weighted by their
+precision (inverse variance).
+
+(1)
+
+The first pair of equalities just says that conditional expectations about hidden causes and
+
+states 
+ are updated based upon the way we would predict them to change – the first
+term – and subsequent terms that minimise prediction error. The second pair of equations
+
+simply expresses prediction error 
+ as the difference between conditional
+expectations about hidden causes and (the changes in) hidden states and their predicted
+
+values, weighed by their precisions 
+. These predictions are nonlinear functions of
+conditional expectations (g(i), f(i)) at each level of the hierarchy and the level above.
+
+It is difficult to overstate the generality and importance of Equation (1) – it grandfathers
+nearly every known statistical estimation scheme, under parametric assumptions about
+additive noise. These range from ordinary least squares to advanced Bayesian filtering
+schemes (see Friston 2008). In this general setting, Equation (1) minimises variational free
+energy and corresponds to generalised predictive coding. Under linear models, it reduces to
+linear predictive coding, also known as Kalman-Bucy filtering (see Friston et al, 2010 for
+details).
+
+In neuronal network terms, Equation (1) says that prediction error units receive messages
+from the same level and the level above. This is because the hierarchical form of the model
+only requires conditional expectations from neighbouring levels to form prediction errors, as
+can be seen schematically in Figure 4. Conversely, expectations are driven by prediction
+
+Bastos et al.
+Page 12
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+error from the same level and the level below – updating expectations about hidden states
+and causes respectively. These constitute the bottom-up and lateral messages that drive
+conditional expectations to provide better predictions – or representations – that suppress
+prediction error. This updating corresponds to an accumulation of prediction errors – in that
+the rate of change of conditional expectations is proportional to prediction error.
+Electrophysiologically, this means that one would expect to see a transient prediction error
+response to bottom-up afferents (in neuronal populations encoding prediction error) that is
+suppressed to baseline firing rates by sustained responses (in neuronal populations encoding
+predictions). This is the essence of recurrent message passing between hierarchical levels to
+suppress prediction error (see Friston 2008 for a more detailed discussion).
+
+The nature of this message passing is remarkably consistent with the anatomical and
+physiological features of cortical hierarchies. An important prediction is that the nonlinear
+functions of the generative model – modelling context sensitive dependencies among hidden
+variables – appear only in the top-down and lateral predictions. This means,
+neurobiologically, we would predict feedback connections to possess nonlinear or
+neuromodulatory characteristics; in contrast to feedforward connections that mediate a linear
+mixture of prediction errors. This functional asymmetry is exactly consistent with the
+empirical evidence reviewed above. Another key feature of Equation (1) is that the top-
+down predictions produce prediction errors through subtraction. In other words, feedback
+connections should exert inhibitory effects, of the sort seen empirically. Table 2 summarises
+the features of extrinsic connectivity (reviewed in the previous section) that are explained by
+predictive coding. In the remainder of this review, we focus on intrinsic connections and
+cortical microcircuits.
+
+The cortical microcircuit and predictive coding
+
+We now try to associate the variables in Equation (1) with specific populations in the
+canonical microcircuit. Figure 5 illustrates a remarkable correspondence between the form
+of Equation (1) and the connectivity of the canonical microcircuit. Furthermore, the
+resulting scheme corresponds almost exactly to the computational architecture proposed by
+Mumford (1992). This correspondence rests upon the following intuitive steps:
+
+•
+First, we divide the excitatory cells in the superficial and deep layers into principal
+(pyramidal) cells and excitatory interneurons. This accommodates the fact that (in
+macaque V1) a significant percentage of superficial L2/3 cells (about half) and
+deep L5 excitatory cells (about 80%) do not project outside the cortical column
+(Callaway and Wiser, 1996; Briggs and Callaway, 2005).
+
+•
+Second, we know that the superficial and deep pyramidal cells provide feedforward
+and feedback connections respectively. This means that superficial pyramidal cells
+
+must encode and broadcast prediction errors on hidden causes 
+, while deep
+
+pyramidal cells must encode conditional expectations 
+ so that they can
+elaborate feedback predictions.
+
+•
+Third, we know that the (spiny stellate) excitatory cells in the granular layer receive
+
+feedforward connections encoding prediction errors 
+ on the hidden causes of the
+level below.
+
+•
+This leaves the inhibitory interneurons in the granular layer, which, for symmetry,
+we associate with prediction errors on the hidden states.
+
+•
+The remaining populations are the excitatory and inhibitory interneurons in the
+supragranular layer, to which we assign expectations about hidden causes and
+states respectively. These are mapped through descending (intrinsic) feedforward
+
+Bastos et al.
+Page 13
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+connections to cells in the deep layers that generate predictions. We do not suppose
+that this is a simple one-to-one mapping – rather it mediates the nonlinear
+transformation of expectations to predictions required by the earlier cortical level.
+
+This arrangement accommodates the fact that the dependencies among hidden states are
+confined to each node (by the nature of graphical models), which means that their
+expectations and prediction errors should be encoded by interneurons. Furthermore, the
+splitting of excitatory cells in the upper layers into two populations (encoding expectations
+and prediction errors on hidden causes) is sensible, because there is a one-to-one mapping
+between the expectations on hidden causes and their prediction errors.
+
+The ensuing architecture bears a striking correspondence to the microcircuit in (Haeusler
+and Maass, 2007) in the left panel of Figure 5 – in the sense that nearly every connection
+required by the predictive coding scheme appears to be present in terms of quantitative
+measures of intrinsic connectivity. However, there are two exceptions that both involve
+connections to the inhibitory cells in the granular layer (shown as dotted lines in Figure 5).
+Predictive coding requires that these cells (that encode prediction errors on hidden states)
+compare the expected changes in hidden states with the actual changes. This suggests that
+there should be interlaminar projections from supragranular (inhibitory) and infragranular
+(excitatory) cells. In terms of their synaptic characteristics, one would predict that these
+intrinsic connections would be of a feedback sort – in the sense that they convey predictions.
+Although not considered in this Haeusler and Maass scheme, feedback connections from
+infragranular layers are an established component of the canonical microcircuit (see Figure
+2).
+
+Functional asymmetries in the microcircuit
+
+The circuitry in Figure 5 appears consistent with the broad scheme of ascending
+(feedforward) and descending (feedback) intrinsic connections: feedforward prediction
+errors from a lower cortical level arrive at granular layers and are passed forward to
+excitatory and inhibitory interneurons in supragranular layers, encoding expectations. Strong
+and reciprocal intralaminar connections couple superficial excitatory interneurons and
+pyramidal cells. Excitatory and inhibitory interneurons in supragranular layers then send
+strong feedforward connections to the infragranular layer. These connections enable deep
+pyramidal cells and excitatory interneurons to produce (feedback) predictions, which ascend
+back to L4 or descend to a lower hierarchical level. This arrangement recapitulates the
+functional asymmetries between extrinsic feedforward and feedback connections and is
+consistent with the empirical characteristics of intrinsic connections.
+
+If we focus on the superficial and deep pyramidal cells, the form of the recognition
+dynamics in Equation (1) tells us something quite fundamental: We would anticipate higher
+frequencies in the superficial pyramidal cells, relative to the deep pyramidal cells. One can
+see this easily by taking the Fourier transform of the first equality in Equation (1)
+
+(2)
+
+This equation says that the contribution of any (angular) frequency ω in the prediction errors
+(encoded by superficial pyramidal cells) to the expectations (encoded by the deep pyramidal
+cells) is suppressed in proportion to that frequency (Friston 2008). In other words, high
+frequencies should be attenuated when passing from superficial to deep pyramidal cells.
+There is nothing mysterious about this attenuation – it is a simple consequence of the fact
+that conditional expectations accumulate prediction errors, thereby suppressing high-
+frequency fluctuations to produce smooth estimates of hidden causes. This smoothing –
+
+Bastos et al.
+Page 14
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+inherent in Bayesian filtering – leads to an asymmetry in frequency content of superficial
+and deep cells: for example, superficial cells should express more gamma relative to beta,
+and deep cells should express more beta relative to gamma (Roopun et al., 2006, 2008;
+Maier et al., 2010).
+
+Figure 6 provides a schematic illustration of the spectral asymmetry predicted by Equation
+2. Note that predictions about the relative amplitudes of high and low frequencies in
+superficial and deep layers pertain to all frequencies – there is nothing in predictive coding
+per se to suggest characteristic frequencies in the gamma and beta ranges. However, one
+might speculate the characteristic frequencies of canonical microcircuits have evolved to
+model and – through active inference – create the sensorium (Friston, 2010; Berkes et al.,
+2011; Engbert et al., 2011). Indeed, there is empirical evidence to support this notion; in the
+visual (Lakatos et al., 2008; Meirovithz et al., 2012; Bosman et al., 2009) and motor domain
+(Gwin and Ferris, 2012).
+
+In summary, predictions are formed by a linear accumulation of prediction errors.
+Conversely, prediction errors are nonlinear functions of predictions. This means that the
+conversion of prediction errors into predictions (Bayesian filtering) necessarily entails a loss
+of high frequencies. However, the nonlinearity in the mapping from predictions to prediction
+errors means that high frequencies can be created (consider the effect of squaring a sine
+wave, which would convert beta into gamma). In short, prediction errors should express
+higher frequencies than the predictions that accumulate them. This is another example of a
+potentially important functional asymmetry between feedforward and feedback message
+passing that emerges under predictive coding. It is particularly interesting given recent
+evidence that feedforward connections may use higher frequencies than feedback
+connections (Bosman et al., 2012).
+
+Conclusion
+
+In conclusion, there is a remarkable correspondence between the anatomy and physiology of
+the canonical microcircuit and the formal constraints implied by generalised predictive
+coding. Having said this, there are many variations on the mapping between computational
+and neuronal architectures: Even if predictive coding is an appropriate implementation of
+Bayesian filtering, there are many variations on the arrangement shown in Figure 5. For
+example, feedback connections could arise directly from cells encoding conditional
+expectations in supragranular layers. Indeed, there is emerging evidence that feedback
+connections between proximate hierarchical levels originate from both deep and superficial
+layers (Markov et al 2011). Note that this putative splitting of extrinsic streams is only
+predicted in the light of empirical constraints on intrinsic connectivity.
+
+One of our motivations – for considering formal constraints on connectivity – was to
+produce dynamic causal models of canonical microcircuits. Dynamic causal modelling
+enables one to compare different connectivity models, using empirical electrophysiological
+responses (David et al, 2006; Moran et al, 2008, 2011). This form of modelling rests upon
+Bayesian model comparison and allows one to assess the evidence for one microcircuit
+relative to another. In principle, this provides a way to evaluate different microcircuit
+models, in terms of their ability to explain observed activity. One might imagine that the
+particular circuits for predictive coding presented in this paper will be nuanced as more
+anatomical and physiological information becomes available. The ability to compare
+competing models or microcircuits – using optogenetics, local field potentials and
+electroencephalography – may be important for refining neurobiologically informed
+microcircuits. In short, many of the predictions and assumptions we have made about the
+specific form of the microcircuit for predictive coding may be testable in the near future.
+
+Bastos et al.
+Page 15
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Acknowledgments
+
+This work was supported by the Wellcome Trust and the NSF Graduate Research Fellowship under Grant No.
+2009090358 to A.M.B. Support was also provided by NIH grants MH055714 (G.R.M.) and EY013588 (W.M.U.),
+and NSF grant 1228535 (G.R.M and W.M.U). The authors would like to thank Julien Vezoli, Will Penny, Dimitris
+Pinotsis, Stewart Shipp, Vladimir Litvak, Conrado Bosman, Laurent Perrinet and Henry Kennedy for helpful
+discussions. We would also like to thank our reviewers for helpful comments and guidance.
+
+References
+
+Alink A, Schwiedrzik CM, Kohler A, Singer W, Muckli L. Stimulus predictability reduces responses
+in primary visual cortex. J. Neurosci. 2010; 30:2960–2966. [PubMed: 20181593]
+Alitto HJ, Usrey WM. Origin and dynamics of extraclassical suppression in the lateral geniculate
+nucleus of the macaque monkey. Neuron. 2008; 57:135–146. [PubMed: 18184570]
+Alonso JM. Book Review: Neural Connections and Receptive Field Properties in the Primary Visual
+Cortex. The Neuroscientist. 2002; 8:443. [PubMed: 12374429]
+Alonso JM, Martinez LM. Functional connectivity between simple cells and complex cells in cat
+striate cortex. Nat. Neurosci. 1998; 1:395–403. [PubMed: 10196530]
+Anderson JC, Kennedy H, Martin KA. Pathways of Attention: Synaptic Relationships of Frontal Eye
+Field to V4, Lateral Intraparietal Cortex, and Area 46 in Macaque Monkey. The Journal of
+Neuroscience. 2011; 31:10872. [PubMed: 21795539]
+Anderson JC, Martin KAC. Synaptic connection from cortical area V4 to V2 in macaque monkey. The
+Journal of Comparative Neurology. 2006; 495:709–721. [PubMed: 16506191]
+Armstrong, KM.; Schafer, RJ.; Chang, MH.; Moore, T. Attention and action in the frontal eye field. In
+The Neuroscience of Attention: Attentional Control and Selection. Mangun, GR., editor. Oxford
+University Press; New York: 2012. p. 151-166.
+Arnal LH, Wyart V, Giraud A-L. Transitions in neural oscillations reflect prediction errors generated
+in audiovisual speech. Nature Neuroscience. 2011; 14:797–801.
+Atick JJ. Could information theory provide an ecological theory of sensory processing? Network:
+Computation in Neural Systems. 1992; 3:213–251.
+Attneave F. Some informational aspects of visual perception. Psychol Rev. 1954; 61:183–193.
+[PubMed: 13167245]
+Bair W, Cavanaugh JR, Movshon JA. Time course and time-distance relationships for surround
+suppression in macaque V1 neurons. The Journal of Neuroscience. 2003; 23:7690. [PubMed:
+12930809]
+Barceló F, Suwazono S, Knight RT. Prefrontal modulation of visual processing in humans. Nat.
+Neurosci. 2000; 3:399–403. [PubMed: 10725931]
+Barlow, HB. Possible principles underlying the transformations of sensory messages.. In: Rosenblith,
+WA., editor. Sensory Communication. MIT Press; Cambridge, MA: 1961. p. 217-234.
+Barone P, Batardiere A, Knoblauch K, Kennedy H. Laminar distribution of neurons in extrastriate
+areas projecting to visual areas V1 and V4 correlates with the hierarchical rank and indicates the
+operation of a distance rule. The Journal of Neuroscience. 2000; 20:3263. [PubMed: 10777791]
+Berkes P, Orban G, Lengyel M, Fiser J. Spontaneous Cortical Activity Reveals Hallmarks of an
+Optimal Internal Model of the Environment. Science. 2011; 331:83–87. [PubMed: 21212356]
+Bollimunta A, Mo J, Schroeder CE, Ding M. Neuronal Mechanisms and Attentional Modulation of
+Corticothalamic Alpha Oscillations. The Journal of Neuroscience. 2011; 31:4935. [PubMed:
+21451032]
+Bosman CA, Schoffelen J-M, Brunet N, Oostenveld R, Bastos AM, Womelsdorf T, Rubehn B,
+Stieglitz T, De Weerd P, Fries P. Attentional Stimulus Selection through Selective
+Synchronization between Monkey Visual Areas. Neuron. 2012; 75:875–888. [PubMed: 22958827]
+Briggs F, Callaway EM. Laminar patterns of local excitatory input to layer 5 neurons in macaque
+primary visual cortex. Cerebral Cortex. 2005; 15:479. [PubMed: 15319309]
+Briggs F, Usrey WM. Parallel processing in the corticogeniculate pathway of the macaque monkey.
+Neuron. 2009; 62:135–146. [PubMed: 19376073]
+
+Bastos et al.
+Page 16
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Buchsbaum G, Gottschalk A. Trichromacy, opponent colours coding and optimum colour information
+transmission in the retina. Proc. R. Soc. Lond., B, Biol. Sci. 1983; 220:89–113. [PubMed:
+6140684]
+Buffalo EA, Fries P, Landman R, Buschman TJ, Desimone R. Laminar differences in gamma and
+alpha coherence in the ventral stream. Proceedings of the National Academy of Sciences. 2011;
+108:11262.
+Bullier J, Henry GH. Ordinal position and afferent input of neurons in monkey striate cortex. J. Comp.
+Neurol. 1980; 193:913–935. [PubMed: 6253535]
+Bullier J, Hupé JM, James A, Girard P. Functional interactions between areas V1 and V2 in the
+monkey. Journal of Physiology-Paris. 1996; 90:217–220.
+Buschman TJ, Miller EK. Top-Down Versus Bottom-Up Control of Attention in the Prefrontal and
+Posterior Parietal Cortices. Science. 2007; 315:1860–1862. [PubMed: 17395832]
+Buzsáki G. Neural Syntax: Cell Assemblies, Synapsembles, and Readers. Neuron. 2010; 68:362–385.
+[PubMed: 21040841]
+Callaway EM. Local circuits in primary visual cortex of the macaque monkey. Annual Review of
+Neuroscience. 1998; 21:47–74.
+Callaway EM, Wiser AK. Contributions of individual layer 2-5 spiny neurons to local circuits in
+macaque primary visual cortex. Vis. Neurosci. 1996; 13:907–922. [PubMed: 8903033]
+Canolty RT, Ganguly K, Kennerley SW, Cadieu CF, Koepsell K, Wallis JD, Carmena JM. Oscillatory
+phase coupling coordinates anatomically dispersed functional cell assemblies. Proceedings of the
+National Academy of Sciences. 2010; 107:17356.
+Chu Z, Galarreta M, Hestrin S. Synaptic interactions of late-spiking neocortical neurons in layer 1. The
+Journal of Neuroscience. 2003; 23:96. [PubMed: 12514205]
+Covic EN, Sherman SM. Synaptic properties of connections between the primary and secondary
+auditory cortices in mice. Cereb. Cortex. 2011; 21:2425–2441. [PubMed: 21385835]
+Crick F. Function of the thalamic reticular complex: the searchlight hypothesis. Proceedings of the
+National Academy of Sciences of the United States of America. 1984; 81:4586. [PubMed:
+6589612]
+Crick F, Koch C. Constraints on cortical and thalamic projections: the no-strong-loops hypothesis.
+Nature. 1998; 391:245–250. [PubMed: 9440687]
+Dan Y, Atick JJ, Reid RC. Efficient coding of natural scenes in the lateral geniculate nucleus:
+experimental test of a computational theory. The Journal of Neuroscience. 1996; 16:3351–3362.
+[PubMed: 8627371]
+David O, Kiebel SJ, Harrison LM, Mattout J, Kilner JM, Friston KJ. Dynamic causal modeling of
+evoked responses in EEG and MEG. NeuroImage. 2006; 30:1255–1272. [PubMed: 16473023]
+Dayan P, Hinton GE, Neal RM, Zemel RS. The Helmholtz machine. Neural Comput. 1995; 7:889–
+904. [PubMed: 7584891]
+Desimone R. Neural mechanisms for visual memory and their role in attention. Proc. Natl. Acad. Sci.
+U.S.A. 1996; 93:13494–13499. [PubMed: 8942962]
+Douglas RJ, Martin K. A functional microcircuit for cat visual cortex. The Journal of Physiology.
+1991; 440:735. [PubMed: 1666655]
+Douglas RJ, Martin KA, Whitteridge D. A canonical microcircuit for neocortex. Neural Computation.
+1989; 1:480–488.
+Douglas RJ, Martin KAC. Neuronal Circuits of the Neocortex. Annu. Rev. Neurosci. 2004; 27:419–
+451. [PubMed: 15217339]
+Engbert R, Mergenthaler K, Sinn P, Pikovsky A. PNAS Plus: An integrated model of fixational eye
+movements and microsaccades. Proceedings of the National Academy of Sciences. 2011;
+108:E765–E770.
+Engel, AK.; Friston, KJ.; Kelso, JA.; König, P.; Kovács, I.; MacDonald, A.; Miller, EK.; Phillips,
+WA.; Silverstein, SM.; Tallon-Baudry, C., et al. Coordination in Behavior and Cognition.. In: von
+der Malsburg, C.; Phillips, WA.; Singer, W., editors. Dynamic Coordination in the Brain: From
+Neurons to Mind. MIT Press; Cambridge, MA: 2010. p. 267-299.
+
+Bastos et al.
+Page 17
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Feldman H, Friston KJ. Attention, uncertainty, and free-energy. Front Hum Neurosci. 2010; 4:215.
+[PubMed: 21160551]
+Felleman DJ, Van Essen DC. Distributed hierarchical processing in the primate cerebral cortex. Cereb.
+Cortex. 1991; 1:1–47. [PubMed: 1822724]
+Fitzpatrick D, Usrey WM, Schofield BR, Einstein G. The sublaminar organization of corticogeniculate
+neurons in layer 6 of macaque striate cortex. Vis. Neurosci. 1994; 11:307–315. [PubMed:
+7516176]
+Fries P. A mechanism for cognitive dynamics: neuronal communication through neuronal coherence.
+Trends in Cognitive Sciences. 2005; 9:474–480. [PubMed: 16150631]
+Fries P, Reynolds JH, Rorie AE, Desimone R. Modulation of oscillatory neuronal synchronization by
+selective visual attention. Science. 2001; 291:1560–1563. [PubMed: 11222864]
+Friston K. Hierarchical Models in the Brain. PLoS Comput Biol. 2008; 4:e1000211. [PubMed:
+18989391]
+Friston K. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. 2010;
+11:127–138.
+Fujisawa S, Buzsáki G. A 4 Hz Oscillation Adaptively Synchronizes Prefrontal, VTA, and
+Hippocampal Activities. Neuron. 2011; 72:153–165. [PubMed: 21982376]
+Garrido MI, Kilner JM, Kiebel SJ, Friston KJ. Evoked brain responses are generated by feedback
+loops. Proc. Natl. Acad. Sci. U.S.A. 2007; 104:20961–20966. [PubMed: 18087046]
+Garrido MI, Kilner JM, Stephan KE, Friston KJ. The mismatch negativity: a review of underlying
+mechanisms. Clinical Neurophysiology. 2009; 120:453–463. [PubMed: 19181570]
+Gentet LJ, Kremer Y, Taniguchi H, Huang ZJ, Staiger JF, Petersen CCH. Unique functional properties
+of somatostatin-expressing GABAergic neurons in mouse barrel cortex. Nature Neuroscience.
+2012; 15:607–612.
+George D, Hawkins J. Towards a mathematical theory of cortical micro-circuits. PLoS Computational
+Biology. 2009; 5:e1000532. [PubMed: 19816557]
+Gilbert CD, Wiesel TN. Functional organization of the visual cortex. Prog. Brain Res. 1983; 58:209–
+218. [PubMed: 6138809]
+Girard P, Bullier J. Visual activity in area V2 during reversible inactivation of area 17 in the macaque
+monkey. Journal of Neurophysiology. 1989; 62:1287. [PubMed: 2600626]
+Girard P, Salin PA, Bullier J. Visual activity in areas V3a and V3 during reversible inactivation of area
+V1 in the macaque monkey. Journal of Neurophysiology. 1991a; 66:1493. [PubMed: 1765790]
+Girard P, Salin PA, Bullier J. Visual activity in macaque area V4 depends on area 17 input.
+Neuroreport. 1991b; 2:81–84. [PubMed: 1883988]
+Girard P, Salin PA, Bullier J. Response selectivity of neurons in area MT of the macaque monkey
+during reversible inactivation of area V1. Journal of Neurophysiology. 1992; 67:1437. [PubMed:
+1629756]
+Gray CM, König P, Engel AK, Singer W. Oscillatory responses in cat visual cortex exhibit inter-
+columnar synchronization which reflects global stimulus properties. Nature. 1989; 338:334–337.
+[PubMed: 2922061]
+Gregoriou GG, Gotts SJ, Zhou H, Desimone R. High-Frequency, Long-Range Coupling Between
+Prefrontal and Visual Cortex During Attention. Science. 2009; 324:1207–1210. [PubMed:
+19478185]
+Gregory RL. Perceptual illusions and brain models. Proc. R. Soc. Lond., B. Biol. Sci. 1968; 171:279–
+296. [PubMed: 4387405]
+Gregory RL. Perceptions as hypotheses. Philos. Trans. R. Soc. Lond., B. Biol. Sci. 1980; 290:181–197.
+[PubMed: 6106237]
+Gwin JT, Ferris DP. Beta- and gamma-range human lower limb corticomuscular coherence. Front
+Hum Neurosci. 2012; 6:258. [PubMed: 22973219]
+Haeusler S, Maass W. A statistical analysis of information-processing properties of lamina-specific
+cortical microcircuit models. Cerebral Cortex. 2007; 17:149. [PubMed: 16481565]
+Harrison LM, Stephan KE, Rees G, Friston KJ. Extra-classical receptive field effects measured in
+striate cortex with fMRI. Neuroimage. 2007; 34:1199–1208. [PubMed: 17169579]
+
+Bastos et al.
+Page 18
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Häusser M, Mel B. Dendrites: bug or feature? Current Opinion in Neurobiology. 2003; 13:372–383.
+[PubMed: 12850223]
+Hebb, DO. The Organization of Behavior: A Neuropsychological Theory. Wiley; New York: 1949.
+Helmholtz, H. English translation. Dover; New York: 1860. Handbuch der Physiologischen Optik..
+Hinton G, van Camp D. Keeping neural networks simple by minimizing the description length of
+weights. Proceedings of COLT-93. 1993:5–13.
+Hinton, GE.; Zemel, RS. Autoencoders, Minimum Description Length, and Helmholtz Free Energy..
+In: Cowan, JD.; Tesauro, G.; Alspector, J., editors. Advances in Neural Information Processing
+Systems 6. Morgan Kaufmann; San Mateo, CA: 1994.
+Hopfinger JB, Buonocore MH, Mangun GR. The neural mechanisms of top- down attentional control.
+Nature Neuroscience. 2000; 3:284–291.
+Horton JC, Adams DL. The cortical column: a structure without a function. Philosophical Transactions
+of the Royal Society B: Biological Sciences. 2005; 360:837–862.
+Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat's
+visual cortex. The Journal of Physiology. 1962; 160:106. [PubMed: 14449617]
+Hubel DH, Wiesel TN. Laminar and columnar distribution of geniculo-cortical fibers in the macaque
+monkey. The Journal of Comparative Neurology. 1972; 146:421–450. [PubMed: 4117368]
+Hubel DH, Wiesel TN. Sequence regularity and geometry of orientation columns in the monkey striate
+cortex. J. Comp. Neurol. 1974; 158:267–293. [PubMed: 4436456]
+Hupé JM, James AC, Payne BR, Lomber SG, Girard P, Bullier J. Cortical feedback improves
+discrimination between figure and background by V1, V2 and V3 neurons. Nature. 1998;
+394:784–787. [PubMed: 9723617]
+Johnson RR, Burkhalter A. Microcircuitry of forward and feedback connections within rat visual
+cortex. J. Comp. Neurol. 1996; 368:383–398. [PubMed: 8725346]
+Johnson RR, Burkhalter A. A polysynaptic feedback circuit in rat visual cortex. The Journal of
+Neuroscience. 1997; 17:7129. [PubMed: 9278547]
+Kätzel D, Zemelman BV, Buetfering C, Wölfel M, Miesenböck G. The columnar and laminar
+organization of inhibitory connections to neocortical excitatory cells. Nature Neuroscience. 2010;
+14:100–107.
+Kay JW, Phillips WA. Coherent Infomax as a computational goal for neural systems. Bull. Math. Biol.
+2011; 73:344–372. [PubMed: 20821064]
+Klampfl S, Legenstein R, Maass W. Spiking neurons can learn to solve information bottleneck
+problems and extract independent components. Neural Computation. 2009; 21:911–959. [PubMed:
+19018708]
+Knight RT. Decreased response to novel stimuli after prefrontal lesions in man. Electroencephalogr
+Clin Neurophysiol. 1984; 59:9–20. [PubMed: 6198170]
+Knight RT, Scabini D, Woods DL. Prefrontal cortex gating of auditory transmission in humans. Brain
+Res. 1989; 504:338–342. [PubMed: 2598034]
+Knill DC, Pouget A. The Bayesian brain: the role of uncertainty in neural coding and computation.
+Trends in Neurosciences. 2004; 27:712–719. [PubMed: 15541511]
+de Kock CPJ, Bruno RM, Spors H, Sakmann B. Layer- and cell-type-specific suprathreshold stimulus
+representation in rat primary somatosensory cortex. J. Physiol. (Lond.). 2007; 581:139–154.
+[PubMed: 17317752]
+Kok P, Rahnev D, Jehee JFM, Lau HC, de Lange FP. Attention Reverses the Effect of Prediction in
+Silencing Sensory Signals. Cerebral Cortex. 2011; 22:2197–2206. [PubMed: 22047964]
+Lakatos P, Karmos G, Mehta AD, Ulbert I, Schroeder CE. Entrainment of neuronal oscillations as a
+mechanism of attentional selection. Science. 2008; 320:110–113. [PubMed: 18388295]
+Larkum ME, Nevian T, Sandler M, Polsky A, Schiller J. Synaptic Integration in Tuft Dendrites of
+Layer 5 Pyramidal Neurons: A New Unifying Principle. Science. 2009; 325:756–760. [PubMed:
+19661433]
+Linsker R. Perceptual neural organization: some approaches based on network models and information
+theory. Annual Review of Neuroscience. 1990; 13:257–281.
+
+Bastos et al.
+Page 19
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Livingstone MS. Oscillatory firing and interneuronal correlations in squirrel monkey striate cortex.
+Journal of Neurophysiology. 1996; 75:2467–2485. [PubMed: 8793757]
+London M, Häusser M. Dendritic computation. Annu. Rev. Neurosci. 2005; 28:503–532. [PubMed:
+16033324]
+Lopes-dos-Santos V, Conde-Ocazionez S, Nicolelis MAL, Ribeiro ST, Tort ABL. Neuronal Assembly
+Detection and Cell Membership Specification by Principal Component Analysis. PLoS ONE.
+2011; 6:e20996. [PubMed: 21698248]
+MacKay, DM. Automata Studies. Shannon, CE.; McCarthy, J., editors. Princeton Univ. Press;
+Princeton, NJ: 1956. p. 235-251.
+Maier A, Adams GK, Aura C, Leopold DA. Distinct superficial and deep laminar domains of activity
+in the visual cortex during rest and stimulation. Frontiers in Systems Neuroscience. 2010; 4
+Markov NT, Misery P, Falchier A, Lamy C, Vezoli J, Quilodran R, Gariel MA, Giroud P, Ercsey-
+Ravasz M, Pilaz LJ, et al. Weight consistency specifies regularities of macaque cortical networks.
+Cerebral Cortex. 2011; 21:1254. [PubMed: 21045004]
+Meirovithz E, Ayzenshtat I, Werner-Reiss U, Shamir I, Slovin H. Spatiotemporal Effects of
+Microsaccades on Population Activity in the Visual Cortex of Monkeys during Fixation. Cerebral
+Cortex. 2011; 22:294–307. [PubMed: 21653284]
+Melzer S, Michael M, Caputi A, Eliava M, Fuchs EC, Whittington MA, Monyer H. Long-Range-
+Projecting GABAergic Neurons Modulate Inhibition in Hippocampus and Entorhinal Cortex.
+Science. 2012; 335:1506–1510. [PubMed: 22442486]
+Meyer HS, Schwarz D, Wimmer VC, Schmitt AC, Kerr JND, Sakmann B, Helmstaedter M. Inhibitory
+interneurons in a cortical column form hot zones of inhibition in layers 2 and 5A. Proceedings of
+the National Academy of Sciences. 2011; 108:16807–16812.
+Meyer T, Olson CR. Statistical learning of visual transitions in monkey inferotemporal cortex.
+Proceedings of the National Academy of Sciences. 2011; 108:19401–19406.
+Mignard M, Malpeli JG. Paths of information flow through visual cortex. Science. 1991; 251:1249.
+[PubMed: 1848727]
+Moran RJ, Stephan KE, Kiebel SJ, Rombach N, O'Connor WT, Murphy KJ, Reilly RB, Friston KJ.
+Bayesian estimation of synaptic physiology from the spectral responses of neural masses.
+Neuroimage. 2008; 42:272–284. [PubMed: 18515149]
+Moran RJ, Symmonds M, Stephan KE, Friston KJ, Dolan RJ. An in vivo assay of synaptic function
+mediating human cognition. Curr. Biol. 2011; 21:1320–1325. [PubMed: 21802302]
+Mountcastle VB. Modality and topographic properties of single neurons of cat's somatic sensory
+cortex. Journal of Neurophysiology. 1957; 20:408. [PubMed: 13439410]
+Mountcastle VB. The columnar organization of the neocortex. Brain. 1997; 120:701. [PubMed:
+9153131]
+Mumford D. On the computational architecture of the neocortex. II. The role of cortico-cortical loops.
+Biol Cybern. 1992; 66:241–251. [PubMed: 1540675]
+Murphy PC, Sillito AM. Corticofugal feedback influences the generation of length tuning in the visual
+pathway. Nature. 1987; 329:727–729. [PubMed: 3670375]
+Murray SO. Spatially Specific fMRI Repetition Effects in Human Visual Cortex. Journal of
+Neurophysiology. 2005; 95:2439–2445. [PubMed: 16394067]
+Murray SO, Kersten D, Olshausen BA, Schrater P, Woods DL. Shape perception reduces activity in
+human primary visual cortex. Proc. Natl. Acad. Sci. U.S.A. 2002; 99:15164–15169. [PubMed:
+12417754]
+Neisser, U. Cognitive psychology. Appleton-Century-Crofts; New York: 1967.
+de No, LR. Cerebral cortex: architecture, intracortical connections, motor projections.. In: Fulton, JF.,
+editor. Physiology of the Nervous System. Oxford University Press; Oxford: 1949. p. 288-330.
+Olsen SR, Bortone DS, Adesnik H, Scanziani M. Gain control by layer six in cortical circuits of vision.
+Nature. 2012
+De Pasquale R, Sherman SM. Synaptic properties of corticocortical connections between the primary
+and secondary visual cortical areas in the mouse. J. Neurosci. 2011; 31:16494–16506. [PubMed:
+22090476]
+
+Bastos et al.
+Page 20
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Raizada RDS, Grossberg S. Towards a theory of the laminar architecture of cerebral cortex:
+Computational clues from the visual system. Cerebral Cortex. 2003; 13:100–113. [PubMed:
+12466221]
+Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-
+classical receptive-field effects. Nature Neuroscience. 1999; 2:79–87.
+Roopun AK, Kramer MA, Carracedo LM, Kaiser M, Davies CH, Traub RD, Kopell NJ, Whittington
+MA. Period concatenation underlies interactions between gamma and beta rhythms in neocortex.
+Front Cell Neurosci. 2008; 2:1. [PubMed: 18946516]
+Roopun AK, Middleton SJ, Cunningham MO, LeBeau FE, Bibbig A, Whittington MA, Traub RD. A
+beta2-frequency (20–30 Hz) oscillation in nonsynaptic networks of somatosensory cortex.
+Proceedings of the National Academy of Sciences. 2006; 103:15646.
+Saalmann YB, Pigarev IN, Vidyasagar TR. Neural Mechanisms of Visual Attention: How Top-Down
+Feedback Highlights Relevant Locations. Science. 2007; 316:1612–1615. [PubMed: 17569863]
+Saalmann YB, Pinsk MA, Wang L, Li X, Kastner S. The Pulvinar Regulates Information Transmission
+Between Cortical Areas Based on Attention Demands. Science. 2012; 337:753–756. [PubMed:
+22879517]
+Sakata S, Harris KD. Laminar Structure of Spontaneous and Sensory-Evoked Population Activity in
+Auditory Cortex. Neuron. 2009; 64:404–418. [PubMed: 19914188]
+Salin PA, Bullier J. Corticocortical connections in the visual system: structure and function. Physiol.
+Rev. 1995; 75:107–154. [PubMed: 7831395]
+Sandell JH, Schiller PH. Effect of cooling area 18 on striate cortex cells in the squirrel monkey.
+Journal of Neurophysiology. 1982; 48:38. [PubMed: 6288886]
+Sherman SM, Guillery R. On the actions that one nerve cell can have on another: distinguishing
+“drivers” from “modulators.”. Proceedings of the National Academy of Sciences of the United
+States of America. 1998; 95:7121. [PubMed: 9618549]
+Sherman SM, Guillery RW. Distinct functions for direct and transthalamic corticocortical connections.
+Journal of Neurophysiology. 2011; 106:1068–1077. [PubMed: 21676936]
+Shipp S. Structure and function of the cerebral cortex. Current Biology. 2007; 17:443–449.
+Shlosberg D, Amitai Y, Azouz R. Time-Dependent, Layer-Specific Modulation of Sensory Responses
+Mediated by Neocortical Layer 1. Journal of Neurophysiology. 2006; 96:3170–3182. [PubMed:
+17110738]
+Sillito AM, Cudeiro J, Murphy PC. Orientation sensitive elements in the corticofugal influence on
+centre-surround interactions in the dorsal lateral geniculate nucleus. Exp Brain Res. 1993; 93:6–
+16. [PubMed: 8467892]
+Sincich LC, Horton JC. The circuitry of V1 and V2: integration of color, form, and motion. Annu.
+Rev. Neurosci. 2005; 28:303–326. [PubMed: 16022598]
+Singer W. Neuronal synchrony: a versatile code for the definition of relations? Neuron. 1999; 24:49–
+65. 111–125. [PubMed: 10677026]
+Singer W, Engel AK, Kreiter AK, Munk MH, Neuenschwander S, Roelfsema PR. Neuronal
+assemblies: necessity, signature and detectability. Trends Cogn. Sci. (Regul. Ed.). 1997; 1:252–
+261. [PubMed: 21223920]
+Spratling MW. Reconciling predictive coding and biased competition models of cortical function.
+Front Comput Neurosci. 2008; 2:4. [PubMed: 18978957]
+Srinivasan MV, Laughlin SB, Dubs A. Predictive coding: a fresh view of inhibition in the retina. Proc.
+R. Soc. Lond., B, Biol. Sci. 1982; 216:427–459. [PubMed: 6129637]
+Summerfield C, Trittschuh EH, Monti JM, Mesulam M-M, Egner T. Neural repetition suppression
+reflects fulfilled perceptual expectations. Nature Neuroscience. 2008; 11:1004–1006.
+Summerfield C, Wyart V, Johnen VM, de Gardelle V. Human Scalp Electroencephalography Reveals
+that Repetition Suppression Varies with Expectation. Front Hum Neurosci. 2011; 5:67. [PubMed:
+21847378]
+Theyel BB, Llano DA, Sherman SM. The corticothalamocortical circuit drives higher-order cortex in
+the mouse. Nature Neuroscience. 2009; 13:84–88.
+
+Bastos et al.
+Page 21
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Thomson AM, Bannister AP. Interlaminar connections in the neocortex. Cerebral Cortex. 2003; 13:5–
+14. [PubMed: 12466210]
+Todorovic A, van Ede F, Maris E, de Lange FP. Prior Expectation Mediates Neural Adaptation to
+Repeated Sounds in the Auditory Cortex: An MEG Study. Journal of Neuroscience. 2011;
+31:9118–9123. [PubMed: 21697363]
+Ullman S. Sequence seeking and counter streams: a computational model for bidirectional information
+flow in the visual cortex. Cereb. Cortex. 1995; 5:1–11. [PubMed: 7719126]
+Usrey WM, Fitzpatrick D. Specificity in the axonal connections of layer VI neurons in tree shrew
+striate cortex: evidence for distinct granular and supragranular systems. The Journal of
+Neuroscience. 1996; 16:1203. [PubMed: 8558249]
+Varela F, Lachaux JP, Rodriguez E, Martinerie J. The brainweb: phase synchronization and large-scale
+integration. Nature Reviews Neuroscience. 2001; 2:229–239.
+Vezoli J. Quantitative Analysis of Connectivity in the Visual Cortex: Extracting Function from
+Structure. The Neuroscientist. 2004; 10:476–482. [PubMed: 15359013]
+Vicente R, Gollo LL, Mirasso CR, Fischer I, Pipa G. Dynamical relaying can yield zero time lag
+neuronal synchrony despite long conduction delays. Proceedings of the National Academy of
+Sciences. 2008; 105:17157.
+Wacongne C, Labyt E, van Wassenhove V, Bekinschtein T, Naccache L, Dehaene S. Evidence for a
+hierarchy of predictions and prediction errors in human cortex. Proceedings of the National
+Academy of Sciences. 2011; 108:20754–20759.
+Wang X-J. Neurophysiological and computational principles of cortical rhythms in cognition. Physiol.
+Rev. 2010; 90:1195–1268. [PubMed: 20664082]
+Weiler N, Wood L, Yu J, Solla SA, Shepherd GMG. Top-down laminar organization of the excitatory
+network in motor cortex. Nat Neurosci. 2008; 11:360–366. [PubMed: 18246064]
+Wozny C, Williams SR. Specificity of Synaptic Connectivity between Layer 1 Inhibitory Interneurons
+and Layer⅔ Pyramidal Neurons in the Rat Neocortex. Cerebral Cortex. 2011
+Wurtz RH, McAlonan K, Cavanaugh J, Berman RA. Thalamic pathways for active vision. Trends
+Cogn. Sci. (Regul. Ed.). 2011; 15:177–184. [PubMed: 21414835]
+Wyart V, Nobre AC, Summerfield C. Dissociable prior influences of signal probability and relevance
+on visual contrast sensitivity. Proceedings of the National Academy of Sciences. 2012;
+109:3593–3598.
+Yamaguchi S, Knight RT. Gating of somatosensory input by human prefrontal cortex. Brain Res.
+1990; 521:281–288. [PubMed: 2207666]
+Yoshimura Y, Callaway EM. Fine-scale specificity of cortical networks depends on inhibitory cell
+type and connectivity. Nat. Neurosci. 2005; 8:1552–1559. [PubMed: 16222228]
+Yuille A, Kersten D. Vision as Bayesian inference: analysis by synthesis? Trends Cogn. Sci. (Regul.
+Ed.). 2006; 10:301–308. [PubMed: 16784882]
+Zeki S, Shipp S. The functional logic of cortical connections. Nature. 1988; 335:311–317. [PubMed:
+3047584]
+Zeki SM. The cortical projections of foveal striate cortex in the rhesus monkey. J. Physiol. (Lond.).
+1978; 277:227–244. [PubMed: 418174]
+
+Bastos et al.
+Page 22
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 1.
+This is a schematic of the classical microcircuit adapted from Douglas and Martin (1991).
+This minimal circuitry comprises superficial (layers 2 and 3) and deep (layers, 5 and 6)
+pyramidal cells and a population of smooth inhibitory cells. Feedforward inputs – from the
+thalamus – target all cell populations, but with an emphasis on inhibitory interneurons and
+superficial and granular layers. Note the symmetrical deployment of inhibitory and
+excitatory intrinsic connections that maintain a balance of excitation and inhibition.
+
+Bastos et al.
+Page 23
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 2.
+This is a simplified schematic of the key intrinsic connections among excitatory (E) and
+inhibitory (I) populations in granular (L4), supragranular (L1/2/3) and infragranular (L5/6)
+layers. The excitatory interlaminar connections are based largely on Gilbert and Wiesel
+(1983). Forward connections denote feedforward extrinsic corticocortical or thalamocortical
+afferents that are reciprocated by backward or feedback connections. Anatomical and
+functional data suggest that afferent input enters primarily into L4 and is conveyed to
+superficial layers L2/3 that are rich in pyramidal cells, which project forward to the next
+cortical area, forming a disynaptic route between thalamus and secondary cortical areas
+(Callaway, 1998). Information from L2/3 is then sent to L5 and L6, which sends (intrinsic)
+feedback projections back to L4 (Usrey and Fitzpatrick, 1996). L5 cells originate feedback
+connections to earlier cortical areas as well as to the pulvinar, superior colliculus, and brain
+stem. In summary, forward input is segregated by intrinsic connections into a superficial
+forward stream and a deep backward stream. In this schematic, we have juxtaposed densely
+interconnected excitatory and inhibitory populations within each layer.
+
+Bastos et al.
+Page 24
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 3.
+This schematic shows an example of a generative model. Generative models describe how
+(sensory) data are caused. In this figure, sensory states (blue circles on the periphery) are
+generated by hidden variables (in the centre). The left panel shows the model as a
+probabilistic graphical model, where unknown variables (hidden causes and states) are
+associated with the nodes of a dependency graph and conditional dependencies are indicated
+by arrows. Hidden states confer memory on the model by virtue of having dynamics, while
+hidden causes connect nodes. A graphical model describes the conditional dependencies
+among hidden variables generating data. These dependencies are typically modelled as
+(differential) equations with nonlinear mappings and random fluctuations 
+ with precision
+(inverse variance) Π(i) (see the equations in the insert on the left). This allows one to specify
+the precise form of the probabilistic generative model and leads to a simple and efficient
+inversion scheme (predictive coding; see next figure). Here 
+ denotes the set of hidden
+causes that constitute the parents of sensory s̃
+(i) or hidden x̃
+(i) states. The ~ indicates states in
+generalised coordinates of motion: x̃
+ = (x, x′, x″,...). An intuitive version of the model is
+shown on the right: here, we imagine that a singing bird is the cause of sensations, which –
+through a cascade of dynamical hidden states – produces modality-specific consequences
+(e.g., the auditory object of a bird song and the visual object of a song bird). These
+intermediate causes are themselves (hierarchically) unpacked to generate sensory signals.
+The generative model therefore maps from causes (e.g., concepts) to consequences (e.g.,
+sensations), while its inversion corresponds to mapping from sensations to concepts or
+representations. This inversion corresponds to perceptual synthesis – in which the generative
+model is used to generate predictions. Note that this inversion implicitly resolves the binding
+problem - by explaining multisensory cues with a single cause.
+
+Bastos et al.
+Page 25
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 4.
+This figure describes the predictive coding scheme associated with a simple hierarchical
+model shown on the left. In this model each node has a single parent. The ensuing inversion
+or generalised predictive coding scheme is shown on the right. The key quantities in this
+scheme are (conditional) expectations of the hidden states and causes and their associated
+prediction errors. The basic architecture – implied by the inversion of the graphical
+(hierarchical) model – suggests that prediction errors (caused by unpredicted fluctuations in
+hidden variables) are passed up the hierarchy to update conditional expectations. These
+conditional expectations now provide predictions that are passed down the hierarchy to form
+prediction errors. We presume that the forward and backward message passing between
+hierarchical levels is mediated by extrinsic (feedforward and feedback) connections.
+Neuronal populations encoding conditional expectations and prediction errors now have to
+be deployed in a canonical microcircuit to understand the computational logic of intrinsic
+connections – within each level of the hierarchy – as shown in the next figure.
+
+Bastos et al.
+Page 26
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 5.
+The left hand panel is the canonical microcircuit based on Haeusler and Maass (2007),
+where we have removed inhibitory cells from the deep layers – because they have very little
+interlaminar connectivity. The numbers denote connection strengths (mean amplitude of
+PSPs measured at soma in mV) and connection probabilities (in parentheses) according to
+Thomson et al. (2002). The right panel shows the proposed cortical microcircuit for
+predictive coding, where the quantities of the previous figure have been associated with
+various cell types. Here, prediction error populations are highlighted in pink. Inhibitory
+connections are shown in red, while excitatory connections are in black. The dotted lines
+refer to connections that are not present in the microcircuit on the left (but see Figure 2). In
+this scheme, expectations (about causes and states) are assigned to (excitatory and
+inhibitory) interneurons in the supragranular layers, which are passed to infragranular layers.
+The corresponding prediction errors occupy granular layers, while superficial pyramidal
+cells encode prediction errors that are sent forward to the next hierarchical level. Conditional
+expectations and prediction errors on hidden causes are associated with excitatory cell types,
+while the corresponding quantities for hidden states are assigned to inhibitory cells. Dark
+circles indicate pyramidal cells. Finally, we have placed the precision of the feedforward
+prediction errors against the superficial pyramidal cells. This quantity controls the
+postsynaptic sensitivity or gain to (intrinsic and top-down) pre-synaptic inputs. We have
+previously discussed this in terms of attentional modulation, which may be intimately linked
+to the synchronisation of pre-synaptic inputs and ensuing postsynaptic responses (Fries et al
+2001; Feldman and Friston, 2010).
+
+Bastos et al.
+Page 27
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 6.
+This schematic illustrates the functional asymmetry between the spectral activity of
+superficial and deep cells predicted theoretically. In this illustrative example, we have
+ignored the effects of influences on the expectations of hidden causes (encoded by deep
+pyramidal cells), other than the prediction error on causes (encoded by superficial pyramidal
+cells). The lower panel shows the spectral density of deep pyramidal cell activity, given the
+spectral density of superficial pyramidal cell activity in the upper panel. The equation
+expresses the spectral density of the deep cells as a function of the spectral density of the
+superficial cells; using Equation (2). This schematic is meant to illustrate how the relative
+amounts of low (beta) and high (gamma) frequency activity in superficial and deep cells can
+be explained by the evidence accumulation implicit in predictive coding.
+
+Bastos et al.
+Page 28
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+Bastos et al.
+Page 29
+
+Table 1
+
+Electrophysiological and neuroimaging findings consistent with predictive coding.
+
+Prediction violated
+Area studied
+Neuronal expression of Prediction-
+error
+
+Study
+
+Learned visual object pairings
+Monkey inferotemporal cortex
+(IT)
+
+Enhanced firing rate
+Meyer and Olson, 2011
+
+Natural image statistics
+Monkey V1, V2, V3
+Enhanced firing rate
+Hupé et al., 1998;
+Bullier et al., 1996; Bair
+et al., 2003
+
+Repetitive auditory stream
+Early human auditory cortex
+Enhanced Event Related Potentials
+(ERPs), enhanced gamma-band power
+
+Garrido et al., 2007,
+2009; Todorovic et al.,
+2011
+
+Coherence of visual form and
+motion
+
+Human V1, V2, V3, V4, V5/
+MT
+
+Enhanced BOLD response
+Murray et al 2002;
+Murray et al 2005;
+Harrison et al., 2007
+
+Audio-visual congruence of speech
+Visual and auditory cortex
+Gamma-band oscillatory activity
+Arnal et al., 2011
+
+Predictability of visual stimuli as a
+function of attention
+
+Human V1, V2, V3
+Enhanced BOLD response when
+unattended, reduced BOLD when
+attended
+
+Kok et al., 2011
+
+Hierarchical expectations in
+auditory sequences
+
+Human temporal cortex
+Enhanced Event Related Potentials
+(ERPs)
+
+Wacongne et al., 2011
+
+Expected repetition (or
+alternation) of face stimuli
+
+FFA in fMRI, parietal and
+central electrodes of EEG
+
+Enhanced BOLD response, diminished
+repetition suppression of ERP
+
+Summerfield et al.,
+2008, 2011
+
+Apparent motion of visual
+stimulus
+
+V1
+Enhanced BOLD response
+Alink et al., 2010
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+Bastos et al.
+Page 30
+
+Table 2
+
+The functional (computational) correlates of the anatomy and physiology of cortical hierarchies and their
+extrinsic connections.
+
+Anatomy and physiology
+Functional correlates
+
+Hierarchical organisation of cortical areas (Zeki and Shipp 1988;
+Felleman and Van Essen, 1991; Barone et al., 2000; Vezoli, 2004)
+
+Encoding of conditional dependencies in terms of a graphical
+model (Mumford, 1992; Rao and Ballard, 1999; Friston 2008).
+
+Distinct (laminar-specific) neuronal responses (Douglas et al., 1989;
+Douglas and Martin, 1991)
+
+Encoding expected states of the world (superficial pyramidal cells)
+and prediction errors (deep pyramidal cells) (Mumford, 1992;
+Friston 2008).
+
+Distinct (laminar-specific) extrinsic connections (Zeki and Shipp
+1988; Felleman and Van Essen, 1991; Barone et al., 2000; Vezoli, 2004;
+Markov et al., 2011).
+
+Forward connections convey prediction error (from superficial
+pyramidal cells) and backward connections convey predictions
+(from deep pyramidal cells) (Mumford, 1992; Friston 2008).
+
+Reciprocal extrinsic connectivity (Zeki and Shipp 1988; Felleman and
+Van Essen, 1991; Barone et al., 2000; Vezoli, 2004; Markov et al., 2011)
+
+Recurrent dynamics are intrinsically stable because they are trying
+to suppress prediction error (Crick and Koch 1998;; Friston 2008).
+
+Feedback extrinsic connections are (driving and) modulatory
+(Mignard and Malpeli 1991; Bullier et al., 1996; Sherman and Guillery
+1998; Covic and Sherman, 2011; De Pasquale and Sherman, 2011).
+
+Forwards (driving) and backwards (driving and modulatory)
+connections mediate the (linear) influence of prediction errors and
+the (linear and non-linear) construction of predictions (Friston
+2008; 2010).
+
+Feedback extrinsic connections are inhibitory (Murphy and Sillito,
+1987; Sillito et al., 1993; Chu et al., 2003; Olsen et al. 2012; Meyer et al.,
+2011; Wozny and Williams, 2011).
+
+Top-down predictions suppress or counter prediction errors
+produced by bottom up inputs (Mumford, 1992; Rao and Ballard,
+1999; Friston 2008).
+
+Differences in neuronal dynamics of superficial and deep layers (de
+Kock et al., 2007; Sakata and Harris, 2009; Maier et al., 2010;
+Bollimunta et al., 2011; Buffalo et al., 2011).
+
+Principal cells elaborating predictions (deep pyramidal cells) may
+show distinct (low-pass) dynamics, relative to those encoding error
+(superficial pyramidal cells) (Friston 2008).
+
+Dense intrinsic and horizontal connectivity (Thomson and Bannister,
+2003; Katzel et al., 2010).
+
+Lateral predictions and prediction errors mediating winnerless
+competition and competitive lateral dependencies (Desimone,
+1996; Friston 2010).
+
+Predominance of nonlinear synaptic (dendritic and
+neuromodulatory) infrastructure in superficial layers (Häusser and
+Mel, 2003; London and Häusser, 2005; Gentet et al., 2012).
+
+Required to scale prediction errors, in proportion to their precision,
+affording a form of cortical bias or gain control that encodes
+uncertainty (Feldman and Friston 2010; Spratling, 2008)
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+
diff --git a/papers/references/Bastos2012_Placeholder.md b/papers/references/Bastos2012_Placeholder.md
deleted file mode 100644
index 3b1ac7e1..00000000
--- a/papers/references/Bastos2012_Placeholder.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Canonical microcircuits for predictive coding (Bastos 2012)
-
-This reference defines the anatomical pathways of the cortical microcircuit (L2/3, L4, L5, L6) and how they implement active inference.
-Due to copyright and its format, the full PDF is not hosted in this repository.
-
-**Citation:**
-Bastos, A. M. et al. (2012). *Neuron* **76**, 695.
diff --git a/papers/references/Schlosshauer2007.pdf b/papers/references/Schlosshauer2007.pdf
new file mode 100644
index 00000000..6b0ad828
--- /dev/null
+++ b/papers/references/Schlosshauer2007.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:eecf513088343ae57d312f1892d2e7069214be969b54f31fc64e14339bafd4cd
+size 2028534
diff --git a/papers/references/Schlosshauer2007.txt b/papers/references/Schlosshauer2007.txt
new file mode 100644
index 00000000..7f1a1127
--- /dev/null
+++ b/papers/references/Schlosshauer2007.txt
@@ -0,0 +1,3798 @@
+The quantum-to-classical transition and decoherence
+
+Maximilian Schlosshauer
+Department of Physics, University of Portland, 5000 North Willamette Boulevard, Portland, Oregon 97203, USA
+
+I give a pedagogical overview of decoherence and its role in providing a dynamical account of the
+quantum-to-classical transition. The formalism and concepts of decoherence theory are reviewed,
+followed by a survey of master equations and decoherence models.
+I also discuss methods for
+mitigating decoherence in quantum information processing and describe selected experimental
+investigations of decoherence processes.
+Note: Please see arXiv:1911.06282 [quant-ph] (published as Phys. Rep. 831, 1–57, 2019) for
+a much more extensive and up-to-date review of decoherence.
+
+CONTENTS
+
+I. Introduction
+1
+
+II. Basic formalism and concepts
+2
+
+A. Decoherence and interference damping
+2
+
+B. Environmental monitoring and information
+transfer
+3
+
+C. Environment-induced superselection and
+decoherence-free subspaces
+4
+
+1. Pointer states and the commutativity
+criterion
+5
+
+2. Decoherence-free subspaces
+6
+
+D. Proliferation of information and quantum
+Darwinism
+6
+
+E. Decoherence versus dissipation and noise
+7
+
+III. Master equations
+7
+
+A. Born–Markov master equations
+8
+
+B. Lindblad master equations
+8
+
+C. Non-Markovian decoherence
+9
+
+IV. Decoherence models
+10
+
+A. Collisional decoherence
+10
+
+B. Quantum Brownian motion
+11
+
+C. Spin–boson models
+13
+
+D. Spin-environment models
+13
+
+V. Qubit decoherence, quantum error correction,
+and error avoidance
+14
+
+A. Correction of decoherence-induced quantum
+errors
+14
+
+B. Quantum computation on decoherence-free
+subspaces
+15
+
+C. Environment engineering and dynamical
+decoupling
+16
+
+VI. Experimental studies of decoherence
+16
+
+A. Atoms in a cavity
+17
+
+B. Matter-wave interferometry
+17
+
+C. Superconducting systems
+17
+
+VII. Decoherence and the foundations of quantum
+mechanics
+19
+
+References
+19
+
+I.
+INTRODUCTION
+
+Realistic quantum systems are never completely iso-
+lated from their environment. When a quantum system
+interacts with its environment, it will in general become
+entangled with a large number of environmental degrees
+of freedom. This entanglement influences what we can
+locally observe upon measuring the system. In partic-
+ular, quantum interference effects with respect to cer-
+tain physical quantities (most notably, “classical” quan-
+tities such as position) become effectively suppressed,
+making them prohibitively difficult to observe in most
+cases of practical interest. This is the process of deco-
+herence, sometimes also called dynamical decoherence or
+environment-induced decoherence [1–10]. Stated in gen-
+eral and interpretation-neutral terms, decoherence de-
+scribes how entangling interactions with the environment
+influence the statistics of results of future measurements
+on the system.
+Formally, decoherence can be viewed as a dynamical
+filter on the space of quantum states, singling out those
+states that, for a given system, can be stably prepared
+and maintained, while effectively excluding most other
+states, in particular, nonclassical superposition states of
+the kind popularized by Schr¨odinger’s cat. In this way,
+decoherence lies at the heart of the quantum-to-classical
+transition. It ensures consistency between quantum and
+classical predictions for systems observed to behave clas-
+sically. It provides a quantitative, dynamical account of
+the boundary between quantum and classical physics. In
+any concrete experimental situation, decoherence theory
+specifies the physical requirements, both qualitative and
+quantitative, for pushing the quantum–classical bound-
+ary toward the quantum realm. Decoherence is a pure
+quantum effect, to be distinguished from classical dissi-
+pation and stochastic fluctuations (noise).
+Decoherence processes are extremely efficient.
+Even
+when the environment does not, from a classical point
+of view, impart significant classical perturbations on the
+system, quantum-mechanically the system will in most
+circumstances become rapidly and strongly entangled
+with the environment. Furthermore, due to the many un-
+controllable degrees of freedom of the environment, such
+entanglement is usually irreversible for all practical pur-
+poses. Increasingly realistic models of decoherence pro-
+
+arXiv:1404.2635v2  [quant-ph]  20 Nov 2019
+
+
+2
+
+cesses have been developed, progressing from toy models
+to complex models tailored to specific experiments (see
+Sec. IV). Advances in experimental techniques have made
+it possible to observe the gradual action of decoherence
+in experiments such as matter-wave interferometry [11],
+cavity QED [12], and superconducting systems [13] (see
+Sec. VI).
+The superposition states necessary for quantum in-
+formation processing are typically also those most sus-
+ceptible to decoherence. Thus, decoherence is a major
+barrier to implementing devices for quantum informa-
+tion processing such as quantum computers (see Sec. V).
+Qubit systems must be engineered to minimize environ-
+mental interactions detrimental to the preparation and
+longevity of the desired superposition states.
+At the
+same time, they must remain sufficiently open to al-
+low for their control.
+Quantum error correction can
+undo some of the decoherence-induced degradation of
+the superposition state and will be an integral part of
+quantum computers (see Sec. V A). Not only is deco-
+herence relevant to quantum information, but also vice
+versa. An information-centric view of quantum mechan-
+ics proves helpful in conveying the essence of the deco-
+herence process and is also used in recent explorations
+of the role of the environment as an information channel
+(see Sec. II B).
+It is a curious “historical accident” (Joos’s term [14,
+p. 13]) that the role of the environment in quantum me-
+chanics was appreciated only relatively late. While one
+can find—for example, in Heisenberg’s writings [15]—a
+few early anticipatory remarks about the role of environ-
+mental interactions in the quantum-mechanical descrip-
+tion of systems, it wasn’t until the 1970s that the ubiquity
+and implications of environmental entanglement were re-
+alized by Zeh [1, 16]. It took another decade for the for-
+malism of decoherence to be developed, chiefly by Zurek
+[2, 3], and for concrete models and numerical estimates
+of decoherence rates to be worked out [17, 18].
+Review papers on decoherence include Refs. [4–6, 10,
+
+19].
+There are two books on decoherence:
+a volume
+by Joos et al. [8] (a collection of chapters written by
+different authors) and a monograph by this author [9].
+Ref. [20] also contains material on decoherence. Foun-
+dational implications of decoherence are discussed in
+Refs. [6, 7, 9, 21].
+
+II.
+BASIC FORMALISM AND CONCEPTS
+
+In the double-slit experiment, we cannot observe an in-
+terference pattern if we also measure which slit the parti-
+cle went through (that is, if we obtain perfect which-path
+information). In fact, there is a continuous tradeoff be-
+tween interference (phase information) and which-path
+information: the better we can distinguish the two pos-
+sible paths, the less visible the interference pattern be-
+comes [22]. What is more, for a decrease in interference
+visibility to occur it suffices that there are degrees of
+
+freedom somewhere in the world that, if they were mea-
+sured, would allow us to make, with a certain degree of
+confidence, a statement about the path of the particle
+through the slits.
+While we cannot say that prior to
+their measurement, these degrees of freedom have en-
+coded information about a particular, definitive path of
+the particle—instead, we have merely correlations involv-
+ing both possible paths—no actual measurement is re-
+quired to bring about the decrease in interference visibil-
+ity. It is enough that, in principle, we could make such
+a measurement to obtain which-path information.
+This is somewhat loose talk, and conceptual caveats
+lurk. But it captures quite well the essence of what is
+happening in decoherence, where those “degrees of free-
+dom somewhere in the world” are the degrees of freedom
+of the system’s environment interacting with the system,
+leading to the creation of quantum correlations (entan-
+glement) between system and environment. Decoherence
+can thus be thought of as a process arising from the con-
+tinuous monitoring of the system by the environment;
+effectively, the environment is performing nondemolition
+measurements on the system (see Sec. II B). We now give
+a formal quantum-mechanical account of what we have
+just tried to convey in words, and then flesh out the con-
+sequences and details.
+
+A.
+Decoherence and interference damping
+
+Consider again the double-slit experiment and denote
+the quantum states of the particle (call it S, for “sys-
+tem”) corresponding to passage through slit 1 and 2 by
+|s1⟩ and |s2⟩, respectively. Suppose that the particle in-
+teracts with another system E—for example, a detec-
+tor or an environment—such that if the quantum state
+of the particle before the interaction is |s1⟩, then the
+quantum state of E will become |E1⟩ (and similarly for
+|s2⟩), resulting in the final composite states |s1⟩ |E1⟩ and
+|s2⟩ |E2⟩, respectively. For an initial superposition state
+α |s1⟩+β |s2⟩, the final composite state will be entangled,
+
+|Ψ⟩ = α |s1⟩ |E1⟩ + β |s2⟩ |E2⟩ .
+(1)
+
+The statistics of all possible local measurements on S
+are exhaustively encoded in the reduced density matrix
+ρS,
+
+ρS = TrE(ρSE) = TrE|Ψ⟩⟨Ψ|
+
+= |α|2 |s1⟩⟨s1| + |β|2 |s2⟩⟨s2|
++ αβ∗|s1⟩⟨s2|⟨E2|E1⟩ + α∗β|s2⟩⟨s1|⟨E1|E2⟩.
+(2)
+
+For example, suppose we measure particle’s position by
+letting the particle impinge on a distant detection screen.
+Statistically, the resulting particle probability density
+p(x) will be given by
+
+p(x) = TrS(ρSx)
+
+= |α|2 |ψ1(x)|2 + |β|2 |ψ2(x)|2
+
++ 2 Re {αβ∗ψ1(x)ψ∗
+2(x)⟨E2|E1⟩} ,
+(3)
+
+
+3
+
+where ψi(x) ≡ ⟨x|si⟩. The last term represents the in-
+terference contribution. Thus, the visibility of the inter-
+ference pattern is quantified by the overlap ⟨E2|E1⟩, i.e.,
+by the distinguishability of |E1⟩ and |E2⟩. In the lim-
+iting case of perfect distinguishability, ⟨E2|E1⟩ = 0, no
+interference pattern will be observable and we obtain the
+classical prediction. Phase relations have become locally
+(i.e., with respect to S) inaccessible, and there is no mea-
+surement on S that can reveal coherence between |s1⟩ and
+|s2⟩. The coherence is now between the states |s1⟩ |E1⟩
+and |s2⟩ |E2⟩, requiring an appropriate global measure-
+ment (acting jointly on S and E) for it to be revealed.
+Conversely, if the interaction between S and E is such
+that E is completely unable to resolve the path of the
+particle, then |E1⟩ and |E2⟩ are indistinguishable and full
+coherence is retained at the level of S, as is also directly
+obvious from Eq. (1). In the intermediary regime where
+0 < |⟨E2|E1⟩| < 1, meaning that |E1⟩ and |E2⟩ can be
+distinguished in a one-shot measurement with nonzero
+probability p = 1 − |⟨E2|E1⟩|2 < 1, an interference pat-
+tern of reduced visibility is obtained. Equation (3) shows
+that the reduction in visibility increases as |E1⟩ and |E2⟩
+become more distinguishable.
+Here is another way of putting the matter. Looking
+back at Eq. (1), we see that E encodes which-way infor-
+mation about S in the same “relative-state” sense [23] in
+which EPR correlations [24–26] may be said to encode
+“information.” That is, if ⟨E2|E1⟩ = 0 and we were to
+measure E and found it to be in state |E1⟩, we could, in
+EPR’s words [24, p. 777], “predict with certainty” that
+we will find S in |s1⟩.1 Whenever such a prediction is
+possible were we to measure E, no interference effects be-
+tween the components |s1⟩ and |s2⟩ can be measured at
+S, even if E is never actually measured. If |⟨E2|E1⟩| > 0,
+then E encodes only partial which-way information about
+S, in the sense that a measurement of E could not reliably
+distinguish between |E1⟩ and |E2⟩; instead, sometimes
+the measurement will result in an outcome compatible
+with both |E1⟩ and |E2⟩. Consequently, an interference
+experiment carried out on S would find reduced visibil-
+ity, representing diminished local coherence between the
+components |s1⟩ and |s2⟩.
+As hinted above, the description developed so far de-
+scribes the essence of the decoherence process if we iden-
+tify the particle S more generally with an arbitrary quan-
+tum system and the second system E with the environ-
+ment of S. Then an idealized account of the decoherence
+
+1 Of course, this must not be read as saying that S was already
+in |s1⟩ (i.e., “went through slit 1”) prior to the measurement
+of E.
+Nor does it mean that the result of a subsequent path
+measurement on S is necessarily determined, by virtue of the
+measurement on E, prior to this S-measurement’s actually be-
+ing carried out. After all, as Peres has cautioned us [27], unper-
+formed measurements have no outcomes. So while the picture
+of E as “encoding which-path information” about S is certainly
+suggestive and helpful, it should be used with an understanding
+of its conceptual pitfalls.
+
+interaction has form
+��
+
+i
+ci |si⟩
+�
+|E0⟩
+−→
+�
+
+i
+ci |si⟩ |Ei(t)⟩ .
+(4)
+
+We have here introduced a time parameter t, where t = 0
+corresponds to the onset of the environmental interac-
+tion, with |Ei(t)⟩ ≡ |E0⟩ for all i; at t < 0 the system
+and environment are assumed to be uncorrelated (an as-
+sumption common to most decoherence models).
+A single environmental particle interacting with the
+system will typically only insufficiently resolve the com-
+ponents |si⟩ in the system’s superposition state. But be-
+cause of the large number of such particles (and, hence,
+degrees of freedom), the overlap between their different
+joint states |Ei(t)⟩ will rapidly decrease as a result of
+the build-up of many interaction events. Specifically, in
+many decoherence models an exponential decay of over-
+lap is found [3, 5, 9, 17, 20, 28–31],
+
+⟨Ei(t)|Ej(t)⟩ ∝ e−t/τd
+for i ̸= j.
+(5)
+
+Here τd is the characteristic decoherence timescale, which
+can be evaluated for particular choices of the parameters
+in each model (see Sec. IV).
+
+B.
+Environmental monitoring and information
+transfer
+
+We will now motivate, in a different and more rigorous
+way, the picture of decoherence as a process of environ-
+mental monitoring.
+First, we express the influence of
+the environment in a completely general way.
+We as-
+sume that at t = 0 there are no correlations between
+system S and environment E, ρSE(0) = ρS(0) ⊗ ρE(0).
+We write ρE(0) in its diagonal decomposition, ρE(0) =
+�
+
+i pi|Ei⟩⟨Ei|, where �
+
+i pi = 1 and the states |Ei⟩ form
+an orthonormal basis of the Hilbert space of E.
+If
+H denotes the Hamiltonian (here assumed to be time-
+independent) of SE and U(t) = e−iHt represents the uni-
+tary time evolution operator, then the density matrix of
+S evolves according to
+
+ρS(t) = TrE
+
+�
+
+U(t)
+
+�
+
+ρS(0) ⊗
+
+��
+
+i
+pi|Ei⟩⟨Ei|
+
+��
+
+U †(t)
+
+�
+
+=
+�
+
+ij
+pi ⟨Ej| U(t) |Ei⟩ ρS(0) ⟨Ei| U †(t) |Ej⟩ .
+(6)
+
+Introducing the Kraus operators [32] defined by Eij ≡
+√pi ⟨Ej| U(t) |Ei⟩, we obtain
+
+ρS(t) =
+�
+
+ij
+EijρS(0)E†
+ij.
+(7)
+
+It is customary to combine the two indices i and j into a
+single index and write the Kraus operators as
+
+Wk ≡ √pik ⟨Ejk| U(t) |Eik⟩ ,
+(8)
+
+
+4
+
+such that
+
+ρS(t) =
+�
+
+k
+WkρS(0)W †
+k.
+(9)
+
+This Kraus-operator formalism (also called operator-sum
+formalism) represents the effect of the environment as
+a sequence of (in general nonunitary) transformations of
+ρS generated by the operators Wk. The Kraus operators
+exhaustively encode information about the initial state
+of the environment and about the dynamics of the joint
+SE system. Because the evolution of SE is unitary, the
+Kraus operators satisfy the completeness constraint
+�
+
+k
+WkW †
+k = IS,
+(10)
+
+where IS is the identity operator in the Hilbert space of
+S. Equations (9) and (10) together imply that the Wk are
+the generators of a completely positive map Φ : ρS(0) �→
+ρS, also known as a quantum operation [32] or quantum
+channel.2
+
+We will now use Eq. (9) to formally motivate the view
+that decoherence corresponds to an indirect measurement
+of the system by the environment, and that it thus re-
+sults from a transfer of information from the system to
+the environment (see also Ref. [19]).
+In such an indi-
+rect measurement, we let the system S interact with a
+probe—here the environment E—followed by a projec-
+tive measurement on E. The probe is treated as a quan-
+tum system. This procedure aims to yield information
+about S without performing a projective (and thus de-
+structive) direct measurement on S. To model such an
+indirect measurement, consider again an initial compos-
+ite density operator ρSE(0) = ρS(0) ⊗ ρE(0) evolving
+under the action of U(t) = e−iHt, where H is the to-
+tal Hamiltonian. Consider a projective measurement M
+on E with eigenvalues α and corresponding projectors
+Pα ≡ |α⟩⟨α|, with P 2
+α = P †
+α = Pα. The probability of
+obtaining outcome α in this measurement when S is de-
+scribed by the density operator ρS(t) is
+
+Prob (α | ρS(t)) = TrE (PαρE(t))
+
+= TrE
+�
+PαTrS
+�
+U(t) (ρS(0) ⊗ ρE(0)) U †(t)
+��
+.
+(11)
+
+The density matrix of S conditioned on the particular
+
+2 The Kraus formalism is of limited use in calculating decoherence
+dynamics for concrete situations of physical interest. This is so
+because finding the Kraus operators corresponds to diagonaliz-
+ing the full Hamiltonian of SE, usually a prohibitively difficult
+task.
+Moreover, the Kraus operators contain all contributions
+to the evolution of the reduced density matrix, while for con-
+siderations of decoherence we are typically interested only in
+the nonunitary terms, and certain contributions—such as back-
+action effects from the system on the environment—can often be
+neglected. (This is where master equations come into play; see
+Sec. III.)
+
+outcome α is
+
+ρ(α)
+S (t) = TrE {[I ⊗ Pα] ρSE(t) [I ⊗ Pα]}
+
+Prob (α | ρS(t))
+
+= TrE
+�
+[I ⊗ Pα] U(t) [ρS(0) ⊗ ρE(0)] U †(t) [I ⊗ Pα]
+�
+
+Prob (α | ρS(t))
+.
+
+(12)
+
+Inserting
+the
+diagonal
+decomposition
+ρE(0)
+=
+�
+
+k pk|Ek⟩⟨Ek|
+and
+carrying
+out
+the
+trace
+gives
+[19]
+
+ρ(α)
+S (t) =
+�
+
+k
+
+Mα,kρS(t)M †
+α,k
+
+Prob (α | ρS(t)).
+(13)
+
+Here we have introduced the measurement operators
+
+Mα,k ≡ √pk ⟨α| U(t) |Ek⟩ ,
+(14)
+
+which
+obey
+the
+completeness
+constraint
+�
+
+α,k Mα,kM †
+α,k = IS.
+Equation (12) describes the
+effect of the indirect measurement on the state of the
+system. If, however, we do not actually inquire about
+the result of this measurement, we must assign to the
+system a density operator that is a sum over all the
+possible conditional states ρ(α)
+S (t) weighted by their
+probabilities Prob (α | ρS(t)),
+
+ρS(t) =
+�
+
+α
+Prob (α | ρS(t)) ρ(α)
+S (t)
+
+=
+�
+
+α,k
+Mα,kρS(t)M †
+α,k.
+(15)
+
+Note that this expression is formally analogous to the
+Kraus-operator expression of Eq. (9), which described
+the effect of a general environmental interaction on the
+state of the system. Recall, further, that the situation we
+encounter in decoherence is precisely one in which we do
+not actually read out the environment—or, in the present
+picture, in which we do not inquire about the result of the
+indirect measurement.
+This suggests that decoherence
+can indeed be understood as an indirect measurement—
+a monitoring—of the system by its environment.
+
+C.
+Environment-induced superselection and
+decoherence-free subspaces
+
+Decoherence can occur in any basis; which observable
+is monitored by the environment depends on the spe-
+cific form of the system–environment interaction. The
+preferred states (or preferred observables) of the system
+emerge dynamically as those states that are the most ro-
+bust to the interaction with the environment, in the sense
+that they become least entangled with the environment;
+thus, they are the states most immune to decoherence.
+
+
+5
+
+This is the stability criterion for the selection of pre-
+ferred states, resulting in the dynamical selection of pre-
+ferred states (“environment-induced superselection”) [1–
+3, 16]. These environment-superselected preferred states
+(or observables) are sometimes also called pointer states
+(or pointer observables) [2], since they correspond to the
+physical quantities that are most easily “read off” at the
+level of the system, akin to the pointer on the dial of a
+measurement apparatus.
+
+1.
+Pointer states and the commutativity criterion
+
+To find the preferred states,
+we decompose the
+total system–environment Hamiltonian into the self-
+Hamiltonians of the system S and environment E rep-
+resenting the intrinsic dynamics, and a part Hint repre-
+senting the interaction between system and environment,
+
+H = HS + HE + Hint.
+(16)
+
+In many cases of practical interest, Hint dominates
+the evolution of the system, H ≈ Hint (the quantum-
+measurement limit of decoherence). We look for system
+states |si⟩ such that the composite system–environment
+state, when starting from a product state |si⟩ |E0⟩ at
+t = 0, remains in the product form |si⟩ |Ei(t)⟩ for all
+t > 0 under the action of Hint (we shall assume here
+that Hint is not explicitly time-dependent). That is, we
+demand that (setting ℏ ≡ 1 from here on)
+
+e−iHintt |si⟩ |E0⟩ = λi |si⟩ e−iHintt |E0⟩ ≡ |si⟩ |Ei(t)⟩ .
+(17)
+Thus, the pointer states |si⟩ are the eigenstates of the
+part of the interaction Hamiltonian Hint pertaining to the
+Hilbert space of the system, with eigenvalues λi. These
+states will be stationary under Hint [2]. It follows that the
+pointer observable defined by OS = �
+
+i oi|si⟩⟨si| com-
+mutes with Hint,
+�
+OS, Hint
+�
+= 0.
+(18)
+
+This commutativity criterion [2, 3] is particularly easy to
+apply when Hint takes the tensor-product form Hint =
+S ⊗ E, as is frequently the case. Then the environment-
+superselected observables will be those observables that
+commute with S.
+If S is Hermitian, it represents the
+physical quantity monitored by the environment. In gen-
+eral, any Hint can be written as a diagonal decomposition
+of (unitary but not necessarily Hermitian) system and
+environment operators Sα and Eα, Hint = �
+
+α Sα ⊗ Eα.
+If the Sα are Hermitian, such a Hamiltonian represents
+the simultaneous environmental monitoring of different
+observables Sα of the system. A sufficient condition for
+{|si⟩} to form a set of pointer states of the system is then
+given by the requirement that the |si⟩ be simultaneous
+eigenstates of the operators Sα,
+
+Sα |si⟩ = λ(α)
+i
+|si⟩
+for all α and i.
+(19)
+
+Interaction Hamiltonians frequently describe the scat-
+tering of surrounding particles (photons, air molecules,
+etc.), leading to collisional decoherence (see Sec. IV A).
+Since the force laws describing such processes typically
+depend on some power of distance, the interaction Hamil-
+tonian will then commute with the position operator.
+Thus, the pointer states will be approximate eigenstates
+of position (i.e., narrow position-space wave packets).
+This explains why superpositions of mesoscopically and
+macroscopically distinct positions are prohibitively diffi-
+cult to observe [2, 3, 17, 31, 33–39]. Collisional decoher-
+ence can also be dominant in microscopic systems when
+these systems occur in distinct spatial configurations that
+couple strongly to the surrounding medium. For exam-
+ple, chiral molecules such as sugar are always observed to
+be in chirality eigenstates (left-handed or right-handed),
+which are superpositions of different energy eigenstates.
+Any attempt to prepare such molecules in energy eigen-
+states leads to immediate decoherence into the environ-
+mentally stable chirality eigenstates [40, 41].
+The quantum limit of decoherence [42] arises when the
+modes of the environment are slow in comparison with
+the evolution of the system—that is, when the highest
+frequencies (i.e., energies) available in the environment
+are smaller than the separation between the energy eigen-
+states of the system. Then the environment will be able
+to monitor only quantities that are constants of motion.
+In the case of nondegeneracy, this quantity will be the en-
+ergy of the system, leading to the environment-induced
+superselection of energy eigenstates for the system [42].3
+
+In many realistic situations, the commutativity crite-
+rion, Eq. (18), can only be fulfilled approximately [43, 44].
+In addition, the self-Hamiltonian of the system and the
+interaction Hamiltonian may contribute in roughly equal
+strengths (e.g., in models for quantum Brownian motion
+[4, 45]; see Sec. IV B), rendering neither the quantum-
+measurement limit of negligible intrinsic dynamics nor
+the quantum limit of decoherence of a slow environ-
+ment appropriate. In such cases, more general methods
+for determining the preferred states are required. The
+predictability-sieve strategy [43, 44, 46] computes the time
+dependence of the amount of decoherence introduced into
+the system for a large set of initial states of the system
+evolving under the total system–environment Hamilto-
+nian. Typically, this decoherence is measured using ei-
+ther the purity Tr
+�
+ρ2
+S
+�
+or the von Neumann entropy
+
+3 Textbooks on quantum mechanics usually attribute a special role
+to such energy eigenstates (for closed systems) since they are
+stationary under the action of the Hamiltonian. In this closed-
+system picture, however, arbitrary superpositions of energy
+eigenstates should nonetheless be perfectly legitimate. Thus, it
+is important to realize that the environment-induced superselec-
+tion of energy eigenstates is not equivalent to a situation in which
+the presence of the environment could be neglected altogether;
+instead, the environment plays the crucial role of continuously
+monitoring the energy of the system, leading to a local suppres-
+sion of coherence between energy eigenstates.
+
+
+6
+
+S(ρS) = −Tr (ρS log2 ρS) of the reduced density matrix
+ρS. The states most immune to decoherence will be those
+which lead to the smallest decrease in purity or the small-
+est increase in von Neumann entropy.
+Application of
+this method leads to a ranking of the possible preferred
+states with respect to their robustness to the interac-
+tion with the environment. For particular models it has
+been explicitly shown that the states picked out by the
+predictability sieve are robust to the particular choice of
+the measure of decoherence. For example, in the model
+for quantum Brownian motion, different measures lead
+to the same minimum-uncertainty wave packets in phase
+space [5, 8, 16, 44, 47, 48].
+
+2.
+Decoherence-free subspaces
+
+The
+pointer-state
+condition
+of
+Eq.
+(19)
+can
+be
+strengthened to the concept of pointer subspaces [3] or
+decoherence-free subspaces (DFS) [49–58]. These are sub-
+spaces of the Hilbert space of the system in which every
+state in the subspace is immune to decoherence; this is
+a nontrivial requirement, since in general superpositions
+of pointer states will not be pointer states themselves.
+One important condition for this to happen is that the
+preferred states |si⟩ defined by Eq. (19) form an orthonor-
+mal basis of the subspace, and that the eigenvalues λ(α)
+i
+in Eq. (19) are independent of i, i.e., that all |si⟩ are
+simultaneous degenerate eigenstates of each Sα,
+
+Sα |si⟩ = λ(α) |si⟩
+for all α and i.
+(20)
+
+This condition states that the action of a given Sα must
+be the same for all basis states |si⟩ of the DFS, and thus
+the existence of a DFS corresponds to a symmetry in the
+structure of the system–environment interaction, i.e., to
+a dynamical symmetry. A necessary condition for such a
+symmetry to obtain is the absence of terms in Hint that
+act jointly on system and environment in a nontrivial
+manner.
+An arbitrary state |ψ⟩ in the DFS can then be written
+as |ψ⟩ = �
+
+i ci |si⟩ and will evolve according to
+
+e−iHintt |ψ⟩ |E0⟩ = |ψ⟩ e−i(
+�
+
+α λ(α)Eα)t |E0⟩
+≡ |ψ⟩ |Eψ(t)⟩ .
+(21)
+
+Thus, the state |ψ⟩ does not become entangled with the
+environment and is therefore immune to decoherence.
+When the self-Hamiltonian HS of the system cannot be
+neglected, one needs to additionally ensure that none of
+the basis states |si⟩ of the DFS will drift out of the sub-
+space under the evolution generated by HS. Otherwise
+an initially decoherence-free state would again become
+prone to decoherence. The concept of DFS can be gener-
+alized to the formalism of noiseless subsystems (or noise-
+less quantum codes) [58–60].
+
+D.
+Proliferation of information and quantum
+Darwinism
+
+Quantum Darwinism [61–69] builds on the ideas of de-
+coherence and environmental encoding of information, by
+broadening the role of the environment to that of a com-
+munication and amplification channel. Interactions be-
+tween the system and its environment lead to the redun-
+dant storage of selected information about the system in
+many fragments of the environment. By measuring some
+of these fragments, observers can indirectly obtain infor-
+mation about the system without appreciably disturbing
+the system itself. Indeed, this represents how we typi-
+cally observe objects. For example, we see an object not
+by directly interacting with it, but by intercepting scat-
+tered photons that encode information about the object’s
+spatial structure [67, 68].
+
+In this sense, quantum Darwinism provides a dynami-
+cal explanation for the robustness of states of (especially)
+macroscopic objects to observation. It was found that
+the observable of the system that can be imprinted most
+completely and redundantly in many distinct fragments
+of the environment coincides with the pointer observable
+selected by the system–environment interaction [62–65];
+conversely, most other states do not seem to be redun-
+dantly storable. Indeed, it has been shown that the re-
+dundant proliferation of information regarding pointer
+states is as inevitable as decoherence itself [70]. Quantum
+Darwinism has been studied in several concrete models,
+for example, in spin environments [64], quantum Brow-
+nian motion [71], and photon and photon-like environ-
+ments [67, 68, 70]. The efficiency of the amplification pro-
+cess described by quantum Darwinism can be expressed
+in terms of the quantum Chernoff information [70].
+
+The structure and amount of information that the
+environment encodes about the system can be quanti-
+fied using the measure of (classical [62, 63] or quantum
+[5, 64, 65]) mutual information. Classical mutual infor-
+mation is based on the choice of particular observables of
+the system S and the environment E and quantifies how
+well one can predict the outcome of a measurement of a
+given observable of S by measuring some observable on
+a fraction of E [62, 63]. Quantum mutual information is
+defined as S(ρS)+S(ρE)−S(ρ), where ρS, ρE, and ρ are
+the density matrices of S, E, and the composite system
+SE, respectively, and S(ρ) = −Tr (ρ log2 ρ) is the von
+Neumann entropy associated with ρ. Quantum mutual
+information quantifies the degree of quantum correlations
+between S and E. Classical and quantum mutual infor-
+mation give similar results [5, 62–65] because the differ-
+ence between the two measures, known as the quantum
+discord [72], disappears when decoherence is sufficiently
+effective to select a well-defined pointer basis [72].
+
+
+7
+
+E.
+Decoherence versus dissipation and noise
+
+While the presence of dissipation implies the pres-
+ence of decoherence, the converse is not necessarily true.
+When dissipation and decoherence are both present, they
+typically occur on vastly different timescales; the deco-
+herence timescale is typically many orders of magnitude
+shorter than the relaxation timescale. A rule-of-thumb
+estimate for the ratio of the relaxation timescale τr to the
+decoherence timescale τd for a massive object described
+by a superposition of two different positions a distance
+∆x apart is [18]
+
+τr
+τd
+∼
+� ∆x
+
+λdB
+
+�2
+,
+(22)
+
+where λdB = (2mkT)−1/2 is the thermal de Broglie wave-
+length of the object. For an object of mass m = 1 g at
+room temperature in a coherent superposition of two lo-
+cations a distance ∆x = 1 cm apart, τr/τd is on the order
+of 1040. Thus, for macroscopic objects the dissipative in-
+fluence of the environment is usually completely negligi-
+ble on the timescale relevant to the decoherence induced
+by this environment.
+Decoherence is a consequence of environmental entan-
+glement. In the literature on quantum computing, how-
+ever, the term “decoherence” is often used to refer to
+any process that affects the qubits, including perturba-
+tions due to classical fluctuations and imperfections. Ex-
+amples for sources of such classical noise in the context
+of quantum computing are the fluctuations in the inten-
+sity [73] and duration [74] of the laser beam incident on
+qubits in an ion trap, inhomogeneities in the magnetic
+fields in NMR quantum computing [75], and bias fluctu-
+ations in superconducting qubits [76]. The distinction be-
+tween classical noise and quantum decoherence has been
+further blurred in quantum error correction, since the
+error-correcting schemes are insensitive to the physical
+origin of the qubit errors (see Sec. V A).
+Phenomenologically and formally the influence of clas-
+sical noise processes may be described in a manner simi-
+lar to the effect of environmental entanglement, namely,
+in terms of a decay of the off-diagonal elements (in-
+terference terms) in the local density matrix (in the
+environment-superselected basis).
+But in the case of
+noise, the decay of the off-diagonal elements occurs be-
+cause the system’s density matrix is identified with an
+average over a physical ensemble of systems (or, put dif-
+ferently, over the different instances of particular noise
+processes), while in the case of decoherence the decay is
+due to an entanglement-induced delocalization of phase
+coherence for individual systems. The fundamental dif-
+ference between these physical processes is masked by the
+density-matrix description. Indeed, one can always find
+an experimental procedure that would, at least in princi-
+ple, distinguish between the different physical processes
+underlying formally similar density-matrix descriptions.
+In contrast with decoherence, noise does not create
+system–environment entanglement and can in principle
+
+always be undone using only local operations (witness,
+for example, the reversal of ensemble dephasing in NMR
+experiments using the spin-echo technique). In any indi-
+vidual realization of the noise process the dynamics of the
+system are completely unitary, and thus no coherence is
+lost from the system. By contrast, if the system becomes
+entangled with environmental degrees of freedom, at the
+very least we would need to perform a pair of measure-
+ments on the environment before and after the interac-
+tion with the system in order to gather enough informa-
+tion to reverse the effect of decoherence by application
+of an appropriate countertransformation. Moreover, as
+also seen experimentally [77], these measurements would
+not always constitute a sufficient procedure for “undo-
+ing” decoherence (see also Sec. IV.C of Ref. [5]).
+The loss of phase coherence due to environmental
+entanglement is sometimes simulated (with the above
+caveats) by classical fluctuations perturbing the system,
+i.e., by the addition of certain time-dependent terms to
+the self-Hamiltonian of the system. This strategy was
+implemented, for example, in theoretical [73, 78] and ex-
+perimental [77, 79] studies of the influence of fluctuating
+parameters in ion-trap quantum computers.
+
+III.
+MASTER EQUATIONS
+
+In the usual approach to modeling decoherence, the
+reduced density matrix ρS(t) is obtained from
+
+ρS(t) = TrE ρSE(t) ≡ TrE
+�
+U(t)ρSE(0)U †(t)
+�
+,
+(23)
+
+where U(t) is the time-evolution operator for the compos-
+ite system SE. The task of calculating ρSE(t) is often
+computationally cumbersome or even intractable. It is
+also unnecessarily detailed, because we are usually only
+interested in the dynamics of the system. A master equa-
+tion allows us to calculate ρS(t) directly from an expres-
+sion of the form
+
+ρS(t) = V(t)ρS(0),
+(24)
+
+where the superoperator V(t) is the dynamical map gen-
+erating the evolution of ρS(t).
+If the master equation
+is exact, then we merely have the identity V(t)ρS(0) ≡
+TrE
+�
+U(t)ρSE(0)U †(t)
+�
+and no computational advantage
+is gained.
+Therefore, master equations are typically
+based on simplifying approximations.
+In modeling decoherence, we focus on master equations
+that are first-order time-local differential equations of the
+form
+
+d
+dtρS(t) = L [ρS(t)] ≡ −i [H′
+S, ρS(t)] + D[ρS(t)].
+(25)
+
+This equation is local in time in the sense that the change
+of ρS at time t depends only on ρS evaluated at t. The
+superoperator L acting on ρS(t) typically depends on the
+initial state of the environment and the different terms
+in the Hamiltonian.
+We have decomposed L into two
+
+
+8
+
+parts to distinguish their physical interpretation.
+The
+first term, −i [H′
+S, ρS(t)], is unitary and given by the
+Liouville–von Neumann commutator with the “renormal-
+ized” Hamiltonian H′
+S of the system. (Because the en-
+vironment typically leads to a renormalization of the en-
+ergy levels of the system, this Hamiltonian does in general
+not coincide with the unperturbed free Hamiltonian HS
+of S that would generate the evolution of S in absence of
+the environment.) The second, nonunitary term D[ρS(t)]
+represents decoherence (and often also dissipation) due to
+the environment.
+
+A.
+Born–Markov master equations
+
+Born–Markov master equations allow for many deco-
+herence problems to be treated in a mathematically sim-
+ple, and often closed, form. They are based on the fol-
+lowing two approximations:
+
+1. The
+Born
+approximation.
+The
+system–
+environment coupling is sufficiently weak and
+the environment is reasonably large such that
+changes of the density operator ρE of the environ-
+ment are negligible and the system–environment
+density operator remains remains approximately
+factorized at all times, ρSE(t) ≈ ρS(t) ⊗ ρE.
+
+2. The Markov approximation. Memory effects of the
+environment are negligible, in the sense that any
+self-correlations within the environment created by
+the coupling to the system decay rapidly compared
+to the characteristic relaxation timescale of the
+open quantum system.
+
+Comparisons between the predictions of models based
+on Born–Markov master equations and experimental
+data indicate that the Born and Markov assumptions are
+reasonable in many physical situations (but see Sec. III C
+below for exceptions and non-Markovian models). As-
+suming these assumptions hold and writing the inter-
+action Hamiltonian as Hint = �
+
+α Sα ⊗ Eα, the Born–
+Markov master equation reads [9, 20]
+
+d
+dtρS(t) = −i [HS, ρS(t)]
+
+−
+�
+
+α
+{[Sα, BαρS(t)] + [ρS(t)Cα, Sα]} ,
+(26)
+
+where the system operators Bα and Cα are defined as
+
+Bα ≡
+� ∞
+
+0
+dτ
+�
+
+β
+cαβ(τ)S(I)
+β (−τ),
+(27a)
+
+Cα ≡
+� ∞
+
+0
+dτ
+�
+
+β
+cβα(−τ)S(I)
+β (−τ).
+(27b)
+
+Here S(I)
+α (−τ) denotes the operator Sα in the interaction
+picture. In the following, we will simplify notation by
+
+omitting the superscript “I”; instead we use the conven-
+tion that all operators bearing explicit time arguments
+are to be understood as interaction-picture operators.
+(For density operators, however, we will maintain the
+superscript notation in order to distinguish them from
+Schr¨odinger-picture density operators, which also carry
+a time argument.) The quantities cαβ(τ) appearing in
+Eq. (27) are given by
+
+cαβ(τ) ≡ ⟨Eα(τ)Eβ⟩ρE .
+(28)
+
+These environment self-correlation functions quantify
+how much information the environment retains over time
+about its interaction with the system. The Markov ap-
+proximation corresponds to the assumption of a rapid
+decay of the cαβ(τ) relative to the timescale set by the
+evolution of the system.
+In many situations of interest, the general form of the
+Born–Markov master equation, Eq. (26), simplifies con-
+siderably.
+For example, typically only a single system
+observable S is monitored by the environment, Hint =
+S ⊗E. Also, the time dependence of the operators Sα(τ)
+and Eα(τ) is often simple, facilitating the calculation of
+the quantities Bα and Cα.
+Examples are discussed in
+Sec. IV.
+
+B.
+Lindblad master equations
+
+Lindblad master equations constitute a particular, al-
+beit quite general, class of time-local Markovian mas-
+ter equations. They arise from the requirement that the
+evolution of the reduced density matrix generated by the
+master equation must ensure complete positivity [20, 80–
+85]. Complete positivity guarantees that the dynamical
+map ρS(0) �→ ρS(t) = V(t)ρS(0) described by the master
+equation generates physically consistent dynamics even
+when S is initially entangled with another system. While
+complete positivity is automatically fulfilled if the evo-
+lution is exact, approximate master equations will not
+necessarily ensure complete positivity [20, 83–86]. The
+Lindblad master equation is a special case of the gen-
+eral Born–Markov master equation that ensures complete
+positivity and takes the general form [81, 82]
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)]
+
++ 1
+
+2
+
+�
+
+αβ
+γαβ
+��
+Sα, ρS(t)S†
+β
+�
++
+�
+SαρS(t), S†
+β
+��
+,
+(29)
+
+where H′
+S is the renormalized Hamiltonian of the sys-
+tem. The coefficients γαβ are time-independent and ex-
+haustively encapsulate information about the physical
+parameters of the decoherence processes (and possibly
+dissipation processes).
+One can show that the matrix
+Γ ≡ (γαβ) formed by the coefficients γαβ is positive, i.e.,
+all its eigenvalues κµ are ≥ 0. Therefore, Eq. (29) can be
+
+
+9
+
+simplified by diagonalizing Γ, which results in the diago-
+nal form [82, 87]
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)]
+
+− 1
+
+2
+
+�
+
+µ
+κµ
+�
+L†
+µLµρS(t) + ρSL†
+µLµ − 2LµρS(t)L†
+µ
+�
+.
+
+(30)
+
+The Lindblad operators Lµ are linear combinations of the
+original operators Sα, with coefficients determined by the
+diagonalization of Γ. The Lindblad structure of a mas-
+ter equation can also be motivated from the requirement
+that it gives rise to the most general form of generators
+of quantum dynamical semigroups [20, 81, 82, 84, 87–89].
+It is possible to bring any Born–Markov master equation
+into Lindblad form by imposing the rotating-wave ap-
+proximation. This assumption, ubiquituous in quantum
+optics, is justified whenever the timescale set by the typ-
+ical energy differences ℏ(ω − ω′) of the system Hamilto-
+nian is short in comparison with the relaxation timescale
+of the system. (See Sec. 3.3.1 of Ref. [20] for details.)
+Because the Sα are not necessarily Hermitian, the
+Lindblad operators do not always correspond to physical
+observables. But when they do, we can rewrite Eq. (30)
+in compact double-commutator form,
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)] − 1
+
+2
+
+�
+
+µ
+κµ [Lµ, [Lµ, ρS(t)]] .
+
+(31)
+As an example, consider a situation in which the envi-
+ronment monitors the position of a system. With L = x
+and the “free”-particle Hamiltonian H′
+S = HS = p2/2m,
+Eq. (31) becomes
+
+d
+dtρS(t) = − i
+
+2m
+�
+p2, ρS(t)
+�
+− 1
+
+2κ [x, [x, ρS(t)]] .
+(32)
+
+Expressing this master equation in the position represen-
+tation results in
+
+∂ρS(x, x′, t)
+
+∂t
+= − i
+
+2m
+
+� ∂2
+
+∂x′2 − ∂2
+
+∂x2
+
+�
+ρS(x, x′, t)
+
+− 1
+
+2κ (x − x′)2 ρS(x, x′, t).
+(33)
+
+This is the classic equation of motion for decoherence due
+to environmental scattering first derived in Ref. [17].
+Lindblad master equations provide an intuitive and
+simple way of representing the environmental monitoring
+of an open quantum system. Most of the real physics be-
+hind this monitoring process is hidden in the coefficients
+κµ appearing in Eq. (30). If the Lindblad operators are
+chosen to be dimensionless, the κµ can be directly in-
+terpreted as decoherence rates, since they have units of
+inverse time.
+Equation (31) shows that the decoherence term van-
+ishes if
+
+[Lµ, ρS(t)] = 0
+for all µ, t.
+(34)
+
+In this case, ρS(t) evolves unitarily. Since the Lµ are lin-
+ear combinations of the Sα, Eq. (34) typically means that
+[Sα, ρS(t)] = 0 for all α, t. This implies that simultane-
+ous eigenstates of all Sα will be immune to decoherence,
+which is precisely the pointer-state criterion of Eq. (19).
+In quantum-jump and quantum-trajectory approaches,
+the evolution of the reduced density matrix is conditioned
+on an explicitly observed sequence of measurement re-
+sults in the environment. This allows for the (formal)
+description of a single realization of the system evolv-
+ing stochastically, conditioned on a particular measure-
+ment record. The dynamics are then described by a mas-
+ter equation of the Lindblad type, Eq. (31), for the re-
+duced density matrix ρC
+S conditioned on the measurement
+records of the Lindblad operators Lµ,
+
+dρC
+S = −i
+�
+HS, ρC
+S
+�
+dt − 1
+
+2
+
+�
+
+µ
+κµ
+�
+Lµ,
+�
+Lµ, ρC
+S
+��
+dt
+
++
+�
+
+µ
+
+√κµ W[Lµ]ρC
+S dWµ.
+(35)
+
+Here, W[L]ρ ≡ Lρ+ρL†−ρ Tr
+�
+Lρ + ρL†�
+, and the dWµ
+denote so-called Wiener increments. Equation (35) corre-
+sponds to a diffusive unraveling of the Lindblad equation
+into individual quantum trajectories, which can then be
+expressed by means of a stochastic Schr¨odinger equation
+[90–102].
+
+C.
+Non-Markovian decoherence
+
+The derivation of the Born–Markov master equation
+assumes that the coupling between system and environ-
+ment is weak and memory effects of the environment can
+be neglected.
+These conditions, however, are not met
+in certain situations of physical interest.
+An example
+would be a superconducting qubit strongly coupled to a
+low-temperature environment of other two-level systems
+[103, 104]. Also, a recent experiment [105] has measured
+strongly non-Ohmic spectral densities for the environ-
+ment of a quantum nanomechanical system; such densi-
+ties lead to non-Markovian evolution.
+In many cases, pronounced memory effects in the envi-
+ronment will cause strong dependencies of the evolution
+of the reduced density operator on the past history of the
+system–environment composite and therefore make it im-
+possible to describe the reduced dynamics by a differen-
+tial equation that is local in time. Surprisingly, however,
+one can show that even non-Markovian dynamics some-
+times can still be described by a time-local differential
+equation of the form
+
+d
+dtρS(t) = K(t)ρS(t),
+(36)
+
+where the superoperator K(t) depends only on t.
+For
+example, a non-Markovian master equation for quantum
+Brownian motion (see Sec. IV B) can be obtained through
+
+
+10
+
+a formal modification of the Born–Markov master equa-
+tion [4, 5]. In general, it is often possible to arrive at
+non-Markovian but time-local master equations via the
+so-called time-convolutionless projection operator tech-
+nique [106–109].
+
+IV.
+DECOHERENCE MODELS
+
+Many physical systems can be represented either by
+a qubit if the state space of the system is discrete and
+effectively two-dimensional, or by a particle described by
+continuous phase-space coordinates. Needless to say, in
+the case of quantum information processing the qubit
+representation is of particular relevance.
+Similarly, a wide range of environments can be modeled
+as a collection of quantum harmonic oscillators or qubits.
+Harmonic-oscillator environments are of great generality.
+At low energies, many systems interacting with an en-
+vironment can effectively be represented by one or two
+coordinates of the system linearly coupled to an environ-
+ment of harmonic oscillators; indeed, sufficiently weak in-
+teractions with an arbitrary environment can be mapped
+onto a system linearly coupled to a harmonic-oscillator
+environment [110, 111].
+Environments represented by qubits are often the ap-
+propriate model in the low-temperature regine, where de-
+coherence is typically dominated by interactions with lo-
+calized modes, such as paramagnetic spins, paramagnetic
+electronic impurities, tunneling charges, defects, and nu-
+clear spins [103, 104, 112]. Each of the localized modes is
+represented by a finite-dimensional Hilbert space with a
+finite energy cutoff. We can therefore model these modes
+as a set of discrete states. Typically, only two such states
+are relevant, and thus the localized modes can be mapped
+onto an environment of qubits. Since qubits can be for-
+mally represented by spin- 1
+
+2 particles, such models are
+known as spin-environment models.
+In the following, we will discuss four important stan-
+dard models, namely, collisional decoherence (Sec. IV A),
+quantum Brownian motion (Sec. IV B), the spin–boson
+model (Sec. IV C), and the spin–spin model (Sec. IV D).
+For details on these and other decoherence models, in-
+cluding derivations of the relevant master equations, see
+Secs. 3 and 5 of Ref. [9].
+
+A.
+Collisional decoherence
+
+Collisional decoherence arises from the scattering of en-
+vironmental particles by a massive free quantum particle.
+Models of collisional decoherence were first studied in the
+classic paper by Joos and Zeh [17]. A more rigorous and
+general treatment was later developed by Hornberger and
+collaborators [31, 36–39] (see also [34, 35, 113]), which,
+among other refinements, remedied a flaw in Joos and
+Zeh’s original derivation that had resulted in decoher-
+ence rates that were too large by a factor of 2π [31].
+
+If we assume that the central particle is much more
+massive than the environmental particles such that its
+center-of-mass state is not disturbed by the scattering
+events (no recoil), the time evolution of the reduced den-
+sity matrix is given by [9, 17, 31, 34, 35]
+
+∂ρS(x, x′, t)
+
+∂t
+= −F(x − x′)ρS(x, x′, t).
+(37)
+
+This master equation describes pure spatial decoherence
+without dissipation. The decoherence factor F(x − x′)
+plays the role of a localization rate.
+It represents the
+characteristic decoherence rate at which spatial coher-
+ences between two positions x and x′ become locally sup-
+pressed and is given by
+
+F(x − x′) =
+� ∞
+
+0
+dq ϱ(q)v(q)
+
+×
+� dˆn dˆn′
+
+4π
+
+�
+1 − eiq(ˆn−ˆn′)·(x−x′)�
+|f(qˆn, qˆn′)|2 .
+(38)
+
+Here ϱ(q) denotes the number density of incoming par-
+ticles with magnitude of momentum equal to q = |q|, ˆn
+and ˆn′ are unit vectors (with dˆn and dˆn′ representing the
+associated solid-angle differentials), and v(q) denotes the
+speed of particles with momentum q. For the scattering
+of massive environmental particles we have v(q) = q/m,
+where m is each particle’s mass, while for the scatter-
+ing of photons and other massless particles v(q) is equal
+to the speed of light. The quantity |f(qˆn, qˆn′)|2 is the
+differential cross section for the scattering of an environ-
+mental particle from initial momentum q = qˆn to final
+momentum q′ = qˆn′.
+Whenever the mass of the central particle becomes
+comparable to the mass of the environmental particles (as
+in the case of air molecules scattered by small molecules
+and free electrons [114]), the no-recoil assumption does
+not hold and more general models for collisional deco-
+herence have to be considered [35, 36].
+The resulting
+dynamics include dissipation, as well as decoherence in
+both position and momentum.
+To further evaluate the decoherence factor F(x − x′),
+Eq. (38), we distinguish two important limiting cases. In
+the short-wavelength limit, the typical wavelength of the
+scattered environmental particles is much shorter than
+the coherent separation ∆x = |x − x′| between the well-
+localized wave packets in the spatial superposition state
+of the system.
+Then a single scattering event will be
+able to fully resolve this separation and thus carry away
+complete which-path information, leading to maximum
+spatial decoherence per scattering event. In this limit,
+F(x − x′) turns out to be simply equal to the total scat-
+tering rate Γtot [9]. This implies the existence of an upper
+limit for the decoherence rate when increasing the sepa-
+ration ∆x, in contrast with decoherence rates obtained
+from linear models [compare Eqs. (22) and (54)]. Equa-
+tion (37) then shows that spatial interference terms will
+become exponentially suppressed at a rate set by Γtot,
+
+ρS(x, x′, t) = ρS(x, x′, 0)e−Γtott.
+(39)
+
+
+11
+
+TABLE I. Estimates of decoherence timescales (in seconds)
+for the suppression of spatial interferences over a distance ∆x
+equal to the size a of the object (∆x = a = 10−3 cm for a
+dust grain and ∆x = a = 10−6 cm for a large molecule). See
+Ref. [9] for details.
+
+Environment
+Dust grain Large molecule
+
+Cosmic background radiation
+1
+1024
+
+Photons at room temperature
+10−18
+106
+
+Best laboratory vacuum
+10−14
+10−2
+
+Air at normal pressure
+10−31
+10−19
+
+In the opposite long-wavelength limit, the environmen-
+tal wavelengths are much larger than the coherent sep-
+aration ∆x = |x − x′|, which implies that an individual
+scattering event will reveal only incomplete which-path
+information. For this case, one can show that spatial co-
+herences become exponentially suppressed at a rate that
+depends on the square of the separation ∆x [9],
+
+ρS(x, x′, t) = ρS(x, x′, 0)e−Λ(∆x)2t,
+(40)
+
+where Λ is a scattering constant that encapsulates the
+physical details of the interaction.
+Thus, the quantity
+Λ(∆x)2 plays the role of a decoherence rate.
+The de-
+pendence of this rate on ∆x is reasonable: if the envi-
+ronmental wavelengths are much larger than ∆x, it will
+require a large number of scattering events to encode
+an appreciable amount of which-path information in the
+environment, and this amount will increase, for a given
+number of scattering events, as ∆x becomes larger. Note
+that if ∆x is increased beyond the typical wavelength of
+the environment, the short-wavelength limit needs to be
+considered instead, for which the decoherence rate is in-
+dependent of ∆x and attains its maximum possible value.
+Numerical values of collisional decoherence rates ob-
+tained from Eqs. (39) and (40), with the physically rele-
+vant scattering parameters Γtot and Λ appropriately eval-
+uated, have shown the extreme efficiency of collisions in
+suppressing spatial interferences; Table I shows a few
+classic order-of-magnitude estimates [8, 9, 17].
+Excel-
+lent agreement between theory and experiment has been
+demonstrated for the decoherence of fullerenes due to col-
+lisions with background gas molecules in a Talbot–Lau
+interferometer [31, 115–118] (see Sec. VI B and Fig. 2),
+and for the decoherence of sodium atoms in a Mach–
+Zehnder interferometer due to the scattering of photons
+[119] and gas molecules [120].
+
+B.
+Quantum Brownian motion
+
+A classic and extensively studied model of decoherence
+and dissipation is the one-dimensional motion of a par-
+ticle weakly coupled to a thermal bath of noninteracting
+harmonic oscillators, a model known as quantum Brown-
+ian motion. The self-Hamiltonian HE of the environment
+
+is given by
+
+HE =
+�
+
+i
+
+� 1
+
+2mi
+p2
+i + 1
+
+2miω2
+i q2
+i
+
+�
+,
+(41)
+
+where mi and ωi denote the mass and natural frequency
+of the ith oscillator, and qi and pi are the canonical posi-
+tion and momentum operators. The interaction Hamilto-
+nian Hint describes the bilinear coupling of the system’s
+position coordinate x to the positions qi of the environ-
+mental oscillators, Hint = x ⊗ �
+
+i ciqi, where the ci de-
+note coupling strengths. This interaction represents the
+continuous environmental monitoring of the position co-
+ordinate of the system.
+The Born–Markov master equation describing the evo-
+lution of the density matrix ρS(t) of the system is given
+by [9, 45]
+
+d
+dtρS(t) = −i
+�
+HS, ρS(t)
+�
+
+−
+� ∞
+
+0
+dτ
+�
+ν(τ)
+�
+x,
+�
+x(−τ), ρS(t)
+��
+
+− iη(τ)
+�
+x,
+�
+x(−τ), ρS(t)
+���
+.
+(42)
+
+Here, x(τ) denotes the system’s position operator in the
+interaction picture, x(τ) = eiHSτxe−iHSτ.
+The curly
+brackets { · , · } in the second line denote the anticom-
+mutator {A, B} ≡ AB + BA. The functions
+
+ν(τ) =
+� ∞
+
+0
+dω J(ω) coth
+� ω
+
+2kT
+
+�
+cos (ωτ) ,
+(43)
+
+η(τ) =
+� ∞
+
+0
+dω J(ω) sin (ωτ) ,
+(44)
+
+are known as the noise kernel and dissipation kernel, re-
+spectively. The function J(ω), called the spectral density
+of the environment, is given by
+
+J(ω) ≡
+�
+
+i
+
+c2
+i
+
+2miωi
+δ(ω − ωi).
+(45)
+
+In general, spectral densities encapsulate the physi-
+cal properties of the environment.
+One frequently re-
+places the collection of individual environmental oscilla-
+tors by an (often phenomenologically motivated) contin-
+uous function J(ω) of the environmental frequencies ω.
+If we specialize to the important case of the system rep-
+resented by a harmonic oscillator with self-Hamiltonian
+
+HS =
+1
+
+2M p2 + 1
+
+2MΩ2x2,
+(46)
+
+the resulting Born–Markov master equation is
+
+d
+dtρS(t) = −i
+�
+HS + 1
+
+2M �Ω2x2, ρS(t)
+�
+−iγ
+�
+x,
+�
+p, ρS(t)
+��
+
+− D
+�
+x,
+�
+x, ρS(t)
+��
+− f
+�
+x,
+�
+p, ρS(t)
+��
+.
+(47)
+
+
+12
+
+The coefficients �Ω2, γ, D, and f are defined as
+
+�Ω2 ≡ − 2
+
+M
+
+� ∞
+
+0
+dτ η(τ) cos (Ωτ) ,
+(48a)
+
+γ ≡
+1
+
+MΩ
+
+� ∞
+
+0
+dτ η(τ) sin (Ωτ) ,
+(48b)
+
+D ≡
+� ∞
+
+0
+dτ ν(τ) cos (Ωτ) ,
+(48c)
+
+f ≡ − 1
+
+MΩ
+
+� ∞
+
+0
+dτ ν(τ) sin (Ωτ) .
+(48d)
+
+The first term on the right-hand side of Eq. (47) repre-
+sents the unitary dynamics of a harmonic oscillator whose
+natural frequency is shifted by �Ω. The second term de-
+scribes momentum damping (dissipation) at a rate pro-
+portional to γ, which depends only on the spectral den-
+sity but not the temperature of the environment. The
+third term is of the Lindblad double-commutator form
+[see Eq. (31)] and describes decoherence of spatial coher-
+ences over a distance ∆X at a rate D(∆X)2. Note that
+D depends on both the spectral density J(ω) and the
+temperature T of the environment. The fourth term also
+represents decoherence, but its influence on the dynam-
+ics of the system is usually negligible, especially at higher
+temperatures. In the long-time limit γt ≫ 1, the master
+equation (47) describes dispersion in position space given
+by
+
+∆X2(t) =
+D
+
+2m2γ2 t.
+(49)
+
+That is, the width ∆X(t) of the ensemble in position
+space asymptotically scales as ∆X(t) ∝
+√
+
+t, just as in
+classical Brownian motion; hence the term “quantum
+Brownian motion.”
+Figure 1 shows the time evolution of position-space and
+momentum-space superpositions of two Gaussian wave
+packets in the Wigner picture, as described by Eq. (47)
+[28]. Interference between the two wave packets is rep-
+resented by oscillations between the direct peaks. The
+interaction with the environment damps these oscilla-
+tions.
+The damping occurs on different timescales for
+the two initial conditions. While the momentum coordi-
+nate is not directly monitored by the environment, the
+intrinsic dynamics, through their creation of spatial su-
+perpositions from superpositions of momentum, result
+in decoherence in momentum space.
+This interplay of
+environmental monitoring and intrinsic dynamics leads
+to the emergence of pointer states that are minimum-
+uncertainty Gaussians (coherent states) well-localized in
+both position and momentum, thus approximating clas-
+sical points in phase space [5, 8, 16, 28, 44, 47, 48].
+Let us consider the important case of an ohmic spectral
+density J(ω) ∝ ω with a high-frequency cutoff Λ,
+
+J(ω) = 2Mγ0
+
+π
+ω
+Λ2
+
+Λ2 + ω2 .
+(50)
+
+In the limit of a high-temperature environment (kT ≫ Ω
+and kT ≫ Λ), we arrive at the Caldeira–Leggett master
+
+x
+
+p
+
+x
+
+p
+
+FIG. 1. Evolution of superpositions of Gaussian wave packets
+in quantum Brownian motion as studied in Ref. [28], visual-
+ized in the Wigner representation. Time increases from top
+to bottom. In the left column, the initial wave packets are
+separated in position; in the right column, the separation is
+in momentum.
+
+equation [121],
+
+d
+dtρS(t) = −i
+�
+H′
+S, ρS(t)
+�
+− iγ0
+�
+x,
+�
+p, ρS(t)
+��
+
+− 2Mγ0kT
+�
+x,
+�
+x, ρS(t)
+��
+,
+(51)
+
+where
+
+H′
+S = HS + 1
+
+2M �Ω2x2 =
+1
+
+2M p2 + 1
+
+2M
+�
+Ω2 − 2γ0Λ
+�
+x2
+
+(52)
+is the frequency-shifted Hamiltonian H′
+S of the system.
+This equation has been widely and successfully used to
+model decoherence and dissipation processes, even in
+cases where the assumptions were not strictly fulfilled
+(for example, in quantum-optical settings, where often
+kT ≲ Λ [122]).
+In the position representation, the final term on the
+right-hand side of Eq. (51) can be written as
+
+− γ0
+
+�x − x′
+
+λdB
+
+�2
+ρS(x, x′, t),
+(53)
+
+where λdB = (2MkT)−1/2 is the thermal de Broglie wave-
+length. This term describes spatial localization with a
+
+
+13
+
+decoherence rate τ −1
+|x−x′| given by [18]
+
+τ −1
+|x−x′| = γ0
+
+�x − x′
+
+λdB
+
+�2
+.
+(54)
+
+This is Eq. (22), and as discussed there, given that λdB is
+extremely small for macroscopic and even mesoscopic ob-
+jects, we see that superpositions of macroscopically sepa-
+rated center-of-mass positions will typically be decohered
+on timescales many orders of magnitude shorter than the
+dissipation (relaxation) timescale γ−1
+0 . Over timescales
+on the order of the decoherence time, we may therefore
+often neglect the dissipative term in Eq. (51), leading to
+the pure-decoherence master equation
+
+d
+dtρS(t) = −i
+�
+H′
+S, ρS(t)
+�
+− 2Mγ0kT
+�
+x,
+�
+x, ρS(t)
+��
+. (55)
+
+C.
+Spin–boson models
+
+In the spin–boson model, a qubit interacts with an
+environment of harmonic oscillators. The seminal review
+paper by Leggett et al. [29] discusses the dynamics of the
+spin–boson model in great detail.
+Let us first consider a simplified spin–boson model
+where the self-Hamiltonian of the system is taken to be
+HS = 1
+
+2ω0σz, with eigenstates |0⟩ and |1⟩. In contrast
+with the more general case discussed below, this Hamilto-
+nian does not include a tunneling term − 1
+
+2∆0σx, and thus
+HS does not generate any nontrivial intrinsic dynamics.
+We employ the familiar self-Hamiltonian, Eq. (41), for
+an environment of harmonic oscillators, and choose the
+bilinear interaction Hamiltonian Hint = σz ⊗�
+
+i ciqi. Us-
+ing the raising and lowering operators a† and a, we can
+recast the total Hamiltonian as
+
+H = 1
+
+2ω0σz +
+�
+
+i
+ωia†
+iai + σz ⊗
+�
+
+i
+
+�
+gia†
+i + g∗
+i ai
+�
+. (56)
+
+Note that since
+�
+H, σz
+�
+= 0, no transitions between |0⟩
+and |1⟩ can be induced by H. There is no energy ex-
+change between the system and the environment, and we
+therefore deal with a model of decoherence without dis-
+sipation. Such a model is a good representation of rapid
+decoherence processes during which the amount of dissi-
+pation is negligible, as is often the case in physical appli-
+cations. The resulting evolution can be solved exactly [9].
+For an ohmic spectral density with a high-frequency cut-
+off, it is found that superpositions of the form α |0⟩+β |1⟩
+are exponentially decohered on a timescale set by the
+thermal correlation time (kT)−1 of the environment.
+Inclusion of a tunneling term − 1
+
+2∆0σx yields the gen-
+eral spin–boson model defined by the Hamiltonian
+
+H = 1
+
+2ω0σz − 1
+
+2∆0σx +
+�
+
+i
+
+� 1
+
+2mi
+p2
+i + 1
+
+2miω2
+i q2
+i
+
+�
+
++ σz ⊗
+�
+
+i
+ciqi.
+(57)
+
+The rich non-Markovian dynamics of this model have
+been analyzed in Refs. [29, 123]. The particular dynamics
+strongly depend on the various parameters, such as the
+temperature of the environment, the form of the spec-
+tral density (subohmic, ohmic, or supraohmic), and the
+system–environment coupling strength. For each param-
+eter regime, a characteristic dynamical behavior emerges:
+localization, exponential or incoherent relaxation, expo-
+nential decay, and strongly or weakly damped coherent
+oscillations [29].
+In the weak-coupling limit, one can derive the Born–
+Markov master equation in much the same way as in
+the case of quantum Brownian motion (note the similar
+structure of the Hamiltonians). The result is (see Ref. [9]
+for details)
+
+d
+dtρS(t) = −i
+�
+H′
+SρS(t) − ρS(t)H′†
+S
+�
+
+− �D [σz, [σz, ρS(t)]] + ζσzρS(t)σy + ζ∗σyρS(t)σz.
+(58)
+
+The first term on the right-hand side of the master equa-
+tion (58) represents the evolution under the environment-
+shifted self-Hamiltonian H′
+S, the second term corre-
+sponds to decoherence in the σz eigenbasis of the system
+at a rate given by �D, and the last two terms describe
+the decay of the two-level system. H′
+S is the renormal-
+ized (and in general non-Hermitian) Hamiltonian of the
+system. The coefficients ζ∗, �D, �f, and �γ are given by
+
+ζ∗ = �f − i�γ,
+(59a)
+
+�D =
+� ∞
+
+0
+dτ ν(τ) cos (∆0τ) ,
+(59b)
+
+�f =
+� ∞
+
+0
+dτ ν(τ) sin (∆0τ) ,
+(59c)
+
+�γ =
+� ∞
+
+0
+dτ η(τ) sin (∆0τ) ,
+(59d)
+
+with the noise and the dissipation kernels ν(τ) and η(τ)
+taking the same form as in quantum Brownian motion
+[see Eqs. (43) and (44)].
+
+D.
+Spin-environment models
+
+A qubit linearly coupled to a collection of other
+qubits—known also as a spin–spin model—is often a good
+model of a single two-level system, such as a supercon-
+ducting qubit, strongly coupled to a low-temperature en-
+vironment [103, 104].
+The model of a harmonic oscil-
+lator interacting with a spin environment may be rele-
+vant to the description of decoherence and dissipation in
+quantum-nanomechanical systems and micron-scale ion
+traps [124]. For details on the theory of spin-environment
+models, see Refs. [104, 125–127].
+A simple version of a spin–spin model is described by
+
+
+14
+
+the total Hamiltonian
+
+H = HS + Hint = −1
+
+2∆0σx + 1
+
+2σz ⊗
+
+N
+�
+
+i=1
+giσ(i)
+z
+
+≡ −1
+
+2∆0σx + 1
+
+2σz ⊗ E.
+(60)
+
+Here, HS represents the intrinsic dynamics given by a
+tunneling term, while Hint describes the environmental
+monitoring of the observable σz.
+The model can be solved exactly [128, 129], and
+the resulting dynamics illustrate the dependence of the
+preferred basis on the relative strengths of the self-
+Hamiltonian of the system and the interaction Hamil-
+tonian.
+The preferred basis emerges as the local ba-
+sis that is most robust under the total Hamiltonian.
+When the interaction Hamiltonian dominates over the
+self-Hamiltonian, the pointer states are found to be eigen-
+states of the interaction Hamiltonian, in agreement with
+the commutativity criterion, Eq. (18). Conversely, when
+the modes of the environment are slow and the self-
+Hamiltonian dominates the evolution of the system (the
+quantum limit of decoherence [42]), the pointer states are
+the eigenstates of the Hamiltonian of the system.
+In the weak-coupling limit, spin environments can be
+mapped onto oscillator environments [110, 130]. Specifi-
+cally, the reduced dynamics of a system weakly coupled
+to a spin environment can be described by the system
+coupled to an equivalent oscillator environment described
+by an explicitly temperature-dependent spectral density
+of the form
+
+Jeff(ω, T) ≡ J(ω) tanh
+� ω
+
+2kT
+
+�
+,
+(61)
+
+where J(ω) is the original spectral density of the spin
+environment. (See Sec. 5.4.2 of Ref. [9] for details and
+examples.)
+
+V.
+QUBIT DECOHERENCE, QUANTUM
+ERROR CORRECTION, AND ERROR
+AVOIDANCE
+
+Quantum computation and quantum information pro-
+cessing rely on coherent superpositions of mesoscopically
+or macroscopically distinct states that are highly suscep-
+tible to decoherence. Avoiding, controlling, and mitigat-
+ing decoherence is therefore of paramount importance.
+While the qubits need to be protected from detrimental
+environmental interactions, we also need to be able to
+control and measure them via a macroscopic apparatus.
+The formidable challenge of designing a quantum com-
+puter consists of meeting both demands in a balanced
+way. Even so, decoherence induced by interactions with
+the environment and the control apparatus, as well as
+noise due to faulty gate operations, will likely be too
+strong to allow for useful quantum computations to be
+carried out [74, 131]. What is also needed is an active
+
+mitigation of the effects of decoherence through active
+quantum error correction [132–136].
+We may distinguish two limiting cases for modeling
+decoherence in qubits. The first limit is that of indepen-
+dent qubit decoherence. Here, each qubit couples indepen-
+dently to its own environment, without any interactions
+between these environments. For example, this may be
+the case if the qubits are spatially well-separated (rela-
+tive to the typical coherence length of the environment)
+and only couple to their immediate surroundings. Then
+the error processes affecting the qubits will be completely
+uncorrelated. Thus, if the probability of a particular er-
+ror to affect one qubit is p, the probability of this error
+to occur in K qubits will be pK. Many error-correcting
+schemes are only efficient in correcting such single-qubit
+errors, and thus the assumption of independent decoher-
+ence frequently underlies these schemes. This assump-
+tion, however, is unrealistic when the qubits are located
+spatially close to each other. In this case, all qubits ap-
+proximately feel the same environment, and it is likely
+that errors will become correlated among multiple qubits.
+The limiting case corresponding to this situation is that
+of collective qubit decoherence, in which all qubits couple
+to exactly the same environment.
+
+A.
+Correction of decoherence-induced quantum
+errors
+
+Consider a single qubit S, initially described by a pure
+state |ψ⟩ and interacting with an environment E. One
+can show that an arbitrary evolution of the combined
+qubit–environment state can always be written in the
+form
+
+|ψ⟩ |e0⟩ −→ I |ψ⟩ |eI⟩ +
+�
+
+s=x,y,z
+(σs |ψ⟩) |es⟩ ,
+(62)
+
+where the Pauli operators σs act on the Hilbert space of
+S, and |eI⟩ and {|es⟩} are environmental states that are
+not necessarily orthogonal or normalized. Thus, any in-
+fluence of the environment on the qubit can be expressed
+simply in terms of a weighted sum of the Pauli operators
+and the identity operator acting on the original state of
+the qubit. The effects of σx and σz on the qubit state are
+often referred as a bit-flip error and phase-flip error, re-
+spectively. If we restrict our attention to environmental
+entanglement and the resulting decoherence effects, then
+only phase-flip errors need to be taken into account.
+For N qubits, Eq. (62) generalizes to
+
+|ψ⟩ |e0⟩ −→
+�
+
+i
+(Ei |ψ⟩) |ei⟩ .
+(63)
+
+Here |ψ⟩ is the initial N-qubit state, and the error op-
+erators Ei are tensor products of N operators involv-
+ing identity and Pauli operators. Equation (63) repre-
+sents a worst-case scenario.
+In many cases, simplified
+versions can be used. One important case is that of par-
+tial decoherence. Here, only a small number K < N of
+
+
+15
+
+qubits become entangled with the environment between
+two successive applications of an error-correcting mech-
+anism. Then it will be sufficient to restrict our attention
+to the 2K possible error operators made up of at most K
+operators σz and N − K identity operators. In the case
+of independent qubit decoherence, we only need to con-
+sider a collection of independent phase-flip errors acting
+on single qubits, represented by error operators of the
+form E = I ⊗ · · · ⊗ I ⊗ σz ⊗ I ⊗ · · · ⊗ I.
+Given the entangled state on the right-hand side of
+Eq. (63), the goal of quantum error correction is to re-
+store the initial (unknown) state |ψ⟩. We let an ancilla,
+described by an initial state |a0⟩, interact with the qubit
+system such that
+
+|a0⟩
+
+��
+
+i
+(Ei |ψ⟩) |ei⟩
+
+�
+
+−→
+�
+
+i
+|ai⟩ (Ei |ψ⟩) |ei⟩ .
+(64)
+
+Let us assume that the ancilla states |ai⟩ are at least
+approximately mutually orthogonal, such that they can
+be distinguished by measurement. We now measure the
+observable OA = �
+
+i ai|ai⟩⟨ai| on the ancilla, with ai ̸=
+aj for i ̸= j. The projective measurement will yield a
+particular outcome, say, ak, and lead to the reduction of
+the entangled state,
+�
+
+i
+|ai⟩ (Ei |ψ⟩) |ei⟩ −→ |ak⟩ (Ek |ψ⟩) |ek⟩ .
+(65)
+
+The outcome ak of the measurement tells us the counter-
+transformation needed to restore the initial qubit state.
+Applying E−1
+k
+= E†
+k to the system gives
+
+|ak⟩ (Ek |ψ⟩) |ek⟩
+E−1
+k
+−−−→ |ak⟩ |ψ⟩ |ek⟩ .
+(66)
+
+Note that, as required in order to avoid introducing ad-
+ditional decoherence in the computational basis of the
+qubit system, we have obtained no information whatso-
+ever about the state of the system.
+This account of quantum error correction has been
+highly idealized.
+Let us mention three complications.
+First, it is impossible to design an interaction between
+the computational qubits and the ancilla that would al-
+low us to distinguish, by measuring the ancilla, between
+all possible errors. Second, in realistic settings the error
+operators Ei may be very complex, and it remains to be
+seen whether and how the corresponding countertrans-
+formations can be applied without introducing signifi-
+cant additional decoherence.
+Third, the ancilla qubits
+are physically similar to the computational qubits and
+can therefore be expected to be equally prone to en-
+vironmental interactions (and thus decoherence) as the
+computational qubits themselves. Since the inclusion of
+ancilla qubits increases the total number of qubits in the
+quantum computer, and since decoherence rates typically
+scale exponentially with the size of the system, it will re-
+quire sophisticated experimental designs to ensure not
+only that quantum error correction works in practice,
+but also that it does not aggravate the problem of qubit
+decoherence.
+
+B.
+Quantum computation on decoherence-free
+subspaces
+
+We introduced the concept of decoherence-free sub-
+spaces (DFS) [49–58],
+or pointer subspaces [3],
+in
+Sec. II C 2. DFS allow us to encode quantum informa-
+tion in “quiet corners” of the Hilbert space to protect
+it from environmental effects. In contrast with quantum
+error correction, DFS prevent errors from happening in
+the first place and thus represent a strategy for intrinsic
+error avoidance.
+The two limiting cases of independent qubit decoher-
+ence and collective qubit decoherence delineate the lim-
+its on the size of a DFS. To illustrate this relation-
+ship, let us consider the case of collective decoherence
+of an N-qubit system interacting with an oscillator bath
+[49, 51, 53, 56, 137].
+The interaction Hamiltonian for
+this generalized spin–boson model is taken to be [com-
+pare Eq. (56)]
+
+Hint =
+
+N
+�
+
+i=1
+σ(i)
+z
+⊗
+�
+
+j
+
+�
+gija†
+j + g∗
+ijaj
+�
+≡
+
+N
+�
+
+i=1
+σ(i)
+z
+⊗ Ei.
+
+(67)
+The assumption of collective decoherence implies that
+the couplings gij (and thus the environment operators
+Ei) must be independent of the index i. Then Eq. (67)
+becomes
+
+Hint =
+
+��
+
+i
+σ(i)
+z
+
+�
+
+⊗ E ≡ Sz ⊗ E.
+(68)
+
+Recall that a DFS is spanned by a degenerate set of
+eigenstates of the system operators Sα of the interaction
+Hamiltonian [see Eq. (20)]. Thus, in our case the DFS
+will be spanned by degenerate eigenstates of the collec-
+tive spin operator Sz. Any N-qubit product state of the
+computational basis states |0⟩ and |1⟩ (the eigenstates of
+σz with eigenvalues +1 and −1, respectively) will be an
+eigenstate of Sz. There are 2N +1 different possible inte-
+ger eigenvalues m, ranging from m = −N (corresponding
+to the basis state |1 · · · 1⟩) to m = +N (corresponding
+to |0 · · · 0⟩). The largest number of mutually orthogonal
+computational-basis states with the same eigenvalue m
+of Sz is given by the set S0 of basis states with m = 0,
+i.e., those with N/2 qubits in the state |0⟩. There are
+n0 =
+� N
+N/2
+�
+such states in this set, spanning a DFS of di-
+mension n0. For large values of N, we can approximate
+the binomial coefficient using Stirling’s formula,
+
+log2
+
+� N
+N/2
+
+�
+≈ N − 1
+
+2 log2(πN/2)
+N≫1
+−−−→ N.
+(69)
+
+Therefore, in the limiting case of collective decoherence,
+the dimension of our DFS approaches the dimension of
+the original Hilbert space, and the encoding efficiency
+approaches unity. For example, for N = 4 qubits, the set
+
+S0 = { |0011⟩ , |0101⟩ , |0110⟩ , |1001⟩ , |1010⟩ , |1100⟩ }
+(70)
+
+
+16
+
+spans a maximum-size DFS of dimension six, to be com-
+pared with the dimension of the original Hilbert space,
+which is 24 = 16. Thus, given the model for collective de-
+coherence considered here, using four physical qubits we
+can encode up to two logical qubits in a DFS (since en-
+coding three logical qubits would already require a DFS
+of dimension 23 = 8).
+As mentioned in Sec. II C 2, the existence of a DFS
+corresponds to a dynamical symmetry. Our model rep-
+resents a case of perfect dynamical symmetry, since
+the system–environment interaction, Eq. (68), is com-
+pletely symmetric with respect to any permutations of
+the qubits, thereby leading to a DFS of maximum size.
+What happens if the symmetry is broken by additional
+small independent coupling terms? It has been shown
+[50, 138] that, to first order in the perturbation strength,
+the storage of quantum information in DFS is stable to
+such perturbations to all orders in time, but that the pro-
+cessing of such quantum information encoded in DFS is
+robust only to first order in time.
+In the case of purely independent qubit decoherence,
+the environment operators Ei appearing in Eq. (67) will
+now differ from one another. To find a DFS, we follow
+the usual strategy [see Eq. (20)] of determining a set of
+orthonormal basis states {|si⟩} such that
+
+�
+I(1) ⊗ · · · ⊗ I(j−1) ⊗ σ(j)
+z
+⊗ I(j+1) ⊗ · · · ⊗ I(N)�
+|si⟩
+
+= λ(j) |si⟩
+(71)
+
+for all i and 1 ≤ j ≤ N. The only state fulfilling this
+eigenvalue problem is |0 · · · 0⟩.
+Since we need at least
+a two-dimensional subspace to encode a single logical
+qubit, the case of independent decoherence in the spin–
+boson model does not allow for the existence of a DFS for
+quantum computation. In the language of pointer sub-
+spaces, there is only a single exact pointer state, and this
+environment-superselected preferred state of the system
+will be the ground state |0 · · · 0⟩.
+In realistic settings, neither the assumption of purely
+independent decoherence nor the limit of entirely collec-
+tive decoherence will be entirely appropriate. We can,
+however, use a DFS to protect the qubits from collective
+decoherence effects, and we can recover from single-qubit
+errors due to independent decoherence using active error-
+correction methods. These two approaches can be con-
+catenated [54] to enable universal fault-tolerant quantum
+computation even when the restriction to single-qubit er-
+rors is dropped [55, 139].
+
+C.
+Environment engineering and dynamical
+decoupling
+
+For reasonably large DFS to exist,
+the system–
+environment interaction must exhibit a sufficiently high
+degree of symmetry.
+Such symmetries are unlikely to
+arise naturally in typical experimental settings.
+
+One way of overcoming this limitation is based on envi-
+ronment engineering. Here, one tries to generate certain
+symmetries in the structure of the system–environment
+interactions. For example, an appropriately engineered
+symmetrization could make superposition states in Bose–
+Einstein condensates correspond to (approximate) de-
+generate eigenstates of the interaction Hamiltonian, in
+which case such states would lie within a DFS, thereby
+significantly enhancing their longevity [140].
+In ion
+traps, changing the parameters in the effective interac-
+tion Hamiltonian for the trapped ion allows one to se-
+lect different pointer subspaces and thereby control into
+which DFS the trapped ion is driven [77, 79, 141, 142].
+Another approach to the active creation of DFS is
+known as dynamical decoupling [143–148].
+Here time-
+dependent modifications are introduced into the Hamil-
+tonian of the system that counteract the influence of
+the environment. These modifications take the form of
+sequences of rapid projective measurements or strong
+control-field pulses acting on the system (“quantum
+bang-bang control” [143]).
+Even if the structure of
+the system–environment interaction Hamiltonian is not
+known, decoherence can be suppressed arbitrarily well
+in the limit of an infinitely fast rate of the decoupling
+control field, thus dynamically creating a DFS (which
+then represents a dynamically decoupled subspace). In
+the realistic case of a finite control rate, sufficient (albeit
+imperfect) protection from decoherence can be achieved
+via this decoupling technique, provided the control rate
+is larger than the fastest timescale set by the rate of for-
+mation of environmental entanglement.
+
+VI.
+EXPERIMENTAL STUDIES OF
+DECOHERENCE
+
+Decoherence, of course, happens all around us, and
+in this sense its consequences are readily observed. But
+what we would like to do is to be able to experimen-
+tally study the gradual and controlled action of deco-
+herence. In this endeavor, several obstacles have to be
+overcome. We need to prepare the system in a superpo-
+sition of mesoscopically or even macroscopically distin-
+guishable states with a sufficiently long decoherence time
+such that the gradual action of decoherence can be re-
+solved. We must be able to monitor decoherence without
+introducing a significant amount of additional, unwanted
+decoherence. We would also like to have sufficient con-
+trol over the environment so we can tune the strength
+and form of its interaction with the system.
+Starting
+in the mid-1990s, several such experiments have been
+performed, for example, using cavity QED [12], meso-
+scopic molecules [149], and superconducting systems such
+as SQUIDs and Cooper-pair boxes [13]. Bose–Einstein
+condensates [150] and quantum nanomechanical systems
+[151, 152] are promising candidates for future experimen-
+tal tests of decoherence.
+These experiments are important for several reasons.
+
+
+17
+
+They are impressive demonstrations of the possibility
+of generating nonclassical quantum states in mesoscopic
+and macroscopic systems. They show that the quantum–
+classical boundary is smooth and can be shifted by vary-
+ing the relevant experimental parameters.
+They allow
+us to test and improve decoherence models, and they
+help us design devices for quantum information process-
+ing that are good at evading the detrimental influence
+of the environment. Finally, such experiments may be
+used to test quantum mechanics itself [13]. Such tests re-
+quire sufficient shielding of the system from decoherence
+so that an observed (full or partial) collapse of the wave-
+function could be unambigously attributed to some novel
+nonunitary mechanism in nature, such as those proposed
+in dynamical reduction models [153–155]. This shielding,
+however, is difficult to implement in practice, because
+the large number of particles required for the reduction
+mechanism to become effective will also lead to strong
+decoherence [114, 156].
+The superpositions realized in
+current experiments are still not sufficiently macroscopic
+to rule out collapse theories, although it has been demon-
+strated [118] that matter-wave interferometry with large
+molecular clusters (in the mass range between 106 and
+108 amu) would be able to test the collapse theories pro-
+posed in Refs. [154, 155]; such experiments may soon
+become technologically feasible [11].
+
+A.
+Atoms in a cavity
+
+In 1996 Brune et al. generated a superposition of ra-
+diation fields with classically distinguishable phases in-
+volving several photons [12, 150, 157]. This experiment
+was the first to realize a mesoscopic Schr¨odinger-cat state
+and allowed for the controlled observation and manipu-
+lation of its decoherence. A rubidium atom is prepared
+in a superposition of energy eigenstates |g⟩ and |e⟩ cor-
+responding to two circular Rydberg states.
+The atom
+enters a cavity C containing a radiation field contain-
+ing a few photons. If the atom is in the state |g⟩, the
+field remains unchanged, whereas if it is in the state
+|e⟩, the coherent state |α⟩ of the field undergoes a phase
+shift φ, |α⟩ −→
+��eiφα
+�
+; the experiment achieved φ ≈ π.
+An initial superposition of the atom is therefore am-
+plified into an entangled atom–field state of the form
+1
+√
+
+2 (|g⟩ |α⟩ + |e⟩ |−α⟩).
+The atom then passes through
+an additional cavity, further transforming the superposi-
+tion. Finally, the energy state of the atom is measured.
+This disentangles the atom and the field and leaves the
+latter in a superposition of the mesoscopically distinct
+states |α⟩ and |−α⟩.
+To monitor the decoherence of this superposition, a
+second rubidium atom is sent through the apparatus. Af-
+ter interacting with the field superposition state in the
+cavity C, the atom will always be found in the same en-
+ergy state as the first atom if the superposition has not
+been decohered. This correlation rapidly decays with in-
+creasing decoherence. Thus, by recording the measure-
+
+ment correlation as a function of the wait time τ between
+sending the first and second atom through the appara-
+tus, the decoherence of the field state can be monitored.
+Experimental results were in excellent agreement with
+theoretical predictions [158, 159]. It was found that de-
+coherence became faster as the phase shift φ and the
+mean number ¯n = |α|2 of photons in the cavity C was
+increased. Both results are expected, since an increase
+in φ and ¯n means that the components in the superpo-
+sition become more distinguishable. Recent experiments
+have realized superposition states involving several tens
+of photons [160] and have monitored the gradual deco-
+herence of such states [161].
+
+B.
+Matter-wave interferometry
+
+In these experiments (see Ref. [11] for a review), spatial
+interference patterns are demonstrated for mesoscopic
+molecules ranging from fullerenes [162] to molecular clus-
+ters involving hundreds of atoms, with a total size of
+up to 60 ˚A and masses of several thousand amu (see
+Fig. 2) [163, 164].
+Since the de Broglie wavelength of
+such molecules is on the order of picometers, standard
+double-slit interferometry is out of reach. Instead, the
+experiments make use of the Talbot effect, an interfer-
+ence phenomenon in which a plane wave incident on a
+diffraction grating creates an image of the grating at mul-
+tiples of a distance L behind the grating. In the experi-
+ment, the molecular density (at a macroscopic distance L
+from the grating) is scanned along the direction perpen-
+dicular to the molecular beam.
+An oscillatory density
+pattern (corresponding to the image of the slits in the
+grating) is observed, confirming the existence of coher-
+ence and interference between the different paths of each
+individual molecule passing through the grating. Recent
+experiments have used an improved version of the origi-
+nal Talbot–Lau setup [165], as well as optical ionization
+gratings [166].
+Decoherence is measured as a decrease of the visibil-
+ity of this pattern (Fig. 2). The controlled decoherence
+due to collisions with background gas particles [115, 116]
+and due to emission of thermal radiation from heated
+molecules [168] has been observed, showing a smooth de-
+cay of visibility in agreement with theoretical predictions
+[31, 117, 167]. These successes have led to speculations
+that one could perform similar experiments using even
+larger particles such as proteins and viruses [115, 169]
+or carbonaceous aerosols [170]. Such experiments will be
+limited by collisional and thermal decoherence and by
+noise due to inertial forces and vibrations [115, 169, 170].
+
+C.
+Superconducting systems
+
+Superconducting
+quantum
+interference
+devices
+(SQUIDs)
+and
+Cooper-pair
+boxes
+have
+important
+applications in quantum information processing.
+A
+
+
+18
+NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1263
+
+NATURE COMMUNICATIONS | 2:263 | DOI: 10.1038/ncomms1263 | www.nature.com/naturecommunications
+
+11 Macmillan Publishers Limited. All rights reserved.
+
+ysics, single-particle 
+regarded as a para-
+feature of quantum 
+objects of our mac-
+rinciple has become 
+ng feld of quantum 
+ch in many labora-
+nderstanding of the 
+uantum systems and 
+o the observation of 
+
+m interference with 
+r successful experi-
+our study focuses on 
+ion of the molecule 
+ce. We do this with 
+vide useful molecu-
+1 compares the size 
+8 and PFNS10, with 
+traphenylporphyrin 
+PF84 and TPPF152. 
+molecules in a three-
+apitza-Dirac-Talbot-
+
+rated in a thermal 
+ravitational free-fall 
+meter itself consists 
+amber at a pressure 
+mbrane with 90-nm 
+6 nm. Each slit of G1 
+ecular position that, 
+ads to a momentum 
+
+delocalization and 
+increasing distance 
+ser light wave with a 
+een the electric laser 
+y creates a sinusoidal 
+t matter waves. Te 
+n such that quantum 
+c molecular density 
+structure is sampled 
+cal to G1) across the 
+
+of the transmitted 
+MS).
+added various tech-
+to liquid samples, a 
+tial to maintain the 
+owed us to increase 
+r and many optimi-
+were needed to meet 
+s with very massive 
+
+tum interferograms 
+re 3. In all cases the 
+ude of the sinusoidal 
+al, exceeds the maxi-
+y a signifcant multi-
+t shown for TPPF84 
+ed interference con-
+ith individual scans 
+) and Vobs = 49% for 
+n, we have observed 
+10 and Vobs = 16 � 2% 
+
+for TPPF152 (see Figure 3), in which our classical model predicts 
+Vclass = 1%. Tis supports our claim of true quantum interference for 
+all these complex molecules.
+
+Te most massive molecules are also the slowest and therefore 
+
+the most sensitive ones to external perturbations. In our particle 
+
+Figure 1 | Gallery of molecules used in our interference study. (a) The 
+fullerene C60 (m = 720 AMU, 60 atoms) serves as a size reference and 
+for calibration purposes; (b) The perfluoroalkylated nanosphere PFNS8 
+(C60[C12F25]8, m = 5,672 AMU, 356 atoms) is a carbon cage with eight 
+perfluoroalkyl chains. (c) PFNS10 (C60[C12F25]10, m = 6,910 AMU, 430 
+atoms) has ten side chains and is the most massive particle in the set. 
+(d) A single tetraphenylporphyrin TPP (C44H30N4, m = 614 AMU, 78 
+atoms) is the basis for the two derivatives (e) TPPF84 (C84H26F84N4S4, 
+m = 2,814 AMU, 202 atoms) and (f) TPPF152 (C168H94F152O8N4S4, 
+m = 5,310 AMU, 430 atoms). In its unfolded configuration, the latter is the 
+largest molecule in the set. Measured by the number of atoms, TPPF152 
+and PFNS10 are equally complex. All molecules are displayed to scale. The 
+scale bar corresponds to 10 Å.
+
+y
+
+X
+
+Detector
+
+G1
+
+G2
+
+G3
+
+S3
+
+S2
+
+S1
+
+Oven
+
+Lens
+
+Laser
+
+Z
+
+Figure 2 | Layout of the Kapitza-Dirac-Talbot-Lau (KDTL) interference 
+experiment. The effusive source emits molecules that are velocity-selected 
+by the three delimiters S1, S2 and S3. The KDTL interferometer is composed 
+of two SiNx gratings G1 and G3, as well as the standing light wave G2. The 
+optical dipole force grating imprints a phase modulation �(x)��opt·P/(v·wy) 
+onto the matter wave. Here �opt is the optical polarizability, P the laser 
+power, v the molecular velocity and wy the laser beam waist perpendicular 
+to the molecular beam. The molecules are detected using electron impact 
+ionization and quadrupole mass spectrometry.
+
+0
+0.4
+0.8
+1.2
+1.6
+4
+
+6
+
+8
+10
+
+20
+
+30
+
+visibility (%)
+
+pressure (in 10−6 mbar)
+
+FIG. 2. Left: Molecular clusters used in recent interference
+experiments, drawn to scale (the scale bar represents 10 ˚A).
+Figure from Ref. [163].
+(a) Fullerene C60 (m = 720 amu,
+60 atoms). (b) Perfluoroalkylated nanosphere PFNS8 (m =
+5672 amu, 356 atoms).
+(c) PFNS10 (m = 6910 amu, 430
+atoms). (d) Tetraphenylporphyrin TPP (m = 614 amu, 78
+atoms).
+(e) TPPF84 (m = 2814 amu, 202 atoms).
+(f)
+TPPF152 (m = 5310 amu, 430 atoms). Right: Visibility of
+interference fringes of C70 fullerenes as a function of the pres-
+sure of the background gas. Measured values (circles) agree
+well with the theoretical prediction (solid line) [31, 117, 167]
+describing an exponential decay of visibility with pressure.
+Figure adapted from Ref. [115].
+
+SQUID consists of a ring of superconducting material
+interrupted by thin insulating barriers, the Josephson
+junctions (Fig. 3a).
+At sufficiently low temperatures,
+electrons of opposite spin condense into bosonic Cooper
+pairs.
+Quantum-mechanical tunneling of Cooper pairs
+through the junctions leads to the flow of a resistance-
+free supercurrent around the loop (Josephson effect),
+which creates a magnetic flux threading the loop. The
+collective center-of-mass motion of a macroscopic num-
+ber (∼ 109) of Cooper pairs can then be represented by
+a wave function labeled by a single macroscopic variable,
+namely, the total trapped flux Φ through the loop.
+The two possible directions of the supercurrent define
+a qubit with basis states {|⟳⟩ , |⟲⟩}.
+By adjusting an
+external magnetic field, the SQUID can be biased such
+
+(a)
+
+(a)
+
+80%
+
+60%
+
+40%
+
+5
+0
+
+probability for ⟳
+
+(b)
+
+Josephson junction
+
+superconducting
+
+ring
+
+supercurrent
+
+(b)
+
+(a)
+
+delay time τ (ns)
+
+80%
+
+60%
+
+40%
+
+5
+10
+15
+20
+25
+30
+35
+0
+
+probability for ⟳
+
+(b)
+
+Josephson junction
+
+superconducting
+
+ring
+
+supercurrent
+
+FIG. 3. (a) Schematic illustration of a SQUID. A supercon-
+ducting ring is interrupted by Josephson junctions, leading
+to a dissipationless supercurrent.
+(b) Decoherence of a su-
+perposition of clockwise and counterclockwise supercurrents
+in a superconducting qubit. The damping of the oscillation
+amplitude corresponds to the gradual loss of coherence from
+the system. Figure adapted from Ref. [173].
+
+that the two lowest-lying energy eigenstates |0⟩ and |1⟩
+are equal-weight superpositions of the persistent-current
+states |⟳⟩ and |⟲⟩.
+Such superposition states involving µA currents were
+first experimentally observed in 2000 using spectroscopic
+measurements [171, 172]. Their decoherence was subse-
+quently measured using Ramsey interferometry [173], as
+follows. Two consecutive microwave pulses are applied to
+the system. During the delay time τ between the pulses,
+the system evolves freely. After application of the second
+pulse, the system is left in a superposition of |⟳⟩ and |⟲⟩,
+with the relative amplitudes exhibiting an oscillatory de-
+pendence on τ. A series of measurements in the basis
+{|⟳⟩ , |⟲⟩} over a range of delay times τ then allows one
+to trace out an oscillation of the occupation probabilities
+for |⟳⟩ and |⟲⟩ as a function of τ (Fig. 3b). The envelope
+of the oscillation is damped as a consequence of decoher-
+ence acting on the system during the free evolution of
+duration τ. From the decay of the envelope we can infer
+the decoherence timescale; the original experiment gave
+20 ns [173], while subsequent experiments have achieved
+decoherence times of several µs [174].
+Superpositions states and their decoherence have also
+been observed in superconducting devices whose key vari-
+able is charge (or phase), instead of the flux variable used
+in SQUIDs. Such Cooper-pair boxes consist of a small
+superconducting island onto which Cooper pairs can tun-
+nel from a reservoir through a Josephson junction. Two
+different charge states of the island, differing by at least
+one Cooper pair, define the basis states. Coherent os-
+cillations between such charge states were first observed
+
+
+19
+
+in 1999 [175]. In 2002, Vion et al. [176] reported thou-
+sands of coherent oscillations with a decoherence time
+of 0.5 µs. Similar results have been obtained for phase
+qubits [177, 178], demonstrating decoherence times of
+several µs.
+
+VII.
+DECOHERENCE AND THE
+FOUNDATIONS OF QUANTUM MECHANICS
+
+Can decoherence address foundational problems? If so,
+which ones, and how? Addressing these subtle questions
+is beyond the scope of this review; a few brief remarks
+must suffice here.
+(See Refs. [6, 7, 9, 21] for in-depth
+discussions.) Decoherence, at its heart, is a technical re-
+sult concerning the dynamics and measurement statistics
+of open quantum systems. From this view, decoherence
+merely addresses a consistency problem, by explaining
+how and when the quantum probability distributions ap-
+proach the classically expected distributions. Since deco-
+herence follows directly from an application of the quan-
+tum formalism to interacting quantum systems, it is not
+tied to any particular interpretation of quantum mechan-
+ics, nor does it supply such an interpretation, nor does it
+amount to a theory that could make predictions beyond
+those of standard quantum mechanics.
+The predictively relevant part of decoherence theory
+relies on reduced density matrices, whose formalism and
+interpretation presume the collapse postulate and Born’s
+
+rule. If we understand the “quantum measurement prob-
+lem” as the question of how to reconcile the linear, de-
+terministic evolution described by the Schr¨odinger equa-
+tion with the occurrence of random measurement out-
+comes, then decoherence has not solved this problem
+[6, 9].
+Decoherence does, however, address an aspect
+sometimes associated with the quantum measurement
+problem, namely the preferred-basis problem (at least
+in the sense described in Sec. II C). Further explorations
+of the role of the environment, such as in quantum Dar-
+winism (see Sec. II D), can help illuminate fundamental
+questions concerning information transfer and amplifica-
+tion in the quantum setting.
+Decoherence has been used to identify internal con-
+cistency issues in interpretations of quantum mechanics,
+and the picture associated with the decoherence process
+has sometimes been seen as suggestive of particular inter-
+pretations of quantum mechanics [6, 7]. Indeed, histori-
+cally decoherence theory arose in the context of Zeh’s [1]
+independent formulation of an Everett-style interpreta-
+tion (see Ref. [179] for a historical analysis). Ultimately,
+however, it seems that certain interpretations simply may
+be more in need of decoherence than others for defin-
+ing their structure; see Ref. [180] for the example of an
+Everett-style interpretation [23]. At the end of the day,
+any interpretation that does not involve entities, claims,
+or structures in contradiction with the predictions of de-
+coherence theory (which is to say, with the predictions of
+quantum mechanics) will arguably remain viable.
+
+[1] H.D. Zeh, Found. Phys. 1, 69 (1970)
+[2] W.H. Zurek, Phys. Rev. D 24, 1516 (1981)
+[3] W.H. Zurek, Phys. Rev. D 26, 1862 (1982).
+doi:
+10.1103/PhysRevD.26.1862
+
+[4] J.P. Paz, W.H. Zurek, in Coherent Atomic Matter
+Waves, Les Houches Session LXXII, Les Houches Sum-
+mer School Series, vol. 72, ed. by R. Kaiser, C. West-
+brook, F. David (Springer, Berlin, 2001), Les Houches
+Summer School Series, vol. 72, pp. 533–614
+
+[5] W.H. Zurek, Rev. Mod. Phys. 75, 715 (2003).
+doi:
+10.1103/RevModPhys.75.715
+
+[6] M. Schlosshauer, Rev. Mod. Phys. 76, 1267 (2004)
+[7] G.
+Bacciagaluppi,
+in
+The
+Stanford
+Encyclopedia
+of Philosophy, ed. by E.N. Zalta (2012).
+Online at
+http://plato.stanford.edu/archives/win2012/entries/qm-
+decoherence
+
+[8] E. Joos, H.D. Zeh, C. Kiefer, D. Giulini, J. Kupsch, I.O.
+Stamatescu, Decoherence and the Appearance of a Clas-
+sical World in Quantum Theory, 2nd edn. (Springer,
+New York, 2003)
+
+[9] M.
+Schlosshauer,
+Decoherence
+and
+the
+Quantum-
+to-Classical Transition (Springer, Berlin/Heidelberg,
+2007)
+
+[10] M. Schlosshauer, Phys. Rep. 831, 1 (2019).
+doi:
+10.1016/j.physrep.2019.10.001
+
+[11] K. Hornberger, S. Gerlich, S. Nimmrichter, P. Haslinger,
+M. Arndt, Rev. Mod. Phys. 84, 157 (2012)
+
+[12] J.M. Raimond, M. Brune, S. Haroche, Rev. Mod. Phys.
+
+73, 565 (2001)
+
+[13] A.J. Leggett, J. Phys.:
+Condens. Matter 14, R415
+(2002)
+
+[14] E. Joos, in Decoherence: Theoretical, Experimental, and
+Conceptual Problems, ed. by P. Blanchard, D. Giulini,
+E. Joos, C. Kiefer, I.O. Stamatescu (Springer, Berlin,
+2000), pp. 1–17
+
+[15] M. Schlosshauer, K. Camilleri, AIP Conf. Proc. 1327,
+26 (2011)
+
+[16] O. K¨ubler, H.D. Zeh, Ann. Phys. (N.Y.) 76, 405 (1973)
+[17] E. Joos, H.D. Zeh, Z. Phys. B: Condens. Matter 59, 223
+(1985)
+
+[18] W.H. Zurek, in Frontiers of Nonequilibrium Statistical
+Mechanics, ed. by G.T. Moore, M.O. Scully (Plenum
+Press, New York, 1986), pp. 145–149. First published
+in 1984 as Los Alamos report LAUR 84-2750
+
+[19] K. Hornberger,
+in Entanglement and Decoherence:
+Foundations and Modern Trends,
+Lecture Notes in
+Physics, vol. 768, ed. by A. Buchleitner, C. Viviescas,
+M. Tiersch (Springer, Berlin, 2009), pp. 221–276
+
+[20] H.P. Breuer, F. Petruccione, The Theory of Open Quan-
+tum Systems (Oxford University Press, Oxford, 2002)
+
+[21] M. Schlosshauer, A. Fine, in Quantum Mechanics at the
+Crossroads: New Perspectives from History, Philosophy
+and Physics, ed. by J. Evans, A. Thorndike (Springer,
+Berlin, 2006), pp. 125–148
+
+[22] W.K. Wootters, W.H. Zurek, Phys. Rev. D 19, 473
+(1979)
+
+
+20
+
+[23] H. Everett, Rev. Mod. Phys. 29, 454 (1957)
+[24] A. Einstein, B. Podolsky, N. Rosen, Phys. Rev. 47, 777
+(1935)
+
+[25] J.S. Bell, Physics 1, 195 (1964)
+[26] J.S. Bell, Rev. Mod. Phys. 38, 447 (1966)
+[27] A. Peres, Am. J. Phys. 46 (1978)
+[28] J.P. Paz, S. Habib, W.H. Zurek, Phys. Rev. D 47, 488
+(1993)
+
+[29] A.J. Leggett, S. Chakravarty, A.T. Dorsey, M.P.A.
+Fisher, A. Garg, Rev. Mod. Phys. 59, 1 (1987)
+
+[30] S.G. Mokarzel, A.N. Salgueiro, M.C. Nemes, Phys. Rev.
+A 65, 044101 (2002)
+
+[31] K. Hornberger, J.E. Sipe, Phys. Rev. A 68, 012105
+(2003)
+
+[32] K. Kraus, States, Effects, and Operations (Springer,
+Berlin, 1983)
+
+[33] W.H. Zurek, Phys. Today 44, 36 (1991). See also the
+updated version available as eprint quant-ph/0306072
+
+[34] M.R. Gallis, G.N. Fleming, Phys. Rev. A 42, 38 (1990)
+[35] L. Di´osi, Europhys. Lett. 30, 63 (1995)
+[36] K. Hornberger, Phys. Rev. Lett. 97, 060601 (2006)
+[37] K. Hornberger, B. Vacchini, Phys. Rev. A 77, 022112
+(2008)
+
+[38] M. Busse, K. Hornberger, J. Phys. A: Math. Theor. 42,
+362001 (2009)
+
+[39] M. Busse, K. Hornberger, J. Phys. A: Math. Theor. 43,
+015303 (2010)
+
+[40] R.A. Harris, L. Stodolsky, J. Chem. Phys. 74, 2145
+(1981)
+
+[41] H.D. Zeh, in Decoherence:
+Theoretical, Experimen-
+tal, and Conceptual Problems, ed. by P. Blanchard,
+D. Giulini, E. Joos, C. Kiefer, I. Stamatescu, Lecture
+Notes in Physics No. 538 (Springer, Berlin, 2000), pp.
+19–42
+
+[42] J.P. Paz, W.H. Zurek, Phys. Rev. Lett. 82, 5181 (1999)
+[43] W.H. Zurek, S. Habib, J.P. Paz, Phys. Rev. Lett. 70,
+1187 (1993)
+
+[44] W.H. Zurek, Prog. Theor. Phys. 89, 281 (1993)
+[45] B.L. Hu, J.P. Paz, Y. Zhang, Phys. Rev. D 45, 2843
+(1992)
+
+[46] W.H. Zurek, Philos. Trans. R. Soc. London, Ser. A 356,
+1793 (1998)
+
+[47] L. Di´osi, C. Kiefer, Phys. Rev. Lett. 85, 3552 (2000)
+[48] J. Eisert, Phys. Rev. Lett. 92, 210401 (2004)
+[49] G.M. Palma, K.A. Suominen, A.K. Ekert, Proc. R. Soc.
+Lond. A 452, 567 (1996)
+
+[50] D.A. Lidar, I.L. Chuang, K.B. Whaley, Phys. Rev. Lett.
+81, 2594 (1998)
+
+[51] P. Zanardi, M. Rasetti, Phys. Rev. Lett. 79, 3306 (1997)
+[52] P. Zanardi, M. Rasetti, Mod. Phys. Lett. B 11, 1085
+(1997)
+
+[53] P. Zanardi, Phys. Rev. A 57, 3276 (1998)
+[54] D.A. Lidar, D. Bacon, K.B. Whaley, Phys. Rev. Lett.
+82, 4556 (1999)
+
+[55] D. Bacon, J. Kempe, D.A. Lidar, K.B. Whaley, Phys.
+Rev. Lett. 85, 1758 (2000)
+
+[56] L.M. Duan, G.C. Guo, Phys. Rev. A 57, 737 (1998)
+[57] P. Zanardi, Phys. Rev. A 63, 012301 (2001)
+[58] E. Knill, R. Laflamme, L. Viola, Phys. Rev. Lett. 82,
+2525 (2000)
+
+[59] J. Kempe, D. Bacon, D.A. Lidar, K.B. Whaley, Phys.
+Rev. A 63, 042307 (2001)
+
+[60] D.A. Lidar, K.B. Whaley, in Irreversible Quantum Dy-
+namics, Springer Lecture Notes in Physics, vol. 622, ed.
+
+by F. Benatti, R. Floreanini (Springer, Berlin, 2003),
+pp. 83–120. Also available as eprint quant-ph/0301032
+
+[61] W.H. Zurek, in Science and Ultimate Reality, ed. by
+J.D. Barrow, P.C.W. Davies, C.H. Harper (Cambridge
+University Press, Cambridge, England, 2004), pp. 121–
+137
+
+[62] H. Ollivier, D. Poulin, W.H. Zurek, Phys. Rev. Lett. 93,
+220401 (2004)
+
+[63] H. Ollivier, D. Poulin, W.H. Zurek, Phys. Rev. A 72,
+042113 (2005)
+
+[64] R. Blume-Kohout, W.H. Zurek, Found. Phys. 35, 1857
+(2005)
+
+[65] R. Blume-Kohout, W.H. Zurek, Phys. Rev. A 73,
+062310 (2006)
+
+[66] W.H. Zurek, Nature Phys. 5, 181 (2009)
+[67] C.J. Riedel, W.H. Zurek, Phys. Rev. Lett. 105, 020404
+(2010)
+
+[68] C.J. Riedel, W.H. Zurek, New J. Phys. 13, 073038
+(2011)
+
+[69] C.J. Riedel, W.H. Zurek, M. Zwolak, New J. Phys. 14,
+083010 (2012)
+
+[70] M. Zwolak, C.J. Riedel, W.H. Zurek, Phys. Rev. Lett.
+112, 140406 (2014)
+
+[71] R. Blume-Kohout, W.H. Zurek, Phys. Rev. Lett. 101,
+240405 (2008)
+
+[72] H. Ollivier, W.H. Zurek, Phys. Rev. Lett. 88, 017901
+(2002)
+
+[73] S. Schneider, G.J. Milburn, Phys. Rev. A 57, 3748
+(1998)
+
+[74] C. Miquel, J.P. Paz, W.H. Zurek, Phys. Rev. Lett. 78,
+3971 (1997)
+
+[75] L.M.K. Vandersypen, I.L. Chuang, Rev. Mod. Phys. 76,
+1037 (2004)
+
+[76] J.M. Martinis, S. Nam, J. Aumentado, K.M. Lang,
+C. Urbina, Phys. Rev. B 67, 094510 (2003)
+
+[77] C.J. Myatt, B.E. King, Q.A. Turchette, C.A. Sackett,
+D. Kielpinski, W.M. Itano, C. Monroe, D.J. Wineland,
+Nature 403, 269 (2000)
+
+[78] S. Schneider, G.J. Milburn, Phys. Rev. A 59, 3766
+(1999)
+
+[79] Q.A. Turchette, C.J. Myatt, B.E. King, C.A. Sackett,
+D. Kielpinski, W.M. Itano, C. Monroe, D.J. Wineland,
+Phys. Rev. A 62, 053807 (2000)
+
+[80] K. Kraus, Ann. Phys. 64, 311 (1971)
+[81] V. Gorini, A. Kossakowski, E.C.G. Sudarshan, J. Math.
+Phys. 17, 821 (1976)
+
+[82] G. Lindblad, Commun. Math. Phys. 48, 119 (1976)
+[83] R. Alicki, M. Fannes, Quantum Dynamical Systems
+(Oxford University Press, Oxford, 2001)
+
+[84] R. Alicki, K. Lendi, Quantum Dynamical Semigroups
+and Applications, Lect. Notes Phys., vol. 717, 2nd edn.
+(Springer, Berlin/Heidelberg, 2007)
+
+[85] F. Benatti, R. Floreanini, Int. J. Mod. Phys. B 19, 3063
+(2005). doi:10.1142/S0217979205032097
+
+[86] R. D¨umcke, H. Spohn, Z. Phys. B 34, 419 (1979)
+[87] V. Gorini, A. Frigerio, M. Verri, A. Kossakowski, E.C.G.
+Sudarshan, Rep. Math. Phys. 13, 149 (1978)
+
+[88] E.B. Davies, Commun. Math. Phys. 39, 91 (1974)
+[89] A. Kossakowski, Rep. Math. Phys. 3, 247 (1972)
+[90] A. Barchielli, V.P. Belavkin, J. Phys. A: Math. Gen. 24,
+1495 (1991)
+
+[91] V.P. Belavkin, in Lecture Notes in Control and Infor-
+mation Sciences, vol. 121 (Springer, Berlin, 1989), pp.
+245–265
+
+
+21
+
+[92] V.P. Belavkin, J. Phys. A: Math. Gen. 22, L1109 (1989)
+[93] V.P. Belavkin, Phys. Lett. A 140, 355 (1989)
+[94] V.P. Belavkin,
+in Chaos:
+The Interplay Between
+Stochastic and Deterministic Behaviour, ed. by P. Gar-
+baczewksi, M. Wolf, A. Veron, Lecture Notes in Physics
+(Springer, 1995), pp. 21–41
+
+[95] L. Di´osi, Phys. Lett. A 129, 419 (1988)
+[96] L. Di´osi, Phys. Lett. A 132, 233 (1988)
+[97] L. Di´osi, J. Phys. A 21, 2885 (1988)
+[98] N. Gisin, Phys. Rev. Lett. 52, 1657 (1984)
+[99] N. Gisin, Helv. Phys. Acta 62, 363 (1989)
+
+[100] H.M. Wiseman, Phys. Rev. A 49, 2133 (1994)
+[101] H.S. Goan, G.J. Milburn, H.M. Wiseman, H.B. Sun,
+Phys. Rev. B 63, 125326 (2001)
+
+[102] M.B. Plenio, P.L. Knight, Rev. Mod. Phys. 70, 101
+(1998)
+
+[103] N.V. Prokof’ev, P.C.E. Stamp, Rep. Prog. Phys. 63,
+669 (2000)
+
+[104] M. Dub´e, P.C.E. Stamp, Chem. Phys. 268, 257 (2001)
+[105] S. Gr¨oblacher, A. Trubarov, N. Prigge, M. Aspelmeyer,
+J. Eisert, Nature Comm. 6, 7606 (2015)
+
+[106] S. Chaturvedi, F. Shibata, Z. Phys. B 35, 297 (1979)
+[107] F. Shibata, T. Arimitsu, J. Phys. Soc. Jpn. 49, 891
+(1980)
+
+[108] A. Royer, Phys. Rev. A 6, 1741 (1972)
+[109] A. Royer, Phys. Lett. A 315, 335 (2003)
+[110] R. Feynman, F.L. Vernon, Ann. Phys. (N.Y.) 24, 118
+(1963)
+
+[111] A. Caldeira, A. Leggett, Ann. Phys. (N.Y.) 149, 374
+(1983)
+
+[112] O.V. Lounasmaa, Experimental Principles and Methods
+below 1 K (Academic Press, New York, 1974)
+
+[113] S.L. Adler, J. Phys. A: Math. Gen. 39, 14067 (2006)
+[114] M. Tegmark, Found. Phys. Lett. 6, 571 (1993)
+[115] L.
+Hackerm¨uller,
+K.
+Hornberger,
+B.
+Brezger,
+A. Zeilinger,
+M. Arndt,
+Appl. Phys. B 77,
+781
+(2003)
+
+[116] K. Hornberger, S. Uttenthaler, B. Brezger, L. Hack-
+erm¨uller, M. Arndt, A. Zeilinger, Phys. Rev. Lett. 90,
+160401 (2003)
+
+[117] K. Hornberger, J.E. Sipe, M. Arndt, Phys. Rev. A 70,
+053608 (2004)
+
+[118] S. Nimmrichter, K. Hornberger, P. Haslinger, M. Arndt,
+Phys. Rev. A 83, 043621 (2011)
+
+[119] D.A. Kokorowski, A.D. Cronin, T.D. Roberts, D.E.
+Pritchard, Phys. Rev. Lett. 86, 2191 (2001)
+
+[120] H. Uys, J.D. Perreault, A.D. Cronin, Phys. Rev. Lett.
+95, 150403 (2005)
+
+[121] A.O. Caldeira, A.J. Leggett, Physica A 121, 587 (1983)
+[122] D.F. Walls, M.J. Collett, G.J. Milburn, Phys. Rev. D
+32, 3208 (1985)
+
+[123] U. Weiss, Quantum Dissipative Systems (World Scien-
+tific, Singapore, 1999)
+
+[124] M. Schlosshauer, A.P. Hines, G.J. Milburn, Phys. Rev.
+A 77, 022111 (2008)
+
+[125] P.C.E. Stamp, in Tunnelling in Complex Systems, ed.
+by S. Tomsovic (World Scientific, Singapore, 1998), pp.
+101–197
+
+[126] N.V. Prokof’ev, P.C.E. Stamp, (1995)
+[127] N.V. Prokof’ev, P.C.E. Stamp, J. Phys. Chem. Lett. 5,
+L663 (1993)
+
+[128] V.V. Dobrovitski, H.A. De Raedt, M.I. Katsnelson,
+B.N. Harmon, Phys. Rev. Lett. 90, 210401 (2003). doi:
+10.1103/PhysRevLett.90.210401
+
+[129] F.M. Cucchietti, J.P. Paz, W.H. Zurek, Phys. Rev. A
+72, 052113 (2005). doi:10.1103/PhysRevA.72.052113
+
+[130] A.O. Caldeira, A.H. Castro Neto, T.O. de Carvalho,
+Phys. Rev. B 48, 13974 (1993)
+
+[131] C. Miquel, J.P. Paz, R. Perazzo, Phys. Rev. A 54, 2605
+(1996)
+
+[132] A.M. Steane, Phys. Rev. Lett. 77, 793 (1996)
+[133] P.W. Shor, Phys. Rev. A 52, R2493 (1995)
+[134] A.M. Steane, in Decoherence and Its Implications in
+Quantum Computation and Information Transfer, ed.
+by P. Turchi, A. Gonis (IOS Press, Amsterdam, 2001),
+pp. 284–298. Also available as eprint quant-ph/0304016
+
+[135] E. Knill, R. Laflamme, A. Ashikhmin, H. Barnum,
+L. Viola, W. Zurek, LA Science 27, 188 (2002)
+
+[136] M.A. Nielsen, I.L. Chuang, Quantum Computation and
+Quantum Information (Cambridge University Press,
+Cambridge, 2000)
+
+[137] J.H. Reina, L. Quiroga, N.F. Johnson, Phys. Rev. A 65
+65, 032326 (2002)
+
+[138] D. Bacon, D.A. Lidar, K.B. Whaley, Phys. Rev. A 60,
+1944 (1999)
+
+[139] D.A. Lidar, D. Bacon, J. Kempe, K.B. Whaley, Phys.
+Rev. A 63, 022307 (2001)
+
+[140] D.A.R. Dalvit, J. Dziarmaga, W.H. Zurek, Phys. Rev.
+A 62, 013607 (2000)
+
+[141] J.F. Poyatos, J.I. Cirac, P. Zoller, Phys. Rev. Lett. 77,
+4728 (1996)
+
+[142] A.R.R. Carvalho, P. Milman, R.L. de Matos Filho,
+L. Davidovich, Phys. Rev. Lett. 86, 4988 (2001)
+
+[143] L. Viola, S. Lloyd, Phys. Rev. A 58, 2733 (1998)
+[144] L. Viola, E. Knill, S. Lloyd, Phys. Rev. Lett. 82, 2417
+(1999)
+
+[145] P. Zanardi, Phys. Lett. A 258, 77 (1999)
+[146] L. Viola, E. Knill, S. Lloyd, Phys. Rev. Lett. 85, 3520
+(2000)
+
+[147] L.A. Wu, D.A. Lidar, Phys. Rev. Lett. 88, 207902
+(2002)
+
+[148] L.A. Wu, M.S. Byrd, D.A. Lidar, Phys. Rev. Lett. 89,
+127901 (2002)
+
+[149] M. Arndt, K. Hornberger, A. Zeilinger, Phys. World 18,
+35 (2005)
+
+[150] R. Kaiser, C. Westbrook, F. David (eds.).
+Coherent
+Atomic Matter Waves, Les Houches Session LXXII, Les
+Houches Summer School Series (Springer, Berlin, 2001)
+
+[151] M. Blencowe, Phys. Rep. 395, 159 (2004)
+[152] M. Aspelmeyer, T.J. Kippenberg, F. Marquardt, Rev.
+Mod. Phys. 86, 1391 (2014)
+
+[153] A. Bassi, G.C. Ghirardi, Phys. Rep. 379, 257 (2003)
+[154] S.L. Adler, J. Phys. A 40, 2935 (2007)
+[155] A. Bassi, D.A. Deckert, L. Ferialdi, EPL 92, 50006
+(2010)
+
+[156] S. Nimmrichter, K. Hornberger, Phys. Rev. Lett. 110,
+160403 (2013)
+
+[157] M. Brune, E. Hagley, J. Dreyer, X. Maˆıtre, A. Maali,
+C. Wunderlich, J.M. Raimond, S. Haroche, Phys. Rev.
+Lett. 77, 4887 (1996)
+
+[158] L. Davidovich, M. Brune, J.M. Raimond, S. Haroche,
+Phys. Rev. A 53, 1295 (1996)
+
+[159] X. Maˆıtre, E. Hagley, J. Dreyer, A. Maali, C.W.M.
+Brune, J.M. Raimond, S. Haroche, J. Mod. Opt. 44,
+2023 (1997)
+
+[160] A.
+Auffeves,
+P.
+Maioli,
+T.
+Meunier,
+S.
+Gleyzes,
+G. Nogues, M. Brune, J.M. Raimond, S. Haroche, Phys.
+Rev. Lett. 91, 230405 (2003)
+
+
+22
+
+[161] S. Del´eglise, I. Dotsenko, C. Sayrin, J. Bernu, M. Brune,
+J.M. Raimond, S. Haroche, Nature 455, 510 (2008)
+
+[162] M.
+Arndt,
+O.
+Nairz,
+J.
+Vos-Andreae,
+C.
+Keller,
+G. van der Zouw, A. Zeilinger, Nature 401, 680 (1999)
+
+[163] S. Gerlich, S. Eibenberger, M. Tomandl, S. Nimm-
+richter, K. Hornberger, P.J. Fagan, J. T¨uxen, M. Mayor,
+M. Arndt, Nature Comm. 2, 263 (2012)
+
+[164] S. Eibenberger,
+S. Gerlich,
+M. Arndt,
+M. Mayor,
+J. T¨uxen, Phys. Chem. Chem. Phys. 15, 14696 (2013)
+
+[165] S. Gerlich, L. Hackerm¨uller, K. Hornberger, A. Stibor,
+H. Ulbricht, F. Goldfarb, T. Savas, M. M¨uri, M. Mayor,
+M. Arndt, Nature Phys. 3, 711 (2007)
+
+[166] P. Haslinger, N. D¨orre, P. Geyer, J. Rodewald, S. Nimm-
+richter, M. Arndt, Nature Phys. 9, 144 (2013)
+
+[167] K. Hornberger, L. Hackerm¨uller, M. Arndt, Phys. Rev.
+A 71, 023601 (2005)
+
+[168] L.
+Hackerm¨uller,
+K.
+Hornberger,
+B.
+Brezger,
+A. Zeilinger, M. Arndt, Nature 427, 711 (2004)
+
+[169] M.
+Arndt,
+O.
+Nairz,
+A.
+Zeilinger,
+in
+Quantum
+[Un]Speakables:
+From Bell to Quantum Information,
+ed. by R.A. Bertlmann, A. Zeilinger (Springer, Berlin,
+2002), pp. 333–351
+
+[170] K. Hornberger, Phys. Rev. A 73, 052102 (2006)
+[171] J.R. Friedman, V. Patel, W. Chen, S.K. Yolpygo, J.E.
+
+Lukens, Nature 406, 43 (2000)
+
+[172] C.H. van der Wal, A.C.J. ter Haar, F.K. Wilhelm, R.N.
+Schouten, C.J.P.M. Harmans, T.P. Orlando, S. Lloyd,
+J.E. Mooij, Science 290, 773 (2000)
+
+[173] I. Chiorescu, Y. Nakamura, C.J.P.M. Harmans, J.E.
+Mooij, Science 21, 1869 (2003)
+
+[174] P. Bertet, I. Chiorescu, G. Burkard, K. Semba, C.J.P.M.
+Harmans, D.P. DiVincenzo, J.E. Mooij, Phys. Rev.
+Lett. 95, 257002 (2005)
+
+[175] Y. Nakamura, Y.A. Pashkin, J.S. Tsai, Nature 398, 786
+(1999)
+
+[176] D. Vion, A. Aassime, A. Cottet, P. Joyez, H. Pothier,
+C. Urbina, D. Esteve, M.H. Devoret, Science 296, 886
+(2002)
+
+[177] Y. Yu, S. Han, X. Chu, S.I. Chu, Z. Wang, Science 296,
+889 (2002)
+
+[178] J.M. Martinis, S. Nam, J. Aumentado, C. Urbina, Phys.
+Rev. Lett. 89, 117901 (2002)
+
+[179] K. Camilleri, Stud. Hist. Phil. Mod. Phys. 40, 290
+(2009)
+
+[180] D. Wallace, in Many Worlds? Everett, Quantum Theory
+and Reality, ed. by S. Saunders, J. Barrett, A. Kent,
+D. Wallace (Oxford University Press, Oxford, 2010), pp.
+53–72
+
+
diff --git a/papers/references/Schlosshauer2007_Placeholder.md b/papers/references/Schlosshauer2007_Placeholder.md
deleted file mode 100644
index eb2c06da..00000000
--- a/papers/references/Schlosshauer2007_Placeholder.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Decoherence and the Quantum-to-Classical Transition (Schlosshauer 2007)
-
-This textbook provides the authoritative derivations of the Spin-Boson Hamiltonian, Ohmic spectral densities, and decoherence functions used to model system-environment interactions.
-Due to copyright and its format, the full PDF is not hosted in this repository.
-
-**Citation:**
-Schlosshauer, M. (2007). *Decoherence and the Quantum-to-Classical Transition*. Springer.
diff --git a/papers/references/Sparso2001.pdf b/papers/references/Sparso2001.pdf
new file mode 100644
index 00000000..0507b33e
--- /dev/null
+++ b/papers/references/Sparso2001.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:35bad8700412d0272a2d3b70216bb7581007c6ae945b927201e31b359965e589
+size 2602288
diff --git a/papers/references/Sparso2001.txt b/papers/references/Sparso2001.txt
new file mode 100644
index 00000000..a1dc2b4b
--- /dev/null
+++ b/papers/references/Sparso2001.txt
@@ -0,0 +1,24580 @@
+PRINCIPLES OF
+ASYNCHRONOUS CIRCUIT DESIGN
+– A Systems Perspective
+
+Edited by
+JENS SPARSØ
+Technical University of Denmark
+
+STEVE FURBER
+The University of Manchester, UK
+
+Kluwer Academic Publishers
+Boston/Dordrecht/London
+
+
+Contents
+
+Preface
+xi
+
+Part I
+Asynchronous circuit design – A tutorial
+Author: Jens Sparsø
+
+1
+Introduction
+3
+1.1
+Why consider asynchronous circuits?
+3
+1.2
+Aims and background
+4
+1.3
+Clocking versus handshaking
+5
+1.4
+Outline of Part I
+8
+
+2
+Fundamentals
+9
+2.1
+Handshake protocols
+9
+2.1.1
+Bundled-data protocols
+9
+2.1.2
+The 4-phase dual-rail protocol
+11
+2.1.3
+The 2-phase dual-rail protocol
+13
+2.1.4
+Other protocols
+13
+2.2
+The Muller C-element and the indication principle
+14
+2.3
+The Muller pipeline
+16
+2.4
+Circuit implementation styles
+17
+2.4.1
+4-phase bundled-data
+18
+2.4.2
+2-phase bundled data (Micropipelines)
+19
+2.4.3
+4-phase dual-rail
+20
+2.5
+Theory
+23
+2.5.1
+The basics of speed-independence
+23
+2.5.2
+Classification of asynchronous circuits
+25
+2.5.3
+Isochronic forks
+26
+2.5.4
+Relation to circuits
+26
+2.6
+Test
+27
+2.7
+Summary
+28
+
+3
+Static data-flow structures
+29
+3.1
+Introduction
+29
+3.2
+Pipelines and rings
+30
+
+v
+
+
+vi
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+3.3
+Building blocks
+31
+3.4
+A simple example
+33
+3.5
+Simple applications of rings
+35
+3.5.1
+Sequential circuits
+35
+3.5.2
+Iterative computations
+35
+3.6
+FOR, IF, and WHILE constructs
+36
+3.7
+A more complex example: GCD
+38
+3.8
+Pointers to additional examples
+39
+3.8.1
+A low-power filter bank
+39
+3.8.2
+An asynchronous microprocessor
+39
+3.8.3
+A fine-grain pipelined vector multiplier
+40
+3.9
+Summary
+40
+
+4
+Performance
+41
+4.1
+Introduction
+41
+4.2
+A qualitative view of performance
+42
+4.2.1
+Example 1: A FIFO used as a shift register
+42
+4.2.2
+Example 2: A shift register with parallel load
+44
+4.3
+Quantifying performance
+47
+4.3.1
+Latency, throughput and wavelength
+47
+4.3.2
+Cycle time of a ring
+49
+4.3.3
+Example 3: Performance of a 3-stage ring
+51
+4.3.4
+Final remarks
+52
+4.4
+Dependency graph analysis
+52
+4.4.1
+Example 4: Dependency graph for a pipeline
+52
+4.4.2
+Example 5: Dependency graph for a 3-stage ring
+54
+4.5
+Summary
+56
+
+5
+Handshake circuit implementations
+57
+5.1
+The latch
+57
+5.2
+Fork, join, and merge
+58
+5.3
+Function blocks – The basics
+60
+5.3.1
+Introduction
+60
+5.3.2
+Transparency to handshaking
+61
+5.3.3
+Review of ripple-carry addition
+64
+5.4
+Bundled-data function blocks
+65
+5.4.1
+Using matched delays
+65
+5.4.2
+Delay selection
+66
+5.5
+Dual-rail function blocks
+67
+5.5.1
+Delay insensitive minterm synthesis (DIMS)
+67
+5.5.2
+Null Convention Logic
+69
+5.5.3
+Transistor-level CMOS implementations
+70
+5.5.4
+Martin’s adder
+71
+5.6
+Hybrid function blocks
+73
+5.7
+MUX and DEMUX
+75
+5.8
+Mutual exclusion, arbitration and metastability
+77
+5.8.1
+Mutual exclusion
+77
+5.8.2
+Arbitration
+79
+5.8.3
+Probability of metastability
+79
+
+
+Contents
+vii
+
+5.9
+Summary
+80
+
+6
+Speed-independent control circuits
+81
+6.1
+Introduction
+81
+6.1.1
+Asynchronous sequential circuits
+81
+6.1.2
+Hazards
+82
+6.1.3
+Delay models
+83
+6.1.4
+Fundamental mode and input-output mode
+83
+6.1.5
+Synthesis of fundamental mode circuits
+84
+6.2
+Signal transition graphs
+86
+6.2.1
+Petri nets and STGs
+86
+6.2.2
+Some frequently used STG fragments
+88
+6.3
+The basic synthesis procedure
+91
+6.3.1
+Example 1: a C-element
+92
+6.3.2
+Example 2: a circuit with choice
+92
+6.3.3
+Example 2: Hazards in the simple gate implementation
+94
+6.4
+Implementations using state-holding gates
+96
+6.4.1
+Introduction
+96
+6.4.2
+Excitation regions and quiescent regions
+97
+6.4.3
+Example 2: Using state-holding elements
+98
+6.4.4
+The monotonic cover constraint
+98
+6.4.5
+Circuit topologies using state-holding elements
+99
+6.5
+Initialization
+101
+6.6
+Summary of the synthesis process
+101
+6.7
+Petrify: A tool for synthesizing SI circuits from STGs
+102
+6.8
+Design examples using Petrify
+104
+6.8.1
+Example 2 revisited
+104
+6.8.2
+Control circuit for a 4-phase bundled-data latch
+106
+6.8.3
+Control circuit for a 4-phase bundled-data MUX
+109
+6.9
+Summary
+113
+
+7
+Advanced 4-phase bundled-data
+protocols and circuits
+115
+
+7.1
+Channels and protocols
+115
+7.1.1
+Channel types
+115
+7.1.2
+Data-validity schemes
+116
+7.1.3
+Discussion
+116
+7.2
+Static type checking
+118
+7.3
+More advanced latch control circuits
+119
+7.4
+Summary
+121
+
+8
+High-level languages and tools
+123
+8.1
+Introduction
+123
+8.2
+Concurrency and message passing in CSP
+124
+8.3
+Tangram: program examples
+126
+8.3.1
+A 2-place shift register
+126
+8.3.2
+A 2-place (ripple) FIFO
+126
+
+
+viii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+8.3.3
+GCD using while and if statements
+127
+8.3.4
+GCD using guarded commands
+128
+8.4
+Tangram: syntax-directed compilation
+128
+8.4.1
+The 2-place shift register
+129
+8.4.2
+The 2-place FIFO
+130
+8.4.3
+GCD using guarded repetition
+131
+8.5
+Martin’s translation process
+133
+8.6
+Using VHDL for asynchronous design
+134
+8.6.1
+Introduction
+134
+8.6.2
+VHDL versus CSP-type languages
+135
+8.6.3
+Channel communication and design flow
+136
+8.6.4
+The abstract channel package
+138
+8.6.5
+The real channel package
+142
+8.6.6
+Partitioning into control and data
+144
+8.7
+Summary
+146
+Appendix: The VHDL channel packages
+148
+A.1
+The abstract channel package
+148
+A.2
+The real channel package
+150
+
+Part II
+Balsa - An Asynchronous Hardware Synthesis System
+Author: Doug Edwards, Andrew Bardsley
+
+9
+An introduction to Balsa
+155
+9.1
+Overview
+155
+9.2
+Basic concepts
+156
+9.3
+Tool set and design flow
+159
+9.4
+Getting started
+159
+9.4.1
+A single-place buffer
+161
+9.4.2
+Two-place buffers
+163
+9.4.3
+Parallel composition and module reuse
+164
+9.4.4
+Placing multiple structures
+165
+9.5
+Ancillary Balsa tools
+166
+9.5.1
+Makefile generation
+166
+9.5.2
+Estimating area cost
+167
+9.5.3
+Viewing the handshake circuit graph
+168
+9.5.4
+Simulation
+168
+
+10
+The Balsa language
+173
+10.1
+Data types
+173
+10.2
+Data typing issues
+176
+10.3
+Control flow and commands
+178
+10.4
+Binary/unary operators
+181
+10.5
+Program structure
+181
+10.6
+Example circuits
+183
+10.7
+Selecting channels
+190
+
+
+Contents
+ix
+11
+Building library components
+193
+11.1
+Parameterised descriptions
+193
+11.1.1 A variable width buffer definition
+193
+11.1.2 Pipelines of variable width and depth
+194
+11.2
+Recursive definitions
+195
+11.2.1 An n-way multiplexer
+195
+11.2.2 A population counter
+197
+11.2.3 A Balsa shifter
+200
+11.2.4 An arbiter tree
+202
+
+12
+A simple DMA controller
+205
+12.1
+Global registers
+205
+12.2
+Channel registers
+206
+12.3
+DMA controller structure
+207
+12.4
+The Balsa description
+211
+12.4.1 Arbiter tree
+211
+12.4.2 Transfer engine
+212
+12.4.3 Control unit
+213
+
+Part III
+Large-Scale Asynchronous Designs
+
+13
+Descale
+221
+Joep Kessels & Ad Peeters, Torsten Kramer and Volker Timm
+13.1
+Introduction
+222
+13.2
+VLSI programming of asynchronous circuits
+223
+13.2.1 The Tangram toolset
+223
+13.2.2 Handshake technology
+225
+13.2.3 GCD algorithm
+226
+13.3
+Opportunities for asynchronous circuits
+231
+13.4
+Contactless smartcards
+232
+13.5
+The digital circuit
+235
+13.5.1 The 80C51 microcontroller
+236
+13.5.2 The prefetch unit
+239
+13.5.3 The DES coprocessor
+241
+13.6
+Results
+243
+13.7
+Test
+245
+13.8
+The power supply unit
+246
+13.9
+Conclusions
+247
+
+14
+An Asynchronous Viterbi Decoder
+249
+Linda E. M. Brackenbury
+14.1
+Introduction
+249
+14.2
+The Viterbi decoder
+250
+14.2.1 Convolution encoding
+250
+14.2.2 Decoder principle
+251
+14.3
+System parameters
+253
+14.4
+System overview
+254
+
+
+x
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+14.5
+The Path Metric Unit (PMU)
+256
+14.5.1 Node pair design in the PMU
+256
+14.5.2 Branch metrics
+259
+14.5.3 Slot timing
+261
+14.5.4 Global winner identification
+262
+14.6
+The History Unit (HU)
+264
+14.6.1 Principle of operation
+264
+14.6.2 History Unit backtrace
+264
+14.6.3 History Unit implementation
+267
+14.7
+Results and design evaluation
+269
+14.8
+Conclusions
+271
+14.8.1 Acknowledgement
+272
+14.8.2 Further reading
+272
+
+15
+Processors
+273
+Jim D. Garside
+15.1
+An introduction to the Amulet processors
+274
+15.1.1 Amulet1 (1994)
+274
+15.1.2 Amulet2e (1996)
+275
+15.1.3 Amulet3i (2000)
+275
+15.2
+Some other asynchronous microprocessors
+276
+15.3
+Processors as design examples
+278
+15.4
+Processor implementation techniques
+279
+15.4.1 Pipelining processors
+279
+15.4.2 Asynchronous pipeline architectures
+281
+15.4.3 Determinism and non-determinism
+282
+15.4.4 Dependencies
+288
+15.4.5 Exceptions
+297
+15.5
+Memory – a case study
+302
+15.5.1 Sequential accesses
+302
+15.5.2 The Amulet3i RAM
+303
+15.5.3 Cache
+307
+15.6
+Larger asynchronous systems
+310
+15.6.1 System-on-Chip (DRACO)
+310
+15.6.2 Interconnection
+310
+15.6.3 Balsa and the DMA controller
+312
+15.6.4 Calibrated time delays
+313
+15.6.5 Production test
+314
+15.7
+Summary
+315
+
+Epilogue
+317
+
+References
+319
+
+Index
+333
+
+
+Preface
+
+This book was compiled to address a perceived need for an introductory text
+on asynchronous design. There are several highly technical books on aspects of
+the subject, but no obvious starting point for a designer who wishes to become
+acquainted for the first time with asynchronous technology. We hope this book
+will serve as that starting point.
+The reader is assumed to have some background in digital design. We as-
+sume that concepts such as logic gates, flip-flops and Boolean logic are famil-
+iar. Some of the latter sections also assume familiarity with the higher levels of
+digital design such as microprocessor architectures and systems-on-chip, but
+readers unfamiliar with these topics should still find the majority of the book
+accessible.
+The intended audience for the book comprises the following groups:
+
+Industrial designers with a background in conventional (clocked) digital
+design who wish to gain an understanding of asynchronous design in
+order, for example, to establish whether or not it may be advantageous
+to use asynchronous techniques in their next design task.
+
+Students in Electronic and/or Computer Engineering who are taking a
+course that includes aspects of asynchronous design.
+
+The book is structured in three parts. Part I is a tutorial in asynchronous
+design. It addresses the most important issue for the beginner, which is how to
+think about asynchronous systems. The first big hurdle to be cleared is that of
+mindset – asynchronous design requires a different mental approach from that
+normally employed in clocked design. Attempts to take an existing clocked
+system, strip out the clock and simply replace it with asynchronous handshakes
+are doomed to disappoint. Another hurdle is that of circuit design methodol-
+ogy – the existing body of literature presents an apparent plethora of disparate
+approaches. The aim of the tutorial is to get behind this and to present a single
+unified and coherent perspective which emphasizes the common ground. In
+this way the tutorial should enable the reader to begin to understand the char-
+acteristics of asynchronous systems in a way that will enable them to ‘think
+
+xi
+
+
+xii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+outside the box’ of conventional clocked design and to create radical new de-
+sign solutions that fully exploit the potential of clockless systems.
+Once the asynchronous design mindset has been mastered, the second hur-
+dle is designer productivity. VLSI designers are used to working in a highly
+productive environment supported by powerful automatic tools. Asynchronous
+design lags in its tools environment, but things are improving. Part II of the
+book gives an introduction to Balsa, a high-level synthesis system for asyn-
+chronous circuits. It is written by Doug Edwards (who has managed the Balsa
+development at the University of Manchester since its inception) and Andrew
+Bardsley (who has written most of the software). Balsa is not the solution to all
+asynchronous design problems, but it is capable of synthesizing very complex
+systems (for example, the 32-channel DMA controller used on the DRACO
+chip described in Chapter 15) and it is a good way to develop an understanding
+of asynchronous design ‘in the large’.
+Knowing how to think about asynchronous design and having access to suit-
+able tools leaves one question: what can be built in this way? In Part III we
+offer a number of examples of complex asynchronous systems as illustrations
+of the answer to this question. In each of these examples the designers have
+been asked to provide descriptions that will provide the reader with insights
+into the design process. The examples include a commercial smart card chip
+designed at Philips and a Viterbi decoder designed at the University of Manch-
+ester. Part III closes with a discussion of the issues that come up in the design
+of advanced asynchronous microprocessors, focusing on the Amulet processor
+series, again developed at the University of Manchester.
+Although the book is a compilation of contributions from different authors,
+each of these has been specifically written with the goals of the book in mind –
+to provide answers to the sorts of questions that a newcomer to asynchronous
+design is likely to ask. In order to keep the book accessible and to avoid it
+becoming an intimidating size, much valuable work has had to be omitted. Our
+objective in introducing you to asynchronous design is that you might become
+acquainted with it. If your relationship develops further, perhaps even into the
+full-blown affair that has smitten a few, included among whose number are the
+contributors to this book, you will, of course, want to know more. The book
+includes an extensive bibliography that will provide food enough for even the
+most insatiable of appetites.
+
+JENS SPARSØ AND STEVE FURBER, SEPTEMBER 2001
+
+
+xiii
+
+Acknowledgments
+
+Many people have helped significantly in the creation of this book. In addi-
+tion to writing their respective chapters, several of the authors have also read
+and commented on drafts of other parts of the book, and the quality of the work
+as a whole has been enhanced as a result.
+The editors are also grateful to Alan Williams, Russell Hobson and Steve
+Temple, for their careful reading of drafts of this book and their constructive
+suggestions for improvement.
+Part I of the book has been used as a course text and the quality and con-
+sistency of the content improved by feedback from the students on the spring
+2001 course “49425 Design of Asynchronous Circuits” at DTU.
+Any remaining errors or omissions are the responsibility of the editors.
+
+The writing of this book was initiated as a dissemination effort within the
+European Low-Power Initiative for Electronic System Design (ESD-LPD), and
+this book is part of the book series from this initiative. As will become clear,
+the book goes far beyond the dissemination of results from projects within
+in the ESD-LPD cluster, and the editors would like to acknowledge the sup-
+port of the working group on asynchronous circuit design, ACiD-WG, that has
+provided a fruitful forum for interaction and the exchange of ideas. The ACiD-
+WG has been funded by the European Commission since 1992 under several
+Framework Programmes: FP3 Basic Research (EP7225), FP4 Technologies
+for Components and Subsystems (EP21949), and FP5 Microelectronics (IST-
+1999-29119).
+
+
+
+Foreword
+
+This book is the third in a series on novel low-power design architectures,
+methods and design practices. It results from a large European project started
+in 1997, whose goal is to promote the further development and the faster and
+wider industrial use of advanced design methods for reducing the power con-
+sumption of electronic systems.
+Low-power design became crucial with the widespread use of portable in-
+formation and communication terminals, where a small battery has to last for a
+long period. High-performance electronics, in addition, suffers from a contin-
+uing increase in the dissipated power per square millimeter of silicon, due to
+increasing clock-rates, which causes cooling and reliability problems or other-
+wise limits performance.
+The European Union’s Information Technologies Programme ‘Esprit’ there-
+fore launched a ‘Pilot action for Low-Power Design’, which eventually grew
+to 19 R&D projects and one coordination project, with an overall budget of 14
+million EUROs. This action is known as the European Low-Power Initiative
+for Electronic System Design (ESD-LPD) and will be completed in the year
+2002. It aims to develop or demonstrate new design methods for power reduc-
+tion, while the coordination project ensures that the methods, experiences and
+results are properly documented and publicised.
+The initiative addresses low-power design at various levels. These include
+system and algorithmic level, instruction set processor level, custom proces-
+sor level, register transfer level, gate level, circuit level and layout level. It
+covers data-dominated, control-dominated and asynchronous architectures. 10
+projects deal mainly with digital circuits, 7 with analog and mixed-signal cir-
+cuits, and 2 with software-related aspects. The principal application areas are
+communication, medical equipment and e-commerce devices.
+
+xv
+
+
+xvi
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+The following list describes the objectives of the 20 projects. It is sorted by
+decreasing funding budget.
+
+CRAFT CMOS Radio Frequency Circuit Design for Wireless Application
+
+Advanced CMOS RF circuit design including blocks such as LNA, down con-
+verter mixers & phase shifters, oscillators and frequency synthesisers, integrated
+filters delta sigma conversion, power amplifiers
+
+Development of novel models for active and passive devices as well as fine-tuning
+and validation based on first silicon prototypes
+
+Analysis and specification of sophisticated architectures to meet, in particular,
+low-power single-chip implementation
+
+PAPRICA Power and Part Count Reduction Innovative Communication Architecture
+
+Feasibility assessment of DQIF, through physical design and characterisation of
+the core blocks
+
+Low-power RF design techniques in standard CMOS digital processes
+
+RF design tools and framework; PAPRICA Design Kit
+
+Demonstration of a practical implementation of a specific application
+
+MELOPAS Methodology for Low Power Asic design
+
+To develop a methodology to evaluate the power consumption of a complex ASIC
+early on in the design flow
+
+To develop a hardware/software co-simulation tool
+
+To quickly achieve a drastic reduction in the power consumption of electronic
+equipment
+
+TARDIS Technical Coordination and Dissemination
+
+To organise the communication between design experiments and to exploit their
+potential synergy
+
+To guide the capturing of methods and experiences gained in the design experi-
+ments
+
+To organise and promote the wider dissemination and use of the gathered design
+know-how and experience
+
+LUCS Low-Power Ultrasound Chip Set.
+
+Design methodology on low-power ADC, memory and circuit design
+
+Prototype demonstration of a hand-held medical ultrasound scanner
+
+
+Foreword
+xvii
+
+ALPINS Analog Low-Power Design for Communications Systems
+
+Low-voltage voice band smoothing filters and analog-to-digital and digital-to-
+analog converters for an analog front-end circuit for a DECT system
+
+High linear transconductor-capacitor (gm-C) filter for GSM Analog Interface Cir-
+cuit operating at supply voltages as low as 2.5V
+
+Formal verification tools, which will be implemented in the industrial partner’s
+design environment. These tools support the complete design process from sys-
+tem level down to transistor level
+
+SALOMON System-level analog-digital trade-off analysis for low power
+
+A general top-down design flow for mixed-signal telecom ASICs
+
+High-level models of analog and digital blocks and power estimators for these
+blocks
+
+A prototype implementation of the design flow with particular software tools to
+demonstrate the general design flow
+
+DESCALE Design Experiment on a Smart Card Application for Low Energy
+
+The application of highly innovative handshake technology
+
+Aiming at some 3 to 5 times less power and some 10 times smaller peak currents
+compared to synchronously operated solutions
+
+SUPREGE A low-power SUPerREGEnerative transceiver for wireless data transmission at
+short distances
+
+Design trade-offs and optimisation of the micro power receiver/transmitter as a
+function of various parameters (power consumption, area, bandwidth, sensitivity,
+etc)
+
+Modulation/demodulation and interface with data transmission systems
+
+Realisation of the integrated micro power receiver/transmitter based on the super-
+regeneration principle
+
+PREST Power REduction for System Technologies
+
+Survey of contemporary Low-Power Design techniques and commercial power
+analysis software tools
+
+Investigation of architectural and algorithmic design techniques with a power
+consumption comparison
+
+Investigation of Asynchronous design techniques and Arithmetic styles
+
+Set-up and assessment of a low-power design flow
+
+Fabrication and characterisation of a Viterbi demonstrator to assess the most
+promising power reduction techniques
+
+
+xviii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+DABLP Low-Power Exploration for Mapping DAB Applications to Multi-Processors
+
+A DAB channel decoder architecture with reduced power consumption
+
+Refined and extended ATOMIUM methodology and supporting tools
+
+COSAFE Low-Power Hardware-Software Co-Design for Safety-Critical Applications
+
+The development of strategies for power-efficient assignment of safety critical
+mechanisms to hardware or software
+
+The design and implementation of a low-power, safety-critical ASIP, which re-
+alises the control unit of a portable infusion pump system
+
+AMIED Asynchronous Low-Power Methodology and Implementation of an Encryption/De-
+cryption System
+
+Implementation of the IDEA encryption/decryption method with drastically re-
+duced power consumption
+
+Advanced low-power design flow with emphasis on algorithm and architecture
+optimisations
+
+Industrial demonstration of the asynchronous design methodology based on com-
+mercial tools
+
+LPGD A Low-Power Design Methodology/Flow and its Application to the Implementation of
+a DCS1800-GSM/DECT Modulator/Demodulator
+
+To complete the development of a top-down, low-power design methodology/flow
+for DSP applications
+
+To demonstrate the methods on the example of an integrated GFSK/GMSK Modu-
+lator-Demodulator (MODEM) for DCS1800-GSM/DECT applications
+
+SOFLOPO Low-Power Software Development for Embedded Applications
+
+Develop techniques and guidelines for mapping a specific algorithm code onto
+appropriate instruction subsets
+
+Integrate these techniques into software for the power-conscious ARM-RISC and
+DSP code optimisation
+
+I-MODE Low-Power RF to Base Band Interface for Multi-Mode Portable Phone
+
+To raise the level of integration in a DECT/DCS1800 transceiver, by implement-
+ing the necessary analog base band low-pass filters and data converters in CMOS
+technology using low-power techniques
+
+COOL-LOGOS Power Reduction through the Use of Local don’t Care Conditions and Global
+Gate Resizing Techniques: An Experimental Evaluation.
+
+To apply the developed low-power design techniques to an existing 24-bit DSP,
+which is already fabricated
+
+To assess the merit of the new techniques using experimental silicon through com-
+parisons of the projected power reduction (in simulation) and actually measured
+reduction of new DSP; assessment of the commercial impact
+
+
+Foreword
+xix
+
+LOVO Low Output VOltage DC/DC converters for low-power applications
+
+Development of technical solutions for the power supplies of advanced low-
+power systems
+
+New methods for synchronous rectification for very low output voltage power
+converters
+
+PCBIT Low-Power ISDN Interface for Portable PC’s
+
+Design of a PC-Card board that implements the PCBIT interface
+
+Integrate levels 1 and 2 of the communication protocol in a single ASIC
+
+Incorporate power management techniques in the ASIC design:
+
+– system level: shutdown of idle modules in the circuit
+– gate level: precomputation, gated-clock FSMs
+
+COLOPODS Design of a Cochlear Hearing Aid Low-Power DSP System
+
+Selection of a future oriented low-power technology enabling future power re-
+duction through integration of analog modules
+
+Design of a speech processor IC yielding a power reduction of 90% compared to
+the 3.3 Volt implementation
+
+The low power design projects have achieved the following results:
+
+Projects that have designed prototype chips can demonstrate power re-
+ductions of 10 to 30 percent.
+
+New low-power design libraries have been developed.
+
+New proven low-power RF architectures are now available.
+
+New smaller and lighter mobile equipment has been developed.
+
+Instead of running a number of Esprit projects at the same time indepen-
+dently of each other, during this pilot action the projects have collaborated
+strongly. This is achieved mostly by the novel feature of this action, which
+is the presence and role of the coordinator: DIMES - the Delft Institute of
+Microelectronics and Submicron-technology, located in Delft, the Netherlands
+(http://www.dimes.tudelft.nl). The task of the coordinator is to co-ordinate,
+facilitate, and organize:
+
+the information exchange between projects;
+
+the systematic documentation of methods and experiences;
+
+the publication and the wider dissemination to the public.
+
+
+xx
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+The most important achievements, credited to the presence of the coordina-
+tor are:
+
+New personnel contacts have been made, and as a consequence the re-
+sulting synergy between partners resulted in better and faster develop-
+ments.
+
+The organization of low-power design workshops, special sessions at
+conferences, and a low-power design web site:
+
+http://www.esdlpd.dimes.tudelft.nl.
+
+At this site all of the public reports from the projects can be found, as
+can all kinds of information about the initiative itself.
+
+The design methodology, design methods and/or design experience are
+disclosed, are well-documented and available.
+
+Based on the work of the projects, and in cooperation with the projects,
+the publication of a low-power design book series is planned. Written
+by members of the projects, this series of books on low-power design
+will disseminate novel design methodologies and design experiences
+that were obtained during the run-time of the European Low Power Ini-
+tiative for Electronic System Design, to the general public.
+
+In conclusion, the major contribution of this project cluster is, in addition
+to the technical achievements already mentioned, the acceleration of the in-
+troduction of novel knowledge on low-power design methods into mainstream
+development processes.
+We would like to thank all project partners from all of the different compa-
+nies and organizations who make the Low-Power Initiative a success.
+
+Rene van Leuken, Reinder Nouta, Alexander de Graaf, Delft, June 2001
+
+
+I
+
+ASYNCHRONOUS CIRCUIT DESIGN
+– A TUTORIAL
+
+Author: Jens Sparsø
+Technical University of Denmark
+jsp@imm.dtu.dk
+
+Abstract
+Asynchronous circuits have characteristics that differ significantly from those
+of synchronous circuits and, as will be clear from some of the later chapters
+in this book, it is possible exploit these characteristics to design circuits with
+very interesting performance parameters in terms of their power, performance,
+electromagnetic emissions (EMI), etc.
+Asynchronous design is not yet a well-established and widely-used design meth-
+odology. There are textbooks that provide comprehensive coverage of the under-
+lying theories, but the field has not yet matured to a point where there is an estab-
+lished currriculum and university tradition for teaching courses on asynchronous
+circuit design to electrical engineering and computer engineering students.
+As this author sees the situation there is a gap between understanding the fun-
+damentals and being able to design useful circuits of some complexity. The aim
+of Part I of this book is to provide a tutorial on asynchronous circuit design that
+fills this gap.
+More specifically the aims are: (i) to introduce readers with background in syn-
+chronous digital circuit design to the fundamentals of asynchronous circuit de-
+sign such that they are able to read and understand the literature, and (ii) to
+provide readers with an understanding of the “nature” of asynchronous circuits
+such that they are to design non-trivial circuits with interesting performance pa-
+rameters.
+The material is based on experience from the design of several asynchronous
+chips, and it has evolved over the last decade from tutorials given at a number
+of European conferences and from a number of special topics courses taught
+at the Technical University of Denmark and elsewhere. In May 1999 I gave a
+one-week intensive course at Delft University of Technology and it was when
+preparing for this course I felt that the material was shaping up, and I set out
+to write the following text. Most of the material has recently been used and
+debugged in a course at the Technical University of Denmark in the spring 2001.
+Supplemented by a few journal articles and a small design project, the text may
+be used for a one semester course on asynchronous design.
+
+Keywords:
+asynchronous circuits, tutorial
+
+
+
+Chapter 1
+
+INTRODUCTION
+
+1.1.
+Why consider asynchronous circuits?
+
+Most digital circuits designed and fabricated today are “synchronous”. In
+essence, they are based on two fundamental assumptions that greatly simplify
+their design: (1) all signals are binary, and (2) all components share a common
+and discrete notion of time, as defined by a clock signal distributed throughout
+the circuit.
+Asynchronous circuits are fundamentally different; they also assume bi-
+nary signals, but there is no common and discrete time. Instead the circuits
+use handshaking between their components in order to perform the necessary
+synchronization, communication, and sequencing of operations. Expressed in
+‘synchronous terms’ this results in a behaviour that is similar to systematic
+fine-grain clock gating and local clocks that are not in phase and whose period
+is determined by actual circuit delays – registers are only clocked where and
+when needed.
+This difference gives asynchronous circuits inherent properties that can be
+(and have been) exploited to advantage in the areas listed and motivated below.
+The interested reader may find further introduction to the mechanisms behind
+the advantages mentioned below in [140].
+
+Low power consumption, [136, 138, 42, 45, 99, 102]
+
+�
+�
+�due to fine-grain clock gating and zero standby power consumption.
+
+High operating speed, [156, 157, 88]
+
+�
+�
+�operating speed is determined by actual local latencies rather than
+global worst-case latency.
+
+Less emission of electro-magnetic noise, [136, 109]
+
+�
+�
+�the local clocks tend to tick at random points in time.
+
+Robustness towards variations in supply voltage, temperature, and fabri-
+cation process parameters, [87, 98, 100]
+
+�
+�
+�timing is based on matched delays (and can even be insensitive to
+circuit and wire delays).
+
+3
+
+
+4
+Part I: Asynchronous circuit design – A tutorial
+
+Better composability and modularity, [92, 80, 142, 128, 124]
+
+�
+�
+�because of the simple handshake interfaces and the local timing.
+
+No clock distribution and clock skew problems,
+
+�
+�
+�there is no global signal that needs to be distributed with minimal
+phase skew across the circuit.
+
+On the other hand there are also some drawbacks. The asynchronous con-
+trol logic that implements the handshaking normally represents an overhead
+in terms of silicon area, circuit speed, and power consumption. It is therefore
+pertinent to ask whether or not the investment pays off, i.e. whether the use of
+asynchronous techniques results in a substantial improvement in one or more
+of the above areas. Other obstacles are a lack of CAD tools and strategies and
+a lack of tools for testing and test vector generation.
+Research in asynchronous design goes back to the mid 1950s [93, 92], but
+it was not until the late 1990s that projects in academia and industry demon-
+strated that it is possible to design asynchronous circuits which exhibit signifi-
+cant benefits in nontrivial real-life examples, and that commercialization of the
+technology began to take place. Recent examples are presented in [106] and in
+Part III of this book.
+
+1.2.
+Aims and background
+
+There are already several excellent articles and book chapters that introduce
+asynchronous design [54, 33, 34, 35, 140, 69, 124] as well as several mono-
+graphs and textbooks devoted to asynchronous design including [106, 14, 25,
+18, 95] – why then write yet another introduction to asynchronous design?
+There are several reasons:
+
+My experience from designing several asynchronous chips [123, 103],
+and from teaching asynchronous design to students and engineers over
+the past 10 years, is that it takes more than knowledge of the basic prin-
+ciples and theories to design efficient asynchronous circuits. In my ex-
+perience there is a large gap between the introductory articles and book
+chapters mentioned above explaining the design methods and theories
+on the one side, and the papers describing actual designs and current re-
+search on the other side. It takes more than knowing the rules of a game
+to play and win the game. Bridging this gap involves experience and a
+good understanding of the nature of asynchronous circuits. An experi-
+ence that I share with many other researchers is that “just going asyn-
+chronous” results in larger, slower and more power consuming circuits.
+The crux is to use asynchronous techniques to exploit characteristics in
+the algorithm and architecture of the application in question. This fur-
+
+
+Chapter 1: Introduction
+5
+
+ther implies that asynchronous techniques may not always be the right
+solution to the problem.
+
+Another issue is that asynchronous design is a rather young discipline.
+Different researchers have proposed different circuit structures and de-
+sign methods. At a first glance they may seem different – an observation
+that is supported by different terminologies; but a closer look often re-
+veals that the underlying principles and the resulting circuits are rather
+similar.
+
+Finally, most of the above-mentioned introductory articles and book
+chapters are comprehensive in nature. While being appreciated by those
+already working in the field, the multitude of different theories and ap-
+proaches in existence represents an obstacle for the newcomer wishing
+to get started designing asynchronous circuits.
+
+Compared to the introductory texts mentioned above, the aims of this tu-
+torial are: (1) to provide an introduction to asynchronous design that is more
+selective, (2) to stress basic principles and similarities between the different ap-
+proaches, and (3) to take the introduction further towards designing practical
+and useful circuits.
+
+1.3.
+Clocking versus handshaking
+
+Figure 1.1(a) shows a synchronous circuit. For simplicity the figure shows a
+pipeline, but it is intended to represent any synchronous circuit. When design-
+ing ASICs using hardware description languages and synthesis tools, designers
+focus mostly on the data processing and assume the existence of a global clock.
+For example, a designer would express the fact that data clocked into register
+R3 is a function CL3 of the data clocked into R2 at the previous clock as the
+following assignment of variables: R3 :� CL3�R2�. Figure 1.1(a) represents
+this high-level view with a universal clock.
+When it comes to physical design, reality is different. Todays ASICs use a
+structure of clock buffers resulting in a large number of (possibly gated) clock
+signals as shown in figure 1.1(b). It is well known that it takes CAD tools
+and engineering effort to design the clock gating circuitry and to minimize
+and control the skew between the many different clock signals. Guaranteeing
+the two-sided timing constraints – the setup to hold time window around the
+clock edge – in a world that is dominated by wire delays is not an easy task.
+The buffer-insertion-and-resynthesis process that is used in current commercial
+CAD tools may not converge and, even if it does, it relies on delay models that
+are often of questionable accuracy.
+
+
+6
+Part I: Asynchronous circuit design – A tutorial
+
+CL4
+
+CL4
+
+"Channel" or "Link"
+
+R2
+R3
+R4
+R1
+CL4
+CL3
+
+(d)
+
+Ack
+
+R2
+R3
+R4
+R1
+Data
+CL3
+CL4
+
+Req
+CTL
+CTL
+CTL
+CTL
+
+Req
+
+Ack
+
+Data
+
+R2
+R3
+R1
+CL3
+
+CLK
+
+(b)
+
+CLK
+
+R2
+R3
+R4
+R1
+CL3
+(a)
+
+(c)
+
+R4
+
+clock gate signal
+
+Figure 1.1.
+(a) A synchronous circuit, (b) a synchronous circuit with clock drivers and clock
+gating, (c) an equivalent asynchronous circuit, and (d) an abstract data-flow view of the asyn-
+chronous circuit. (The figure shows a pipeline, but it is intended to represent any circuit topol-
+ogy).
+
+
+Chapter 1: Introduction
+7
+
+Asynchronous design represents an alternative to this. In an asynchronous
+circuit the clock signal is replaced by some form of handshaking between
+neighbouring registers; for example the simple request-acknowledge based
+handshake protocol shown in figure 1.1(c). In the following chapter we look
+at alternative handshake protocols and data encodings, but before departing
+into these implementation details it is useful to take a more abstract view as
+illustrated in figure 1.1(d):
+
+think of the data and handshake signals connecting one register to the
+next in figure 1.1(c) as a “handshake channel” or “link,”
+
+think of the data stored in the registers as tokens tagged with data values
+(that may be changed along the way as tokens flow through combina-
+tional circuits), and
+
+think of the combinational circuits as being transparent to the handshak-
+ing between registers; a combinatorial circuit simply absorbs a token on
+each of its input links, performs its computation, and then emits a to-
+ken on each of its output links (much like a transition in a Petri net, c.f.
+section 6.2.1).
+
+Viewed this way, an asynchronous circuit is simply a static data-flow struc-
+ture [36]. Intuitively, correct operation requires that data tokens flowing in the
+circuit do not disappear, that one token does not overtake another, and that new
+tokens do not appear out of nowhere. A simple rule that can ensure this is the
+following:
+
+A register may input and store a new data token from its predecessor if its
+successor has input and stored the data token that the register was previ-
+ously holding. [The states of the predecessor and successor registers are
+signaled by the incoming request and acknowledge signals respectively.]
+
+Following this rule data is copied from one register to the next along the path
+through the circuit. In this process subsequent registers will often be holding
+copies of the same data value but the old duplicate data values will later be
+overwritten by new data values in a carefully ordered manner, and a handshake
+cycle on a link will always enclose the transfer of exactly one data-token. Un-
+derstanding this “token flow game” is crucial to the design of efficient circuits,
+and we will address these issues later, extending the token-flow view to cover
+structures other than pipelines. Our aim here is just to give the reader an intu-
+itive feel for the fundamentally different nature of asynchronous circuits.
+An important message is that the “handshake-channel and data-token view”
+represents a very useful abstraction that is equivalent to the register transfer
+level (RTL) used in the design of synchronous circuits. This data-flow ab-
+straction, as we will call it, separates the structure and function of the circuit
+from the implementation details of its components.
+
+
+8
+Part I: Asynchronous circuit design – A tutorial
+
+Another important message is that it is the handshaking between the regis-
+ters that controls the flow of tokens, whereas the combinational circuit blocks
+must be fully transparent to this handshaking. Ensuring this transparency is not
+always trivial; it takes more than a traditional combinational circuit, so we will
+use the term ’function block’ to denote a combinational circuit whose input
+and output ports are handshake-channels or links.
+Finally, some more down-to-earth engineering comments may also be rele-
+vant. The synchronous circuit in figure 1.1(b) is “controlled” by clock pulses
+that are in phase with a periodic clock signal, whereas the asynchronous circuit
+in figure 1.1(c) is controlled by locally derived clock pulses that can occur at
+any time; the local handshaking ensures that clock pulses are generated where
+and when needed. This tends to randomize the clock pulses over time, and is
+likely to result in less electromagnetic emission and a smoother supply current
+without the large di�dt spikes that characterize a synchronous circuit.
+
+1.4.
+Outline of Part I
+
+Chapter 2 presents a number of fundamental concepts and circuits that are
+important for the understanding of the following material. Read through it but
+don’t get stuck; you may want to revisit relevant parts later.
+Chapters 3 and 4 address asynchronous design at the data-flow level: chap-
+ter 3 explains the operation of pipelines and rings, introduces a set of hand-
+shake components and explains how to design (larger) computing structures,
+and chapter 4 addresses performance analysis and optimization of such struc-
+tures, both qualitatively and quantitatively.
+Chapter 5 addresses the circuit level implementation of the handshake com-
+ponents introduced in chapter 3, and chapter 6 addresses the design of hazard-
+free sequential (control) circuits. The latter includes a general introduction to
+the topics and in-depth coverage of one specific method: the design of speed-
+independent control circuits from signal transition graph specifications. These
+techniques are illustrated by control circuits used in the implementation of
+some of the handshake components introduced in chapter 3.
+All of the above chapters 2–6 aim to explain the basic techniques and meth-
+ods in some depth. The last two chapters are briefer. Chapter 7 introduces
+more advanced topics related to the implementation of circuits using the 4-
+phase bundled-data protocol, and chapter 8 addresses hardware description
+languages and synthesis tools for asynchronous design. Chapter 8 is by no
+means comprehensive; it focuses on CSP-like languages and syntax-directed
+compilation, but also describes how asynchronous design can be supported by
+a standard language, VHDL.
+
+
+Chapter 2
+
+FUNDAMENTALS
+
+This chapter provides explanations of a number of topics and concepts that
+are of fundamental importance for understanding the following chapters and
+for appreciating the similarities between the different asynchronous design
+styles. The presentation style will be somewhat informal and the aim is to
+provide the reader with intuition and insight.
+
+2.1.
+Handshake protocols
+
+The previous chapter showed one particular handshake protocol known as a
+return-to-zero handshake protocol, figure 1.1(c). In the asynchronous commu-
+nity it is given a more informative name: the 4-phase bundled-data protocol.
+
+2.1.1
+Bundled-data protocols
+
+The term bundled-data refers to a situation where the data signals use nor-
+mal Boolean levels to encode information, and where separate request and
+acknowledge wires are bundled with the data signals, figure 2.1(a). In the 4-
+phase protocol illustrated in figure 2.1(b) the request and acknowledge wires
+also use normal Boolean levels to encode information, and the term 4-phase
+refers to the number of communication actions: (1) the sender issues data and
+sets request high, (2) the receiver absorbs the data and sets acknowledge high,
+(3) the sender responds by taking request low (at which point data is no longer
+guaranteed to be valid) and (4) the receiver acknowledges this by taking ac-
+knowledge low. At this point the sender may initiate the next communication
+cycle.
+The 4-phase bundled data protocol is familiar to most digital designers, but
+it has a disadvantage in the superfluous return-to-zero transitions that cost un-
+necessary time and energy. The 2-phase bundled-data protocol shown in fig-
+ure 2.1(c) avoids this. The information on the request and acknowledge wires
+is now encoded as signal transitions on the wires and there is no difference
+between a 0
+� 1 and a 1
+� 0 transition, they both represent a “signal event”.
+Ideally the 2-phase bundled-data protocol should lead to faster circuits than
+the 4-phase bundled-data protocol, but often the implementation of circuits
+
+9
+
+
+10
+Part I: Asynchronous circuit design – A tutorial
+
+(push) channel
+
+(a)
+
+(b)
+4-phase protocol
+(c)
+2-phase protocol
+
+Data
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Data
+
+n
+
+Bundled data
+
+Data
+
+Ack
+
+Req
+
+Figure 2.1.
+(a) A bundled-data channel. (b) A 4-phase bundled-data protocol. (c) A 2-phase
+bundled-data protocol.
+
+responding to events is complex and there is no general answer as to which
+protocol is best.
+At this point some discussion of terminology is appropriate. Instead of the
+term bundled-data that is used throughout this text, some texts use the term
+single-rail. The term ‘bundled-data’ hints at the timing relationship between
+the data signals and the handshake signals, whereas the term ‘single-rail’ hints
+at the use of one wire to carry one bit of data. Also, the term single-rail may
+be considered consistent with the dual-rail data representation discussed in the
+next section. Instead of the term 4-phase handshaking (or signaling) some texts
+use the terms return-to-zero (RTZ) signaling or level signaling, and instead of
+the term 2-phase handshaking (or signaling) some texts use the terms non-
+return-to-zero (NRZ) signaling or transition signaling. Consequently a return-
+to-zero single-track prococol is the same as a 4-phase bundled-data protocol,
+etc.
+The protocols introduced above all assume that the sender is the active party
+that initiates the data transfer over the channel. This is known as a push chan-
+nel. The opposite, the receiver asking for new data, is also possible and is
+called a pull channel. In this case the directions of the request and acknowl-
+edge signals are reversed, and the validity of data is indicated in the acknowl-
+edge signal going from the sender to the receiver. In abstract circuit diagrams
+showing links/channels as one symbol we will often mark the active end of a
+channel with a dot, as illustrated in figure 2.1(a).
+To complete the picture we mention a number of variations: (1) a channel
+without data that can be used for synchronization, and (2) a channel where
+data is transmitted in both directions and where req and ack indicate validity
+
+
+Chapter 2: Fundamentals
+11
+
+of the data that is exchanged. The latter could be used to interface a read-
+only memory: the address would be bundled with req and the data would be
+bundled with ack. These alternatives are explained later in section 7.1.1. In the
+following sections we will restrict the discussion to push channels.
+All the bundled-data protocols rely on delay matching, such that the order
+of signal events at the sender’s end is preserved at the receiver’s end. On a
+push channel, data is valid before request is set high, expressed formally as
+Valid
+�Data�
+� Req. This ordering should also be valid at the receiver’s end,
+and it requires some care when physically implementing such circuits. Possible
+solutions are:
+
+To control the placement and routing of the wires, possibly by routing
+all signals in a channel as a bundle. This is trivial in a tile-based datapath
+structure.
+
+To have a safety margin at the sender’s end.
+
+To insert and/or resize buffers after layout (much as is done in today’s
+synthesis and layout CAD tools).
+
+An alternative is to use a more sophisticated protocol that is robust to wire
+delays. In the following sections we introduce a number of such protocols that
+are completely insensitive to delays.
+
+2.1.2
+The 4-phase dual-rail protocol
+
+The 4-phase dual-rail protocol encodes the request signal into the data sig-
+nals using two wires per bit of information that has to be communicated, fig-
+ure 2.2. In essence it is a 4-phase protocol using two request wires per bit of
+information d; one wire d
+�t is used for signaling a logic 1 (or true), and an-
+other wire d
+�f is used for signaling logic 0 (or false). When observing a 1-bit
+channel one will see a sequence of 4-phase handshakes where the participating
+
+"1"
+"0"
+"E"
+
+dual-rail
+(push) channel
+
+0
+
+0
+1
+1
+
+d.t d.f
+
+0
+
+1
+0
+1
+
+Valid  "0"
+Valid  "1"
+Not used
+
+Empty ("E")
+
+2n
+
+Ack
+
+Data, Req
+4-phase
+
+Data {d.t, d.f} Empty
+Valid
+Empty
+Valid
+
+Ack
+
+Figure 2.2.
+A delay-insensitive channel using the 4-phase dual-rail protocol.
+
+
+12
+Part I: Asynchronous circuit design – A tutorial
+
+“request” signal in any handshake cycle can be either d
+�t or d
+�f . This protocol
+is very robust; two parties can communicate reliably regardless of delays in the
+wires connecting the two parties – the protocol is delay-insensitive.
+Viewed together the
+�x�f
+�x�t
+� wire pair is a codeword;
+�x�f
+�x�t
+�
+�
+�1�0�
+and
+�x�f
+�x�t
+�
+�
+�0�1� represent “valid data” (logic 0 and logic 1 respectively)
+and
+�x�f
+�x�t
+�
+�
+�0�0� represents “no data” (or “spacer” or “empty value” or
+“NULL”). The codeword
+�x�f
+�x�t
+�
+�
+�1�1� is not used, and a transition from
+one valid codeword to another valid codeword is not allowed, as illustrated in
+figure 2.2.
+This leads to a more abstract view of 4-phase handshaking: (1) the sender
+issues a valid codeword, (2) the receiver absorbs the codeword and sets ac-
+knowledge high, (3) the sender responds by issuing the empty codeword, and
+(4) the receiver acknowledges this by taking acknowledge low. At this point
+the sender may initiate the next communication cycle. An even more abstract
+view of what is seen on a channel is a data stream of valid codewords separated
+by empty codewords.
+Let’s now extend this approach to bit-parallel channels. An N-bit data chan-
+nel is formed simply by concatenating N wire pairs, each using the encoding
+described above. A receiver is always able to detect when all bits are valid (to
+which it responds by taking acknowledge high), and when all bits are empty
+(to which it responds by taking acknowledge low). This is intuitive, but there
+is also some mathematical background – the dual-rail code is a particularly
+simple member of the family of delay-insensitive codes [147], and it has some
+nice properties:
+
+any concatenation of dual-rail codewords is itself a dual-rail codeword.
+
+for a given N (the number of bits to be communicated), the set of all
+possible codewords can be disjointly divided into 3 sets:
+
+– the empty codeword where all N wire pairs are
+�0,0�.
+
+– the intermediate codewords
+where some wire-pairs assume the
+empty state and some wire pairs assume valid data.
+
+– the 2N different valid codewords.
+
+Figure 2.3 illustrates the handshaking on an N-bit channel: a receiver will
+see the empty codeword, a sequence of intermediate codewords (as more and
+more bits/wire-pairs become valid) and eventually a valid codeword. After
+receiving and acknowledging the codeword, the receiver will see a sequence of
+intermediate codewords (as more and more bits become empty), and eventually
+the empty codeword to which the receiver responds by driving acknowledge
+low again.
+
+
+Chapter 2: Fundamentals
+13
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+
+Data
+
+Time
+
+1
+
+0
+
+Time
+
+Figure 2.3.
+Illustration of the handshaking on a 4-phase dual-rail channel.
+
+2.1.3
+The 2-phase dual-rail protocol
+
+The 2-phase dual-rail protocol also uses 2 wires
+�d
+�t
+�d
+�f
+� per bit, but the
+information is encoded as transitions (events) as explained previously. On an
+N-bit channel a new codeword is received when exactly one wire in each of
+the N wire pairs has made a transition. There is no empty value; a valid mes-
+sage is acknowledged and followed by another message that is acknowledged.
+Figure 2.4 shows the signal waveforms on a 2-bit channel using the 2-phase
+dual-rail protocol.
+
+Ack
+
+(d1.t, d1.f)
+
+(d0.t, d0.f)
+
+d1.t
+
+d1.f
+
+Ack
+
+d0.f
+
+d0.t
+
+00
+01
+00
+11
+
+Figure 2.4.
+Illustration of the handshaking on a 2-phase dual-rail channel.
+
+2.1.4
+Other protocols
+
+The previous sections introduced the four most common channel protocols:
+the 4-phase bundled-data push channel, the 2-phase bundled-data push chan-
+nel, the 4-phase dual-rail push channel and the 2-phase dual-rail push channel;
+but there are many other possibilities. The two wires per bit used in the dual-
+rail protocol can be seen as a one-hot encoding of that bit and often it is useful
+to extend to 1-of-n encodings in control logic and higher-radix data encodings.
+
+
+14
+Part I: Asynchronous circuit design – A tutorial
+
+If the focus is on communication rather than computation, m-of-n encodings
+may be of relevance. The solution space can be expressed as the cross product
+of a number of options including:
+
+�2-phase
+�4-phase
+�
+�
+�bundled-data
+�dual-rail
+�1-of-n
+�
+�
+�
+�
+�
+�
+�push
+�pull
+�
+
+The choice of protocol affects the circuit implementation characteristics
+(area, speed, power, robustness, etc.). Before continuing with these imple-
+mentation issues it is necessary to introduce the concept of indication or ac-
+knowledgement, as well as a new component, the Muller C-element.
+
+2.2.
+The Muller C-element and the indication principle
+
+In a synchronous circuit the role of the clock is to define points in time
+where signals are stable and valid. In between the clock-ticks, the signals
+may exhibit hazards and may make multiple transitions as the combinational
+circuits stabilize. This does not matter from a functional point of view. In
+asynchronous (control) circuits the situation is different. The absence of a
+clock means that, in many circumstances, signals are required to be valid all
+the time, that every signal transition has a meaning and, consequently, that
+hazards and races must be avoided.
+Intuitively a circuit is a collection of gates (normally including some feed-
+back loops), and when the output of a gate changes it is seen by other gates
+that in turn may decide to change their outputs accordingly. As an example fig-
+ure 2.5 shows one possible implementation of the CTL circuit in figure 1.1(c).
+The intention here is not to explain its function, just to give an impression of
+the type of circuit we are discussing. It is obvious that hazards on the Ro,
+Ai, and Lt signals would be disastrous if the circuit is used in the pipeline of
+figure 1.1(c).
+
++
+
+&
+&
+
++
+
+Ao
+
+Ri
+
+Ai
+
+Ro
+
+CTL
+
+Lt
+
+Figure 2.5.
+An example of an asynchronous control circuit. Lt is a “local” clock that is in-
+tended to control a latch.
+
+
+Chapter 2: Fundamentals
+15
+
+0
+0
+1
+1
+
+0
+
+a
+b
+y
+
+1
+
+0
+1
+0
+1
+
+a
+
+b
+
+y
++
+1
+1
+
+Figure 2.6.
+A normal OR gate
+
+a
+
+b
+y
+
+a
+y
+C
+b
+
+Some specifications:
+
+1: if a
+� b then y :� a
+
+2:
+a
+� b
+�� y :� a
+
+3:
+y
+� ab
+�y�a
+�b�
+
+4:
+a
+b
+y
+
+0
+0
+0
+0
+1
+no change
+1
+0
+no change
+1
+1
+1
+
+Figure 2.7.
+The Muller C-element: symbol, possible implementation, and some alternative
+specifications.
+
+The concept of indication or acknowledgement plays an important role in
+the design of such circuits. Consider the simple 2-input OR gate in figure 2.6.
+An observer seeing the output change from 1 to 0 may conclude that both
+inputs are now at 0. However, when seeing the output change from 0 to 1 the
+observer is not able to make conclusions about both inputs. The observer only
+knows that at least one input is 1, but it does not know which. We say that
+the OR gate only indicates or acknowledges when both inputs are 0. Through
+similar arguments it can be seen that an AND gate only indicates when both
+inputs are 1.
+Signal transitions that are not indicated or acknowledged in other signal
+transitions are the source of hazards and should be avoided. We will address
+this issue in greater detail later in section 2.5.1 and in chapter 6.
+A circuit that is better in this respect is the Muller C-element shown in fig-
+ure 2.7. It is a state-holding element much like an asynchronous set-reset latch.
+When both inputs are 0 the output is set to 0, and when both inputs are 1 the
+output is set to 1. For other input combinations the output does not change.
+Consequently, an observer seeing the output change from 0 to 1 may conclude
+that both inputs are now at 1; and similarly, an observer seeing the output
+change from 1 to 0 may conclude that both inputs are now 0.
+
+
+16
+Part I: Asynchronous circuit design – A tutorial
+
+Combining this with the observation that all asynchronous circuits rely on
+handshaking that involves cyclic transitions between 0 and 1, it should be clear
+that the Muller C-element is indeed a fundamental component that is exten-
+sively used in asynchronous circuits.
+
+2.3.
+The Muller pipeline
+
+Figure 2.8 shows a circuit that is built from C-elements and inverters. The
+circuit is known as a Muller pipeline or a Muller distributor. Variations and ex-
+tensions of this circuit form the (control) backbone of almost all asynchronous
+circuits. It may not always be obvious at a first glance, but if one strips off the
+cluttering details, the Muller pipeline is always there as the crux of the matter.
+The circuit has a beautiful and symmetric behaviour, and once you understand
+its behaviour, you have a very good basis for understanding most asynchronous
+circuits.
+The Muller pipeline in figure 2.8 is a mechanism that relays handshakes.
+After all of the C-elements have been initialized to 0 the left environment
+may start handshaking. To understand what happens let’s consider the ith C-
+element, C
+�i�: It will propagate (i.e. input and store) a 1 from its predecessor,
+C
+�i
+� 1�, only if its successor, C
+�i
+� 1�, is 0. In a similar way it will propagate
+(i.e. input and store) a 0 from its predecessor if its successor is 1. It is often
+useful to think of the signals propagating in an asynchronous circuit as a se-
+quence of waves, as illustrated at the bottom of figure 2.8. Viewed this way, the
+role of a C-element stage in the pipeline is to propagate crests and troughs of
+waves in a carefully controlled way that maintains the integrity of each wave.
+On any interface between C-element pipeline stages an observer will see
+correct handshaking, but the timing may differ from the timing of the hand-
+shaking on the left hand environment; once a wave has been injected into the
+Muller pipeline it will propagate with a speed that is determined by actual de-
+lays in the circuit.
+Eventually the first handshake (request) injected by the left hand environ-
+ment will reach the right hand environment. If the right hand environment
+does not respond to the handshake, the pipeline will eventually fill. If this hap-
+pens the pipeline will stop handshaking with the left hand environment – the
+Muller pipeline behaves like a ripple through FIFO!
+In addition to this elegant behaviour, the pipeline has a number of beautiful
+symmetries. Firstly, it does not matter if you use 2-phase or 4-phase handshak-
+ing. It is the same circuit. The difference is in how you interpret the signals
+and use the circuit. Secondly, the circuit operates equally well from right to
+left. You may reverse the definition of signal polarities, reverse the role of the
+request and acknowledge signals, and operate the circuit from right to left. It
+is analogous to electrons and holes in a semiconductor; when current flows in
+
+
+Chapter 2: Fundamentals
+17
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Req
+Req
+
+Ack
+Ack
+Ack
+
+Req
+Req
+
+Ack
+
+C
+CC
+C
+
+if   C[i-1]      C[i+1]   then   C[i] := C[i-1]
+
+C[i+1]
+C[i-1]
+
+Right
+
+C[i]
+
+Left
+
+Figure 2.8.
+The Muller pipeline or Muller distributor.
+
+one direction it may be carried by electrons flowing in one direction or by holes
+flowing in the opposite direction.
+Finally, the circuit has the interesting property that it works correctly regard-
+less of delays in gates and wires – the Muller-pipeline is delay-insensitive.
+
+2.4.
+Circuit implementation styles
+
+As mentioned previously, the choice of handshake protocol affects the cir-
+cuit implementation (area, speed, power, robustness, etc.). Most practical cir-
+cuits use one of the following protocols introduced in section 2.1:
+
+4-phase bundled-data – which most closely resembles the design of syn-
+chronous circuits and which normally leads to the most efficient circuits,
+due to the extensive use of timing assumptions.
+
+2-phase bundled-data – introduced under the name Micropipelines by Ivan
+Sutherland in his 1988 Turing Award lecture.
+
+4-phase dual-rail – the classic approach rooted in David Muller’s pioneering
+work in the 1950s.
+
+Common to all protocols is the fact that the corresponding circuit imple-
+mentations all use variations of the Muller pipeline for controlling the storage
+elements. Below we explain the basics of pipelines built using simple transpar-
+
+
+18
+Part I: Asynchronous circuit design – A tutorial
+
+C
+C
+C
+
+C
+C
+C
+
+Latch
+
+EN
+
+Comb.
+
+F
+Latch
+
+EN
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Data
+
+Latch
+
+EN
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Comb.
+
+F
+
+Req
+
+Ack
+
+(a)
+
+(b)
+
+Figure 2.9.
+A simple 4-phase bundled-data pipeline.
+
+ent latches as storage elements. More optimized and elaborate circuit imple-
+mentations and more complex circuit structures are the topics of later chapters.
+
+2.4.1
+4-phase bundled-data
+
+A 4-phase bundled-data pipeline is particularly simple. A Muller pipeline
+is used to generate local clock pulses. The clock pulse generated in one stage
+overlaps with the pulses generated in the neighbouring stages in a carefully
+controlled interlocked manner. Figure 2.9(a) shows a FIFO, i.e. a pipeline
+without data processing, and figure 2.9(b) shows how combinational circuits
+(also called function blocks) can be added between the latches. To maintain
+correct behaviour matching delays have to be inserted in the request signal
+paths.
+You may view this circuit as a traditional “synchronous” data-path, con-
+sisting of latches and combinational circuits that are clocked by a distributed
+gated-clock driver, or you may view it as an asynchronous data-flow structure
+composed of two types of handshake components: latches and function blocks,
+as indicated by the dashed boxes.
+The pipeline implementation shown in figure 2.9 is particularly simple but
+it has some drawbacks: when it fills the state of the C-elements is (0, 1, 0,
+1, etc.), and as a consequence only every other latch is storing data. This
+
+
+Chapter 2: Fundamentals
+19
+
+C
+C
+C
+
+C P
+
+Latch
+
+C P
+
+Latch
+
+C P
+
+Latch
+
+Req
+Req
+Req
+
+Ack
+Ack
+Ack
+
+Req
+
+Ack
+
+Data
+Data
+
+Figure 2.10.
+A simple 2-phase bundled-data pipeline.
+
+is no worse than in a synchronous circuit using master-slave flip-flops, but
+it is possible to design asynchronous pipelines and FIFOs that are better in
+this respect. Another disadvantage is speed. The throughput of a pipeline
+or FIFO depends on the time it takes to complete a handshake cycle and for
+the above implementation this involves communication with both neighbours.
+Chapter 7 addresses alternative implementations that are both faster and have
+better occupancy when full.
+
+2.4.2
+2-phase bundled data (Micropipelines)
+
+A 2-phase bundled-data pipeline also uses a Muller pipeline as the backbone
+control circuit, but the control signals are interpreted as events or transitions,
+figure 2.10. For this reason special capture-pass latches are needed: events
+on the C and P inputs alternate, causing the latch to alternate between cap-
+ture mode and pass mode. This calls for a special latch design as shown in
+figure 2.11 and explained below. The switch symbol in figure 2.11 is a multi-
+plexer, and the event controlled latch can be understood as two ordinary level
+sensitive latches (operating in an alternating fashion) followed by a multiplexer
+and a buffer.
+Figure 2.10 shows a pipeline without data processing. Combinational cir-
+cuits with matching delay elements can be inserted between latches in a similar
+way to the 4-phase bundled-data approach in figure 2.9.
+The 2-phase bundled-data approach was pioneered by Ivan Sutherland in the
+late 1980s and an excellent introduction is given in his 1988 Turing Award Lec-
+ture [128]. The title Micropipelines is often used synonymously with the use
+of the 2-phase bundled-data protocol, but it also refers to the use of a particular
+set of components that are based on event signalling. In addition to the latch in
+figure 2.11 these are: AND, OR, Select, Toggle, Call and Arbiter. The above
+figures 2.10 and 2.11 are similar to figures 15 and 12 in [128], but they empha-
+sise stronger the fact that the control structure is a Muller-pipeline. Some alter-
+
+
+20
+Part I: Asynchronous circuit design – A tutorial
+
+pass
+
+pass
+
+C
+P
+
+In
+Out
+
+C
+P
+
+C=0
+P=0
+C=1
+P=0
+
+C=1
+P=1
+C=0
+P=1
+
+capture
+
+t0:
+t1:
+
+capture
+
+t2:
+t3:
+
+Latch
+
+Figure 2.11.
+Implementation and operation of a capture-pass event controlled latch. At time
+t0 the latch is transparent (i.e. in pass mode) and signals C and P are both low. An event on the
+C input turns the latch into capture mode, etc.
+
+native latch designs that are (significantly) smaller and (significantly) slower
+are also presented in [128].
+At the conceptual level the 2-phase bundled-data approach is elegant and ef-
+ficient; compared to the 4-phase bundled-data approach it avoids the power and
+performance loss that is incurred by the return-to-zero part of the handshaking.
+However, as illustrated by the latch design, the implementation of components
+that respond to signal transitions is often more complex than the implemen-
+tation of components that respond to normal level signals. In addition to the
+storage elements explained above, conditional control logic that responds to
+signal transitions tends to be complex as well. This has been experienced by
+this author [123], by the University of Manchester [42, 45] and by many others.
+Having said this, the 2-phase bundled-data approach may be the preferred
+solution in systems with unconditional data-flows and very high speed require-
+ments. But as just mentioned, the higher speed comes at a price: larger silicon
+area and higher power consumption. In this respect asynchronous design is no
+different from synchronous design.
+
+2.4.3
+4-phase dual-rail
+
+A 4-phase dual-rail pipeline is also based on the Muller pipeline, but in a
+more elaborate way that has to do with the combined encoding of data and
+request. Figure 2.12 shows the implementation of a 1-bit wide and three stage
+deep pipeline without data processing. It can be understood as two Muller
+
+
+Chapter 2: Fundamentals
+21
+
+C
+
+C
+
++
+
+C
+
+C
+
++
+
+C
+
+C
+
++
+
+d.f
+
+d.t
+
+Ack
+
+d.f
+
+d.t
+
+Ack
+
+Figure 2.12.
+A simple 3-stage 1-bit wide 4-phase dual-rail pipeline.
+
+C
+
+C
+
+C
+
+C
+
++
+
++
+
++
+
+C
+
+C
+
+C
+
+di[0].f
+
+di[0].t
+
+di[1].f
+
+di[1].t
+
+di[2].f
+
+di[2].t
+
++
+
++
+
++
+
+"All empty" 
+
+ack_i
+ack_o
+
+do[0].f
+
+do[0].t
+
+do[1].f
+
+do[1].t
+
+do[2].f
+
+do[2].t
+
+Alternative completion detector 
+
+C
+
+"All valid" 
+
+&
+
+&
+
+Figure 2.13.
+An N-bit latch with completion detection.
+
+pipelines connected in parallel, using a common acknowledge signal per stage
+to synchronize operation. The pair of C-elements in a pipeline stage can store
+the empty codeword
+�d
+�t
+�d
+�f
+�
+�
+�0�0�, causing the acknowledge signal out
+of that stage to be 0, or it can store one of the two valid codewords
+�0�1�
+and
+�1�0�, causing the acknowledge signal out of that stage to be logic 1.
+At this point, and referring back to section 2.2, the reader should notice that
+because the codeword
+�1�1� is illegal and does not occur, the acknowledge
+signal generated by the OR gate safely indicates the state of the pipeline stage
+as being “valid” or “empty.”
+An N-bit wide pipeline can be implemented by using a number of 1-bit
+pipelines in parallel. This does not guarantee to a receiver that all bits in a
+word arrive at the same time, but often the necessary synchronization is done
+
+
+22
+Part I: Asynchronous circuit design – A tutorial
+
+b
+y
+
+E
+E
+0
+0
+
+F
+F
+T
+F
+T
+F
+T
+T
+
+1
+1
+0
+1
+
+0
+0
+
+NO  CHANGE
+
+y.f
+y.t
+
+0
+1
+
+a
+b
+
+AND
+
+a
+
++
+y.f
+
+C
+
+C
+
+C
+
+C
+y.t
+
+a.f
+00
+
+01
+
+10
+
+11
+
+a.t
+
+b.t
+
+b.f
+
+Figure 2.14.
+A 4-phase dual-rail AND gate: symbol, truth table, and implementation.
+
+in the function blocks. In [124, 125] we describe a design of this style using
+the DIMS combinational circuits explained below.
+If bit-parallel synchronization is needed, the individual acknowledge signals
+can be combined into one global acknowledge using a C-element. Figure 2.13
+shows an N-bit wide latch. The OR gates and the C-element in the dashed box
+form a completion detector that indicates whether the N-bit dual-rail codeword
+stored in the latch is empty or valid. The figure also shows an implementation
+of a completion detector using only a 2-input C-element.
+Let us now look at how combinational circuits for 4-phase dual-rail cir-
+cuits are implemented. As mentioned in chapter 1 combinational circuits must
+be transparent to the handshaking between latches. Therefore, all outputs of
+a combinational circuit must not become valid until after all inputs have be-
+come valid. Otherwise the receiving latch may prematurely set acknowledge
+high (before all signals from the sending latch have become valid). In a sim-
+ilar way all outputs of a combinational circuit must not become empty until
+after all inputs have become empty. Otherwise the receiving latch may pre-
+maturely set acknowledge low (before all signals from the sending latch have
+become empty). Consequently a combinational circuit for the 4-phase dual-
+rail approach involves state holding elements and it exhibits a hysteresis-like
+behaviour in the empty-to-valid and valid-to-empty transitions.
+A particularly simple approach, using only C-elements and OR gates, is
+illustrated in figure 2.14, which shows the implementation of a dual-rail AND
+gate.The circuit can be understood as a direct mapping from sum-of-minterms
+expressions for each of the two output wires into hardware. The circuit waits
+for all its inputs to become valid. When this happens exactly one of the four
+C-elements goes high. This again causes the relevant output wire to go high
+corresponding to the gate producing the desired valid output. When all inputs
+become empty the C-elements are all set low, and the output of the dual-rail
+AND gate becomes empty again. Note that the C-elements provide both the
+
+
+Chapter 2: Fundamentals
+23
+
+necessary ’and’ operator and the hysteresis in the empty-to-valid and valid-to-
+empty transitions that is required for transparent handshaking. Note also that
+(again) the OR gate is never exposed to more than one input signal being high.
+Other dual-rail gates such as OR and EXOR can be implemented in a sim-
+ilar fashion, and a dual-rail inverter involves just a swap of the true and false
+wires. The transistor count in these basic dual-rail gates is obviously rather
+high, and in chapter 5 we explore more efficient circuit implementations. Here
+our interest is in the fundamental principles.
+Given a set of basic dual-rail gates one can construct dual-rail combinational
+circuits for arbitrary Boolean expressions using normal combinational circuit
+synthesis techniques. The transparency to handshaking that is a property of
+the basic gates is preserved when composing gates into larger combinational
+circuits.
+The fundamental ideas explained above all go back to David Muller’s work
+in the late 1950s and early 1960s [93, 92]. While [93] develops the fundamen-
+tal theory for the design of speed-independent circuits, [92] is a more practical
+introduction including a design example: a bit-serial multiplier using latches
+and gates as explained above.
+
+2.5.
+Theory
+
+Asynchronous circuits can be classified, as we will see below, as being self-
+timed, speed-independent or delay-insensitive depending on the delay assump-
+tions that are made. In this section we introduce some important theoretical
+concepts that relate to this classification. The goal is to communicate the basic
+ideas and provide some intuition on the problems and solutions, and a reader
+who wishes to dig deeper into the theory is referred to the literature. Some
+recent starting points are [95, 54, 69, 35, 18].
+
+2.5.1
+The basics of speed-independence
+
+We will start by reviewing the basics of David Muller’s model of a cir-
+cuit and the conditions for it being speed-independent [93]. A circuit is mod-
+eled along with its (dummy) environment as a closed network of gates, closed
+meaning that all inputs are connected to outputs and vice versa. Gates are
+modeled as Boolean operators with arbitrary non-zero delays, and wires are
+assumed to be ideal. In this context the circuit can be described as a set of
+concurrent Boolean functions, one for each gate output. The state of the circuit
+is the set of all gate outputs. Figure 2.15 illustrates this for a stage of a Muller
+pipeline with an inverter and a buffer mimicing the handshaking behaviour of
+the left and right hand environments.
+A gate whose output is consistent with its inputs is said to be stable; its
+“next output” is the same as its “current output”, zi
+
+�
+� zi. A gate whose inputs
+
+
+24
+Part I: Asynchronous circuit design – A tutorial
+
+r i
+a i+1
+
+c i
+a i
+r i+1
+
+iy
+
+C
+
+ri
+
+�
+�
+not
+�ci
+
+�
+ci
+
+�
+�
+riyi
+
+�
+�ri
+
+�yi
+
+�ci
+yi
+
+�
+�
+not
+�ai�1
+
+�
+ai�1
+
+�
+�
+ci
+
+Figure 2.15.
+Muller model of a Muller pipeline stage with “dummy” gates modeling the envi-
+ronment behaviour.
+
+have changed in such a way that an output change is called for is said to be
+excited; its “next output” is different from its “current output”, i.e. zi
+
+�
+�� zi.
+After an arbitrary delay an excited gate may spontaneously change its output
+and become stable. We say that the gate fires, and as excited gates fire and
+become stable with new output values, other gates in turn become excited, etc.
+To illustrate this, suppose that the circuit in figure 2.15 is in state
+�ri
+
+�yi
+
+�ci
+
+�
+ai�1
+
+�
+�
+�0�1�0�0�. In this state (the inverter) ri is excited corresponding to the
+left environment being about to take request high. After the firing of ri
+
+� the
+circuit reaches state
+�ri
+
+�yi
+
+�ci
+
+�ai�1
+
+�
+�
+�1�1�0�0� and ci now becomes excited.
+For synthesis and analysis purposes one can construct the complete state graph
+representing all possible sequences of gate firings. This is addressed in detail
+in chapter 6. Here we will restrict the discussion to an explanation of the
+fundamental ideas.
+In the general case it is possible that several gates are excited at the same
+time (i.e. in a given state). If one of these gates, say zi, fires the interesting
+thing is what happens to the other excited gates which may have zi as one
+of their inputs: they may remain excited, or they may find themselves with a
+different set of input signals that no longer calls for an output change. A circuit
+is speed-independent if the latter never happens. The practical implication of
+an excited gate becoming stable without firing is a potential hazard. Since
+delays are unknown the gate may or may not have changed its output, or it
+may be in the middle of doing so when the ‘counter-order’ comes calling for
+the gate output to remain unchanged.
+Since the model involves a Boolean state variable for each gate (and for
+each wire segment in the case of delay-insensitive circuits) the state space be-
+comes very large even for very simple circuits. In chapter 6 we introduce signal
+transition graphs as a more abstract representation from which circuits can be
+synthesized.
+Now that we have a model for describing and reasoning about the behaviour
+of gate-level circuits let’s address the classification of asynchronous circuits.
+
+
+Chapter 2: Fundamentals
+25
+
+d
+
+d
+
+dA
+
+2
+
+3
+
+d1
+A
+
+B
+
+dB
+
+C
+
+dC
+
+Figure 2.16.
+A circuit fragment with gate and wire delays. The output of gate A forks to inputs
+of gates B and C.
+
+2.5.2
+Classification of asynchronous circuits
+
+At the gate level, asynchronous circuits can be classified as being self-timed,
+speed-independent or delay-insensitive depending on the delay assumptions
+that are made. Figure 2.16 serves to illustrate the following discussion. The
+figure shows three gates: A, B, and C, where the output signal from gate A is
+connected to inputs on gates B and C.
+A speed-independent (SI) circuit as introduced above is a circuit that oper-
+ates “correctly” assuming positive, bounded but unknown delays in gates and
+ideal zero-delay wires. Referring to figure 2.16 this means arbitrary dA, dB,
+and dC, but d1
+
+� d2
+
+� d3
+
+� 0. Assuming ideal zero-delay wires is not very
+realistic in today’s semiconductor processes. By allowing arbitrary d1 and d2
+and by requiring d2
+
+� d3 the wire delays can be lumped into the gates, and
+from a theoretical point of view the circuit is still speed-independent.
+A circuit that operates “correctly” with positive, bounded but unknown de-
+lays in wires as well as in gates is delay-insensitive (DI). Referring to fig-
+ure 2.16 this means arbitrary dA, dB, dC, d1, d2, and d3. Such circuits are obvi-
+ously extremely robust. One way to show that a circuit is delay-insensitive is to
+use a Muller model of the circuit where wire segments (after forks) are modeled
+as buffer components. If this equivalent circuit model is speed-independent,
+then the circuit is delay-insensitive.
+Unfortunately the class of delay-insensitive circuits is rather small. Only
+circuits composed of C-elements and inverters can be delay-insensitive [82],
+and the Muller pipeline in figures 2.5, 2.8, and 2.15 is one important example.
+Circuits that are delay-insensitive with the exception of some carefully identi-
+fied wire forks where d2
+
+� d3 are called quasi-delay-insensitive (QDI). Such
+wire forks, where signal transitions occur at the same time at all end-points, are
+called isochronic (and discussed in more detail in the next section). Typically
+these isochronic forks are found in gate-level implementations of basic build-
+ing blocks where the designer can control the wire delays. At the higher levels
+of abstraction the composition of building blocks would typically be delay-
+insensitive. After these comments it is obvious that a distinction between DI,
+QDI and SI makes good sense.
+
+
+26
+Part I: Asynchronous circuit design – A tutorial
+
+Because the class of delay-insensitive circuits is so small, basically exclud-
+ing all circuits that compute, most circuits that are referred to in the literature
+as delay-insensitive are only quasi-delay-insensitive.
+Finally a word about self-timed circuits: speed-independence and delay-
+insensitivity as introduced above are (mathematically) well defined properties
+under the unbounded gate and wire delay model. Circuits whose correct opera-
+tion relies on more elaborate and/or engineering timing assumptions are simply
+called self-timed.
+
+2.5.3
+Isochronic forks
+
+From the above it is clear that the distinction between speed-independent
+circuits and delay-insensitive circuits relates to wire forks and, more specifi-
+cally, to whether the delays to all end-points of a forking wire are identical or
+not. If the delays are identical, the wire-fork is called isochronic.
+The need for isochronic forks is related to the concept of indication intro-
+duced in section 2.2. Consider a situation in figure 2.16 where gate A has
+changed its output. Eventually this change is observed on the inputs of gates
+B and C, and after some time gates B and C may respond to the new input by
+producing a new output. If this happens we say that the output change on gate
+A is indicated by output changes on gates B and C. If, on the other hand, only
+gate B responds to the new input, it is not possible to establish whether gate C
+has seen the input change as well. In this case it is necessary to strengthen the
+assumptions to d2
+
+� d3 (i.e. that the fork is isochronic) and conclude that since
+the input signal change was indicated by the output of B, gate C has also seen
+the change.
+
+2.5.4
+Relation to circuits
+
+In the 2-phase and 4-phase bundled-data approaches the control circuits are
+normally speed-independent (or in some cases even delay-insensitive), but the
+data-path circuits with their matched delays are self-timed. Circuits designed
+following the 4-phase dual-rail approach are generally quasi-delay-insensitive.
+In the circuits shown in figures 2.12 and 2.14 the forks that connect to the inputs
+of several C-elements must be isochronic, whereas the forks that connect to the
+inputs of several OR gates are delay-insensitive.
+The different circuit classes, DI, QDI, SI and self-timed, are not mutually-
+exclusive ways to build complete systems, but useful abstractions that can be
+used at different levels of design. In most practical designs they are mixed.
+For example, in the Amulet processors [44, 43, 48] SI design is used for lo-
+cal asynchronous controllers, bundled-data for local data processing, and DI
+is used for high-level composition. Another example is the hearing-aid filter
+bank design presented in [103]. It uses the DI dual-rail 4-phase protocol inside
+
+
+Chapter 2: Fundamentals
+27
+
+RAM-modules and arithmetic circuits to provide robust completion indication,
+and 4-phase bundled-data with SI control at the top levels of design, i.e. some-
+what different from the Amulet designs. This emphasizes that the choice of
+handshake protocol and circuit implementation style is among the factors to
+consider when optimizing an asynchronous digital system.
+It is important to stress that speed-independence and delay-insensitivity are
+mathematical properties that can be verified for a given implementation. If an
+abstract component – such as a C-element or a complex And-Or-Invert gate
+– is replaced by its implementation using simple gates and possibly some
+wire-forks, then the circuit may no longer be speed-independent or delay-
+insensitive.
+As an illustrative example we mention that the simple Muller
+pipeline stage in figures 2.8 and 2.15 is no longer delay-insensitive if the C-
+element is replaced by the gate-level implementation shown in figure 2.5 that
+uses simple AND and OR gates. Furthermore, even simple gates are abstrac-
+tions; in CMOS the primitives are N and P transistors, and even the simplest
+gates include forks.
+In chapter 6 we will explore the design of SI control circuits in great detail
+(because theory and synthesis tools are well developed). As SI circuits ignore
+wire delays completely some care is needed when physically implementing
+these circuits. In general one might think that the zero wire-delay assumption
+is trivially satisfied in small circuits involving 10-20 gates, but this need not be
+the case: a normal place and route CAD tool might spread the gates of a small
+controller all over the chip. Even if the gates are placed next to each other
+they may have different logic thresholds on their inputs which in combination
+with slowly rising or falling signals can cause (and have caused!) circuits
+to malfunction. For static CMOS and for circuits operating with low supply
+voltages (e.g. VDD
+
+� VtN
+
+�
+�VtP
+
+�) this is less of a problem, but for dynamic
+circuits using a larger VDD (e.g. 3.3 V or 5.0 V) the logic thresholds can be
+very different. This often overlooked problem is addressed in detail in [134].
+
+2.6.
+Test
+
+When it comes to the commercial exploitation of asynchronous circuits the
+problem of test comes to the fore. Test is a major topic in its own right, and
+it is beyond the scope of this tutorial to do anything more than mention a few
+issues and challenges. Although the following text is brief it assumes some
+knowledge of testing. The material does not constitute a foundation for the
+following chapters and it may be skipped.
+The previous discussion about Muller circuits (excited gates and the firing
+of gates), the principle of indication, and the discussion of isochronoic forks
+ties in nicely with a discussion of testing for stuck at faults. In the stuck-at
+fault model defects are modeled at the gate level as (individual) inputs and
+outputs being stuck-at-1 or stuck-at-0. The principle of indication says that all
+
+
+28
+Part I: Asynchronous circuit design – A tutorial
+
+input signal transitions on a gate must be indicated by an output signal tran-
+sition on the gate. Furthermore, asynchronous circuits make extensive use of
+handshaking and this causes signals to exhibit cyclic transitions between 0 and
+1. In this scenario, the presence of a stuck-at fault is likely to cause the cir-
+cuit to halt; if one component stops handshaking the stall tends to “propagate”
+to neighbouring components, and eventually the entire circuit halts. Conse-
+quently, the development of a set of test patterns that exhaustively tests for all
+stuck-at faults is simply a matter of developing a set of test patterns that toggle
+all nodes, and this is generally a comparatively simple task.
+Since isochronic forks are forks where a signal transition in one or more
+branches is not indicated in the gates that take these signals as inputs, it follows
+that isochronic forks imply untestable stuck-at faults.
+Testing asynchronous circuits incurs additional problems. As we will see
+in the following chapters, asynchronous circuits tend to implement registers
+using latches rather than flip-flops. In combination with the absence of a global
+clock, this makes it less straightforward to connect registers into scan-paths.
+Another consequence of the distributed self-timed control (i.e. the lack of a
+global clock) is that it is less straightforward to single-step the circuit through
+a sequence of well-defined states. This makes it less straightforward to steer
+the circuit into particular quiescent states, which is necessary for IDDQ testing,
+– the technique that is used to test for shorts and opens which are faults that
+are typical in today’s CMOS processes.
+The extensive use of state-holding elements (such as the Muller C-element),
+together with the self-timed behaviour, makes it difficult to test the feed-back
+circuitry that implements the state holding behaviour. Delay-fault testing rep-
+resents yet another challenge.
+The above discussion may leave the impression that the problem of testing
+asynchronous circuits is largely unsolved. This is not correct. The truth is
+rather that the techniques for testing synchronous circuits are not directly ap-
+plicable. The situation is quite similar to the design of asynchronous circuits
+that we will address in detail in the following chapters. Here a mix of new
+and well-known techniques are also needed. A good starting point for reading
+about the testing of asynchronous circuits is [120]. Finally, we mention that
+testing is also touched upon in chapters 13 and 15.
+
+2.7.
+Summary
+
+This chapter introduced a number of fundamental concepts. We will now
+return to the main track of designing circuits. The reader will probably want to
+revisit some of the material in this chapter again while reading the following
+chapters.
+
+
+Chapter 3
+
+STATIC DATA-FLOW STRUCTURES
+
+In this chapter we will develop a high-level view of asynchronous design
+that is equivalent to RTL (register transfer level) in synchronous design. At
+this level the circuits may be viewed as static data-flow structures. The aim is
+to focus on the behaviour of the circuits, and to abstract away the details of the
+handshake signaling which can be considered an orthogonal implementation
+issue.
+
+3.1.
+Introduction
+
+The various handshake protocols and the associated circuit implementa-
+tion styles presented in the previous chapters are rather different. However,
+when looking at the circuits at a more abstract level – the data-flow handshake-
+channel level introduced in chapter 1 – these differences diminish, and it makes
+good sense to view the choice of handshake protocol and circuit implementa-
+tion style as low level implementation decisions that can be made largely in-
+dependently from the more abstract design decisions that establish the overall
+structure and operation of the circuit.
+Throughout this chapter we will assume a 4-phase protocol since this is
+most common. From a data-flow point of view this means that the we will be
+dealing with data streams composed of alternating valid and empty values – in
+a two-phase protocol we would see only a sequence of valid values, but apart
+from that everything else would be the same. Furthermore we will be dealing
+with simple latches as storage elements. The latches are controlled according
+to the simple rule stated in chapter 1:
+
+A latch may input and store a new token (valid or empty) from its pre-
+decessor if its successor latch has input and stored the token that it was
+previously holding.
+
+Latches are the only components that initiate and take an active part in hand-
+shaking; all other components are “transparent” to the handshaking. To ease
+the distinction between latches and combinational circuits and to emphasize
+the token flow in circuit diagrams, we will use a box symbol with double verti-
+cal lines to represent latches throughout the rest of this tutorial (see figure 3.1).
+
+29
+
+
+30
+Part I: Asynchronous circuit design – A tutorial
+
+L0
+L1
+L2
+L3
+L4
+
+E
+V
+V
+E
+E
+
+Bubble
+Bubble
+Token
+Token
+Token
+
+Figure 3.1.
+A possible state of a five stage pipeline.
+
+V
+
+V
+E
+V
+
+E
+E
+
+E
+V
+V
+
+E
+V
+E
+t3:
+
+t2:
+
+t1:
+
+t0:
+Token Token
+Bubble
+
+Figure 3.2.
+Ring: (a) a possible state; and (b) a sequence of data transfers.
+
+3.2.
+Pipelines and rings
+
+Figure 3.1 shows a snapshot of a pipeline composed of five latches. The
+“box arrows” represent channels or links consisting of request, acknowledge
+and data signals (as explained on page 5). The valid value in L1 has just
+been copied into L2 and the empty value in L3 has just been copied into
+L4. This means that L1 and L3 are now holding old duplicates of the val-
+ues now stored in L2 and L4. Such old duplicates are called “bubbles”, and the
+newest/rightmost valid and empty values are called “tokens”. To distinguish
+tokens from bubbles, tokens are represented with a circle around the value. In
+this way a latch may hold a valid token, an empty token or a bubble. Bubbles
+can be viewed as catalysts: a bubble allows a token to move forward, and in
+supporting this the bubble moves backwards one step.
+Any circuit should have one or more bubbles, otherwise it will be in a dead-
+lock state. This is a matter of initializing the circuit properly, and we will
+elaborate on this shortly. Furthermore, as we will see later, the number of
+bubbles also has a significant impact on performance.
+In a pipeline with at least three latches, it is possible to connect the output
+of the last stage to the input of the first, forming a ring in which data tokens
+can circulate autonomously. Assuming the ring is initialized as shown in fig-
+ure 3.2(a) at time t0 with a valid token, an empty token and a bubble, the first
+steps of the circulation process are shown in figure 3.2(b), at times t1, t2 and
+
+
+Chapter 3: Static data-flow structures
+31
+
+t3. Rings are the backbone structures of circuits that perform iterative compu-
+tations. The cycle time of the ring in figure 3.2 is 6 “steps” (the state at t6 will
+be identical to the state at t0). Both the valid token and the empty token have
+to make one round trip. A round trip involves 3 “steps” and as there is only
+one bubble to support this the cycle time is 6 “steps”. It is interesting to note
+that a 4-stage ring initialized to hold a valid token, an empty token and two
+bubbles can iterate in 4 “steps”. It is also interesting to note that the addition
+of one more latch does not re-time the circuit or alter its function (as would be
+the case in a synchronous circuit); it is still a ring in which a single data token
+is circulating.
+
+3.3.
+Building blocks
+
+Figure 3.3 shows a minimum set of components that is sufficient to im-
+plement asynchronous circuits (static data-flow structures with deterministic
+behaviour, i.e. without arbiters). The components can be grouped in four cat-
+egories as explained below. In the next section we will see examples of the
+token-flow behaviour in structures composed of these components. Compo-
+nents for mutual exclusion and arbitration are covered in section 5.8.
+
+Latches provide storage for variables and implement the handshaking that
+supports the token flow. In addition to the normal latch a number of
+degenerate latches are often needed: a latch with only an output channel
+is a source that produces tokens (with the same constant value), and a
+latch with only an input channel is a sink that consumes tokens. Fig-
+ure 2.9 shows the implementation of a 4-phase bundled-data latch, fig-
+ure 2.11 shows the implementation of a 2-phase bundled-data latch, and
+figures 2.12 – 2.13 show the implementation of a 4-phase dual-rail latch.
+
+Function blocks are the asynchronous equivalent of combinatorial circuits.
+They are transparent/passive from a handshaking point of view. A func-
+tion block will: (1) wait for tokens on its inputs (an implicit join), (2)
+perform the required combinatorial function, and (3) issue tokens on its
+outputs. Both empty and valid tokens are handled in this way. Some
+implementations assume that the inputs have been synchronized. In this
+case it may be necessary to use an explicit join component. The imple-
+mentation of function blocks is addressed in detail in chapter 5.
+
+Unconditional flow control: Fork and join components are used to handle
+parallel threads of computation. In engineering terms, forks are used
+when the output from one component is input to more components, and
+joins are used when data from several independent channels needs to
+be synchronized – typically because they are (independent) inputs to a
+circuit. In the following we will often omit joins and forks from cir-
+
+
+32
+Part I: Asynchronous circuit design – A tutorial
+
+Merge
+
+Latch
+Source
+Sink
+
+0
+
+1
+
+MUX
+DEMUX
+
+0
+
+1
+
+Function block
+
+Join
+
+... behaves like:
+
+Fork      
+
+   - Fork
+(carried out in sequence)
+
+   - Join;
+   - Comb. logic;
+
+(Alternative symbols)
+
+Figure 3.3.
+A minimum and, for most cases, sufficient set of asynchronous components.
+
+cuit diagrams: the fan-out of a channel implies a fork, and the fan-in of
+several channels implies a join.
+
+A merge component has two or more input channels and one output
+channel. Handshakes on the input channels are assumed to be mutually
+exclusive and the merge relays input tokens/handshakes to the output.
+
+Conditional flow control: MUX and DEMUX components perform the usual
+functions of selecting among several inputs or steering the input to one
+of several outputs. The control input is a channel just like the data in-
+puts and outputs. A MUX will synchronize the control channel and the
+relevant input channel and send the input data to the data output. The
+other input channel is ignored. Similarly a DEMUX will synchronize
+the control and data input channels and steer the input to the selected
+output channel.
+
+As mentioned before the latches implement the handshaking and thereby the
+token flow in a circuit. All other components must be transparent to the hand-
+
+
+Chapter 3: Static data-flow structures
+33
+
+shaking. This has significant implications for the implementation of these com-
+ponents!
+
+3.4.
+A simple example
+
+Figure 3.4 shows an example of a circuit composed of latches, forks and
+joins that we will use to illustrate the token-flow behaviour of an asynchronous
+circuit. The structure can be described as pipeline segments and a ring con-
+nected into a larger structure using fork and join components.
+
+Figure 3.4.
+An example asynchronous circuit composed of latches, forks and joins.
+
+Assume that the circuit is initialized as shown in figure 3.5 at time t0: all
+latches are initialized to the empty value except for the bottom two latches in
+the ring that are initialized to contain a valid value and an empty value. Values
+enclosed in circles are tokens and the rest are bubbles. Assume further that
+the left and right hand environments (not shown) take part in the handshakes
+that the circuit is prepared to perform. Under these conditions the operation
+of the circuit (i.e. the flow of tokens) is as illustrated in the snapshots labeled
+t0
+
+�t11. The left hand environment performs one handshake cycle inputting a
+valid value followed by an empty value. In a similar way the right environment
+takes part in one handshake cycle and consumes a valid value and an empty
+value.
+Because the flow of tokens is controlled by local handshaking the circuit
+could exhibit many other behaviours. For example, at time t5 the circuit is
+ready to accept a new valid value from its left environment. Notice also that
+if the initial state had no tokens in the ring, then the circuit would deadlock
+after a few steps. It is highly recommended that the reader tries to play the
+token-bubble data-flow game; perhaps using the same circuit but with different
+initial states.
+
+
+34
+Part I: Asynchronous circuit design – A tutorial
+
+V
+
+V
+
+V
+
+V
+
+V
+
+V
+
+V
+
+E
+
+V
+
+E
+
+E
+
+E
+
+E
+
+V
+
+E
+
+E
+
+V
+
+V
+E
+
+E
+
+V
+
+E
+
+E
+V
+
+V
+
+V
+E
+
+E
+
+V
+
+E
+
+V
+
+E
+
+V
+
+E
+
+V
+
+E
+
+E
+V
+
+E
+
+E
+
+V
+E
+
+V
+E
+
+E
+
+V
+E
+
+V
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+E
+
+E
+
+V
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+V
+
+V
+
+E
+
+E
+
+V
+
+V
+
+E
+
+E
+
+E
+
+E
+
+E
+
+t0:
+
+t1:
+
+t2:
+
+t3:
+
+E
+E
+
+V
+
+t5:
+
+t4:
+
+E
+
+E
+
+V
+
+V
+
+V
+
+V
+
+V
+
+E
+
+E
+
+t6:
+
+E
+E
+
+t7:
+
+E
+
+t8:
+
+E
+
+E
+E
+
+E
+
+E
+
+V
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+t9:
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+t10:
+
+t10:
+
+V
+Bubble
+E
+
+Valid token
+Empty token
+
+Bubble
+
+Legend:
+
+Figure 3.5.
+A possible operation sequence of the example circuit from figure 3.4.
+
+
+Chapter 3: Static data-flow structures
+35
+
+3.5.
+Simple applications of rings
+
+This section presents a few simple and obvious circuits based on a single
+ring.
+
+3.5.1
+Sequential circuits
+
+Figure 3.6 shows a straightforward implementation of a finite state machine.
+Its structure is similar to a synchronous finite state machine; it consists of a
+function block and a ring that holds the current state. The machine accepts an
+“input token” that is joined with the “current state token”. Then the function
+block computes the output and the next state, and finally the fork splits these
+into an “output token” and a “next state token.”
+
+state
+Current
+Next
+state
+
+F
+Input
+Output
+
+V
+E
+E
+
+Figure 3.6.
+Implementation of an asynchronous finite state machine using a ring.
+
+3.5.2
+Iterative computations
+
+A ring can also be used to build circuits that implement iterative computa-
+tions. Figure 3.7 shows a template circuit. The idea is that the circuit will:
+(1) accept an operand, (2) sequence through the same operation a number of
+times until the computation terminates and (3) output the result. The necessary
+control is not shown. The figure shows one particular implementation. Pos-
+
+F
+E
+E
+
+1
+
+0
+0
+
+1
+
+Operand(s)
+Result
+
+E
+
+Figure 3.7.
+Implementation of an iterative computation using a ring.
+
+
+36
+Part I: Asynchronous circuit design – A tutorial
+
+sible variations involve locating the latches and the function block differently
+in the ring as well as decomposing the function block and putting these (sim-
+pler) function blocks between more latches. In [156] Ted Williams presents
+a circuit that performs division using a self-timed 5-stage ring. This design
+was later used in a floating point coprocessor in a commercial microprocessor
+[157].
+
+3.6.
+FOR, IF, and WHILE constructs
+
+Very often the desired function of a circuit is expressed using a program-
+ming language (C, C++, VHDL, Verilog, etc.). In this section we will show
+implementation templates for a number of typical conditional structures and
+loop structures. A reader who is familiar with control-data-flow graphs, per-
+haps from high-level synthesis, will recognize the great similarities between
+asynchronous circuits and control-data-flow graphs [36, 127].
+
+if <cond> then <body1> else <body2>
+An asynchronous circuit template
+for implementing an if statement is shown in figure 3.8(a). The data-type of
+the input and output channels to the if circuit is a record containing all vari-
+ables in the <cond> expression and the variables manipulated by <body1> and
+<body2>. The data-type of the output channel from the cond block is a Boolean
+that controls the DEMUX and MUX components. The FORK associated with
+this channel is not shown.
+Since the execution of <body1> and <body2> is mutually exclusive it is
+possible to replace the controlled MUX in the bottom of the circuit with a
+simpler MERGE as shown in figure 3.8(b). The circuit in figure 3.8 contains
+
+body1
+body2
+
+1
+0
+
+1
+0
+
+{variables}
+
+cond
+
+{variables}
+
+body1
+body2
+
+1
+0
+
+{variables}
+
+cond
+
+{variables}
+
+merge
+
+(b)
+(a)
+
+Figure 3.8.
+A template for implementing if statements.
+
+
+Chapter 3: Static data-flow structures
+37
+
+no feedback loops and no latches – it can be considered a large function block.
+The circuit can be pipelined for improved performance by inserting latches.
+
+for <count> do <body>
+An asynchronous circuit template for implementing
+a for statement is shown in figure 3.9. The data-type of the input channel to
+the for circuit is a record containing all variables manipulated in the <body>
+and the loop count, <count>, that is assumed to be a non-negative integer. The
+data-type of the output channel is a record containing all variables manipulated
+in the <body>.
+
+0
+
+count
+
+E
+
+body
+
+1
+0
+
+{variables}
+
+{variables}
+
+{variables},  count
+
+1
+0
+
+Initial tokens
+
+Figure 3.9.
+A template for implementing for statements.
+
+The data-type of the output channel from the count block is a Boolean,
+and one handshake on the input channel of the count block encloses <count>
+handshakes on the output channel: <count> - 1 handshakes providing the
+Boolean value “1” and one (final) handshake providing the Boolean value “0”.
+Notice the two latches on the control input to the MUX. They must be initial-
+ized to contain a data token with the value “0” and an empty token in order to
+enable the for circuit to read the variables into the loop.
+After executing the for statement once, the last handshake of the count block
+will steer the variables in the loop onto the output channel and put a “0” token
+and an empty token into the two latches, thereby preparing the for circuit for
+a subsequent activation. The FORK in the input and the FORK on the output
+of the count block are not shown. Similarly a number of latches are omitted.
+Remember: (1) all rings must contain at least 3 latches and (2) for each latch
+initialized to hold a data token there must also be a latch initialized to hold an
+empty token (when using 4-phase handshaking).
+
+
+38
+Part I: Asynchronous circuit design – A tutorial
+
+while <cond> do <body>
+An asynchronous circuit template for implement-
+ing a while statement is shown in figure 3.10. Inputs to (and outputs from) the
+circuit are the variables in the <cond> expression and the variables manipu-
+lated by <body>. As before in the for circuit, it is necessary to put two latches
+initialized to contain a data token with the value “0” and an empty token on
+the control input of the MUX. And as before a number of latches are omitted
+in the two rings that constitute the while circuit. When the while circuit termi-
+nates (after zero or more iterations) data is steered out of the loop and this also
+causes the latches on the MUX control input to become initialized properly for
+the subsequent activation of the circuit.
+
+0
+
+cond
+
+{variables}
+
+body
+
+{variables}
+
+{variables}
+
+1
+0
+
+1
+0
+
+E
+
+Initial tokens
+
+Figure 3.10.
+A template for implementing while statements.
+
+3.7.
+A more complex example: GCD
+
+Using the templates just introduced we will now design a small example
+circuit, GCD, that computes the greatest common divisor of two integers. GCD
+is often used as an introductory example, and figure 3.11 shows a programming
+language specification of the algorithm.
+In addition to its role as a design example in the current context, GCD can
+also serve to illustrate the similarities and differences between different design
+techniques. In chapter 8 we will use the same example to illustrate the Tangram
+language and the associated syntax-directed compilation process (section 8.3.3
+on pages 127–128).
+The implementation of GCD is shown in figure 3.12. It consists of a while
+template whose body is an if template. Figure 3.12 shows the circuit including
+all the necessary latches (with their initial states). The implementation makes
+no attempt at sharing resources – it is a direct mapping following the imple-
+mentation templates presented in the previous section.
+
+
+Chapter 3: Static data-flow structures
+39
+
+input (a,b);
+while a
+�� b do
+if a
+� b
+then a
+� a
+�b;
+else b
+� b
+�a;
+output (a);
+
+Figure 3.11.
+A programming language specification of GCD.
+
+0
+
+1
+
+0
+
+A>B
+
+1
+
+0
+
+1
+
+0
+
+A-B
+
+B-A
+
+0
+
+1
+
+E
+
+E
+
+E
+E
+
+A==B
+
+A,B
+GCD(A,B)
+
+A,B
+
+A,B
+
+1
+
+1
+
+Figure 3.12.
+An asynchronous circuit implementation of GCD.
+
+3.8.
+Pointers to additional examples
+
+3.8.1
+A low-power filter bank
+
+In [103] we reported on the design of a low-power IFIR filter bank for a
+digital hearing aid. It is a circuit that was designed following the approach
+presented in this chapter. The paper also provides some insight into the design
+of low power circuits as well as the circuit level implementation of memory
+structures and datapath units.
+
+3.8.2
+An asynchronous microprocessor
+
+In [23] we reported on the design of a MIPS microprocessor, called ARISC.
+Although there are many details to be understood in a large-scale design like
+a microprocessor, the basic architecture shown in figure 3.13 can be under-
+stood as a simple data-flow structure. The solid-black rectangles represent
+latches, the box-arrows represent channels, and the text-boxes represents func-
+tion blocks (combinatorial circuits).
+The processor is a simple pipelined design with instructions retiring in pro-
+gram order. It consists of a fetch-decode-issue ring with a fixed number of to-
+
+
+40
+Part I: Asynchronous circuit design – A tutorial
+
+REG
+Read
+
+PC
+Read
+
+PC
+ALU
+
+REG
+Write
+Data
+Mem.
+
+Inst.
+Mem.
+
+On
+Bolt
+
+Issue
+
+Decode
+
+Flush
+
+Arith.
+
+Logic
+
+Shift
+
+CP0
+
+Lock
+
+UnLock
+
+Figure 3.13.
+Architecture of the ARISC microprocessor.
+
+kens. This ensures a fixed instruction prefetch depth. The issue stage forks de-
+coded instructions into the execute pipeline and initiates the fetch of one more
+instruction. Register forwarding is avoided by a locking mechanism: when an
+instruction is issued for execution the destination register is locked until the
+write-back has taken place. If a subsequent instruction has a read-after-write
+data hazard this instruction is stalled until the register is unlocked. The tokens
+flowing in the design contain all operands and control signals related to the
+execution of an instruction, i.e. similar to what is stored in a pipeline stage in a
+synchronous processor. For further information the interested reader is referred
+to [23]. Other asynchronous microprocessors are based on similar principles.
+
+3.8.3
+A fine-grain pipelined vector multiplier
+
+The GCD circuit and the ARISC presented in the preceding sections use bit-
+parallel communication channels. An example of a static data-flow structure
+that uses 1-bit channels and fine grain pipelining is the serial-parallel vector
+multiplier design reported in [124, 125]. Here all necessary word-level syn-
+chronization is performed implicitly by the function blocks. The large number
+of interacting rings and pipeline segments in the static data-flow representa-
+tion of the design makes it rather complex. After reading the next chapter on
+performance analysis the interested reader may want to look at this design; it
+contains several interesting optimizations.
+
+3.9.
+Summary
+
+This chapter developed a high-level view of asynchronous design that is
+equivalent to RTL (register transfer level) in synchronous design – static data
+flow structures. The next chapter address performance analysis at this level of
+abstraction.
+
+
+Chapter 4
+
+PERFORMANCE
+
+In this chapter we will address the performance analysis and optimization
+of asynchronous circuits. The material extends and builds upon the “static
+data-flow structures view” introduced in the previous chapter.
+
+4.1.
+Introduction
+
+In a synchronous circuit, performance analysis and optimization is a matter
+of finding the longest latency signal path between two registers; this determines
+the period of the clock signal. The global clock partitions the circuit into many
+combinatorial circuits that can be analyzed individually. This is known as static
+timing analysis and it is a rather simple task, even for a large circuit.
+For an asynchronous circuit, performance analysis and optimization is a
+global and therefore much more complex problem. The use of handshaking
+makes the timing in one component dependent on the timing of its neighbours,
+which again depends on the timing of their neighbours, etc. Furthermore, the
+performance of a circuit does not depend only on its structure, but also on
+how it is initialized and used by its environment. The performance of an asyn-
+chronous circuit can even exhibit transients and oscillations.
+We will first develop a qualitative understanding of the dynamics of the to-
+ken flow in asynchronous circuits. A good understanding of this is essential for
+designing circuits with good performance. We will then introduce some quan-
+titative performance parameters that characterize individual pipeline stages and
+pipelines and rings composed of identical pipeline stages. Using these param-
+eters one can make first-level design decisions. Finally we will address how
+more complex and irregular structures can be analyzed.
+The following text represents a major revision of material from [124] and
+it is based on original work by Ted Williams [153, 154, 155]. If consulting
+these references the reader should be aware of the exact definition of a token.
+Throughout this book a token is defined as a valid data value or an empty data
+value, whereas in the cited references (that deal exclusively with 4-phase hand-
+shaking) a token is a valid-empty data pair. The definition used here accentu-
+ates the similarity between a token in an asynchronous circuit and the token in
+
+41
+
+
+42
+Part I: Asynchronous circuit design – A tutorial
+
+a Petri net. Furthermore it provides some unification between 4-phase hand-
+shaking and 2-phase handshaking – 2-phase handshaking is the same game,
+but without empty-tokens.
+In the following we will assume 4-phase handshaking, and the examples we
+provide all use bundled-data circuits. It is left as an exercise for the reader
+to make the simple adaptations that are necessary for dealing with 2-phase
+handshaking.
+
+4.2.
+A qualitative view of performance
+
+4.2.1
+Example 1: A FIFO used as a shift register
+
+The fundamental concepts can be illustrated by a simple example: a FIFO
+composed of a number of latches in which there are N valid tokens separated
+by N empty tokens, and whose environment alternates between reading a token
+from the FIFO and writing a token into the FIFO (see figure 4.1(a)). In this way
+the nomber of tokens in the FIFO is invariant. This example is relevant because
+many designs use FIFOs in this way, and because it models the behaviour of
+shift registers as well as rings – structures in which the number of tokens is
+also invariant.
+A relevant performance figure is the throughput, which is the rate at which
+tokens are input to or output from the shift register. This figure is proportional
+to the time it takes to shift the contents of the chain of latches one position to
+the right.
+Figure 4.1(b) illustrates the behaviour of an implementation in which there
+are 2N latches per valid token and figure 4.1(c) illustrates the behaviour of an
+implementation in which there are 3N latches per valid token. In both exam-
+ples the number of valid tokens in the FIFO is N
+� 3, and the only difference
+between the two situations in figure 4.1(b) and 4.1(c) is the number of bubbles.
+In figure 4.1(b) at time t1 the environment reads the valid token, D1, as
+indicated by the solid channel symbol. This introduces a bubble that enables
+data transfers to take place one at a time (t2
+
+�t5). At time t6 the environment
+inputs a valid token, D4, and at this point all elements have been shifted one
+position to the right. Hence, the time used to move all elements one place to
+the right is proportional to the number of tokens, in this case 2N
+� 6 time steps.
+Adding more latches increases the number of bubbles, which again increases
+the number of data transfers that can take place simultaneously, thereby im-
+proving the performance. In figure 4.1(c) the shift register has 3N stages and
+therefore one bubble per valid-empty token-pair. The effect of this is that N
+data transfers can occur simultaneously and the time used to move all elements
+one place to the right is constant; 2 time steps.
+If the number of latches was increased to 4N there would be one token per
+bubble, and the time to move all tokens one step to the right would be only
+
+
+Chapter 4: Performance
+43
+
+E
+D1
+E
+E
+D2
+
+(c) N data tokens and N empty tokens in 3N stages:
+
+(b) N data tokens and N empty tokens in 2N stages:
+
+(a) A FIFO and its environment:
+
+bubble
+bubble
+bubble
+
+D3
+
+E
+E
+E
+D2
+
+E
+E
+D2
+
+E
+E
+
+E
+
+D1
+
+E
+
+E
+
+E
+
+D2
+
+D2
+E
+
+E
+
+E
+
+D2
+
+D2
+
+E
+
+E
+D4
+
+D3
+
+D3
+E
+
+D2
+E
+D3
+D4
+E
+
+E
+D1
+E
+D2
+D3
+E
+
+E
+E
+D1
+E
+D2
+D3
+D4
+
+E
+D2
+E
+D3
+E
+D4
+
+D4
+
+D4
+D3
+D2
+
+bubble
+bubble
+bubble
+
+E
+E
+E
+
+bubble
+bubble
+bubble
+
+D2
+D3
+D4
+E
+E
+E
+E
+
+E
+E
+D2
+E
+D3
+E
+D4
+
+E
+
+D4
+
+E
+
+D4
+
+E
+
+Environment
+
+E
+D2
+E
+D3
+E
+
+E
+
+E
+E
+t4:
+
+bubble
+bubble
+bubble
+
+D1
+D3
+D2
+
+E
+E
+
+bubble
+bubble
+bubble
+
+t3:
+
+t2:
+
+t1:
+
+t0:
+
+E
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+t5:
+
+t4:
+
+t3:
+
+t2:
+
+t1:
+
+t0:
+
+t8:
+
+t7:
+
+t6:
+
+D4
+
+E
+
+D3
+
+E
+
+D2
+
+E
+E
+
+D1
+
+D3
+
+D3
+
+D3
+
+D3
+
+Figure 4.1.
+A FIFO and its environment. The environment alternates between reading a token
+from the FIFO and writing at token into the FIFO.
+
+
+44
+Part I: Asynchronous circuit design – A tutorial
+
+1 time step. In this situation the pipeline is half full and the latches holding
+bubbles act as slave latches (relative to the latches holding tokens). Increasing
+the number of bubbles further would not increase the performance further. Fi-
+nally, it is interesting to notice that the addition of just one more latch holding
+a bubble to figure 4.1(b) would double the performance. The asynchronous
+designer has great freedom in trading more latches for performance.
+As the number of bubbles in a design depends on the number of latches per
+token, the above analysis illustrates that performance optimization of a given
+circuit is primarily a task of structural modification – circuit level optimization
+like transistor sizing is of secondary importance.
+
+4.2.2
+Example 2: A shift register with parallel load
+
+In order to illustrate another point – that the distribution of tokens and bub-
+bles in a circuit can vary over time, depending on the dynamics of the circuit
+and its environment – we offer another example: a shift register with parallel
+load. Figure 4.2 shows an initial design of a 4-bit shift register. The circuit has
+a bit-parallel input channel, din[3:0], connecting it to a data producing envi-
+ronment. It also has a 1-bit data channel, do, and a 1-bit control channel, ctl,
+connecting it to a data consuming environment. Operation is controlled by the
+data consuming environment which may request the circuit to: (ctl
+� 0) per-
+form a parallel load and to provide the least significant bit from the bit-parallel
+channel on the do channel, or (ctl
+� 1) to perform a right shift and provide
+the next bit on the do channel. In this way the data consuming environment
+always inputs a control token (valid or empty) to which the circuit always re-
+sponds by outputting a data token (valid or empty). During a parallel load, the
+previous content of the shift register is steered into the “dead end” sink-latches.
+During a right shift the constant 0 is shifted into the most significant position
+– corresponding to a logical right shift. The data consuming environment is
+not required to read all the input data bits, and it may continue reading zeros
+beyond the most significant input data bit.
+The initial design shown in figure 4.2 suffers from two performance lim-
+iting inexpediencies: firstly, it has the same problem as the shift register in
+figure 4.1(b) – there are too few bubbles, and the peak data rate on the bit-
+serial output reduces linearly with the length of the shift register. Secondly,
+the control signal is forked to all of the MUXes and DEMUXes in the design.
+This implies a high fan-out of the request and data signals (which requires a
+couple of buffers) and synchronization of all the individual acknowledge sig-
+nals (which requires a C-element with many inputs, possibly implemented as
+a tree of C-elements). The first problem can be avoided by adding a 3rd latch
+to the datapath in each stage of the circuit corresponding to the situation in
+
+
+Chapter 4: Performance
+45
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+1
+
+0
+
+din[1]
+din[2]
+din[0]
+
+din[1]
+din[0]
+din[3] din[2]
+
+din[3:0]
+
+producing
+environment
+
+Data
+
+environment
+consuming
+Data
+
+ctl
+
+do
+E
+d3
+E
+d2
+
+din[3]
+
+E
+d1
+0
+
+Figure 4.2.
+Initial design of the shift register with parallel load.
+
+
+46
+Part I: Asynchronous circuit design – A tutorial
+
+0
+
+0
+
+0
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+din[0]
+din[2]
+din[3]
+
+din[3]
+din[2]
+din[1]
+din[0]
+
+dout
+
+dout
+
+dout
+
+din[0]
+din[1]
+din[2]
+din[3]
+
+E
+0
+E
+0
+E
+0
+
+d2
+d1
+d3
+d0
+din[1]
+
+E
+E
+E
+
+E
+
+d3
+d2
+
+0
+E
+0
+
+E
+E
+E
+
+0
+E
+
+d1
+d0
+
+d0
+
+0
+
+d1
+E
+d0
+
+E
+1
+
+E
+E
+
+E
+
+d2
+
+E
+
+d2
+
+E
+
+E
+
+d3
+
+0
+
+E
+
+(c)
+
+(b)
+
+(a)
+
+Figure 4.3.
+Improved design of the shift register with parallel load.
+
+
+Chapter 4: Performance
+47
+
+figure 4.1(c), but if the extra latches are added to the control path instead, as
+shown in figure 4.3(a) on page 46, they will solve both problems.
+This improved design exhibits an interesting and illustrative dynamic be-
+haviour: initially, the data latches are densely packed with tokens and all the
+control latches contain bubbles, figure 4.3(a). The first step of the parallel load
+cycle is shown in figure 4.3(b), and figure 4.3(c) shows a possible state after
+the data consuming environment has read a couple of bits. The most-significant
+stage is just about to perform its “parallel load” and the bubbles are now in the
+chain of data latches. If at this point the data consuming environment paused,
+the tokens in the control path would gradually disappear while tokens in the
+datapath would pack again. Note that at any time the total number of tokens in
+the circuit is constant!
+
+4.3.
+Quantifying performance
+
+4.3.1
+Latency, throughput and wavelength
+
+When the overall structure of a design is being decided, it is important to
+determine the optimal number of latches or pipeline stages in the rings and
+pipeline fragments from which the design is composed. In order to establish a
+basis for first order design decisions, this section will introduce some quantita-
+tive performance parameters. We will restrict the discussion to 4-phase hand-
+shaking and bundled-data circuit implementations and we will consider rings
+with only a single valid token. Subsection 4.3.4, which concludes this section
+on performance parameters, will comment on adapting to other protocols and
+implementation styles.
+The performance of a pipeline is usually characterized by two parameters:
+latency and throughput (or its inverse called period or cycle time). For an asyn-
+chronous pipeline a third parameter, the dynamic wavelength, is important as
+well. With reference to figure 4.4 and following [153, 154, 155] these param-
+eters are defined as follows:
+
+Latency: The latency is the delay from the input of a data item until the corre-
+sponding output data item is produced. When data flows in the forward
+direction, acknowledge signals propagate in the reverse direction. Con-
+sequently two parameters are defined:
+
+The forward latency,Lf , is the delay from new data on the input of
+a stage (Data�i
+� 1� or Req�i
+� 1�) to the production of the corre-
+sponding output (Data�i� or Req�i�) provided that the acknowledge
+signals are in place when data arrives. Lf
+�V and Lf
+�E denote the
+latencies for propagating a valid token and an empty token respec-
+tively. It is assumed that these latencies are constants, i.e. that
+they are independent of the value of the data. [As forward propa-
+
+
+48
+Part I: Asynchronous circuit design – A tutorial
+
+Data[i-1]
+
+Req[i-1]
+
+Ack[i]
+Ack[i+1]
+
+Data[i]
+
+Ack[i]
+
+Data[i-1]
+
+Ack[i+1]
+
+Data[i]
+
+d
+
+Dual-rail pipeline:
+
+Req[i]
+
+Bundled-data pipeline:
+
+L[i]
+F[i]
+
+L[i]
+F[i]
+
+Figure 4.4.
+Generic pipelines for definition of performance parameters.
+
+gation of an empty token does not “compute” it may be desirable
+to minimize Lf
+�E. In the 4-phase bundled-data approach this can
+be achieved through the use of an asymmetric delay element.]
+
+The reverse latency, Lr, is the delay from receiving an acknowl-
+edge from the succeeding stage (Ack[i+1]) until the corresponding
+acknowledge is produced to the preceding stage (Ack[i]) provided
+that the request is in place when the acknowledge arrives. Lr
+
+� and
+Lr
+
+� denote the latencies of propagating Ack
+� and Ack
+� respectively.
+
+Period: The period, P, is the delay between the input of a valid token (fol-
+lowed by its succeeding empty token) and the input of the next valid
+token, i.e. a complete handshake cycle. For a 4-phase protocol this in-
+volves: (1) forward propagation of a valid data value, (2) reverse propa-
+gation of acknowledge, (3) forward propagation of the empty data value,
+and (4) reverse propagation of acknowledge. Therefore a lower bound
+on the period is:
+P
+� L f
+�V
+
+�Lr
+
+�
+�L f
+�E
+
+�Lr
+
+�
+(4.1)
+
+Many of the circuits we consider in this book are symmetric, i.e. Lf
+�V
+
+�
+L f
+�E and Lr
+
+�
+� Lr
+
+�, and for these circuits the period is simply:
+
+P
+� 2L f
+
+�2Lr
+(4.2)
+
+We will also consider circuits where Lf
+�V
+
+� L f
+�E and, as we will see in
+section 4.4.1 and again in section 7.3, the actual implementation of the
+latches may lead to a period that is larger than the minimum possible
+
+
+Chapter 4: Performance
+49
+
+given by equation 4.1. In section 4.4.1 we analyze a pipeline whose
+period is:
+P
+� 2Lr
+
+�2L f
+�V
+(4.3)
+
+Throughput: The throughput, T, is the number of valid tokens that flow
+through a pipeline stage per unit time: T
+� 1�P
+
+Dynamic wavelength: The dynamic wavelength, Wd, of a pipeline is the num-
+ber of pipeline stages that a forward-propagating token passes through
+during P:
+
+Wd
+
+� P
+
+L f
+(4.4)
+
+Explained differently: Wd is the distance – measured in pipeline stages
+– between successive valid or empty tokens, when they flow unimpeded
+down a pipeline. Think of a valid token as the crest of a wave and its
+associated empty token as the trough of the wave. If Lf
+�V
+
+�� L f
+�E the
+average forward latency Lf
+
+� 1
+
+2
+
+�L f
+�V
+
+�L f
+�E
+
+� should be used in the above
+equation.
+
+Static spread: The static spread, S, is the distance – measured in pipeline
+stages – between successive valid (or empty) tokens in a pipeline that is
+full (i.e. contains no bubbles). Sometimes the term occupancy is used;
+this is the inverse of S.
+
+4.3.2
+Cycle time of a ring
+
+The parameters defined above are local performance parameters that char-
+acterize the implementation of individual pipeline stages. When a number of
+pipeline stages are connected to form a ring, the following parameter is rele-
+vant:
+
+Cycle time: The cycle time of a ring, TCycle, is the time it takes for a token
+(valid or empty) to make one round trip through all of the pipeline stages
+in the ring. To achieve maximum performance (i.e. minimum cycle
+time), the number of pipeline stages per valid token must match the dy-
+namic wavelength, in which case TCycle
+
+� P. If the number of pipeline
+stages is smaller, the cycle time will be limited by the lack of bubbles,
+and if there are more pipeline stages the cycle time will be limited by
+the forward latency through the pipeline stages. In [153, 154, 155] these
+two modes of operation are called bubble limited and data limited, re-
+spectively.
+
+
+50
+Part I: Asynchronous circuit design – A tutorial
+
+Wd
+
+cycle
+T
+
+N < W   :
+d
+
+Tcycle  =
+2 N
+N - 2 L
+r
+
+(Bubble limited)
+
+N > W   :
+d
+
+Tcycle
+=  N Lf
+
+(Data limited)
+
+N
+
+P
+
+Figure 4.5.
+Cycle time of a ring as a function of the number of pipeline stages in it.
+
+The cycle time of an N-stage ring in which there is one valid token,
+one empty token and N
+� 2 bubbles can be computed from one of the
+following two equations (illustrated in figure 4.5):
+
+When N
+� Wd the cycle time is limited by the forward latency
+through the N stages:
+
+TCycle
+
+�DataLimited
+�
+� N
+�Lf
+(4.5)
+
+If Lf
+�V
+
+�� L f
+�E use Lf
+
+� max�L f
+�V;L f
+�E
+
+�.
+
+When N
+�Wd the cycle time is limited by the reverse latency. With
+N pipeline stages, one valid token and one empty token, the ring
+contains N
+� 2 bubbles, and as a cycle involves 2N data transfers
+(N valid and N empty), the cycle time becomes:
+
+TCycle
+
+�BubbleLimited
+�
+�
+2N
+
+N
+�2Lr
+(4.6)
+
+If Lr
+
+�
+�� Lr
+
+� use Lr
+
+� 1
+
+2
+
+�Lr
+
+�
+�Lr
+
+�
+�
+
+For the sake of completeness it should be mentioned that a third possible
+mode of operation called control limited exists for some circuit config-
+urations [153, 154, 155]. This is, however, not relevant to the circuit
+implementation configurations presented in this book.
+
+The topic of performance analysis and optimization has been addressed in
+some more recent papers [31, 90, 91, 37] and in some of these the term “slack
+matching” is used (referring to the process of balancing the timing of forward
+flowing tokens and backward flowing bubbles).
+
+
+Chapter 4: Performance
+51
+
+4.3.3
+Example 3: Performance of a 3-stage ring
+
+ 
+
+Pipeline stage [i]
+
+Req[i-1]
+
+Data[i-1]
+Data[i]
+
+Req[i]
+
+Ack[i+1]
+Ack[i]
+
+CL
+L
+
+ti = 1
+
+Lf
+ 
+
+Lr
+Ack[i]
+
+Req[i-1]
+Req[i]
+
+Ack[i+1]
+
+Data[i]
+Data[i-1]
+CL
+L
+
+ti = 1
+
+td = 3
+td = 3
+
+tc = 2
+tc = 2
+
+C
+C
+
+Figure 4.6.
+A simple 4-phase bundled-data pipeline stage, and an illustration of its forward
+and reverse latency signal paths.
+
+Let us illustrate the above by a small example: a 3-stage ring composed of
+identical 4-phase bundled-data pipeline stages that are implemented as illus-
+trated in figure 4.6(a). The data path is composed of a latch and a combinatorial
+circuit, CL. The control part is composed of a C-element and an inverter that
+controls the latch and a delay element that matches the delay in the combinato-
+rial circuit. Without the combinatorial circuit and the delay element we have a
+simple FIFO stage. For illustrative purposes the components in the control part
+are assigned the following latencies: C-element: tc
+
+� 2 ns, inverter: ti
+
+� 1 ns,
+and delay element: td
+
+� 3 ns.
+Figure 4.6(b) shows the signal paths corresponding to the forward and re-
+verse latencies, and table 4.1 lists the expressions and the values of these pa-
+rameters. From these figures the period and the dynamic wavelength for the
+two circuit configurations are calculated. For the FIFO, Wd
+
+� 5�0 stages, and
+for the pipeline, Wd
+
+� 3�2. A ring can only contain an integer number of stages
+and if Wd is not integer it is necessary to analyze rings with
+�Wd
+
+� and
+�Wd
+
+�
+
+Table 4.1.
+Performance of different simple ring configurations.
+
+FIFO
+Pipeline
+Parameter
+Expression
+Value
+Expression
+Value
+
+Lr
+tc
+
+�ti
+3 ns
+tc
+
+�ti
+3 ns
+L f
+tc
+2 ns
+tc
+
+�td
+5 ns
+P
+� 2L f
+
+�2Lr
+4tc
+
+�2ti
+10 ns
+4tc
+
+�2ti
+
+�2td
+16 ns
+
+Wd
+5 stages
+3.2 stages
+
+TCycle (3 stages)
+6 Lr
+18 ns
+6 Lr
+18 ns
+TCycle (4 stages)
+4 Lr
+12 ns
+4 Lf
+20 ns
+TCycle (5 stages)
+3�3 Lr
+
+� 5 L f
+10 ns
+5 Lf
+25 ns
+TCycle (6 stages)
+6 Lf
+12 ns
+6 Lf
+30 ns
+
+
+52
+Part I: Asynchronous circuit design – A tutorial
+
+stages and determine which yields the smallest cycle time. Table 4.1 shows the
+results of the analysis including cycle times for rings with 3 to 6 stages.
+
+4.3.4
+Final remarks
+
+The above presentation made a number of simplifying assumptions: (1)
+only rings and pipelines composed of identical pipeline stages were consid-
+ered, (2) it assumed function blocks with symmetric delays (i.e. circuits where
+L f
+�V
+
+� L f
+�E), (3) it assumed function blocks with constant latencies (i.e. ig-
+noring the important issue of data-dependent latencies and average-case per-
+formance), (4) it considered rings with only a single valid token, and (5) the
+analysis considered only 4-phase handshaking and bundled-data circuits.
+For 4-phase dual-rail implementations (where request is embedded in the
+data encoding) the performance parameter equations defined in the previous
+section apply without modification. For designs using a 2-phase protocol, some
+straightforward modifications are necessary: there are no empty tokens and
+hence there is only one value for the forward latency Lf and one value for the
+reverse latency Lr. It is also a simple matter to state expressions for the cycle
+time of rings with more tokens.
+It is more difficult to deal with data-dependent latencies in the function
+blocks and to deal with non-identical pipeline stages. Despite these deficien-
+cies the performance parameters introduced in the previous sections are very
+useful as a basis for first-order design decisions.
+
+4.4.
+Dependency graph analysis
+
+When the pipeline stages incorporate different function blocks, or function
+blocks with asymmetric delays, it is a more complex task to determine the crit-
+ical path. It is necessary to construct a graph that represents the dependencies
+between signal transitions in the circuit, and to analyze this graph and identify
+the critical path cycle [19, 153, 154, 155]. This can be done in a systematic or
+even mechanical way but the amount of detail makes it a complex task.
+The nodes in such a dependency graph represent rising or falling signal
+transitions, and the edges represent dependencies between the signal transi-
+tions. Formally, a dependency is a marked graph [28]. Let us look at a couple
+of examples.
+
+4.4.1
+Example 4: Dependency graph for a pipeline
+
+As a first example let us consider a (very long) pipeline composed of identi-
+cal stages using a function block with asymmetric delays causing Lf
+�E
+
+� L f
+�V.
+Figure 4.7(a) shows a 3-stage section of this pipeline. Each pipeline stage has
+
+
+Chapter 4: Performance
+53
+
+the following latency parameters:
+
+L f
+�V
+
+�
+td
+�0�1
+
+�
+�tc
+
+� 5 ns
+�2 ns
+� 7 ns
+L f
+�E
+
+�
+td
+�1�0
+
+�
+�tc
+
+� 1 ns
+�2 ns
+� 3 ns
+Lr
+
+�
+� Lr
+
+�
+�
+ti
+
+�tc
+
+� 3 ns
+
+There is a close relationship between the circuit diagram and the dependency
+graph. As signals alternate between rising transitions (�) and falling transitions
+(�) – or between valid and empty data values – the graph has two nodes per
+circuit element. Similarly the graph has two edges per wire in the circuit.
+Figure 4.7(b) shows the two graph fragments that correspond to a pipeline
+stage, and figure 4.7(c) shows the dependency graph that corresponds to the 3
+pipeline stages in figure 4.7(a).
+A label outside a node denotes the circuit delay associated with the signal
+transition. We use a particular style for the graphs that we find illustrative: the
+nodes corresponding to the forward flow of valid and empty data values are
+organized as two horizontal rows, and nodes representing the reverse flowing
+acknowledge signals appear as diagonal segments connecting the rows.
+The cycle time or period of the pipeline is the time from a signal transition
+until the same signal transition occurs again. The cycle time can therefore be
+determined by finding the longest simple cycle in the graph, i.e. the cycle with
+the largest accumulated circuit delay which does not contain a sub-cycle. The
+dotted cycle in figure 4.7(c) is the longest simple cycle. Starting at point A the
+corresponding period is:
+
+P
+�
+tD�0�1�
+
+�tC
+
+�
+��
+�
+Lf
+�V
+
+�tI
+
+�tC
+
+�
+��
+�
+
+Lr
+
+�
+
+�tD�1�0�
+
+�tC
+
+�
+��
+�
+Lf
+�V
+
+�tI
+
+�tC
+
+�
+��
+�
+
+Lr
+
+�
+
+�
+2Lr
+
+�2L f
+�V
+
+� 20 ns
+
+Note that this is the period given by equation 4.3 on page 49. An alternative
+cycle time candidate is the following:
+
+R
+
+�i�
+
+�;Req
+
+�i�
+
+�
+
+�
+��
+�
+Lf
+�V
+
+;A
+
+�i�1�
+
+�;Req
+
+�i�1�
+
+�
+
+�
+��
+�
+
+Lr
+
+�
+
+;R
+
+�i�
+
+�;Req
+
+�i�
+
+�
+
+�
+��
+�
+Lf
+�E
+
+;A
+
+�i�1�
+
+�;Req
+
+�i�1�
+
+�
+
+�
+��
+�
+
+Lr
+
+�
+
+;
+
+and the corresponding period is:
+
+P
+� 2Lr
+
+�L f
+�V
+
+�L f
+�E
+
+� 16 ns
+
+Note that this is the minimum possible period given by equation 4.1 on page 48.
+The period is determined by the longest cycle which is 20 ns. Thus, this ex-
+ample illustrates that for some (simple) latch implementations it may not be
+possible to reduce the cycle time by using function blocks with asymmetric
+delays (Lf
+�E
+
+� L f
+�V).
+
+
+54
+Part I: Asynchronous circuit design – A tutorial
+
+L
+
+ 
+
+CL
+L
+
+ 
+
+CL
+L
+
+ 
+
+CL
+
+ti = 1
+
+Ack[i-1]
+Ack[i]
+
+Stage[i-1]
+Stage[i]
+Stage[i+1]
+
+Data[i-1]
+
+td(0->1) = 5
+td(0->1) = 5
+
+Req[i-1]
+
+td(0->1) = 5
+
+Req[i]
+
+ti = 1
+
+Req[i-2]
+
+Data[i-2]
+
+Ack[i+2]
+
+Req[i+1]
+
+Data[i+1]
+
+Ack[i+1]
+
+Data[i]
+
+ti = 1
+
+td(1->0) = 1
+td(1->0) = 1
+td(1->0) = 1
+tc = 2
+tc = 2
+tc = 2
+
+(a)
+
+C
+C
+C
+
+Req[ i ]
+Ack[ i ]
+Req[ i-1]
+tc
+
+ti
+
+Ack[ i+1]
+
+R[ i ]
+
+A[ i ]
+
+td[ i ](1->0)
+
+tc
+
+ti
+
+R[ i ]
+
+A[ i ]
+
+Ack[ i+1]
+
+td[ i ](0->1)
+Req[ i ]
+Ack[ i ]
+
+0->1 transition of Req[i]:
+1->0 transition of Req[i]:
+
+ti = 1 ns
+ti = 1 ns
+
+ti = 1 ns
+ti = 1 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+ti = 1 ns
+
+tc = 2 ns
+
+R[i-1]
+R[i]
+
+td(0->1) = 5 ns
+td(0->1) = 5 ns
+td(0->1) = 5 ns
+
+Ack[i]
+R[i+1]
+Ack[i+1]
+Req[i]
+Req[i+1]
+Req[i-1]
+
+A[i-1]
+A[i]
+A[i+1]
+
+A[i-1]
+A[i]
+A[i+1]
+
+R[i-1]
+Ack[i-1]
+Req[i-1]
+
+td(1->0) = 1 ns
+td(1->0) = 1 ns
+
+R[i]
+Ack[i]
+Req[i]
+
+td(1->0) = 1 ns
+
+R[i+1]
+Req[i+1]
+Ack[i+1]
+
+ti = 1 ns
+
+Stage [i]
+Stage [i-1]
+Stage [i+1]
+
+Ack[i-1] 
+
+A
+
+C
+
+C
+Req[ i-1]
+
+(b)
+
+(c)
+
+Figure 4.7.
+Data dependency graph for a 3-stage section of a pipeline: (a) the circuit dia-
+gram, (b) the two graph fragments corresponding to a pipeline stage, and (c) the resulting data-
+dependency graph.
+
+4.4.2
+Example 5: Dependency graph for a 3-stage ring
+
+As another example of dependency graph analysis let us consider a three
+stage 4-phase bundled-data ring composed of different pipeline stages, fig-
+ure 4.8(a): stage 1 with a combinatorial circuit that is matched by a symmetric
+
+
+Chapter 4: Performance
+55
+
+L
+
+ 
+
+CL
+L
+
+ 
+ 
+
+(a)
+
+L
+
+Ack1
+
+Req3
+
+Data3
+CL
+Data1
+
+Ack2
+
+Req1
+
+Ack3
+
+Data2
+
+Stage 2
+Stage 1
+
+Ack1
+
+Req3
+
+Data3
+
+Stage 3
+
+tc = 2 ns
+
+ti = 1 ns
+ti = 1 ns
+
+Req2
+
+td3(0->1) = 6 ns
+td3(1->0) = 1 ns
+tc = 2 ns
+
+ti = 1 ns
+
+td2 = 2 ns
+
+tc = 2 ns
+
+C
+C
+C
+
+B
+A
+
+B
+A
+
+R1
+
+A1
+
+Req1
+Ack1
+
+ti = 1 ns
+
+ti = 1 ns
+ti = 1 ns
+
+R1
+
+A1
+
+Req1
+Ack1
+R2
+
+A2
+
+Req2
+Ack2
+
+ti = 1 ns
+ti = 1 ns
+
+R3
+Req3
+Ack3
+
+A3
+
+td1(0->1) = 2 ns
+tc = 2 ns
+
+R2
+
+A2
+
+Req2
+Ack2
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+td1(1->0) = 2 ns
+
+td3(0->1) = 6 ns
+tc = 2 ns
+
+Req3
+Ack3
+
+A3
+
+R3
+
+ti = 1 ns
+
+tc = 2 ns
+
+Stage 1
+Stage 2
+Stage 3
+
+td3(1->0) = 1 ns
+
+(b)
+
+Figure 4.8.
+Data dependency graph for an example 3-stage ring: (a) the circuit diagram for
+the ring and (b) the resulting data-dependency graph.
+
+delay element, stage 2 without combinatorial logic, and stage 3 with a combi-
+natorial circuit that is matched by an asymmetric delay element.
+The dependency graph is similar to the dependency graph for the 3-stage
+pipeline from the previous section. The only difference is that the output port
+of stage 3 is connected to the input port of stage 1, forming a closed graph.
+There are several “longest simple cycle” candidates:
+1 A cycle corresponding to the forward flow of valid-tokens:
+
+(R1�; Req1�; R2�; Req2�; R3�; Req3�)
+
+For this cycle, the cycle time is TCycle
+
+� 14 ns.
+
+2 A cycle corresponding to the forward flow of empty-tokens:
+
+(R1�; Req1�; R2�; Req2�; R3�; Req3
+�)
+
+For this cycle, the cycle time is TCycle
+
+� 9 ns.
+
+
+56
+Part I: Asynchronous circuit design – A tutorial
+
+3 A cycle corresponding to the backward flowing bubble:
+
+(A1�; Req1�; A3�; Req3�; A2�; Req2�; A1�; Req1�; A3�; Req3�;
+A2�; Req2�)
+
+For this cycle, the cycle time is TCycle
+
+� 6Lr
+
+� 18 ns.
+
+The 3-stage ring contains one valid-token, one empty-token and one bub-
+ble, and it is interesting to note that the single bubble is involved in six
+data transfers, and therefore makes two reverse round trips for each for-
+ward round trip of the valid-token.
+
+4 There is, however, another cycle with a slightly longer cycle time, as
+illustrated in figure 4.8(b). It is the cycle corresponding to the backward-
+flowing bubble where the sequence:
+
+(A1�; Req1�; A3�)
+is replaced by
+(R3�)
+
+For this cycle the cycle time is TCycle
+
+� 6Lr
+
+� 20 ns.
+
+A dependency graph analysis of a 4-stage ring is very similar. The only
+difference is that there are two bubbles in the ring. In the dependency graph
+this corresponds to the existence of two “bubble cycles” that do not interfere
+with each other.
+The dependency graph approach presented above assumes a closed circuit
+that results in a closed dependency graph. If a component such as a pipeline
+fragment is to be analyzed it is necessary to include a (dummy) model of its
+environment as well – typically in the form of independent and eager token
+producers and token consumers, i.e. dummy circuits that simply respond to
+handshakes. Figure 2.15 on page 24 illustrated this for a single pipeline stage
+control circuit.
+Note that a dependency graph as introduced above is similar to a signal
+transition graph (STG) which we will introduce more carefully in chapter 6.
+
+4.5.
+Summary
+
+This chapter addressed the performance analysis of asynchronous circuits
+at several levels: firstly, by providing a qualitative understanding of perfor-
+mance based on the dynamics of tokens flowing in a circuit; secondly, by in-
+troducing quantitative performance parameters that characterize pipelines and
+rings composed of identical pipeline stages and, thirdly, by introducing de-
+pendency graphs that enable the analysis of pipelines and rings composed of
+non-identical stages.
+At this point we have covered the design and performance analysis of asyn-
+chronous circuits at the “static data-flow structures” level, and it is time to
+address low-level circuit design principles and techniques. This will be the
+topic of the next two chapters.
+
+
+Chapter 5
+
+HANDSHAKE CIRCUIT IMPLEMENTATIONS
+
+In this chapter we will address the implementation of handshake compo-
+nents. First, we will consider the basic set of components introduced in sec-
+tion 3.3 on page 32: (1) the latch, (2) the unconditional data-flow control el-
+ements join, fork and merge, (3) function blocks, and (4) the conditional flow
+control elements MUX and DEMUX. In addition to these basic components
+we will also consider the implementation of mutual exclusion elements and
+arbiters and touch upon the (unavoidable) problem of metastability. The major
+part of the chapter (sections 5.3–5.6) is devoted to the implementation of func-
+tion blocks and the material includes a number of fundamental concepts and
+circuit implementation styles.
+
+5.1.
+The latch
+
+As mentioned previously, the role of latches is: (1) to provide storage for
+valid and empty tokens, and (2) to support the flow of tokens via handshak-
+ing with neighbouring latches. Possible implementations of the handshake
+latch were shown in chapter 2: Figure 2.9 on page 18 shows how a 4-phase
+bundled-data handshake latch can be implemented using a conventional latch
+and a control circuit (the figure shows several such examples assembled into
+pipelines). In a similar way figure 2.11 on page 20 shows the implementation
+of a 2-phase bundled-data latch, and figures 2.12-2.13 on page 21 show the
+implementation of a 4-phase dual-rail latch.
+A handshake latch can be characterized in terms of the throughput, the dy-
+namic wavelength and the static spread of a FIFO that is composed of identical
+latches. Common to the two 4-phase latch designs mentioned above is that a
+FIFO will fill with every other latch holding a valid token and every other latch
+holding an empty token (as illustrated in figure 4.1(b) on page 43). Thus, the
+static spread for these FIFOs is S
+� 2.
+A 2-phase implementation does not involve empty tokens and consequently
+it may be possible to design a latch whose static spread is S
+� 1. Note, how-
+ever, that the implementation of the 2-phase bundled-data handshake latch in
+
+57
+
+
+58
+Part I: Asynchronous circuit design – A tutorial
+
+figure 2.11 on page 20 involves several level-sensitive latches; the utilization
+of the level sensitive latches is no better.
+Ideally, one would want to pack a valid token into every level-sensitive latch,
+and in chapter 7 we will address the design of 4-phase bundled-data handshake
+latches that have a smaller static spread.
+
+5.2.
+Fork, join, and merge
+
+Possible 4-phase bundled-data and 4-phase dual-rail implementations of the
+fork, join, and merge components are shown in figure 5.1. For simplicity the
+figure shows a fork with two output channels only, and join and merge compo-
+nents with two input channels only. Furthermore, all channels are assumed to
+be 1-bit channels. It is, of course, possible to generalize to three or more inputs
+and outputs respectively, and to extend to n-bit channels. Based on the expla-
+nation given below this should be straightforward, and it is left as an exercise
+for the reader.
+
+4-phase fork and join
+A fork involves a C-element to combine the acknowl-
+edge signals on the output channels into a single acknowledge signal on the
+input channel. Similarly a 4-phase bundled-data join involves a C-element to
+combine the request signals on the input channels into a single request signal
+on the output channel. The 4-phase dual-rail join does not involve any active
+components as the request signal is encoded into the data.
+The particular fork in figure 5.1 duplicates the input data, and the join con-
+catenates the input data. This happens to be the way joins and forks are mostly
+used in our static data-flow structures, but there are many alternatives: for ex-
+ample, the fork could split the input data which would make it more symmetric
+to the join in figure 5.1. In any case the difference is only in how the input data
+is transferred to the output. From a control point of view the different alter-
+natives are identical: a join synchronizes several input channels and a fork
+synchronizes several output channels.
+
+4-phase merge
+The implementation of the merge is a little more elaborate.
+Handshakes on the input channels are mutually exclusive, and the merge sim-
+ply relays the active input handshake to the output channel.
+Let us consider the implementation of the 4-phase bundled-data merge first.
+It consists of an asynchronous control circuit and a multiplexer that is con-
+trolled by the input request. The control circuit is explained below.
+The request signals on the input channels are mutually exclusive and may
+simply be ORed together to produce the request signal on the output channel.
+For each input channel, a C-element produces an acknowledge signal in re-
+sponse to an acknowledge on the output channel provided that the input chan-
+nel has valid data. For example, the C-element driving the xack signal is set high
+
+
+Chapter 5: Handshake circuit implementations
+59
+
+C
+
+1) 
+
+y
+
+1) 
+
+y
+
+x.f
+z.f
+
+y
+
+Merge
+
+Join
+
+C
+
+C
+
+y−req
+
+y−ack
+z−ack
+
+x−req
+
+y
+
+z
+
+z−req
+
+x−ack
+
+x
+
+y.f
+
+MUX
+
+x−ack
+
+x−req
+
+z.t
+
+y.t
+
+x−ack
+
+x
+
+x.t
+
+x−ack
+
+y
+
+z−ack
+y−ack
+
+z0.f
+
+z−ack
+
+z−req
+
+Fork      
+
+y−ack
+
+x
+y
+z1
+z0
+
+z0.t
+y−ack
+
+x−ack
+
+x.t
+x.f
+
+z1.t
+z1.f
+
+z
+
+y−ack
+
+y−req
+
+z−req
+
+Component
+4−phase bundled−data
+4−phase dual−rail
+
+x−ack
+z−ack
+
+x−req
+y−req
+
+z.f 
+
+x−ack
+
+y−ack
+
+z−ack
+
+z−ack
+
+z.t 
+x.t 
+x.f 
+
+y.t 
+y.f 
+
+y.t
+y.f
+
+C
+
+C
+C
+
++
+
++
++
+
++
+
+C
+z
+
++
+
+x
+
+x
+
+z
+
+z
+
+x
+
+Figure 5.1.
+4-phase bundled-data and 4-phase dual-rail implementations of the fork, join and
+merge components.
+
+when xreq and zack have both gone high, and it is reset when both signals have
+gone low again. As zack goes low in response to xreq going low, it will suffice to
+reset the C-element in response to zack going low. This optimization is possible
+if asymmetric C-elements are available, figure 5.2. Similar arguments applies
+for the C-element that drives the yack signal. A more detailed introduction to
+generalized C-elements and related state-holding devices is given in chapter 6,
+sections 6.4.1 and 6.4.5.
+
++
+
+C
+x-ack
+z-ack
+
+x-req
+
+z-ack
+x-ack
+
+reset
+
+x-req
+set
+
+Figure 5.2.
+A possible implementation of the upper asymmetric C-element in the 4-phase
+bundled-data merge in figure 5.1.
+
+
+60
+Part I: Asynchronous circuit design – A tutorial
+
+The implementation of the 4-phase dual-rail merge is fairly similar. As
+request is encoded into the data signals an OR gate is used for each of the
+two output signals z�t and z�f . Acknowledge on an input channel is produced
+in response to an acknowledge on the output channel provided that the input
+channel has valid data. Since the example assumes 1-bit wide channels, the
+latter is established using an OR gate (marked “1”), but for N-bit wide channels
+a completion detector (as shown in figure 2.13 on page 21) would be required.
+
+2-phase fork, join and merge
+Finally a word about 2-phase bundled-data
+implementations of the fork, join and merge components: the implementation
+of 2-phase bundled-data fork and join components is identical to the imple-
+mentation of the corresponding 4-phase bundled-data components (assuming
+that all signals are initially low).
+The implementation of a 2-phase bundled-data merge, on the other hand,
+is complex and rather different, and it provides a good illustration of why the
+implementation of some 2-phase bundled-data components is complex. When
+observing an individual request or acknowledge signal the transitions will obvi-
+ously alternate between rising and falling, but since nothing is known about the
+sequence of handshakes on the input channels there is no relationship between
+the polarity of a request signal transition on an input channel and the polarity
+of the corresponding request signal transition on the output channel. Similarly
+there is no relationship between the polarity of an acknowledge signal transi-
+tion on the output channel and the polarity of the corresponding acknowledge
+signal transition on the input channel channel. This calls for some kind of stor-
+age element on each request and acknowledge signal produced by the circuit.
+This brings complexity, as does the associated control logic.
+
+5.3.
+Function blocks – The basics
+
+This section will introduce the fundamental principles of function block de-
+sign, and subsequent sections will illustrate function block implementations
+for different handshake protocols. The running example will be an N-bit ripple
+carry adder.
+
+5.3.1
+Introduction
+
+A function block is the asynchronous equivalent of a combinatorial circuit:
+it computes one or more output signals from a set of input signals. The term
+“function block” is used to stress the fact that we are dealing with circuits with
+a purely functional behaviour.
+However, in addition to computing the desired function(s) of the input sig-
+nals, a function block must also be transparent to the handshaking that is im-
+plemented by its neighbouring latches. This transparency to handshaking is
+
+
+Chapter 5: Handshake circuit implementations
+61
+
+block
+Function 
+
+A
+
+B
+SUM
+
+ADD
+
+cin
+cout
+
+Join
+Fork
+
+Figure 5.3.
+A function block whose operands and results are provided on separate channels
+requires a join of the inputs and a fork on the output.
+
+what makes function blocks different from combinatorial circuits and, as we
+will see, there are greater depths to this than is indicated by the word “trans-
+parent” – in particular for function blocks that implicitly indicate completion
+(which is the case for circuits using dual-rail signals).
+The most general scenario is where a function block receives its operands
+on separate channels and produces its results on separate channels, figure 5.3.
+The use of several independent input and output channels implies a join on the
+input side and a fork on the output side, as illustrated in the figure. These can
+be implemented separately, as explained in the previous section, or they can be
+integrated into the function block circuitry. In what follows we will restrict the
+discussion to a scenario where all operands are provided on a single channel
+and where all results are provided on a single channel.
+We will first address the issue of handshake transparency and then review
+the fundamentals of ripple carry addition, in order to provide the necessary
+background for discussing the different implementation examples that follow.
+A good paper on the design of function blocks is [97].
+
+5.3.2
+Transparency to handshaking
+
+The general concepts are best illustrated by considering a 4-phase dual-rail
+scenario – function blocks for bundled data protocols can be understood as
+a special case. Figure 5.4(a) shows two handshake latches connected directly
+and figure 5.4(b) shows the same situation with a function block added between
+the two latches. The function block must be transparent to the handshaking.
+Informally this means that if observing the signals on the ports of the latches,
+one should see the same sequence of handshake signal transitions; the only
+difference should be some slow-down caused by the latency of the function
+block.
+A function block is obviously not allowed to produce a request on its output
+before receiving a request on its input; put the other way round, a request on the
+output of the function block should indicate that all of the inputs are valid and
+that all (relevant) internal signals and all output signals have been computed.
+
+
+62
+Part I: Asynchronous circuit design – A tutorial
+
+Ack
+
+Data
+
+Ack
+
+F
+
+Input
+data
+Output
+data
+
+(b)
+(a)
+
+LATCH
+
+LATCH
+
+LATCH
+
+LATCH
+
+Figure 5.4.
+(a) Two latches connected directly by a handshake channel and (b) the same situ-
+ation with a function block added between the latches. The handshaking as seen by the latches
+in the two situations should be the same, i.e. the function block must be designed such that it is
+transparent to the handshaking.
+
+(Here we are touching upon the principle of indication once again.) In 4-phase
+protocols a symmetric set of requirements apply for the return-to-zero part of
+the handshaking.
+Function blocks can be characterized as either strongly indicating or weakly
+indicating depending on how they behave with respect to this handshake trans-
+parency. The signalling that can be observed on the channel between the two
+
+All
+valid
+
+All
+empty
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+1
+0
+
+Time
+
+Input data
+
+Output data
+
+(1)
+
+(2a)
+
+(2b)
+
+(3)
+(4a)
+
+(4b)
+(4b)
+
+(1)
+“All inputs become defined”
+�
+“Some outputs become defined”
+(2)
+“All outputs become defined”
+�
+“Some inputs become undefined”
+(3)
+“All inputs become undefined”
+�
+“Some outputs become undefined”
+(4)
+“All outputs become undefined”
+�
+“Some inputs become defined”
+
+Figure 5.5.
+Signal traces and event orderings for a strongly indicating function block.
+
+
+Chapter 5: Handshake circuit implementations
+63
+
+All
+valid
+
+All
+empty
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+1
+
+0
+
+Time
+
+Input data
+
+Output data
+
+(6b)
+
+(3b)
+
+(6a)
+
+(6b)
+
+(2)
+
+(1)
+(3a)
+
+(4)
+(5)
+
+(1)
+“Some inputs become defined”
+�
+“Some outputs become defined”
+(2)
+“All inputs become defined”
+�
+“All outputs become defined”
+(3)
+“All outputs become defined”
+�
+“Some inputs become undefined”
+(4)
+“Some inputs become undefined”
+�
+“Some outputs become undefined”
+(5)
+“All inputs become undefined”
+�
+“All outputs become undefined”
+(6)
+“All outputs become undefined”
+�
+“Some inputs become defined”
+
+Figure 5.6.
+Signal traces and event orderings for a weakly indicating function block.
+
+latches in figure 5.4(a) was illustrated in figure 2.3 on page 13. We can illus-
+trate the handshaking for the situation in figure 5.4(b) in a similar way.
+
+A function block is strongly indicating, as illustrated in figure 5.5, if (1)
+it waits for all of its inputs to become valid before it starts to compute and
+produce valid outputs, and if (2) it waits for all of its inputs to become
+empty before it starts to produce empty outputs.
+
+A function block is weakly indicating, as illustrated in figure 5.6, if (1)
+it starts to compute and produce valid outputs as soon as possible, i.e.
+when some but not all input signals have become valid, and if (2) it
+starts to produce empty outputs as soon as possible, i.e. when some but
+not all input signals have become empty.
+
+For a weakly indication function block to behave correctly, it is necessary
+to require that it never produces all valid outputs until after all inputs have be-
+come valid, and that it never produces all empty outputs until after all inputs
+have become empty. This behaviour is identical to Seitz’s weak conditions in
+[121]. In [121] Seitz further explains that it can be proved that if the individual
+components satisfy the weak conditions then any “valid combinatorial circuit
+
+
+64
+Part I: Asynchronous circuit design – A tutorial
+
+structure” of function blocks also satisfies the weak conditions, i.e. that func-
+tion blocks may be combined to form larger function blocks. By “valid com-
+binatorial circuit structure” we mean a structure where no components have
+inputs or outputs left unconnected and where there are no feed-back signal
+paths. Strongly indicating function blocks have the same property – a “valid
+combinatorial circuit structure” of strongly indicating function blocks is itself
+a strongly indicating function block.
+Notice that both weakly and strongly indicating function blocks exhibit a
+hysteresis-like behaviour in the valid-to-empty and empty-to-valid transitions:
+(1) some/all outputs must remain valid until after some/all inputs have become
+empty, and (2) some/all outputs must remain empty until after some/all inputs
+have become valid. It is this hysteresis that ensures handshake transparency,
+and the implementation consequence is that one or more state holding circuits
+(normally in the form of C-elements) are needed.
+Finally, a word about the 4-phase bundled-data protocol. Since Req� is
+equivalent to “all data signals are valid” and since Req� is equivalent to “all
+data signals are empty,” a 4-phase bundled-data function block can be catego-
+rized as strongly indicating.
+As we will see in the following, strongly indicating function blocks have
+worst-case latency. To obtain actual case latency weakly indicating function
+blocks must be used. Before addressing possible function block implementa-
+tion styles for the different handshake protocols it is useful to review the basics
+of binary ripple-carry addition, the running example in the following sections.
+
+5.3.3
+Review of ripple-carry addition
+
+Figure 5.7 illustrates the implementation principle of a ripple-carry adder.
+A 1-bit full adder stage implements:
+
+s
+�
+a
+�b
+�c
+(5.1)
+d
+�
+ab
+�ac
+�bc
+(5.2)
+
+ai bi
+a1 b1
+an bn
+
+�����
+�����
+�����
+�����
+
+
+
+
+
+di
+d1
+ci
+cn
+
+sn
+si
+s1
+
+cout
+cin
+
+Figure 5.7.
+A ripple-carry adder. The carry output of one stage di is connected to the carry
+input of the next stage ci�1.
+
+
+Chapter 5: Handshake circuit implementations
+65
+
+In many implementations inputs a and b are recoded as:
+
+p
+�
+a
+�b
+(“propagate” carry)
+(5.3)
+g
+�
+ab
+(“generate” carry)
+(5.4)
+k
+�
+ab
+(“kill” carry)
+(5.5)
+
+�
+�
+�and the output signals are computed as follows:
+
+s
+�
+p
+�c
+(5.6)
+d
+�
+g
+� pc
+or alternatively
+(5.7a)
+
+d
+�
+k
+� pc
+(5.7b)
+
+For a ripple-carry adder, the worst case critical path is a carry rippling across
+the entire adder. If the latency of a 1-bit full adder is tadd the worst case latency
+of an N-bit adder is N
+� tadd. This is a very rare situation and in general the
+longest carry ripple during a computation is much shorter. Assuming random
+and uncorrelated operands the average latency is log�N
+�
+�tadd and, if numeri-
+cally small operands occur more frequently, the average latency is even less.
+Using normal Boolean signals (as in the bundled-data protocols) there is no
+way to know when the computation has finished and the resulting performance
+is thus worst-case.
+By using dual-rail carry signals (d
+�t
+�d
+�f ) it is possible to design circuits
+that indicate completion as part of the computation and thus achieve actual
+case latency. The crux is that a dual-rail carry signal, d, conveys one of the
+following 3 messages:
+
+(d
+�t
+�d
+�f ) = (0,0) = Empty
+“The carry has not been computed yet”
+(possibly because it depends on c)
+(d
+�t
+�d
+�f ) = (1,0) = True
+“The carry is 1”
+(d
+�t
+�d
+�f ) = (0,1) = False
+“The carry is 0”
+
+Consequently it is possible for a 1-bit adder to output a valid carry without
+waiting for the incoming carry if its inputs make this possible (a
+� b
+� 0 or
+a
+� b
+� 1). This idea was first put forward in 1955 in a paper by Gilchrist [52].
+The same idea is explained in [62, pp. 75-78] and in [121].
+
+5.4.
+Bundled-data function blocks
+
+5.4.1
+Using matched delays
+
+A bundled-data implementation of the adder in figure 5.7 is shown in fig-
+ure 5.8. It is composed of a traditional combinatorial circuit adder and a match-
+ing delay element. The delay element provides a constant delay that matches
+the worst case latency of the combinatorial adder. This includes the worst case
+
+
+66
+Part I: Asynchronous circuit design – A tutorial
+
+comb.
+circuit
+
+s[n:1]
+
+d
+
+a[n:1]
+b[n:1]
+c
+
+matched
+delay
+Req-in
+Req-out
+
+Ack-in
+Ack-out
+
+Figure 5.8.
+A 4-phase bundled data implementation of the N
+�bit handshake adder from fig-
+ure 5.7.
+
+critical path in the circuit – a carry rippling across the entire adder – as well as
+the worst case operating conditions. For reliable operation some safety margin
+is needed.
+In addition to the combinatorial circuit itself, the delay element represents
+a design challenge for the following reasons: to a first order the delay element
+will track delay variations that are due to the fabrication process spread as well
+as variations in temperature and supply voltage. On the other hand, wire de-
+lays can be significant and they are often beyond the designer’s control. Some
+design policy for matched delays is obviously needed. In a full custom de-
+sign environment one may use a dummy circuit with identical layout but with
+weaker transistors. In a standard cell automatic place and route environment
+one will have to accept a fairly large safety margin or do post-layout timing
+analysis and trimming of the delays. The latter sounds tedious but it is similar
+to the procedure used in synchronous design where setup and hold times are
+checked and delays trimmed after layout.
+In a 4-phase bundled-data design an asymmetric delay element may be
+preferable from a performance point of view, in order to perform the return-to-
+zero part of the handshaking as quickly as possible. Another issue is the power
+consumption of the delay element. In the ARISC processor design reported in
+[23] the delay elements consumed 10 % of the total power.
+
+5.4.2
+Delay selection
+
+In [105] Nowick proposed a scheme called “speculative completion”. The
+basic principle is illustrated in figure 5.9. In addition to the desired function
+some additional circuitry is added that selects among several matched delays.
+The estimate must be conservative, i.e. on the safe side. The estimation can
+be based on the input signals and/or on some internal signals in the circuit that
+implements the desired function.
+For an N-bit ripple-carry adder the propagate signals (c.f. equation 5.3)
+that form the individual 1-bit full adders (c.f. figure 5.7) may be used for the
+estimation. As an example of the idea consider a 16-bit adder. If p8
+
+� 0 the
+
+
+Chapter 5: Handshake circuit implementations
+67
+
+large
+
+small
+
+medium
+
+Estimate
+
+Funct.
+
+Req_in
+Req_out
+
+Inputs
+Outputs
+
+MUX
+
+Figure 5.9.
+The basic idea of “speculative completion”.
+
+longest carry ripple can be no longer than 8 stages, and if p12
+
+� p8
+
+� p4
+
+� 0
+the longest carry ripple can be no longer than 4 stages. Based on such simple
+estimates a sufficiently large matched delay is selected. Again, if a 4-phase
+protocol is used, asymmetric delay elements are preferable from a performance
+point of view.
+To the designer the trade-off is between an aggressive estimate with a large
+circuit overhead (area and power) or a less aggressive estimate with less over-
+head. For more details on the implementation and the attainable performance
+gains the reader is is referred to [105, 107].
+
+5.5.
+Dual-rail function blocks
+
+5.5.1
+Delay insensitive minterm synthesis (DIMS)
+
+In chapter 2 (page 22 and figure 2.14) we explained the implementation of
+an AND gate for dual-rail signals. Using the same basic topology it is possible
+to implement other simple gates such as OR, EXOR, etc. An inverter involves
+no active circuitry as it is just a swap of the two wires.
+Arbitrary functions can be implemented by combining gates in exactly the
+same way as when one designs combinatorial circuits for a synchronous cir-
+cuit. The handshaking is implicitly taken care of and can be ignored when
+composing gates and implementing Boolean functions. This has the important
+implication that existing logic synthesis techniques and tools may be used, the
+only difference is that the basic gates are implemented differently.
+The dual-rail AND gate in figure 2.14 is obviously rather inefficient: 4 C-
+elements and 1 OR gate totaling approximately 30 transistors – a factor five
+greater than a normal AND gate whose implementation requires only 6 tran-
+sistors. By implementing larger functions the overhead can be reduced. To
+illustrate this figure 5.10(b)-(c) shows the implementation of a 1-bit full adder.
+We will discuss the circuit in figure 5.10(d) shortly.
+
+
+68
+Part I: Asynchronous circuit design – A tutorial
+
+b.f   b.t
+
+c.f   c.t
+
+s.f   s.t
+
+d.f   d.t
+
+Generate
+
+Kill
+
+E
+E
+E
+0
+0
+0
+0
+
+c
+b
+a
+
+F
+F
+F
+T
+F
+F
+F
+T
+F
+T
+T
+F
+T
+F
+F
+T
+F
+T
+T
+T
+F
+T
+T
+T
+
+0
+1
+0
+1
+1
+0
+1
+0
+0
+1
+0
+1
+1
+0
+1
+
+0
+0
+0
+0
+
+0
+
+1
+1
+
+1
+
+1
+
+1
+
+1
+0
+1
+
+0
+0
+
+0
+
+NO  CHANGE
+
+Kill
+
+Generate
+
+s.t
+s.f
+d.t
+d.f
+
+(b)
+
+(d)
+(c)
+
+(a)
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+b.f
+
+c.f
+
+b.t
+
+c.t
+
+a.f
+a.t
++
+
++
+
++
+
++
+
+s.t
+
+s.f
+
+d.t
+
+d.f
+
+ADD
+
+a.f   a.t
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+b.f
+
+c.f
+
+b.t
+
+c.t
+
+a.f
+a.t
+
+C
+
+C
+
++
+
++
+
++
+
++
+
+s.t
+
+s.f
+
+d.t
+
+d.f
+
+1
+
+Figure 5.10.
+A 4-phase dual-rail full-adder: (a) Symbol, (b) truth table, (c) DIMS implemen-
+tation and (d) an optimization that makes the full adder weakly indicating.
+
+The PLA-like structure of the circuit in figure 5.10(c) illustrates a general
+principle for implementing arbitrary Boolean functions. In [124] we called this
+approach DIMS – Delay-Insensitive Minterm Synthesis – because the circuits
+are delay-insensitive and because the C-elements in the circuits generate all
+minterms of the input variables. The truth tables have 3 groups of rows speci-
+fying the output when the input is: (1) the empty codeword to which the circuit
+responds by setting the output empty, (2) an intermediate codeword which does
+not affect the output, or (3) a valid codeword to which the circuit responds by
+setting the output to the proper valid value.
+The fundamental ideas explained above all go back to David Muller’s work
+in the late 1950s and early 1960s [93, 92]. While [93] develops the funda-
+mental theorem for the design of speed-independent circuits, [92] is a more
+practical introduction including a design example: a bit-serial multiplier using
+latches and gates as explained above.
+Referring to section 5.3.2, the DIMS circuits as explained here can be cat-
+egorized as strongly indicating, and hence they exhibit worst case latency. In
+
+
+Chapter 5: Handshake circuit implementations
+69
+
+an N-bit ripple-carry adder the empty-to-valid and valid-to-empty transitions
+will ripple in strict sequence from the least significant full adder to the most
+significant one.
+If we change the full-adder design slightly as illustrated in figure 5.10(d) a
+valid d may be produced before the c input is valid (“kill” or “generate”), and
+an N-bit ripple-carry adder built from such full adders will exhibit actual-case
+latency – the circuits are weakly indicating function blocks.
+The designs in figure 5.10(c) and 5.10(d), and ripple-carry adders built from
+these full adders, are all symmetric in the sense that the latency of propagating
+an empty value is the same as the latency of propagating the preceding valid
+value. This may be undesirable. Later in section 5.5.4 we will introduce an
+elegant design that propagates empty values in constant time (with the latency
+of 2 full adder cells).
+
+5.5.2
+Null Convention Logic
+
+The C-elements and OR gates from the previous sections can be seen as n-
+of-n and 1-of-n threshold gates with hysteresis, figure 5.11. By using arbitrary
+m-of-n threshold gates with hysteresis – an idea proposed by Theseus Logic,
+Inc., [39] – it is possible to reduce the implementation complexity. An m-
+of-n threshold gate with hysteresis will set its output high when any m inputs
+have gone high and it will set its output low when all its inputs are low. This
+elegant circuit implementation idea is the key element in Theseus Logic’s Null
+Convention Logic. At the higher levels of design NCL is no different from
+the data-flow view presented in chapter 3 and NCL has great similarities to the
+circuit design styles presented in [92, 122, 124, 97]. Figure 5.11 shows that
+
+1
+1
+
+2
+2
+
+3
+
+5
+
+1
+1
+1
+
+2
+2
+
+3
+3
+
+4
+4
+
+C-elements
+
+Inverter
+
+OR-gates
+
+Figure 5.11.
+NCL gates: m�of�n threshold gates with hysteresis (1
+� m
+� n).
+
+
+70
+Part I: Asynchronous circuit design – A tutorial
+
+2
+
+2
+
+3
+
+3
+
+b.f
+b.t
+
+c.t
+c.f
+
+a.t
+a.f
+
+s.t
+
+s.f
+
+d.t
+d.f
+
+Figure 5.12.
+A full adder using NCL gates.
+
+OR gates and C-elements can be seen as special cases in the world of threshold
+gates. The digit inside a gate symbol is the threshold of the gate. Figure 5.12
+shows the implementation of a dual-rail full adder using NCL threshold gates.
+The circuit is weakly indicating.
+
+5.5.3
+Transistor-level CMOS implementations
+
+The last two adder designs we will introduce are based on CMOS transistor-
+level implementations using dual-rail signals. Dual-rail signals are essentially
+what are produced by precharged differential logic circuits that are used in
+memory structures and in logic families like DCVSL, figure 5.13 [151, 55].
+In a bundled-data design the precharge signal can be the request signal on
+the input channel to the function block. In a dual-rail design the precharge
+p-type transistors may be replaced by transistor networks that detect when all
+
+A
+
+B
+B
+
+Precharge
+
+N  transistor
+network
+
+Out.t
+
+Inputs
+
+Precharge
+
+Out.f
+
+Figure 5.13.
+A precharged differential CMOS combinatorial circuit. By adding the cross-
+coupled p-type transistors labeled “A” or the (weak) feedback-inverters labeled “B” the circuit
+becomes (pseudo)static.
+
+
+Chapter 5: Handshake circuit implementations
+71
+
+c.f
+
+c.t
+
+b.t
+
+a.f
+
+b.f
+
+c.f
+
+c.t
+
+b.t
+
+a.f
+
+b.f
+
+b.t
+
+c.f
+
+b.f
+
+c.t
+
+d.f
+
+b.f
+
+c.f
+
+b.f
+
+c.f
+
+d.t
+
+b.t
+
+c.f
+
+b.t
+
+c.t
+
+b.f
+
+c.t
+
+b.t
+
+c.t
+
+a.t
+a.t
+
+a.t
+a.f
+a.f
+a.f
+a.t
+a.f
+a.t
+a.t
+
+Figure 5.14.
+Transistor-level implementation of the carry signal for the strongly indicating full
+adder from figure 5.10(c).
+
+inputs are empty. Similarly the pull down n-type transistor signal paths should
+only conduct when the required input signals are valid.
+Transistor implementations of the DIMS and NCL gates introduced above
+are thus straightforward. Figure 5.14 shows a transistor-level implementation
+of a carry circuit for a strongly-indicating full adder. In the pull-down circuit
+each column of transistors corresponds to a minterm. In general when imple-
+menting DCVSL gates it is possible to share transistors in the two pull-down
+networks, but in this particular case it has not been done in order to illustrate
+better the relationship between the transistor implementation and the gate im-
+plementation in figure 5.10(c).
+The high stacks of p-type transistors are obviously undesirable. They may
+be replaced by a single transistor controlled by an “all empty” signal generated
+elsewhere. Finally, we mention that the weakly-indicating full adder design
+presented in the next section includes optimizations that minimize the p-type
+transistor stacks.
+
+5.5.4
+Martin’s adder
+
+In [85] Martin addresses the design of dual-rail function blocks in general
+and he illustrates the ideas using a very elegant dual-rail ripple-carry adder.
+The adder has a small transistor count, it exhibits actual case latency when
+adding valid data, and it propagates empty values in constant time – the adder
+represents the ultimate in the design of weakly indicating function blocks.
+Looking at the weakly-indicating transistor-level carry circuit in figure 5.14
+we see that d remains valid until a, b, and c are all empty. If we designed a
+similar sum circuit its output s would also remain valid until a, b, and c are all
+empty. The weak conditions in figure 5.6 only require that one output remains
+
+
+72
+Part I: Asynchronous circuit design – A tutorial
+
+s1
+
+a1 b1
+
+d1
+c1
+c2
+
+b2
+a2
+
+d2
+
+s2
+
+b3
+a3
+
+d3
+
+s3
+
+c3
+
+c3
+
+d2
+d1
+
+c2
+
+d3
+
+s3
+s2
+s1
+
+c1
+
+a3, b3
+a1,b1
+
+c3
+
+d2
+d1
+
+c2
+
+d3
+
+s3
+s2
+s1
+
+c1
+
+a3, b3
+a2, b2
+a1, b1
+
+a2, b2
+
+(b)
+
+(c)
+
+(a)
+
+Ripple-carry adder:
+
+Validity indication:
+
+Empty indication:
+
+Kill /Generate
+Propagate
+
+Figure 5.15.
+(a) A 3-stage ripple-carry adder and graphs illustrating how valid data (b) and
+empty data (c) propagate through the circuit (Martin [85]).
+
+valid until all inputs have become invalid. Hence it is allowed to split the
+indication of a, b and c being empty among the carry and the sum circuits.
+In [85] Martin uses some very illustrative directed graphs to express how
+the output signals indicate when input signals and internal signals are valid or
+empty. The nodes in the graphs are the signals in the circuit and the directed
+edges represent indication dependencies. Solid edges represent guaranteed de-
+pendencies and dashed edges represent possible dependencies. Figure 5.15(a)
+shows three full adder stages of a ripple-carry adder, and figures 5.15(b) and
+5.15(c) show how valid and empty inputs respectively propagate through the
+circuit.
+The propagation and indication of valid values is similar to what we dis-
+cussed above in the other adder designs, but the propagation and indication of
+empty values is different and exhibits constant latency. When the outputs d3,
+s3, s2, and s1 are all valid this indicates that all input signals and all inter-
+nal carry signals are valid. Similarly when the outputs d3, s3, s2, and s1 are
+all empty this indicates that all input signals and all internal carry signals are
+empty – the ripple-carry adder satisfies the weak conditions.
+
+
+Chapter 5: Handshake circuit implementations
+73
+
+a.t
+
+c.t
+
+c.f
+
+c.t
+
+c.f
+
+c.t
+
+b.t
+
+c.f
+
+b.f
+
+a.f
+
+c.t
+
+b.f
+
+a.t
+
+c.f
+
+a.f
+
+b.t
+
+s.f
+
+s.t
+
+a.f
+
+b.t
+
+a.t
+
+a.f
+
+b.f
+
+c.f
+
+a.f
+b.f
+
+d.f
+
+b.f
+
+a.f
+
+b.t
+
+a.t
+
+a.t
+
+b.t
+a.t
+b.t
+
+c.t
+
+d.t
+
+b.f
+
+Figure 5.16.
+The CMOS transistor implementation of Martin’s adder [85, Fig. 3].
+
+The corresponding transistor implementation of a full adder is shown in
+figure 5.16. It uses 34 transistors, which is comparable to a traditional combi-
+natorial circuit implementation.
+The principles explained above apply to the design of function blocks in
+general. “Valid/empty indication (or acknowledgement), dependency graphs”
+as shown in figure 5.15 are a very useful technique for understanding and de-
+signing circuits with low latency and the weakest possible indication.
+
+5.6.
+Hybrid function blocks
+
+The final adder we will present has 4-phase bundled-data input and output
+channels and a dual-rail carry chain. The design exhibits characteristics sim-
+ilar to Martin’s dual-rail adder presented in the previous section: actual case
+latency when propagating valid data, constant latency when propagating empty
+data, and a moderate transistor count. The basic structure of this hybrid adder
+is shown in figure 5.17. Each full adder is composed of a carry circuit and
+a sum circuit. Figure 5.18(a)-(b) shows precharged CMOS implementations
+of the two circuits. The idea is that the circuits precharge when Reqin
+
+� 0,
+evaluate when Reqin
+
+� 1, detect when all carry signals are valid and use this
+information to indicate completion, i.e. Reqout
+
+�. If the latency of the comple-
+tion detector does not exceed the latency in the sum circuit in a full adder then
+a matched delay element is needed as indicated in figure 5.17.
+The size and latency of the completion detector in figure 5.17 grows with the
+size of the adder, and in wide adders the latency of the completion detector may
+significantly exceed the latency of the sum circuit. An interesting optimization
+that reduces the completion detector overhead – possibly at the expense of
+a small increase in overall latency (Reqin
+
+� to Reqout
+
+�) – is to use a mix of
+strongly and weakly indicating function blocks [101]. Following the naming
+convention established in figure 5.7 on page 64 we could make, for example,
+
+
+74
+Part I: Asynchronous circuit design – A tutorial
+
+C
+
+Completion
+detector
+
+Precharge/Evaluate
+all cy and sum circuits
+
++
++
++
+
+sum
+sum
+
+sn
+si
+
+sum
+
+s1
+Req_out
+
+c1.t
+
+d1.f
+ci.f
+
+ci.t
+d1.t
+di.t
+
+di.f
+dn.f
+
+dn.t
+
+c1.f
+
+cin
+
+cout
+
+carry
+carry
+carry
+
+bn
+bi
+a1 b1
+ai
+an
+
+cn.f
+
+cn.t
+
+Req_in
+
+Figure 5.17.
+Block diagram of a hybrid adder with 4-phase bundled-data input and output
+channels and with an internal dual-rail carry chain.
+
+Req_in
+Req_in
+
+Req_in
+
+c.t
+
+a
+b
+a
+a
+
+b
+b
+c.f
+
+a
+b
+
+d.t
+
+d.f
+
+Req_in
+
+Req_in
+
+a
+a
+
+b
+b
+b
+b
+
+c.t
+c.f
+
+s
+
+Req_in
+
+d.t
+
+d.f
+
+(c)
+Req_in
+
+c.f
+
+a
+
+Req_in
+
+a
+
+c.t
+
+a
+
+b
+b
+
+a
+
+(a)
+(b)
+
+Figure 5.18.
+The CMOS transistor implementation of a full adder for the hybrid adder in
+figure 5.17: (a) a weakly indicating carry circuit, (b) the sum circuit and (c) a strongly indicating
+carry circuit.
+
+
+Chapter 5: Handshake circuit implementations
+75
+
+adders 1, 4, 7,
+�
+�
+�weakly indicating and all other adders strongly indicating. In
+this case only the carry signals out of stages 3, 6, 9,
+�
+�
+�need to be checked to
+detect completion. For i
+� 3�6�9�
+�
+�
+� di indicates the completion of di�1 and
+di�2 as well. Many other schemes for mixing strongly and weakly indicating
+full adders are possible. The particular scheme presented in [101] exploited the
+fact that typical-case operands (sampled audio) are numerically small values,
+and the design detects completion from a single carry signal.
+
+Summary – function block design
+
+The previous sections have explained the basics of how to implement func-
+tion blocks and have illustrated this using a variety of ripple-carry adders. The
+main points were “transparency to handshaking” and “actual case latency”
+through the use of weakly-indicating components.
+Finally, a word of warning to put things into the right perspective: to some
+extent the ripple-carry adders explained above over-sell the advantages of aver-
+age-case performance. It is easy to get carried away with elegant circuit de-
+signs but it may not be particularly relevant at the system level:
+
+In many systems the worst-case latency of a ripple-carry adder may sim-
+ply not be acceptable.
+
+In a system with many concurrently active components that synchronize
+and exchange data at high rates, the slowest component at any given time
+tends to dominate system performance; the average-case performance of
+a system may not be nearly as good as the average-case latency of its
+individual components.
+
+In many cases addition is only one part of a more complex compound
+arithmetic operation. For example, the final design of the asynchronous
+filter bank presented in [103] did not use the ideas presented above. In-
+stead we used entirely strongly-indicating full adders because this al-
+lowed an efficient two-dimensional precharged compound add-multiply-
+accumulate unit to be implemented.
+
+5.7.
+MUX and DEMUX
+
+Now that the principles of function block design have been covered we are
+ready to address the implementation of the MUX and DEMUX components,
+c.f. figure 3.3 on page 32. Let’s recapitulate their function: a MUX will syn-
+chronize the control channel and relay the data and the handshaking of the
+selected input channel to the output data channel. The other input channel is
+ignored (and may have a request pending). Similarly a DEMUX will synchro-
+nize the control and the data input channels and steer the input to the selected
+output channel. The other output channel is passive and in the idle state.
+
+
+76
+Part I: Asynchronous circuit design – A tutorial
+
+If we consider only the “active” channels then the MUX and the DEMUX
+can be understood and designed as function blocks – they must be transparent
+to the handshaking in the same way as function blocks. The control chan-
+nel and the (selected) input data channel are first joined and then an output is
+produced. Since no data transfer can take place without the control channel
+and the (selected) input data channel both being active, the implementations
+become strongly indicating function blocks.
+Let’s consider implementations using 4-phase protocols. The simplest and
+most intuitive designs use a dual-rail control channel. Figure 5.19 shows the
+implementation of the MUX and the DEMUX using the 4-phase bundled-data
+
+n
+
+n
+
+"Join"
+
+"Join"
+
+1
+
+MUX
+
+0
+
+"Join"
+
+1
+
+z−ack
+
+"Join"
+
+y
+
+DEMUX
+
+y
+
+z
+
+ctl_ack
+
+x
+
+x−ack
+
+Component
+4−phase bundled−data
+
+y−ack
+
+y−ack
+0
+
+y
+
+y−req
+
+z
+
+ctl_ack
+
+x−req
+
+z−req
+
+y−req
+
+ctl.f  ctl.t
+
+n
+
+ctl.f  ctl.t
+
+n
+n
+
+MUX
+
+x
+
+y
+
+z−ack
+
+z−req
+
+C
+x−ack
+
+x−req
+
+n
+
+ctl
+
+z
+
+C
+
+x
+
+z
+
++
+
+ctl
+
+C
+
++
+
+C
+
+C
+
+x
+
+C
+
+Figure 5.19.
+Implementation of MUX and DEMUX. The input and output data channels x,
+y, and z use the 4-phase bundled-data protocol and the control channel ctl uses the 4-phase
+dual-rail protocol (in order to simplify the design).
+
+
+Chapter 5: Handshake circuit implementations
+77
+
+protocol on the input and output data channels and the 4-phase dual-rail proto-
+col on the control channel. In both circuits ctl�t and ctl�f can be understood as
+two mutually exclusive requests that select between the two alternative input-
+to-output data transfers, and in both cases ctl�t and ctl�f are joined with the
+relevant input requests (at the C-elements marked “Join”). The rest of the
+MUX implementation is then similar to the 4-phase bundled-data MERGE in
+figure 5.1 on page 59. The rest of the DEMUX should be self explanatory; the
+handshaking on the two output ports are mutually exclusive and the acknowl-
+edge signals yack and zack are ORed to form xack
+
+� ctlack.
+All 4-phase dual-rail implementations of the MUX and DEMUX compo-
+nents are rather similar, and all 4-phase bundled-data implementations may be
+obtained by adding 4-phase bundled-data to 4-phase dual-rail protocol con-
+version circuits on the control input. At the end of chapter 6, an all 4-phase
+bundled-data MUX will be one of the examples we use to illustrate the design
+of speed-independent control circuits.
+
+5.8.
+Mutual exclusion, arbitration and metastability
+
+5.8.1
+Mutual exclusion
+
+Some handshake components (including MERGE) require that the commu-
+nication along several (input) channels is mutually exclusive. For the simple
+static data-flow circuit structures we have considered so far this has been the
+case, but in general one may encounter situations where a resource is shared
+between several independent parties/processes.
+The basic circuit needed to deal with such situations is a mutual exclusion
+element (MUTEX), figure 5.20 (we will explain the implementation shortly).
+The input signals R1 and R2 are two requests that originate from two inde-
+pendent sources, and the task of the MUTEX is to pass these inputs to the
+corresponding outputs G1 and G2 in such a way that at most one output is ac-
+tive at any given time. If only one input request arrives the operation is trivial.
+If one input request arrives well before the other, the latter request is blocked
+until the first request is de-asserted. The problem arises when both input sig-
+
+R1
+
+R2
+
+R1
+
+R2
+
+Bistable
+
+&
+
+&
+
+G2
+
+G1
+
+G1
+
+G2
+
+MUTEX
+
+Metastability filter
+
+x2
+
+x1
+
+Figure 5.20.
+The mutual exclusion element: symbol and possible implementation.
+
+
+78
+Part I: Asynchronous circuit design – A tutorial
+
+nals are asserted at the same time. Then the MUTEX is required to make an
+arbitrary decision, and this is where metastability enters the picture.
+The problem is exactly the same as when a synchronous circuit is exposed
+to an asynchronous input signal (one that does not satisfy set-up and hold time
+requirements). For a clocked flip-flop that is used to synchronize an asyn-
+chronous input signal, the question is whether the data signal made its tran-
+sition before or after the active edge of the clock. As with the MUTEX the
+question is again which signal transition occured first, and as with the MU-
+TEX a random decision is needed if the transition of the data signal coincides
+with the active edge of the clock signal.
+The fundamental problem in a MUTEX and in a synchronizer flip-flop is
+that we are dealing with a bi-stable circuit that receives requests to enter each
+of its two stable states at the same time. This will cause the circuit to enter a
+metastable state in which it may stay for an unbounded length of time before
+randomly settling in one of its stable states. The problem of synchronization
+is covered in most textbooks on digital design and VLSI, and the analysis of
+metastability that is presented in these textbooks applies to our MUTEX com-
+ponent as well. A selection of references is: [95, sect. 9.4] [53, sect. 5.4 and
+6.5] [151, sect. 5.5.7] [115, sect. 6.2.2 and 9.4-5] [150, sect. 8.9].
+For the synchronous designer the problem is that metastability may per-
+sist beyond the time interval that has been allocated to recover from potential
+metastability. It is simply not possible to obtain a decision within a bounded
+length of time. The asynchronous designer, on the other hand, will eventually
+obtain a decision, but there is no upper limit on the time he will have to wait
+for the answer. In [22] the terms “time safe” and “value safe” are introduced
+to denote and classify these two situations.
+A possible implementation of the MUTEX, as shown in figure 5.20, in-
+volves a pair of cross coupled NAND gates and a metastability filter. The cross
+coupled NAND gates enable one input to block the other. If both inputs are
+asserted at the same time, the circuit becomes metastable with both signals
+x1 and x2 halfway between supply and ground. The metastability filter pre-
+vents these undefined values from propagating to the outputs; G1 and G2 are
+both kept low until signals x1 and x2 differ by more than a transistor threshold
+voltage.
+The metastability filter in figure 5.20 is a CMOS transistor-level implemen-
+tation from [83]. An NMOS predecessor of this circuit appeared in [121].
+Gate-level implementations are also possible: the metastability filter can be
+implemented using two buffers whose logic thresholds have been made partic-
+ularly high (or low) by “trimming” the strengths of the pull-up and pull-down
+transistor paths ([151, section 2.3]). For example, a 4-input NAND gate with
+all its inputs tied together implements a buffer with a particularly high logic
+
+
+Chapter 5: Handshake circuit implementations
+79
+
+threshold. The use of this idea in the implementation of mutual exclusion ele-
+ments is described in [6, 139].
+
+5.8.2
+Arbitration
+
+The MUTEX can be used to build a handshake arbiter that can be used to
+control access to a resource that is shared between several autonomous inde-
+pendent parties. One possible implementation is shown in figure 5.21.
+
+&
+
+&
+
+C
+
+C
+
+R0
+A0
+
+R1
+A1
+
+R2
+A2
+
+ARBITER
+
++
+R0
+
+A0
+
+y1
+
+y2
+
+G1
+
+MUTEX
+
+R2
+G2
+
+G1
+R1
+
+A1
+
+R1
+
+R2
+
+A2
+
+G2
+
+A1
+
+A2
+
+a’
+
+aa’
+
+b’
+
+bb’
+
+Figure 5.21.
+The handshake arbiter: symbol and possible implementation.
+
+The MUTEX ensures that signals G1 and G2 at the a’–aa’ interface are
+mutually exclusive. Following the MUTEX are two AND gates whose purpose
+it is to ensure that handshakes on the
+�y1�A1� and
+�y2�A2� channels at the b’–
+bb’ interface are mutually exclusive: y2 can only go high if A1 is low and
+y1 can only go high if signal A2 is low. In this way, if handshaking is in
+progress along one channel, it blocks handshaking on the other channel. As
+handshaking along channels
+�y1�A1� and
+�y2�A2� are mutually exclusive the
+rest of the arbiter is simply a MERGE, c.f., figure 5.1 on page 59. If data needs
+to be passed to the shared resource a multiplexer is needed in exactly the same
+way as in the MERGE. The multiplexer may be controlled by signals y1 and/or
+y2.
+
+5.8.3
+Probability of metastability
+
+Let us finally take a quantitative look at metastability: if P�mett
+
+� denotes
+the probability of the MUTEX being metastable for a period of time of t or
+longer (within an observation interval of one second), and if this situation is
+considered a failure, then we may calculate the mean time between failure as:
+
+MTBF
+�
+1
+
+P�mett
+
+�
+(5.8)
+
+The probability P�mett
+
+� may be calculated as:
+
+P�mett
+
+�
+� P�mett
+
+�mett
+�0
+
+�
+�P�mett
+�0
+
+�
+(5.9)
+
+
+80
+Part I: Asynchronous circuit design – A tutorial
+
+where:
+
+P�mett
+
+�mett
+�0
+
+� is the probability that the MUTEX is still metastable at
+time t given that it was metastable at time t
+� 0.
+
+P�mett
+�0
+
+� is the probability that the MUTEX will enter metastability
+within a given observation interval.
+
+The probability P�mett
+�0
+
+� can be calculated as follows: the MUTEX will go
+metastable if its inputs R1 and R2 are exposed to transitions that occur almost
+simultaneously, i.e. within some small time window ∆. If we assume that
+the two input signals are uncorrelated and that they have average switching
+frequencies fR1 and fR2 respectively, then:
+
+P�mett
+�0
+
+�
+�
+1
+
+∆
+� fR1
+
+� fR2
+(5.10)
+
+which can be understood as follows: within an observation interval of one
+second the input signal R2 makes 1� fR2 attempts at hitting one of the 1� fR1
+time intervals of duration ∆ where the MUTEX is vulnerable to metastability.
+The probability P�mett
+
+�mett
+�0
+
+� is determined as:
+
+P�mett
+
+�mett
+�0
+
+�
+� e
+
+�t
+�τ
+(5.11)
+
+where τ expresses the ability of the MUTEX to exit the metastable state spon-
+taneously. This equation can be explained in two different ways and experi-
+mental results have confirmed its correctness. One explanation is that the cross
+coupled NAND gates have no memory of how long they have been metastable,
+and that the only probability distribution that is “memoryless” is an exponen-
+tial distribution. Another explanation is that a small-signal model of the cross-
+coupled NAND gates at the metastable point has a single dominating pole.
+Combining equations 5.8–5.11 we obtain:
+
+MTBF
+�
+e t
+�τ
+
+∆
+� fR1
+
+� fR2
+(5.12)
+
+Experiments and simulations have shown that this equation is reasonably
+accurate provided that t is not very small, and experiments or simulations may
+be used to determine the two parameters ∆ and τ. Representative values for
+good circuit designs implemented in a 0.25 µm CMOS process are ∆
+� 30ps
+and τ
+� 25ps.
+
+5.9.
+Summary
+
+This chapter addressed the implementation of the various handshake com-
+ponents: latch, fork, join, merge, function blocks, mux, demux, mutex and
+arbiter). A significant part of the material addressed principles and techniques
+for implementing function blocks.
+
+
+Chapter 6
+
+SPEED-INDEPENDENT CONTROL CIRCUITS
+
+This chapter provides an introduction to the design of asynchronous sequen-
+tial circuits and explains in detail one well-developed specification and synthe-
+sis method: the synthesis of speed-independent control circuits from signal
+transition graph specifications.
+
+6.1.
+Introduction
+
+Over time many different formalisms and theories have been proposed for
+the design of asynchronous control circuits (e.g. sequential circuits or state
+machines). The multitude of approaches arises from the combination of: (a)
+different specification formalisms, (b) different assumptions about delay mod-
+els for gates and wires, and (c) different assumptions about the interaction
+between the circuit and its environment. Full coverage of the topic is far be-
+yond the scope of this book. Instead we will first present some of the basic
+assumptions and characteristics of the various design methods and give point-
+ers to relevant literature and then we will explain in detail one method: the
+design of speed-independent circuits from signal transition graphs – a method
+that is supported by a well-developed public domain tool, Petrify.
+A good starting point for further reading is a book by Myers [95]. It provides
+in-depth coverage of the various formalisms, methods, and theories for the
+design of asynchronous sequential circuits and it provides a comprehensive
+list of references.
+
+6.1.1
+Asynchronous sequential circuits
+
+To start the discussion figure 6.1 shows a generic synchronous sequential
+circuit and two alternative asynchronous control circuits: a Huffman style fun-
+damental mode circuit with buffers (delay elements) in the feedback signals,
+and a Muller style input-output mode circuit with wires in the feedback path.
+The synchronous circuit is composed of a set of registers holding the current
+state and a combinational logic circuit that computes the output signals and the
+next state signals. When the clock ticks the next state signals are copied into the
+registers thus becoming the current state. Reliable operation only requires that
+
+81
+
+
+82
+Part I: Asynchronous circuit design – A tutorial
+
+Synchronous:
+
+Clock
+
+Current state
+Next state
+
+Asynchronous
+Huffman style 
+fundamental mode: 
+Muller style 
+Asynchronous
+
+input-output mode: 
+
+Logic
+Logic
+Logic
+
+Inputs
+Outputs
+
+Figure6.1.
+(a) A synchronous sequential circuit. (b) A Huffman style asynchronous sequential
+circuit with buffers in the feedback path, and (c) a Muller style asynchronous sequential circuit
+with wires in the feedback path.
+
+the next state output signals from the combinational logic circuit are stable in a
+time window around the rising edge of the clock, an interval that is defined by
+the setup and hold time parameters of the register. Between two clock ticks the
+combinational logic circuit is allowed to produce signals that exhibit hazards.
+The only thing that matters is that the signals are ready and stable when the
+clock ticks.
+In an asynchronous circuit there is no clock and all signals have to be valid
+at all times. This implies that at least the output signals that are seen by the
+environment must be free from all hazards. To achieve this, it is sometimes
+necessary to avoid hazards on internal signals as well. This is why the syn-
+thesis of asynchronous sequential circuits is difficult. Because it is difficult
+researchers have proposed different methods that are based on different (sim-
+plifying) assumptions.
+
+6.1.2
+Hazards
+
+For the circuit designer a hazard is an unwanted glitch on a signal. Fig-
+ure 6.2 shows four possible hazards that may be observed. A circuit that is in
+a stable state does not spontaneously produce a hazard – hazards are related
+to the dynamic operation of a circuit. This again relates to the dynamics of
+the input signals as well as the delays in the gates and wires in the circuit. A
+discussion of hazards is therefore not possible without stating precisely which
+delay model is being used and what assumptions are made about the interaction
+between the circuit and its environment. There are greater theoretical depths
+in this area than one might think at a first glance.
+Gates are normally assumed to have delays. In section 2.5.3 we also dis-
+cussed wire delays, and in particular the implications of having different delays
+in different branches of a forking wire. In addition to gate and wire delays it is
+also necessary to specify which delay model is being used.
+
+
+Chapter 6: Speed-independent control circuits
+83
+
+Static-1  hazard:
+
+Static-0  hazard:
+
+1
+
+0
+0
+
+1
+
+1
+0
+1
+1
+0
+
+0
+1
+1 0
+0
+
+1
+
+0
+
+0
+
+1
+
+1
+
+0
+
+Desired signal
+Actual signal
+
+Dynamic-10  hazard:
+
+Dynamic-01  hazard:
+
+Figure 6.2.
+Possible hazards that may be observed on a signal.
+
+6.1.3
+Delay models
+
+A pure delay that simply shifts any signal waveform later in time is perhaps
+what first comes to mind. In the hardware description language VHDL this is
+called a transport delay. It is, however, not a very realistic model as it implies
+that the gates and wires have infinitely high bandwidth. A more realistic delay
+model is the inertial delay model. In addition to the time shifting of a signal
+waveform, an inertial delay suppresses short pulses. In the inertial delay model
+used in VHDL two parameters are specified, the delay time and the reject time,
+and pulses shorter than the reject time are filtered out. The inertial delay model
+is the default delay model used in VHDL.
+These two fundamental delay models come in several flavours depending on
+how the delay time parameter is specified. The simplest is a fixed delay where
+the delay is a constant. An alternative is a min-max delay where the delay is
+unknown but within a lower and upper bound: tmin
+
+� tdelay
+
+� tmax. A more
+pessimistic model is the unbounded delay where delays are positive (i.e. not
+zero), unknown and unbounded from above: 0
+� tdelay
+
+� ∞. This is the delay
+model that is used for gates in speed-independent circuits.
+It is intuitive that the inertial delay model and the min-max delay model
+both have properties that help filter out some potential hazards.
+
+6.1.4
+Fundamental mode and input-output mode
+
+In addition to the delays in the gates and wires, it is also necessary to for-
+malize the interaction between the circuit being designed and its environment.
+Again, strong assumptions may simplify the design of the circuit. The design
+methods that have been proposed over time all have their roots in one of the
+following assumptions:
+
+Fundamental mode: The circuit is assumed to be in a state where all input
+signals, internal signals, and output signals are stable. In such a sta-
+ble state the environment is allowed to change one input signal. After
+
+
+84
+Part I: Asynchronous circuit design – A tutorial
+
+that, the environment is not allowed to change the input signals again
+until the entire circuit has stabilized. Since internal signals such as state
+variables are unknown to the environment, this implies that the longest
+delay in the circuit must be calculated and the environment is required
+to keep the input signals stable for at least this amount of time. For this
+to make sense, the delays in gates and wires in the circuit have to be
+bounded from above. The limitation on the environment is formulated
+as an absolute time requirement.
+
+The design of asynchronous sequential circuits based on fundamental
+mode operation was pioneered by David Huffman in the 1950s [59, 60].
+
+Input-output mode: Again the circuit is assumed to be in a stable state. Here
+the environment is allowed to change the inputs. When the circuit has
+produced the corresponding output (and it is allowable that there are no
+output changes), the environment is allowed to change the inputs again.
+There are no assumptions about the internal signals and it is therefore
+possible that the next input change occurs before the circuit has stabi-
+lized in response to the previous input signal change.
+
+The restrictions on the environment are formulated as causal relations
+between input signal transitions and output signal transitions. For this
+reason the circuits are often specified using trace based methods where
+the designer specifies all possible sequences of input and output signal
+transitions that can be observed on the interface of the circuit. Signal
+transition graphs, introduced later, are such a trace-based specification
+technique.
+
+The design of asynchronous sequential circuits based on the input-output
+mode of operation was pioneered by David Muller in the 1950s [93, 92].
+As mentioned in section 2.5.1, these circuits are speed-independent.
+
+6.1.5
+Synthesis of fundamental mode circuits
+
+In the classic work by Huffman the environment was only allowed to change
+one input signal at a time. In response to such an input signal change, the
+combinational logic will produce new outputs, of which some are fed back,
+figure 6.1(b). In the original work it was further required that only one feed-
+back signal changes (at a time) and that the delay in the feedback buffer is
+large enough to ensure that the entire combinational circuit has stabilized be-
+fore it sees the change of the feedback signal. This change may, in turn, cause
+the combinational logic to produce new outputs, etc. Eventually through a se-
+quence of single signal transitions the circuit will reach a stable state where
+the environment is again allowed to make a single input change. Another way
+of expressing this behaviour is to say that the circuit starts out in a stable state
+
+
+Chapter 6: Speed-independent control circuits
+85
+
+s0
+00
+01
+11
+c
+10
+
+Inputs a,b
+Output 
+
+s0
+s1
+s2
+-
+0
+
+s1
+s3
+-
+-
+
+s2
+s3
+-
+-
+
+s3
+s5
+s4
+-
+
+s4
+s0
+
+s5
+s0
+-
+-
+
+0
+
+0
+
+1
+
+1
+
+1
+
+-
+-
+
+s1
+s2
+
+s3
+
+s4
+s5
+
+00/0
+
+00/0
+
+11/1
+10/0
+
+11/1
+
+10/0
+01/0
+
+01/0
+
+Primitive flow table
+Mealy type state diagram
+
+ab/c
+
+01/1
+10/1
+
+01/1
+10/1
+
+00/0
+
+11/1
+
+Burst mode specification
+
+s0
+
+a+b+/c+
+
+a-b-/c-
+
+s3
+
+Figure 6.3.
+Some alternative specifications of a Muller C-element: a Mealy state diagram, a
+primitive flow table, and a burst-mode state diagram.
+
+(which is defined to be a state that will persist until an input signal changes). In
+response to an input signal change the circuit will step through a sequence of
+transient, unstable states, until it eventually settles in a new stable state. This
+sequence of states is such that from one state to the next only one variable
+changes.
+The interested reader is encouraged to consult [75], [133] or [95] and to
+specify and synthesize a C-element. The following gives a flavour of the design
+process and the steps involved:
+
+The design may start with a state graph specification that is very simi-
+lar to the specification of a synchronous sequential circuit. This is op-
+tional. Figure 6.3 shows a Mealy type state graph specification of the
+C-element.
+
+The classic design process involves the following steps:
+
+The intended sequential circuit is specified in the form of a primitive
+flow table (a state table with one row per stable state). Figure 6.3 shows
+the primitive flow table specification of a C-element.
+
+A minimum-row reduced flow table is obtained by merging compatible
+states in the primitive flow table.
+
+The states are encoded.
+
+Boolean equations for output variables and state variables are derived.
+
+Later work has generalized the fundamental mode approach by allowing a
+restricted form of multiple-input and multiple-output changes. This approach
+
+
+86
+Part I: Asynchronous circuit design – A tutorial
+
+is called burst mode [32, 27]. When in a stable state, a burst-mode circuit
+will wait for a set of input signals to change (in arbitrary order). After such
+an input burst has completed the machine computes a burst of output signals
+and new values of the internal variables. The environment is not allowed to
+produce a new input burst until the circuit has completely reacted to the pre-
+vious burst – fundamental mode is still assumed, but only between bursts of
+input changes. For comparison, figure 6.3 also shows a burst-mode specifica-
+tion of a C-element. Burst-mode circuits are specified using state graphs that
+are very similar to those used in the design of synchronous circuits. Several
+mature tools for synthesizing burst-mode controllers have been developed in
+academia [40, 160]. These tools are available in the public domain.
+
+6.2.
+Signal transition graphs
+
+The rest of this chapter will be devoted to the specification and synthesis
+of speed-independent control circuits. These circuits operate in input-output
+mode and they are naturally specified using signal transition graphs, (STGs).
+An STG is a petri net and it can be seen as a formalization of a timing dia-
+gram. The synthesis procedure that we will explain in the following consists
+of: (1) Capturing the behaviour of the intended circuit and its environment in
+an STG. (2) Generating the corresponding state graph, and adding state vari-
+ables if needed. (3) Deriving Boolean equations for the state variables and
+outputs.
+
+6.2.1
+Petri nets and STGs
+
+Briefly, a Petri net [3, 113, 94] is a graph composed of directed arcs and
+two types of nodes: transitions and places. Depending on the interpretation
+that is assigned to places, transitions and arcs, Petri nets can be used to model
+and analyze many different (concurrent) systems. Some places can be marked
+with tokens and the Petri net model can be “executed” by firing transitions. A
+transition is enabled to fire if there are tokens on all of its input places, and
+an enabled transition must eventually fire. When a transition fires, a token is
+removed from each input place and a token is added to each output place. We
+will show an example shortly. Petri nets offer a convenient way of expressing
+choice and concurrency.
+It is important to stress that there are many variations of and extensions to
+Petri nets – Petri nets are a family of related models and not a single, unique
+and well defined model. Often certain restrictions are imposed in order to make
+the analysis for certain properties practical. The STGs we will consider in the
+following belong to such a restricted subclass: an STG is a 1-bounded Petri net
+in which only simple forms of input choice are allowed. The exact meaning of
+
+
+Chapter 6: Speed-independent control circuits
+87
+
+Timing diagram
+C-element and dummy environment
+
+a
+
+b
+
+c
+etc.
+
+a
+c
+b
+
+b+
+
+b-
+
+a+
+
+a-
+
+c-
+
+c+
+
+STG
+
+b+
+
+c+
+
+b-
+
+c-
+
+a+
+
+a-
+
+Petri net
+
+Figure 6.4.
+A C-element and its ‘well behaved’ dummy environment, its specification in the
+form of a timing diagram, a Petri net, and an STG formalization of the timing diagram.
+
+“1-bounded” and “simple forms of input choice” will be defined at the end of
+this section.
+In an STG the transitions are interpreted as signal transitions and the places
+and arcs capture the causal relations between the signal transitions. Figure 6.4
+shows a C-element and a ‘well behaved’ dummy environment that maintains
+the input signals until the C-element has changed its outputs. The intended be-
+haviour of the C-element could be expressed in the form of a timing diagram
+as shown in the figure. Figure 6.4 also shows the corresponding Petri net spec-
+ification. The Petri net is marked with tokens on the input places to the a� and
+b� transitions, corresponding to state
+�a�b�c�
+�
+�0�0�0�. The a� and b� tran-
+sitions may fire in any order, and when they have both fired the c� transition
+becomes enabled to fire, etc. Often STGs are drawn in a simpler form where
+most places have been omitted. Every arc that connects two transitions is then
+thought of as containing a place. Figure 6.4 shows the STG specification of the
+C-element.
+A given marking of a Petri net corresponds to a possible state of the sys-
+tem being modeled, and by executing the Petri net and identifying all possible
+
+
+88
+Part I: Asynchronous circuit design – A tutorial
+
+markings it is possible to derive the corresponding state graph of the system.
+The state graph is generally much more complex than the corresponding Petri
+net.
+An STG describing a meaningful circuit enjoys certain properties, and for
+the synthesis algorithms used in tools like Petrify to work, additional properties
+and restrictions may be required. An STG is a Petri net with the following
+characteristics:
+
+1 Input free choice: The selection among alternatives must only be con-
+trolled by mutually exclusive inputs.
+
+2 1-bounded: There must never be more than one token in a place.
+
+3 Liveness: The STG must be free from deadlocks.
+
+An STG describing a meaningful speed-independent circuit has the following
+characteristics:
+
+4 Consistent state assignment: The transitions of a signal must strictly
+alternate between
+� and
+� in any execution of the STG.
+
+5 Persistency: If a signal transition is enabled it must take place, i.e. it
+must not be disabled by another signal transition. The STG specification
+of the circuit must guarantee persistency of internal signals (state vari-
+ables) and output signals, whereas it is up to the environment to guaran-
+tee persistency of the input signals.
+
+In order to be able to synthesize a circuit implementation the following char-
+acteristic is required:
+
+6 Complete state coding (CSC): Two or more different markings of the
+STG must not have the same signal values (i.e. correspond to the same
+state). If this is not the case, it is necessary to introduce extra state
+variables such that different markings correspond to different states. The
+synthesis tool Petrify will do this automatically.
+
+6.2.2
+Some frequently used STG fragments
+
+For the newcomer it may take a little practice to become familiar with spec-
+ifying and designing circuits using STGs. This section explains some of the
+most frequently used templates from which one can construct complete speci-
+fications.
+The basic constructs are: fork, join, choice and merge, see figure 6.5. The
+choice is restricted to what is called input free choice: the transitions follow-
+ing the choice place must represent mutually exclusive input signal transitions.
+This requirement is quite natural; we will only specify and design determin-
+istic circuits. Figure 6.6 shows an example Petri net that illustrates the use
+
+
+Chapter 6: Speed-independent control circuits
+89
+
+Fork
+
+Join
+
+Choice
+
+Merge
+
+Figure 6.5.
+Petri net fragments for fork, join, free choice and merge constructs.
+
+P9
+
+T7
+
+P8
+
+P7
+
+P2
+
+P6
+P5
+
+P4
+P3
+
+P1
+
+T3
+T2
+
+T8
+
+T6
+
+T5
+
+T1
+Fork
+
+Join
+
+Merge
+
+Choice
+
+T4
+
+Figure 6.6.
+An example Petri net that illustrates the use of fork, join, free choice and merge.
+
+of fork, join, free choice and merge. The example system will either perform
+transitions T6 and T7 in sequence, or it will perform T1 followed by the con-
+current execution of transitions T2, T3 and T4 (which may occur in any order),
+followed by T5.
+Towards the end of this chapter we will design a 4-phase bundled-data ver-
+sion of the MUX component from figure 3.3 on page 32. For this we will need
+some additional constructs: a controlled choice and a Petri net fragment for the
+input end of a bundled-data channel.
+Figure 6.7 shows a Petri net fragment where place P1 and transitions T3
+and T4 represent a controlled choice: a token in place P1 will engage in either
+transition T3 or transition T4. The choice is controlled by the presence of
+a token in either P2 or P3. It is crucial that there can never be a token in
+both these places at the same time, and in the example this is ensured by the
+mutually exclusive input signal transitions T1 and T2.
+
+
+90
+Part I: Asynchronous circuit design – A tutorial
+
+T2
+
+T5
+
+P1
+
+T0
+
+T1
+
+P2
+P1: Controlled Choice
+
+Mutually exclusive "paths"
+
+T3
+T4
+
+P3
+
+P0
+P0: Free Choice
+
+Figure 6.7.
+A Petri net fragment including a controlled choice.
+
+Figure 6.8 shows a Petri net fragment for a component with a one-bit input
+channel using a 4-phase bundled-data protocol. It could be the control chan-
+nel used in the MUX and DEMUX components introduced in figure 3.3 on
+page 32. The two transitions dummy1 and dummy2 do not represent transitions
+on the three signals in the channel, they are dummy transitions that facilitate
+expressing the specification. These dummy transitions represent an extension
+to the basic class of STGs.
+Note also that the four arcs connecting:
+place P5 and transition Ctl
+�
+place P5 and transition Ctl
+�
+place P6 and transition dummy2
+place P7 and transition dummy1
+have arrows at both ends. This is a shorthand notation for an arc in each direc-
+tion. Note also that there are several instances where a place is both an input
+place and a output place for a transition. Place P5 and transition Ctl
+� is an
+example of this.
+The overall structure of the Petri net fragment can be understood as follows:
+at the top is a sequence of transitions and places that capture the handshaking
+on the Req and Ack signals. At the bottom is a loop composed of places P6
+and P7 and transitions Ctl
+� and Ctl
+� that captures the control signal changing
+between high and low. The absence of a token in place P5 when Req is high
+expresses the fact that Ctl is stable in this period. When Req is low and a
+token is present in place P5, Ctl is allowed to make as many transitions as it
+desires. When Req� fires, a token is put in place P4 (which is a controlled
+choice place). The Ctl signal is now stable, and depending on its value one of
+the two transitions dummy1 or dummy2 will become enabled and eventually
+
+
+Chapter 6: Speed-independent control circuits
+91
+
+Req
+
+Ack
+
+Ctl
+
+Bundled data interface
+
+Ctl
+Req/Ack
+
+Ctl−
+
+Ctl+
+
+Ack+
+
+Req−
+
+Ack−
+
+Do the "Ctl=0" action
+Do the "Ctl=1" action
+
+dummy1
+dummy2
+
+P1
+
+P2
+
+P6
+P7
+
+Req+
+
+P3
+
+P4
+
+P5
+
+P0
+
+Figure 6.8.
+A Petri net fragment for a component with a one-bit input channel using a 4-phase
+bundled-data protocol.
+
+fire. At this point the intended input-to-output operation that is not included in
+this example may take place, and finally the handshaking on the control port
+finishes (Ack
+�; Req�; Ack
+�).
+
+6.3.
+The basic synthesis procedure
+
+The starting point for the synthesis process is an STG that satisfies the re-
+quirements listed on page 88. From this STG the corresponding state graph is
+derived by identifying all of the possible markings of the STG that are reach-
+able given its initial marking. The last step of the synthesis process is to derive
+Boolean equations for the state variables and output variables.
+We will go through a number of examples by hand in order to illustrate the
+techniques used. Since the state of a circuit includes the values of all of the
+signals in the circuit, the computational complexity of the synthesis process
+can be large, even for small circuits. In practice one would always use one
+of the CAD tools that has been developed – for example Petrify that we will
+introduce later.
+
+
+92
+Part I: Asynchronous circuit design – A tutorial
+
+6.3.1
+Example 1: a C-element
+
+c
+ab
+00
+01
+10
+0
+
+1
+0
+0
+
+11
+0* 0
+
+1*
+1
+1
+1
+
+c = ab + ac + bc 
+
+C element and its environment
+State Graph
+
+Karnaugh map for C
+
+0*0*0
+
+10*0
+0*10
+
+110*
+
+01*1
+1*01
+
+001*
+
+1*1*1
+
+a
+c
+b
+
+Figure 6.9.
+State graph and Boolean equation for the C-element STG from figure 6.4.
+
+Figure 6.9 shows the state graph corresponding to the STG specification in
+figure 6.4 on page 87. Variables that are excited in a given state are marked
+with an asterisk. Also shown in figure 6.9 is the Karnaugh map for output
+signal c. The Boolean expression for c must cover states in which c
+� 1 and
+states where it is excited, c
+� 0
+
+� (changing to 1). In order to better distinguish
+excited variables from stable ones in the Karnaugh maps, we will use R (rising)
+instead of 0
+
+� and F (falling) instead of 1
+
+� throughout the rest of this book.
+It is comforting to see that we can successfully derive the implementation of
+a known circuit, but the C-element is really too simple to illustrate all aspects
+of the design process.
+
+6.3.2
+Example 2: a circuit with choice
+
+The following example provides a better illustration of the synthesis pro-
+cedure, and in a subsequent section we will come back to this example and
+explain more efficient implementations. The example is simple – the circuit
+has only 2 inputs and 2 outputs – and yet it brings forward all relevant issues.
+The example is due to Chris Myers of the University of Utah who presented it
+in his 1996 course EE 587 “Asynchronous VLSI System Design.” The example
+has roots in the papers [12, 13].
+Figure 6.10 shows a specification of the circuit. The circuit has two inputs
+a and b and two outputs c and d, and the circuit has two alternative behaviours
+as illustrated in the timing diagram. The corresponding STG specification is
+shown in figure 6.11 along with the state graph for the circuit. The STG in-
+
+
+Chapter 6: Speed-independent control circuits
+93
+
+Environment
+
+a
+
+b
+
+c
+
+d
+
+a
+
+b
+
+c
+
+d
+
+Figure 6.10.
+The example circuit from [12, 13].
+
+001*0
+
+10*00
+
+1100*
+
+110*1
+
+1111*
+
+0*0*00
+
+1*110
+
+01*10
+
+010*0
+
+14
+
+6
+
+4
+
+0
+
+8
+
+12
+
+2
+
+15
+
+13
+
+a+
+
+b+
+
+d+
+
+c+
+d-
+
+c-
+
+b-
+
+b+
+
+c+
+
+a-
+
+b+
+a+
+
+c+
+b+
+
+d+
+
+c+
+a-
+
+b-
+
+c-
+
+d-
+
+P1
+
+P0
+
+Figure 6.11.
+The STG specification and the corresponding state graph.
+
+x3
+
+x2
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
++
+
+&
+
+&
+
+c = d + a b + b c
+
+d
+a
+b
+c
+
+&
+
+&
+&
+
++
+
+b
+
+a
+d
+
+c
+
+Karnaugh map:
+
+Boolean equation for c:
+
+An atomic complex gate:
+
+Using simple gates:
+
+x1
+
+Figure 6.12.
+The Karnaugh map, the Boolean equation, and two alternative gate-level imple-
+mentations of output signal c.
+
+
+94
+Part I: Asynchronous circuit design – A tutorial
+
+cludes only the free choice place P0 and the merge place P1. All arcs that
+directly connect two transitions are assumed to include a place. The states in
+the state diagram have been labeled with decimal numbers to ease filling out
+the Karnaugh maps.
+The STG satisfies all of the properties 1-6 listed on page 88 and it is thus
+possible to proceed and derive Boolean equations for output signals c and d.
+[Note: In state 0 both inputs are marked to be excited,
+�a�b�
+�
+�0
+
+�
+
+�0
+
+�
+
+�, and
+in states 4 and 8 one of the signals is still 0 but no longer excited. This is a
+problem of notation only. In reality only one of the two variables is excited in
+state 0, but we don’t know which one. Furthermore, the STG is only required
+to be persistent with respect to the internal signals and the output signals. Per-
+sistency of the input signals must be guaranteed by the environment].
+For output c, figure 6.12 shows the Karnaugh map, the Boolean equation and
+two alternative gate implementations: one using a single atomic And-Or-Invert
+gate, and one using simple AND and OR gates. Note that there are states that
+are not reachable by the circuit. In the Karnaugh map these states correspond
+to don’t cares. The implementation of output signal d is left as an exercise for
+the reader (d
+� abc).
+
+6.3.3
+Example 2: Hazards in the simple gate
+implementation
+
+The STG in figure 6.10 satisfies all of the implementation conditions 1-6
+(including persistency), and consequently an implementation where each out-
+put signal is implemented by a single atomic complex gate is hazard free. In
+the case of c we need a complex And-Or gate with inversion of input signal a.
+In general such an atomic implementation is not feasible and it is necessary to
+decompose the implementation into a structure of simpler gates. Unfortunately
+this will introduce extra variables, and these extra variables may not satisfy the
+persistency requirement that an excited signal transition must eventually fire.
+Speed-independence preserving logic decomposition is therefore a very inter-
+esting and relevant topic [20, 76].
+The implementation of c using simple gates that is shown in figure 6.12 is
+not speed-independent; it may exhibit both static and dynamic hazards, and it
+provides a good illustration of the mechanisms behind hazards. The problem
+is that the signals x1, x2 and x3 are not included in the original STG and state
+graph. A detailed analysis that includes these signals would not satisfy the
+persistency requirement. Below we explain possible failure sequences that may
+cause a static-1 hazard and a dynamic-10 hazard on output signal c. Figure 6.13
+illustrates the discussion.
+
+A static-1 hazard may occur when the circuit goes through the following se-
+quence of states: 12, 13, 15, 14. The transition from state 12 to state 13
+
+
+Chapter 6: Speed-independent control circuits
+95
+
+Potential dynamic-10 hazard.
+Potential static-1 hazard.
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+a b
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+a b
+
+d
+
+b c
+
+d
+
+b c
+
+Figure 6.13.
+The Karnaugh maps for output signal c showing state sequences that may lead to
+hazards.
+
+corresponds to d going high and the transition from state 15 to state 14
+corresponds to d going low again. In state 13 c is excited (R) and it is
+supposed to remain high throughout states 13, 15, 14, and 6. States 13
+and 15 are covered by the cube d, and state 14 is covered by cube bc that
+is supposed to “take over” and maintain c
+� 1 after d has gone low. If
+the AND gate with output signal x2 that corresponds to cube bc is slow
+we have the problem - the static-1 hazard.
+
+A dynamic-10 hazard may occur when the circuit goes through the following
+sequence of states: 4, 6, 2, 0. This situation corresponds to the upper
+AND gate (with output signal x1) and the OR gate relaying b� into
+c� and b� into c�. However, after the c� transition the lower AND
+gate, x2, becomes excited (R) as well, but the firing of this gate is not
+indicated by any other signal transition – the OR gate already has one
+input high. If the lower AND gate (x2) fires, it will later become excited
+(F) in response to c�. The net effect of this is that the lower AND gate
+(x2) may superimpose a 0-1-0 pulse onto the c output after the intended
+c� transition has occured.
+
+In the above we did not consider the inverter with input signal a and output
+signal x3. Since a is not an input to any other gate, this decomposition is SI.
+In summary both types of hazard are related to the circuit going through a
+sequence of states that are covered by several cubes that are supposed to main-
+tain the signal at the same (stable) level. The cube that “takes over” represents
+a signal that may not be indicated by any other signal. In essence it is the same
+problem that we touched upon in section 2.2 on page 14 and in section 2.4.3
+on page 20 – an OR gate can only indicate when the first input goes high.
+
+
+96
+Part I: Asynchronous circuit design – A tutorial
+
+6.4.
+Implementations using state-holding gates
+
+6.4.1
+Introduction
+
+During operation each variable in the circuit will go through a sequence of
+states where it is (stable) 0, followed by one or more states where it is excited
+(R), followed by a sequence of states where it is (stable) 1, followed by one or
+more states where it is excited (F), etc. In the above implementation we were
+covering all states where a variable, z, was high or excited to go high (z
+� 1
+and z
+� R
+� 0�).
+An alternative is to use a state-holding device such as a set-reset latch. The
+Boolean equations for the set and reset signals need only cover the z
+� R
+� 0�
+states and the z
+� F
+� 1� states respectively. This will lead to simpler equations
+and potentially simpler decompositions. Figure 6.14 shows the implementation
+template using a standard set-reset latch and an alternative solution based on a
+standard C-element. In the latter case the reset signal must be inverted. Later,
+in section 6.4.5, we will discuss alternative and more elaborate implementa-
+tions, but for the following discussion the basic topologies in figure 6.14 will
+suffice.
+
+logic
+
+Set
+logic
+
+Reset
+z
+C
+z
+SR
+Reset
+logic
+
+Set
+logic
+
+latch
+
+Standard C-element implementation:
+SR flip-flop implementation:
+
+Figure 6.14.
+Possible implementation templates using (simple) state holding elements.
+
+At this point it is relevant to mention that the equations for when to set
+and reset the state-holding element for signal z can be found by rewriting the
+original equation (that covers states in which z
+� R and z
+� 1) into the following
+form:
+
+z
+� “Set”
+�z
+�“Reset”
+(6.1)
+
+For signal c in the above example (figure 6.12 on page 93) we would get the
+following set and reset functions: cset
+
+� d
+�ab and creset
+
+� b (which is identical
+to the result in figure 6.15 in section 6.4.3). Furthermore it is obvious that for
+all reachable states (only) the set and reset functions for a signal z must never
+be active at the same time:
+
+“Set”
+�“Reset”
+� 0
+
+
+Chapter 6: Speed-independent control circuits
+97
+
+The following sections will develop the idea of using state-holding elements
+and we will illustrate the techniques by re-implementing example 2 from the
+previous section.
+
+6.4.2
+Excitation regions and quiescent regions
+
+The above idea of using a state-holding device for each variable can be formal-
+ized as follows:
+
+An excitation region, ER, for a variable z is a maximally-connected set of
+states in which the variable is excited:
+
+ER(z�) denotes a region of states where z
+� R
+� 0*
+
+ER(z�) denotes a region of states where z
+� F
+� 1*
+
+A quiescent region, QR, for a variable z is a maximally-connected set of states
+in which the variable is not excited:
+
+QR(z�) denotes a region of states where z
+� 1
+
+QR(z�) denotes a region of states where z
+� 0
+
+For a given circuit the state space can be disjointly divided into one or more
+regions of each type.
+
+The set function (cover) for a variable z:
+
+must contain all states in the ER(z�) regions
+
+may contain states from the QR(z�) regions
+
+may contain states not reachable by the circuit
+
+The reset function (cover) for a variable z:
+
+must contain all states in the ER(z�) regions
+
+may contain states from the QR(z�) regions
+
+may contain states not reachable by the circuit
+
+In section 6.4.4 below we will add what is known as the monotonic cover
+constraint or the unique entry constraint in order to avoid hazards:
+
+A cube (product term) in the set or reset function of a variable must only
+be entered through a state where the variable is excited.
+
+Having mentioned this last constraint, we have above a complete recipe
+for the design of speed-independent circuits where each non-input signal is
+implemented by a state holding device. Let us continue with example 2.
+
+
+98
+Part I: Asynchronous circuit design – A tutorial
+
+6.4.3
+Example 2: Using state-holding elements
+
+Figure 6.15 illustrates the above procedure for example 2 from sections 6.3.2
+and 6.3.3. As before, the Boolean equations (for the set and reset functions)
+may need to be implemented using atomic complex gates in order to ensure
+that the resulting circuit is speed-independent.
+
+ER2(c-)
+
+QR1(c+)
+
+QR1(c-)
+
+ER1(c+)
+
+ER2(c+)
+
+a+
+
+b+
+
+d+
+
+c+
+d-
+
+c-
+
+b-
+
+b+
+
+c+
+
+a-
+
+001*0
+
+10*00
+
+1100*
+
+110*1
+
+1111*
+
+0*0*00
+
+1*110
+
+01*10
+
+010*0
+
+14
+
+6
+
+4
+
+0
+
+8
+
+12
+
+2
+
+15
+
+13
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+c-reset = b
+
+c-set = d + a b
+
+Figure 6.15.
+Excitation and quiescent regions in the state diagram for signal c in the example
+circuit from figure 6.10, and the corresponding derivation of equations for the set and reset
+functions.
+
+C
+x1
+
+c-set
+
+c-reset
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+c-reset = b
+
+c-set = d + a b
+
+&
++
+a
+d
+
+b
+
+b
+c
+
+Figure 6.16.
+Implementation of c using a standard C-element and simple gates, along with the
+Karnaugh map from which the set and reset functions were derived.
+
+6.4.4
+The monotonic cover constraint
+
+A standard C-element based implementation of signal c from above, with
+the set and reset functions implemented using simple gates, is shown in fig-
+ure 6.16 along with the Karnaugh map from which the set and reset functions
+are derived. The set function involves two cubes d and ab that are input signals
+to an OR gate. This implementation may exhibit a dynamic-10 hazard on the
+
+
+Chapter 6: Speed-independent control circuits
+99
+
+cset-signal in a similar way to that discussed previously. The Karnaugh map in
+figure 6.16 shows the sequence of states that may lead to a malfunction: (8, 12,
+13, 15, 14, 6, 0). Signal d is low in state 12, high in states 13 and 15, and low
+again in state 14. This sequence of states corresponds to a pulse on d. Through
+the OR gate this will create a pulse on the cset signal that will cause c to go
+high. Later in state 2, c will go low again. This is the desired behaviour. The
+problem is that the internal signal x1 that corresponds to the other cube in the
+expression for cset becomes excited (x1
+� R) in state 6. If this AND gate is
+slow this may produce an unintended pulse on the cset signal after c has been
+reset again.
+If the cube ab (that covers states 4, 5, 7, and 6) is reduced to include only
+states 4 and 5 corresponding to cset
+
+� d
+� abc we would avoid the problem.
+The effect of this modification is that the OR gate is never exposed to more
+than one input signal being high, and when this is the case we do not have
+problems with the principle of indication (c.f. the discussion of indication and
+dual-rail circuits in chapter 2). Another way of expressing this is that a cover
+cube must only be entered through states belonging to an excitation region.
+This requirement is known as:
+
+the monotonic cover constraint: only one product term in a sum-of-
+products implementation is allowed to be high at any given time. Obvi-
+ously the requirement need only be satisfied in the states that are reach-
+able by the circuit, or alternatively
+
+the unique entry constraint: cover cubes may only be entered through
+excitation region states.
+
+6.4.5
+Circuit topologies using state-holding elements
+
+In addition to the set-reset flip-flop and the standard C-element based tem-
+plates presented above, there are a number of alternative solutions for imple-
+menting variables using a state-holding device.
+A popular approach is the generalized C-element that is available to the
+CMOS transistor-level designer. Here the state-holding mechanism and the set
+and reset functions are implemented in one (atomic) compound structure of
+n- and p-type transistors. Figure 6.17 shows a gate-level symbol for a circuit
+where zset
+
+� ab and zreset
+
+� bc along with dynamic and static CMOS imple-
+mentations.
+An alternative implementation that may be attractive to a designer using a
+standard cell library that includes (complex) And-Or-Invert gates is shown in
+figure 6.18. The circuit has the interesting property that it produces both the
+desired signal z and its complement z and during transitions it never produces
+
+�z�z
+�
+�
+�1�1�. Again, the example is a circuit where zset
+
+� ab and zreset
+
+� bc.
+
+
+100
+Part I: Asynchronous circuit design – A tutorial
+
+P
+
+N
+"Set"
+
+"Reset"
+
+P
+
+"Reset"
+
+N
+
+"Set"
+"Reset"
+
+"Set"
+
+N
+
+P
+
+z
+z
+
+z-set = a b
+
+z-reset = b c
+
+Dynamic (and pseudostatic) CMOS implementation:
+
+Gate level symbol: 
+
+Static CMOS implementation:
+
++
+
+-
+
+a
+b
+c
+z
+C
+
+b
+
+c
+b
+
+c
+
+a
+
+b
+b
+
+a
+
+a
+
+b
+
+c
+
+z
+
+Figure 6.17.
+A generalized C-element: gate-level symbol, and some CMOS transistor imple-
+mentations.
+
+&
+
+&
+
+&
+
+&
+
++
+
++
+
+Set
+
+Reset
+
+z
+
+z
+
+a
+b
+
+b
+
+c
+
+Figure 6.18.
+An SR implementation based on two complex And-Or-Invert gates.
+
+
+Chapter 6: Speed-independent control circuits
+101
+
+6.5.
+Initialization
+
+Initialization is an important aspect of practical circuit design, and unfortu-
+nately it has not been addressed in the above. The synthesis process assumes
+an initial state that corresponds to the initial marking of the STG, and the re-
+sulting synthesized circuit is a correct speed-independent implementation of
+the specification provided that the circuit starts out in the same initial state.
+Since the synthesized circuits generally use state-holding elements or circuitry
+with feedback loops it is necessary to actively force the circuit into the intended
+initial state.
+Consequently, the designer has to do a manual post-synthesis hack and ex-
+tend the circuit with an extra signal which, when active, sets all state-holding
+constructs into the desired state.
+Normally the circuits will not be speed-
+independent with respect to this initialization signal; it is assumed to be as-
+serted for long enough to cause the desired actions before it is de-asserted.
+For circuit implementations using state-holding elements such as set-reset
+latches and standard C-elements, initialization is trivial provided that these
+components have special clear/preset signals in addition to their normal inputs.
+In all other cases the designer has to add an initialization signal to the relevant
+Boolean equations explicitly. If the synthesis process is targeting a given cell
+library, the modified logic equations may need further logic decomposition,
+and as we have seen this may compromise speed-independence.
+The fact that initialization is not included in the synthesis process is ob-
+viously a drawback, but normally one would implement a library of control
+circuits and use these as building blocks when designing circuits at the more
+abstract “static data-flow structures” level as introduced in chapter 3.
+Initializing all control circuits as outlined above is a simple and robust ap-
+proach. However, initialization of asynchronous circuits based on handshake
+components may also be achieved by an implicit approach that that exploits
+the function of the circuit to “propagate” initial signal values into the circuit.
+In Tangram (section 8.3, and chapter 13 in part III) this this is called self-
+initialization, [135].
+
+6.6.
+Summary of the synthesis process
+
+The previous sections have covered the basic theory for synthesizing SI con-
+trol circuits from STG specifications. The style of presentation has deliberately
+been chosen to be an informal one with emphasis on examples and the intuition
+behind the theory and the synthesis procedure.
+The theory has roots in work done by the following Universities and groups:
+University of Illinois [93], MIT [26, 24], Stanford [13], IMEC [145, 159], St.
+Petersburg Electrical Engineering Institute [146], and the multinational group
+of researchers who have developed the Petrify tool [29] that we will introduce
+
+
+102
+Part I: Asynchronous circuit design – A tutorial
+
+in the next section. This author has attended several discussions from which
+it is clear that in some cases the concepts and theories have been developed
+independently by several groups, and I will refrain from attempting a precise
+history of the evolution. The reader who is interested in digging deeper into the
+subject is encouraged to consult the literature; in particular the book by Myers
+[95].
+In summary the synthesis process outlined in the previous sections involves
+the following steps:
+
+1 Specify the desired behaviour of the circuit and its (dummy) environ-
+ment using an STG.
+
+2 Check that the STG satisfies properties 1-5 on page 88: 1-bounded, con-
+sistent state assignment, liveness, only input free choice and controlled
+choice and persistency. An STG satisfying these conditions is a valid
+specification of an SI circuit.
+
+3 Check that the specification satisfies property 6 on page 88: complete
+state coding (CSC). If the specification does not satisfy CSC it is neces-
+sary to add one or more state variables or to change the specification
+(which is often possible in 4-phase control circuits where the down-
+going signal transitions can be shuffled around). Some tools (including
+Petrify) can insert state variables automatically, whereas re-shuffling of
+signals – which represents a modification of the specification – is a task
+for the designer.
+
+4 Select an implementation template and derive the Boolean equations for
+the variables themselves, or for the set and reset functions when state
+holding devices are used. Also decide if these equations can be imple-
+mented in atomic gates (typically complex AOI-gates) or if they are to
+be implemented by structures of simpler gates. These decisions may be
+set by switches in the synthesis tools.
+
+5 Derive the Boolean equations for the desired implementation template.
+
+6 Manually modify the implementation such that the circuit can be forced
+into the desired initial state by an explicit reset or initialization signal.
+
+7 Enter the design into a CAD tool and perform simulation and layout of
+the circuit (or the system in which the circuit is used as a component).
+
+6.7.
+Petrify: A tool for synthesizing SI circuits from STGs
+
+Petrify is a public domain tool for manipulating Petri nets and for syn-
+thesizing SI control circuits from STG specifications.
+It is available from
+http://www.lsi.upc.es/˜jordic/petrify/petrify.html.
+
+
+Chapter 6: Speed-independent control circuits
+103
+
+Petrify is a typical UNIX program with many options and switches. As a
+circuit designer one would probably prefer a push-button synthesis tool that
+accepts a specification and produces a circuit. Petrify can be used this way but
+it is more than this. If you know how to play the game it is an interactive tool
+for specifying, checking, and manipulating Petri nets, STGs and state graphs.
+In the following section we will show some examples of how to design speed-
+independent control circuits.
+Input to Petrify is an STG described in a simple textual format. Using the
+program draw astg that is part of the Petrify distribution (and that is based
+on the graph visualization package ‘dot’ developed at AT&T) it is possible to
+produce a drawing of the STGs and state graphs. The graphs are “nice” but the
+topological organization may be very different from how the designer thinks
+about the problem. Even the simple task of checking that an STG entered in
+textual form is indeed the intended STG may be difficult.
+To help ease this situation a graphical STG entry and simulation tool called
+VSTGL (Visual STG Lab) has been developed at the Technical University of
+Denmark. To help the designer obtain a correct specification VSTGL includes
+an interactive simulator that allows the designer to add tokens and to fire tran-
+sitions. It also carries out certain simple checks on the STG.
+VSTGL is available from http://vstgl.sourceforge.net/ and it is the
+result of a small student project done by two 4th year students. VSTGL is sta-
+ble and reliable, though naming of signal transitions may seem a bit awkward.
+Petrify can solve CSC violations by inserting state variables, and it can be
+controlled to target the implementation templates introduced in section 6.4:
+
+The -cg option will produce a complex-gate circuit (one where each non-
+input signal is implemented in a single complex gate).
+
+The -gc option will produce a generalized C-element circuit. The outputs
+from Petrify are the Boolean equations for the set and reset functions for
+each non-input signal.
+
+The -gcm option will produce a generalized C-element solution where
+the set and reset functions satisfy the monotonic cover requirement. Con-
+sequently the solution can also be mapped onto a standard C-element
+implementation where the set and reset functions are implemented us-
+ing simple AND and OR gates.
+
+The -tm option will cause Petrify to perform technology mapping onto
+a gate library that can be specified by the user. Technology mapping can
+obviously not be combined with the -cg and -gc options.
+
+Petrify comes with a manual and some examples. In the following section
+we will go through some examples drawn from the previous chapters of the
+book.
+
+
+104
+Part I: Asynchronous circuit design – A tutorial
+
+6.8.
+Design examples using Petrify
+
+In the following we will illustrate the use of Petrify by specifying and syn-
+thesizing: (a) example 2 – the circuit with choice, (b) a control circuit for the
+4-phase bundled-data implementation of the latch from figure 3.3 on page 32
+and (c) a control circuit for the 4-phase bundled-data implementation of the
+MUX from figure 3.3 on page 32. For all of the examples we will assume push
+channels only.
+
+6.8.1
+Example 2 revisited
+
+As a first example, we will synthesize the different versions of example 2
+that we have already designed manually. Figure 6.19 shows the STG as it is
+entered into VSTGL. The corresponding textual input to Petrify (the ex2.g file)
+and the STG as it may be visualized by Petrify are shown in figure 6.20. Note
+in figure 6.20 that an index is added when a signal transition appears more than
+once in order to facilitate the textual input.
+
+P0
+
+c+
+
+b+
+
+P1
+
+c-
+
+b-
+
+a-
+
+d-
+
+c+
+
+a+
+
+b+
+
+d+
+d+
+d+
+d+
+d+
+d+
+d+
+d+
+
+Figure 6.19.
+The STG of example 2 as it is entered into VSTGL.
+
+Using complex gates
+
+> petrify ex2.g -cg -eqn ex2-cg.eqn
+
+The STG has CSC.
+# File generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# from <ex2.g> on 6-Mar-01 at 8:30 AM
+....
+
+
+Chapter 6: Speed-independent control circuits
+105
+
+.model ex2
+.inputs a b
+.outputs c d
+.graph
+P0 a+ b+
+c+ P1
+b+ c+
+P1 b-
+c- P0
+b- c-
+a- P1
+d- a-
+c+/1 d-
+a+ b+/1
+b+/1 d+
+d+ c+/1
+.marking { P0 }
+.end
+
+INPUTS:   a,b
+OUTPUTS:  c,d
+
+P0
+
+a+
+b+
+
+b+/1
+c+
+
+c-
+
+P1
+
+b-
+
+d+
+
+c+/1
+a-
+
+d-
+
+Figure 6.20.
+The textual description of the STG for example 2 and the drawing of the STG
+that is produced by Petrify.
+
+# The original TS had (before/after minimization) 9/9 states
+# Original STG:
+2 places,
+10 transitions,
+13 arcs
+...
+# Current STG:
+4 places,
+9 transitions,
+18 arcs
+...
+# It is a Petri net with 1 self-loop places
+...
+
+> more ex2-cg.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 7.00
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[c] = b (c + a’) + d;
+[d] = a b c’;
+
+Using generalized C-elements:
+
+> petrify ex2.g -gc -eqn ex2-gc.eqn
+
+...
+
+> more ex2-gc.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 12.00
+
+
+106
+Part I: Asynchronous circuit design – A tutorial
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[0] = a’ b + d;
+[1] = a b c’;
+[d] = d c’ + [1];
+# mappable onto gC
+[c] = c b + [0];
+# mappable onto gC
+
+The equations for the generalized C-elements should be “interpreted”
+according to equation 6.1 on page 96
+
+Using standard C-elements and set/reset functions that satisfy the monotonic
+cover constraint:
+
+> petrify ex2.g -gcm -eqn ex2-gcm.eqn
+
+...
+
+> more ex2-gcm.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 10.00
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[0] = a’ b c’ + d;
+[d] = a b c’;
+[c] = c b + [0];
+# mappable onto gC
+
+Again, the equations for the generalized C-element should be “interpreted”
+according to equation 6.1 on page 96.
+
+6.8.2
+Control circuit for a 4-phase bundled-data latch
+
+Figure 6.21 shows an asynchronous handshake latch with a dummy environ-
+ment on its left and right side. The latch can be implemented using a normal
+N-bit wide transparent latch and the control circuit we are about to design.
+A driver may be needed for the latch control signal Lt. In order to make the
+latch controller robust and independent of the delay in this driver, we may feed
+the buffered signal (Lt) back such that the controller knows when the signal
+has been presented to the latch. Figure 6.21 also shows fragments of the STG
+specification – the handshaking of the left and right hand environments and
+ideas about the behaviour of the latch controller. Initially Lt is low and the
+latch is transparent, and when new input data arrives they will flow through
+the latch. In response to Rin�, and provided that the right hand environment
+is ready for another handshake (Aout
+� 0), the controller may generate Rout
+�
+
+
+Chapter 6: Speed-independent control circuits
+107
+
+Latch controller
+Right hand environment
+
+Lt-
+Lt+
+
+Rin+
+
+Ain+
+
+Rin-
+
+Ain-
+
+Rin+
+Aout-
+
+Rout+
+
+Ain+
+
+Rin-
+Aout+
+
+Rout-
+
+Ain-
+
+Rout+
+
+Aout+
+
+Rout-
+
+Aout-
+
+Left hand environment
+
+EN
+EN
+
+EN = 0: Latch is transparant
+
+EN = 1: Latch holds data
+
+The control circuit
+
+A handshake latch
+
+Latch
+
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+
+Figure 6.21.
+A 4-phase bundled-data handshake latch and some STG fragments that capture
+ideas about its behaviour.
+
+Rout+
+
+Rin+
+
+Lt+
+
+Ain+
+
+Rin-
+
+Rout-
+
+Lt-
+
+Ain-
+
+Aout-
+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+
+Figure 6.22.
+The resulting STG for the latch controller (as input to VSTGL).
+
+
+108
+Part I: Asynchronous circuit design – A tutorial
+
+right away. Furthermore the data should be latched, Lt
+�, and an acknowledge
+sent to the left hand environment, Ain�. A symmetric scenario is possible in
+response to Rin� when the latch is switched back into the transparent mode.
+Combining these STG fragments results in the STG shown in figure 6.22.
+Running Petrify yields the following:
+
+> petrify lctl.g -cg -eqn lctl-cg.eqn
+
+The STG has CSC.
+# File generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# from <lctl.g> on 6-Mar-01 at 11:18 AM
+...
+# The original TS had (before/after minimization) 16/16 states
+# Original STG:
+0 places,
+10 transitions,
+14 arcs (
+0 pt + ...
+# Current STG:
+0 places,
+10 transitions,
+12 arcs (
+0 pt + ...
+# It is a Marked Graph
+.model lctl
+.inputs
+Aout Rin
+.outputs
+Lt Rout Ain
+.graph
+Rout+ Aout+ Lt+
+Lt+ Ain+
+Aout+ Rout-
+Rin+ Rout+
+Ain+ Rin-
+Rin- Rout-
+Ain- Rin+
+Rout- Lt- Aout-
+Aout- Rout+
+Lt- Ain-
+.marking { <Aout-,Rout+> <Ain-,Rin+> }
+.end
+
+> more lctl-cg.eqn
+
+# EQN file for model lctl
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 7.00
+
+INORDER = Aout Rin Lt Rout Ain;
+OUTORDER = [Lt] [Rout] [Ain];
+[Lt] = Rout;
+[Rout] = Rin (Rout + Aout’) + Aout’ Rout;
+[Ain] = Lt;
+
+The equation for [Rout] may be rewritten as:
+
+[Rout] = Rin Aout’ + Rout (Rin + Aout’)
+
+which can be recognized to be a C-element with inputs Rin and Aout’.
+
+
+Chapter 6: Speed-independent control circuits
+109
+
+6.8.3
+Control circuit for a 4-phase bundled-data MUX
+
+After the above two examples, where we have worked out already well-
+known circuit implementations, let us now consider a more complex example
+that cannot (easily) be done by hand. Figure 6.23 shows the handshake multi-
+plexer from figure 3.3 on page 32. It also shows how the handshake MUX can
+be implemented by a “regular” combinational circuit multiplexer and a control
+circuit. Below we will design a speed-independent control circuit for a 4-phase
+bundled-data MUX.
+
+Ctl CtlReq
+CtlAck
+
+Handshake MUX
+
+0
+
+1
+
+In0
+
+In1
+Out
+
+Ctl
+
+0
+
+1
+
+In1Req
+In1Ack
+
+In0Ack
+In0Req
+
+In1Data
+
+In0data
+
+OutAck
+OutReq
+
+OutData
+
+Figure 6.23.
+The handshake MUX and the structure of a 4-phase bundled-data implementa-
+tion.
+
+The MUX has three input channels and we must assume they are connected
+to three independent dummy environments. The dots remind us that the chan-
+nels are push channels. When specifying the behaviour of the MUX control
+circuit and its (dummy) environment it is important to keep this in mind. A
+typical error when drawing STGs is to specify an environment with a more
+limited behaviour than the real environment. For each of the three input chan-
+nels the STG has cycles involving (Req�;Ack
+�;Req�;Ack
+�; etc.), and each
+of these cycles is initialized to contain a token.
+As mentioned previously, it is sometimes easier to deal with control chan-
+nels using dual-rail (or in general 1�of�N) data encodings since this implies
+dealing with one-hot (decoded) control signals. As a first step towards the STG
+for a MUX using entirely 4-phase bundled-data channels, figure 6.24 shows an
+STG for a MUX where the control channel uses dual-rail signals (Ctl�t, Ctl�f
+and CtlAck). This STG can then be combined with the STG-fragment for a
+4-phase bundled-data channel from figure 6.8 on page 91, resulting in the STG
+in figure 6.25. The “intermediate” STG in figure 6.24 emphasizes the fact that
+the MUX can be seen as a controlled join – the two mutually exclusive and
+structurally identical halves are basically the STGs of a join.
+
+
+110
+Part I: Asynchronous circuit design – A tutorial
+
+Ctl.t+
+
+CtlAck+
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+Ctl.f+
+
+CtlAck+
+
+OutReq+
+
+P2
+P1
+P0
+
+Ctl.t-
+Ctl.f-
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.24.
+The STG specification of the control circuit for a 4-phase bundled-data MUX
+using a 4-phase dual-rail control channel. Combined with the STG fragment for a bundled-data
+(control) channel the resulting STG for an all 4-phase dual-rail MUX is obtained (figure 6.25).
+
+Ctl-
+
+Ctl+
+
+CtlReq+
+
+P5
+
+P2
+
+P3
+
+P4
+
+OutReq+
+
+CtlAck+
+
+CtlAck-
+
+CtlReq-
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+
+P1
+
+P0
+
+P6
+
+CtlReq-
+
+CtlAck+
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.25.
+The final STG specification of the control circuit for the 4-phase bundled-data
+MUX. All channels, including the control channel, are 4-phase bundled-data.
+
+
+Chapter 6: Speed-independent control circuits
+111
+
+Below is the result of running Petrify, this time with the -o option that writes
+the resulting STG (possibly with state signals added) in a file rather than to
+stdout.
+
+> petrify MUX4p.g -o MUX4p-csc.g -gcm -eqn MUX4p-gcm.eqn
+
+State coding conflicts for signal In1Ack
+State coding conflicts for signal In0Ack
+State coding conflicts for signal OutReq
+The STG has no CSC.
+Adding state signal: csc0
+The STG has CSC.
+
+> more MUX4p-gcm.eqn
+
+# EQN file for model MUX4p
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 29.00
+
+INORDER = In0Req OutAck In1Req Ctl CtlReq In1Ack In0Ack OutReq
+CtlAck csc0;
+OUTORDER = [In1Ack] [In0Ack] [OutReq] [CtlAck] [csc0];
+[In1Ack] = OutAck csc0’;
+[In0Ack] = OutAck csc0;
+[2] = CtlReq (In1Req csc0’ + In0Req Ctl’);
+[3] = CtlReq’ (In1Req’ csc0’ + In0Req’ csc0);
+[OutReq] = OutReq [3]’ + [2];
+# mappable onto gC
+[5] = OutAck’ csc0;
+[CtlAck] = CtlAck [5]’ + OutAck;
+# mappable onto gC
+[7] = OutAck’ CtlReq’;
+[8] = CtlReq Ctl;
+[csc0] = csc0 [8]’ + [7];
+# mappable onto gC
+
+As can be seen, the STG does not satisfy CSC (complete state coding) as
+several markings correspond to the same state vector, so Petrify adds an inter-
+nal state-signal csc0. The intuition is that after CtlReq� the Boolean signal
+Ctl is no longer valid but the MUX control circuit has not yet finished its job.
+If the circuit can’t see what to continue doing from its input signals it needs
+an internal state variable in which to keep this information. The signal csc0 is
+an active-low signal: it is set low if Ctl
+� 0 when CtlReq� and it is set back
+to high when OutAck and CtlReq are both low. The fact that the signal csc0 is
+high when all channels are idle (all handshake signals are low) should be kept
+in mind when dealing with reset, c.f. section 6.5.
+The exact details of how the state variable is added can be seen from the
+STG that includes csc0 which is produced by Petrify before it synthesizes the
+logic expressions for the circuit.
+
+
+112
+Part I: Asynchronous circuit design – A tutorial
+
+It is sometimes possible to avoid adding a state variable by re-shuffling sig-
+nal transitions. It is not always obvious what yields the best solution. In prin-
+ciple more concurrency should improve performance, but it also results in a
+larger state-space for the circuit and this often tends to result in larger and
+slower circuits. A discussion of performance also involves the interaction with
+the environment. There is plenty of room for exploring alternative solutions.
+
+Ctl-
+
+Ctl+
+
+CtlReq+
+
+P5
+
+P2
+
+P3
+
+P4
+
+OutReq+
+
+CtlAck+
+
+CtlAck-
+
+CtlReq-
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+
+P1
+
+P0
+
+P6
+
+CtlReq-
+
+CtlAck+
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.26.
+The modified STG specification of the 4-phase bundled-data MUX control cir-
+cuit.
+
+In figure 6.26 we have removed some concurrency from the MUX STG by
+ordering the transitions on In0Ack
+�In1Ack and CtlAck (In0Ack
+�
+� CtlAck
+�,
+In1Ack
+�
+� CtlAck
+� etc.). This STG satisfies CSC and the resulting circuit is
+marginally smaller:
+
+> more MUX4p-gcm.eqn
+
+# EQN file for model MUX4pB
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+
+
+Chapter 6: Speed-independent control circuits
+113
+
+# Estimated area = 27.00
+
+INORDER = In0Req OutAck In1Req Ctl CtlReq In1Ack In0Ack OutReq CtlAck;
+OUTORDER = [In1Ack] [In0Ack] [OutReq] [CtlAck];
+[0] = Ctl CtlReq OutAck;
+[1] = Ctl’ CtlReq OutAck;
+[2] = CtlReq (Ctl’ In0Req + Ctl In1Req);
+[3] = CtlReq’ (In0Ack’ In1Req’ + In0Req’ In0Ack);
+[OutReq] = OutReq [3]’ + [2];
+# mappable onto gC
+[CtlAck] = In1Ack + In0Ack;
+[In1Ack] = In1Ack OutAck + [0];
+# mappable onto gC
+[In0Ack] = In0Ack OutAck + [1];
+# mappable onto gC
+
+6.9.
+Summary
+
+This chapter has provided an introduction to the design of asynchronous se-
+quential (control) circuits with the main focus on speed-independent circuits
+and specifications using STGs. The material was presented from a practical
+view in order to enable the reader to go ahead and design his or her own speed-
+independent control circuits. This, rather than comprehensiveness, has been
+our goal, and as mentioned in the introduction we have largely ignored impor-
+tant alternative approaches including burst-mode and fundamental-mode cir-
+cuits.
+
+
+
+Chapter 7
+
+ADVANCED 4-PHASE BUNDLED-DATA
+PROTOCOLS AND CIRCUITS
+
+The previous chapters have explained the basics of asynchronous circuit
+design. In this chapter we will address 4-phase bundled-data protocols and
+circuits in more detail. This will include: (1) a variety of channel types, (2)
+protocols with different data-validity schemes, and (3) a number of more so-
+phisticated latch control circuits. These latch controllers are interesting for
+two reasons: they are very useful in optimizing the circuits for area, power and
+speed, and they are nice examples of the types of control circuits that can be
+specified and synthesized using the STG-based techniques from the previous
+chapter.
+
+7.1.
+Channels and protocols
+
+7.1.1
+Channel types
+
+So far we have considered only push channels where the sender is the active
+party that initiates the communication of data, and where the receiver is the
+passive party. The opposite situation, where the receiver is the active party
+that initiates the communication of data, is also possible, and such a channel
+is called a pull channel. A channel that carries no data is called a nonput
+channel and is used for synchronization purposes. Finally, it is also possible
+to communicate data from a receiver to a sender along with the acknowledge
+signal. Such a channel is called a biput channel. In a 4-phase bundled-data
+implementation data from the receiver is bundled with the acknowledge, and in
+a 4-phase dual-rail protocol the passive party will acknowledge the reception
+of a codeword by returning a codeword rather than just an an acknowledge
+signal. Figure 7.1 illustrates these four channel types (nonput, push, pull, and
+biput) assuming a bundled-data protocol. Each channel type may, of course,
+use any of the handshake protocols (2-phase or 4-phase) and data encodings
+(bundled-data, dual-rail, m�of�n, etc.) introduced previously.
+
+115
+
+
+116
+Part I: Asynchronous circuit design – A tutorial
+
+7.1.2
+Data-validity schemes
+
+For the bundled-data protocols it is also relevant to define the time interval
+in which data is valid, and figure 7.2 illustrates the different possibilities.
+For a push channel the request signal carries the message “here is new data
+for you” and the acknowledge signal carries the information “thank you, I have
+absorbed the data, and you may release the data wires.” Similarly, for a pull
+channel the request signal carries the message “please send new data” and the
+acknowledge signal carries the message “here is the data that you requested.”
+It is the signal transitions on the request and acknowledge wires that are in-
+terpreted in this way. A 4-phase handshake involves two transitions on each
+wire and, depending on whether it is the rising or the falling transitions on
+the request and acknowledge signals that are interpreted in this way, several
+data-validity schemes emerge: early, broad, late and extended early.
+Since 2-phase handshaking does not involve any redundant signal transitions
+there is only one data-validity scheme for each channel type (push or pull), as
+illustrated in figure 7.2.
+It is common to all of the data-validity schemes that the data is valid some
+time before the event that indicates the start of the interval, and that it remains
+stable until some time after the event that indicates the end of the interval.
+Furthermore, all of the data-validity schemes express the requirements of the
+party that receives the data. The fact that a receiver signals “thank you, I have
+absorbed the data, and you may go ahead and release the data wires,” does
+not mean that this actually happens – the sender may prolong the data-validity
+interval, and the receiver may even rely on this.
+A typical example of this is the extended-early data-validity schemes in fig-
+ure 7.2. On a push channel the data-validity interval begins some time before
+Req
+� and ends some time after Req
+�.
+
+7.1.3
+Discussion
+
+The above classification of channel types and handshake protocols stems
+mostly from Peeters’ Ph.D. thesis [112]. The choice of channel type, hand-
+shake protocol and data-validity scheme obviously affects the implementation
+of the handshake components in terms of area, speed, and power. Just as a
+design may use a mix of different bundled-data and dual-rail protocols, it may
+also use a mix of channel types and data-validity schemes.
+For example, a 4-phase bundled-data push channel using a broad or an
+extended-early data-validity scheme is a very convenient input to a function
+block that is implemented using precharged CMOS circuitry: the request signal
+may directly control the precharge and evaluate transistors because the broad
+and the extended-early data-validity schemes guarantee that the input data is
+stable during the evaluate phase.
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+117
+
+n
+
+Data
+
+Ack
+
+Req
+
+n
+
+Ack
+
+Req
+
+Req
+
+Data
+
+Data
+
+Ack
+
+Nonput channel
+
+Data
+
+Ack
+
+Req
+
+Biput channel (bundled data)
+
+Push channel (bundled data)
+
+Pull channel (bundled data)
+
+Figure 7.1.
+The four fundamental channel types: nonput, push, biput, and pull.
+
+Data (early)
+
+4-phase protocol:
+(push channel)
+Ack
+
+Req
+
+Data (broad)
+
+Data (late)
+
+Data (extended early)
+
+Data (early)
+
+Ack
+
+Req
+
+Data (broad)
+
+Data (late)
+
+Data (extended early)
+
+4-phase protocol:
+(pull channel)
+
+Ack
+
+Req
+
+Data (pull channel)
+
+2-phase protocols:
+
+Data (push channel)
+
+Figure 7.2.
+Data-validity schemes for 2-phase and 4-phase bundled data.
+
+
+118
+Part I: Asynchronous circuit design – A tutorial
+
+Another interesting option in a 4-phase bundled-data design is to use func-
+tion blocks that assume a broad data validity scheme on the input channel and
+that produce a late data validity scheme on the output channel. Under these
+assumptions it is possible to use a symmetric delay element that matches only
+half of the latency of the combinatorial circuit. The idea is that the sum of the
+delay of Req
+� and Req
+� matches the latency of the combinatorial circuit, and
+that Req
+� indicates valid output data. In [112, p.46] this is referred to as true
+single phase because the return-to-zero part of the handshaking is no longer
+redundant. This approach also has implications for the implementation of the
+components that connect to the function block.
+It is beyond the scope of this text to enter into a discussion of where and
+when to use the different options. The interested reader is referred to [112, 77]
+for more details.
+
+7.2.
+Static type checking
+
+When designing circuits it is useful to think of the combination of channel
+type and data-validity scheme as being similar to a data type in a programming
+language, and to do some static type checking of the circuit being designed
+by asking questions like: “what types are allowed on the input ports of this
+handshake component?” and “what types are produced on the output ports
+of this handshake component?”. The latter may depend on the type that was
+provided on the input port. A similar form of type checking for synchronous
+circuits using two-phase non-overlapping clocks has been proposed in [104]
+and used in the Genesil silicon compiler [67].
+
+"broad"
+
+"extended early"
+
+"late"
+"early"
+
+Figure 7.3.
+Hierarchical ordering of the four data-validity schemes for the 4-phase bundled-
+data protocol.
+
+Figure 7.3 shows a hierarchical ordering of the four possible types (data
+validity schemes) for a 4-phase bundled-data push channel: “broad” is the
+strongest type and it can be used as input to circuits that require any of the
+weaker types. Similarly “extended early” may be used where only “early” is
+required. Circuits that are transparent to handshaking (function blocks, join,
+fork, merge, mux, demux) produce outputs whose type is at most as strong as
+their (weakest) input type. In general the input and output types are the same
+but there are examples where this is not the case. The only circuit that can
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+119
+
+produce outputs whose type is stronger than the input type is a latch. Let us
+look at some examples:
+
+A join that concatenates two inputs of type “extended early” produces
+outputs that are only “early.’
+
+From the STG fragments in figure 6.21 on page 107 it may be seen that
+the simple 4-phase bundled-data latch controller from the previous chap-
+ters (figure 2.9 on page 18) assumes “early” inputs and that it produces
+“extended-early” outputs.
+
+The 4-phase bundled-data MUX design in section 6.8.3 assumes “ex-
+tended early” on its control input (the STG in figure 6.25 on page 110
+specifies stable input from CtlReq� to CtlReq�).
+
+The reader is encouraged to continue this exercise and perhaps draw the as-
+sociated timing diagrams from which the types of the outputs may be deduced.
+The type checking explained here is a very useful technique for debugging
+circuits that exhibit erronous behaviour.
+
+7.3.
+More advanced latch control circuits
+
+In previous chapters we have only considered 4-phase bundled-data hand-
+shake latches using a latch controller consisting of a C-element and an inverter
+(figure 2.9 on page 18). In [41] this circuit is called a simple latch controller,
+and in [77] it is called an un-decoupled latch controller.
+When a pipeline or FIFO that uses the simple latch controller fills, every
+second handshake latch will be holding a valid token and every other hand-
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+E
+D3
+E
+E
+D1
+D2
+
+Ack
+
+Req
+
+Data
+
+Ack
+
+Req
+
+EN
+
+L
+Data
+
+(b)
+
+(a)
+
+C
+C
+C
+C
+C
+C
+
+Figure 7.4.
+(a) A FIFO based on handshake latches, and (b) its implementation using simple
+latch controllers and level-sensitive latches. The FIFO fills with valid data in every other latch.
+A latch is transparent when EN
+� 0 and it is opaque (holding data) when EN
+� 1.
+
+
+120
+Part I: Asynchronous circuit design – A tutorial
+
+Ack
+Req
+
+Data
+
+Ack
+Req
+
+Data
+D3
+D2
+D1
+
+EN
+EN
+EN
+
+Latch
+Latch
+Latch
+control
+control
+control
+
+Figure 7.5.
+A FIFO where every level-sensitive latch holds valid data when the FIFO is full.
+The semi-decoupled and fully-decoupled latch controllers from [41] allow this behaviour.
+
+shake latch will be holding an empty token as illustrated in figure 7.4(a) – the
+static spread of the pipeline is S
+� 2.
+This token picture is a bit misleading. The empty tokens correspond to
+the return-to-zero part of the handshaking and in reality the latches are not
+“holding empty tokens” – they are transparent, and this represents a waste of
+hardware resource.
+Ideally one would want to store a valid token in every level-sensitive latch
+as illustrated in figure 7.5 and just “add” the empty tokens to the data stream
+on the interfaces as part of the handshaking. In [41] Furber and Day explain
+the design of two such improved 4-phase bundled-data latch control circuits:
+a semi-decoupled and a fully-decoupled latch controller. In addition to these
+specific circuits the paper also provides a nice illustration of the use of STGs
+for designing control circuits following the approach explained in chapter 6.
+The three latch controllers have the following characteristics:
+
+The simple or un-decoupled latch controller has the problem that new
+input data can only be latched when the previous handshake on the out-
+put channel has completed, i.e., after Aout
+
+�. Furthermore, the hand-
+shakes on the input and output channels interact tightly: Rout
+
+�
+� Ain
+
+�
+and Rout
+
+�
+� Ain
+
+�.
+
+The semi-decoupled latch controller relaxes these requirements some-
+what: new inputs may be latched after Rout
+
+�, and the controller may
+produce Ain
+
+� independently of the handshaking on the output channel –
+the interaction between the input and output channels has been relaxed
+to: Aout
+
+�
+� Ain
+
+�.
+
+The fully-decoupled latch controller further relaxes these requirements:
+new inputs may be latched after Aout
+
+� (i.e. as soon as the downstream
+latch has indicated that it has latched the current data), and the hand-
+shaking on the input channel may complete without any interaction with
+the output channel.
+
+Another potential drawback of the simple latch controller is that it is unable
+to take advantage of function blocks with asymmetric delays as explained in
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+121
+
+Latch controller
+Static spread, S
+Period, P
+
+“Simple”
+2
+2Lr
+
+�2L f
+�V
+“Semi-decoupled”
+1
+2Lr
+
+�2L f
+�V
+“Fully-decoupled”
+1
+2Lr
+
+�L f
+�V
+
+�L f
+�E
+
+Table 7.1.
+Summary of the characteristics of the latch controllers in [41].
+
+section 4.4.1 on page 52. The fully-decoupled latch controller presented in
+[41] does not have this problem. Due to the decoupling of the input and out-
+put channels the dependency graph critical cycle that determines the period, P,
+only visits nodes related to two neighbouring pipeline stages and the period be-
+comes minimum (c.f. section 4.4.1). Table 7.1 summarizes the characteristics
+of the simple, semi-decoupled and fully-decoupled latch controllers.
+All of the above-mentioned latch controllers are “normally transparent”
+and this may lead to excessive power consumption because inputs that make
+multiple transitions before settling will propagate through several consecutive
+pipeline stages. By using “normally opaque” latch controllers every latch will
+act as a barrier. If a handshake latch that is holding a bubble is exposed to
+a token on its input, the latch controller switches the latch into the transpar-
+ent mode, and when the input data have propagated safely into the latch, it
+will switch the latch back into the opaque mode in which it will hold the data.
+In the design of the asynchronous MIPS processor reported in [23] we expe-
+rienced approximately a 50 % power reduction when using normally opaque
+latch controllers instead of normally transparent latch controllers.
+Figure 7.6 shows the STG specification and the circuit implementation of
+the normally opaque latch controller used in [23]. As seen from the STG there
+is quite a strong interaction between the input and output channels, but the
+dependency graph critical cycle that determines the period only visits nodes
+related to two neighbouring pipeline stages and the period is minimum. It may
+be necessary to add some delay into the Lt
+� to Rout
+� path in order to ensure
+that input signals have propagated through the latch before Rout
+�. Further-
+more the duration of the Lt
+� 0 pulse that causes the latch to be transparent is
+determined by gate delays in the latch controller itself, and the pulse must be
+long enough to ensure safe latching of the data. The latch controller assumes
+a broad data-validity scheme on its input channel and it provides a broad data-
+validity scheme on its output channel.
+
+7.4.
+Summary
+
+This chapter introduced a selection of channel types, data-validity schemes
+and a selection of latch controllers. The presentation was rather brief; the aim
+was just to present the basics and to introduce some of the many options and
+
+
+122
+Part I: Asynchronous circuit design – A tutorial
+
+Lt = 0: Latch is transparant
+
+Lt = 1: Latch is opaque (holding data)
+
+C
+
+C
+C
+
+C
+
++
++
++
++
+
+B
+
+EN
+
+Rin+
+Rout+
+
+Ain+
+
+Rin-
+
+Ain-
+Aout-
+
+Rout-
+
+Aout+
+
+Lt-
+
+B+
+
+B-
+
+Lt+
+
+Lt
+
+Rout
+
+Aout
+Rin
+
+Ain
+
+Latch
+
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+
+Din
+Dout
+
+Figure 7.6.
+The STG specification and the circuit implementation of the normally opaque
+fully-decoupled latch controller from [23].
+
+possibilities for optimizing the circuits. The interested reader is referred to the
+literature for more details.
+Finally a warning: the static data-flow view of asynchronous circuits pre-
+sented in chapter 3 (i.e. that valid and empty tokens are copied forward con-
+trolled by the handshaking between latch controllers) and the performance
+analysis presented in chapter 4 assume that all handshake latches use the sim-
+ple normally transparent latch controller. When using semi-decoupled or fully-
+decoupled latch controllers, it is necessary to modify the token flow view, and
+to rework the performance analysis. To a first order one might substitute each
+semi-decoupled or fully-decoupled latch controller with a pair of simple latch
+controllers. Furthermore a ring need only include two handshake latches if
+semi-decoupled or fully-decoupled latch controllers are used.
+
+
+Chapter 8
+
+HIGH-LEVEL LANGUAGES AND TOOLS
+
+This chapter addresses languages and CAD tools for the high-level modeling
+and synthesis of asynchronous circuits. The aim is briefly to introduce some
+basic concepts and a few representative and influential design methods. The
+interested reader will find more details elsewhere in this book (in Part II and
+chapter 13) as well as in the original papers that are cited in the text. In the last
+section we address the use of VHDL for the design of asynchronous circuits.
+
+8.1.
+Introduction
+
+Almost all work on the high-level modeling and synthesis of asynchronous
+circuits is based on the use of a language that belongs to the CSP family of
+languages, rather than one of the two industry-standard hardware description
+languages, VHDL and Verilog. Asynchronous circuits are highly concurrent
+and communication between modules is based on handshake channels. Con-
+sequently a hardware description language for asynchronous circuit design
+should provide efficient primitives supporting these two characteristics. The
+CSP language proposed by Hoare [57, 58] meets these requirements. CSP
+stands for “Communicating Sequential Processes” and its key characteristics
+are:
+
+Concurrent processes.
+
+Sequential and concurrent composition of statements within a process.
+
+Synchronous message passing over point-to-point channels (supported
+by the primitives send, receive and – possibly – probe).
+
+CSP is a member of a large family of languages for programming concurrent
+systems in general: OCCAM [68], LOTOS [108, 16], and CCS [89], as well as
+languages defined specifically for designing asynchronous circuits: Tangram
+[142, 135], CHP [81], and Balsa [9, 10]. Further details are presented else-
+where in this book on Tangram (in Part III, chapter 13) and Balsa (in Part II).
+In this chapter we first take a closer look at the CSP language constructs
+supporting communication and concurrency. This will include a few sample
+
+123
+
+
+124
+Part I: Asynchronous circuit design – A tutorial
+
+programs to give a flavour of this type of language. Following this we briefly
+explain two rather different design methods that both take a CSP-like program
+as the starting point for the design:
+
+At Philips Research Laboratories, van Berkel, Peeters, Kessels et al. have
+developed a proprietary language, Tangram, and an associated silicon
+compiler [142, 141, 135, 112]. Using a process called syntax-directed
+compilation, the synthesis tool maps a Tangram program into a structure
+of handshake components. Using these tools several significant asyn-
+chronous chips have been designed within Philips [137, 138, 144, 73,
+74]. The last of these is a smart-card circuit that is described in chap-
+ter 13 on page 221.
+
+At Caltech Martin has developed a language CHP – Communicating
+Hardware Processes – and a set of tools that supports a partly manual,
+partly automated design flow that targets highly optimized transistor-
+level implementations of QDI 4-phase dual-rail circuits [80, 83].
+
+CHP has a syntax that is similar to CSP (using various special symbols)
+whereas Tangram has a syntax that is more like a traditional programming
+language (using keywords); but in essence they are both very similar to CSP.
+In the last section of this chapter we will introduce a VHDL-package that
+provides CSP-like message passing and explain an associated VHDL-based
+design flow that supports a manual step-wise refinement design process.
+
+8.2.
+Concurrency and message passing in CSP
+
+The “sequential processes” part of the CSP acronym denotes that each pro-
+cess is described by a program whose statements are executed in sequence one
+by one. A semicolon is used to separate statements (as in many other program-
+ming languages). The semicolon can be seen as an operator that combines
+statements into programs. In this respect a process in CSP is very similar to a
+process in VHDL. However, CSP also allows the parallel composition of state-
+ments within a process. The symbol “�” denotes parallel composition. This
+feature is not found in VHDL, whereas the fork-join construct in Verilog does
+allow statement-level concurrency within a process.
+The “communicating” part of the CSP acronym refers to synchronous mes-
+sage passing using point-to-point channels as illustrated in figure 8.1, where
+two processes P1 and P2 are connected by a channel named C. Using a send
+statement, C!x, process P1 sends (denoted by the ‘!’ symbol) the value of its
+variable x on channel C, and using a receive statement, C?y, process P2 re-
+ceives (denoted by the ‘?’ symbol) from channel C a value that is assigned
+to its variable y. The channel is memoryless and the transfer of the value of
+variable x in P1 into variable y in P2 is an atomic action. This has the effect
+
+
+Chapter 8: High-level languages and tools
+125
+
+P2:
+
+C
+
+....
+C!x;
+
+....
+x:= 17;
+var x ...
+P1:
+var y,z ...
+....
+
+C?y;
+z:= y -17;
+....
+
+Figure 8.1.
+Two processes P1 and P2 connected by a channel C. Process P1 sends the value of
+its variable x to the channel C, and process P2 receives the value and assigns it to its variable y.
+
+of synchronizing processes P1 and P2. Whichever comes first will wait for
+the other party, and the send and receive statements complete at the same time.
+The term rendezvous is sometimes used for this type of synchronization.
+When a process executes a send (or receive) statement, it commits to the
+communication and suspends until the process at the other end of the channel
+performs its receive (or send) statement. This may not always be desirable, and
+Martin has extended CSP with a probe construct [79] which allows the process
+at the passive end of a channel to probe whether or not a communication is
+pending on the channel, without committing to any communication. The probe
+is a function which takes a channel name as its argument and returns a Boolean.
+The syntax for probing channel C is C.
+As an aside we mention that some languages for programming concurrent
+systems assume channels with (possibly unbounded) buffering capability. The
+implication of this is that the channel acts as a FIFO, and the communicating
+processes do not synchronize when they communicate. Consequently this form
+of communication is called asynchronous message passing.
+Going back to our synchronous message passing, it is obvious that the phys-
+ical implementation of a memoryless channel is simply a set of wires together
+with a protocol for synchronizing the communicating processes. It is also obvi-
+ous that any of the protocols that we have considered in the previous chapters
+may be used. Synchronous message passing is thus a very useful language
+construct that supports the high-level modeling of asynchronous circuits by
+abstracting away the exact details of the data encoding and handshake protocol
+used on the channel.
+Unfortunately both VHDL and Verilog lack such primitives. It is possible
+to write low-level code that implements the handshaking, but it is highly unde-
+sirable to mix such low-level details into code whose purpose is to capture the
+high-level behaviour of the circuit.
+In the following section we will provide some small program examples to
+give a flavour of this type of language. The examples will be written in Tan-
+
+
+126
+Part I: Asynchronous circuit design – A tutorial
+
+gram as they also serve the purpose of illustrating syntax-directed compilation
+in a subsequent section. The source code, handshake circuit figures, and frag-
+ments of the text have been kindly provided by Ad Peeters from Philips.
+Manchester University has recently developed a similar language and syn-
+thesis tool that is available in the public domain [10], and is introduced in Part
+II of this book. Other examples of related work are presented in [17] and [21].
+
+8.3.
+Tangram: program examples
+
+This section provides a few simple Tangram program examples: a 2-place
+shift register, a 2-place ripple FIFO, and a greatest common divisor function.
+
+8.3.1
+A 2-place shift register
+
+Figure 8.2 shows the code for a 2-place shift register named ShiftReg. It is
+a process with an input channel In and an output channel Out, both carrying
+variables of type [0..255]. There are two local variables x and y that are
+initialized to 0. The process performs an unbounded repetition of a sequence
+of three statements: out!y; y:=x; in?x.
+
+x
+y
+out
+
+ShiftReg
+
+in
+
+T = type [0..255]
+& ShiftReg : main proc(in? chan T & out! chan T).
+begin
+& var x,y: var T := 0
+|
+forever do
+out!y ; y:=x ; in?x
+od
+end
+
+Figure 8.2.
+A Tangram program for a 2-place shift register.
+
+8.3.2
+A 2-place (ripple) FIFO
+
+Figure 8.3 shows the Tangram program for a 2-place first-in first-out buffer
+named Fifo. It can be understood as two 1-place buffers that are operating in
+parallel and that are connected by a channel c. At first sight it appears very
+similar to the 2-place shift register presented above, but a closer examination
+will show that it is more flexible and exhibits greater concurrency.
+
+
+Chapter 8: High-level languages and tools
+127
+
+x
+y
+in
+out
+
+Fifo
+
+c
+
+T = type [0..255]
+& Fifo : main proc(in? chan T & out! chan T).
+begin
+& x,y: var T
+& c : chan T
+|
+forever do in?x ; c!x
+od
+|| forever do c?y
+; out!y od
+end
+
+Figure 8.3.
+A Tangram program for a 2-place (ripple) FIFO.
+
+8.3.3
+GCD using while and if statements
+
+Figure 8.4 shows the code for a module that computes the greatest common
+divisor, the example from section 3.7. The “do x<>y then
+�
+�
+�od” is a while
+statement and, apart from the syntactical differences, the code in figure 8.4 is
+identical to the code in figure 3.11 on page 39.
+The module has an input channel from which it receives the two operands,
+and an output channel on which it sends the result.
+
+int = type [0..255]
+& gcd_if : main proc (in?chan <<int,int>> & out!chan int).
+begin x,y:var int ff
+| forever do
+in?<<x,y>>
+; do x<>y then
+if x<y then y:=y-x
+else x:=x-y
+fi
+od
+; out!x
+od
+end
+
+Figure 8.4.
+A Tangram for GCD using while and if statements.
+
+
+128
+Part I: Asynchronous circuit design – A tutorial
+
+8.3.4
+GCD using guarded commands
+
+Figure 8.5 shows an alternative version of GCD. This time the module has
+separate input channels for the two operands and its body is based on the repe-
+tition of a guarded command. The guarded repetition can be seen as a general-
+ization of the while statement. The statement repeats until all guards are false.
+When at least one of the guards is true, exactly one command corresponding to
+such a true guard is selected (either deterministically or non-deterministically)
+and executed.
+
+int = type [0..255]
+& gcd_gc : main proc (in1,in2?chan int & out!chan int).
+begin x,y:var int ff
+| forever do
+in1?x || in2?y
+; do x<y then y:=y-x
+or y<x then x:=x-y
+od
+; out!x
+od
+end
+
+Figure 8.5.
+A Tangram program for GCD using guarded repetition.
+
+8.4.
+Tangram: syntax-directed compilation
+
+Let us now address the synthesis process. The design flow uses an inter-
+mediate format based on handshake circuits. The front-end design activity
+is called VLSI programming and, using syntax-directed compilation, a Tan-
+gram program is mapped into a structure of handshake components. There is a
+one-to-one correspondence between the Tangram program and the handshake
+circuit as will be clear from the following examples. The compilation process
+is thus fully transparent to the designer, who works entirely at the Tangram
+program level.
+The back-end of the design flow involves a library of handshake circuits
+that the compiler targets as well as some tools for post-synthesis peephole
+optimization of the handshake circuits (i.e. replacing common structures of
+handshake components by more efficient equivalent ones). A number of hand-
+shake circuit libraries exist, allowing implementations using different hand-
+shake protocols (4-phase dual-rail, 4-phase bundled-data, etc.), and different
+implementation technologies (CMOS standard cells, FPGAs, etc.). The hand-
+shake components can be specified and designed: (i) manually, or (ii) using
+STGs and Petrify as explained in chapter 6, or (iii) using the lower steps in
+Martin’s transformation-based method that is presented in the next section.
+
+
+Chapter 8: High-level languages and tools
+129
+
+It is beyond the scope of this text to explain the details of the compilation
+process. We will restrict ourselves to providing a flavour of “syntax-directed
+compilation” by showing handshake circuits corresponding to the example
+Tangram programs from the previous section.
+
+8.4.1
+The 2-place shift register
+
+As a first example of syntax-directed compilation figure 8.6 shows the hand-
+shake circuit corresponding to the Tangram program in figure 8.2.
+
+�
+in
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+
+; 0
+1
+2
+
+�
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.6.
+The compiled handshake circuit for the 2-place shift register.
+
+Handshake components are represented by circular symbols, and the chan-
+nels that connect the components are represented by arcs. The small dots on
+the component symbols represent ports. An open dot denotes a passive port
+and a solid dot denotes an active port. The arrowhead represents the direction
+of the data transfer. A nonput channel does not involve the transfer of data and
+consequently it has no direction and no arrowhead. As can be seen in figure 8.6
+a handshake circuit uses a mix of push and pull channels.
+The structure of the program is a forever-do statement whose body consists
+of three statements that are executed sequentially (because they are separated
+by semicolons). Each of the three statements is a kind of assignment statement:
+the value of variable y is “assigned” to output channel out, the value of variable
+x is assigned to variable y, and the value received on input chanel in is assigned
+to variable x. The structure of the handshake circuit is exactly the same:
+
+At the top is a repeater that implements the forever-do statement. A
+repeater waits for a request on its passive input port and then it performs
+an unbounded repetition of handshakes on its active output channel. The
+handshake on the input channel never completes.
+
+Below is a 3-way sequencer that implements the semicolons in the pro-
+gram text. The sequencer waits for a request on its passive input channel,
+then it performs in sequence a full handshake on each of its active out-
+
+
+130
+Part I: Asynchronous circuit design – A tutorial
+
+put channels (in the order indicated by the numbers in the symbol) and
+finally it completes the handshaking on the passive input channel. In
+this way the sequencer activates in turn the handshake circuit constructs
+that correspond to the individual statements in the body of the forever-do
+statement.
+
+The bottom row of handshake components includes two variables, x and
+y, and three transferers, denoted by ‘�’. Note that variables have pas-
+sive read and write ports. The transferers implement the three statements
+(out!y; y:=x; in?x) that form the body of the forever-do statement,
+each a form of assignment. A transferer waits for a request on its passive
+nonput channel and then initiates a handshake on its pull input channel.
+The handshake on the pull input channel is relayed to the push output
+channel. In this way the transferer pulls data from its input channel and
+pushes it onto its output channel. Finally, it completes the handshaking
+on the passive nonput channel.
+
+8.4.2
+The 2-place FIFO
+
+Figure 8.7 shows the handshake circuit corresponding to the Tangram pro-
+gram in figure 8.3. The component labeled ‘psv’ in the handshake circuit of
+figure 8.7 is a so-called passivator. It relates to the internal channel c of the
+Fifo and implements the synchronization and communication between the ac-
+tive sender (c!x) and the active receiver (c?y).
+
+�
+in
+
+�
+
+0 ;
+1
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+psv
+
+�
+
+�
+
+�
+
+0 ;
+1
+
+�
+
+�
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.7.
+Compiled handshake circuit for the FIFO program.
+
+An optimization of the handshake circuit for Fifo is shown in figure 8.8.
+The synchronization in the datapath using a passivator has been replaced by a
+synchronization in the control using a ‘join’ component. One may observe that
+the datapath of this handshake circuit for the FIFO design is the same as that
+of the shift register, shown in figure 8.2. The only difference is in the control
+part of the circuits.
+
+
+Chapter 8: High-level languages and tools
+131
+
+�
+in
+
+�
+
+;
+0
+1
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+;
+0
+1
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.8.
+Optimized handshake circuit for the FIFO program.
+
+8.4.3
+GCD using guarded repetition
+
+As a more complex example of syntax-directed compilation figure 8.9 shows
+the handshake circuit compiled from the Tangram program in figure 8.5. Com-
+pared with the previous handshake circuits, the handshake circuit for the GCD
+program introduces two new classes of components that are treated in more
+detail below.
+Firstly, the circuit contains a ‘bar’ and a ‘do’ component, both of which are
+data-dependent control components. Secondly, the handshake circuit contains
+components that do not directly correspond to language constructs, but rather
+implement sharing: the multiplexer (denoted by ‘mux’), the demultiplexer (de-
+noted by ‘dmx’), and the fork component (denoted by ‘�’).
+Warning: the Tangram fork is identical to the fork in figure 3.3 but the Tan-
+gram multiplexer and demultiplexer components are different. The Tangram
+multiplexer is identical to the merge in figure 3.3 and the Tangram demulti-
+plexer is a kind of “inverse merge.” Its output ports are passive and it requires
+the handshakes on the two outputs to be mutually exclusive.
+
+The ‘bar’ and the ‘do’ components:
+The do and bar component together
+implement the guarded command construct with two guards, in which the do
+component implements the iteration part (the do od part, including the evalu-
+ation of the disjunction of the two guards), and the bar component implements
+the choice part (the then or then part of the command).
+The do component, when activated through its passive port, first collects the
+disjunction of the value of all guards through a handshake on its active data
+port. When the value thus collected is true, it activates its active nonput port
+(to activate the selected command), and after completion starts a new evalua-
+tion cycle. When the value collected is false, the do component completes its
+operation by completing the handshake on the passive port.
+
+
+132
+Part I: Asynchronous circuit design – A tutorial
+
+�
+
+�
+in2
+mux
+
+�
+
+�
+y
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+in1
+mux
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+bar
+
+�
+
+�
+
+�
+
+do
+
+�
+
+�
+
+� out
+
+�
+
+;
+0
+1
+2
+
+�
+
+Figure 8.9.
+Compiled handshake circuit for the GCD program using guarded repetition.
+
+The bar component can be activated either through its passive data port, or
+through its passive control port. (The do component, for example, sequences
+these two activations.) When activated through the data port, it collects the
+value of two guards through a handshake on the active data ports, and then
+sends the disjunction of these values along the passive data port, thus complet-
+ing that handshake. When activated through the control port, the bar compo-
+nent activates an active control port of which the associated data port returned a
+‘true’ value in the most recent data cycle. (For simplicity, this selection is typ-
+ically implemented in a deterministic fashion, although this is not required at
+the level of the program.) One may observe that bar components can be com-
+
+
+Chapter 8: High-level languages and tools
+133
+
+bined in a tree or list to implement a guarded command list of arbitrary length.
+Furthermore, not every data cycle has to be followed by a control cycle.
+
+The ‘mux’, ‘demux’, and ‘fork’ components
+The program for GCD in
+figure 8.4 has two occurrences of variable x in which a value is written into x,
+namely input action in1?x and assignment x:=x-y. In the handshake circuit
+of figure 8.9, these two write actions for Tangram variable x are merged by the
+multiplexer component so as to arrive at the write port of handshake variable
+x.
+Variable x occurs at five different locations in the program as an expres-
+sion, once in the output expression out!x, twice in the guard expressions x<y
+and y<x, and twice in the assignment expressions x-y and y-x. These five in-
+spections of variable x could be implemented as five distinct read ports on the
+handshake variable x, which is shown in the handshake circuit in [135, Fig. 2.7,
+p.34]. In figure 8.9, a different compilation is shown, in which handshake vari-
+able x has three read ports:
+
+A read port dedicated to the occurrence in the output action.
+
+A read port dedicated to the guard expressions. Their evaluation is mu-
+tually inclusive, and hence can be combined using a synchronizing fork
+component.
+
+A read port dedicated to the assignment expressions. Their evaluation is
+mutually exclusive, and hence can be combined using a demultiplexer.
+
+The GCD example is discussed in further detail in chapter 13.
+
+8.5.
+Martin’s translation process
+
+The work of Martin and his group at Caltech has made fundamental contri-
+butions to asynchronous design and it has influenced the work of many other
+researchers. The methods have been used at Caltech to design several sig-
+nificant chips, most recently and most notably an asynchronous MIPS R3000
+processor [88]. As the following presentation of the design flow hints, the de-
+sign process is elaborate and sophisticated and is probably only an option to a
+person who has spent time with the Caltech group.
+The mostly manual design process involves the following steps (semantics-
+preserving transformations):
+(1) Process decomposition where each process is refined into a collection
+of interacting simpler processes. This step is repeated until all processes are
+simple enough to be dealt with in the next step in the process.
+(2) Handshake expansion where each communication channel is replaced
+by explicit wires and where each communication action (e.g. send or receive)
+
+
+134
+Part I: Asynchronous circuit design – A tutorial
+
+is replaced by the signal transitions required by the protocol that is being used.
+For example a receive statement such as:
+
+C?y
+
+is replaced by a sequence of simpler statements – for example:
+
+�Creq
+
+�; y :� data; Cack
+
+�;
+��Creq
+
+�; Cack
+
+�
+
+which is read as: “wait for request to go high”, “read the data”, “drive ac-
+knowledge high”, “wait for request to go low”, and “drive acknowledge low”.
+At this level it may be necessary to add state variables and/or to reshuffle
+signal transitions in order to obtain a specification that satisfies a condition
+similar to the CSC condition in chapter 6.
+(3) Production rule expansion where each handshaking expansion is re-
+placed by a set of production rules (or guarded commands), for example:
+
+a
+�b
+�� c
+�
+and
+�b
+�
+�c
+�� c
+�
+
+A production rule consist of a condition and an action, and the action is per-
+formed whenever the condition is true. As an aside we mention that the above
+two production rules express the same as the set and reset functions for the
+signal c on page 96. The production rules specify the behaviour of the internal
+signals and output signals of the process. The production rules are themselves
+simple concurrent processes and the guards must ensure that the signal tran-
+sitions occur in program order (i.e. that the semantics of the original CHP
+program are maintained). This may require strengthening the guards. Further-
+more, in order to obtain simpler circuit implementations, the guards may be
+modified and made symmetric.
+(4) Operator reduction where production rules are grouped into clusters and
+where each cluster is mapped onto a basic hardware component similar to a
+generalized C-element. The above two production rules would be mapped into
+the generalized C-element shown in figure 6.17 on page 100.
+
+8.6.
+Using VHDL for asynchronous design
+
+8.6.1
+Introduction
+
+In this section we will introduce a couple of VHDL packages that provide
+the designer with primitives for synchronous message passing between pro-
+cesses – similar to the constructs found in the CSP-family of languages (send,
+receive and probe).
+The material was developed in an M.Sc. project and used in the design of a
+32-bit floating-point ALU using the IEEE floating-point number representation
+[110], and it has subsequently been used in a course on asynchronous circuit
+
+
+Chapter 8: High-level languages and tools
+135
+
+design. Others, including [95, 118, 149, 78], have developed related VHDL
+packages and approaches.
+The channel packages introduced in the following support only one type
+of channel, using a 32-bit 4-phase bundled-data push protocol. However, as
+VHDL allows the overloading of procedures and functions, it is straightfor-
+ward to define channels with arbitrary data types. All it takes is a little cut-and-
+paste editing. Providing support for protocols other than the 4-phase bundled-
+data push protocol will require more significant extensions to the packages.
+
+8.6.2
+VHDL versus CSP-type languages
+
+The previous sections introduced several CSP-like hardware description lan-
+guages for asynchronous design. The advantages of these languages are their
+support of concurrency and synchronous message passing, as well as a limited
+and well-defined set of language constructs that makes syntax-directed compi-
+lation a relatively simple task.
+Having said this there is nothing that prevents a designer from using one
+of the industry standard languages VHDL (or Verilog) for the design of asyn-
+chronous circuits. In fact some of the fundamental concepts in these languages
+– concurrent processes and signal events – are “nice fits” with the modeling
+and design of asynchronous circuits. To illustrate this figure 8.10 shows how
+the Tangram program from figure 8.2 could be expressed in plain VHDL. In
+addition to demonstrating the feasibility, the figure also highlights the limi-
+tations of VHDL when it comes to modeling asynchronous circuits: most of
+the code expresses low-level handshaking details, and this greatly clutters the
+description of the function of the circuit.
+VHDL obviously lacks built-in primitives for synchronous message passing
+on channels similar to those found in CSP-like languages. Another feature of
+the CSP family of languages that VHDL lacks is statement-level concurrency
+within a process. On the other hand there are also some advantages of using an
+industry standard hardware description language such as VHDL:
+
+It is well supported by existing CAD tool frameworks that provide sim-
+ulators, pre-designed modules, mixed-mode simulation, and tools for
+synthesis, layout and the back annotation of timing information.
+
+The same simulator and test bench can be used throughout the entire de-
+sign process from the first high-level specification to the final implemen-
+tation in some target technology (for example a standard cell layout).
+
+It is possible to perform mixed-mode simulations where some entities
+are modeled using behavioural specifications and others are implemented
+using the components of the target technology.
+
+
+136
+Part I: Asynchronous circuit design – A tutorial
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+type T is std_logic_vector(7 downto 0)
+
+entity ShiftReg is
+port ( in_req
+: in
+std_logic;
+in_ack
+: out std_logic;
+in_data
+: in
+T;
+out_req
+: out std_logic;
+out_ack
+: in
+std_logic;
+out-data : out T );
+end ShiftReg;
+
+architecture behav of ShiftReg is
+begin
+process
+variable x, y: T;
+begin
+loop
+out_req <= ’1’ ;
+--
+out!y
+out_data <= y ;
+wait until out_ack = ’1’;
+out_req <= ’0’;
+wait until out_ack = ’0’;
+y := x;
+--
+y := x
+wait until in_req = ’1’;
+--
+in?x
+x := in_data;
+in.ack <= ’1’;
+wait until ch_req = ’0’;
+ch_ack <= ’0’;
+end loop;
+end process;
+end behav;
+
+Figure 8.10.
+VHDL description of the 2-place shift register FIFO stage from figure 8.2.
+
+Many real-world systems include both synchronous and asynchronous
+subsystems, and such hybrid systems can be modeled without any prob-
+lems in VHDL.
+
+8.6.3
+Channel communication and design flow
+
+The design flow presented in what follows is motivated by the advantages
+mentioned above. The goal is to augment VHDL with CSP-like channel com-
+munication primitives, i.e. the procedures send(<channel>, <variable>)
+and receive(<channel>,<variable>) and the function probe(<channel>).
+Another goal is to enable mixed-mode simulations where one end of a channel
+connects to an entity whose architecture body is a circuit implementation and
+the other end connects to an entity whose architecture body is a behavioural de-
+scription using the above communication primitives, figure 8.11(b). In this way
+
+
+Chapter 8: High-level languages and tools
+137
+
+Data
+
+Control
+
+Latches
+
+Ack
+
+Req
+
+Comb. logic
+
+Entity 2:
+
+High-level model: 
+
+Entity 2:
+
+Receive(<channel>,<var>)
+channel
+
+Data
+
+Control
+
+Latches
+
+Ack
+
+Req
+
+Comb. logic
+
+channel
+
+Mixed-mode model: 
+Entity 2:
+
+Entity 1:
+
+channel
+
+Comb. logic
+Latches
+
+Ack
+
+Req
+
+Data
+
+Control
+
+Low-level model: 
+
+Entity 1:
+
+Send(<channel>,<var>)
+
+Entity 1:
+
+Send(<channel>,<var>)
+
+(a)
+
+(b)
+
+(c)
+
+Figure 8.11.
+The VHDL packages for channel communication support high-level, mixed-
+mode and gate-level/standard cell simulations.
+
+a manual top-down stepwise refinement design process is supported, where the
+same test bench is used throughout the entire design process from high-level
+specification to low-level circuit implementation, figure 8.11(a-c).
+In VHDL all communication between processes takes place via signals.
+Channels therefore have to be declared as signals, preferably one signal per
+channel. Since (for a push channel) the sender drives the request and data part
+of a channel, and the receiver drives the acknowledge part, there are two drivers
+to one signal. This is allowed in VHDL if the signal is a resolved signal. Thus,
+it is possible to define a channel type as a record with a request, an acknowl-
+edge and a data field, and then define a resolution function for the channel type
+which will determine the resulting value of the channel. This type of channel,
+with separate request and acknowledge fields, will be called a real channel and
+is described in section 8.6.5. In simulations there will be three traces for each
+channel, showing the waveforms of request and acknowledge along with the
+data that is communicated.
+A channel can also be defined with only two fields: one that describes the
+state of the handshaking (called the “handshake phase” or simply the “phase”)
+and one containing the data. The type of the phase field is an enumerated type,
+
+
+138
+Part I: Asynchronous circuit design – A tutorial
+
+whose values can be the handshake phases a channel can assume, as well as
+the values with which the sender and receiver can drive the field. This type of
+channel will be called an abstract channel. In simulations there will be two
+traces for each channel, and it is easy to read the phases the channel assumes
+and the data values that are transfered.
+The procedures and definitions are organized into two VHDL-packages: one
+called “abstpack.vhd” that can be used for simulating high-level models and
+one called “realpack.vhd” that can be used at all levels of design. Full listings
+can be found in appendix 8.A at the end of this chapter. The design flow
+enabled by these packages is as follows:
+
+The circuit and its environment or test bench is first modelled and sim-
+ulated using abstract channels. All it takes is the following statement in
+the top level design unit: “usepackage work.abstpack.all”.
+
+The circuit is then partitioned into simpler entities. The entities still
+communicate using channels and the simulation still uses the abstract
+channel package. This step may be repeated.
+
+At some point the designer changes to using the real channel package
+by changing to: “usepackage work.realpack.all” in the top-level
+design unit. Apart from this simple change, the VHDL source code is
+identical.
+
+It is now possible to partition entities into control circuitry (that can be
+designed as explained in chapter 6) and data circuitry (that consist of or-
+dinary latches and combinational circuitry). Mixed mode simulations as
+illustrated in figure 8.11(b) are possible. Simulation models of the con-
+trol circuits may be their actual implementation in the target technology
+or simply an entity containing a set of concurrent signal assignments –
+for example the Boolean equations produced by Petrify.
+
+Eventually, when all entities have been partitioned into control and data,
+and when all leaf entities have been implemented using components of
+the target technology, the design is complete. Using standard technology
+mapping tools an implementation may be produced, and the circuit can
+be simulated with back annotated timing information.
+
+Note that the same simulation test bench can be used throughout the entire
+design process from the high-level specification to the low-level implementa-
+tion using components from the target technology.
+
+8.6.4
+The abstract channel package
+
+An abstract channel is defined in figure 8.12 with a data type called fp (a
+32-bit standard logic vector representing an IEEE floating-point number). The
+
+
+Chapter 8: High-level languages and tools
+139
+
+type handshake_phase is
+(
+u,
+-- uninitialized
+idle,
+-- no communication
+swait,
+-- sender waiting
+rwait,
+-- receiver waiting
+rcv,
+-- receiving data
+rec1,
+-- recovery phase 1
+rec2,
+-- recovery phase 2
+req,
+-- request signal
+ack,
+-- acknowledge signal
+error
+-- protocol error
+);
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+phase : handshake_phase;
+data
+: fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+Figure 8.12.
+Definition of an abstract channel.
+
+actual channel type is called channel fp. It is necessary to define a channel
+for each data type used in the design. The data type can be an arbitrary type,
+including record types, but it is advisable to use data types that are built from
+std logic because this is typically the type used by target component libraries
+(such as standard cell libraries) that are eventually used for the implementation.
+The meaning of the values of the type handshake phase are described in
+detail below:
+
+u: Uninitialized channel. This is the default value of the drivers. As long as
+either the sender or receiver drive the channel with this value, the channel
+stays uninitialized.
+
+idle: No communication. Both the sender and receiver drive the channel with
+the idle value.
+
+swait: The sender is waiting to perform a communication. The sender is driv-
+ing the channel with the req value and the receiver drives with the idle
+value.
+
+rwait: The receiver is waiting to perform a communication. The sender is
+driving the channel with the idle value and the receiver drives with the
+
+
+140
+Part I: Asynchronous circuit design – A tutorial
+
+rwait value. This value is used both as a driving value and as a resulting
+value for a channel, just like the idle and u values.
+
+rcv: Data is transfered. The sender is driving the channel with the req value
+and the receiver drives it with the rwait value.
+After a predefined
+amount of time (tpd at the top of the package, see later in this section)
+the receiver changes its driving value to ack, and the channel changes
+its phase to rec1. In a simulation it is only possible to see the transfered
+value during the rcv phase and the swait phase. At all other times the
+data field assumes a predefined default data value.
+
+rec1: Recovery phase. This phase is not seen in a simulation, since the channel
+changes to the rec2 phase with no time delay.
+
+rec2: Recovery phase. This phase is not seen in a simulation, since the channel
+changes to the idle phase with no time delay.
+
+req: The sender drives the channel with this value, when it wants to perform
+a communication. A channel can never assume this value.
+
+ack: The receiver drives the channel with this value when it wants to perform
+a communication. A channel can never assume this value.
+
+error: Protocol error. A channel assumes this value when the resolution func-
+tion detects an error. It is an error if there is more than one driver with
+an rwait, req or ack value. This could be the result if more than two
+drivers are connected to a channel, or if a send command is accidentally
+used instead of a receive command or vice versa.
+
+Figure 8.13 shows a graphical illustration of the protocol of the abstract
+channel. The values in large letters are the resulting values of the channel, and
+the values in smaller letters below them are the driving values of the sender
+and receiver respectively. Both the sender and receiver are allowed to initiate
+a communication. This makes it possible in a simulation to see if either the
+
+IDLE
+
+IDLE
+RWAIT
+IDLE
+RWAIT
+
+SWAIT
+REQ
+IDLE
+
+RCV
+REQ
+REC1
+REQ
+ACK
+REC2
+IDLE
+ACK
+-
+UU
+
+IDLE
+RWAIT
+
+Figure 8.13.
+The protocol for the abstract channel. The values in large letters are the resulting
+resolved values of the channel, and the values in smaller letters below them are the driving
+values of the sender and receiver respectively.
+
+
+Chapter 8: High-level languages and tools
+141
+
+sender or receiver is waiting to communicate. It is the procedures send and
+receive that follow this protocol.
+Because channels with different data types are defined as separate types,
+the procedures send, receive and probe have to be defined for each of these
+channel types. Fortunately VHDL allows overloading of procedure names, so
+it is possible to make these definitions. The only differences between the def-
+initions of the channels are the data types, the names of the channel types and
+the default values of the data fields in the channels. So it is very easy to copy
+the definitions of one channel to make a new channel type. It is not necessary
+to redefine the type handshake phase. All these definitions are conveniently
+collected in a VHDL package. This package can then be referenced wherever
+needed. An example of such a package with only one channel type can be
+seen in appendix A.1. The procedures initialize in and initialize out
+are used to initialize the input and output ends of a channel. If a sender or re-
+ceiver does not initialize a channel, no communications can take place on that
+channel.
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+use work.abstract_channels.all;
+
+entity fp_latch is
+generic(delay : time);
+port ( d
+: inout channel_fp;
+-- input data channel
+port ( q
+: inout channel_fp;
+-- output data channel
+resetn : in std_logic
+);
+end fp_latch;
+
+architecture behav of fp_latch is
+begin
+
+process
+variable data : fp;
+begin
+initialize_in(d);
+initialize_out(q);
+wait until resetn = ’1’;
+loop
+receive(d, data);
+wait for delay;
+send(q, data);
+end loop;
+end process;
+
+end behav;
+
+Figure 8.14.
+Description of a FIFO stage.
+
+
+142
+Part I: Asynchronous circuit design – A tutorial
+
+d
+q
+
+resetn
+
+fp_latch
+
+ch_in
+ch_out
+d
+q
+
+resetn
+
+fp_latch
+
+d
+q
+
+resetn
+
+fp_latch
+
+FIFO_stage_1
+FIFO_stage_2
+FIFO_stage_3
+
+Figure 8.15.
+A FIFO built using the latch defined in figure 8.14.
+
+Figure 8.16.
+Simulation of the FIFO using the abstract channel package.
+
+A simple example of a subcircuit is the FIFO stage fp latch shown in
+figure 8.14. Notice that the channels in the entity have the mode inout, and
+the FIFO stage waits for the reset signal resetn after the initialization. In that
+way it waits for other subcircuits which may actually use this reset signal for
+initialization.
+The FIFO stage uses a generic parameter delay. This delay is inserted for
+experimental reasons in order to show the different phases of the channels.
+Three FIFO stages are connected in a pipeline (figure 8.15) and fed with data
+values. The middle section has a delay that is twice as long as the other two
+stages. This will result in a blocked channel just before the slow FIFO stage
+and a starved channel just after the slow FIFO stage.
+The result of this experiment can be seen in figure 8.16. The simulator
+used is the Synopsys VSS. It is seen that ch in is predominantly in the swait
+phase, which characterizes a blocked channel, and ch out is predominantly in
+the rwait phase, which characterizes a starved channel.
+
+8.6.5
+The real channel package
+
+At some point in the design process it is time to separate communicating
+entities into control and data entities. This is supported by the real channel
+types, in which the request and acknowledge signals are separate std logic
+signals – the type used by the target component models. The data type is the
+
+
+Chapter 8: High-level languages and tools
+143
+
+same as the abstract channel type, but the handshaking is modeled differently.
+A real channel type is defined in figure 8.17.
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+req
+: std_logic;
+ack
+: std_logic;
+data : fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+Figure 8.17.
+Definition of a real channel.
+
+All definitions relating to the real channels are collected in a package (sim-
+ilar to the abstract channel package) and use the same names for the channel
+types, procedures and functions. For this reason it is very simple to switch
+to simulating using real channels. All it takes is to change the name of the
+package in the use statements in the top level design entity. Alternatively, one
+can use the same name for both packages, in which case it is the last analyzed
+package that is used in simulations.
+An example of a real channel package with only one channel type can be
+seen in appendix A.2. This package defines a 32-bit standard logic 4-phase
+bundled-data push channel. The constant tpd in this package is the delay from
+a transition on the request or acknowledge signal to the response to this tran-
+sition. “Synopsys compiler directives” are inserted in several places in the
+package. This is because Synopsys needs to know the channel types and the
+resolution functions belonging to them when it generates an EDIF netlist to the
+floor planner, but not the procedures in the package.
+Figure 8.18 shows the result of repeating the simulation experiment from the
+previous section, this time using the real channel package. Notice the sequence
+of four-phase handshakes.
+Note that the data value on a channel is, at all times, whatever value the
+sender is driving onto the channel. An alternative would be to make the resolu-
+tion function put out the default data value outside the data-validity period, but
+this may cause the setup and hold times of the latches to be violated. The proce-
+dure send provides a broad data-validity scheme, which means that it can com-
+municate with receivers that require early, broad or late data-validity schemes
+on the channel. The procedure receive requires an early data-validity scheme,
+
+
+144
+Part I: Asynchronous circuit design – A tutorial
+
+Figure 8.18.
+Simulation of the FIFO using the real channel package.
+
+which means that it can communicate with senders that provide early or broad
+data-validity schemes.
+The resolution functions for the real channels (and the abstract channels)
+can detect protocol errors. Examples of errors are more than one sender or
+receiver on a channel, and using a send command or a receive command at
+the wrong end of a channel. In such cases the channel assumes the X value on
+the request or acknowledge signals.
+
+8.6.6
+Partitioning into control and data
+
+This section describes how to separate an entity into control and data enti-
+ties. This is possible when the real channel package is used but, as explained
+below, this partitioning has to follow certain guidelines.
+To illustrate how the partitioning is carried out, the FIFO stage in figure 8.14
+in the preceding section will be separated into a latch control circuit called
+latch ctrl and a latch called std logic latch. The VHDL code is shown
+in figure 8.19, and figure 8.20 is a graphical illustration of the partitioning that
+includes the unresolved signals ud and uq as explained below.
+In VHDL a driver that drives a compound resolved signal has to drive all
+fields in the signal. Therefore a control circuit cannot drive only the acknowl-
+edge field in a channel. To overcome this problem a signal of the corresponding
+unresolved channel type has to be declared inside the partitioned entity. This
+is the function of the signals ud and uq of type uchannel fp in figure 8.17.
+The control circuit then drives only the acknowledge field in this signal; this
+is allowed since the signal is unresolved. The rest of the fields remain unini-
+tialized. The unresolved signal then drives the channel; this is allowed since it
+drives all of the fields in the channel. The resolution function for the channel
+should ignore the uninitialized values that the channel is driven with. Compo-
+nents that use the send and receive procedures also drive those fields in the
+
+
+Chapter 8: High-level languages and tools
+145
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+use work.real_channels.all;
+
+entity fp_latch is
+port ( d
+: inout channel_fp;
+-- input data channel
+q
+: inout channel_fp;
+-- output data channel
+resetn : in std_logic
+);
+end fp_latch;
+
+architecture struct of fp_latch is
+
+component latch_ctrl
+port ( rin, aout, resetn : in
+std_logic;
+ain, rout, lt : out std_logic
+);
+end component;
+
+component std_logic_latch
+generic (width : positive);
+port ( lt : in
+std_logic;
+d
+: in
+std_logic_vector(width-1 downto 0);
+q
+: out std_logic_vector(width-1 downto 0)
+);
+end component;
+
+signal lt : std_logic;
+signal ud, uq : uchannel_fp;
+
+begin
+
+latch_ctrl1 : latch_ctrl
+port map (d.req,q.ack,resetn,ud.ack,uq.req,lt);
+std_logic_latch1 : std_logic_latch
+generic map (width => 32)
+port map (lt,d.data,uq.data);
+
+d <= connect(ud);
+q <= connect(uq);
+
+end struct;
+
+Figure8.19.
+Separation of the FIFO stage into an ordinary data latch and a latch control circuit.
+
+channel that they do not control with uninitialized values. For example, an out-
+put to a channel drives the acknowledge field in the channel with the U value.
+The fields in a channel that are used as inputs are connected directly from the
+channel to the circuits that have to read those fields.
+Notice in the description that the signals ud and uq do not drive d and q
+directly but through a function called connect. This function simply returns
+its parameter. It may seem unnecessary, but it has proved to be necessary when
+some of the subcircuits are described with a standard cell implementation. In
+a simulation a special “gate-level simulation engine” is used to simulate the
+
+
+146
+Part I: Asynchronous circuit design – A tutorial
+
+Lt
+
+d
+
+std_logic_latch
+
+q
+
+d
+
+resetn
+
+Lt
+
+Lt
+aout
+ain
+
+rin
+rout
+
+resetn
+
+ud
+q
+uq
+
+latch_ctl
+
+ch_in
+ch_out
+
+q
+
+resetn
+
+d
+
+fp_latch
+fp_latch
+fp_latch
+
+FIFO_stage
+FIFO_stage
+FIFO_stage
+
+Figure 8.20.
+Separation of control and data.
+
+standard cells [129]. During initialization it will set some of the signals to the
+value X instead of to the value U as it should. It has not been possible to get
+the channel resolution function to ignore these X values, because the gate-level
+simulation engine sets some of the values in the channel. By introducing the
+connect function, which is a behavioural description, the normal simulator
+takes over and evaluates the channel by means of the corresponding resolution
+function. It should be emphasized that it is a bug in the gate-level simulation
+engine that necessitates the addition of the connect function.
+
+8.7.
+Summary
+
+This chapter addressed languages and CAD tools for high-level modeling
+and synthesis of asynchronous circuits. The text focused on a few represen-
+tative and influential design methods that are based languages that are similar
+to CSP. The reason for preferring these languages are that they support chan-
+nel based communication between processes (synchronous message passing)
+as well as concurrency at both process and statement level – two features that
+are important for modeling asynchronous circuits. The text also illustrated a
+synthesis method known as syntax directed translation. Subsequent chapters
+in this book will elaborate much more on these issues.
+Finally the chapter illustrated how channel based communication can be
+implemented in VHDL, and we provided two packages containing all the nec-
+essary procedures and functions including: send, receive and probe. These
+packages supports a manual top-down stepwise-refinement design flow where
+the same test bench can be used to simulate the design throughout the entire
+
+
+Chapter 8: High-level languages and tools
+147
+
+design process from high level specification to low level circuit implementa-
+tion.
+This chapter on languages and CAD-tools for asynchronous design con-
+cludes the tutorial on asynchronous circuit design and it it time to wrap up:
+Chapter 2 presented the fundamental concepts and theories, and provided point-
+ers to the literature. Chapters 3 and 4 presented an RTL-like abstract view
+on asynchronous circuits (tokens flowing in static data-flow structures) that is
+very useful for understanding their operation and performance. This material
+is probably where this tutorial supplements the existing body of literature the
+most. Chapters 5 and 6 addressed the design of datapath operators and con-
+trol circuits. Focus in chapter 6 was on speed-independent circuits, but this
+is not the only approach. In recent years there has also been great progress
+in synthesizing multiple-input-change fundamental-mode circuits. Chapter 7
+discussed more advanced 4-phase bundled-data protocols and circuits. Finally
+chapter 8 addressed languages and tools for high-level modeling and synthesis
+of asynchronous circuits.
+The tutorial deliberately made no attempts at covering of all corners of the
+field – the aim was to pave a road into “the world of asynchronous design”.
+Now you are here at the end of the road; hopefully with enough background
+to carry on digging deeper into the literature, and equally importantly, with
+sufficient understanding of the characteristics of asynchronous circuits, that
+you can start designing your own circuits. And finally; asynchronous circuits
+do not represent an alternative to synchronous circuits. They have advantages
+in some areas and disadvantages in other areas and they should be seen as a
+supplement, and as such they add new dimensions to the solution space that
+the digital designer explores. Even today, many circuits can not be categorized
+as either synchronous or asynchronous, they contain elements of both.
+The following chapters will introduce some recent industrial scale asyn-
+chronous chips. Additional designs are presented in [106].
+
+
+148
+Part I: Asynchronous circuit design – A tutorial
+
+Appendix: The VHDL channel packages
+
+A.1.
+The abstract channel package
+
+-- Abstract channel package: (4-phase bundled-data push channel, 32-bit data)
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+package abstract_channels is
+
+constant tpd : time := 2 ns;
+
+-- Type definition for abstract handshake protocol
+
+type handshake_phase is
+(
+u,
+-- uninitialized
+idle,
+-- no communication
+swait,
+-- sender waiting
+rwait,
+-- receiver waiting
+rcv,
+-- receiving data
+rec1,
+-- recovery phase 1
+rec2,
+-- recovery phase 2
+req,
+-- request signal
+ack,
+-- acknowledge signal
+error
+-- protocol error
+);
+
+-- Floating point channel definitions
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+phase : handshake_phase;
+data
+: fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+procedure initialize_in(signal ch : out channel_fp);
+
+procedure initialize_out(signal ch : out channel_fp);
+
+procedure send(signal ch : inout channel_fp; d : in fp);
+
+procedure receive(signal ch : inout channel_fp; d : out fp);
+
+function probe(signal ch : in channel_fp) return boolean;
+
+end abstract_channels;
+
+
+Chapter 8: High-level languages and tools
+149
+
+package body abstract_channels is
+
+-- Resolution table for abstract handshake protocol
+
+type table_type is array(handshake_phase, handshake_phase) of
+handshake_phase;
+
+constant resolution_table : table_type := (
+----------------------------------------------------------------------------
+-- 2. parameter:
+|
+|
+-- u
+idle
+swait rwait rcv
+rec1
+rec2
+req
+ack
+error
+|1. par:|
+----------------------------------------------------------------------------
+(u,
+u,
+u,
+u,
+u,
+u,
+u,
+u,
+u,
+u
+), --| u
+|
+(u,
+idle, swait,rwait,rcv,
+rec1, rec2, swait,rec2, error), --| idle
+|
+(u,
+swait,error,rcv,
+error,error,rec1, error,rec1, error), --| swait |
+(u,
+rwait,rcv,
+error,error,error,error,rcv,
+error,error), --| rwait |
+(u,
+rcv,
+error,error,error,error,error,error,error,error), --| rcv
+|
+(u,
+rec1, error,error,error,error,error,error,error,error), --| rec1
+|
+(u,
+rec2, rec1, error,error,error,error,rec1, error,error), --| rec2
+|
+(u,
+error,error,error,error,error,error,error,error,error), --| req
+|
+(u,
+error,error,error,error,error,error,error,error,error), --| ack
+|
+(u,
+error,error,error,error,error,error,error,error,error));--| error |
+
+-- Fp channel
+
+constant default_data_fp : fp := "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX";
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp is
+variable result : uchannel_fp := (idle, default_data_fp);
+begin
+for i in s’range loop
+result.phase := resolution_table(result.phase, s(i).phase);
+if (s(i).phase = req) or (s(i).phase = swait) or
+(s(i).phase = rcv) then
+result.data := s(i).data;
+end if;
+end loop;
+if not((result.phase = swait) or (result.phase = rcv)) then
+result.data := default_data_fp;
+end if;
+return result;
+end resolved;
+
+procedure initialize_in(signal ch : out channel_fp) is
+begin
+ch.phase <= idle after tpd;
+end initialize_in;
+
+procedure initialize_out(signal ch : out channel_fp) is
+begin
+ch.phase <= idle after tpd;
+end initialize_out;
+
+procedure send(signal ch : inout channel_fp; d : in fp) is
+begin
+if not((ch.phase = idle) or (ch.phase = rwait)) then
+wait until (ch.phase = idle) or (ch.phase = rwait);
+
+
+150
+Part I: Asynchronous circuit design – A tutorial
+
+end if;
+ch <= (req, d);
+wait until ch.phase = rec1;
+ch.phase <= idle;
+end send;
+
+procedure receive(signal ch : inout channel_fp; d : out fp) is
+begin
+if not((ch.phase = idle) or (ch.phase = swait)) then
+wait until (ch.phase = idle) or (ch.phase = swait);
+end if;
+ch.phase <= rwait;
+wait until ch.phase = rcv;
+wait for tpd;
+d := ch.data;
+ch.phase <= ack;
+wait until ch.phase = rec2;
+ch.phase <= idle;
+end receive;
+
+function probe(signal ch : in channel_fp) return boolean is
+begin
+return (ch.phase = swait);
+end probe;
+
+end abstract_channels;
+
+A.2.
+The real channel package
+
+-- Low-level channel package (4-phase bundled-data push channel, 32-bit data)
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+package real_channels is
+
+-- synopsys synthesis_off
+constant tpd : time := 2 ns;
+-- synopsys synthesis_on
+
+-- Floating point channel definitions
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+req
+: std_logic;
+ack
+: std_logic;
+data : fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+
+Chapter 8: High-level languages and tools
+151
+
+subtype channel_fp is resolved uchannel_fp;
+
+-- synopsys synthesis_off
+procedure initialize_in(signal ch : out channel_fp);
+
+procedure initialize_out(signal ch : out channel_fp);
+
+procedure send(signal ch : inout channel_fp; d : in fp);
+
+procedure receive(signal ch : inout channel_fp; d : out fp);
+
+function probe(signal ch : in uchannel_fp) return boolean;
+-- synopsys synthesis_on
+
+function connect(signal ch : in uchannel_fp) return channel_fp;
+
+end real_channels;
+
+package body real_channels is
+
+-- Resolution table for 4-phase handshake protocol
+
+-- synopsys synthesis_off
+type stdlogic_table is array(std_logic, std_logic) of std_logic;
+
+constant resolution_table : stdlogic_table := (
+--
+--------------------------------------------------------------
+--
+| 2. parameter:
+|
+|
+--
+|
+U
+X
+0
+1
+Z
+W
+L
+H
+-
+|1. par:|
+--
+--------------------------------------------------------------
+( ’U’, ’X’, ’0’, ’1’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+U
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+X
+|
+( ’0’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+0
+|
+( ’1’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+1
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+Z
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+W
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+L
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+H
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ )); -- |
+-
+|
+-- synopsys synthesis_on
+
+-- Fp channel
+
+-- synopsys synthesis_off
+constant default_data_fp : fp := "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX";
+-- synopsys synthesis_on
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp is
+-- pragma resolution_method three_state
+-- synopsys synthesis_off
+variable result : uchannel_fp := (’U’,’U’,default_data_fp);
+-- synopsys synthesis_on
+begin
+-- synopsys synthesis_off
+for i in s’range loop
+result.req := resolution_table(result.req,s(i).req);
+result.ack := resolution_table(result.ack,s(i).ack);
+
+
+152
+Part I: Asynchronous circuit design – A tutorial
+
+if (s(i).req = ’1’) or (s(i).req = ’0’) then
+result.data := s(i).data;
+end if;
+end loop;
+if not((result.req = ’1’) or (result.req = ’0’)) then
+result.data := default_data_fp;
+end if;
+return result;
+-- synopsys synthesis_on
+end resolved;
+
+-- synopsys synthesis_off
+procedure initialize_in(signal ch : out channel_fp) is
+begin
+ch.ack <= ’0’ after tpd;
+end initialize_in;
+
+procedure initialize_out(signal ch : out channel_fp) is
+begin
+ch.req <= ’0’ after tpd;
+end initialize_out;
+
+procedure send(signal ch : inout channel_fp; d : in fp) is
+begin
+if ch.ack /= ’0’ then
+wait until ch.ack = ’0’;
+end if;
+ch.req <= ’1’ after tpd;
+ch.data <= d after tpd;
+wait until ch.ack = ’1’;
+ch.req <= ’0’ after tpd;
+end send;
+
+procedure receive(signal ch : inout channel_fp; d : out fp) is
+begin
+if ch.req /= ’1’ then
+wait until ch.req = ’1’;
+end if;
+wait for tpd;
+d := ch.data;
+ch.ack <= ’1’;
+wait until ch.req = ’0’;
+ch.ack <= ’0’ after tpd;
+end receive;
+
+function probe(signal ch : in uchannel_fp) return boolean is
+begin
+return (ch.req = ’1’);
+end probe;
+-- synopsys synthesis_on
+
+function connect(signal ch : in uchannel_fp) return channel_fp is
+begin
+return ch;
+end connect;
+
+end real_channels;
+
+
+II
+
+BALSA - AN ASYNCHRONOUS HARDWARE
+SYNTHESIS SYSTEM
+
+Author: Doug Edwards, Andrew Bardsley
+Department of Computer Science
+The University of Manchester
+{doug,bardsley}@cs.man.ac.uk
+
+Abstract
+Balsa is a system for describing and synthesising asynchronous circuits based
+on syntax-directed compilation into communicating handshake circuits. In these
+chapters, the basic Balsa design flow is described and several simple circuit ex-
+amples are used to illustrate the Balsa language in an informal tutorial style. The
+section concludes with a walk-through of a major design exercise – a 4 channel
+DMA controller described entirely in Balsa.
+
+Keywords:
+asynchronous circuits, high-level synthesis
+
+
+
+Chapter 9
+
+AN INTRODUCTION TO BALSA
+
+9.1.
+Overview
+
+Balsa is both a framework for synthesising asynchronous hardware systems
+and a language for describing such systems. The approach adopted is that of
+syntax-directed compilation into communicating handshaking components and
+closely follows the Tangram system ([141, 135] and Chapter 13 on page 221)
+of Philips. The advantage of this approach is that the compilation is trans-
+parent: there is a one-to-one mapping between the language constructs in the
+specification and the intermediate handshake circuits that are produced. It is
+relatively easy for an experienced user to envisage the micro-architecture of the
+circuit that results from the original description. Incremental changes made at
+the language level result in predictable changes at the circuit implementation
+level. This is important if optimisations and design trade-offs are to be made
+easily and contrasts with synchronous VHDL synthesis in which small changes
+in the specification may make radical alterations to the resulting circuit.
+It is important to understand what Balsa offers the designer and what obli-
+gations are still placed upon the designer. The tight “edit description – syn-
+thesise – simulate – revise description” loop made possible by the fast com-
+pilation process makes it very easy for the design space of a system to be
+explored and prototypes rapidly evaluated. However, there is no substitute
+for creativity. Poor designs may be created as easily as elegant designs and
+some experience in designing asynchronous circuits is required before even a
+good designer of conventional clocked circuits will best be able to exploit the
+system. Be warned that although Balsa guarantees correct-by-construction cir-
+cuits, it does not guarantee correct systems. In particular, it is quite feasible,
+as in any asynchronous system, to describe an elegant circuit which will ex-
+hibit deadlock. Furthermore, post-layout simulation is still required in order
+to check that when the instantiated circuit has been placed and routed by con-
+ventional CAD tools, it meets basic timing requirements. On the other hand,
+a choice of implementation libraries is available allowing the designer to trade
+
+155
+
+
+156
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+the greater process portability of a delay-insensitive implementation against,
+perhaps, smaller circuit area which may require a larger post-layout validation
+effort.
+Although Balsa has evolved from a research environment, it is not a toy sys-
+tem unsuited for large-scale designs; Balsa has been used to synthesise the 32
+channel DMA controller [11] for the Amulet3i asynchronous microprocessor
+macro-cell [48]. The controller has a complex specification and the resulting
+implementation occupies 2mm2 on a 0�35µm 3-layer metal process. Balsa is at
+the time of writing being used to synthesise a complete Amulet core as part of
+the EU funded G3 smartcard project [46].
+As noted earlier, Balsa is very similar to Tangram. It is a less mature pack-
+age lacking some useful tools contained within the Tangram package such as
+the power performance analyser. However, Balsa is freely available whereas
+Tangram is not generally available outside Philips. As far as the expressiveness
+of the languages is concerned, Balsa adds powerful parameterisation using re-
+cursive expansion definition facilities whereas Tangram allows more flexibility
+in interacting with non delay-insensitive external interfaces. Balsa has delib-
+erately chosen not to add such features to ensure that its channels-only delay-
+insensitive model is not compromised.
+The reader should be aware that not all aspects of the Balsa language or its
+syntax are explored in the material that follows: a more detailed introduction
+is available in the Balsa User Guide available from [7]. The Balsa system is
+freely available from the same site. The system is still evolving: the description
+here refers to Balsa release 3.1.0.
+
+9.2.
+Basic concepts
+
+A circuit described in Balsa is compiled into a communicating network com-
+posed from a small (about 40) set of handshake components. The components
+are connected by channels over which atomic communications or handshakes
+take place. Channels may have datapaths associated with them (in which case
+a handshake involves the transfer of data), or may be purely control (in which
+case the handshake acts as a synchronisation or rendezvous point).
+Each channel connects exactly one passive port of a handshake component
+to one active port of another handshake component. An active port is a port
+which initiates a communication. A passive port responds (when it is ready) to
+the request from the active port by an acknowledge signal.
+Data channels may be push channels or pull channels. In a push channel,
+the direction of the data flow is from the active port to the passive port. This
+is similar to the communication style of micropipelines. Data validity is sig-
+nalled by request and released on acknowledge. In a pull channel, the direction
+of data flow is from the passive port to the active port. The active port requests
+
+
+Chapter 9: An introduction to Balsa
+157
+
+a transfer, data validity is signalled by an acknowledge from the passive port.
+An example of a circuit composed from handshake components is shown in
+figure 9.1. Active ports are denoted by filled bubbles on a handshake compo-
+nent and passive ports are denoted by open bubbles.
+
+acknowledge
+
+request
+acknowledge
+
+request
+
+acknowledge
+
+bundled data
+
+acknowledge
+request
+
+request
+request
+
+acknowledge
+@
+
+"0;1"
+
+0
+
+1
+
+→
+
+Figure 9.1.
+Two connected handshake components.
+
+Here, a Fetch component, or Transferrer, denoted by “� ”) and a Case com-
+ponent (denoted by “@”) are connected by an internal data-bearing channel.
+Circuit action is activated by a request to the Transferrer which in turn issues
+a request to the environment on its active pull input port (on the left of the
+diagram). The environment supplies the demanded data indicating its validity
+by the acknowledgement signal. The Transferrer then presents a handshake re-
+quest and data to the Case component on its active push output port which the
+Case component receives on a passive port. Depending on the data value, the
+Case component issues a handshake to its environment on either the top right
+or bottom right port. Finally, when the acknowledgement is received by the
+Case component, an acknowledgement is returned along the original channel
+and terminating this handshake. The circuit is ready to operate once more.
+Data follows the direction of the request in this example and the acknowl-
+edgement to that request flows in the opposite direction. In this figure, indi-
+vidual physical request, acknowledgement and data wires are explicitly shown.
+Data is carried on separate wires from the signalling (it is “bundled” with the
+control) although this is not necessarily true for other data/signalling encoding
+schemes.
+The bundled-data scheme illustrated in figure 9.1 is not the only imple-
+mentation possible. Methodologies exist to implement channel connections
+with delay-insensitive signalling where timing relationships between individ-
+ual wires of an implemented channel do not affect the functionality of the cir-
+cuit. Handshake circuits can be implemented using these methodologies which
+are robust to na¨ıve realisations, process variations and interconnect delay prop-
+erties. Future releases of Balsa will include several alternative back-ends. A
+more detailed discussion of handshake protocols can be found in section 2.1
+on page 9 and section 7.1 on page 115.
+
+
+158
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Normally, handshake circuit diagrams are not shown at the level of detail
+of figure 9.1, a channel usually being shown as a single arc with the direc-
+tion of data being denoted by an arrow head on the arc. Similarly, control
+only channels, comprising only request/acknowledge wires, are indicated by
+an arc without an arrowhead. The circuit complexity of handshake circuits is
+often low: for example, a Transferrer may be implemented using only wires.
+An example of a handshake circuit for a modulo-10 counter (see page 185) is
+shown in figure 9.2. The corresponding gate-level implementation is shown in
+figure 9.3.
+
+count
+
+aclk
+
+activate
+
+4
+
+1
+0
+
+1
+
+0
+
+4
+
+4
+
+"0;1"
+
+4
+
+@
+tmp
+x /= 9
+
+4
+count
+_reg
+|
+
+→
+
+4
+→
+
+→
+
+→
+
+→
+
+4
+4
+x + 1
+
+1
+4
+
+#
+DW
+;
+
+*
+
+4
+4
+
+Figure 9.2.
+Handshake circuit of a modulo-10 counter.
+
+DW
+#
+
+4
+4
+
+*
+
+;
+
+4
+1
+
+x + 1
+4
+4
+
+T
+
+T
+
+@
+
+"0;1"
+
+1
+0
+
+1
+
+T
+
+T
+0
+
+4
+
+4
+
+4
+
+|
+
+4
+4
+
+tmp
+T
+x /= 9
+
+4
+count
+_reg
+
+count
+
+activate
+
+aclk
+
+(no ack)
+
+Control sequencing components (3 gates each)
+
+S
+
+S
+
+C
+
+r
+a
+
+Compare
+r
+
+a
+
+/= 9?
+
+Incrementer
+r
+
+a
+
+R
+
+S
+
+latch x4
+
+r
+
+a
+
+0
+
+1
+
+latch
+
+Figure 9.3.
+Gate-level circuit of a modulo-10 counter.
+
+
+Chapter 9: An introduction to Balsa
+159
+
+9.3.
+Tool set and design flow
+
+An overview of the Balsa design flow is shown in figure 9.4. Behavioural
+simulation is provided by LARD [38], a language developed within the Amulet
+group for modelling asynchronous systems. However, the target CAD sys-
+tem can also be used to perform more accurate simulations and to validate
+the design.
+Most of the Balsa tools are concerned with manipulating the
+Breeze handshake intermediate files produced by compiling Balsa descrip-
+tions. Breeze files can be used by back-end tools to provide implementations
+for Balsa descriptions, but also contain procedure and type definitions passed
+on from Balsa source files allowing Breeze to be used as the package descrip-
+tion format for Balsa.
+The Balsa system comprises the following collection of tools:
+
+balsa-c: the compiler for the Balsa language. The compiler produces
+Breeze from Balsa descriptions.
+
+balsa-netlist: produces a netlist, currently EDIF, Compass or Verilog,
+from a Breeze description, performing technology mapping and hand-
+shake expansion.
+
+breeze2ps: a tool which produces a PostScript file of the handshake cir-
+cuit graph.
+
+breeze2lard: a translator that converts a Breeze file to a LARD behavioural
+model.
+
+breeze-cost: a tool which gives an area cost estimate of the circuit.
+
+balsa-md: a tool for generating Makefiles for make(1).
+
+balsa-mgr: a graphical front-end to balsa-md with project management
+facilities.
+
+The interfaces between the Balsa and target CAD systems are handled by
+the following scripts:
+
+balsa-pv: uses powerview tools to produce an EDIF file from a top-level
+powerview schematic which incorporates Balsa generated circuits.
+
+balsa-xi: produces a Xilinx download file from an EDIF description of
+a compiled circuit.
+
+balsa-ihdl: an interface to the Cadence Verilog-XL environment.
+
+9.4.
+Getting started
+
+In this section, simple buffer circuits are described in Balsa introducing the
+basic elements of a Balsa description.
+
+
+160
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+balsa−c
+
+balsa−netlist
+
+breeze−cost
+
+balsa−md
+balsa−mgr
+
+balsa−li
+
+balsa−lcd
+
+breeze2lard
+
+Behavioural sim.
+
+LARD bytecodes
+
+LARD test harness
+LARD
+
+balsa−pv
+
+Fusion
+
+balsa−xi
+
+EDIF 2 0 0 netlist
+
+Powerview DB
+
+Simulation results
+
+Xilinx bitstream
+
+Timing extraction
+
+Timing info.
+
+balsa−ihdl
+
+Pearl
+Silicon Ensemble
+
+Silicon Ensemble
+
+Verilog−XL
+
+Verilog netlist
+
+Cadence DB
+
+SDF
+Layout
+
+SDF
+
+Simulation results
+
+cp(1)
+
+Chip compiler
+Netlist gen.
+
+TimeMill
+
+Compass netlist
+
+Compass DB
+
+Layout
+
+Cap. extraction
+
+Extracted netlist
+
+TimeMill netlist
+
+Simulation results
+
+A non−Balsa tool
+A Balsa tool
+
+A file format / data
+
+Balsa
+
+Breeze
+
+Cost estimate
+
+Figure 9.4.
+Design flow.
+
+
+Chapter 9: An introduction to Balsa
+161
+
+9.4.1
+A single-place buffer
+
+This buffer circuit is the HDL equivalent of the “hello, world” program. Its
+Balsa description is:
+
+import [balsa.types.basic]
+-- a single line comment
+-- buffer1a: A single place buffer
+procedure buffer1 (input i : byte; output o : byte) is
+variable x : byte
+begin
+loop
+i -> x
+-- Input
+communication
+;
+-- sequence the two communications
+o <- x
+-- Output communication
+end
+end
+
+Commentary on the code
+
+This Balsa description builds a single-place buffer, 8-bits wide. The circuit
+requests a byte from the environment which, when ready, transfers the data
+to the register. The circuit signals to the environment on its output channel
+that data is available and the environment reads it when it chooses. This small
+program introduces:
+
+comments:
+Balsa supports both multi-line comments and single-line com-
+ments.
+
+modular compilation:
+Balsa supports modular compilation. The import
+statement in this example includes the definition of some standard data types
+such as byte, nibble, etc. The search path given in the import statement
+is a dot-separated directory path similar to that of Java (although multi-file
+packages are not implemented). The import statement may be used to include
+other precompiled Balsa programs thereby acting as a library mechanism. Any
+import statements must precede other declarations in the files.
+
+procedures:
+The procedure declaration introduces an object that looks sim-
+ilar to a procedure definition in a conventional programming language. A Balsa
+procedure is a process. The parameters of the procedure define the interface to
+the environment outside the circuit block. In this case, the module has an 8-bit
+input and an 8-bit output. The body of the procedure definition defines an al-
+gorithmic behaviour for the circuit; it also implies a structural implementation.
+In this example, a variable x (of type byte) is declared implying that an 8-bit
+wide storage element will be appear in the synthesised circuit.
+
+
+162
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+The behaviour of the circuit is obvious from the code: 8-bit values are trans-
+ferred from the environment to the storage variable, x, and then sequentially
+output from the variable to the environment. This sequence of events is con-
+tinually repeated (loop
+�
+�
+� end).
+
+channel communication:
+the communication operators “->” and “<-” are
+channel assignments and imply a communication or handshake over the chan-
+nel. Because of the sequencing explicit in the description, the variable x will
+only accept a new value when it is ready; the value will only be passed out to
+the environment when requested. Note that the channel is always on the left-
+hand side of the operator and the corresponding variable or expression on the
+right-hand side.
+
+sequencing:
+The “;” operator separating the two assignments is not merely a
+syntactic statement separator, it explicitly denotes sequentiality. The contents
+of x are transferred to the output port after the input transfer has completed.
+Because a “;” connects two sequenced statements or blocks, it is an error to
+place a “;” after the last statement in a block.
+
+repetition
+The loop
+�
+�
+� end construct causes infinite repetition of the code
+contained within its body. Procedures without loop
+�
+�
+� end are permitted and
+will terminate, allowing procedure calls to be sequenced if required.
+
+Compiling the circuit
+
+balsa-c buffer1a
+
+The compiler produces an output file buffer1a.breeze. This is a file in an in-
+termediate format which can be imported back into other Balsa source files
+(thereby providing a simple library mechanism). Breeze is a textual format file
+designed for ease of parsing and it is therefore somewhat opaque. A primitive
+graphical representation of the compiled circuit in terms of handshake compo-
+nents can be produced (as buffer1a.ps) by:
+
+breeze2ps buffer1a
+
+The synthesised circuit
+
+The resulting handshake circuit is shown in figure 9.5. This is not actually
+taken from the output of breeze2ps, but has been redrawn to make the diagram
+more readable. Although it is not necessary to understand the exact opera-
+tion of the compiled circuit, a knowledge of the structure is helpful to gain an
+understanding of how best to describe circuits which can be synthesised effi-
+ciently using Balsa. A brief description of the operation of the circuit is given
+
+
+Chapter 9: An introduction to Balsa
+163
+
+o
+i
+
+Loop
+
+Sequence
+
+Variable
+
+→
+
+➤
+
+Fetch
+Fetch
+
+→
+
+*
+
+x
+
+;
+
+#
+
+Figure 9.5.
+Handshake circuit for a single-place buffer.
+
+below. The circuit has been annotated with the names of the various handshake
+components.
+The port at the top, denoted by “>”, is an activation port generating a hand-
+shake enclosing the behaviour of the circuit. It can be thought of as a reset
+signal which, when de-asserted, initiates the operation of the circuit. All com-
+piled Balsa programs contain an activation port.
+The activation port starts the operation of the Repeater (“#”) which initi-
+ates a handshake with the Sequencer. The Repeater corresponds directly to the
+loop�
+�
+� end construct, and the Sequencer to the “;” operator. The Sequencer
+first issues a handshake to the left-hand Fetch component, causing data to be
+moved to the storage element in the Variable element. The Sequencer then
+handshakes with the right-hand Fetch component, causing data to be read from
+the Variable element. When these operations are complete, the Sequencer com-
+pletes its handshake with the Repeater which starts the cycle again.
+
+9.4.2
+Two-place buffers
+
+1st design
+
+Having built a single-place buffer, an obvious goal is a pipeline of single
+buffer stages. Initially consider a two-place buffer; there are a number of ways
+we might describe this. One choice is to define a circuit with two storage
+elements:
+
+-- buffer2a: Sequential 2-place buffer with assignment
+--
+between variables
+import [balsa.types.basic]
+
+procedure buffer2 (input i : byte; output o : byte) is
+variable x1, x2 : byte
+begin
+
+
+164
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+loop
+i -> x1;
+-- Input communication
+x2 := x1;
+-- Implied communication
+o <- x2
+-- Output communication
+end
+end
+
+In this example in we explicitly introduce two storage elements, x1 and x2.
+The contents of the variable x1 are caused to be transferred to the variable x2
+by means of the assignment operator “:=”. However, transfer is still effected
+by means of a handshaking communication channel. This assignment operator
+is merely a way of concealing the channel for convenience.
+
+2nd design
+
+The implicit channel can be made explicit as shown in buffer2b.balsa:
+
+-- buffer2b: Sequential version with an explicit
+--
+internal channel
+import [balsa.types.basic]
+
+procedure buffer2 (input i:byte; output o:byte) is
+variable x1, x2 : byte
+channel chan: byte
+begin
+loop
+i -> x1;
+-- Input
+communication
+chan <- x1 || chan -> x2;
+-- Transfer x1 to x2
+o <- x2
+-- Output communication
+end
+end
+
+The channel which, in the previous example, was concealed behind the use
+of the “:=” assignment operator, has been made explicit. The handshake circuit
+produced (after some simple optimisations) is identical to buffer2a. The “||”
+operator is explained in the next example.
+It is important to understand the significance of the operation of the circuits
+produced by buffer2a and buffer2b. Remember that “;” is more than a syntac-
+tic separator: it is an operator denoting sequence. Thus, first the input, i, is
+transferred to x1. When this operation is complete, x1 is transferred to x2 and
+finally the contents of x2 are written to the environment on port o. Only after
+this sequence of operations is complete can new data from the environment be
+read into x1 again.
+
+9.4.3
+Parallel composition and module reuse
+
+The operation above is unnecessarily constrained: there is no reason why
+the circuit cannot be reading a new value into x1 at the same time that x2 is
+
+
+Chapter 9: An introduction to Balsa
+165
+
+writing out its data to the environment. The program in buffer2c achieves this
+optimisation.
+
+-- buffer2c: a 2-place buffer using parallel composition
+import [buffer1a]
+
+procedure buffer2 (input i : byte; output o : byte) is
+channel c : byte
+begin
+buffer1 (i, c) ||
+buffer1 (c, o)
+end
+
+Commentary on the code
+
+In the program above, a 2-place buffer is composed from 2 single-place
+buffers. The output of the first buffer is connected to the input of the second
+buffer by their respective output and input ports. However, apart from com-
+munications across the common channel, the operation of the two buffers is
+independent.
+The deceptively simple program above illustrates a number of new features
+of the Balsa language:
+
+modular compilation:
+The buffer1a circuit is included by the import mech-
+anism described earlier. The circuit must have been compiled previously. The
+Makefile generation program balsa-md (see page 166) can be used to generate
+a Makefile which will automatically take care of such dependencies.
+
+connectivity by naming:
+The output of the first buffer is connected to the
+input of the second buffer because of the common channel name, c, in the
+parameter list in the instantiation of the buffers.
+
+parallel composition:
+The “||” operator specifies that the two units which it
+connects should operate in parallel. This does not mean that the two units may
+operate totally independently: in this example, the output of one buffer writes
+to the input of the other buffer, creating a point of synchronisation. Note also
+that the parallelism referred to is a temporal parallelism. The two buffers are
+physically connected in series.
+
+9.4.4
+Placing multiple structures
+
+If we wish to extend the number of places in the buffer, the previous tech-
+nique of explicitly enumerating every buffer becomes tedious. What is required
+is a means of parameterising the buffer length (though in any real hardware
+implementation the number of buffers cannot be variable and must be known
+
+
+166
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+before-hand). One technique, shown in buffer n, is to use the for construct
+together with compile-time constants:
+
+-- buffer_n: an n-place parameterised buffer
+import [buffer1a]
+constant n = 8
+
+procedure buffer_n (input i:byte; output o:byte)
+is
+array 1 .. n-1 of channel c : byte
+begin
+buffer1 (i, c[1]) ||
+-- First buffer
+buffer1 (c[n-1], o) ||
+-- Last buffer
+for || i in 1 .. n-2 then
+-- Buffer i
+buffer1 (c[i], c[i+1])
+end
+end
+
+Commentary on the Code
+
+constants:
+the value of an expression (of any type) may be bound to a name.
+The value of the expression is evaluated at compile time and the type of the
+name when used will be the same as the original expression in the constant
+declaration. Numbers can be given in decimal (starting with one of 1..9), hexa-
+decimal (0x prefix), octal (0 prefix) and binary (0b prefix).
+
+arrayed channels:
+procedure ports and locally declared channels may be
+arrayed. Each channel can be referred to by a numeric or enumerated index,
+but from the point of view of handshaking, each channel is distinct and no
+indexed channel has any relationship with any other such channel other than
+the name they share. Arraying is not part of a channel’s type.
+
+for loops:
+a for loop allows iteration over the instantiation of a subcircuit.
+The composition of the circuits may either be a parallel composition – as in the
+example above – or sequential. In the latter case, “;” should be substituted for
+“||” in the loop specifier. The iteration range of the loop must be resolvable at
+compile time.
+A more flexible approach uses parameterised procedures and is discussed
+later in chapter 11 on page 193.
+
+9.5.
+Ancillary Balsa tools
+
+9.5.1
+Makefile generation
+
+Makefiles are commonly used in Unix by the utility make(1) to specify and
+control the processes by which complicated programs are compiled. Speci-
+fying the dependencies involved is often tedious and error prone. The Balsa
+
+
+Chapter 9: An introduction to Balsa
+167
+
+system has a utility, balsa-md, to generate the Makefile for a given program
+automatically. The generated Makefile knows not only how to compile a Balsa
+module with multiple imports, but also how to generate and run test-harnesses
+for the simulation environment, LARD, used by Balsa. Balsa-mgr provides
+a convenient, intuitive, GUI front-end to balsa-md and considerably simpli-
+fies project management, in particular the handling of multiple test harnesses.
+However, since a textual description of any GUI is tedious, balsa-mgr will not
+be discussed further and only the facilities to which the underlying balsa-md
+provides a gateway will be described in the examples that follow. The interface
+presented by balsa-mgr is shown in figure 9.6.
+
+Figure 9.6.
+balsa-mgr IDE.
+
+9.5.2
+Estimating area cost
+
+The area cost of a circuit may be estimated by executing the Makefile rule
+cost. For example, an extract of the output produced for the 2-place buffer is
+shown below:
+
+Part: buffer2
+(0 (component "$BrzFetch" (8) (10 2 9)))
+(0 (component "$BrzFetch" (8) (8 6 7)))
+(0 (component "$BrzFetch" (8) (5 4 3)))
+
+
+168
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+(20.75 (component "$BrzLoop" () (1 11)))
+(99.0 (component "$BrzSequence" (3) (11 (10 8 5))))
+(198.0 (component "$BrzVariable" (8 1 "x1[0..7]") (9 (6))))
+(198.0 (component "$BrzVariable" (8 1 "x2[0..7]") (7 (4))))
+
+Total cost: 515.75
+
+The exact format of the report produced is somewhat obscure. Each line
+corresponds to a handshake component. Its area cost is the first number on the
+line. The parameters after the component name correspond to the width of the
+various channels of that component and the internal channel names. The area
+reported is proportional to the cost of implementing the circuit in a particular
+silicon process and is of most use in comparing different circuit descriptions.
+
+9.5.3
+Viewing the handshake circuit graph
+
+A PostScript view of the handshake circuit graph can be produced by run-
+ning the rule make ps. A (flattened) view of the handshake circuit graph for
+the example buffer.2c is shown in figure 9.7.
+The two single-place buffers from which the circuit is composed are recog-
+nisable in the circuit. Apart from minor differences in the labelling of the
+handshake component symbols, the circuit is identical to that shown in fig-
+ure 8.6 discussed in section 8.4 on page 128 and the same optimisations have
+been (automatically) applied.
+
+9.5.4
+Simulation
+
+Ignoring the various simulation possibilities available once the design has
+been converted to a silicon layout, there are three strategies for evaluating and
+simulating the design from Balsa:
+
+1 Default LARD test harness.
+
+The command make sim will generate a LARD test harness and run it.
+The test harness reads data from a file for each input port of the module
+under test. Data sent to output channels appears on the standard output.
+This method needs no knowledge of LARD at all.
+
+2 Balsa test harness.
+
+If a more sophisticated test sequence is required, Balsa is a sufficiently
+flexible language in its own right to be able to specify most test se-
+quences. A default LARD test harness can then be generated for the
+Balsa test harness. Again no detailed knowledge of LARD is required.
+
+3 Custom LARD test harness.
+
+
+Chapter 9: An introduction to Balsa
+169
+
+buffer2c
+
+activate
+
+.
+
+C1: @10:18
+
+0
+
+i
+o
+
+#
+
+C10: @13:3
+
+0
+
+1
+
+#
+
+C4: @13:3
+
+0
+
+2
+
+x[0..7]
+
+;
+
+C15: @14:11
+
+0
+
+1
+
+T
+
+C14: @14:7
+
+0
+
+1
+
+.
+
+C12: @15:7
+
+1
+
+2
+
+C2: i
+
+1
+
+C13: x
+
+0
+
+2
+
+T
+
+C11: x
+
+1
+
+1
+
+x[0..7]
+
+C7: x
+
+0
+
+2
+
+;
+
+C9: @14:11
+
+0
+
+1
+
+T
+
+C6: @15:7
+
+0
+
+2
+
+C8: @14:7
+
+0
+
+1
+
+C3: o
+
+2
+
+C5: x
+
+1
+
+1
+
+C16: @27:18
+
+0
+
+2
+
+Figure 9.7.
+Flattened view of buffer2c.
+
+
+170
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+For some applications, it may be necessary to write a custom test harness
+in LARD. The Makefile generated test harness may be used as template.
+
+The default test harness exercises the target Balsa block by repeatedly hand-
+shaking on all external channels; input data channels receive the value 0 on
+each handshake, although it is possible to associate an input channel with a
+data file.
+
+Simulating buffer2c
+
+A simulation can be generated by invoking the appropriate simulation rule
+from the Makefile, producing the following output:
+
+0: chan ‘i’: writing 0
+6: chan ‘i’: writing 0
+15: chan ‘o’: reading 0
+19: chan ‘i’: writing 0
+28: chan ‘o’: reading 0
+32: chan ‘i’: writing 0
+41: chan ‘o’: reading 0
+45: chan ‘i’: writing 0
+54: chan ‘o’: reading 0
+58: chan ‘i’: writing 0
+67: chan ‘o’: reading 0
+71: chan ‘i’: writing 0
+80: chan ‘o’: reading 0
+
+The simulation runs forever unless terminated (by Ctrl-C). The numbers
+reported on the left hand side of each channel activity line are simulation times.
+LARD uses a unit delay model so these values should be treated with caution.
+
+Simulation data file
+
+This particular simulation stimulus is not very informative. A better strategy
+is to arrange for the data on the input channel i to be externally defined. In
+the next example, a file contains the following set of test data (in a variety of
+number representations):
+
+1
+0x10
+022
+0b011101
+5
+
+The Makefile can be forced to generate a rule for running a simulation from
+this stimulus file. If the simulation is now run, the following output is pro-
+duced:
+
+
+Chapter 9: An introduction to Balsa
+171
+
+3: chan ‘i’: writing 1
+15: chan ‘o’: reading 1
+16: chan ‘i’: writing 16
+28: chan ‘o’: reading 16
+29: chan ‘i’: writing 18
+41: chan ‘o’: reading 18
+42: chan ‘i’: writing 29
+54: chan ‘o’: reading 29
+55: chan ‘i’: writing 5
+67: chan ‘o’: reading 5
+Program terminated
+
+Channel viewer
+
+In the previous examples, the output of the simulation is textual appearing
+on the standard output.
+LARD has a graphical interface which displays the
+handshakes and data values associated with the internal and external channels.
+Assuming the building of a test harness rule has been specified to balsa-md,
+the channel viewer can be invoked causing two windows to appear on the
+screen: the LARD interpreter control window and the channel viewer window
+itself.
+Starting the simulation will cause a trace of the various channels in the de-
+sign to appear in the channel view window. For each channel the request and
+acknowledge signals and data values are displayed.
+
+Figure 9.8.
+Channel viewer window.
+
+
+
+Chapter 10
+
+THE BALSA LANGUAGE
+
+In this chapter, a tutorial overview of the language is given together with
+several small designs which illustrate various aspects of the language.
+
+10.1.
+Data types
+
+Balsa is strongly typed with data types based on bit vectors. The results
+of expressions must be guaranteed to fit within the range of the underlying bit
+vector representation. Types are either anonymous or named. Type equivalence
+for anonymous types is checked on the basis of the size and properties of the
+type, whereas type equivalence for named types is checked against the point of
+declaration.
+There are two classes of anonymous types: numeric types which are de-
+clared with the bits keyword, and arrays of other types. Numeric types can
+be either signed or unsigned. Signedness has an effect on expression operators
+and casting. Only numeric types and arrays of other types may be used without
+first binding a name to those types. Balsa has three separate name spaces: one
+for procedure and function names, a second for variable and channel names
+and a third for type declarations.
+
+Numeric types
+
+Numeric types support the number range
+�0�2n
+
+�1� for n-bit unsigned num-
+bers or
+��2n�1
+
+�2n�1
+
+� 1� for n-bit signed numbers. Named numeric types are
+just aliases of the same range. An example of a numeric type declaration is:
+
+type word is 16 bits
+
+This defines a new type word which is unsigned (there is no unsigned key-
+word) covering the range
+�0�216
+
+� 1�. Alternatively, a signed type could have
+been declared as:
+
+type sword is 16 signed bits
+
+which defines a new type sword covering the range
+��215
+
+�215
+
+�1�.
+
+173
+
+
+174
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+The only predefined type is bit. However the standard Balsa distribution
+comes with with a set of library declarations for such types as byte, nibble,
+boolean and cardinal as well as the constants true and false.
+
+Enumerated types
+
+Enumerated types consist of named numeric values. The named values are
+given values starting at zero and incrementing by one from left to right. Ele-
+ments with explicit values reset the counter and many names can be given to
+the same value, for example:
+
+type Colour is enumeration
+Black, Brown, Red, Orange, Yellow, Green, Blue,
+Violet, Purple=Violet, Grey, Gray=Grey, White
+end
+
+The value of the Violet element of Colour is 7, as is Purple. Both Grey
+and Gray have value 8. The total number of elements is 10. An enumeration
+can be padded to a fixed size by use of the over keyword:
+
+type SillyExample is enumeration
+e1=1, e2 over 4 bits
+end
+
+Here 2 bits are sufficient to specify the 3 possible values of the enumeration
+(0 is not bound to a name, e1 has the value 1 and e2 has the value 2). The over
+keyword ensures that the representation of the enumerated type is actually 4
+bits. Enumerated types must be bound to names by a type declaration before
+use.
+
+Constants
+
+Constant values can be defined in terms of an expression resolvable at com-
+pile time. Constants may be declared in terms of a predefined type otherwise
+they default to a numeric type. Examples are:
+
+constant minx = 5
+constant maxx = minx + 10
+constant hue = Red : Colour
+constant colour = Colour’Green
+-- explicit enumeration element
+
+Record types
+
+Records are bit-wise compositions of named elements of possibly different
+(pre-declared) types with the first element occupying the least significant bit
+positions, e.g.:
+
+type Resistor is record
+
+
+Chapter 10: The Balsa language
+175
+
+FirstBand, SecondBand, Multiplier : Colour;
+Tolerance : ToleranceColour
+end
+
+Resistor has four elements: FirstBand, SecondBand, Multiplier of
+type Colour and Tolerance of type ToleranceColour (both types must have
+been declared previously). FirstBand is the first element and so represents the
+least significant portion of the bit-wise value of a type Resistor. Selection of
+elements within the record structure is accomplished with the usual dot nota-
+tion. Thus if R15 is a variable of type Resistor, the value of its SecondBand
+can extracted by R15.SecondBand. As with enumerations, record types can be
+padded using the over notation.
+
+Array types
+
+Arrays are numerically indexed compositions of same-typed values. An
+example of the declaration of an array type is:
+
+type RegBank_t : array 0..7 of byte
+
+This introduces a new type RegBank t which is an array type of 8 elements
+indexed across the range [0, 7], each element being of type byte. The ordering
+of the range specifier is irrelevant; array 0..7 is equivalent to array 7..0.
+In general a single expression, expr, can be used to specify the array size: this
+is equivalent to a range of 0..expr-1. Anonymous array types are allowed in
+Balsa, so that variables can be declared as an array without first defining the
+array type:
+
+variable RegBank : array 0..7 of byte
+
+Arbitrary ranges of elements within an array can be accessed by an array
+slicing mechanism e.g. a[5..7] extracts elements a5, a6, and a7. As with all
+range specifiers, the ordering of the range is irrelevant. In general Balsa packs
+all composite typed structures in a least significant to most significant, left to
+right manner. Array slices always return values which are based at index 0.
+Arrays can be constructed by a tupling mechanism or by concatenation of
+other arrays of the same base type:
+
+variable a, b, c, d, e ,f: byte
+variable z2 : array 2 of byte
+variable z4 : array 4 of byte
+variable z6 : array 6 of byte
+
+z4:= {a,b,c,d}
+-- array construction
+z6:= z4 @ {e, f}
+-- array concatenation
+z2:= (z4 @ {e, f}) [3..4] -- element extraction by array slicing
+
+
+176
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+In the last example, the first element of z2 is set to d and the second element
+is set to e. Array slicing is useful for extracting arbitrary bitfields from other
+datatypes.
+
+Arrayed channels
+
+Channels may be arrayed, that is they may consist of several distinct chan-
+nels which can be referred to by a numeric or enumerated index. This is similar
+to the way in which variables can have an array type except that each channel
+is distinct for the purposes of handshaking and each indexed channel has no
+relationship to the other channels in the array other than the single name they
+share. The syntax for arrayed channels is different from that of array typed
+variables making it easier to disambiguate arrays from arrayed channels. As
+an example:
+
+array 4 of channel XYZ : array 4 of byte
+
+declares 4 channels, XYZ[0] to XYZ[3], each channel is a 32-bit wide type
+array 0..3 of byte. An example of the use of arrayed channels is shown
+in section 9.4.4 on page 165.
+
+10.2.
+Data typing issues
+
+As stated previously, Balsa is strongly typed: the left and right sides of as-
+signments are expected to have the same type. The only form of implicit type-
+casting is the promotion of numeric literals and constants to a wider numeric
+type. In particular, care must be taken to ensure that the result of an arithmetic
+operation will always be compatible with the declared result type. Consider
+the assignment statement x := x + 1. This is not a valid Balsa statement be-
+cause potentially the result is one bit wider than the width of the variable x. If
+the carry-out from the addition is to be ignored, the user must explicitly force
+the truncation by means of a cast.
+
+Casts
+
+If the variable x was declared as 32 bits, the correct form of the assignment
+above is:
+
+x := (x + 1 as 32 bits)
+
+The keyword as indicates the cast operation. The parentheses are a neces-
+sary part of the syntax. If the carry out of the addition of two 32-bit numbers
+is required, a record type can be used to hold the composite result:
+
+type AddResult is record
+Result : 32 bits;
+
+
+Chapter 10: The Balsa language
+177
+
+Carry : bit;
+end
+variable r : AddResult
+
+r := (a + b as AddResult)
+
+The expression r.Carry accesses the required carry bit, r.Result yields
+the 32-bit addition result.
+Casts are required when extracting bit fields. Here is an example from the
+instruction decoder of a simple microprocessor. The bottom 5 bits of the 16-bit
+instruction word contain an 5-bit signed immediate. It is required to extract the
+immediate field and sign-extend it to 16 bits:
+
+type Word is 16 signed bits
+type Imm5 is 5 signed bits
+
+variable Instr : 16 bits -- bottom 5 bits contain an immediate
+variable Imm16 : Word
+
+Imm16 := (((Instr as array 16 of bit) [0..4] as Imm5) as Word)
+
+First, the instruction word, Instr, is cast into an array of bits from which
+an arbitrary sub-range can be extracted:
+
+(Instr as array 16 of bit)
+
+Next the bottom (least significant) 5 bits must be extracted:
+
+(Instr as array 16 of bit) [0..4]
+
+The extracted 5 bits must now be cast back into a 5-bit signed number:
+
+((Instr as array 16 of bit) [0..4] as Imm5)
+
+The 5-bit signed number is then signed extended to the 16-bit immediate
+value:
+
+(((Instr as array 16 of bit) [0..4] as Imm5) as Word)
+
+The double cast is required because a straightforward cast from 5 bits to
+the variable Imm16 of type Word would have merely zero filled the topmost bit
+positions even though Word is a signed type. However, a cast from a signed
+numeric type to another (wider) signed numeric type will sign extend the nar-
+rower value into the width of the wider target type.
+Extracting bits from a field is a fairly common operation in many hardware
+designs. In general, the original datatype has to be cast into an array first before
+bitfield extraction. The smash operator “#” provides a convenient shorthand for
+casting an object into an array of bits. Thus the sign extension example above
+is more simply written:
+
+((#Instr [0..4] as Imm5) as Word)
+
+
+178
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Table 10.1.
+Balsa commands.
+
+Command
+Notes
+
+sync
+control only (dataless) handshake
+<-
+handshake data transfer from an expression to an output port
+->
+handshake data transfer to a variable from an input port
+:=
+assigns a value to a variable
+;
+sequence operator
+||
+parallel composition operator
+continue
+a no-op
+halt
+causes deadlock (useful in simulation)
+loop
+�
+�
+� end
+repeat forever
+while
+�
+�
+� else
+�
+�
+� end
+conditional loop
+for
+�
+�
+� end
+structural (not temporal) iteration
+if
+�
+�
+� then
+�
+�
+� else
+�
+�
+� end
+conditional execution, may have multiple guarded commands
+case
+�
+�
+� end
+conditional execution based on constant expressions
+select
+non-arbitrated choice operator
+arbitrate
+arbitrated choice operator
+print
+compile time printing of diagnostics
+
+Auto-assignment
+
+Statements of the form:
+
+x := f(x)
+
+are allowed in Balsa. However, the implementation generates an auxiliary vari-
+able which is then assigned back to the variable visible to the programmer –
+the variable is enclosed within a single handshake and cannot be read from
+and written to simultaneously. Since auto-assignment generates twice as many
+variables as might be suspected, it is probably better practice to avoid the auto-
+assignment, explicitly introduce the extra variable and then rewrite the program
+to hide the sequential update thereby avoiding any time penalty. An example
+of this approach is given in count10b on page 184.
+
+10.3.
+Control flow and commands
+
+Balsa’s command set is listed in table 10.1.
+
+Dataless handshakes
+
+sync
+�Channel� – awaits a handshake on the named channel. Circuit action
+does not proceed until the handshake is completed.
+
+
+Chapter 10: The Balsa language
+179
+
+Channel communications
+
+Data can be transferred between a variable and a channel, between channels
+or from a channel to a command code block as shown below:
+
+�Channel� <-
+�Variable� – transfers data from a variable to the named channel.
+This may either be an internal channel local to a procedure or an output port
+listed in the procedure declaration.
+
+�Channel� ->
+�Variable� – transfers data from the channel connected to a
+variable. The channel may either be an internal channel local to a procedure or
+an input port listed in the procedure declaration.
+
+�Channel1� ->
+�Channel2� – transfers data between channels.
+
+�Channel� -> then
+�Command� – allows the data to be accessed throughout the
+command block. However, the handshake on the channel is not completed and
+thus the data not released until the command block itself has terminated.
+
+Variable assignment
+
+�Variable� :=
+�Expression� – transfers the result of an expression into a vari-
+able. The result type of the expression and that of the variable must agree.
+
+Sequential composition
+
+�Command1� ;
+�Command2� – the two commands execute sequentially. The first
+must terminate before the second commences.
+
+Parallel composition
+
+�Command1� ||
+�Command2� – composes two commands such that they oper-
+ate concurrently and independently. Both commands must complete before
+the circuit action proceeds. Beware of inadvertently introducing dependencies
+between the two commands so that neither can proceed until the other has com-
+pleted. The “||” operator binds tighter than “;”. If that is not what is intended,
+then commands may be grouped in blocks as shown below
+
+[
+�Command1� ;
+�Command2� ] ||
+�Command3�
+
+Note the use of square brackets to group commands rather than parentheses.
+Alternatively, the keywords begin
+�
+�
+� end may be used and are mandatory if
+variables local to a block are to be declared.
+
+Continue and halt commands
+
+continue is effectively a no-op.
+The command halt causes a process
+thread to deadlock.
+
+
+180
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Looping constructs
+
+The loop command causes an infinite repetition of a block of code. Finite
+loops may be constructed using the while construct. The simplest example of
+its use is:
+
+while
+�Condition� then
+�Command� end
+
+However, multiple guards are allowed, so a more general form of the con-
+struct is:
+
+while
+
+�Condition1� then
+�Command1�
+|
+�Condition1� then
+�Command2�
+|
+�Condition3� then
+�Command3�
+else
+
+�Command4�
+end
+
+The ability of the while construct to take an else clause is a minor conve-
+nience. The code sequence above could have been written, with only a small
+difference in the resultant handshake circuit implementation, as:
+
+while
+
+�Condition1� then
+�Command1�
+|
+�Condition2� then
+�Command2�
+|
+�Condition3� then
+�Command3�
+end;
+
+�Command4�
+
+If more than one guard is satisfied, the particular command that is executed
+is unspecified.
+
+Structural iteration
+
+Balsa has a structural looping construct. In many programming languages
+it is a matter of convenience or style as to whether a loop is written in terms of
+a for loop or a while loop. This is not so in Balsa. The for loop is similar
+to VHDL’s for
+�
+�
+� generate command and is used for iteratively laying out
+repetitive structures. An example of its use was given earlier in section 9.4.4 on
+page 165. An illustration of the inappropriate use of the for command is given
+in the example count10e on page 189. Structures may be iteratively instantiated
+to operate either sequentially or concurrently with one another depending on
+whether for ; or for || is employed.
+
+Conditional execution
+
+Balsa has two constructs to achieve conditional execution. Balsa’s case
+statement is similar to that found in conventional programming languages. A
+single guard may match more than one value of the guard expression.
+
+
+Chapter 10: The Balsa language
+181
+
+case x+y of
+1 .. 4, 11
+then o <- x
+| 5 .. 10 then o <- y
+else o <- z
+end
+
+An if
+�
+�
+� then
+�
+�
+� else statement allows conditional execution based on
+the evaluation of expressions at run-time. Its syntax is similar to that of the
+while loop. Note the sequencing implicit in nested if statements, such as that
+shown below:
+
+if
+�Condition1� then
+
+�Command1�
+else
+if
+�Condition2� then
+
+�Command2�
+end
+end
+
+The test for Condition2 is made after the test for Condition1. If it is
+known that the two conditions are mutually exclusive, the expression may be
+written:
+
+if
+�Condition1� then
+�Command1�
+|
+�Condition2� then
+�Command2�
+end
+
+The “|” separator causes Condition1 and Condition2 to be evaluated in
+parallel. The result is undefined if more than one guard (condition) is satisfied.
+
+10.4.
+Binary/unary operators
+
+Balsa’s binary/unary operators are shown in order of decreasing precedence
+in table 10.2.
+
+10.5.
+Program structure
+
+File structure
+
+A typical design will consist of several files containing procedure/type/cons-
+tant declarations which come together in a top-level procedure that constitutes
+the overall design. This top-level procedure would typically be at the end of
+a file which imports all the other relevant design files. This importing feature
+forms a simple but effective way of allowing component reuse and maps sim-
+ply onto the notion of the imported procedures being either pre-compiled hand-
+shake circuits or existing (possibly hand crafted) library components. Declara-
+tions have a syntactically defined order (left to right, top to bottom) with each
+declaration having its scope defined from the point of declaration to the end of
+
+
+182
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Table 10.2.
+Balsa binary/unary operators.
+
+Symbol
+Operation
+Valid types
+Notes
+
+.
+record indexing
+record
+
+#
+smash
+any
+takes value from any type and reduces it
+to an array of bits
+[]
+array indexing
+array
+non-constant index possible (can generate
+lots of hardware)
+not,log,
+unary operators
+numeric
+log only works on constants and returns
+- (unary)
+the ceiling: e.g. log 15 returns 4
+“-” returns a result 1 bit wider than the
+argument
+ˆ
+exponentiation
+numeric
+
+*, /, %
+multiply, divide,
+numeric
+only applicable to constants
+
+remainder
+
++,-
+add, subtract
+numeric
+results are 1 or 2 bits longer than
+the largest argument
+@
+concatenation
+arrays
+
+<,>,<=,>=
+inequalities
+numeric,
+enumerations
+
+=, /=
+equals,
+all types
+comparison is by sign extended
+
+not equals
+value for signed numeric types
+and
+bitwise AND
+numeric
+Balsa uses type 1 bits for if/while
+guards so bitwise and logical operators
+are the same.
+or, xor
+bitwise OR,XOR
+numeric
+
+the current (or importing) file. Thus Balsa has the same simple “declare before
+use” rule of C and Modula, though without any facility for prototypes.
+
+Declarations
+
+Declarations introduce new type, constant or procedure names into the global
+name spaces from the point of declaration until the end of the enclosing block
+(or file in the case of top-level declarations). There are three disjoint name
+spaces: one for types, one for procedures and a third for all other declarations.
+At the top level, only constants are in this last category. However, variables and
+channels may be included in procedure local declarations. Where a declaration
+within an enclosed/inner block has the same name as one previously made in
+an outer/enclosing context, the local declaration will hide the outer declaration
+for the remainder of that inner block.
+
+
+Chapter 10: The Balsa language
+183
+
+Procedures
+
+Procedures form the bulk of a Balsa description. Each procedure has a name,
+a set of ports and an accompanying behavioural description. The sync key-
+word introduces dataless channels. Both dataless and data bearing channels can
+be members of “arrayed channels”. Arrayed channels allow numeric/enumer-
+ated indexing of otherwise functionally separate channels. Procedures can also
+carry a list of local declarations which may include other procedures, types and
+constants.
+
+Shared procedures
+
+Normally, each call to a procedure generates separate hardware to instan-
+tiate that procedure. A procedure may be shared, in which case calls to that
+procedure access common hardware thereby avoiding duplication of the cir-
+cuit at the cost of some multiplexing to allow sharing to occur. The use of
+shared procedures is discussed further on page 187.
+
+Functions
+
+In many programming languages, functions can be thought of as procedures
+without side effects that return a result. However, in Balsa there is a fundamen-
+tal difference between functions and procedures. Parameters to a procedure
+define handshaking channels that interface to the circuit block defined by the
+procedure. Function parameters, on the other hand, are just expression aliases
+returning values. An example of the use of function definitions can be found
+in the arbiter tree design on page 202.
+
+10.6.
+Example circuits
+
+In this section, various designs of counter are described in Balsa. In flavour,
+they resemble the specifications of conventional synchronous counters, since
+these designs are more familiar to newcomers to asynchronous systems. More
+sophisticated systolic counters, better suited to an asynchronous approach, are
+described by van Berkel [14].
+In this design, the role of the clock which updates the state of the counter is
+taken by a dataless sync channel, named aclk. The counter issues a handshake
+request over the sync channel, the environment responds with an acknowledge-
+ment completing the handshake and the counter state is updated.
+
+A modulo-16 counter
+
+-- count16a.balsa: modulo 16 counter
+import [balsa.types.basic]
+
+
+184
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+procedure count16 (sync aclk; output count : nibble) is
+variable count_reg : nibble
+begin
+loop
+sync aclk ;
+count <- count_reg ;
+count_reg := (count_reg + 1 as nibble)
+end
+end
+
+This counter interfaces to its environment by means of two channels: the
+dataless aclk channel and the output channel count which carries the current
+count value. The internal register which implements the variable count reg
+and the output channel are of the predefined type nibble (4-bits wide). After
+count reg is incremented, the result must be cast back to type nibble. For
+simplicity, issues of initialisation/reset have been ignored. A LARD simulation
+of this circuit will give a harmless warning when uninitialised variables are
+accessed.
+
+Removing auto-assignment
+
+The auto-assignment statement in the example above, although concise and
+expressive, hides the fact that, in most back-ends, an auxiliary variable is cre-
+ated so that the update can be carried out in a race-free manner. By making this
+auxiliary variable explicit, advantage may be taken of its visibility to overlap
+its update with other activity as shown in the example below.
+
+-- count16b.balsa: write-back overlaps output assignment
+import [balsa.types.basic]
+
+procedure count16 (sync aclk; output count : nibble) is
+variable count_reg, tmp : nibble
+begin
+loop
+sync aclk;
+tmp := (count_reg + 1 as nibble) ||
+count <- count_reg;
+count_reg := tmp
+end
+end
+
+In this example, the transfer of the count register to the output channel is
+overlapped with the incrementing of the auxiliary shadow register. There is
+some slight area overhead involved in parallelisation and any potential speed-
+up may be minimal in this case, but the principle of making trade-offs at the
+level of the source code is illustrated.
+
+
+Chapter 10: The Balsa language
+185
+
+A modulo-10 counter
+
+The basic counter description above can easily be modified to produce a
+modulo-10 counter. A simple test is required to detect when the internal regis-
+ter reaches its maximum value and then to reset it to zero.
+
+-- count10a.balsa: an asynchronous decade counter
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+procedure count10(sync aclk; output count: C_size) is
+variable count_reg : C_size
+variable tmp : C_size
+begin
+loop
+sync aclk;
+if count_reg /= max_count then
+tmp := (count_reg + 1 as C_size)
+else
+tmp := 0
+end || count <- count_reg ;
+count_reg := tmp
+end -- loop
+end -- begin
+
+A loadable up/down decade counter
+
+This example describes a loadable up/down decade counter. It introduces
+many of the language features discussed earlier in the chapter. The counter
+requires two control bits, one to determine the direction of count, and the other
+to determine whether the counter should load or
+�inc,dec�rement on the next
+operation. The are several valid design options; in this example, count10b
+below, the control bits and the data to be loaded are bundled together in a
+single channel, in sigs.
+
+-- count10b.balsa: an aysnchronous up/down decade counter
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+type dir is enumeration down, up end
+type mode is enumeration load, count end
+type In_bundle is record
+data : C_size ;
+mode : mode;
+dir : dir
+
+
+186
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+end
+
+procedure updown10 (
+input in_sigs: In_bundle;
+output count: C_size
+) is
+variable count_reg : C_size
+variable tmp : In_bundle
+begin
+loop
+in_sigs -> tmp; -- read control+data bundle
+if
+tmp.mode = count then
+case tmp.dir of
+down then
+-- counting down
+if count_reg /= 0 then
+tmp.data := (count_reg - 1 as C_size)
+else
+tmp.data := max_count
+end -- if
+| up then
+-- counting up
+if count_reg /= max_count then
+tmp.data := (count_reg + 1 as C_size)
+else
+tmp.data := 0
+end -- if
+end -- case tmp.dir
+end; -- if
+count <- tmp.data || count_reg:= tmp.data
+end -- loop
+end
+
+The example above illustrates the use of if
+�
+�
+� then
+�
+�
+� else and case
+control constructs as well the use of record structures and enumerated types.
+The use of symbolic values within enumerated types makes the code more
+readable. Test harnesses which can be generated automatically by the Balsa
+system can also read the symbolic enumerated values. For example, here is a
+test file which initialises the counter to 8, counts up, testing that the counter
+wraps round to zero and then counts down allowing the user to check that the
+counter correctly wraps to 9.
+
+{8, load, up}
+load
+counter with 8
+{0, count, up}
+count to 9
+{0, count, up}
+count & wrap to 0
+{0, count, up}
+count to 1
+{0, count, down}
+count down to 0
+{0, count, down}
+count down to 9
+{1, load, down}
+load counter with 1
+{0, count, down}
+count down to 0
+{0, count, down}
+count down & wrap to 9
+
+
+Chapter 10: The Balsa language
+187
+
+Sharing hardware
+
+In Balsa, every statement instantiates hardware in the resulting circuit. It
+is therefore worth examining descriptions to see if there are any repeated con-
+structs that could either be moved to a common point in the code or replaced by
+shared procedures. In count10b above, the description instantiates two adders:
+one used for incrementing and the other for decrementing. Since these two
+units are not used concurrently, area can be saved by sharing a single adder
+(which adds either
+�1 or
+�1 depending in the direction of count) described by
+a shared procedure. The code below illustrates how count10b can be rewritten
+to use a shared procedure. The shared procedure add sub computes the next
+count value by adding the current count value to a variable, inc, which can
+take values of
+�1 or
+�1. Note that to accommodate these values, inc must be
+declared as 2 signed bits.
+The area advantage of the approach is shown by examining the cost of the
+circuit reported by breeze-cost: count10b has a cost of 2141 units, whereas the
+shared procedure version has a cost of only 1760. The relative advantage, of
+course, becomes more pronounced as the size of the counter increases.
+
+-- count10c.balsa: introducing shared procedures
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+type dir is enumeration down, up end
+type mode is enumeration load, count end
+type inc is 2 signed bits
+
+type In_bundle is record
+data : C_size ;
+mode :
+mode;
+dir : dir
+end
+
+procedure updown10 (
+input in_sigs: In_bundle;
+output count: C_size
+) is
+variable count_reg : C_size
+variable tmp : In_bundle
+variable inc : inc
+
+shared add_sub is
+begin
+tmp.data:= (count_reg + inc as C_size)
+end
+
+
+188
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+begin
+loop
+in_sigs -> tmp; -- read control+data bundle
+if
+tmp.mode = count then
+case tmp.dir of
+down then
+-- counting down
+if count_reg /= 0 then
+inc:= -1;
+add_sub()
+else
+tmp.data := max_count
+end -- if
+| up then
+-- counting up
+if count_reg /= max_count then
+inc := +1;
+add_sub()
+else
+tmp.data := 0
+end -- if
+end -- case tmp.dir
+end; -- if
+count <- tmp.data || count_reg:= tmp.data
+end -- loop
+end
+
+In order to guarantee the correctness of implementations, there are a number
+of minor restrictions on the use of shared procedures:
+
+shared procedures cannot have any arguments;
+
+shared procedures cannot use local channels;
+
+shared procedures using elements of the channel referenced by a select
+statement (see section 10.7 on page 190) must be declared as local within
+the body of that select block.
+
+“while” loop description
+
+An alternative description of the modulo-10 counter employs the while con-
+struct:
+
+-- count10d.balsa: mod-10 counter alternative implementation
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 10
+
+procedure count10(sync aclk; output count : C_size) is
+variable count_reg : C_size
+begin
+
+
+Chapter 10: The Balsa language
+189
+
+loop
+while count_reg < max_count then
+sync aclk;
+count <- count_reg;
+count_reg:= (count_reg + 1 as C_size)
+end; -- while
+count_reg:= 0
+end -- loop
+end
+
+Structural “for” loops
+
+for loops are a potential pitfall for beginners to Balsa. In many program-
+ming languages, while loops and for loops can be used interchangeably. This
+is not the case in Balsa: a for loop implements structural iteration, in other
+words, separate hardware is instantiated for each pass through the loop. The
+following description, which superficially appears very similar to the while
+loop example of count10d previously, appears to be correct: it compiles with-
+out problems and a LARD simulation appears to give the correct behaviour.
+However, examination of the cost reveals an area cost of 11577, a large in-
+crease. It is important to understand why this is the case. The for loop is un-
+rolled at compile time and 10 instances of the circuit to increment the counter
+are created. Each instance of the loop is activated sequentially. The PostScript
+plot of the handshake circuit graph is rather unreadable; setting max_count to
+3 produces a more readable plot.
+
+-- count10e.balsa: beware the "for" construct
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 10
+
+procedure count10(sync aclk; output count: C_size) is
+variable count_reg : C_size
+begin
+loop
+for ; i in 1 .. max_count then
+sync aclk;
+count <- count_reg;
+count_reg:= (count_reg + 1 as C_size)
+end; -- for ; i
+count_reg:= 0
+end -- loop
+end -- begin
+
+If, instead of using the sequential for construct, the parallel for construct
+had been employed (for ||
+�
+�
+�), the compiler would give an error message
+complaining about read/write conflicts from parallel threads. In this case, all
+
+
+190
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+instances of the counter circuit would attempt to update the counter register
+at the same time, leading to possible conflicts. A reader who understands the
+resulting potential handshake circuit is well on the way to a good understanding
+of the methodology.
+
+10.7.
+Selecting channels
+
+The asynchronous circuit described below merges two input channels into
+a single output channel; it may be thought of as a self-selecting multiplexer.
+The select statement chooses between the two input channels a and b by
+waiting for data on either channel to arrive. When a handshake on either a or b
+commences data is held valid on the input, and the handshake not completed,
+until the end of the select
+�
+�
+�end block.
+This circuit is an example of handshake enclosure and avoids the need for an
+internal latch to be created to store the data from the input channel; a possible
+disadvantage is that, because of the delayed completion of the handshake, the
+input is not released immediately to continue processing independently. In this
+example, data is transferred to the output channel and the input handshake will
+complete as soon as data has been removed from the output channel. An exam-
+ple of a more extended enclosure can be found in the code for the population
+counter on page 197.
+
+-- mux1.balsa: unbuffered Merge
+import [balsa.types.basic]
+
+procedure mux (input a, b :byte; output c :byte) is
+begin
+loop
+select a then c <- a
+-- channel behaves like a variable
+|
+b then c <- b
+-- ditto
+end -- select
+end -- loop
+end
+
+Because of the enclosed nature of the handshake associated with select,
+inputs a and b should be mutually exclusive for the duration of the block of
+code enclosed by the select. In many cases, this is not a difficult obligation
+to satisfy. However, if a and b are truly independent, select can be replaced
+by arbitrate which allows an arbitrated choice to be made. Arbiters are rel-
+atively expensive in terms of speed and may not be possible to implement in
+some technologies. Further, the loss of determinism in circuits with arbitra-
+tion can also introduce testing and design correctness verification problems.
+Designers should therefore not use arbiters unnecessarily.
+
+
+Chapter 10: The Balsa language
+191
+
+-- mux2.balsa: unbuffered arbitrated Merge.
+import [balsa.types.basic]
+
+procedure mux (input a, b :byte; output c :byte) is
+begin
+loop
+arbitrate a then c <- a
+-- channel behaves like a variable
+|
+b then c <- b
+-- ditto
+end -- arbitrate
+end -- loop
+end
+
+
+
+Chapter 11
+
+BUILDING LIBRARY COMPONENTS
+
+11.1.
+Parameterised descriptions
+
+Parameterised procedures allow designers to develop a library of commonly
+used components and then to instantiate those structures later with varying
+parameters. A simple example is the specification of a buffer as a library part
+without knowing the width of the buffer. Similarly, a pipeline of buffers can be
+defined in the library without requiring any knowledge of the depth chosen for
+the pipeline when it is instantiated.
+
+11.1.1
+A variable width buffer definition
+
+The example pbuffer1 below defines a single place buffer with a parame-
+terised width.
+
+-- pbuffer1.balsa - parameterised buffer example
+import [balsa.types.basic]
+
+procedure Buffer (
+parameter X : type ;
+input i : X; output o : X
+) is
+variable x : X
+begin
+loop
+i -> x ;
+o <- x
+end
+end
+
+-- now define a byte wide buffer
+procedure Buffer8 is Buffer (byte)
+
+-- now use the definition
+procedure test1 (input a : byte ; output b : byte) is
+begin
+Buffer8 (a, b)
+
+193
+
+
+194
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+end
+
+-- alternatively
+procedure test2 (input a : byte ; output b : byte) is
+begin
+Buffer (byte, a, b)
+end
+
+The definition of the single place buffer given earlier on page 161 is modi-
+fied by the addition of the parameter declaration which defines X to be of type
+type. In other words X is identified as being a type to be refined later. Once a
+parameter type has been declared, it can be used in later declarations and state-
+ments: for example, input channel i is defined as being of type X. No hardware
+is generated for the parameterised procedure definition itself.
+Having defined the procedure, it can be used in other procedure definitions.
+Buffer8 defines a byte wide buffer that can be instantiated as required as
+shown, for example, in procedure test1. Alternatively, a concrete realisation
+of the parameterised procedure can be used directly as shown in procedure
+test2.
+
+11.1.2
+Pipelines of variable width and depth
+
+The next example illustrates how multiple parameters to a procedure may
+be specified. The parameterised buffer element is included in a pipeline whose
+depth is also parameterised.
+
+-- pbuffer2.balsa - parameterised pipeline example
+import [balsa.types.basic]
+import [pbuffer1]
+
+-- BufferN: an n-place parameterised, variable width buffer
+procedure BufferN (
+parameter n : cardinal ;
+parameter X : type ;
+input i : X ;
+output o : X
+) is
+begin
+if n = 1 then
+-- single place pipeline
+Buffer(X, i, o)
+| n >= 2 then
+-- parallel evaluation
+local array 1 .. n-1 of channel c : X
+begin
+Buffer(x, i, c[1])
+||
+-- first buffer
+Buffer(x, c[n-1], o) ||
+-- last buffer
+for || i in 1 .. n-2 then
+Buffer(X, c[i], c[i+1])
+end -- for || i
+end
+
+
+Chapter 11: Building library components
+195
+
+else print error, "zero length pipeline specified"
+end -- if
+end
+
+-- Now define a 4 deep, byte-wide pipeline.
+procedure Buffer4 is BufferN(4, byte)
+
+Buffer is the single place parameterised width buffer of the previous exam-
+ple and this is reused by means of the library statement import[pbuffer1].
+In this code, BufferN is defined in a very similar manner to the example in sec-
+tion 9.4.4 on page 165 except that the number of stages in the pipeline, n, is
+not a constant but is a parameter to the definition of type cardinal. Note that
+this definition includes some error checking. If an attempt is made to build a
+zero length pipeline during a definition, an error message is printed.
+
+11.2.
+Recursive definitions
+
+Balsa allows a form of recursion in definitions (as long as the resulting struc-
+tures can be statically determined at compile time). Many structures can be
+described elegantly using this technique which forms a natural extension to
+the powerful parameterisation mechanism. The remainder of this chapter il-
+lustrates recursive parameterisation by means of some interesting (and useful)
+examples.
+
+11.2.1
+An n-way multiplexer
+
+An n-way multiplexer can be constructed from a tree of 2-way multiplexers.
+A recursive definition suggests itself as the natural specification technique: an
+n-way multiplexer can be split into two n/2-way multiplexers connected by
+internal channels to a 2-way multiplexer as shown in figure 11.1 on page 196.
+
+--- Pmux1.balsa: A recursive parameterised MUX definition
+import [balsa.types.basic]
+
+procedure PMux (
+parameter X : type;
+parameter n : cardinal;
+array n of input inp : X;
+-- note use of arrayed port
+output out : X
+) is
+begin
+if n = 0 then print error,"Parameter n should not be zero"
+|
+n = 1 then
+loop
+select inp[0] then
+out <- inp[0]
+end -- select
+end -- loop
+
+
+196
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+0
+inp
+
+n/2−1
+inp
+
+n/2
+inp
+
+n−1
+inp
+
+0
+out
+
+1
+out
+
+out
+
+After Decompostion
+
+0
+inp
+
+1
+inp
+
+inpn−1
+n−2
+inp
+
+Before Decomposition
+
+out
+
+Figure 11.1.
+Decomposition of an n-way multiplexer.
+
+|
+n = 2 then
+loop
+select inp[0] then
+out <- inp[0]
+| inp[1] then
+out <- inp[1]
+end -- select
+end -- loop
+else
+local
+-- local block with local definitions
+channel out0, out1 : X
+constant mid = n/2
+begin
+PMux (X, mid,
+inp[0..mid-1], out0) ||
+PMux (X, n-mid, inp[mid..n-1], out1) ||
+PMux (X, 2, {out0,out1},out)
+end
+end -- if
+end
+
+-- Here is a 5-way multiplexer
+procedure PMux5Byte is PMux(byte, 5)
+
+Commentary on the code
+
+The multiplexer is parameterised in terms of the type of the inputs and the
+number of channels n. The code is straightforward. A multiplexer of size
+greater than 2 is decomposed into two multiplexers half the size connected by
+internal channels to a 2-to-1 multiplexer. Notice how the arrayed channels,
+out0 and out1 are specified as a tuple. The recursive decomposition stops
+
+
+Chapter 11: Building library components
+197
+
+when the number of inputs is 2 or 1 (specifying a multiplexer with zero inputs
+generates an error). A 1-input multiplexer makes no choice of inputs.
+
+A Balsa test harness
+
+The code below illustrates how a simple Balsa program can be used as a test
+harness to generate test values for the multiplexer. The test program is actually
+rather na¨ıve.
+
+-- test_pmux.balsa - A test-harness for Pmux1
+import [balsa.types.basic]
+import [pmux1]
+
+procedure test (output out : byte) is
+type ttype is sizeof byte + 1 bits
+array 5 of channel inp : byte
+variable i : ttype
+begin
+begin
+i:= 1;
+while i <= 0x80 then
+inp[0] <- (i as byte);
+inp[1] <- (i+1 as byte);
+inp[2] <- (i+2 as byte);
+inp[3] <- (i+3 as byte);
+inp[4] <- (i+4 as byte);
+i:= (i + i as ttype)
+end
+end ||
+PMux5Byte(inp, out)
+end
+
+11.2.2
+A population counter
+
+This next example counts the number of bits set in a word. It comes from
+the requirement in an Amulet processor to know the number of registers to be
+restored/saved during LDM/STM (Load/Store Multiple) instructions.
+The approach taken is to partition the problem into two parts. Initially, ad-
+jacent bits are added together to form an array of 2-bit channels representing
+the numbers of bits that are set in each of the adjacent pairs. The array of 2-
+bit numbers are then added in a recursively defined tree of adders (procedure
+AddTree). The structure of the bit-counter is shown in figure 11.2.
+
+-- popcount: count the number of bits set in a word
+import [balsa.types.basic]
+
+procedure AddTree (
+parameter inputCount : cardinal;
+parameter inputSize : cardinal;
+parameter outputSize : cardinal;
+
+
+198
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+array inputCount of input i : inputSize bits;
+output o : outputSize bits
+) is
+begin
+if inputCount = 1 then
+select i[0] then o <- (i[0] as outputSize) end
+|
+inputCount = 2 then
+select i[0], i[1] then
+o <- (i[0] + i[1] as outputSize bits)
+end -- select
+else
+local
+constant lowHalfInputCount = inputCount / 2
+constant highHalfInputCount = inputCount - lowHalfInputCount
+
+channel lowO, highO : outputSize - 1 bits
+begin
+AddTree (lowHalfInputCount, inputSize, outputSize - 1,
+i[0..lowHalfInputCount-1], lowO) ||
+AddTree (highHalfInputCount, inputSize, outputSize - 1,
+i[lowHalfInputCount..inputCount-1], highO) ||
+AddTree (2, outputSize - 1, outputSize, {lowO, highO}, o)
+end
+end -- if
+end
+
+procedure PopulationCount (
+parameter n : cardinal;
+input i : n bits;
+output o : log (n+1) bits
+) is
+begin
+if n % 2 = 1 then
+print error, "number of bits must be even"
+end; -- if
+loop
+select i then
+if n = 1 then
+o <- i
+|
+n = 2 then
+o <- (#i[0] + #i[1])
+add bits 0 and 1
+else
+local
+constant pairCount = n - (n / 2)
+array pairCount of channel addedPairs : 2 bits
+begin
+for || c in 0..pairCount-1 then
+addedPairs[c] <- (#i[c*2] + #i[(c*2)+1])
+end ||
+AddTree (pairCount, 2, log (n+1), addedPairs, o)
+
+
+Chapter 11: Building library components
+199
+
+#i[0]
+#i[1]
+#i[2]
+#i[3]
+#i[4]
+#i[5]
+#i[6]
+#i[7]
+
++ : 2 bits
++ : 2 bits
+
++ : 3 bits
+
++ : bit
++ : bit
++ : bit
++ : bit
+
+o
+
+PopulationCount (8)
+
+(4,2,4)
+AddTree
+(2,3,4)
+AddTree
+
+i
+
+(2,2,3)
+AddTree
+(2,2,3)
+AddTree
+
+Figure 11.2.
+Structure of a bit population counter.
+
+end
+end -- if
+end -- select
+end -- loop
+end
+
+procedure PopCount16 is PopulationCount (16)
+
+Commentary on the code
+
+parameterisation:
+Procedures AddTree and PopulationCount are param-
+eterised. PopulationCount can be used to count the number of bits set in any
+sized word. AddTree is parameterised to allow a recursively defined adder of
+any number of arbitrary width vectors.
+
+enclosed selection:
+The semantics of the enclosed handshake of select al-
+low the contents of the input i to be referred to several times in the body of the
+select block without the need for an internal latch.
+
+avoiding deadlock:
+Note that the formation of the sum of adjacent bits is
+specified by a parallel for loop.
+
+for || c in 0..pairCount-1 then
+addedPairs[c] <- (#i[c*2] + #i[(c*2)+1])
+end
+
+
+200
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+It might be thought that a serial for ; loop could be used at, perhaps, the
+expense of speed. This is not the case: the system will deadlock, illustrating
+why designing asynchronous circuits requires some real understanding of the
+methodology. In this case the adder to which the array of addPairs is con-
+nected requires pairs of inputs to be ready before it can complete the addition
+and release its inputs. However, if the sum of adjacent bits is computed seri-
+ally, the next pair will not be computed until the handshake for the previous
+pair has been completed – which is not possible because AddTree is awaiting
+all pairs to become valid: result deadlock!
+
+11.2.3
+A Balsa shifter
+
+General shifters are an essential element of all microprocessors including
+the Amulet processors. The following description forms the basis of such a
+shifter. It implements only a rotate right function, but it is easily extensible to
+other shift functions.
+The main work of the shifter is local procedure rorBody which recursively
+creates sub-shifters capable of optionally rotating 1, 2, 4, 8 etc. bits. The
+structure of the shifter is shown in figure 11.3.
+
+rorStage
+
+rorStage
+
+rorStage
+
+mux
+
+rotate
+
+#d[log distance]
+
+i
+d
+
+o
+
+rorStage
+
+0
+1
+
+rorBody (1)
+
+rorBody (2)
+
+rorBody (4)
+
+i
+
+o
+
+ror (8 bits)
+
+d
+
+Figure 11.3.
+Structure of a rotate right shifter.
+
+import [balsa.types.basic]
+
+-- ror: rotate right shifter
+procedure ror (
+parameter X : type;
+input d : sizeof X bits;
+
+
+Chapter 11: Building library components
+201
+
+input i : X;
+output o : X
+) is
+begin
+loop
+select d then
+local
+constant typeWidth = sizeof X
+
+procedure rorBody (
+parameter distance : cardinal;
+input i : X;
+output o : X
+) is
+local
+procedure rorStage (
+output o : X
+) is
+begin
+select i then
+if #d[log distance] then
+o <- (#i[typeWidth-1..distance] @
+#i[distance-1..0] as X)
+{shift}
+else
+o <- i
+{don’t shift}
+end -- if
+end -- select
+end
+channel c : X
+begin
+if distance > 1 then
+rorStage (c) ||
+rorBody (distance/2, c, o)
+else
+rorStage (o)
+end -- if
+end
+begin
+rorBody (typeWidth/2, i, o)
+end
+end -- select
+end -- loop
+end
+
+procedure ror32 is ror (32 bits)
+
+
+202
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Testing the shifter
+
+This next code example builds another test routine in Balsa to exercise the
+shifter.
+
+import [balsa.types.basic]
+import [ror]
+
+--test ror32
+procedure test_ror32(output o : 32 bits)
+is
+variable i : 5 bits
+channel shiftchan : 32 bits
+channel distchan : 5 bits
+begin
+begin
+i:= 1;
+while i < 31 then
+shiftchan <- 7 || distchan <- i;
+i:= (i+1 as 5 bits)
+end -- while
+end || ror32(distchan, shiftchan, o)
+end
+
+11.2.4
+An arbiter tree
+
+The final example builds a parameterised arbiter. This circuit forms part
+of the DMA controller of chapter 12. The architecture of an 8-input arbiter
+is shown in figure 12.3 on page 212. ArbFunnel is a parameterisable tree
+composed of two elements: ArbHead and ArbTree. Pairs of incoming sync
+requests are arbitrated and combined into single bit decisions by ArbHead el-
+ements. These single bit channels are then arbitrated between by ArbTree
+elements. An ArbTree takes a number of decision bits from each of a number
+of inputs (on the i ports) and produces a rank of 2-input arbiters to reduce the
+problem to half as many inputs each with 1 extra decision bit. Recursive calls
+to ArbTree reduce the number of input channels to one (whose final decision
+value is returned on port o).
+
+-- ArbHead: 2 way arbcall with channel no. output
+import [balsa.types.basic]
+procedure ArbHead (
+sync i0, i1;
+output o : bit
+) is
+begin
+loop
+arbitrate i0 then o <- 0
+|
+i1 then o <- 1
+end -- arbitrate
+
+
+Chapter 11: Building library components
+203
+
+end -- loop
+end -- begin
+
+-- ArbTree: a tree arbcall which outputs a channel number
+-- prepended onto the input channel’s data. (invokes itself
+-- recursively to make the tree)
+
+procedure ArbTree (
+parameter inputCount : cardinal;
+parameter depth : cardinal;
+-- bits to carry from inputs
+array inputCount of input i : depth bits;
+output o : (log inputCount) + depth bits
+) is
+type BitArray is array 1 of bit
+type BitArray2 is array 2 of bit
+function AddTopBit (hd : bit; tl : depth bits) =
+(#tl @ {hd} as depth + 1 bits)
+function AddTopBit2 (hd : bit; tl : depth + 1 bits) =
+(#tl @ {hd} as depth + 2 bits)
+function AddTop2Bits (hd0 : bit; hd1 : bit; tl : depth bits) =
+(#tl @ {hd0,hd1} as depth + 2 bits)
+begin
+case inputCount of
+0, 1 then print error, "Can’t build an ArbTree with fewer than 2 inputs"
+|
+2 then loop
+arbitrate i[0] -> i0 then o <- AddTopBit (0, i0)
+|
+i[1] -> i1 then o <- AddTopBit (1, i1)
+end -- arbitrate
+end -- loop
+|
+3 then local channel lo : 1 + depth bits
+begin
+ArbTree (2, depth, i[0 .. 1], lo) ||
+loop
+arbitrate lo then o <- AddTopBit2 (0, lo)
+| i[2] -> i2 then o <- AddTop2Bits (1, 0, i2)
+end -- arbitrate
+end -- loop
+end
+else local
+constant halfCount = inputCount / 2
+constant halfBits = depth + log halfCount
+channel l, r : halfBits bits
+begin
+ArbTree (halfCount, depth, i[0 .. halfCount-1], l) ||
+ArbTree (inputCount - halfCount, depth,
+i[halfCount .. inputCount-1], r) ||
+ArbTree (2, halfBits, {l,r}, o)
+end -- begin
+end -- case inputCount
+end
+
+
+204
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+-- ArbFunnel: build a tree arbcall (balanced apart from the last
+--
+channel which is faster than the rest) which produces a channel
+--
+number from an array of sync inputs
+procedure ArbFunnel (
+parameter inputCount : cardinal;
+array inputCount of sync i;
+output o : log inputCount bits
+) is
+constant halfCount = inputCount / 2
+constant oddInputCount = inputCount % 2
+begin
+if inputCount < 2 then
+print error, "can’t build an ArbFunnel with fewer than 2 inputs"
+| inputCount = 2 then
+ArbHead (i[0], i[1], o)
+| inputCount > 2 then
+local
+array halfCount + 1 of channel li : bit
+begin
+for || j in 0 .. halfCount - 1 then
+ArbHead (i[j*2], i[j*2+1], li[j])
+end ||
+if oddInputCount then
+ArbTree (halfCount + 1, 1, li[0 .. halfCount], o) ||
+loop
+select i[inputCount - 1] then li[halfCount] <- 0
+end -- select
+end -- loop
+else
+ArbTree (halfCount, 1, li[0 .. halfCount-1], o)
+end -- if
+end
+end
+-- if
+end
+
+
+Chapter 12
+
+A SIMPLE DMA CONTROLLER
+
+A simple 4 channel DMA controller is presented as a practical description
+of a reasonably large-scale Balsa design written entirely in Balsa and so can
+be compiled for any of the technologies which the Balsa back-end supports.
+Readers should note that this controller is not the same as the Amulet3i DMA
+controller referred to in chapter 15. A more detailed description of this con-
+troller and the motivation for its design can be found in [8]. A complete listing
+of the code for the controller can be downloaded from [7].
+The simplified controller provides:
+
+4 full address range channels each with independent source, destination
+and count registers.
+
+8 client DMA request inputs with matching acknowledgements.
+
+Peripheral to peripheral, memory to memory and peripheral to memory
+transfers. Each channel has both source and destination client requests
+so “true” peripheral to peripheral transfers can be performed by waiting
+for requests from both parties.
+
+Figure 12.1 shows the programmer’s view of the controller’s register mem-
+ory map. The register bank is split into two parts: the channel registers and the
+global registers.
+
+12.1.
+Global registers
+
+The global registers contain control state pertaining to the state of currently
+active channels and of interrupts signalled by the termination of transfer runs.
+There are 4 global registers:
+
+genCtrl: General control.
+In this controller, the general control register
+only contains one bit: the global enable – gEnable. The global enable is the
+only controller bit reset at power-up. All other controller state bits must be
+initialised before gEnable is set. Using a global enable bit in this way allows
+the initialisation part of the Balsa description to remain small and cheap.
+
+205
+
+
+206
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+0
+1
+2
+3
+4
+5
+11:9
+8:6
+
+chan[n].ctrl
+
+enable
+
+srcInc
+dstInc
+countDec
+srcDRQ
+dstDRQ
+srcClientNo
+dstClientNo
+
+31
+23
+15
+0
+24
+16
+8
+7
+
+chan[0].src
+
+chan[0].dst
+
+chan[0].count
+
+chan[1].dst
+
+chan[1].count
+
+chan[2].src
+
+chan[2].count
+
+chan[3].src
+
+chan[3].count
+
+chan[1].src
+
+chan[2].dst
+
+000
+004
+008
+00C
+
+010
+014
+018
+01C
+
+020
+024
+028
+02C
+
+030
+034
+038
+03C
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+genCtrl
+Reads as 0/1
+200
+
+204
+
+208
+
+20C
+
+chanStatus
+
+IRQMask
+
+IRQReq
+
+Reads as 0x0000 000...
+
+Reads as 0x0000 000...
+
+Reads as 0x0000 000...
+
+chan[0].ctrl
+
+chan[2].ctrl
+
+chan[3].ctrl
+
+chan[2].dst
+
+chan[1].ctrl
+
+genCtrl.gEnable
+
+global registers
+
+channel registers
+
+Figure 12.1.
+DMA controller programmer’s model.
+
+chanStatus: Channelend-of-runstatus.
+The chanStatus register contains
+4 bits, one per DMA channel. When set by the DMA controller, a bit in this
+register indicates that the corresponding channel has come to the end of its run
+of transfers.
+
+IRQMask, IRQReq: Interrupt mask and status.
+The IRQMask register
+contains one bit per channel (like chanStatus) with set bits specifying that
+an interrupt should be raised at the end of a transfer run of that channel (when
+the corresponding chanStatus bit becomes set). IRQReq contains the current
+interrupt status for each channel.
+The channel status, IRQ mask and IRQ status bits are kept in global registers
+in order to reduce the number of DMA register reads which must be performed
+by the CPU after receiving an interrupt in order to determine which channel to
+service.
+
+12.2.
+Channel registers
+
+Each channel has 4 registers associated with it in the same way as the
+Amulet3i DMA controller. The two address registers (channel[n].src and
+channel[n].dst) specify the 32-bit source and destination addresses for trans-
+fers. The count register (channel[n].count) is a 32-bit count of remaining
+
+
+Chapter 12: A simple DMA controller
+207
+
+transfers to perform; transfer runs terminate when the count register is decre-
+mented to zero. The control register (channel[n].ctrl) specifies the updates
+to be performed on the other three registers and the clients to which this chan-
+nel is connected. Writing to the control register has the effect of clearing inter-
+rupts and end-of-run indication on that channel. The control register contains
+8 fields:
+
+enable: Transfer enable.
+If the enable bit is set, this channel should be
+considered for transfers when a new DMA request arrives. Channel enables
+are not cleared on power-up. The genCtrl.gEnable bit can be used to pre-
+vent transfers from occurring whilst the channel enable bits are cleared during
+startup.
+
+srcInc, dstInc, countDec: Increment/decrement control.
+These bits are
+used to enable source, destination and count register update after a transfer.
+Source and destination registers are incremented by 4 after transfers (since
+only word transfers are supported in this version of the controller) if srcInc
+and dstInc (respectively) are set. Note that the bottom 2 bits of these ad-
+dresses are preserved. The count register is decremented by 1 after each trans-
+fer if countDec is set. Resetting either srcInc or dstInc results in the corre-
+sponding address remaining unchanged between transfers. This is useful for
+nominating peripheral (rather than memory) addresses. Resetting countDec
+results in “free-running” transfers.
+
+srcDRQ, dstDRQ: Initial DMA requests.
+Transfers can take place on a
+channel when a pair of DMA requests have been received, one for the source
+client and the other for the destination client (the requests-pending registers).
+The srcDRQ and dstDRQ bits specify the initial states for those two requests.
+Setting both of these bits indicates that the source and destination requests
+should be considered to have already arrived. Resetting one or both of the bits
+specifies that requests from the corresponding
+�src,dst�ClientNo numbered
+client should trigger a transfer (both client requests are required when both
+control bits are reset).
+
+srcClientNo, dstClientNo: Client to channel mapping.
+These fields spec-
+ify the client numbers from which this channel receives source and destination
+DMA requests. These fields are only of use when either srcDRQ or dstDRQ (or
+both) are reset.
+
+12.3.
+DMA controller structure
+
+The structure of the simplified DMA controller is shown in Figure 12.2. The
+simplified DMA controller is composed of 5 units:
+
+
+208
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ARBITER
+
+DMA Requests
+
+DRQ
+
+MARBLE
+
+TECommand
+
+ARBITER
+
+interrupts to CPU
+
+A/Doff
+Don
+
+busResponse
+
+Control unit
+Engine
+Transfer
+
+TEAck: tfr. ack.
+
+A
+D
+
+Initiator I/F
+Target I/F
+
+busCommand
+
+Figure 12.2.
+DMA controller structure.
+
+
+Chapter 12: A simple DMA controller
+209
+
+MARBLE target interface
+
+The controller is assumed to be attached to the MARBLE asynchronous
+bus which connects the subsystems in the Amulet3i system-on-chip (see chap-
+ter 15). It is relatively easy to provide an interface to other forms of on-chip
+system connect busses.
+The MARBLE target interface provides a connection to MARBLE through
+which the controller can be programmed. Accesses to the registers from this
+interface are arbitrated with incoming DMA requests and transfer acknowl-
+edgements from the transfer engine. This arbitration and the decoupling of
+transfer engine from control unit allow the DMA controller to avoid potential
+bus access deadlock situations.
+The MARBLE interface used here carries an 8-bit address (8-bit word ad-
+dress, 10-bit byte address) similar to that of the Amulet3i DMA controller.
+This allows the same address mapping of channel registers and the possibil-
+ity of extendeding the number of channels to 32 without changing the global
+register addresses.
+
+MARBLE initiator interface
+
+The initiator interface is used by the DMA controller to perform its transfers.
+Only the address and control bits to this interface are connected to the Balsa
+synthesised controller hardware. The data to and from the initiator interface is
+handled by a latch (shown as the shaded box in Figure 12.2). Only word-wide
+transfers are supported and so this latch is all that is needed to hold data values
+between the read and write bus transactions of a transfer. Supporting different
+transfer widths is relatively easy but has not been implemented in this example
+in order to simplify the code.
+
+Control unit
+
+Each DMA channel has a pair of register bits, the requests-pending bits,
+which recode the arrival of requests for that channel’s source and destination
+clients. After marking-up an incoming request, the control unit examines the
+requests-pending registers of each channel in turn to find a channel on which
+to perform a transfer. If a transfer is to be performed, the register contents for
+that channel are forwarded to the transfer engine and the register contents are
+updated to reflect the incremented addresses and decremented count. DMA
+requests are acknowledged straight away when no transfer engine command
+is issued, or just after the command is issued where a transfer command is
+issued to the transfer engine. The acknowledgement of DMA requests does
+not guarantee the timely completion of the related transfer, peripherals must
+observe bus accesses made to themselves for this purpose. The acknowledge-
+
+
+210
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ment serves only to confirm the receipt of the DMA transfer request. A request
+must be removed after an acknowledgement is signalled so that other requests
+can be received through the request arbitration tree to mark-up potential trans-
+fers for other channels.
+
+Transfer engine
+
+The controller’s transfer engine takes commands from the control unit when
+a DMA transfer is due to be performed and performs no DMA request mapping
+or filtering of its own. The only reason for having the transfer engine is to
+prevent the potential bus deadlock situation if an access to the register bank
+is made across MARBLE while the DMA controller is trying to perform a
+transfer. In this situation, control of the bus belongs to the initiator (usually
+the CPU) trying to access the DMA controller. This initiator cannot proceed
+as the DMA controller is engaged in trying to gain the bus for itself. With a
+transfer engine, and the decoupling of DMA request/CPU access from transfer
+operations, the control unit is free to fulfil the initiator’s register request while
+the transfer engine is waiting for the bus to become available.
+After performing a transfer, the transfer engine will signal to the control
+unit to provide a new transfer command; it does this by a handshake on the
+transfer acknowledge channel (marked TEAck in Figure 12.2). This channel
+passes through the control unit’s command arbitration hardware and serves to
+inform the control unit that the transfer engine is free and that the request-
+pending register can be polled to find the next suitable transfer candidate. The
+acknowledgement not only provides the self-looping activation required to per-
+form memory to memory transfers but also allows the looping required to ser-
+vice requests for other types of transfer which are received during the period
+when the transfer engine was busy.
+A flag register, TEBusy, held in the control unit, is used to record the status
+of the transfer engine so that commands are not issued to it while a transfer
+is in progress. This flag is set each time a transfer command is issued to the
+transfer engine and cleared each time a transfer acknowledgement is received
+by the control unit. The request-pending registers are not re-examined (and a
+transfer command issued) if TEBusy is set.
+
+Arbiter tree
+
+The DMA controller receives DMA requests on an array of 8 sync channels
+connected to the input of the ARBITER unit shown in Figure 12.2. This arbiter
+unit is a tree of 2-way arbiter cells that combines these 8 inputs into a single
+DMA request number which it provides to the control unit. DMA requests
+are acknowledged as soon as the control unit has recorded them. Only the
+successful transfer of data between peripherals should be used as an indication
+
+
+Chapter 12: A simple DMA controller
+211
+
+of the actual completion of a DMA operation. When a transfer is begun (i.e.
+passed from control unit to transfer engine), that transfer’s channel registers
+and requests-pending registers are updated before another arbitrated access to
+the control unit is accepted. As a consequence, a new request on a channel can
+arrive (and be correctly observed) as soon as any transfer-related bus activity
+occurs for that transfer.
+
+12.4.
+The Balsa description
+
+The Balsa description of the DMA controller is composed of 3 parts: the
+arbiter tree, the control unit and the transfer engine. The two MARBLE inter-
+faces sit outside the Balsa block and are controlled through the target (mta and
+mtd) ports (corresponding to command and response ports) and the initiator
+address/control (mia) port. The top level of the DMA controller is:
+
+procedure DMAArb is ArbFunnel (NoOfClients)
+
+procedure dma (
+input mta : MARBLE8bACommand;
+output mtd : MARBLEResponse;
+output mia : MARBLECommandNoData;
+output irq : bit;
+array NoOfClients of sync drq
+) is
+channel DRQClientNo : ClientNo
+channel TECommand : array 2 of Word
+\textbf{--srcAddr, dstAddr}
+sync TEAck
+begin
+DMAArb (drq, DRQClientNo) ||
+DMAControl (mta, mtd, DRQClientNo, TECommand, TEAck, IRQ) ||
+DMATransferEngine (TECommand, TEAck, mia)
+end
+
+Interrupts are signalled by writing a 0 or 1 to the irq port. This interrupt
+value must then be caught by an external latch to generate a bare interrupt
+signal.
+
+12.4.1
+Arbiter tree
+
+DMA requests from the client peripherals arrive on the sync channels drq,
+these channels connect to the request arbiter DMAArb. The procedure declara-
+tion for DMAArb is given in the top level as a parameterised version of the pro-
+cedure ArbFunnel and was described in chapter 11 on page 202. Figure 12.3
+shows the organisation of an 8-input ArbFunnel.
+
+
+212
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ARB.
+
+ARB.
+
+ARB.
+
+ArbHead
+ArbHead
+
+ArbTree over 2,2 
+
+ArbTree over 4,1
+
+ArbFunnel over 8
+
+o
+
+ArbHead
+ArbHead
+
+i[0]
+i[1]
+i[2]
+i[3]
+i[4]
+i[5]
+i[6]
+i[7]
+
+Figure 12.3.
+8-input arbiter – ArbFunnel.
+
+12.4.2
+Transfer engine
+
+The transfer engine is, like the arbiter unit, quite simple. It exists only as
+a buffer stage between the control unit and the MARBLE initiator interface.
+This function is reflected in the sequencing in the Balsa description and the
+latches used to store the outgoing addresses.
+
+procedure DMATransferEngine (
+input command : array 2 of Word;
+sync ack;
+output busCommand : MARBLECommandNoData
+) is
+variable commandV : array 2 of Word
+begin
+loop
+command -> commandV;
+busCommand <- {commandV[0],read,word};
+busCommand <- {commandV[1],write,word};
+sync ack
+end
+end
+
+
+Chapter 12: A simple DMA controller
+213
+
+12.4.3
+Control unit
+
+The bulk of the controller is contained in the control unit. It contains all the
+channel register latch bits and register access multiplexers/demultiplexers. The
+reduced number of channels and single channel type makes this arrangement
+practical. There are in total 445 bits of programmer accessible state. The ports,
+local variables and local channels of the control unit’s Balsa description are:
+
+procedure DMAControl (
+input busCommand : MARBLE8bACommand;
+output busResponse : MARBLEResponse;
+input DRQ : ClientNo;
+output TECommand : array 2 of Word;
+sync TEAck;
+output IRQ : bit
+) is
+-- combined channel registers
+variable channelRegisters :
+array NoOfChannels of ChannelRegister
+variable channelR, channelW : ChannelRegister
+array over ChannelRegType of bit
+variable channelNo : ChannelNo
+variable clientNo : ClientNo
+
+variable TEBusy : bit
+
+variable gEnable : bit
+variable chanStatus : array NoOfChannels of bit
+variable IRQMask, IRQReq : array NoOfChannels of bit
+
+variable requestPending :
+array NoOfChannels of RequestPair
+
+channel commandSourceC : DMACommandSource
+channel busCommandC : MARBLE8bACommand
+channel DRQC : ClientNo
+variable commandSource : DMACommandSource
+. . .
+
+The ChannelRegister is the combined source, destination, count and con-
+trol registers for one channel. The variable channelRegisters is accessed
+by reading or writing these 108-bit wide registers (32 + 32 + 32 + 12). The
+two registers, channelR and channelW, are used as read and write buffers
+to the channel registers. This allows the partial writes required for CPU ac-
+cess to individual 32-bit words to fragment only these two registers, not all
+of the channel registers. The variables channelNo and clientNo are used to
+hold channel and client numbers between operations. DMA request arrival
+
+
+214
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+and request mark-up can modify clientNo and channel register accesses and
+ready-to-transfer polling can modify channelNo.
+The three channel declarations are used to communicate between a sub-
+procedure of DMAControl, RequestHandler, which arbitrates requests from
+the arbiter tree, MARBLE target interface and transfer engine acknowledge for
+service by the control unit. RequestHandler’s description is fairly uninterest-
+ing and so will not be discussed.
+The body of the control unit, with the less interesting portions removed, is
+as follows:
+
+begin
+Init ();
+-- RequestHandler is an ArbFunnel
+--
+with accompanying data
+RequestHandler (busCommand, DRQ, TEAck, commandSourceC,
+busCommandC, DRQC) ||
+loop
+-- find source of service requests
+commandSourceC -> commandSource;
+case commandSource of
+DRQ then DRQC -> clientNo; MarkUpClientRequest ()
+| bus then
+select busCommandC then
+if (busCommandC.a as RegAddrType).globalNchannel
+then . . . -- global R/W from the CPU
+else -- channel regs
+channelNo :=
+(busCommandC.a as ChannelRegAddr).channelNo;
+ReadChannelRegisters ();
+case busCommandC.rNw of
+. . . -- most of CPU reg. access code omitted
+-- CPU ctrl register write
+| ctrl then channelW.ctrl :=
+(busCommandC.d as ControlRegister) ||
+requestsPending[channelNo] := {0,0} ||
+ClearChanStatus ()
+end;
+WriteChannelRegisters ()
+end
+end
+end
+else -- TEAck
+TEBusy := 0;
+if gEnable then AssessInterrupts () end
+end;
+if gEnable and not TEBusy then
+TryToIssueTransfer ()
+end
+
+
+Chapter 12: A simple DMA controller
+215
+
+end
+end
+
+A number of procedure calls are made by the control unit body, for exam-
+ple, AssessInterrupts (). These procedure calls are to shared procedures
+whose definitions follow the local variables in DMAControl’s description. In
+Balsa, local procedures which are declared to be “shared” are only instantiated
+in the containing procedure’s handshake circuit in one place. (Normal pro-
+cedure calls place a new copy of that procedure’s body for each call). Calls
+to shared procedures are combined using a Call component making their use
+cheaper than normal procedures for whom a new copy of the called procedure’s
+body is placed at each call location.
+
+DMA request handling – MarkUpClientRequest
+
+Incoming DMA requests are marked up in the request pending registers as
+previously described. The procedure MarkUpClientRequest performs this
+operation by testing all the channels’ srcClientNo and dstClientNo con-
+trol bits with clientNo (the client ID of the incoming request) in parallel.
+MarkUpClientRequest’s description is:
+
+shared MarkUpClientRequest is
+begin
+for || i in 0..NoOfChannels-1 then
+if channelRegisters[i].ctrl.srcClientNo = clientNo
+then requestsPending[i].src := 1
+end ||
+if channelRegisters[i].ctrl.dstClientNo = clientNo
+then requestsPending[i].dst := 1
+end
+end
+end
+
+The for || loops in this description performs parallel structural instantia-
+tion of NoOfChannels copies of the body if statements.
+
+Register Access – ReadChannelRegisters,
+WriteChannelRegisters
+
+The shared procedures used to access the channel registers are very short.
+They make the only variable-indexed accesses to the channel registers. The
+two procedures are:
+
+shared ReadChannelRegisters is begin
+channelR := channelRegisters[channelNo]
+end
+
+
+216
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+shared WriteChannelRegisters is begin
+channelRegisters[channelNo] := channelW
+end
+
+Notice that no individual word write enables are present and so, in order
+to modify a single word, a whole channel register must be read, modified,
+then written back. The ReadChannelRegisters followed by channelW :=
+channelR in the description of the CPU’s access to the channel registers uses
+this update method.
+
+Channel status and interrupts – ClearChanStatus,
+AssessInterrupts
+
+The interrupt output bit is asserted by AssessInterrupts. Interrupts are
+signalled when the IRQReq register is non-zero and are reassessed each time
+an action which could clear an interrupt occurs. ClearChanStatus is called
+during channel control register updates to clear interrupts and channel status
+(end-of-run) indications. Their descriptions are:
+
+shared AssessInterrupts is begin
+IRQ <- (IRQReq as NoOfChannels bits) /= 0
+end
+
+shared ClearChanStatus is begin
+chanStatus[channelNo] := 0 ||
+IRQReq[channelNo] := 0;
+AssessInterrupts ()
+end
+
+Ready-to-transfer polling – TryToIssueTransfer, IssueTransfer
+
+Whenever the DMA controller is stimulated by its command interfaces, it
+tries to perform a transfer. The request-pending, and ctrl[n].enable bits
+for each channel are examined in turn to determine if that channel is ready
+to transfer. Incrementing the channel number during this search is performed
+using a local channel to allow the incremented value to be accessed in parallel
+from two places. TryToIssueTransfer’s Balsa description is:
+
+shared TryToIssueTransfer is local
+variable foundChannel : bit
+variable newChannelNo : ChannelNo
+begin
+foundChannel := 0 || channelNo := 0;
+
+while not foundChannel then
+-- source and destination requests arrived?
+if requestsPending[channelNo] = {1,1}
+and channelRegisters[channelNo].ctrl.enable then
+
+
+Chapter 12: A simple DMA controller
+217
+
+ReadChannelRegisters ();
+requestsPending[channelNo] :=
+channelR.ctrl.srcDRQ, channelR.ctrl.dstDRQ ||
+foundChannel := 1 ||
+IssueTransfer () ||
+UpdateRegistersAfterTransfer ()
+else
+local
+channel incChanNo : array ChannelNoLen + 1 of bit
+begin
+incChanNo <- (channelNo + 1 as
+array ChannelNoLen + 1 of bit) ||
+select incChanNo then
+foundChannel := incChanNo[ChannelNoLen] ||
+newChannelNo := (incChanNo[0..ChannelNoLen-1]
+as ChannelNo)
+end;
+channelNo := newChannelNo
+end
+end
+end
+end
+
+Notice that if a transfer is taken, the requestPending bits for that chan-
+nel are re-initialised from the ctrl.�srcDRQ,dstDRQ� control bits for that
+channel. The procedure IssueTransfer actually passes the transfer on to the
+transfer engine. Its definition is:
+
+shared IssueTransfer is begin
+TEBusy := 1 ||
+TECommand <- {channelR.src, channelR.dst}
+end
+
+The interlock formed by checking TEBusy before attempting a transfer, and
+the setting/resetting of TEBusy by transfer initiation/completion ensures that
+no transfer is presented to the transfer engine (deadlocking the control unit)
+while it is occupied. The TEAck communication back to the control unit also
+provides stimuli for re-triggering the DMA controller to perform outstanding
+requests. This re-triggering, combined with the sequential polling of channels,
+allows outstanding requests (received while the transfer engine was busy) to be
+serviced correctly. Notice that a static prioritisation on pre-arrived requests is
+enforced by sequential channel polling.
+
+Register increment/decrement –
+UpdateRegistersAfterTransfer
+
+After a transfer has been issued, the registers for that transfer’s channel must
+be updated. Procedure UpdateRegistersAfterTransfer performs this task:
+
+
+218
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+shared UpdateRegistersAfterTransfer is begin
+channelW.ctrl := channelR.ctrl ||
+if channelR.ctrl.srcInc then
+channelW.src := (channelR.src + 1 as Word)
+end ||
+if channelR.ctrl.dstInc then
+channelW.dst := (channelR.dst + 1 as Word)
+end ||
+if channelR.ctrl.countDec then
+channelW.count := (channelR.count - 1 as Word)
+end;
+if channelW.count = 0 then
+chanStatus[channelNo] := 1 ||
+if IRQMask[channelNo] then
+IRQReq[channelNo] := 1
+end ||
+channelW.ctrl.enable := 0
+end;
+WriteChannelRegisters ()
+end
+
+This procedure uses two incrementers and a decrementer to modify the
+channel’s source address, destination address and count respectively. If the
+channel’s transfer count is decremented to zero, the chanStatus bit, interrupt
+status and channel enable are all updated to indicate an end-of-run.
+This concludes the descriptions of the more illustrative aspects of the Balsa
+DMA controller description. For further details see [8] and, as was noted at
+the beginning of this chapter, a full code listing is available from [7].
+
+
+III
+
+LARGE-SCALE ASYNCHRONOUS DESIGNS
+
+Abstract
+In this final part of the book we describe some large-scale asynchronous VLSI
+designs to illustrate the capabilities of this technology. The first two of these
+designs – the contactless smart card chip developed at Philips and the Viterbi
+decoder developed at the University of Manchester – were designed within EU-
+funded projects in the low-power design initiative that is the sponsor of this book
+series. The third chapter describes aspects of the Amulet microprocessor series,
+again from the University of Manchester, developed in several other EU-funded
+projects which, although outside the low-power design initiative, never-the-less
+still had low power as a significant objective.
+The chips descibed in this part of the book are some of the largest and most com-
+plex asynchronous designs ever developed. Fully detailed descriptions of them
+are far beyond the scope of this book, but they are included to demonstrate that
+asynchronous design is fully capable of supporting large-scale designs, and they
+show what can be done with skilled and experienced design teams. The descrip-
+tions presented here have been written to give insight into the thinking processes
+that a designer of state-of-the-art asynchronous systems might go through in de-
+veloping such designs.
+
+Keywords:
+asynchronous circuits, large-scale designs
+
+
+
+Chapter 13
+
+DESCALE:
+
+�
+
+a Design Experiment for a Smart Card Application
+consuming Low Energy
+
+Joep Kessels & Ad Peeters
+Philips Research, NL-5656AA Eindhoven, The Netherlands
+{Joep.Kessels
+� Ad.Peeters}@philips.com
+
+Torsten Kramer
+Kramer-Consulting, D-21079 Hamburg, Germany
+
+Kramer@kramer-consulting.de
+
+Volker Timm
+Philips Semiconductors, D-22529 Hamburg, Germany
+
+Volker.Timm@philips.com
+
+Abstract
+We have designed an asynchronous chip for contactless smart cards. Asyn-
+chronous circuits have two power properties that make them very suitable for
+contactless devices: low average power and small current peaks. The fact that
+asynchronous circuits operate over a wide range of the supply voltage, while
+automatically adapting their speed, has been used to obtain a circuit that is very
+resilient to voltage drops while giving maximum performance for the power be-
+ing received. The asynchronous circuit has been built, tested and evaluated.
+
+Keywords:
+low-power asynchronous circuits, smart cards, contactless devices, DES cryp-
+tography
+
+�Part of the work described in this paper was funded by the European Commission under Esprit TCS/ESD-
+LPD contract 25519 (Design Experiment DESCALE).
+
+221
+
+
+222
+Part III: Large-Scale Asynchronous Designs
+
+13.1.
+Introduction
+
+Since their introduction in the eighties, smart cards have been used in a con-
+tinuously growing number of applications, such as banking, telephony (tele-
+phone and SIM cards), access control (Pay-TV), health-care, tickets for public
+transport, electronic signatures, and identification. Currently, most cards have
+contacts and, for that reason, need to be inserted into a reader. For applications
+in which the fast handling of transactions is important, contactless smart cards
+have been introduced requiring only close proximity to a reader (typically sev-
+eral centimeters). The chip on such a card must be extremely power efficient,
+since it is powered by electromagnetic radiation.
+Asynchronous CMOS circuits have the potential for very low power con-
+sumption, since they only dissipate when and where active. However, asyn-
+chronous circuits are difficult to design at the level of gates and registers.
+Therefore the high-level design language Tangram was defined [141] and a
+silicon compiler has been implemented that translates Tangram programs into
+asynchronous circuits.
+The Tangram compiler generates a special class of asynchronous circuits
+called handshake circuits [135, 112]. Handshake circuits are constructed from
+a set of about forty basic components that use handshake signalling for com-
+munication.
+Several chips have been designed in Tangram [136, 144] and if we compare
+these chips to existing clocked implementations, then the asynchronous ver-
+sions are generally about 20% larger in area and consume about 25% of the
+power.
+In order to find out what advantages asynchronous circuits have to offer in
+contactless smart cards, we have designed an asynchronous smart card chip. In
+this chapter, we indicate which properties of asynchronous circuits have been
+exploited and we present the results. The rest of the chapter is organized as fol-
+lows. Section 13.2 presents the Tangram method for designing asynchronous
+circuits. Section 13.3 summarizes the differences in power behaviour between
+synchronous and asynchronous circuits. Section 13.4 first provides some back-
+ground to contactless smart cards, then identifies the power characteristics in
+which contactless devices differ from battery-powered ones, and finally indi-
+cates why asynchronous circuits are very suited for contactless devices. Sec-
+tion 13.5 presents the digital circuit, Section 13.6 the results of the silicon, and
+Section 13.8 the power regulator, which also exploits an asynchronous oppor-
+tunity. We conclude with a summary of the merits of this asynchronous design.
+
+
+Chapter 13: Descale
+223
+
+13.2.
+VLSI programming of asynchronous circuits
+
+The design flow that is used in the design of the smart card IC reported here
+is based on the programming language Tangram and its associated compiler
+and tool set. An important aspect of this design approach is the transparency
+of the silicon compiler [142], which allows the designer to reason about de-
+sign characteristics such as area, power, performance, and testability, at the
+programming (Tangram) level.
+This section first introduces the Tangram tool set, then briefly describes the
+underlying handshake technology, and finally illustrates VLSI-programming
+techniques using the GCD algorithm presented in Chapter 8.
+
+13.2.1
+The Tangram toolset
+
+Fig. 13.1 shows the Tangram toolset. The designer describes a design in
+Tangram, which is a conventional programming language, similar to C and
+Pascal, extended to include constructs for expressing concurrency and com-
+munication in a way similar to those in the language CSP [58]. In addition to
+this, there are language constructs for expressing hardware-specific issues, like
+sharing of blocks and waiting for clock-edges.
+A compiler translates Tangram programs into so-called handshake circuits,
+which are netlists composed from a library of some 40 handshake components.
+Each handshake component implements a language construct, such as sequenc-
+ing, repetition, communication, and sharing.
+The handshake circuit simulator and corresponding performance analyzer
+give feedback to the designer about aspects such as function, area, timing,
+power, and testability.
+The actual mapping onto a conventional (synchronous) standard-cell library
+is done in two steps. In the first step the component expander uses the com-
+ponent library to generate an abstract netlist consisting of combinational logic,
+registers, and asynchronous cells, such as Muller C-elements. This step also
+determines the data encoding and the handshake protocol; generally a four-
+phase single-rail implementation is generated. In the second step commercial
+synthesis tools and technology mappers are used to generate the standard-cell
+netlist. No dedicated (asynchronous) cells are required in this mapping, be-
+cause all asynchronous cells are decomposed into cells from the standard-cell
+library at hand.
+Similar language-based approaches using handshake circuits as intermedi-
+ate format are described in [17, 9]. Design approaches in which asynchronous
+details are not hidden from the designer have also proven successful [80, 21,
+83, 30, 64]. A general overview of design methods for asynchronous circuits
+is given in [69] and in Part I of this book.
+
+
+224
+Part III: Large-Scale Asynchronous Designs
+
+Tangram
+
+Program
+
+Tangram
+
+Compiler
+
+Handshake
+Circuit
+
+Component
+
+Expander
+
+Abstract
+Netlist
+
+Technology
+
+Mapper
+
+Mapped
+
+Netlist
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Function, Area
+
+Timing, Power, Test
+
+Performance
+Analyzer
+
+Timed Traces
+Fault Coverage
+
+Handshake Circuit
+Simulator
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Component
+
+Library
+
+Asynchronous
+
+Library
+
+Standard-cell
+Library
+
+Figure 13.1.
+The Tangram toolset: boxes denote tools, ovals denote (design) representations.
+
+
+Chapter 13: Descale
+225
+
+Active
+Passive
+
+Active
+Passive
+
+�
+Req
+
+�
+Ack
+
+Figure 13.2.
+Handshake channel: abstract figure (top) and implementation (bottom).
+
+13.2.2
+Handshake technology
+
+The design of large-scale asynchronous ICs demands a timing discipline to
+replace the clock regime that is used in conventional VLSI design. We have
+chosen handshake signaling [121] as the asynchronous timing discipline, since
+it supports plug-and-play composition of components into systems, and is also
+easy to implement in VLSI. An alternative to handshaking would be to com-
+pose asynchronous finite-state machines that communicate using fundamental
+mode or burst-mode assumptions [27, 132].
+Fig. 13.2 shows a handshake channel, which is a point-to-point connection
+between an active and a passive partner. In the abstract figure, the solid circle
+indicates the channel’s active side and the open circle its passive side. The
+implementation shows that both partners are connected by two wires: a request
+(Req) and an acknowledge (Ack) wire. A handshake requires the cooperation of
+both partners. It is initiated by the active party, which starts by sending a signal
+via Req, and then waits until a signal arrives via Ack. The passive side waits
+until a request arrives, and then sends an acknowledge. Handshake channels
+can be used not only for synchronization, but also for communication. To that
+end, data can be encoded in the request, the acknowledge, or in both.
+The protocol used in most asynchronous VLSI circuits is a four-phase hand-
+shake, in which the channel starts in a state with both Req and Ack low. The
+active side starts a handshake by making Req high. When this is observed by
+the passive side, it pulls Ack high. After this a return-to-zero cycle follows,
+during which first Req and then Ack go low, thus returning to the initial state.
+Handshake components interact with their environment using handshake
+channels. One can build handshake components implementing language con-
+structs. Fig. 13.3 shows two examples: the sequencer and the parallel compo-
+nent.
+The sequencer, when activated via a, performs first a handshake via b and
+then via c. It is used to control the sequential execution of commands con-
+nected to b and c. After receiving a request along a, it sends a request along b,
+waits for the corresponding acknowledge, then sends a request along c, waits
+
+
+226
+Part III: Large-Scale Asynchronous Designs
+
+SEQ
+
+a
+
+c
+b
+PAR
+
+a
+
+c
+b
+
+Figure 13.3.
+Handshake components: sequencer (left) and parallel (right).
+
+Guard
+Command
+DO
+
+�
+
+Figure 13.4.
+Handshake circuit for while loop.
+
+for the acknowledge on c, and finally signals completion of its operation by
+sending an acknowledge along channel a.
+The parallel component, when activated by a request along a, sends requests
+along channels b and c concurrently, waits until both acknowledges have ar-
+rived, and then sends an acknowledge along channel a.
+Components for storage of data (variables) and operation on data (such as
+addition and bit-manipulation) can also be constructed. Tangram programs
+are compiled into handshake circuits (composition of handshake components)
+in a syntax-directed way (see also Chapter 8 on page 123). For instance, the
+compilation of a while loop in Tangram, which is written as
+
+do
+Guard
+then
+Command
+od
+
+results in the handshake circuit shown in Fig. 13.4. The do-component, when
+activated, collects the value of the guard through a handshake on its active data
+port. When this guard is false, it completes the handshake on its passive port,
+otherwise it activates the command through a handshake on its active control
+port, and after this has completed, re-evaluates the guard to start a new cycle.
+Details about handshake circuits, the compilation from Tangram into this
+intermediate architecture, the four-phase handshake protocols that are applied,
+and of the gate-level implementation of handshake components can be found
+in [141, 135, 112].
+
+13.2.3
+GCD algorithm
+
+When designing in Tangram, the emphasis for an initial design typically is
+on functional correctness. When this is the only point of attention, the result
+
+
+Chapter 13: Descale
+227
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin x,y:var int ff
+| forever do
+in?<<x,y>>
+; do x<>y then
+if x<y then y:=y-x
+else x:=x-y
+fi
+od
+; out!x
+od
+end
+
+Figure 13.5.
+GCD algorithm in Tangram.
+
+is generally too large, too slow, and too power hungry. The design of a suit-
+able datapath and control architecture, targetting some optimization criteria or
+cost function, can be approached in a transformational way. One can use the
+Tangram tool set to evaluate and analyze the design, and based on that, de-
+cide which transformations to apply. The transparency of the silicon compiler
+(‘What you program is what you get’) helps in predicting the effect of these
+transformations.
+The GCD algorithm is used as an example to illustrate VLSI programming
+based on a transparent silicon compiler (see also the discussion in section 8.3.3
+on page 127). We start with the algorithm of Fig. 13.5, which is functionally
+correct but far from optimal when it comes to implementation cost in VLSI.
+The corresponding handshake circuit is shown in Fig. 13.6.
+The cost report for this GCD design, as presented by the Tangram toolset, is
+the following:
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+19
+2052
+38.0
+12.5
+Memory
+16
+3744
+69.3
+22.8
+C-elements
+12
+1242
+23.0
+7.5
+Logic
+81
+9414
+174.3
+57.2
+--------------------------------------------------------------------
+Total:
+128
+16452
+304.7
+100.0
+
+An important aspect of the design is that it contains four operators in the dat-
+apath. We can improve this by changing the Tangram description in such a
+way that only one subtractor is needed, instead of two. A way to achieve this
+is to change the timing behaviour of the algorithm, and use a higher number
+of iterations of the do-loop by either swapping or subtracting the two numbers
+
+
+228
+Part III: Large-Scale Asynchronous Designs
+
+mux
+
+�
+
+�
+y
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+mux
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+dmx
+
+�
+
+��
+
+�
+
+�
+do
+
+�
+
+�
+
+�
+@
+
+�
+
+�
+
+�
+in
+
+��
+
+�
+
+�
+
+� out
+
+�
+
+;
+0
+1
+
+2
+
+�
+
+Figure 13.6.
+Compiled handshake circuit for initial GCD program.
+
+
+Chapter 13: Descale
+229
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin
+xy : var <<int,int>> ff
+& x = alias xy.0
+& y = alias xy.1
+| forever do
+in?xy
+; do x<>y then
+if x<y then xy:= <<x,y-x>>
+else xy:= <<y,x>>
+fi
+od
+; out!x
+od
+end
+
+Figure 13.7.
+GCD algorithm in Tangram with optimized control.
+
+x and y. This requires that x and y are stored in a single flip-flop variable.
+The new Tangram algorithm thus obtained is shown in Fig. 13.7. Its associated
+gate-level cost report has improved from 305 to 274 gate equivalents.
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+14
+1512
+28.0
+10.2
+Memory
+16
+3744
+69.3
+25.3
+C-elements
+10
+1008
+18.7
+6.8
+Logic
+86
+8532
+158.0
+57.7
+--------------------------------------------------------------------
+Total:
+126
+14796
+274.0
+100.0
+
+An additional transformation is to compute x<y and y-x using only one op-
+erator, and to combine the two assignments to variable xy into one assignment,
+so as to further simplify the control, at the price of always requiring the worst-
+case computation time for the conditional expression, even if it involves just a
+swap of x and y. Furthermore, one can allow one additional step in the com-
+putation, so that the termination condition of the loop simplifies from x<>y to
+y<>0. This final implementation is shown in Fig. 13.8; its handshake circuit is
+shown in Fig. 13.9, in which the datapath operations have been represented in
+an abstract way, rather than as separate components.
+The handshake circuit of the optimized design has a simpler structure than
+that of the initial design. The number of logic blocks (and, in the single-rail
+implementation, the number of delay-matching gate chains) has been reduced
+from four to two. This improvement is also apparent in the area statistics given
+below.
+
+
+230
+Part III: Large-Scale Asynchronous Designs
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin
+xy : var <<int,int>> ff
+& x = alias xy.0
+& y = alias xy.1
+& comp: func(). (y-x) cast <<int,bool>>
+| forever do
+in?xy
+; do y<>0 then
+xy:= if -comp.1 then <<x,comp.0>> else <<y,x>> fi
+od
+; out!x
+od
+end
+
+Figure 13.8.
+GCD algorithm in Tangram with optimized datapath.
+
+�
+
+�
+in
+mux
+
+�
+
+�
+xy
+
+�
+
+x
+
+�
+
+y
+�� 0
+
+�
+
+f
+�x�y�
+
+�
+
+�
+
+�
+� out
+
+do
+
+�
+
+�
+
+�
+
+;
+0
+1
+
+2
+
+�
+
+Figure 13.9.
+Compiled handshake circuit for optimized GCD program.
+
+
+Chapter 13: Descale
+231
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+10
+1080
+20.0
+9.4
+Memory
+16
+3744
+69.3
+32.7
+C-elements
+6
+576
+10.7
+5.0
+Logic
+42
+6048
+112.0
+52.8
+--------------------------------------------------------------------
+Total:
+74
+11448
+212.0
+100.0
+
+One may observe that area-wise, the design has improved in the control (the
+number of C-elements was reduced from 12 to 6), in the logic (the area for
+combinational logic was reduced from 174 to 112 gate equivalents), and in the
+total number of delay elements required (which was reduced from 19 to 10).
+
+13.3.
+Opportunities for asynchronous circuits
+
+When the asynchronous circuits generated by the Tangram compiler are
+compared to synchronous ones, three differences stand out, leading to four
+attractive properties of these asynchronous circuits.
+
+1 The subcircuits in a synchronous circuit are clock-driven, whereas they
+are demand-driven in an asynchronous one. This means that the subcir-
+cuits in an asynchronous circuit are only active when and where needed.
+Asynchronous circuits will therefore generally dissipate less power than
+synchronous ones.
+
+2 The operations in a synchronous circuit are synchronized by a central
+clock, whereas they are synchronized by distributed handshakes in an
+asynchronous circuit. Therefore
+
+a) a synchronous circuit shows large current peaks at the clock edges,
+whereas the power consumption of an asynchronous circuit is more
+uniformly distributed over time;
+
+b) the strict periodicity of the current peaks in a synchronous circuit
+leads to higher clock harmonics in the emission spectrum, which
+are absent in the spectrum of an asynchronous design.
+
+3 Synchronous circuits use an external time reference, whereas asynch-
+ronous circuits are self-timed. This means that asynchronous circuits
+operate over a wide range of the supply voltage (for instance, from 1 up
+to 3.3 V) while automatically adapting their speed. This property, called
+automatic performance adaptation, implies that asynchronous circuits
+are resilient to supply voltage variations. It can also be used to reduce
+the power consumption by adaptive voltage scaling, which means adapt-
+ing the supply voltage to the performance required [100]. Adaptive volt-
+
+
+232
+Part III: Large-Scale Asynchronous Designs
+
+age scaling techniques are also applied in synchronous circuits, but then
+special measures must be taken to adapt the clock frequency.
+
+Asynchronous circuits also have drawbacks. The most important one is that
+these circuits are unconventional: designers and mainstream tools and libraries
+are all oriented towards synchronous design methods. Additional drawbacks
+of asynchronous circuits come from the fact that they use gates to control reg-
+isters (latches and flip-flops), instead of the relatively straightforward clock-
+distribution network in synchronous circuits. Although this enables the low
+power consumption it also leads to circuits that are typically larger, slower,
+and harder to test. Testability (for fabrication defects) is probably the most
+fundamental issue of these. For a discussion on testing asynchronous circuits
+we refer to [61, 120].
+Property 2.b was the main reason for Philips Semiconductors to design a
+family of asynchronous pager chips [114]. However, as is addressed in the next
+section, it is the other three properties that can be used most advantageously in
+contactless smart card chips.
+
+13.4.
+Contactless smartcards
+
+Contactless smart cards have a number of advantages when compared to
+contacted ones: they are convenient and fast to use, insensitive to dirt and
+grease and, since their readers have no slots, they are less amenable to vandal-
+ism.
+The communication between a contactless smart card and reader is estab-
+lished through an electromagnetic field, emitted by the reader. The card has a
+coil through which it can retrieve energy from this field. The amount of energy
+that can be retrieved depends on the distance to the reader, the number of turns
+in the coil, and the orientation of the card in the field.
+Fig. 13.10 shows the functional parts of a contactless smartcard consisting
+of a VLSI circuit (in the dotted box) and an external coil. The tuned circuit
+formed by the coil and capacitor C0 is used for three purposes:
+
+receiving power;
+
+receiving the clock frequency (equal to the carrier frequency); and
+
+supporting the communication.
+
+The complete circuit consists of a power supply unit and a digital circuit with
+a buffer capacitor (C1) for storing energy.
+Several standards for contactless smart cards currently exist:
+
+ISO/IEC 10536, which specifies close coupling cards, operating at a dis-
+tance of 1cm from the reader.
+
+
+Chapter 13: Descale
+233
+
+Digital
+Circuit
+
+Power
+Supply
+Unit
+C0
+C1
+
+Figure 13.10.
+Contactless smart card.
+
+Table 13.1.
+Main characteristics of ISO/IEC 14443 standard.
+
+ISO 14443 standard
+A (Mifare)
+B
+
+Carrier frequency
+13.56 MHz
+13.56 MHz
+Throughput (up and down)
+106kbit/sec
+106kbit/sec
+Down link (reader to card)
+ASK 100%
+ASK 10%
+encoding
+Miller
+NRZ
+Up link (card to reader)
+ASK
+BPSK
+frequency
+847.5 kHz
+847.5 kHz
+modulation
+Manchester
+NRZ
+
+ISO/IEC 14443, for so-called proximity integrated circuit cards (PICCs),
+operating at a distance of up to 10cm from the reader, typically using 5
+turns in the on-card coil. This standard defines two types, A and B, the
+main characteristics of which are given in Table 13.1.
+
+ISO/IEC 15693 specifies vicinity integrated circuit cards (VICCs), op-
+erating at some 50cm from the reader, and typically requiring a coil with
+a few hundreds of turns.
+
+The Mifare [63] standard (ISO/IEC 14443 type A) has hundreds of millions
+of cards in operation today. Fig. 13.11 shows a Mifare card with both the chip
+and the coil visible. Mifare is a proximity card (it can be used up to 10 cm
+from the reader) supporting two-way communication. Performance is impor-
+tant, since the transaction time must be less than 200 msec. One of the first
+companies to deploy Mifare technology en masse was the Seoul Bus Asso-
+ciation, which has millions of such bus cards in use, generating hundreds of
+millions of transactions per month.
+This chapter reports an asynchronous Mifare smart card IC that was reported
+earlier in [73]. Both synchronous [116] and asynchronous [2] circuits for smart
+cards of the ISO/IEC 14443 type B standard have also been reported. Due to
+
+
+234
+Part III: Large-Scale Asynchronous Designs
+
+the 100% ASK modulation scheme in the type A standard, a Mifare IC is
+exposed to periods during which no power comes in at all, in contrast to the
+type B standard, which is based on 10% ASK modulation.
+Since on average (over time) contactless smart card chips receive only a few
+milliwatts of power, power efficiency is very important. Although low power
+is also important in battery-powered devices, there are two crucial differences
+between these two types of device.
+
+1 To maximize the battery life-time in battery-powered devices one should
+minimize the average power consumption. In contactless devices, how-
+ever, one should in addition minimize the peak power, since the peaks
+must be kept below a certain level, which depends on the incoming
+power and the buffer capacitor.
+
+2 The supply voltage is nearly constant in battery-powered devices whereas
+in contactless ones it may vary over time during a transaction due to fluc-
+tuations in both the incoming and the consumed power.
+
+Figure 13.11.
+Mifare card, showing IC (bottom left) and coil.
+
+In the bullets below, we give some facts about conventional synchronous
+chips for contactless smart cards, which, as we will see later, offer opportuni-
+ties for improvement by using asynchronous circuits.
+
+A synchronous circuit runs at a fixed speed dictated by the clock, de-
+spite the fact that both the incoming and the effectively consumed power
+vary over time. Synchronous circuits must, therefore, be designed so as
+to allow the most power-hungry operations to be performed when min-
+imum power is coming in. Consequently, if too much power is being
+
+
+Chapter 13: Descale
+235
+
+DES
+
+RSA
+
+UART
+
+80C51
+Micro-
+controller
+
+RAM
+
+ROM
+
+EEPROM
+
+Figure 13.12.
+Global design of the smart-card circuit.
+
+received, that superfluous power is thrown away. If, on the other hand,
+too little power is being received, the supply voltage drops making the
+circuit slower and, as soon as the circuit has become too slow to meet
+the time requirements set by the clock, the transaction must be canceled.
+For this reason contactless smart card chips contain subcircuits that de-
+tect when the voltage drops below a certain threshold and then abort the
+transaction.
+
+Currently, the performance of the microcontroller in a contactless smart
+card chip is usually not limited by the speed of the circuit, but by the
+RF-power being received.
+
+A synchronous circuit requires a buffer capacitor of several nanofarads
+and the area needed for such a capacitor is of the same order of magni-
+tude as the area needed for the microcontroller.
+
+The communication from the smart card to the reader is based on mod-
+ulating the load, which implies that normal functional load fluctuations
+may interfere with the communication.
+
+13.5.
+The digital circuit
+
+We have built the digital circuit shown in Fig. 13.12 that consists of:
+
+an 80C51 microcontroller;
+
+three kinds of low-power memory, the sizes and access times of which
+are given in Table 13.2 (64 bytes can be written simultaneously in one
+write access to the EEPROM);
+
+two encryption coprocessors:
+
+– an RSA converter [119] for public key conversions and
+
+
+236
+Part III: Large-Scale Asynchronous Designs
+
+– a triple DES converter [96] for private key conversions;
+
+a UART for the external communication.
+
+The EEPROM contains program parts as well as data such as encryption keys
+and e-money. Both the ROM and the RAM are equipped with matching delay
+lines and for the EEPROM we designed a similar function based on a counter.
+These delay lines have been used to provide all three memories with a hand-
+shake interface, which made it extremely easy to deal with the differences in
+access time as well as variations in both temperature and supply voltage. An
+additional advantage is that the controller automatically runs faster when exe-
+cuting code from ROM than when executing code from EEPROM.
+The circuit is meant to be used in a dual-interface card, which is a card with
+both a contacted and a contactless interface. Apart from the RSA converter,
+which will not be used in contactless operation, all circuits are asynchronous.
+In contactless operation, the average supply voltage will be about 2 V. The
+simulations, however, are done at 3.3 V, which is the voltage at which the
+library has been characterized.
+
+Table 13.2.
+Memory sizes and access times.
+
+Memory
+Size Access time [ns]
+type
+[kbyte]
+read
+write
+
+RAM
+2
+10
+10
+ROM
+38
+30
+EEPROM
+32
+180
+4,000
+
+13.5.1
+The 80C51 microcontroller
+
+The 80C51 microcontroller is a modified version of the one described in
+[144, 143]. The four most important modifications are described below.
+To deal with the slow memories a prefetch unit has been included in the
+80C51 architecture. At 3.3 V the average instruction execution time in free-
+running mode is about 100 ns provided it takes no time to fetch the code from
+memory. If, however, code is fetched from the EEPROM and the microcon-
+troller has to wait during the read accesses, the performance would be drasti-
+cally reduced, since most instructions are one or two bytes long, taking 180 or
+360 ns to fetch. To avoid this performance degradation a form of instruction
+prefetching has been introduced in which a process running concurrently to
+the 80C51 core is fetching code bytes as long as a two byte FIFO is not full.
+
+
+Chapter 13: Descale
+237
+
+The prefetch unit gives an increase in performance of about 30%. A simplified
+version of the prefetch unit is described in Section 13.5.2.
+We also introduced early write completion, which means that the microcon-
+troller continues execution as soon as it has issued a write access. This has
+been introduced to prevent the microcontroller from waiting during the 4 msec
+it takes to do a write access to the EEPROM (for instance to change the e-
+money), but also to speed up the write accesses to the RAM. To exploit this
+feature when doing a write access to the EEPROM, the corresponding code
+must be in the ROM.
+The controller has been provided with an immediate halt input signal by
+which its execution can be halted within a short time. This provision is nec-
+essary to deal with the fact that the information, which the reader sends to the
+card, is coded by suppressing the carrier during periods of 3 µsec. Since the
+card does not receive any power during these periods, the controller has to be
+halted immediately (only some basic functions continue to operate). In the
+synchronous design this halting function came naturally, since the clock would
+stop during these periods.
+We have introduced a quasi synchronous mode, in which the microcontroller
+is, at instruction level, fully timing compatible with its synchronous counter-
+part. In this mode, the asynchronous microcontroller waits after each instruc-
+tion until the number of clock ticks required by a synchronous version have
+elapsed. This mode is necessary when time-dependent functions are designed
+in software. Since this mode is under software control, the microcontroller can
+easily switch modes depending on the function it is executing. This feature
+was also of great help to demonstrate the guaranteed performance, which is
+the maximum clock rate at which each instruction terminates within the given
+number of clock ticks. For most programs, the free-running performance is
+about twice as high as the guaranteed performance.
+We have compared the asynchronous circuit as compiled from Tangram with
+a synchronous circuit that is functionally equivalent, and which has been com-
+piled from synthesizable VHDL to the same CMOS technology. These ICs
+have a comparable performance.
+The asynchronous microcontroller nicely
+demonstrates the three properties of asynchronous circuits that we want to ex-
+ploit in the design of the smart-card chip:
+
+The average power consumption of the asynchronous 80C51 is about
+three times lower than the power consumption of its synchronous coun-
+terpart when delivering the same performance at the same supply volt-
+age.
+
+Fig. 13.13 shows the current peaks of both the synchronous and the asyn-
+chronous 80C51 at 3.3 V, where the asynchronous version is running in
+quasi synchronous mode, giving a performance that is 2.5 times higher
+
+
+238
+Part III: Large-Scale Asynchronous Designs
+
+100
+
+0
+1
+2
+0
+1
+2
+
+Time[µsec]
+
+I[mA]
+
+Synchronous version
+Asynchronous version
+
+Figure 13.13.
+Current peaks of 80C51 microcontroller.
+
+than the synchronous design (the synchronous version runs at 10 MHz
+and the asynchronous version at 25 MHz). Despite the fact that the figure
+does not give a fair impression of the average power being consumed,
+it clearly shows that the current peaks of the asynchronous 80C51 are
+about five times smaller than those of the synchronous equivalent.
+
+The performance adaptation property of asynchronous circuits is demon-
+strated in Fig. 13.14, which shows the free-running performance of the
+microcontroller, when executing code from ROM, as a function of the
+supply voltage. As is expected, the performance depends linearly on the
+supply voltage. When the supply voltage goes up from 1.5 to 3.3 V, the
+performance increases from 3 to 8.7 MIPS (about a factor 3). Since the
+ROM containing the program does not function properly when the sup-
+ply voltage is below 1.5 V, we could not measure the performance for
+lower values. We observed, however, that the DES coprocessor, which
+does not need a memory, still functions correctly at a supply voltage
+level as low as 0.5 V.
+
+The figure also shows the supply current as a function of the supply
+voltage. Note that the current increases in this range from 0.7 to 6 mA
+(about a factor 9). Since in CMOS circuits the current is the product
+of the transition rate (performance) and the charge being switched per
+transition (both of which depend linearly on the supply voltage), the
+current increases with the square of the voltage. From this it follows that
+
+
+Chapter 13: Descale
+239
+
+the power, being the product of the current and the voltage, goes up with
+the cube of the voltage.
+
+From this data one can compute the third curve showing the energy
+needed to execute an instruction, which increases with the square of the
+supply voltage from 0.35 to 2.25 nJ.
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+���
+
+������
+�
+������
+���
+
+�
+����������
+� !���
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+"�����
+�
+��#�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+$�����
+�
+��
+%�&�����%��
+��'�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Figure 13.14.
+Measured performance of the asynchronous 80C51 for various supply voltages.
+
+13.5.2
+The prefetch unit
+
+Fig. 13.15 gives the Tangram code of a simplified version of the prefetch
+unit. The prefetch unit communicates with the 80C51 core through two chan-
+nels: it receives the address from which to start fetching code bytes via channel
+StartAddress and then it sends these bytes through channel CodeByte. Since
+the prefetch unit plays the passive role in both communications, it can probe
+each channel to see whether the core has started a communication through that
+channel. The state of the prefetch unit consists of program counter, pc, and
+a two-place buffer, which is implemented by means of an array Buffer, an
+integer count, and two one-bit pointers getptr and putptr.
+
+
+240
+Part III: Large-Scale Asynchronous Designs
+
+forever do
+sel probe(StartAddress)
+then (
+StartAddress?pc
+|| putptr := getptr
+|| count := 0
+|| AbortMemAcc()
+)
+or probe(CodeByte) and (count>0)
+then CodeByte!Buffer[getptr]
+; (
+getptr := next(getptr)
+|| count := count-1
+)
+or MemAck
+then Buffer[putptr] := MemData
+; (
+putptr := next(putptr)
+|| count := count+1
+|| pc := pc+1
+|| CompleteMemAcc()
+)
+les
+; if (count<2) and -MemReq
+then MemReq := true
+fi
+od
+
+Figure 13.15.
+Tangram code of a simplified version of the prefetch unit.
+
+The prefetch unit executes an infinite loop and in each step it first executes a
+selection command (denoted by sel ... les), in which it can select among three
+guarded commands, which are separated by the keyword or. Each guarded
+command is of the form
+
+Guard
+then
+Command,
+
+in which Guard is a Boolean condition, and Command a command that may
+be executed only if the guard holds. A command is said to be enabled if the
+corresponding guard holds. Executing a selection command implies waiting
+until at least one of the commands is enabled, then selecting such a command
+—in an arbitrated choice— and executing it.
+In the first guarded command, channel StartAddress is probed to find out
+whether the core is sending a new start address. In that case, program counter
+pc is set to the address received, the buffer is flushed, and a possible outstand-
+ing memory access is aborted (by resetting both MemReq and the delay counter).
+All four subcommands in this guarded command are executed in parallel (‘A
+|| B’ means execute commands A and B concurrently, whereas ‘A ; B’ means
+execute A and B sequentially).
+
+
+Chapter 13: Descale
+241
+
+The second guarded command takes care of sending the next program byte
+via channel CodeByte to the core if the core is ready to receive that byte and
+the buffer is not empty. The third guarded command gets enabled if MemAck
+goes high indicating that the data signals in a read access are valid. In that
+case the value read from memory is put in the buffer after which the memory
+handshake is completed.
+After each event, if the buffer is not full and no memory access is being
+performed a next memory access is started. Since (count<2)
+� -MemReq is
+an invariant of the loop, the last (conditional) command in the loop can be
+simplified to unconditional assignment MemReq:= (count<2).
+Note that the value (pc-count) is equal to the program counter in the core,
+since it is set to the destination address in the case of a jump, increased by
+1 if a code byte is transferred to the core, and kept invariant if a code byte
+is read from memory. Therefore the core does not need to hold the program
+counter and instead, when the information is needed for a relative branch, it can
+retrieve the counter value from the prefetch unit. Clearly, this feature requires
+an extension of the Tangram code shown in Fig. 13.15.
+
+13.5.3
+The DES coprocessor
+
+A transaction may need up to ten single DES conversions, where each con-
+version takes about 10 ms if it is executed in software. Therefore a hardware
+solution is needed, since these conversions would otherwise consume about
+half of the transaction time.
+Fig. 13.16 shows the datapath of the DES coprocessor. The processor sup-
+ports both single- and triple-DES conversions and, for the latter type of con-
+version, contains two keys: a foreground and a background key. Single-DES
+conversions use the foreground key, whereas triple-DES conversions use the
+foreground key for the first and third conversion and the background key for
+the second conversion. The foreground key is stored in register CD0 consisting
+of 56 flipflops (the DES key size is 56 bits), whereas the background key re-
+sides in variable CD1 consisting of 56 latches. The text value resides in variable
+LR consisting of 64 latches (DES words contain 64 bits).
+A single-DES conversion consists of 16 steps and, in each step, the key is
+permuted and a new text value is computed from the old text value and the key.
+In order to have the key return to its original value at the end of a conversion,
+the key makes two basic permutations in 12 of the 16 steps and only one in
+the remaining 4, where 28 basic permutations are needed for a complete cycle.
+The permutations are performed in flipflop register CD0.
+Most of the area is taken by the combinational circuit called DES. Since this
+circuit is also dominant in power dissipation, one should minimize the number
+of transitions at its inputs. For this purpose, we have introduced two latch
+
+
+242
+Part III: Large-Scale Asynchronous Designs
+
+cd(56)
+
+CD0(56)
+
+CD1(56)
+
+LR(64)
+
+lr(64)
+
+Data(8)
+
+DES
+
+LRotMuxLR
+RotMuxCD
+
+Swap
+
+Inp/Enc/Dec
+
+SFRwrite
+
+Figure 13.16.
+DES coprocessor architecture.
+
+registers: cd for the key and lr for the text. If two basic permutations are done
+in one step, cd hides the effect of the first one from combinational circuit DES.
+In addition, all inputs of combinational circuit DES change only once in each
+step by loading the two registers lr and cd simultaneously and then storing the
+result in register LR as described by the following piece of Tangram text.
+
+( lr:= LR || cd:= CD0 ) ; LR:= DES(lr,cd)
+
+Therefore, latch register lr also serves as a kind of slave register. Latch
+register cd also serves a functional purpose, since the two keys are swapped by
+executing the following three transfers.
+
+cd:= CD0 ; CD0:= CD1 ; CD1:= cd
+
+The size of the DES coprocessor is 3,250 gate equivalents, of which 57%
+is taken by the combinational logic and 35% by latches and flip-flops. Con-
+sequently, the overhead in area due to the asynchronous design style (delay
+matching and C-elements) is marginal at 8%. At 3.3 V, a single-DES conver-
+sion takes 1.25 µs and 12 nJ.
+
+
+Chapter 13: Descale
+243
+
+Fig. 13.17 shows the simulated current of the DES coprocessor at 3.3 V
+(the microcontroller is active before and after the DES computation). The real
+current peaks will be much smaller due to a lower supply voltage (the DES
+processor functions properly at a supply voltage as low as 0.5 V) as well as the
+buffer capacitor (the resolution in the simulation is 1 ns).
+
+1
+
+100
+
+3
+2
+0
+t[µs]
+
+i[mA]
+
+Figure 13.17.
+DES coprocessor current at 3.3 V.
+
+The conversion time, of a few microseconds, is so small that we used the
+handshaking mechanism to obtain synchronization between the microcontroller
+and the coprocessor. After starting the coprocessor, the microcontroller can
+continue executing instructions, and only when reading the result will it be
+held up in a handshake until the result is available. Note that a synchronous
+design would require a form of busy-waiting.
+
+13.6.
+Results
+
+Fig. 13.18 shows the layout of the chip, which is in a five-layer metal,
+0.35 µm technology and has a size of 4�52
+� 4�16
+ 18 mm2, including the
+bond pads. Many bond pads are only included for measurement and evalua-
+tion purposes. A production chip only needs about 10 bond pads.
+The two horizontal blocks on top form the buffer capacitor (in a production
+chip, the capacitor would only require about one quarter of the area). The
+memories are on the next row, from left to right: two RAMs, one ROM and
+the EEPROM, which is the large block to the right. The asynchronous circuit
+is located in the lower left quadrant, near the centre.
+Table 13.3a gives the area of the blocks constituting the contactless dig-
+ital circuit, which is the asynchronous circuit together with the memories.
+
+
+244
+Part III: Large-Scale Asynchronous Designs
+
+Figure 13.18.
+Layout of smart card chip.
+
+The other modules are either synchronous or analog circuits, where the syn-
+chronous modules are not used in contactless operation. From this table it
+follows that the asynchronous logic takes only 12% of the total contactless
+digital circuit.
+
+Table 13.3a.
+Areas of the contactless dig-
+ital circuit blocks.
+
+Block
+Area [mm2]
+
+RAM
+1.2
+ROM
+1.0
+EEPROM
+5.6
+Async. circ.
+1.1
+
+Total
+8.9
+
+Table 13.3b.
+Areas of the asynchronous
+modules.
+
+Module
+Area [GE]
+
+CPU
+7,800
+Pref. Unit
+700
+DES
+3,250
+UART
+2,040
+Interfaces
+3,680
+Timer
+1,080
+
+Total
+18,550
+
+
+Chapter 13: Descale
+245
+
+The sizes of the different asynchronous modules are given in Table 13.3b.
+In the standard cell library used, a gate equivalent (GE) is 54 µm2 with a typical
+layout density of 17,500 gates per mm2.
+
+Table 13.3c.
+Power of the contactless
+digital circuit.
+
+Block
+Power
+
+Core
+56%
+ROM
+27%
+RAM
+17%
+
+Table 13.3d.
+Effect of asynchronous de-
+sign on power and area at different levels.
+
+Level
+Power
+Area
+
+Async. circ.
+�70%
+�18%
+Async. + Mem.
+�60%
+�2%
+
+Table 13.3c shows the power dissipation of the digital circuit blocks when
+the controller is executing code from ROM (being the normal situation).
+Table 13.3d shows the effects on power and area of an asynchronous design
+at two different levels. The asynchronous circuit gives a reduction in power
+dissipation of about 70% for 18% additional area. At the level of the contact-
+less digital circuit, however, we obtain a power reduction of 60% for only 2%
+additional area. Note that this analysis does not include the synchronous RSA
+converter and the analog circuits needed in a production chip, such as for in-
+stance the buffer capacitor and the power supply unit. Therefore at chip level
+the relative reduction in power dissipation will be about the same, whereas the
+overhead in area will be reduced even further.
+
+13.7.
+Test
+
+The testing of asynchronous circuits for manufacturing defects is known to
+be a difficult problem [61, 120]. The main problem is that asynchronous cir-
+cuits have many feedback loops that cannot be controlled by an external signal.
+This makes the introduction of scan testing expensive, and forces the designer
+to become involved in the development of a functional test, either in producing
+the patterns directly or in implementing the design-for-test measures.
+A functional test approach was chosen for the chip described in this chapter.
+During test, the microcontroller is connected to an external ROM that contains
+a test program. This program computes a signature, and a circuit is said to be
+defective if the signature is not correct. In addition, current measurements are
+performed to increase the test coverage.
+The functional tests were developed using the test for the 80C51 microcon-
+troller [144] as a starting point. Both this test and its extension were developed
+using the test-evaluation tool described in [152]. This tool evaluates the struc-
+tural test coverage (controllability and observability) of functional traces.
+
+
+246
+Part III: Large-Scale Asynchronous Designs
+
+For the datapath logic, the tool can be used to achieve a 100% coverage for
+the stuck-at-input fault model, even though actually achieving this may be a
+real challenge to the test engineer. For the 80C51 microcontroller, however,
+with the inherent controllability and observability of its registers and buses,
+this appears to be feasible.
+In the absence of full-scan, it is known that a 100% coverage for the stuck-
+at-input fault model can only be achieved for the asynchronous control logic
+by using a combination of functional patterns and current measurements in a
+circuit that has been modified so as to be pausable in the middle of carefully
+selected handshakes [120]. Since this modification has not been implemented
+in the smartcard circuit reported here, the fault coverage can never be 100%.
+The exact fault coverage of the traces that were used here is not known,
+as this would demand unrealistic levels of compute power using a verification
+tool, but is estimated to be around 90% for the asynchronous control and data-
+path subsystem.
+
+13.8.
+The power supply unit
+
+Fig. 13.19 shows the power supply unit consisting of a rectifier and a power
+regulator, which are both completely analog circuits. The design of the recti-
+fier is conventional, and of the regulator we discuss only those aspects of the
+behaviour that are relevant to the design of the digital circuit without going
+into the details of its design.
+
+Pow
+Reg
+
+i0
+
+v0
+
+i1
+
+v1
+Rect
+
+Figure 13.19.
+Power supply unit.
+
+To avoid interference with the communication, a power regulator has been
+designed that shows an almost constant load at its input. Fig. 13.20 shows
+Spice-level simulation results of such a power regulator when the input volt-
+age V0 is fixed at 5V. On the horizontal axis we have the activity (number of
+transitions per second) of the digital circuit. The input load is almost constant,
+since input current i0 is almost constant over the whole range.
+When the activity is low, output voltage V1 is constant at about 3V. In this
+range, too much power is coming in and the regulator functions as a voltage
+source with output current i1 increasing when the activity increases. The su-
+perfluous power is shunted to ground. However, i1 reaches a saturation point
+
+
+Chapter 13: Descale
+247
+
+when it reaches i0. From this point on, no more power is shunted to ground and
+the regulator starts to function as a current source with output voltage V1 de-
+creasing when the activity increases. The regulator delivers maximum power
+in the middle of the range where both the outgoing voltage and the outgoing
+current are high. Note that these simulation results assume constant incoming
+RF power. The variations in the incoming RF power during a transaction, how-
+ever, are an additional source of fluctuations in V1, since these variations result
+in shifts of the saturation point.
+
+i0
+
+i1
+
+V1
+
+2
+
+3
+
+0.0
+
+0.4
+
+0.8
+
+1.2
+
+mA
+
+V
+
+1
+
+Activity
+
+Figure 13.20.
+Power regulator behaviour.
+
+A power source with these characteristics burdens the designer of a syn-
+chronous circuit with the problem of trading off between performance and ro-
+bustness. Going for maximum performance means assuming a supply voltage
+of 3 V in which case a transaction must be aborted if the voltage drops below
+2.5 V, for example. On the other hand, if the designer opts for a more robust
+design by choosing 2 V as the operating voltage, performance is lost when
+the regulator delivers 3 V. Such trade-offs are not needed for an asynchronous
+circuit, since it always automatically gives the maximum performance for the
+power received.
+
+13.9.
+Conclusions
+
+We have designed, built and evaluated an asynchronous chip for contactless
+smart cards in which we have exploited the fact that asynchronous circuits:
+
+use little average power,
+
+show small current peaks, and
+
+operate over a wide range of the supply voltage.
+
+
+248
+Part III: Large-Scale Asynchronous Designs
+
+Measurements and simulations showed the following advantages of this de-
+sign when compared to a conventional synchronous one:
+
+The asynchronous circuit gives the maximum performance for the power
+received. This comes mainly from the fact that the asynchronous de-
+sign needs less of what is the main limiting factor for the performance,
+namely power. Compared to a synchronous design, the asynchronous
+circuit needs about 40% of the power for less than 2% additional area.
+In addition, the automatic speed adaptation property of asynchronous
+circuits saves the designer from trading off between performance and
+robustness. Due to this property the asynchronous circuit will give free-
+running instead of guaranteed performance, where the difference be-
+tween the two is about a factor two.
+
+The asynchronous design is more resilient to voltage drops, since it still
+operates correctly for voltages down to 1.5 V.
+
+The current peaks of an asynchronous circuit are less pronounced, mak-
+ing the requirements with respect to the buffer capacitor more modest.
+
+The combination of the power regulator with the asynchronous circuit
+gives little communication interference. In this case, the smaller current
+peaks and the self-adaptation property are of importance.
+
+Acknowledgements
+
+We gratefully acknowledge the other members of the Tangram team: Kees
+van Berkel, Marc Verra and Erwin Woutersen, and we thank Klaus Ully for
+helping us to get the DES converter right.
+
+
+Chapter 14
+
+AN ASYNCHRONOUS VITERBI
+DECODER
+
+�
+
+Linda E. M. Brackenbury
+Department of Computer Science, The University of Manchester
+
+lbrackenbury@cs.man.ac.uk
+
+Abstract
+Viterbi decoders are used for decoding data encoded using convolutional for-
+ward error correcting codes. Such codes are used in a large proportion of digital
+transmission and digital recording systems because, even when the transmitted
+signal is subjected to significant noise, the decoder is still able efficiently to de-
+termine the most likely transmitted data.
+This chapter descibes a novel Viterbi decoder aimed at being power efficient
+through adopting an asynchronous approach. The new design is based upon se-
+rial unary arithmetic for the computation and storage of the metrics required;
+this arithmetic replaces the add-compare-select parallel arithmetic performed by
+conventional synchronous systems. Like all Viterbi decoders, a history of com-
+putational results is built up over many data bits to determine the data most likely
+to have been transmitted at an earlier time. The identification of a starting point
+to this tracing operation allows the storage requirement to be greatly reduced
+compared with that in conventional decoders where the starting point is random.
+Furthermore, asynchronous operation in the system described enables multiple,
+independent, concurrent tracing operations to be performed which are decoupled
+from the placing of new data in the history memory.
+
+Keywords:
+low-power asynchronous circuits, Viterbi, convolution decoder
+
+14.1.
+Introduction
+
+The PREST (Power REduction for Systems Technology) [1] project was a
+collaborative project where each partner designed a low power alternative to
+
+�The Viterbi design work was supported by the EPSRC/MoD PowerPack project GR/L27930 and the EU
+PREST project EP25242, and this support is gratefully acknowledged.
+
+249
+
+
+250
+Part III: Large-Scale Asynchronous Designs
+
+an industry standard Viterbi decoder. The Manchester University team’s aim
+was to obtain a power-efficient design through the use of asynchronous timing.
+The Viterbi decoder function [148] was chosen as it is a key function in
+digital TV and mobile communications. It detects and corrects transmission
+errors, outputting the data stream which, according to its calculations, is the
+most likely to have been transmitted. Viterbi coding is popular because it can
+deal with a continual data stream and requires no framing information such as
+a block start or finish. Furthermore, the way in which the output stream is con-
+structed means that even if output data is incorrectly indicated, this situation
+will correct itself in time.
+
+14.2.
+The Viterbi decoder
+
+14.2.1
+Convolution encoding
+
+In order to understand the functions performed by the decoder, the way the
+data is encoded needs to be described. This is illustrated in figure 14.1. The
+input data stream enters a shift register which is initially set to zero. The 2-
+bit register stores the previous two input bits and these are combined with the
+current bit in two modulo-2 adders to provide the two binary digits which form
+the encoded output for the current input bit.
+
+1−bit
+delay
+1−bit
+
++
+
++
+
+i(t−1)
+i(t−2)
+
+Y
+
+delay
+
+X
+
+i(t)
+
+Figure 14.1.
+Four-state encoder.
+
+For example, suppose that the input stream comprises 011 with the current
+input bit (‘1’) on the right and the oldest bit (‘0’) on the left, then the encoded
+X-output, which adds the bits in all three bits, is 0 and the encoded Y-output,
+which adds the current and oldest bit, is 1; so ‘01’ would be transmitted in this
+case. As each input bit causes two encoded bits to be transmitted, the code is
+referred to as 1�2 rate.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+251
+
+The use of three input bits (called the constraint length k) means that the
+encoder has 2
+
+�k
+�1�
+
+� 4 possible states, ranging from state S0 when both previ-
+ous bits are zero to S3 when both these bits are ones. So, a current state n will
+become state 2n modulo 4 if the current input bit is a zero and
+�2n
+�1� modulo
+4 if it is a one; for example, if the current state is 2, the next state will be 0 or
+1. That is, each state leads to two known states. This state transition informa-
+tion versus time is normally drawn in the form of a trellis diagram as shown in
+figure 14.2 for a four-state system. States are also referred to as nodes, and the
+node-to-node connection network is referred to as a butterfly connection due
+to its predictability, regularity and shape.
+
+S3
+
+S2
+
+S1
+
+S0
+
+0
+1
+2
+3
+4
+
+1
+1
+0
+1
+
+time
+
+00
+
+10
+
+11
+
+01
+
+01
+
+10
+
+00
+
+11
+
+data
+
+Figure 14.2.
+Four-node trellis.
+
+By knowing the starting state of the trellis at any time and the subsequent
+input stream, the route taken by the encoder and the state it reaches can be
+traced. For example, starting at state 0 at time 0, an input pattern of 1 then 1
+then 0 followed by 1 will cause the trellis path to travel from state S0
+� S1
+�
+S3
+� S2
+� S1; this route is indicated by the thicker black line on figure 14.2.
+This figure also shows the encoder outputs associated with each path or
+branch on the trellis. For example, moving from state S3 to S2, corresponding
+to a current input of ‘0’, the input stream must be 110 (with the oldest bit
+leftmost), so the encoded X and Y outputs are 0 and 1; this path is labelled
+with 01 to indicate the encoder output in moving from S3 to S2. The other
+paths are calculated in a similar way and labelled appropriately.
+
+14.2.2
+Decoder principle
+
+The decoder attempts to reconstruct the most likely path taken by the en-
+coder through the trellis. It does this by constructing a trellis and attaching
+weights to the nodes and each possible path at each timeslot; these indicate
+how likely it is that the node and path were the route taken by the encoder.
+Consider again the four-node example above. The encoded output for trans-
+
+
+252
+Part III: Large-Scale Asynchronous Designs
+
+mission resulting from the same input sequence of 1 then 1 then 0 then 1 would
+be 11 followed by 01, 01 and finally 00 (from an initial encoder state of all-
+zeros). Suppose that instead of receiving this sequence, the decoder receives
+corrupted data of 11 followed by 00, 01 and 00. Also assume that node 0 has
+an initialised weight of zero and that the other nodes have a higher initialised
+weight, say 2, at time 0; this corresponds to the encoder starting in state S0.
+
+S3
+
+S2
+
+S1
+
+S0
+
+2
+
+2
+
+2
+
+0
+
+0
+
+1
+
+1
+
+1
+
+2
+
+0
+
+1
+
+2
+
+11
+
+2
+
+1
+
+1
+
+1
+
+0
+
+2
+
+1
+
+0
+
+00
+
+1
+
+2
+
+0
+
+0
+
+1
+
+1
+
+2
+
+1
+
+01
+
+3
+
+3
+
+1
+
+2
+
+2
+
+1
+
+1
+
+1
+
+0
+
+2
+
+1
+
+0
+
+00
+
+3
+
+0
+
+2
+2
+
+3
+
+1
+
+1
+
+2
+
+2
+
+2
+1
+0
+3
+4
+3
+3
+
+1
+
+time
+
+Figure 14.3.
+Decoding trellis.
+
+For each branch, the distance between each received bit and the bit expected
+by the branch gives the weight of the branch. Where X and Y are encoded as
+single bits, referred to as hard coding, the number of bits difference between
+the received bits and the ideal encoded state for the branch gives the branch
+weight. The weights allocated in each timeslot for the received data are shown
+in figure 14.3. The received bits in the first timeslot are 11. Branches rep-
+resenting an encoded output of 11 are given a weight 0 while the branches
+representing an encoded output of 00 are given a weight of 2, and branches
+with ideal encoded outputs of 01 or 10 are given a weight of 1; these branch
+weights represent the distance between the branch and the received input. The
+branch weights are then added to the node weights to give an overall weight.
+Thus for example, state S2 has a node weight of 2 at time 0 and its branches
+have weights of 0 and 2 giving an overall weight of 2 and 4 going into the
+receiving node at the end of the timeslot. The overall weight indicates how
+likely it is that the route through the encoder corresponded to this sequence;
+the lower the overall weight the more likely this is.
+From the trellis diagram, it can be seen that each state can be reached from
+two other states and that each state will therefore have two incoming overall
+weights. The weight that is chosen is the lower of these two since this rep-
+resents the most likely branch that the encoder would traverse in reaching a
+particular state. So for example, state S1 can be entered from either state S0
+or S2 and at time 1, the overall node plus branch weight from S0 is 0
+� 0
+� 0
+while from S2 it is 2
+� 2
+� 4. The weight arising from S0 is chosen as this is
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+253
+
+the lower and therefore more likely route into S1; the weight arising from the
+S2 branch is discarded and the new weight for node S1 at the end of this first
+timeslot is that from S0, i.e. 0.
+This process continues in a similar way for each timeslot with weights given
+to each branch according to the difference between the encoded branch pattern
+and the received bits. This difference is then added to the node weight to give
+an overall weight with the next weight for a node equal to the lower incoming
+weight. This results in the weights given in figure 14.3.
+To form the output from the decoder, a path is traced through its trellis using
+the accumulated history of node weight information. The weights at time 4
+indicate that the encoder is most likely to be in state S1 as this has the lowest
+node weight. Furthermore, this state was most likely to have been reached
+from S2. Continuing tracing backwards from S2 taking the lower overall node
+count, the most likely path taken from S2 is S3, S1 and S0 (initialisation). Note
+that despite having received corrupted data, this is exactly the sequence taken
+by the encoder in sending this data!
+The output data can be ‘rescued’ from the decoder by noting that to reach
+states S0 and S2 the current data input to the encoder is a ‘0’ while to reach
+states S1 and S3, the current encoder input is a ‘1’; that is, the least significant
+state bit indicates the state of the current data. Thus since the optimum states
+the decoder reaches at time 1, 2, 3 and 4 is S1, S3, S2 and S1 respectively, the
+decoder would output a data stream of 1101 as being the most likely input to
+the encoder.
+
+14.3.
+System parameters
+
+In practice the decoder designed is larger and more complicated than indi-
+cated by the simple example given. The encoder uses the current bit and the
+six previous bits in various combinations to provide the two encoded output
+streams; this spreads each input bit out over a longer transmission period, in-
+creasing the probability of error-free reception in the presence of noise. If the
+current bit is bit 0 with the oldest bit in the shift register being bit 6, then the
+encoded X-output is obtained by adding bits 0, 1, 2, 3 and 6 and the encoded
+Y-output from adding bits 0, 2, 3, 5 and 6. Since the constraint length is 7, the
+system has 64 nodes and therefore 128 paths or branches. Thus state n, where
+n is an integer from 0 to 63, leads to states 2n modulo 64 and
+�2n
+�1� modulo
+64.
+Furthermore, the received bits are not hard coded (just straight ones and
+zeros) but soft coded. Three bits are used to represent values, with 100 used
+for a strong zero value and 011 for a strong one value. Noise in transmission
+means that a received character can be indicated by any 3-bit value from 0 to
+7 and in interpreting the received value, it is helpful to regard the number as
+
+
+254
+Part III: Large-Scale Asynchronous Designs
+
+signed. Thus 011 indicates a strong one while 000 denotes received data that
+weakly indicates a one. Similarly, 100 implies a strong zero while 111 denotes
+a probable but weak zero. To make the text easier to follow hereafter, unsigned
+values will be used with the code for a 3-bit perfect zero taken as 000 (i.e.
+value 0) while a 3-bit perfect one is taken as 111 (i.e. value 7).
+The interface to the asynchrous decoder is synchronous with the validity of
+the encoded characters on a particular positive clock edge indicated by a Block-
+Valid signal; encoded data is only present when this signal is high. Code rates
+other than the 1/2, primarily described in this chapter, are also possible. These
+are achieved by using the Block-Valid with a Puncture and a Puncture-X-nY
+signal; if Block-Valid is active (high) then a high Puncture signal indicates that
+only one of the X or Y symbols is valid and Puncture-X-nY indicates which.
+Both encoded characters are present if Block-Valid is high and Puncture is low.
+All the code rates originate from the 1/2 rate encoder with data for the other
+code rates then obtained by omitting to send some of these encoded characters.
+In this way, in addition to the 1/2 code rate, the system receives and decodes
+rates of 2/3 (two input bits result in the transmission of three encoded charac-
+ters), 3/4, 5/6, 6/7 and 7/8 (7 input bits defined by 8 encoded characters with
+the remaining 6 not transmitted). As the code rate progressively increases, less
+redundancy is included in the transmitted data resulting in an increased error
+rate from the decoder; a rate of 1/2 will yield the most error-free output.
+The system is expected to be operated from a 90 MHz clock. Since this
+clock is used by all code rates, the code rate not only defines the ratio of the
+number of input bits to the number of transmitted encoded characters but also
+specifies the ratio of the number of clocks containing encoded information (of
+1 or 2 valid characters) to the number of clock cycles. For example, a 3/4 rate
+signifies that three input bits result in four transmitted characters and also that
+three clocks out of four contain some encoded information. Thus in a 3/4 code,
+every fourth clock contains no encoded information (Block-Valid is low). Any
+clock for which Block-Valid is high is said to contain a symbol. Thus with
+a 90 MHz clock rate, a 1/2 rate code which has valid data on alternate clock
+cycles yields a data rate of 45 MSymbols/sec and a 7/8 code rate is equivalent
+to 90x7/8 Msymbols/sec = 78.75 Msymbols/sec.
+
+14.4.
+System overview
+
+To perform the decoder computation previously outlined, the Viterbi de-
+coder comprises three units as shown in figure 14.4. The Branch Metric Unit
+(BMU) receives the transmitted data and computes the distance between the
+ideal branch pattern symbols of (0,0), (0,7), (7,0) or (7,7) and the received
+data; these weights are then passed to the Path Metric Unit (PMU). The PMU
+holds the node weights and performs the node plus branch weight calculations,
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+255
+
+Clock
+Req
+Req
+
+branch
+
+Ack
+
+Req
+
+Unit
+History
+global
+Metric
+Path
+
+Unit
+Metric
+Unit
+
+Branch
+
+Receiver
+
+Ack
+
+Decoded
+Output 
+
+Ack
+
+winner
+
+Input from
+metrics
+
+local
+winners
+
+Figure 14.4.
+Decoder units.
+
+selecting the lower overall weight as the next weight for a node in a particular
+timeslot. The computed node weights are then fed back (within the PMU) and
+become the node weights for the next timeslot.
+As well as computing the next slot node weights, the PMU remembers
+whether the winner at a node came from the upper or lower branch leading
+in to it; this is only a single bit per node. On each timeslot, this local winner
+information is passed to the History Unit (HU), called the Survivor Memory
+Unit in synchronous systems. This information enables the HU to keep a his-
+tory of the routes taken through the trellis by each node. In addition to this
+local node information, in our asynchronous design the state with the lowest
+node weight is identified in the PMU and its identity passed to the HU. Locat-
+ing this global winner gives a known starting point in searching back through
+the trellis history to find the best data to output.
+Conventional synchronous designs do not undertake an overall node winner
+identification and consequently start the search at a random point. They need
+to store a relatively large number of timeslots in order to ensure that there is
+sufficient history to make it likely that the correct route through the trellis is
+eventually found. In the asynchronous design, the identification of the overall
+node winner in the PMU was relatively easy to perform and it seemed the
+natural way to proceed. It has had the desirable effect of enabling the amount
+of timeslot history kept in the HU to be reduced and it also reduces the activity
+in the HU, saving power.
+The HU uses both the overall winning node information (the global winner)
+and the local node winners in order to reconstruct the trellis and trace back the
+path from the current overall winner to find the node indicated by the oldest
+timeslot stored; the bit for this node is then output. The HU can be visualised
+as a circular buffer. Once data is output from the oldest slot, this information
+is overwritten with the current (newest) winner data so that the next oldest data
+becomes the oldest data in the next timeslot.
+Figure 14.4 shows the bundled-data interface used between units; four-phase
+signalling is used for the Request and Acknowledge handshake signals. The
+Clock signal is required because the external system to the decoder is syn-
+
+
+256
+Part III: Large-Scale Asynchronous Designs
+
+chronous. Input data is supplied and output data removed on the positive clock
+edge provided that the bits on that clock edge are valid as indicated by Block-
+Valid. In practice, all units contain buffering at their interfaces in order to
+decouple the operation of the units, to cater for local variations in operating
+times, and to satisfy latency requirements imposed by the external system.
+
+14.5.
+The Path Metric Unit (PMU)
+
+14.5.1
+Node pair design in the PMU
+
+The PMU performs the core of the computation in the decoder and is the
+starting point for the design. Conventionally, the computation performed is
+that of Add (node to branch weight), Compare (upper and lower weights to
+a node) and Select (lower weight as next weight for a node). Because of the
+butterfly connection, the branch weights associated with nodes j and j
+�32 and
+their connections lead to nodes 2 j and 2 j
+� 1, as shown in figure 14.5 where
+BMa and BMb represent the branch weights; it should be noted that since a
+branch represents ideal convolved characters of (0,0), (0,7), (7,0) or (7,7), it is
+only necessary to compute a total of four weights in any system representing
+their distance from the received soft-coded characters.
+
+node metric
+Previous
+
+node
+  
+node
+
+node
+node
+BMa
+
+  BMa
+
+BMb
+
+BMb
+
+Next
+
+node metric
+
+ j+32
+
+     j
+
+2j+1
+
+   2j
+
+Figure 14.5.
+Node pair computation.
+
+As the logic for this pair of nodes is self-contained and all the logic can be
+similarly partitioned into self-contained pairs of nodes, the basic unit of logic
+design in the PMU is the node pair; this is then replicated the required number
+of times (32 in this system). Furthermore, since 8-bit parallel arithmetic is nor-
+mally used, in a 64-node system this leads to 512 data signals in the butterfly
+connection and 1024 interconnections within the PMU.
+In an effort to simplify this routing problem and to achieve an associated
+power reduction from this simplification, serial arithmetic was proposed for
+the asynchronous design; in principle, this would reduce the butterfly to just
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+257
+
+Table 14.1.
+Serial unary arithmetic.
+
+number
+transition
+representation
+
+zero
+=
+000000
+one
+=
+111111
+two
+=
+000001
+three
+=
+111101
+four
+=
+000101
+five
+=
+110101
+six
+=
+010101
+
+Ain
+
+Rin
+Rout
+
+Aout
+
+C
+C
+C
+C
+C
+C
+
+R1
+
+A1
+R2
+
+A2
+R3
+A4
+
+A3
+R4
+A5
+
+R5
+
+1
+3
+5
+
+2
+4
+6
+
+Figure 14.6.
+Six-bit 2-phase dataless FIFO.
+
+64 wires. The adoption of serial arithmetic significantly impinges upon the
+node pair design and the way weights are stored. A conventional binary rep-
+resentation of values where serial arithmetic was performed on them was not
+a practical option as it would lead to a system throughput appreciably below
+that specified. This led to the idea of indicating values by the number of entries
+occupied in a FIFO buffer, so for example a count of five would require that the
+bottom five stages of the buffer showed that they were full; note that the buffer
+stages don’t need to store any data but merely require a full/empty indication.
+The speed and simplicity of this full/empty dataless FIFO scheme is further
+enhanced by adopting serial unary arithmetic for representing the data in the
+buffer (rather than a ‘1’, say, to represent full and a ‘0’ for empty). This is
+essentially a 2-phase representation for values, so that the number of transi-
+tions considering the full/empty bits as a whole represents the count. This is
+illustrated in table 14.1 for a 6-stage FIFO where the input enters on the left
+hand side (and exits from the right hand side).
+The FIFOs used to hold the path and state metrics are Muller pipelines as
+shown in figure 14.6 (see also figure 2.8 on page 17). The encoding of a metric,
+M, is simply the state of an initially empty Muller pipeline after it has been
+exposed to M 2-phase handshakes on its input. Since a Muller C-gate in the
+technology used has a propagation delay of around 1 nsec the FIFO can transfer
+data in and out at a rate of around 500 MHz.
+
+
+258
+Part III: Large-Scale Asynchronous Designs
+
+Using serial unary arithmetic, the major design component in a node pair
+for adding the node and branch weights and transferring the smaller to be the
+new node weight is an increment-decrement unit. The basic scheme for a node
+pair is illustrated in figure 14.7.
+
+node  2n+1
+node  2n
+
+node  2n+3
+node  2n+2
+
+node  n/2
+
+node  n/2+32
+
+from node pair  n/4
+
+from node pair  n/4+16
+to node pair  n+1
+
+to node pair n
+
+butterfly
+network
+
+butterfly
+network
+
+metric
+state
+
+state
+metric
+
+path
+
+metrics
+branch
+
+path
+
+path
+
+path
+
+global
+
+state
+metric
+
+state
+metric
+
+local
+winner
+
+metric
+
+metric
+
+metric
+
+select
+
+select
+
+node  n+1
+
+node  n
+
+node pair n/2
+
+winner
+to History Unit
+
+metric
+
+Figure 14.7.
+Node pair logic.
+
+The new weights for each state are stored in the State Metric FIFOs on the
+right hand side of figure 14.7. When the global and local winners of these have
+been sent to the HU and acknowledged, the next timeslot commences with the
+parallel loading of the branch weights into the Path Metric FIFOs on the left
+in figure 14.7; these overwrite any existing content in these FIFOs. Parallel
+loading here, rather than serial entry, was selected on the grounds of speed and
+the need to clear the Path Metric FIFOs of any existing count.
+The branch weights loaded are those computed by the BMU. The BMU first
+computes the conventional branch weights based on the difference between the
+two ideal 3-bit characters expected on the trellis branch and the two received
+values. The BMU then translates this to a transition pattern. This is made
+more complicated by the fact that the external environment to the Path Metric
+FIFOs is sometimes in the
+� 1
+� state necessitating a 2-phase inversion of the
+computed pattern.
+Once the branch weights are loaded in, the node weights are then added to
+them. The node weights are transferred serially from the State Metric FIFOs
+across the butterfly connection into the Path Metric FIFOs. For each event
+transferred, two Path Metric FIFOs are incremented by one and the State Met-
+ric FIFO decremented by one. The transfer is complete when the feeding State
+Metric FIFO is empty. The Path Metric FIFOs for the node pair can commence
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+259
+
+the comparison and selection of the lower count as the node weight for the re-
+ceiving State Metric FIFO once the receiving State Metric FIFO is empty.
+The simplest way of performing this comparison and selection of the lower
+Path Metric FIFO count is to pair transitions (events) in the upper and lower
+Path Metric FIFOs connected to a receiving State Metric FIFO. Each paired
+event decrements each Path Metric FIFO by one and produces a transition
+which is used to increment the State Metric FIFO. The observant reader will
+note that the pairing of events to produce an output transition is exactly the
+action performed by a two-input Muller C-gate and this in principle is all that
+is required for the Select element shown in figure 14.7.
+The pairing action ceases when the lower count in the two Path Metric FI-
+FOs has decremented to zero. At this point, this Path Metric FIFO is the local
+path winner and the new node weight in the receiving State Metric FIFO is
+complete. The identity of the winning Path Metric FIFO (upper or lower) is
+needed to reconstruct the trellis; this information is buffered in the PMU and
+sent to the HU when all local winners are known and the overall winning node
+has been identified. This completes the actions required in the current timeslot
+and the PMU is then free to commence the next timeslot.
+
+14.5.2
+Branch metrics
+
+The proposed scheme is only viable if the numbers that need to be trans-
+ferred between the FIFOs can be kept small. A simulator was written in order
+to establish the minimum values that were consistent with meeting the bit error
+rates specified for the industrial standard device.
+
+�
+�
+
+
+
+ 
+
+7,7
+
+d
+
+a
+b
+
+d11
+
+d00
+
+c
+
+d01
+
+d10
+
+7,0
+0,0
+
+0,7
+
+Figure 14.8.
+Computing the branch metric.
+
+
+260
+Part III: Large-Scale Asynchronous Designs
+
+Table 14.2.
+Branch metric weight generation.
+
+received 3-bit character
+0
+1
+2
+3
+4
+5
+6
+7
+
+Weight referenced to 0:
+0
+0
+0
+0
+1
+3
+5
+7
+Weight referenced to 7:
+7
+5
+3
+1
+0
+0
+0
+0
+
+In the BMU, the distance of the incoming data from the ideal branch rep-
+resentations of (0,0), (0,7), (7,0) and (7,7) needs to be computed. This cal-
+culation is depicted in figure 14.8. The incoming data is assumed to have
+a value of
+�a�c� which does not correspond to any of the ideal points. The
+squares of the distances d00, d01, d10 and d11 are a2
+
+�c2, a2
+
+�d2, b2
+
+�c2 and
+b2
+
+�d2 respectively. Only the relative values of these quantities are of interest.
+Substituting
+�7
+� a� for b and
+�7
+� c� for d in the quadratic expressions gives
+squared distances of a2
+
+�c2, a2
+
+�
+�7
+�c�2,
+�7
+�a�2
+
+�c2 and
+�7
+�a�2
+
+�
+�7
+�c�2.
+Expanding out and subtracting a2
+
+� c2 gives distance values of 0, 49
+� 14c,
+49
+�14a and 98
+�14a
+�14c. Dividing by 7, adding a
+�c and then substituting
+back b for
+�7
+�a� and d for
+�7
+�c� yields the linear linear metrics a
+�c, a
+�d,
+b
+�c and b
+�d.
+Thus in this particular system, the Euclidian distance squared is equivalent
+to the Manhattan distance, which is a somewhat surprising and unexpected re-
+sult. It indicates that to use squared distances offers no advantage over using
+the much simpler linear distances; indeed, using squared distances followed
+by scaling to reduce the number size (which is adopted in some systems) in-
+troduces unnecessary circuit complexity and some inaccuracy.
+The linear weights are further minimised by subtracting the x and y distance
+to the nearest ideal point from the branch weights, so the smallest linear metric
+is always made zero. For example, if the incoming soft codes are exactly 7,7
+in figure 14.8 then linear metrics of 14, 7, 7 and 0 are generated. However, if
+the incoming bits are at co-ordinate 5,6 (due to noise) then the metrics become
+11, 6, 8 and 3 which by subtracting 3 from all values (2 for x value and 1
+for y value) become branch metrics of 8, 3, 5 and 0. Using decrementing
+of the smallest distance to the nearest point, which then becomes zero, the
+values generated for weights for each 3-bit character are reduced as shown in
+table 14.2.
+The maximum metric is now 14, and this will always arise when the incom-
+ing data coincides with one of the ideal input combinations. This is still too
+large for a system which needs to operate serially at close to 90 MHz. There-
+fore the metric is further scaled non-linearly and, to preserve the relative value
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+261
+
+of the weights, this is performed separately on each of the two incoming 3-bit
+soft values.
+Referring to table 14.2, weights of zero remain zero while weights of 1, 3, 5
+and 7 are scaled to 1, 2, 3 and 4. Clearly, the weights for both 3-bit soft codes
+need to be added to obtain the overall branch metric weight. Thus for example,
+the incoming soft-codes of 7,7 generate weights of 4+4, 4+0, 0+4 and 0 = 8, 4,
+4 and 0, while incoming soft-codes of 5,6 generates weights of 2+3, 2+0, 0+3
+and 0 = 5, 2, 3 and 0. Although the scaling here is non-linear and will therefore
+introduce some inaccuracy, simulation showed that this was not significant in
+relation to the results obtained with and without the scaling.
+Furthermore, simulation also reveals that, unsurprisingly, paths with the
+largest weights are rarely involved in the most likely path to be chosen dur-
+ing the path reconstruction to find the output. Therefore, weights generated
+can be limited or capped and a limit of 6 is used in the BMU. Thus the weights
+actually generated and loaded by the BMU for a 7,7 soft code input are 6, 4, 4
+and 0; however, the weights for a 5,6 input code remain as above.
+Six-bit FIFOs are used throughout the PMU and again numbers in the PMU
+are capped at 6. To deal with cases where the serial addition of the node metric
+to a branch weight would exceed this number, logic referred to as the Overflow
+Unit is placed at the input of each Path Metric FIFO. This receives the incom-
+ing request handshake but does not pass it to the FIFO, instead it returns it to
+the sending State Metric FIFO as an acknowledge signal.
+
+14.5.3
+Slot timing
+
+The overall or global winner from the PMU in a particular timeslot is the
+node having the lowest state metric count. In the same way that the BMU
+values can be adjusted so that the lowest weight is zero, the state metric values
+can also be decremented so that their minimum value is zero. As a result,
+numbers in the BMU and PMU are guaranteed to range only between zero and
+six. Furthermore if the soft bits contain no noise, which is the situation most of
+the time, then one (and only one) State Metric FIFO will contain a zero count
+indicating the best path through the trellis. This means that in the majority of
+timeslots, there is no need to perform any subtraction on the state metrics.
+Detecting that a count is zero is in itself easy since it is indicated by an all-
+zeros state in the FIFO. This, as well as establishing the local winner, is done
+locally within a node and is timed from the control signals applied to the node;
+each node has a control section which generates the timing signals required by
+the node, and the timing within a node is independent of the timing of all other
+nodes.
+A slot commences with the loading of the BMU branch weights. The node
+timing then passes to the next stage where the state-to-path metric transfer is
+
+
+262
+Part III: Large-Scale Asynchronous Designs
+
+performed. Following this, detecting that the sending State Metric FIFOs and
+the State Metric FIFO to receive the lower path metric count are empty causes
+the generation of a state-to-path metric done signal. The node timing then
+moves on to the next phase which generates a path-to-state metric enable. If
+at the time this signal is activated one of the Path Metric FIFOs for the node
+is empty, then a flip-flop is set indicating that this node is a global winner
+candidate; in this case no transfer to the State Metric FIFO is required and
+the path-to-state metric done signal is generated; this signal is used to clock
+the upper/lower branch (local) winner into a flip-flop and also to set a ‘local
+winner found’ flip-flop.
+If neither Path Metric FIFO is empty then the path-to-state enable signal
+allows the transfer to its State Metric FIFO until one of the Path Metric FIFOs
+becomes empty; at this point, the path-to-state metric done signal is generated,
+setting the ‘local winner’ and the ‘local winner found’ flip-flops only.
+The ‘local winner found’ and ‘global winner found’ signals now progress
+up the PMU logic hierarchy to the top level because all information needed
+to be passed to the HU has to be present before the request out signal to the
+HU is generated by the PMU. Furthermore, when the local and global winner
+data have been assembled for the HU, all nodes need to be informed that the
+slot has ended and the timing can be shifted to the start of the next slot. It
+should therefore be noted that while the timing within the nodes is local, the
+communication of the winner information to the HU and the subsequent release
+of the nodes has to be global.
+
+14.5.4
+Global winner identification
+
+The formation of the ‘all local winners found’ and the global winner identi-
+fication is partitioned across the various levels of the PMU logic hierarchy. At
+the level of a pair of node pairs, the four global winner candidate signals are
+input to a priority function which produces a 2-bit node address and a ‘global
+winner found’ signal if one of the inputs was a candidate. This logic is repeated
+with four such pairs of node pairs so that using the ‘global winner found’ sig-
+nal generated by each pair of node pairs, the two bit address obtained indicates
+which pair of node pairs contains the global winner; these are then combined
+with the two-bit address generated by that pair of node pairs to form a 4-bit
+node address. Finally, this logic is repeated at the top level where there are
+four sets of four pairs of node pairs. Again the ‘global winner found’ signal
+from each set is used in a priority logic function to produce the two-bit address
+of the winning set and this is combined with the four address bits identifying
+the winning node within the winning set; this is the 6-bit node identification
+that is sent to the HU as the global winner.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+263
+
+The ‘local winner found’ signals only require combining in NAND or NOR
+gates and this is done at the node pair, four pairs of node pairs and top levels.
+At the top level, all the ‘local winner found’ signals must be true in order to
+generate the request out signal to the HU. Since the global winner identification
+is generated from the node whose local winner is the first to be indicated, the
+global winner is guaranteed to be identified prior to the last local winner being
+found. The acknowledge signal from the HU, in response to the PMU request
+out, causes a reset signal to all nodes which resets the global candidate and the
+‘local winner found’ flip-flops and moves the timing on.
+For the cases where no State Metric FIFO contains zero, this is detected;
+it indicates that noisy data has been received. Here, the global winner can be
+identified by performing a subtraction such that one or more State Metric FI-
+FOs with the smallest count become zero. This could be time consuming and
+a decision was taken not to perform the decrement in the current timeslot. In-
+stead the local winners are sent to the HU in the normal way but a Not-Valid
+signal accompanies the request handshake indicating that while the local win-
+ner information is genuine the global winner identification should be ignored.
+The decrementing of all the state metric weights is performed in the next
+cycle by all the Overflow Units (which precede the Path Metric FIFOs). A
+signal to this unit indicates that decrementing is required. This results in the
+first incoming request from a State Metric FIFO to transfer its count to the
+Path Metric FIFO being ignored. The Overflow Unit sends an acknowledge
+back to its sending State Metric FIFO but does not pass the request on to its
+Path Metric FIFO. This effectively decrements the State Metric FIFOs by one
+by discarding the first item sent by them to the Path Metric FIFOs.
+Only a count of one is decremented in this way on any timeslot. This may
+still leave all state metrics with a non-zero count in them but simulation re-
+vealed that this was highly unlikely. Furthermore, if the Overflow Units were
+used to decrement the state metrics by the smallest count then either consider-
+able logic to determine the size of this count would be needed, or time consum-
+ing logic which decremented all state metrics by one and then tested to see if all
+these metrics were still non-zero (repeating these steps if necessary) would be
+required. Instead the much simpler approach of detecting a zero-valued state
+metric and identifying when all state metric counts are non-zero is used.
+In retrospect, it would have been better to have decremented the State Met-
+ric down to zero in the current timeslot. The decrementing has to occur at some
+point and postponing to the next timeslot merely shifts when the operation is
+performed. More importantly, the failure to identify the global winner in the
+case of all State Metrics FIFOs holding a non-zero count means that informa-
+tion which is in the PMU is not passed to the HU and therefore the HU has less
+information on which to base its decisions as to the data output.
+
+
+264
+Part III: Large-Scale Asynchronous Designs
+
+14.6.
+The History Unit (HU)
+
+The global and local winner information from the PMU to the HU is accom-
+panied by a Request handshake signal in the normal way. Having specified the
+interface to the PMU, the design of the asynchronous HU can be decoupled
+from the design of the rest of the system. As previously mentioned, the iden-
+tification of a global winner means that the number of timeslots of local and
+global winner history that need be kept by the HU can be reduced compared
+with systems that need to start the tracing back through the trellis information
+from a random point. A rule of thumb for the minimum number of timeslots
+that need to be stored for determining the correct output is around 5 times the
+constraint length. With a length of seven for the system described, a minimum
+history of 35 timeslots is required and this was confirmed by simulation. On
+this basis, a 65-slot HU was developed.
+
+14.6.1
+Principle of operation
+
+Figure 14.9 illustrates the principle of operation of the HU which, for sim-
+plicity, is shown as having only four states and storing only five time slots
+indicated by the rectangular outline. At each time step, indicated by T1, T2,
+etc., the PMU supplies the HU with the local winner information (an arrow
+points back from each state to the upper/lower winner state in the previous
+time step) and a global winner indicated by a sold circle.
+Consider the situation when the latest data to have been supplied is at T6.
+The global winner at T6 is S3, and following the arrows back, the global winner
+at T2 is S0. The next data output bit is therefore 0 (the least significant bit
+of S0’s state number), and this is output as the buffer window slides forward
+to time step T7. At T7 the received data has been corrupted by noise, and
+the global winner is (erroneously) S3. Following the local winners back, the
+backtrace modifies the global winners in time steps T6 and T5, but the path
+converges with the old path at T4. The next data output (from the global winner
+at T3) is 1 and is still correct. Moving on to T8, the global winner is S0 and,
+tracing back, the global winners at T5, T6 and T7 are changed, with T5 and T6
+resuming their previous correct values. The noise that corrupted the path at T7
+has no lasting effect and the output data stream is error-free.
+
+14.6.2
+History Unit backtrace
+
+The data structure used in the HU is illustrated in table 14.3. Each of the
+65 slots contains 64 bits of local winner information and a 6-bit global winner
+identifier. There is also a ‘global winner valid’ flag which indicates whether or
+not the global winner has been computed. The 65 slots form a circular buffer
+with the start (and end) of the buffer stepping around the slots in sequence.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+265
+
+0
+0
+1
+0
+1
+1
+1
+0
+data_out:
+
+T1
+T2
+T3
+T4
+T5
+T6
+T7
+T8
+
+time
+
+S0
+
+S1
+
+S2
+
+S3
+
+Figure 14.9.
+History Unit backtrace example.
+
+Table 14.3.
+History Unit data structure.
+
+slot
+local winner
+global winner
+valid
+number
+(64 bits)
+(6 bits)
+(1 bit)
+
+0
+L00[63..0]
+G00[5..0]
+V00
+1
+L01[63..0]
+G01[5..0]
+V01
+
+�
+�
+�
+18
+L18[63..0]
+G18[5..0]
+V18
+19
+L19[63..0]
+G19[5..0]
+V19
+20
+L20[63..0]
+G20[5..0]
+V20
+� head
+21
+L21[63..0]
+G21[5..0]
+V21
+22
+L22[63..0]
+G22[5..0]
+V22
+
+�
+�
+�
+64
+L64[63..0]
+G64[5..0]
+V64
+
+At each step the next output bit is issued from the least significant bit of the
+current head-slot global winner identifier. Then the new local and global win-
+ner information is written into the head slot and the head pointer moves to the
+next slot. The new local and global winner information is used to initiate a
+backtrace, which updates the current optimum path held in the global winner
+memories.
+The trellis arrangement produces a simple arithmetic relationship between
+one state and the next state so that, given a global winner identity in one slot,
+the previous global winner identity is readily computed. The parent identity
+can be derived from the child identity by shifting the child state one place to
+the right and inserting the relevant local winner bit into the most significant bit
+position. For example, if the global winner is node 23 in a slot, then the global
+
+
+266
+Part III: Large-Scale Asynchronous Designs
+
+winner in the previous slot will be node 11 (if the current slot local winner for
+node 23 is 0) or node 11+32 = node 43 (if the local winner for node 23 is 1).
+Where the current global winner is the child of the previous global winner,
+the current winner continues the good path already stored in the global winner
+memories. This makes it unnecessary to search back through the local winner
+information in order to reconstruct the trellis and hence saves power. There-
+fore, when data is received from the PMU, if the incoming global winner is the
+child of the last winner, then it is only necessary to output data from the oldest
+global winner entry and then to overwrite this memory line with the incoming
+local and global winner information.
+However, if sufficient noise is present (or noise has been present and the
+data now switches back to a good stream), then there may be a discontinu-
+ity between the incoming and previous global winner; this is recognised by
+the current global winner not being the child of the previous winner. In this
+case, the global winner memories do not hold a good path and this path is re-
+constructed using the local winner information. Here, the output data is read
+out and the winner information is written in as before. In addition, starting
+from the current global winner, this node identification is used to select its up-
+per/lower branch winner from the current local winner information. The parent
+identity is then constructed as described above. This computed parent identity
+is compared with the global winner identity for the previous slot. If they are
+the same then the backtrace has now converged onto the good path kept in the
+global winner memories and no further action is required. If, however, they are
+not the same then the computed parent identity needs to overwrite the global
+winner in the previous timeslot. The backtrace process now repeats in order to
+construct a good path to the next previous timeslot, and this process continues
+until the computed parent identity does match the stored global winner.
+Backtracing slot by slot thus proceeds until the computed path converges
+with the stored path. The algorithm is shown in a Balsa-like pseudo-code in
+figure 14.10. (Note, however, that Balsa does not have the ‘��’ and ‘��’
+shift operators; they are borrowed here from C to improve clarity.) In practice,
+simulation shows that path convergence usually occurs within eight or fewer
+slots. So, although the most recent items may be over-written, the oldest items
+tend to be static and the output data from the oldest slots does not change.
+Overwriting the entire path is a rare occurrence and, in this circumstance, the
+data output from the system is almost certainly erroneous.
+No backtrace is commenced in any slot where the global winner is invalid;
+the global winner entry is marked as invalid but the local winner information is
+written in the normal way. Any subsequent backtrace that encounters an invalid
+global winner will force a not equivalence with the incoming computed global
+winner at that slot, so that the computed global winner replaces the invalid
+stored value and the entry is then marked as valid.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+267
+
+loop
+c := head;
+-- child starts at head
+data_out <- Gc[0];
+-- output lsb of Gc
+Lc := In.local_winners;
+-- update head local winners,
+Gc := In.global_winner;
+--
+global winner, and
+Vc := In.global_winner_valid; --
+global winner valid bit
+head := head + 1;
+-- step head pointer to next slot
+if Vc then
+-- backtrace only from valid head
+p := (c-1) % 65;
+-- parent slot number
+while (c /= head
+-- detect buffer wrap-around
+and (not Vp
+-- over-write invalid parent
+or
+Gp /= (Lc[Gc] << 5) + (Gc >> 1)))) -- not converged
+then
+Gp := (Lc[Gc] << 5) + (Gc >> 1));
+-- update parent
+Vp := TRUE;
+-- mark as valid
+c
+:= p;
+-- next slot
+p
+:= (c-1) % 65
+-- next parent slot number
+end -- while
+end -- if
+end -- loop
+
+Figure 14.10.
+History Unit backtrace sequential algorithm.
+
+14.6.3
+History Unit implementation
+
+The type of memory used in the HU is the dominant factor in determining
+its implementation. Initially, RAM elements were considered for this storage
+as single and dual-port read elements were present in the available cell library.
+However, their use makes it difficult to keep track of incomplete backtraces
+when a new backtrace needs to be started. In addition, the global and local
+winner memories need to be separate entities but this introduces some ineffi-
+ciency in the address decoding. Furthermore, there are difficulties in providing
+the many specific timed signals required to drive the memory. The RAM tim-
+ings are equivalent to many simple gate delays. Such gates would be used to
+form the reference timing signals, and it is not clear that the gate propagation
+delays due to supply voltage changes vary in the same way as the RAM delays.
+For these reasons, the memory was implemented with flip-flop storage and
+the system is shown in figure 14.11. It comprises 65 lines made up of 64
+slots of replicated storage and one further slot which, on reset, becomes the
+slot holding the head token; the head slot receives the new local and global
+winner information from the PMU. The control block holds the global winner
+identification plus the token handling and backtrace logic.
+The concurrent asynchronous algorithm is illustrated in Balsa-like pseudo-
+code in figure 14.12, which represents a single stage in the History Unit. The
+complete HU comprises 65 such stages, one of which is initialised to be the
+
+
+268
+Part III: Large-Scale Asynchronous Designs
+
+control
+
+evaluate
+addr
+
+token
+
+Rin
+Ain
+data_out
+Aout
+
+control
+
+local winners
+
+winners
+winners
+local
+
+memory
+
+local
+
+memory
+
+addr
+
+data
+
+strobe
+
+addr
+
+data
+
+strobe
+
+winner
+global
+
+Figure 14.11.
+History Unit implementation.
+
+head of the circular buffer. This can be compared with the sequential algo-
+rithm shown earlier in figure 14.10. The transformation from the sequential
+to the concurrent algorithm is illustrative of high-level asynchronous design
+methodologies even when the design is being carried out manually (as was this
+Viterbi decoder) and not in a high-level language such as Balsa.
+The head slot contains the oldest winner information and determines the 1-
+bit data output from the system. Remembering that odd states signify a ‘1’ in-
+put and even states a ‘0’ input, the head slot outputs the least significant global
+winner bit on the data-out line. This data enters a buffer and the acknowledge
+Aout signifies its acceptance. The head is then free to write the current winner
+information to its memory. The Token signal then passes (leftwards) to the
+adjacent slot which now becomes the new head.
+The parent node of the current global winner is computed as described and
+this is passed (rightwards) to the adjacent slot with an Evaluate signal. The
+computed parent is compared with the stored winner in the previous stage.
+Equivalence results in no further backtracing and the backtrace is said to be
+retired. Not equivalence causes overwriting of the global winner and this win-
+ner accompanied by Strobe is used to address (Addr) the local memory. The
+data bit returned on Data is used to compute the parent of this winner which is
+then passed rightwards to the preceding timeslot with an Evaluate signal. This
+process repeats until the backtrace converges with the existing global winners
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+269
+
+loop
+arbitrate In then
+if head then
+-- head...
+data_out <- G[0];
+--
+output next data bit
+L := In.local_winners || --
+update local values
+G := In.global_winner ||
+V := In.global_winner_valid;
+if V then
+--
+if global winner is valid
+addrOut <- (L[G] << 5) + (G >> 1) -- start backtrace
+end; -- if V
+sync tokenOut ||
+--
+pass on head gtoken
+head := false
+--
+and clear head Boolean
+end -- if head
+| addrIn then
+--
+backtrace input?
+if not head then
+if addrIn /= G or not V then
+-- path converged? If not...
+G := addrIn ||
+--
+update global winner
+V := true;
+addrOut <- (L[G] << 5) + (G >> 1) -- propagate backtrace
+end -- if ..
+end -- if not head
+| tokenIn then
+--
+head token arrived
+head := true
+--
+set head Boolean
+end -- arbitrate
+end -- loop
+
+Figure 14.12.
+History Unit backtrace stage.
+
+and can be retired. A backtrace has to be forcibly retired if it is in danger of
+running into the head slot; an arbiter is used to test for this in the control and it
+is the only place where arbiters need to be used in the system. Fortunately, the
+meeting of the head and backtrace processing is a rare occurrence.
+It should be noted that, unlike a conventional system, path reconstruction is
+only undertaken if necessary and then for only as long as required; both strate-
+gies save power. Furthermore, the use of asynchronous techniques within the
+HU enables the writing of winner information from the PMU to be indepen-
+dent of and run concurrently with any path reconstruction activity. The use
+of flip-flop storage rather than RAM has resulted in a simpler and more flexi-
+ble design. It also has the distinct advantage of enabling multiple backtraces,
+whose frontiers are all at different slots, to be run concurrently.
+
+14.7.
+Results and design evaluation
+
+The asynchronous Viterbi decoder system was implemented as described
+using the industrial partner’s cell library which was designed to operate from
+3.3V. Non-standard elements such as the Muller C-gate were constructed from
+
+
+270
+Part III: Large-Scale Asynchronous Designs
+
+the cell library; the only full-custom element which had to be designed was an
+arbiter.
+Following simulation, the decoder was fabricated on a 0.35 micron CMOS
+process. Results for a 1/2 code rate, random input bit stream with no errors
+show an overall power dissipation of 1333 mW at 45 Msymbols/sec. Of this,
+the dissipation in the PMU dominates at 1233 mW while the HU takes only
+37 mW. The difference (about 60 mW) between these figure and those for the
+overall consumption are accounted for by the dissipation in the BMU and in
+the small amount of ‘glue’ logic prior to the BMU.
+Errors in the input data to the decoder result in only small variations in the
+dissipation with the overall dissipation falling slightly with an increasing input
+error rate; internally, this decrease comprises a small reduction in the PMU
+dissipation and a smaller rise in the HU dissipation. The results for other code
+rates are a scaled version of those obtained for the 1/2 code rate as might be
+expected. For example, a 3/4 code rate which receives 3 symbols for every 4
+clocks exhibits 1.5 times the dissipation of the 1/2 rate code with its 2 symbols
+every 4 clocks.
+The asynchronous PMU performs approximately the same amount of work
+regardless of the number of errors in the data stream. This results from capping
+numbers at 6. This means that for a good data stream, all nodes have a weight
+of six except the one node on the good (correct) path. Thus the PMU is almost
+permanently ‘saturated’ and practically all work performed relates to paths
+which will never be selected! Errors cause some spread in the node weights
+with the higher weights (4, 5 and 6) predominating and the slightly smaller
+counts on some nodes accounts for the slight drop in dissipation in the PMU
+under error conditions.
+The asynchronous PMU dissipation is very high and does not compare well
+with synchronous PMUs using conventional parallel arithmetic to perform the
+add, compare and select operation [15]. In order to understand why this occurs,
+it is necessary to examine the operation and logic used in the asynchronous
+PMU in more detail. With good (i.e. no error) data, 63 nodes have weights
+of 6 and one node has a weight of 0. This translates to PMU activity on each
+timeslot where all State Metric FIFOs but one contain counts of 6 which are
+then transferred to Path Metric FIFOs, whose counts of 6 are in turn removed
+from the Path Metric FIFOs, paired and transferred to the receiving State Met-
+ric FIFOs. Entering or removing a count of 6 from a FIFO involves 21 changes
+of state in the stages. Furthermore, the number of transitions actually involved
+is higher due to (around 5) internal transitions on the elements making up the
+C-gates forming each FIFO stage. Thus, each of 63 nodes experiences around
+650 transitions just on the data path per timeslot. The control and other over-
+heads on the data path can be expected to form (say) an additional 30% of
+logic. This indicates a node activity of around 850 transitions/timeslot and
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+271
+
+overall the PMU can be expected to make a maximum of 54,400 transitions on
+each timeslot.
+Unfortunately, the design of the FIFOs, particularly the Path Metric FIFOs,
+has led to high capacitive loading on the true and inverse C-gate outputs at
+each stage. A dissipation of 1233 mW with 54,400 transitions per slot and an
+energy cost of 5�45
+�C joules per transition, where C is the average track plus
+gate loading capacitance (in Farads), indicates an average loading of 92 fF and
+this is confirmed by measurements.
+By contrast, the HU power efficiency is excellent and is the real success
+story in this design. Its dissipation is low and is far smaller than that in any
+other system the designers are aware of. The HU dissipation demonstrates that
+by keeping a good path, very little computing is required to output the data
+stream when there are no errors. Furthermore, when lots of noise is present,
+so that the backtrace process is active with many good paths in the process of
+being reconstructed concurrently, the dissipation in the HU only rises slightly;
+this indicates that accessing the local winner memory in flip-flops and over-
+writing the global winner information is not a power-costly operation.
+The HU dissipation also compares favourably with a (synchronous) system
+built along similar principles to the HU described here but using RAM ele-
+ments from the library instead of flip-flops. Due to the limitations of the RAM
+devices previously mentioned, these introduce additional complexity because
+only one backtrace is performed at any time; it is therefore necessary to keep
+track of the depth reached by incomplete backtraces which are abandoned for a
+new backtrace leaving a partially complete global winner path reconstruction.
+The difference in dissipation between the asynchronous HU using flip-flops
+and the other using RAMs reflects the power cost of accessing the local win-
+ner RAM and the associated significant additional computation involved in the
+backtrace. This points to the power efficiency of storing the HU information
+in a manner best suited to the task.
+
+14.8.
+Conclusions
+
+As in many asynchronous designs, the system design has had to be ap-
+proached from first principles and has caused a complete rethink about how
+to implement the Viterbi algorithm. This has resulted in a number of novel
+features being incorporated in the PMU and HU units. In the PMU, the deci-
+sion to use serial unary arithmetic has enabled the conventional parallel add,
+compare and select logic to be dispensed with and replaced by dataless FIFOs
+which perform the arithmetic serially.
+While the PMU is an interesting and different design from that convention-
+ally used, its power consumption is not good. Its design illustrates that power
+efficiency at the algorithmic and architecture levels needs to be combined with
+
+
+272
+Part III: Large-Scale Asynchronous Designs
+
+efficient design at the logic, circuit and layout levels to realise the true po-
+tential of a system. This is demonstrated by a synchronous PMU constructed
+along similar architectural principles to those described but implemented using
+a low-power logic family and full custom layout which dissipates only 70 mW
+at 45 Msymbols/sec [15]. It is clear that while a full custom design of the asyn-
+chronous PMU datapath would reduce the current power levels significantly, a
+major revision of the PMU logic for the datapath, paying particular attention to
+loading, is required for a design which has better power efficiency than other
+systems.
+The identification of a global winner is probably the most important advance
+in the PMU design. This has meant that both a good path and a local winner
+history can be kept by the HU, leading to a greatly reduced amount of overall
+storage required to deduce the output data. The use of flip-flop storage has also
+greatly contributed to the power efficiency of this unit and it does demonstrate
+the power advantages of optimising design at all levels in the design hierarchy
+down to and including the logic.
+The HU also illustrates the advantages of asynchronous design in that the
+placing of current information is decoupled from any backtracing operations of
+which there may be many running concurrently. Furthermore, the speed of the
+backtracing is only dependent on the logic required to perform this operation
+and not on any other internal or external system timing. Such a decoupled,
+multiple backtracing activity would clearly be more difficult to organise in the
+context of a synchronous timing environment.
+
+14.8.1
+Acknowledgement
+
+As in any large project, a number of people have been engaged in the design
+and implementation of the Viterbi decoder described in this chapter. It is there-
+fore a pleasure to acknowledge the other colleagues in the Amulet group in the
+Computer Science Department at Manchester University involved at all stages
+in this project, namely Mike Cumpstey, Steve Furber and Peter Riocreux. I am
+also grateful to them for comments on the draft of this chapter.
+
+14.8.2
+Further reading
+
+Further information on the Viterbi algorithm may be found in [148], [71]
+and [70]. Futher information on the PREST project is in [1].
+
+
+Chapter 15
+
+PROCESSORS
+
+�
+
+Jim D. Garside
+Department of Computer Science, The University of Manchester
+
+jgarside@cs.man.ac.uk
+
+Abstract
+Computer design becomes ever more complex. Small asynchronous systems
+may be intriguing and even elegant but unless asynchronous logic can not only be
+competitive with ‘conventional’ logic but can show some significant advantages
+it cannot be taken seriously in the commercial world.
+There can be no better way to demonstrate the feasibility of something than
+by doing it. To this end several research groups around have the world have been
+putting together real, large, asynchronous systems. These have taken several
+forms, but many groups have chosen to start with microprocessors; a processor
+is a good demonstrator because it is well defined, self-contained and forces a de-
+signer to solve problems which are already well understood. If an asynchronous
+implementation of a microprocessor can compare favourably with a synchronous
+device performing an identical function then the case is proven.
+This chapter describes a number of processors that have been fabricated and
+discusses in some detail some of the solutions employed. The primary source of
+the material is the Amulet series of ARM implementations – because these are
+the most familiar to the author – but other devices are included as appropriate.
+The later parts of the chapter widen the descriptions to include memory systems,
+cacheing and on-chip interconnect, illustrating how a complete asynchronous
+System on Chip (SoC) can be produced.
+
+Keywords:
+low-power asynchronous circuits, processor architecture
+
+�The majority of the work described in this chapter has been made possible by grants from the European
+Union Open Microprocessor systems Initiative (OMI). The primary sources of funding have been OMI-
+MAP (Amulet1), OMI/DE-ARM (Amulet2e) and OMI/ATOM (Amulet3). Without this funding none of
+these devices would have been made and this support is gratefully acknowledged.
+
+273
+
+
+274
+Part III: Large-Scale Asynchronous Designs
+
+15.1.
+An introduction to the Amulet processors
+
+Most of the examples in this chapter are based around the Amulet series
+of microprocessors, developed at the University of Manchester. All of these
+have been asynchronous implementations of the ARM architecture [65] and,
+as such, allow direct comparisons with their synchronous contemporaries. It
+should be noted that the primary intention, as in other ARM designs, was to
+produce power-efficient rather than high-performance processors.
+Brief descriptions of the three fabricated Amulet processors and some other
+notable examples are given below.
+
+15.1.1
+Amulet1 (1994)
+
+Figure 15.1.
+Amulet1.
+
+Amulet1 [158] (figure 15.1) was a feasibility study in asynchronous design,
+using techniques based extensively on Sutherland’s Micropipelines [128]. Al-
+though two-phase signalling was used for communications standard, transpar-
+ent latches were used internally rather than Sutherland’s capture-pass latch (see
+figure 2.11 on page 20). The external two-phase interface proved difficult to
+interface with external commodity parts.
+Amulet1 comprised 60,000 transistors in a 1�0µm, 2-layer metal process and
+ran the ARM6 instruction set (with the exception of the multiply-accumulate
+operation).
+It achieved about half the instruction throughput of an ARM6
+manufactured on the same process with roughly the same energy efficiency
+(MIPS/W).
+
+
+Chapter 15: Processors
+275
+
+Figure 15.2.
+Amulet2e.
+
+15.1.2
+Amulet2e (1996)
+
+Amulet2e [44] (figure 15.2) was an ARM7 compatible device with complete
+instruction set compliance. In addition to the CPU it included an asynchronous
+4 KByte cache memory and a flexible external interface making it much easier
+to integrate with commodity parts. A few other optimisations such as (limited)
+result forwarding and branch prediction were added.
+Internally this device used four-phase rather than two-phase handshake pro-
+tocols. It occupied 450,000 transistors (mostly in the cache memory) in a
+0�5µm 3-layer metal process; although about three times faster than Amulet1 it
+was still only half the performance of a contemporary synchronous chip.
+
+15.1.3
+Amulet3i (2000)
+
+Amulet3i [48] was intended as a macrocell for supporting System on Chip
+(SoC) applications rather than a stand-alone device. It is an ARM9 compat-
+ible device comprising around 800 000 transistors in a 0�35µm 3-layer metal
+process. It comprises an Amulet3 CPU, 8 KBytes of pseudo-dual port RAM,
+16 KBytes of ROM, a powerful DMA controller and an external memory/test
+interface, all based around a MARBLE [4] asynchronous on-chip bus.
+
+
+276
+Part III: Large-Scale Asynchronous Designs
+
+Figure 15.3.
+DRACO.
+
+Amulet3i achieves roughly the same performance as a contemporary, syn-
+chronous ARM with an equal or marginally better energy efficiency. It was
+integrated with a number of synchronous peripheral devices, designed by Ha-
+genuk GmbH, to form the DRACO (DECT Radio Communications Controller)
+device (figure 15.3).
+
+15.2.
+Some other asynchronous microprocessors
+
+Several other groups around the world have also been developing asyn-
+chronous microprocessors over the past decade or so. In this section a (non-
+exhaustive) selection of these are briefly described.
+Caltech has produced two asynchronous processors: the ‘Caltech Asyn-
+chronous Microprocessor’ (1989) [86] was a locally-designed 16-bit RISC
+which was the first single chip asynchronous processor; the ‘MiniMIPS’ [88]
+was an implementation of the R3000 architecture [72]. Both of these proces-
+sors were custom designed using delay-insensitive coding rather than the ‘bun-
+dled data’ used in the Amulets. This, together with a design philosophy aimed
+at speed rather than low-power consumption results in high-performance, high-
+power devices.
+Another MIPS-style microprocessor is the University of Tokyo’s ‘TITAC-2’
+(1997) [130] (figure 15.4) which is an R2000. This was developed in another
+
+
+Chapter 15: Processors
+277
+
+Figure 15.4.
+TITAC-2. (Picture courtesy of the University of Tokyo.)
+
+Figure 15.5.
+ASPRO-216.
+(Picture courtesy of the TIMA Laboratory,CIS Group, IMAG
+(Grenoble).)
+
+
+278
+Part III: Large-Scale Asynchronous Designs
+
+different design style (quasi-delay insensitive). As may be apparent from the
+figure TITAC-2 employed considerable manual layout.
+‘ASPRO-216’ (1998) [117] from IMAG in Grenoble is slightly different in
+that it is a 16-bit signal processor rather than a general-purpose microprocessor.
+More significantly its design was largely automated and synthesised from a
+CHP(Communicating Hardware Processes) [84, 118] description. This shows
+in the more ‘amorphous’ appearance of the processor, although the chip is
+dominated by its memories (figure 15.5).
+All the devices mentioned above have been research prototypes. Commer-
+cial take up of asynchronous processor technology has been slower; neverthe-
+less there are some significant examples.
+Philips Research Laboratories in Eindhoven have been developing the ‘Tan-
+gram’ [135] circuit synthesis system, primarily aimed at low-performance,
+very low power systems. This has been used to produce an asynchronous im-
+plementation of the 80C51 (1998) [144] which has been deployed in commer-
+cial pager devices where its low power and low EMI properties are particularly
+attractive. It is also intended for use in smartcard applications (see chapter 13
+on page 221).
+Although not strictly a microprocessor the Sharp DDMP (Data-Driven Me-
+dia Processor) (1997) [131] merits inclusion here. Intended for multimedia
+applications this provides a number of parallel processing elements which are
+employed or left idle according to the demand at any time. Asynchronous
+technology was attractive here because of the ease of power management.
+Finally the DRACO device (figure 15.3) was designed specifically for com-
+mercial use although not (yet) marketed due to company reorganisation. As
+a processor in a radio ‘base station’ the low EMI properties of asynchronous
+logic were the reasons for adoption of this technology.
+
+15.3.
+Processors as design examples
+
+Why build an asynchronous microprocessor? Part of the answer must be the
+various advantages of low power, low EMI etc. claimed for any asynchronous
+design and demonstrated in the commercial devices mentioned above, but why
+is a processor a good demonstration of these features?
+In many ways it is not. A better demonstrator of asynchronous advantages
+may well be an application with a regular structure which is amenable to very
+fine grain pipelining: some signal processing or network switching applica-
+tions have these characteristics. However there is a great deal of appeal in con-
+structing a microprocessor. Firstly, it is a well-defined and self-contained prob-
+lem; it is easy to define what a microprocessor should do and to demonstrate
+that it fulfils that specification. Secondly, it forces an asynchronous designer to
+confront and solve a number of implementation problems which might not oc-
+
+
+Chapter 15: Processors
+279
+
+cur if a ‘tailor made’ demonstration was chosen. Lastly, it is often possible to
+compare the result with contemporary, synchronous devices in order to quan-
+tify and assess the results of the work. Of course, microprocessors are also an
+intensely fast-moving and competitive market in which it is hard to compete,
+even in a familiar technology!
+
+15.4.
+Processor implementation techniques
+
+15.4.1
+Pipelining processors
+
+Pipelining [56] the operation of a device such as a microprocessor is an effi-
+cient way to improve performance. At its simplest, pipelining can subdivide a
+time-consuming operation into a number of faster operations whose execution
+can be overlapped. If done ‘perfectly’ a five-stage pipeline can speed up an
+operation by (almost) five times with only a small hardware cost in pipeline
+latches.
+A typical synchronous pipeline should divide the logic into equally timed
+slices. If there is an imbalance in the partitioning the slowest pipeline stage
+sets the clock period for the whole system; faster logic is slowed down by the
+clock resulting in some time wastage (figure 15.6). This is more emphasised
+if the timing of a particular pipeline stage is, in some way, variable or ‘data
+dependent’. Data dependencies can be quite common in a microprocessor: a
+simple example is an ALU operation, where a ‘move’ operation is faster than
+an addition because the former operations require no carry propagation. A
+more ‘exaggerated’ example is a memory access where a cache hit is much
+faster than a cache miss.
+
+work
+Useful
+period
+Clock
+
+Fetch
+Decode
+Evaluate
+Transfer
+Write
+
+Figure 15.6.
+Synchronous pipeline usage.
+
+Normally in most of such cases the clock must allow for the slowest pos-
+sible cycle. This either slows down the clock or causes the designer to invest
+considerable hardware in speeding up the worst-case operations, for example
+adding fast carry propagation mechanisms. Whilst the latter is a good trade
+if a high proportion of the operations are slow this is poor economics if the
+worst-care operations are rare.
+
+
+280
+Part III: Large-Scale Asynchronous Designs
+
+Another possible solution open to the synchronous designer is to allow cer-
+tain slow operations to occupy more than one clock cycle; this is clearly expe-
+dient when a cache miss occurs and the processor must idle until an off-chip
+memory can be read. However multi-cycle operations introduce the need for
+system-wide clock control to stall other parts of the system; even then the tim-
+ing resolution is still limited to whole clock cycles.
+Asynchronous pipelining is conceptually much easier. Not only is all con-
+trol localised, but it is implicitly adaptable to pipeline stages with variable de-
+lays. This means that rare, worst-case operations can be accommodated with-
+out either excessive hardware investment or a significant impact in the overall
+processing time by altering the delay of a pipeline stage dynamically. This,
+combined with the fact that operations can flow at their own rate rather than
+being stacked one to a stage, gives the pipeline considerable ‘elasticity’. In
+such a pipeline a cache miss still occupies a single cycle, although that cycle
+will be a particularly slow one!
+Another clear example of this is visible in the ARM instruction set [65]
+where data processing operations and address calculations may specify an
+operand shift before the ALU operation. In early ARM implementations this
+was contained in a single pipeline stage and hidden by the slower memory ac-
+cess time. To avoid a performance penalty more modern synchronous designs
+have the options of:
+
+an additional shifter pipeline stage (increases latency);
+
+stalling for an extra clock when a shift is required (increases complex-
+ity).
+
+An asynchronous pipeline can simply stretch the execution cycle when re-
+quired. As the additional time is unlikely to be as long as the ALU stage delay
+this is more flexible than either of the synchronous options, and the overall
+impact is small because shift/operate operations are quite rare.
+
+‘Average case’ performance.
+The elasticity of an asynchronous pipeline has
+led to the myth that an asynchronous pipeline can perform its processing in an
+‘average’ rather than worst case time for each pipeline stage. This is true only
+if the unit under consideration is kept constantly busy. This will not be true in
+the general case: on some occasions the unit will be ready before its operands
+and at other times it will have completed before subsequent stages can accept
+its output; in each case an interlude of idleness is enforced. This effect can be
+reduced by providing fast buffers between processing elements to allow some
+‘play’ in the timing, but true average case behaviour can only be achieved with
+buffers of infinite size. Any buffering increases the pipeline latency and should
+therefore be used with circumspection.
+
+
+Chapter 15: Processors
+281
+
+In practice an asynchronous pipeline should be balanced in a similar fashion
+to a synchronous pipeline. The difference is that occasional, time consuming
+operations can be accommodated without either pipeline disruption or signif-
+icant extra hardware. An added bonus is that the pipeline latency, especially
+when filling an ‘empty’ pipeline, can be reduced because the first instruction
+is not retarded by the clock at each stage (figure 15.6). The problem then de-
+volves into partitioning the system and solving the resulting problem of internal
+operand dependences.
+
+15.4.2
+Asynchronous pipeline architectures
+
+Once an asynchronous pipeline has been developed it is very easy to add
+pipeline stages to a design. However, pipelining indiscriminately can be a Bad
+Thing, at least in a general-purpose processor. The major reason for this is
+the need to resolve dependencies [56], where one operation must gather its
+operands from the results of previous instructions; if many operations are in
+the pipeline simultaneously it is quite likely that any new instructions will be
+forced to wait for some of these to complete. Resolving dependencies in an
+asynchronous environment can be a relatively complex task and is discussed in
+a later section.
+A less obvious consequence is the increase in latency in the pipeline, not
+only due to added latch delays but because, in some circumstances, the pipeline
+needs to drain and refill. Consider a processor pipeline with a fast FIFO buffer
+acting as a prefetch buffer; this initially seems like a good idea as it can help
+reduce stalls due to (for example) occasional slow cycles in the prefetch (e.g.
+cache miss) and execution (e.g. multiplication) stages. However, at least in
+a general purpose processor, this benefit is masked because a single stall can
+fill up the prefetch buffer and, typically shortly thereafter, a branch requires it
+to be drained. This was an architectural error evidenced in Amulet1, which
+suffered a noticeable performance penalty due to its four-stage prefetch buffer.
+Of course in other applications, where pipeline flushes are rare, the ease of
+adding buffering can provide significant gains; however experience has shown
+that for a general purpose CPU a more conventional approach producing a
+reasonably balanced pipeline based around a known cycle time (such as the
+cache read-hit cycle time) is the ‘best’ approach.
+One definite advantage of an asynchronous pipeline is that the pipeline flow
+can be controlled locally. Consider that bane of the RISC architecture the
+multi-cycle instruction; in a synchronous environment such an operation must
+be able to suspend large parts or all of the processing pipeline, necessitating a
+widespread control network. In an asynchronous environment the system need
+not be aware of operations in other parts of the system; instead a multi-cycle
+
+
+282
+Part III: Large-Scale Asynchronous Designs
+
+operation simply appears as a longer delay, possibly causing a stall if other
+processing elements require interaction with the busy unit(s).
+In the ARM architecture there is one case where this ability is very useful;
+the multiple register load and store operations (LDM/STM) can transfer an
+arbitrary subset of the sixteen current registers to or from memory. Amulet3
+implements this by generating several local ‘instruction’ packets for the exe-
+cution stages for a single input handshake. At this point it is likely that the
+prefetch will fill up the intervening latches and stall, but this is a natural con-
+sequence of the pipeline’s operation.
+There are other examples of this behaviour in Amulet3 (figure 15.7), notably
+the Thumb decoder which ingests 32-bit packets which can contain either one
+ARM instruction or two of the ‘compressed’, 16-bit Thumb instructions. In
+the latter case two output handshakes are generated for each input. This pro-
+vides an advantage over earlier (synchronous) Thumb implementations, which
+fetch instructions sixteen bits at a time, because it reduces the number of in-
+struction fetch memory cycles and, with a slow memory, uses the full available
+bandwidth. The power consumption in the memory system is also reduced
+commensurately.
+Local control is also possible in instructions such as ‘CMP’ (compare) which
+do not need to traverse the entire pipeline length; it is just as easy to remove a
+packet from the pipeline by generating no output handshakes as it is to generate
+extra packets. In Amulet3 comparison operations only affect the processor’s
+flags which reside in the ‘execute’ stage and therefore cause no activity further
+down the pipeline.
+A final benefit of the localised control is that the pipeline operation can
+be regulated by any active stage. Both Amulet2 and Amulet3 have retrofitted a
+‘halt’ (until interrupted) instruction into the ARM instruction set (implemented
+transparently from an instruction which branches to itself). This can be de-
+tected and ‘executed’ anywhere within the pipeline with the same effect of
+causing the processor to stall. Indeed Amulet2 instigates halts in its execution
+unit whereas Amulet3 halts by suspending instruction prefetch, but the over-
+all effect is the same. Halting an asynchronous processor (or part thereof) is
+equivalent to stopping the clock in a synchronous processor and, in a CMOS
+process, can drop the power consumption to near zero. This facility therefore
+makes power management particularly easy and – in many cases – near auto-
+matic.
+
+15.4.3
+Determinism and non-determinism
+
+Before examining specific architectural techniques which can be employed
+in an asynchronous processor it is worth considering something of the de-
+sign philosophies employed. The most significant is probably that of non-
+
+
+Chapter 15: Processors
+283
+
+Latch
+
+Execute
+Interface
+Data
+
+FIFO
+
+Buffer
+
+Latch
+
+Decode &
+
+Latch
+
+Register
+
+Latch
+
+Prefetch
+
+Reorder
+
+Write
+
+Latch
+
+Reg. Rd.
+
+Thumb
+
+skip
+memory
+
+FIQ
+
+IRQ
+
+Latch
+
+Instr
+Memory
+ may generate
+
+ additional packets
+
+branch addresses
+
+forwarding
+indirect PC load
+
+store data
+
+addr.
+
+load data Memory
+Data
+
+Figure 15.7.
+Amulet3 core organisation.
+
+
+284
+Part III: Large-Scale Asynchronous Designs
+
+determinism because, unlike a synchronous processor, an asynchronous pro-
+cessor can behave non-deterministically and yet still function correctly.
+An advantage in the analysis and design of a synchronous system is that the
+state in the next cycle can be determined entirely from the current state. This
+may also be true in an asynchronous system, but the timing freedom means
+that this is not the only choice of action. Within a small asynchronous state
+machine it is possible to achieve the same behaviour with internal transitions
+ordered differently (e.g. the inputs to a Muller C-element can change in any
+order) and this is also true on a macroscopic level. The first example used here
+is a processor’s prefetch behaviour, chosen because different philosophies have
+been chosen in different projects.
+All the Amulet processors have had a non-deterministic prefetch depth. This
+is achieved by allowing the prefetch unit to run freely, normally only con-
+strained by the rate at which the instruction memory is able to accept instruc-
+tions. In order to branch the prefetch process is ‘interrupted’ and a new pro-
+gramme counter value inserted; this is an asynchronous process which requires
+arbitration and therefore happens at a non-deterministic point.
+An alternative approach, for example used in the ASPRO-216 processor
+[117], is to prefetch a fixed number of instructions.
+This can be done by
+prompting the prefetch unit to begin a new fetch for each instruction which
+completes. If a branch is required this can be signalled, if not this too is sig-
+nalled. In effect the processing pipeline becomes a ring in which a number of
+tokens are circulated and reused. (See also section 3.8.2 on page 39.)
+Having a deterministic prefetch simplifies certain tasks, notably dealing
+with speculative prefetches and branch delay slots. As it is possible to say
+exactly how many instructions will be prefetched following a branch these can
+be processed or discarded with a simple counting process. However keeping
+tokens flowing around a ring places an extra constraint on the elasticity of the
+pipeline which – in some circumstances – may sacrifice performance.
+With a non-deterministic prefetch depth it is possible to have fetched zero
+or more instructions speculatively, although there will be an upper bound set
+when the pipeline ‘backs up’. In an architecture without delay slots (such
+as ARM) the lower limit is not a problem, but some means other than in-
+struction counting must be provided to indicate that the prefetch stream has
+changed. The Amulet processors do this by ‘colouring’ the prefetch streams.
+To illustrate: imagine the processor prefetches a stream of ‘red’ instructions.
+Eventually, as these are executed, a branch is encountered which causes the
+execution unit to request prefetches starting at a new address and in a different
+colour (‘green’, say). Subsequent red instructions must then be discarded, the
+first green instruction being the next operation completed. A subsequent green
+branch will cause another colour change; because all the former red operations
+
+
+Chapter 15: Processors
+285
+
+(a) Deadlock
+(b) Deadlock avoided with
+extra latch
+
+Figure 15.8.
+Branch deadlock and its prevention.
+
+must have been flushed at this point it is possible to switch back to red, thus
+only two colours (i.e. a single colour bit) is required.
+Before leaving this issue there is one other, less obvious, problem with a
+non-deterministic prefetch depth which can cause deadlock if not considered
+in the design. In this architecture the act of branching uses an arbiter to insert
+a token into the processor’s pipeline. If the pipeline is already full – and there
+is nothing to limit this – then the arbiter cannot acknowledge the operation
+and thus the pipeline deadlocks (figure 15.8(a)). Because a branch could occur
+when the pipeline is not full the deadlock is not inevitable, but it could happen
+each time a branch is attempted.
+This problem is easily solved; an extra latch which is normally empty can
+decouple the branch operation from the main pipeline flow (figure 15.8(b)),
+leaving ‘normal’ operation to continue. As a second, valid branch cannot be
+taken until after the first has been accepted and its target fetched and decoded,
+a single latch will always prevent deadlocks here.
+This class of problem is generic. A manifestation of a similar problem be-
+came evident early in the design of Amulet1 where the instruction fetch com-
+peted non-deterministically for the bus with data loads and stores. In a situation
+where the processing pipeline is full it is possible that an instruction fetch can
+occupy the memory bus but be unable to complete because there is no latch
+ready for the result. If a data transfer is pending then it is blocked, resulting
+in the pipeline remaining full (figure 15.9(a)). This can be rectified if the in-
+struction fetch is throttled so that it cannot gain the bus until it is known that
+it can relinquish it again (figure 15.9(b)). The converse of the problem does
+
+
+286
+Part III: Large-Scale Asynchronous Designs
+
+Memory
+
+(a) Deadlock
+
+Throttle
+
+Memory
+
+(b) Deadlock avoided by
+throttling prefetch
+
+Figure 15.9.
+Bus contention deadlock and its prevention.
+
+not occur because the data transfer, once started, cannot be stalled, so only
+instruction prefetch requires throttling.
+Amulet3, with its separate instruction and data buses, does not exhibit this
+problem within the processor, although the memory system can still suffer an
+analogous deadlock. This is described further in the section on memory.
+If a deterministic, asynchronous solution is preferred this could be imple-
+mented by adding a data transfer phase to every instruction. There is still an
+asynchronous advantage here in that an ‘unwanted’ data access would be very
+fast but there is a price in limiting the adaptability of the processor’s pipeline.
+
+Counterflow Pipeline Processor.
+Although most asynchronous micropro-
+cessor designs have a ‘conventional’ architecture (other than lacking a clock!)
+it may be practicable to implement radically different processor structures, and
+several have been studied. One interesting – and highly non-deterministic –
+idea is the Counterflow Pipeline Processor (CFPP) [126] in which instructions
+flow freely along a pipeline containing processing units towards the register
+bank while operands flow equally freely towards them. This is intended to re-
+duce stalls in waiting for operand dependences between instructions; as well
+as evaluating and carrying its result to completion an instruction can cast the
+result backwards to be picked up as required by subsequent operations.
+Whilst expensive in the number of buses required, the CFPP allows con-
+siderable flexibility in the number and arrangement of functional units (fig-
+ure 15.10). Functional units may be ALUs of full or limited function, mem-
+ory access units, multipliers etc. The only rule is that the operation must be
+attempted by one of the available units, so stalls are only needed if an uncom-
+pleted operand reaches the last appropriate functional unit still without all its
+operands.
+
+
+Chapter 15: Processors
+287
+
+Instructions
+Fetch
+
+ALU/...
+
+ALU/... 
+
+ALU/...
+Instructions Operands
+
+Instructions Operands
+
+Instructions Operands
+
+Registers
+
+Branch
+
+Results
+
+Figure 15.10.
+Principle of the Counterflow Pipeline Processor.
+
+To ensure that an instruction does not miss any of its operands, passing
+in the other direction, a degree of synchronisation is needed at each pipeline
+stage. Because there is no ‘favoured’ direction in the pipeline an arbiter is
+required at each stage to ensure that two packets do not ‘cross over’ without
+checking each other’s contents. Because every movement of both instructions
+and data requires an arbitration, fast, efficient arbiters are essential for such a
+performance architecture.
+Of course deepening the pipeline to accommodate many functional units
+increases the penalty due to branches – which need to propagate ‘backwards’
+to the fetch stage – so good branch prediction is important.
+
+Arbitration and deadlock.
+It is possible to build an asynchronous processor
+which is as deterministic in its operation sequences as its synchronous coun-
+terpart (i.e. deterministic except for such events as interrupts). Alternatively it
+is possible to make a highly non-deterministic processor.
+Each scheme has both advantages and disadvantages. Enforcing synchroni-
+sation could lead to a reduction in performance; for example the memory in-
+terface may not begin a prefetch until told that an instruction does not require
+the memory for a data transfer, with a consequent reduction in available band-
+width. On the other hand the predictability of the system’s behaviour can make
+testing significantly easier by reducing the reachable state space of the system.
+The decision as to what (if anything) should be allowed to be non-deterministic
+is a decision for the designer which must be reviewed in the particular circum-
+stances. However it must be remembered that every non-determinism has an
+associated arbiter (which, theoretically, can require an infinite resolution time)
+
+
+288
+Part III: Large-Scale Asynchronous Designs
+
+and is likely to introduce a potential deadlock which must be identified and
+prevented.
+In the general case a good rule for avoiding deadlock is to examine carefully
+any instances of arbitration for a ‘shared’ resource (such as the bus) and ensure
+that no unit can be granted access until it is certain that it can relinquish it
+again regardless of the behaviour of other parts of the system. Each arbiter
+increases the number of states reachable by the processor and makes the design
+problem harder, but it increases the system’s flexibility. Non-determinism can
+be beneficial if used with caution.
+
+15.4.4
+Dependencies
+
+When a processor is pipelined beyond a certain depth it is necessary to in-
+sert interlocks to ensure that dependences between instructions are satisfied
+and programmes execute correctly. Even devices such as the MIPS R3000
+[72] – which was a ‘Microprocessor without Interlocked Pipeline Stages’ is
+interlocked in the sense that the programmer/compiler could use a clock cy-
+cle count to ensure correct operation; an expedient which is disqualified in an
+asynchronous environment. Similar constraints are applied in the ARM archi-
+tecture.
+
+PC pipeline.
+ARM does not use the clock explicitly in the way MIPS does,
+but there is one aspect of the architecture which is similar. The Programme
+Counter (PC) is available to the programmer in the general-purpose register
+set and when it is read the value obtained is the address of the current in-
+struction plus two instructions. This is a historical consequence of the early
+ARM implementations where there were two clock cycles between generating
+the instruction’s address and executing the operation. Compatibility with this
+must be maintained, even in an asynchronous processor where the prefetch and
+execution are autonomous.
+
+memory
+Instruction
+
+Read
+
+Decode
+Reg.
+
+PC pipeline
+
+Fetch
+
+Figure 15.11.
+PC pipeline.
+
+Because the generation and subsequent, possible use of the PC are unsyn-
+chronised in an Amulet processor a method of transmitting the value must be
+found. To do this all Amulet processors have maintained a copy of the PC with
+
+
+Chapter 15: Processors
+289
+
+each fetched instruction (figure 15.11). These flow down the pipeline with the
+instruction and can be read whenever the PC may be needed. The PC may be
+required explicitly as an operand or address, implicitly in branch instructions,
+or to find a return address in the case of exceptions such as interrupts or mem-
+ory aborts. Different Amulet cores have varied the exact value (e.g. PC+8)
+held with this ‘PC pipeline’ in attempts to minimise the later calculation over-
+heads, but in Amulet3 the PC is held without any premodification which allows
+any of the required values PC+2, PC+4, PC+8 to be calculated with a simple
+incrementer.
+It is worth noting that the PC values need not be bundled with the instruction
+directly. The PC pipeline can be a separate, asynchronous pipeline from the
+instruction fetch which can have a different depth, providing that a ‘one in,
+one out’ correspondence is maintained. This is a feature which is exploited in
+Amulet3 to throttle the prefetch unit and prevent instruction fetches causing a
+deadlock; this mechanism is described more fully in section 15.5.2.
+
+Registerdependencies.
+The greatest register dependency problems are read-
+after-write dependencies. One of these occurs in the case of a fragment of code
+such as:
+
+LDR
+R1, [R0]
+; Load ...
+ADD
+R2, R1, R3
+; ... and read
+
+In this example it is essential that the value is (at least apparently) loaded
+into R1 before the subsequent instruction uses it. As soon as the execution
+path is pipelined there is a risk that this will not be assured and this uncertainty
+is increased in an asynchronous device where the load could take an arbitrary
+time.
+Three solutions for ensuring register bank dependencies are satisfied are
+given below.
+
+Don’t pipeline.
+
+Lock.
+
+Forward.
+
+The first solution was the approach taken in the earliest, synchronous ARM
+implementations. This involves reading register operands, performing an oper-
+ation and writing the result back in a single cycle so that a subsequent operation
+always has a ‘clean’ register view. This is simple but makes the evaluation cy-
+cle very long and is unacceptable in a high-performance processor.
+A locking approach allows selective pipelining of the instruction execution
+by retarding instructions whose operand registers are not yet valid. This in-
+volves setting a ‘lock’ flag when an instruction is decoded which wishes to
+
+
+290
+Part III: Large-Scale Asynchronous Designs
+
+modify a register and clearing the flag again when the value finally arrives. A
+subsequent instruction can test the locks on its operands and be delayed until
+they are all clear. This mechanism is eminently suited to an asynchronous im-
+plementation because a stalled instruction is simply caused to wait, which can
+be done without recourse to arbitration.
+In practice it is convenient to allow more than one result to be outstanding
+for a single register. Partly this is a consequence of the ARM’s extensive use
+of conditional instructions, such as:
+
+CMP
+R1, R2
+; Set flags
+MOVNE
+R0, #-1
+; If R1 ? R2
+MOVEQ
+R0, #1
+; If R1 = R2
+
+In such a case a single lock flag (on R0 in this instance) is inadequate and
+some form of semaphore is needed. It turns out that the operation of such
+a semaphore is fairly simple to implement in an asynchronous environment
+provided that testing and incrementing are mutually exclusive. At issue time
+an instruction therefore:
+
+attempts to read its operands and waits until they are all unlocked, then
+
+locks its register destination(s) by incrementing the semaphore.
+
+The semaphore is decremented again when the result is returned; this action
+may take the semaphore to zero and, possibly, free up a waiting instruction.
+This can happen at any time.
+The example above illustrates another potential problem in an asynchronous
+system: the two ‘MOV’ operations are mutually exclusive and so only one will
+be executed. As this is not known at issue time both have incremented the
+semaphore and so both must decrement it, otherwise R0 will be permanently
+locked. In general if a speculative operation is begun it must complete – the
+‘write’ operation therefore always takes place, although sometimes the register
+contents are not changed and only the unlocking is performed.
+The principle of semaphores was designed and implemented as the ‘lock
+FIFO’ in Amulet1 and Amulet2. In these processors the semaphore also held
+the destination register addresses to avoid carrying them with the instruction.
+As the instructions flowed (largely) in order, results and destinations could be
+paired at write time.
+The lock FIFO was implemented as an asynchronous pipeline as shown in
+figure 15.12. Because the cells in each latch (horizontal) are transparent latches
+the entries are copied from one to the next, thus ensuring the outputs of the OR
+gates are glitch free. These can then be used to stall any read operations on
+the particular register until the write is complete. The only hazard would be
+if a register was actively locked whilst a read was being attempted; this is
+prevented by a sequencing of read-then-lock by the instruction decoder.
+
+
+Chapter 15: Processors
+291
+
+Lock indicator
+1
+0
+1
+0
+
+0
+
+0
+
+0
+
+0
+
+0
+0
+0
+
+1
+
+0
+
+1
+0
+
+0
+
+0
+0
+
+1
+
+0
+
+FIFO Control
+
+Figure 15.12.
+Lock FIFO.
+
+Unlocking can happen safely at any time, both relative to the reading or
+locking of other registers. The destination address, in its decoded form, is
+already available at the bottom of the data FIFO.
+
+Reorder buffer.
+Although the lock FIFO works successfully it can intro-
+duce inefficiency in that it enforces in-order completion on the instructions
+and stalls each instruction until its operands are available. Therefore it is an
+effective, cheap method to guarantee functionality but is less than ideal for
+high-performance architectures.
+In Amulet3 register dependencies are resolved using an asynchronous re-
+order buffer [66, 51]. Whilst the major incentive for this was to facilitate a less
+intrusive page fault mechanism (see below) this allows instructions to complete
+in an arbitrary order and results can be forwarded at any time. It is therefore a
+significant step towards a complete out-of-order asynchronous processor.
+
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+
+
+
+
+
+
+
+
+
+
+
+
+Look−up
+
+ALU
+Registers
+Reorder
+
+Memory
+
+Read
+
+Arrival
+
+Forward
+
+Writeback
+
+Allocate
+
+Figure 15.13.
+Reorder buffer position in Amulet3.
+
+
+292
+Part III: Large-Scale Asynchronous Designs
+
+Table 15.1.
+Reorder buffer allocation.
+
+Instruction
+Slot0
+Slot1
+Slot2
+Slot3
+
+LDR
+R0, [R2]
+R0
+?
+?
+?
+MOV
+R4, #17
+R0
+R4
+?
+?
+LDR
+R7, [R0+4]!
+R0
+R4
+R0
+R7
+ADD
+R7, R7, R4
+R7
+R4
+R0
+R7
+CMP
+R3, R4
+R7
+R4
+R0
+R7
+ADDNE
+R7, R7, R6
+R7
+R7
+R0
+R7
+SUB
+R1, R7, R0
+R7
+R7
+R1
+R7
+
+The reorder buffer is positioned between the various data processing units
+and the register bank (figure 15.13) and crosses several timing domains. An
+instruction first encounters the reorder buffer control at decode time when its
+operands may be discovered and forwarded; any results are allocated space
+at this time. Subsequently an instruction may be subdivided (and, in princi-
+ple, issued at any time) and results can arrive separately and independently.
+Operands can be forwarded any number of times between when they arrive
+and when they are overwritten. Finally an ordered writeback occurs where the
+results are retired and the reorder buffer slots freed for reallocation.
+In the decode phase the operand register numbers are compared with a small
+content addressable memory (CAM) to determine which reorder buffer slots
+may contain appropriate values for forwarding. This list always terminates
+with the register itself, so a value is always available from somewhere. Once
+this is map is known the reorder buffer slots can safely be reassigned.
+The assignment process cyclically associates each reorder buffer entry with
+a particular register address and the instruction packet carries forward just the
+reorder buffer ‘slot’ number. Consider the following code fragment:
+
+LDR
+R0, [R2]
+;
+MOV
+R4, #17
+;
+LDR
+R7, [R0+4]!
+;
+ADD
+R7, R7, R4
+;
+CMP
+R3, R4
+;
+ADDNE
+R7, R7, R6
+;
+SUB
+R1, R7, R0
+;
+
+Assuming that the reorder buffer has four entries and the next free entry
+is (arbitrarily) 0, the reorder buffer assignment will proceed as shown in ta-
+ble 15.1. In each case the italicized entry is the latest one to have been as-
+signed.
+
+
+Chapter 15: Processors
+293
+
+Note:
+
+the second load (LDR) operation uses the ARM’s base writeback mode
+and therefore requires two destinations;
+
+the comparison has no register destinations;
+
+the same register address can appear multiple times in the reorder buffer;
+
+a slot is assigned even if the instruction is conditional (ADDNE) and
+may not produce a valid result.
+
+The instruction decoder still retains the reorder buffer map prior to the start
+of the instruction. In parallel with the reassignment it proceeds as follows,
+using the final instruction as an example:
+
+The locations of the appropriate registers are examined; these may be in
+the process of being reassigned but cannot be overwritten yet because
+the instruction execution has not begun.
+
+For R7 try: slot 1, slot 0, slot 3, register bank.
+
+For R0 try: slot 2, register bank.
+
+Try to read from each location in the list until a value is obtained.
+
+Note that a list of possibilities is required because the assigned slots need
+not contain a valid value. The most obvious cause of invalidation in an ARM
+is an instruction which fails its condition code check (e.g. the value of R7 in
+slot1), but other conditions – such as a preceding branch – can also result in an
+instruction being abandoned.
+
+Read register
+
+Read CAM
+Forward
+
+Assign slot
+
+Figure 15.14.
+Register processes in the Amulet3 decode unit.
+
+The flow of control through the decode phase is summarised in figure 15.14.
+Note that the forwarding time can vary depending on external factors while the
+slot assignment time depends on the number of slots which are required (from
+zero to two) and (occasionally) may have to wait for a slot to be available. Re-
+order buffer slots are assigned serially, even within a single instruction which
+simplifies the asynchronous implementation. There is a small performance
+impact on more ‘complex’ instructions, but these are relatively rare.
+
+
+294
+Part III: Large-Scale Asynchronous Designs
+
+Subsequent to assignment the instruction packets proceed to further stages
+carrying their reorder buffer slot number. Although Amulet3 issues only sin-
+gle instructions in order the ARM instruction set effectively allows two semi-
+independent operations via the internal execution unit and via the external data
+interface (figure 15.15). Each of these may produce a result at any time so each
+has its own port into the reorder buffer. Whilst these inputs are asynchronous
+they are guaranteed to target different slots and can therefore be independent.
+
+LDM − subsequent cycles
+
+Data Load
+
+LDM − first cycle
+
+Internal operation
+
+Decode
+Decode
+
+Execute
+
+Execute
+
+Data Int.
+
+Decode
+
+Execute
+
+Data Int.
+
+Data Int.
+
+Execute
+Data Int.
+
+Decode
+
+Figure 15.15.
+Sub-instruction routing.
+
+The writeback process simply copies results back to the register bank. On its
+arrival each result ‘fills’ a reorder buffer slot. The ‘writeback’ process therefore
+waits until a particular slot is ready, copies it out, and moves to the next one.
+The fact that the slots may become ready in an arbitrary order is not a concern.
+Superimposed on this process is a forwarding mechanism which waits until
+a result is ready and copies it back into the decode stage. This process can hap-
+pen before, during or after the writeback process; in fact the processes are asyn-
+chronous and concurrent. The key is that both processes use non-destructive
+copying and therefore leave the data available for the other as required. A
+result in the reorder buffer remains until it is overwritten.
+Either of the above processes may be required to wait if they need a result
+which has not yet arrived. In order to control this in a non-interacting way there
+are two separate ‘flags’ which indicate the presence of data in a slot. One flag
+is raised to indicate that the slot is ‘full’; this is cleared when the writeback
+process has copied the result to the register bank. This is at the heart of the
+control circuit shown in figure 15.16 (which will be described shortly). The
+
+
+Chapter 15: Processors
+295
+
+WPa_n
+WPr_n
+
+Aout
+
+T_n+1
+T_n
+
+Rout_n
+
+C
+C
+
+Match
+
+Try
+
+Fcol
+
+C
+
+Result_ack
+
+Full_n
+
+Token_out
+
+Result_req
+
+Token_in
+
+Write back
+
+Figure 15.16.
+Reorder buffer copy-back circuit.
+
+Table 15.2.
+‘Fcol’ state as results arrive in a reorder buffer.
+
+Result
+0
+1
+2
+3
+Number
+
+0
+0
+� 1
+0
+0
+0
+1
+1
+0
+� 1
+0
+0
+2
+1
+1
+0
+� 1
+0
+3
+1
+1
+1
+0
+� 1
+4
+1
+� 0
+1
+1
+1
+5
+0
+1
+� 0
+1
+1
+6
+0
+0
+1
+� 0
+1
+7
+0
+0
+0
+1
+� 0
+8
+0
+� 1
+0
+0
+0
+
+second flag (‘Fcol’) is merely changed to indicate the arrival of a result. As the
+state of this flag alternates on each pass through the cyclic reorder buffer it is
+possible for a forwarding request to test to see if a result has arrived without
+changing the flag’s state (table 15.2). This is essential because a value may be
+forwarded zero or more times, the number not being known at issue time.
+
+
+296
+Part III: Large-Scale Asynchronous Designs
+
+As a final embellishment a result returned to the reorder buffer may be in-
+valid (due, for example, to a condition code failure). Each slot is therefore
+accompanied by a flag which can prevent the data being written to the register
+bank (the ‘full’ flag is still cleared) or being forwarded. In the latter case fur-
+ther forwarding may be attempted from a less recent result, culminating in the
+use of the default register value.
+The writeback circuit (figure 15.16) is a good example of the working of
+such asynchronous control circuits. It operates as follows.
+
+The arrival of a result (top left) causes the ‘Full’ bit to be set and ac-
+knowledges itself (top right). The request comes from one of a number
+of mutually exclusive sources.
+
+This event also toggles the forwarding colour (‘Fcol’) which allows any
+instructions issued subsequent to this one to use the result.
+
+Note that the input can be on any of the mutually exclusive input chan-
+nels; two are shown here but more are possible.
+
+The input request can be removed at any time leaving the slot marked as
+‘Full’.
+
+When it becomes this slot’s turn to copy back a token is received (bot-
+tom left) which initiates an output request (bottom centre). If the token
+is received first the circuit waits for the result to arrive. The token is
+acknowledged.
+
+This process also allows the ‘Full’ bit to be reset, waiting for the input
+request to fall if it has not already done so.
+
+When the copy back is complete the (broadcast) acknowledge is picked
+up by the circuit to complete the token input handshake and pass the
+token to the next, similar circuit to the right. The next slot cannot attempt
+to output until the four-phase output acknowledge is complete.
+
+These slots are connected in a ring and reset so that there is a single token
+present for the first stage to output after which it runs freely, emptying each
+slot in turn when it contains a result.
+The experimental (and often unsystematic!) way in which Amulet3 was de-
+signed meant that the original state machine was defined and refined to a ver-
+sion close to that shown in figure 15.16 as a schematic and only subsequently
+subjected to a more formal analysis. The analysis was carried out using Petrify
+(which was introduced in section 6.7 on page 102).
+At first this caused problems due to the choices within the system. The
+output channel ‘calls’ the register bank and it was expedient to broadcast the
+output acknowledge to all the copy back circuits and use the (unique) active
+
+
+Chapter 15: Processors
+297
+
+request to sensitize the relevant location. This proved hard to model in a sin-
+gle subcircuit. However on a suggestion from one of the Petrify developers
+(Alex Yakovlev) it proved easier to model the whole system than a single part.
+The resultant signal transition graph (STG) (see figure 15.17) clearly shows the
+four implemented subcircuits which pass control around cyclically. The rest of
+the processor is abstracted in the four small rings which, when an output hand-
+shake has been completed, can reset the circuit to ‘Full’ again at an arbitrary
+time.
+The analysis with Petrify both verified the circuit’s operation and removed
+a redundant transistor in each subcircuit.
+There are two hazards in the asynchronous processes as described here. The
+first is that there is no local way of preventing a slot being overwritten before
+it has been emptied. In Amulet3 this is guaranteed elsewhere by ensuring that
+a slot is never assigned until it has been freed by the copy back process. In
+effect a count of free slots is maintained which is decremented when a slot is
+assigned and incremented when it is released again. Because assignment and
+release are both cyclic and in-order it is not necessary to pass individual slot
+identification, the presence of a token is adequate. This ‘throttle’ is imple-
+mented as a simple, dataless FIFO which also acts as an asynchronous buffer
+between the unsynchronized processes. This is shown in figure 15.18 for a
+system with four reorder buffer slots; in the state shown one result has been
+produced but not yet committed to the register bank, two more are being gen-
+erated and the decoder could issue another instruction which generated a single
+result.
+The other hazard is because the forwarding and writeback processes are not
+synchronised, therefore the register value which is read (as a default) could be
+changed during the read process, resulting in ‘random’ data being read. How-
+ever this can only happen if there is a valid value for that register in the reorder
+buffer and therefore it is certain that this value will be forwarded in preference
+to the register contents. Providing that the implementor can ensure that the
+‘random’ data does not cause problems within the circuit, the mechanism is
+secure against this asynchronous interaction.
+Studies of ARM code indicated that for the Amulet3 architecture a reorder
+buffer with five or more entries was unlikely ever to fill [50]. Amulet3 therefore
+implements a four entry buffer; however the mechanism described is extensible
+to any chosen size.
+
+15.4.5
+Exceptions
+
+An ‘exception’ is an unexpected event which occurs in the course of running
+a programme. The Amulet processors are compatible with the ARM instruc-
+
+
+298
+Part III: Large-Scale Asynchronous Designs
+
+Rout1+
+
+Full1-
+T4-
+Aout+
+
+Rout1-
+
+T1+
+Aout-
+
+Aout-
+
+WPr1+
+Rout2+
+
+Full1+
+Full2-
+T1-
+Aout+
+
+WPa1+
+
+Rout2-
+
+T2+
+
+WPr1-
+
+WPa1-
+
+Aout-
+
+WPr2+
+Rout3+
+
+Full2+
+Full3-
+T2-
+Aout+
+
+WPa2+
+
+Rout3-
+
+T3+
+
+WPr2-
+
+WPa2-
+
+Aout-
+
+WPr3+
+Rout4+
+
+Full3+
+Full4-
+T3-
+Aout+
+
+WPa3+
+
+Rout4-
+
+T4+
+
+WPr3-
+
+WPa3-
+
+WPr4+
+
+Full4+
+
+WPa4+
+
+WPr4-
+
+WPa4-
+
+Figure 15.17.
+An STG for all four reorder buffer token-passing circuits.
+
+
+Chapter 15: Processors
+299
+
+Execute
+Decode
+
+Memory
+
+Reorder
+
+Figure 15.18.
+Token passing throttle on reorder buffer.
+
+tion set and, therefore, the types of exceptions and the behaviour when they
+occur is predefined. Ignoring reset, ARM has six types of exception:
+
+Prefetch abort - a memory fault (e.g. page fault) during an instruction
+prefetch;
+
+Data abort - a memory fault during a read or write transfer;
+
+Illegal instruction - an emulator trap;
+
+Software interrupt - a system call (not really an exception, but has similar
+behaviour);
+
+Interrupt - a normal, level sensitive interrupt;
+
+Fast interrupt - similar to normal interrupts, but higher priority.
+
+Of these the majority are quite easy to deal with: software interrupts and
+illegal instructions can be detected at instruction decode time, as can prefetch
+aborts (when there is no instruction to decode); interrupts are imprecise and
+therefore can be inserted anywhere; only data aborts have the ability to cause
+serious problems because they only evidence after the instruction has been
+decoded and started to execute.
+
+Interrupts.
+In any processor an interrupt is an asynchronous event. In one
+sense the arrival of an interrupt can be thought of as the insertion of a ‘call’ in-
+struction in the normal sequence of instruction fetches. At first glance it would
+seem that simply arbitrating an extra ‘instruction’ packet into the instruction
+stream would suffice; however this simplistic view can cause problems.
+The chief problem is that there is some interaction between the interrupt
+‘instruction’ and prefetch stream; the interrupt needs a PC value to synthesise
+its return address, and the interrupting device cannot know what this is. Fur-
+thermore the return address must be valid; if an interrupt is accepted just after
+
+
+300
+Part III: Large-Scale Asynchronous Designs
+
+a branch has been prefetched it could be inserted into code which should not
+be run.
+Amulet3 implements interrupts by ‘hijacking’ rather than inserting instruc-
+tions, the whole operation being performed in the prefetch unit. Although the
+interrupt signals (in this case they are level-sensitive) change asynchronously
+with respect to the prefetch unit the mutual exclusion element is better thought
+of as a synchroniser than as a typical asynchronous arbiter. Figure 15.19 shows
+a method of implementing this. Here any change in the interrupt input will re-
+tard the normal request flow until the synchronised state has been latched; the
+synchronised interrupt signal only changes when done is low.
+
+Request
+
+MUTEX
+
+Interrupt
+
+Interrupt
+
+Synchronised
+
+Done
+
+Latch
+
+Figure 15.19.
+Interrupt synchroniser.
+
+When it is known that an interrupt signal has become active the current PC
+value effectively becomes the address of the interrupt ‘instruction’ and can be
+used to form the return address. This can be sent as an instruction but can save
+time by bypassing the memory. The interrupt can then be disabled to prevent
+further acknowledgement.
+Because this action takes place in the prefetch unit, Amulet3 can treat the
+interrupt entry as a predicted branch and jump directly to the appropriate inter-
+rupt service routine which, in an ARM, are at fixed addresses.
+The problem still arises that the interrupt entry may be speculative. If a
+branch is pending the return address sent to the execution stage may be invalid
+– in any case it will be wrongly coloured and therefore discarded! However
+the act of branching updates both the PC and any associated information, in-
+cluding the interrupt enables. As the interrupt has not been serviced the (level-
+sensitive) request will still be active and another attempt to enter the service
+routine will be made. This time the branch target address will be saved and
+there can be no further impediments.
+
+Data aborts.
+Although it solves the register dependency problems, the re-
+order buffer was originally introduced to simplify the implementation of data
+aborts. The ARM architecture specifies that, if an abort occurs, no effects
+from following instructions will be retained. Earlier Amulet processors did not
+
+
+Chapter 15: Processors
+301
+
+speculate on memory operations, relying on a fast ‘go/no go’ decision from
+any memory management unit. Amulet3 allows for more sophisticated (i.e.
+slower!) memory management by outputting memory transfer requests and
+only checking for aborts at the end of the operation. To be of any worth this
+must allow other, speculative operations to take place in parallel, but these
+operations cannot be ‘retired’ until the outcome of the load is known.
+The reorder buffer provides a place for speculative results to be held – and
+forwarded for reuse if necessary – until they can be retired into registers. In the
+(rare) case of a data abort the speculative results can be discarded, leaving the
+register bank intact. The discard can be achieved either by using a colouring
+scheme, tested by the register writeback process, or by marking speculative
+results as invalid using the same flag as an operation which has failed for other
+reasons. For certain reasons of implementational expediency Amulet3 uses
+the latter method, although the asynchronous hazard of invalidating a result
+whilst a forwarding operation is being attempted must be avoided. (This is
+achieved by implementing two validity bits and nullifying only one of them;
+the copy back process uses an AND of these whereas forwarding uses an OR.
+Forwarding is therefore not disturbed, although the result will be discarded
+later.)
+The reorder buffer accounts only for the register state however; the ARM
+holds other state bits which also require preservation. Two separate mecha-
+nisms are used for these, depending on the frequency of changes.
+The first is the current programme status register (CPSR) which holds the
+processor’s flags, operating mode et alia, the whole of which can be repre-
+sented in ten bits. The flags clearly change frequently during execution and
+there are many dependences on these bits due, for example, to compare-branch
+sequences. When attempting a memory operation Amulet3 simply copies the
+current CPSR state into a history buffer [66] FIFO; successful completion of
+the transaction discards this entry, but an abort can restore the CPSR state to
+that at the start of the failed operation.
+The other non-register state is a set of five saved programme status regis-
+ters (SPSRs) which act as a temporary store for the CPSR in various exception
+entries. These change very rarely and it is uneconomic to enlarge the history
+buffer to encompass them, although – in theory – they could be changed be-
+tween a load being issued and an abort being signalled. The solution here was
+simply to use a semaphore to lock the SPSRs whilst any memory operations
+are outstanding. This delivers the required functionality very cheaply and the
+performance penalty is tiny because SPSRs change so rarely.
+
+
+302
+Part III: Large-Scale Asynchronous Designs
+
+15.5.
+Memory – a case study
+
+It seems reasonable that an asynchronous processor should interact with an
+asynchronous memory system. This implies the need for handshake interfaces
+on a range of memory systems, including RAM, ROM and caches. This is the
+subject of the following sections.
+An individual memory array is a very regular structure and – under steady
+voltage, temperature etc. conditions – will produce data in a constant time.
+At first glance this may suggest that there is not much scope for asynchronous
+design within a memory system. However each part of the memory will have
+its own characteristic timing; in some cases even a simple memory will have a
+variable cycle time. An example is a RAM which will typically take longer to
+read from than it will to write to.
+In fact the memory system is one part of a computer which has extremely
+variable timing; even a clocked computer will take different times to service a
+cache hit and a cache miss. An asynchronous system will accommodate such
+cycle time variation quite naturally and is able to exploit many more subtle
+variations which would be padded to fill a clock cycle in a synchronous ma-
+chine.
+
+15.5.1
+Sequential accesses
+
+A static RAM (SRAM) stores data in small flip-flops which have only a
+very weak output drive. To accelerate read access it is normal to use ‘sense
+amplifiers’ to detect small (sub-digital) swings in voltage and produce an early
+digital output. Sense amplifiers, being analogue parts, are quite power-hungry.
+Sense amplification is only useful when there has been enough voltage
+swing for the read bits to be discriminated; it is also only required until the
+bits can be latched. As this period is certainly less than even half a clock cycle
+this is an ideal application for a self-timed system. A delay can be inserted to
+keep the sense amplifiers ‘switched off’ when the read cycle commences and
+only switch them on when they may be useful. An extra (known) bit in the
+RAM may then be discriminated and, when it has been read, used to latch the
+entire RAM line and disable the sense amplifiers. The same signal can be used
+to indicate the completion of the read cycle, possibly returning the RAM array
+to the precharge state in preparation for another cycle (figure 15.20).
+When designing such a circuit the RAM completion is easy to detect but the
+delay before enabling the sense amplifiers is harder to judge. The designer can
+choose this to be short – to ensure maximum speed – or somewhat longer –
+to ensure the memory is ready before the amplifiers are enabled thus ensuring
+minimum power wastage. If the designer errs then either speed or power may
+be compromised slightly, however functionality is retained.
+
+
+Chapter 15: Processors
+303
+
+Delay
+
+RAM array
+
+Latch
+
+Figure 15.20.
+Self-timed sense amplifiers.
+
+A typical SRAM array is organised to be roughly square; a 1 Kbyte RAM
+might therefore be organised as (say) 64
+� 128 rather than 256
+� 32 even
+though the processor requires only 32 bits on a given cycle. This presents
+the RAM designer with two choices:
+
+multiplex 32 sense amplifiers to the required word;
+
+amplify all 128 bits and ignore the unwanted ones.
+
+The first choice appears better when a read is considered in isolation but
+cycles are rarely so arranged; typical access patterns (especially code fetches)
+exhibit considerable sequentiality and this can be exploited in the hardware
+design.
+When using the first option it is possible to delay the RAM precharge and
+provide a subsequent read operation with a shorter read delay. The Amulet2e
+cache [49] uses this technique and is therefore able to provide subsequent ac-
+cesses within a RAM ‘line’ faster than the first such access. This variation
+in access time is much less than a whole ‘cycle’ and therefore would be of
+no interest to a synchronous designer, but it is exploited automatically in an
+asynchronous system.
+The second option given above can latch the entire RAM line after amplifi-
+cation. It can then service subsequent requests from this latch. This frees the
+RAM array to be precharged and – possibly – used for other purposes. This
+technique is exploited in the Amulet3i RAM which is described below.
+
+15.5.2
+The Amulet3i RAM
+
+As shown in figure 15.7 on page 283, the Amulet3 processor has a Harvard
+architecture with separate instruction and data buses. However in the Amulet3i
+
+
+304
+Part III: Large-Scale Asynchronous Designs
+
+SoC the memory model is unified; this implies that the buses must ‘merge’
+somewhere outside the processor core.
+In practice the local buses are merged in two places: once to get onto the
+on-chip MARBLE bus (see below) and once for access to the local, high-speed
+RAM (figure 15.21). In Amulet3i the local RAM is memory-mapped rather
+than forming a cache, although there is no reason why a cache could not have
+been implemented here; cache design is discussed later.
+
+A
+D
+
+AmuInstAdr
+
+InstBus
+
+RIA
+
+MID
+
+RRI
+
+Data Port
+
+Inst Port
+
+DataBus
+
+AmuWriteData
+
+AmuDataAdr
+
+SDA
+
+A
+D
+A
+D
+
+Initiator
+
+Logic
+InstDec
+
+Core
+AMULET3
+
+DataDec/
+Arbiter
+
+Target
+Initiator
+
+8Kbyte
+RAM
+
+MIA
+
+RRD
+
+RDA
+
+RWD
+
+SWD
+
+MRD
+
+MDA
+
+MWD
+
+MARBLE
+
+DAI
+
+IAI
+
+Figure 15.21.
+Amulet3i local memory.
+
+The local RAM (8 kbytes) is divided into 1 Kbyte blocks; both buses run
+to each block (figure 15.22). The blocks are interleaved so that – most of the
+time – there need be no interaction between instruction and data fetches. Only
+if there is a conflict with both buses requiring data from the same RAM block
+is there any need for arbitration.
+In the case of a conflict there is no clock on which to control an adjudica-
+tion; access to the block is restricted by a mutual exclusion element (‘mutex’),
+within the Arbiter blocks in figure 15.22, on a ‘first-come, first served’ basis.
+Note that, in general, data and instruction accesses are not synchronised and
+therefore the average wait will be about half the typical RAM access time.
+Collisions are further minimised by using the latch at the output of the sense
+amplifiers (see preceding section) as a form of cache. Here separate latches
+are provided for instruction and data reads, so sequential accesses rarely need
+to compete for the arbitrated resource. In practice this gives performance ap-
+
+
+Chapter 15: Processors
+305
+
+Arbiter
+Arbiter
+Arbiter
+
+Initiator
+Initiator/Target
+
+Dbuffer
+Ibuffer
+Dbuffer
+Ibuffer
+Ibuffer
+Dbuffer
+
+MARBLE
+
+Local Instruction bus
+
+Local Data bus
+
+MARBLE
+
+microprocessor
+AMULET3
+
+1Kbyte RAM
+1Kbyte RAM
+1Kbyte RAM
+
+Figure 15.22.
+Memory block organisation.
+
+proaching that of a dual-port RAM despite being implemented with standard
+SRAM.
+The local RAM architecture thus provides memory cycles with two differ-
+ent delays (‘random’ and ‘sequential’) with the potential of an added (variable)
+delay in the rare case of a collision. In Amulet3i this is further complicated
+because the two local buses cycle in different times; the instruction bus is sim-
+plified as a read-only bus and runs noticeably faster than the full-function local
+data bus, which also permits external bus mastery to allow DMA and test ac-
+cess (fig. 20). The implications of these various timings are absorbed by the
+asynchronous nature of the system – for instance it is not necessary to slow the
+instruction fetches down by around 25% to fit a clock period set by the data
+bus.
+The inclusion of arbiters within the memory blocks implies that the access
+patterns are non-deterministic. Care must therefore be taken to ensure that the
+system cannot reach a deadlock state. The only possible deadlock that could
+occur in the memory would occur as follows:
+
+1 A (non-sequential) data transfer needs access to a particular RAM block.
+
+2 This is prevented because an instruction fetch is already using the RAM
+array.
+
+
+306
+Part III: Large-Scale Asynchronous Designs
+
+3 The instruction fetch cannot complete because the instruction decoder is
+still busy.
+
+4 The processor pipeline is full and is blocked by the data fetch.
+
+5 Deadlock!
+
+To avoid this it is important not to gain access to the shared resource (the
+RAM array) until it is known that the operation will be able to release it again.
+A data transfer can always do this but provision has to be made restricting the
+generation of instruction fetches until they can be guaranteed to release the
+RAM. In practice the latch following the sense amplifiers forms a convenient,
+dedicated buffer in which to hold the instruction and allow data accesses to
+proceed. In Amulet3 the processor throttles its requests so only a single in-
+struction fetch is outstanding at any given time and this must be removed from
+the ‘I buffer’ before the next address can be sent (figure 15.23).
+
+Latch
+Latch
+
+Arbiter
+
+D tag
+
+D buffer
+
+I tag
+
+I buffer
+
+Instructions
+
+Data transfers
+
+Throttling (within processor)
+
+RAM block
+
+Figure 15.23.
+Memory block arbitration and throttling.
+
+The need for arbitration is rare and thus the possibility of discovering a
+deadlock by random simulation even rarer. It is therefore essential to analyse
+such a non-deterministic system thoroughly to ensure that the opportunities for
+deadlock are removed.
+
+
+Chapter 15: Processors
+307
+
+15.5.3
+Cache
+
+An synchronous cache is very similar to an asynchronous RAM; most of the
+design is a combination of the preceding description of asynchronous RAM
+and standard cache design techniques. However in order to be efficient there
+are certain problems, not present in a synchronous cache, which require solu-
+tion.
+The most significant problems are in managing any conflicts between the
+processor/cache interactions and cache/bus interactions. The first of these to
+address is the issue of line fetch.
+A line fetch generally occurs when a cache miss results from an attempted
+access to a cacheable location. A cache line comprising the required word and
+a small number of adjacent words (a cache line) are copied from memory into
+the cache. The simplest solution to this is to halt the processor, fetch the entire
+cache line, and allow the processor to proceed as the access is now a cache
+hit. However this requires a processor stall which is considerably longer than
+is strictly necessary.
+A more efficient scheme is to begin fetching the cache line, forward the re-
+quired word to the processor as soon as it arrives (it can often be arranged to be
+the first word fetched) and then allow the processor and line fetch to continue
+independently. Performance is further enhanced by allowing the processor to
+use other parts of the cache whilst the line fetch is proceeding (‘hit-under-
+miss’) and to use the incoming words as soon as they arrive. Unfortunately
+in an asynchronous environment this is difficult because the fetched words are
+arriving with no fixed timing relationship with the processor’s cycles.
+Initial thoughts may suggest arbitration for the cache. However it is possi-
+ble to solve this problem without arbitration while maintaining all the desired
+functions by including a dedicated latch for holding the last fetched line. This
+latch is called the Line Fetch Latch.
+
+Read
+
+F
+Read
+Ack.
+
+Select
+
+T
+
+F
+
+Select
+
+LFL Hit
+
+SYNC.
+LF LATCH
+
+MAIN CACHE ARRAY
+
+Req.
+
+Hit
+
+T
+
+DATA
+
+LF ENGINE
+
+ADDR
+
+Figure 15.24.
+Control circuit request steering.
+
+
+308
+Part III: Large-Scale Asynchronous Designs
+
+The line fetch latch (LFL) (figure 15.24) is actually a set of latches residing
+just outside the true RAM array. It normally holds the last-fetched cache line.
+It has its own tag and comparator which allow it to function much like the other
+cache lines. Note that the LFL holds the only copy of this data. (Incidentally,
+because the LFL is static and requires no sense amplification when it is read it
+can provide faster access in an asynchronous system.)
+When, as a result of a cache miss, a fetch is needed, a line from the RAM is
+selected for rejection. For the moment assume that the cache is write-through
+and therefore the RAM can simply be overwritten. The LFL contents, together
+with its tag, are then copied into the chosen line and the LFL is marked as
+‘empty’. This can happen in parallel with the start of the external access.
+The processor is then assumed to have a cache hit from within the LFL and
+attempts to read the appropriate word; this causes a stall at the synchronisation
+point because the word is empty and – unless the external memory is excep-
+tionally fast – will not have been refilled yet.
+As words arrive they are stored in the LFL and individually flagged as ‘full’.
+As soon as the processor can read a word from the LFL (typically after the
+completion of the first fetch cycle) it can continue. From this time the processor
+can continue in parallel with the remaining words being fetched.
+A subsequent cache cycle could be:
+
+a cache hit: this can proceed independently and without interaction with
+the LFL;
+
+a cache miss: this will cause a stall until the line fetch process is com-
+plete and the fetch process can be repeated;
+
+a LFL hit: this attempts to read the LFL whilst it is being filled. The
+possibilities are that the required word is already present (the processor
+continues) or the word is still pending (the processor must stall until it
+arrives.
+
+Only in the last case is there any interaction between the asynchronous pro-
+cesses. However this interaction is merely a wait which can be implemented
+with a flip-flop and an AND gate (figure 15.25). The potential wait begins
+when the LFL is first emptied (caused by, and therefore synchronised with,
+the processor’s action). The wait can end at an arbitrary time but this merely
+delays a transition; it cannot abort or change an action and can therefore be
+implemented without arbitration or the risk of metastability. This mechanism
+was first implemented in the Amulet2e cache system [49] and was also used in
+TITAC-2 [130].
+Both these processors used a simple write-through cache for simplicity. For
+higher performance a copy-back mechanism is needed. This too can be pro-
+vided using an extension of the LFL mechanism. In this case the process of
+
+
+Chapter 15: Processors
+309
+
+Q
+
+R
+
+S
+
+Q
+S
+
+Q
+
+R
+
+S
+
+Q
+
+R
+
+S
+
+Memory line fetch interface
+
+LF_data3
+LF_data2
+LF_data1
+LF_data0
+
+En
+
+Data in
+
+word address
+LF_req
+Processor read interface
+
+LF_complete
+
+LF_ack
+
+Figure 15.25.
+Line-fetch latch read synchronisation.
+
+line fetching is complicated by the need first to copy the victim line from the
+cache array before overwriting it. The victim line can be placed in a separate
+write buffer together with its address as supplied by its tag field. Note that the
+line fetch is caused by a cache miss, so the rejected line can never be the same
+as the line being fetched (this becomes more important later). The LFL is then
+emptied into the RAM array as before and the refilling begins. The writing
+of the rejected line can be delayed because it is less urgent than satisfying the
+cache miss.
+Each cache line (and the write buffer) also contain a ‘dirty’ flag which is set
+if the cache line has been modified. This can be checked and used to determine
+if the write buffer should be written out (i.e. ‘dirty’ is true) or is already coher-
+ent with the memory; in the latter case the copy-back process can be bypassed.
+This process reduces the write traffic but, with a single entry write buffer,
+does not greatly assist with reducing fetch latency because the write buffer
+must be emptied before a subsequent fetch. However the write buffer can be
+extended, albeit at the cost of introducing an arbiter and the potential pitfalls
+therefrom.
+If a second line fetch is needed before an earlier fetch is complete it is the-
+oretically possible for this to overtake any pending write operations. This is
+also desirable. As the write operation will already be pending – merely wait-
+ing for the bus to become free – it is necessary to determine that the new fetch
+request arrived before the write could begin, which requires arbitration in an
+asynchronous environment. However, once the decision is made the operations
+can proceed as before. Two problems remain however:
+
+if the write buffer becomes full there will be nowhere to evict a line to
+and the system can deadlock;
+
+
+310
+Part III: Large-Scale Asynchronous Designs
+
+if the required line is one which has recently been evicted the fetch could
+overtake a pending write and thus lose cache coherency.
+
+The first problem is relatively easy to solve; a simple counter can ensure
+that only a certain amount of overtaking is allowed and that one space in the
+write buffer always remains free (i.e. when the last entry is filled the next bus
+operation must be a write – unless, of course, it is a ‘clean’ line where the write
+can be assumed and bypassed). This can be implemented, for example, as a
+semaphore.
+The second problem is harder, but can be solved by forwarding in a similar
+manner to the forwarding from a processor’s reorder buffer. A line fetch checks
+the addresses of the entries in the write buffer and, if it finds a match, satisfies
+itself from there instead of requesting the memory bus. In such a case the
+‘fetch’ can be performed with no latency, much more rapidly as it is an internal
+operation and can copy an entire cache line in a single operation. This is, of
+course, irrelevant to the functioning of other parts of the system as the whole
+is self-timed anyway. Such forwarding does not affect the write process (a re-
+fetched ‘dirty’ line is returned clean) and can take place regardless of whether
+the copy-back is pending, in progress, complete or not needed. The write buffer
+acts, in effect, as a write-through victim cache [56].
+There is one caveat to this process; the line fetch occurs in parallel with
+the eviction of a cache line. It is therefore possible that one entry in the write
+buffer could be changing during the comparison process. As noted above this
+entry is reserved for the freshly evicted line and can safely be excluded from
+the comparison, thus averting any possibilities of a false positive due to signals
+changing.
+
+15.6.
+Larger asynchronous systems
+
+15.6.1
+System-on-Chip (DRACO)
+
+DRACO (DECT Radio Communications Controller) (figure 15.26) is a sys-
+tem on chip based around the Amulet3 processor. In terms of area about half
+of the (7 mm square) device is an asynchronous ‘island’ – hence Amulet3i –
+and the other half comprises synchronous peripherals compiled from VHDL.
+The asynchronous subsystem (figure 15.27) is a computer in itself and was
+developed both with commercial intent and with a view to investigating some
+new techniques. The processor and RAM have already been discussed. Some
+other novel asynchronous features are outlined in this section.
+
+15.6.2
+Interconnection
+
+Ideally an asynchronous system should be based around an asynchronous
+bus. Indeed it is arguable that large, fast synchronous systems should also use
+
+
+Chapter 15: Processors
+311
+
+Figure 15.26.
+DRACO layout.
+
+subsystem
+
+8 Kbyte
+
+interface
+Test
+
+interface
+Memory
+
+Bridge
+Synchronous
+MARBLE/
+
+peripheral
+
+DMA
+
+I/Os
+peripheral
+
+control
+DRAM
+
+selects
+
+ROM
+
+chip
+
+controller
+
+Synchronous
+
+controller
+
+RAM
+16 Kbyte
+
+data
+
+addr
+
+DMArq
+
+test
+
+delay
+
+AMULET3
+
+MARBLE
+
+synchronous
+
+asynchronous
+
+Figure 15.27.
+Amulet3i asynchronous subsystem.
+
+
+312
+Part III: Large-Scale Asynchronous Designs
+
+asynchronous interconnection between their synchronous subsystems to alle-
+viate problems with high-speed clock distribution and clock skew. This model
+is sometimes referred to as “GALS” (Globally Asynchronous, Locally Syn-
+chronous) and may represent an early commercial opportunity for the inclusion
+of asynchronous circuits in ‘conventional’ systems.
+
+MARBLE.
+As a step in developing such an interconnection standard an
+Amulet3i contains the first implementation of MARBLE [5], a 32-bit, multi-
+master, on-chip bus which communicates by using handshakes rather than a
+clock. Apart from this the signal definitions, with 32-bit address and data,
+look very similar to a conventional bus. MARBLE separates address and data
+communications, allowing pipelining and interleaving of operations in order to
+increase the available bandwidth when several devices require global access.
+MARBLE is supported by ‘initiator’ and ‘target’ interfaces which can be
+attached to any asynchronous component. These, their address, and the bus
+wiring provide all that is needed for communication between the various com-
+ponents. In Amulet3i there are four initiators and seven targets. For example
+the processor’s two local buses each terminate in a MARBLE initiator and the
+local data bus is also a MARBLE target which allows DMA and test data in
+and out of the RAM from other initiators.
+
+Chain.
+Chain (‘Chip area interconnect’) is currently under development as
+a possible replacement for a conventional bus for on-chip communications.
+Chain is based around narrow, high-speed, point-to-point links forming a net-
+work rather than a bus. The idea is to exploit the potential for fast symbol
+transmission within an asynchronous system while reducing the number of
+long distance wires.
+By using a delay-insensitive coding scheme Chain relieves the chip designer
+of the need to ensure timing closure across the whole chip; it also provides
+tolerance of potential problems such as induced crosstalk on the long inter-
+connection wires. Again the user need only communicate with ‘conventional’
+parallel interfaces.
+
+15.6.3
+Balsa and the DMA controller
+
+The DMA controller is a complex, multi-channel unit which was evolved
+according to an external specification. Whilst the function of the unit is rela-
+tively straightforward, even in the asynchronous domain, the unit is notable for
+being the first real application of Balsa synthesis [11].
+The DMA controller comprises a large set of control registers for the many
+DMA channels and a control unit which chooses an active request and services
+it. The registers were designed as blocks of custom VLSI to optimise their area.
+The control logic was written in Balsa, and modified several times as the spec-
+
+
+Chapter 15: Processors
+313
+
+ification changed. The modifications proved remarkably easy to accommodate
+in this environment.
+Such synthesis is not (yet) suitable for high-performance units, but proved
+extremely useful in such an application where the performance was limited by
+other constraints (such as the bus speed, here) and development time predom-
+inates. Of course, in an asynchronous environment, it is easy to accommodate
+devices in any performance range without affecting the overall system func-
+tionality.
+Part II of this book is an introduction to Balsa and includes a complete
+source listing of a simpler 4-channel DMA controller.
+
+15.6.4
+Calibrated time delays
+
+To be useful in real systems an asynchronous processor must be able to
+interface with currently available commodity parts. Whilst it is possible – and
+perhaps even desirable – to have memory devices with handshake interfaces,
+these are not available ‘off the shelf’. Instead commodity memories rely on
+absolute timing assumptions to guarantee their correct operation and thus a
+system using them must have an accurate timing reference. This is the one
+thing lacking in an asynchronous design.
+Introducing a clock purely to provide memory timing would negate many of
+the advantages of the asynchronous system; it is therefore preferable to retain
+the idea of providing data ‘on demand’, providing an adequately precise delay
+can be provided. This delay need not be a clock; a monostable which can be
+triggered on demand is preferable.
+Amulet1 and Amulet2e relied on an externally supplied delay line to pro-
+vide a bus timing reference. This is a flexible solution in that a short delay
+can be used repeatedly to provide longer timing intervals, providing a flexible,
+programmable interface. For example an area of address space could be set
+up for DRAM devices with (say) 1 delay address set up time, 2 delays RAS
+address hold, etc. The bus interface can then count delays as appropriate.
+This is a reasonable solution but suffers from certain drawbacks:
+
+the on-chip delay for each delay cycle is not factored;
+
+delay lines are not particularly precise;
+
+driving an off-chip delay is power hungry.
+
+A much better solution would be to use an on-chip delay. The chief problem
+with this is that a fixed delay will be imprecise, varying from chip to chip and
+also changing with temperature and supply voltage fluctuations. Any on-board
+delay must therefore be calibratable against an external reference.
+Amulet3i uses such a delay. This comprises a chain of delay elements (fig-
+ure 15.28) which can be ‘shorted’ at a determined point. This can be calibrated
+
+
+314
+Part III: Large-Scale Asynchronous Designs
+
+−
+
+R
+
+C
+
+M
+
++
+
+Out
+
+−
+−
+C
+
+In
+
+C
++
++
+−
++
+
+L
+
+C
+
+0
+0
+
+1
+
+0
+
+1
+1
+
+0
+
+1
+
+Figure 15.28.
+Controllable delay circuit.
+
+by counting the number of cycles completed within a known interval and ad-
+justed accordingly using the control wires shown. Once calibrated the delay
+will only change slowly (e.g. with temperature drift) unless the external condi-
+tions change; calibration can therefore be repeated at infrequent intervals under
+software control. The external timing reference can be a long delay such as a
+period of a 32 kHz ‘watch’ oscillator – hardly power expensive!
+
+15.6.5
+Production test
+
+Figure 15.27 on page 311 includes a ‘test interface controller’ block. The
+DRACO chip was designed as a commercial product, and must therefore be
+testable in production. The test approach adopted in the design of DRACO is
+based upon exploiting the MARBLE bus to access the various on-chip mod-
+ules and their test features. In normal use the external memory interface is a
+MARBLE target, but for test purposes it can be reconfigured as a bus initiator,
+enabling an external tester to control the bus and thereby read and write the
+on-chip memory. Circuits in the test interface controller make this efficient,
+with support for automatic sequential address generation, and so on.
+All of the production tests for the asynchronous subsystem use the exter-
+nal tester to load a test program into the on-chip RAM and then instruct the
+Amulet3 processor to execute the program. The test runs without intervention
+from the tester, which must simply wait a given time before inspecting an on-
+chip memory location to read a pass/fail code. Inevitably the tests run at full
+speed; there is no external timing input to control them.
+Certain on-chip modules are very difficult to test in purely functional mode,
+so additional access routes (via test registers accessible from MARBLE) are
+provided to make the tests more efficient. The calibrated time delay is one
+such circuit, and the processor branch predictor is another. In the latter case,
+the branch predictor is taken off line so that the processor (of which it is a part)
+
+
+Chapter 15: Processors
+315
+
+can manipulate its internal state to run optimised tests on its CAM and RAM
+components.
+Although DRACO is not, at the time of writing, in full-scale production, the
+tests ran without difficulty on prototype sample parts.
+
+15.7.
+Summary
+
+Devices such as DRACO have demonstrated that it is feasible to build large,
+functional asynchronous systems. Like any prototype system the chip has its
+problems. There are two (known) bugs in the asynchronous half of the sys-
+tem: an electrical drive strength problem within the multiplier (which failed to
+evidence during simulation) and a logic oversight in the prefetch unit which
+falsely indicates a cycle is ‘sequential’ under certain specific conditions (a
+problem if running code from off-chip DRAM). Neither of these is attributable
+to the asynchronous nature of the device (indeed, there are slightly more bugs
+in the synchronous part of the device!) and both are readily fixable. The pro-
+cessor is comparable with an ARM9 manufactured on the same process in
+terms of physical size, performance and energy efficiency; preliminary mea-
+surements suggest that is is significantly ‘quieter’ in terms of EMI.
+This chapter has presented possible solutions (though certainly not the only
+ones!) to many of the problems facing the designer of complex asynchronous
+processing and memory systems. The majority of the designs described at the
+beginning of the chapter have been produced by academic groups and could be
+classified as “research”; however the complexity of a modern system on chip is
+such that these designs stretch the capability of even a large university group. It
+has been demonstrated that large, functional asynchronous designs are not only
+possible, but can be competitive and have some unique advantages in terms of
+power management and EMI. Asynchronous interconnection may be the only
+solution for large devices, even those with local clocks. Asynchronous chip
+design is ready to move to “development”.
+
+
+
+Epilogue
+
+Asynchronous technology has existed since the first days of digital elec-
+tronics – many of the earliest computers did not employ a central clock signal.
+However, with the development of integrated circuits the need for a straightfor-
+ward design discipline that could scale up rapidly with the available transistor
+resource was pressing, and clocked design became the dominant approach.
+Today, most practising digital designers know very little about asynchronous
+techniques, and what they do know tends to discourage them from venturing
+into the territory. But clocked design is beginning to show signs of stress – its
+ability to scale is waning, and it brings with it growing problems of excessive
+power dissipation and electromagnetic interference.
+During the reign of the clock, a few designers have remained convinced that
+asynchronous techniques have merit, and new techniques have been developed
+that are far better suited to the VLSI era than were the approaches employed on
+early machines. In this book we have tried to illuminate these new techniques
+in a way that is accessible to any practising digital circuit designer, whether or
+not they have had prior exposure to asynchronous circuits.
+In this account of asynchronous design techniques we have had to be selec-
+tive in order not to obscure the principal goal with arcane detail. Much work of
+considerable quality and merit has been omitted, and the reader whose interest
+has been ignited by this book will find that there is a great deal of published
+material available that exposes aspects of asynchronous design that have not
+been touched upon here.
+Although there are commercial examples of VLSI devices based on asyn-
+chronous techniques (a couple of which have been described in this book),
+these are exceptions – most asynchronous development is still taking place in
+research laboratories. If this is to change in the future, where will this change
+first manifest itself?
+The impending demise of clocked design has been forecast for many years
+and still has not happened. If it does happen, it will be for some compelling
+reason, since designers will not lightly cast aside their years of experience in
+one design style in favour of another style that is less proven and less well
+supported by automated tools.
+
+317
+
+
+318
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+There are many possible reasons for considering asynchronous design, but
+no single ‘killer application’ that makes its use obligatory. Several of the argu-
+ments for adopting asynchronous techniques mentioned at the start of this book
+– low power, low electromagnetic interference, modularity, etc. – are applica-
+ble in their own niches, but only the modularity argument has the potential to
+gain universal adoption. Here a promising approach that will support hetero-
+geneous timing environments is GALS (Globally Asynhronous Locally Syn-
+chronous) system design. An asynchronous on-chip interconnect – a ‘chip area
+network’ such as Chain (described on page 312) – is used to connect clocked
+modules. The modules themselves can be kept small enough for clock skew to
+be well-contained so that straightforward synchronous design techniques work
+well, and different modules can employ different clocks or the same clock with
+different phases. Once this framework is in place, it is then clearly straightfor-
+ward to make individual modules asynchronous on a case-by-case basis.
+Here, perhaps unsurprisingly, we see the need to merge asynchronous tech-
+nology with established synchronous design techniques, so most of the func-
+tional design can be performed using well-understood tools and approaches.
+This evolutionary approach contrasts with the revolutionary attacks described
+in Part III of this book, and represents the most likely scenario for the wide-
+spread adoption of the techniques described in this book in the medium-term
+future.
+In the shorter term, however, the application niches that can benefit from
+asynchronous technology are important and viable. It is our hope in writing
+this book that more designers will come to understand the principles of asyn-
+chronous design and its potential to offer new solutions to old and new prob-
+lems. Clocks are useful but they can become straitjackets. Don’t be afraid to
+think outside the box!
+
+For further information on asynchronous design see the bibliography at the
+end of this book, the Asynchronous Bibliography on the Internet [111], and
+the general information on asynchronous design available at the Asynchronous
+Logic Homepage, also on the Internet [47].
+
+
+References
+
+[1] G. Abouyannis et al. Project PREST EP25242. European Low Power
+Initiative for Electronic System Design (ESDLPD) Third International
+Workshop, pages 5–49, 2000.
+
+[2] A. Abrial, J. Bouvier, M. Renaudin, P. Senn, and P. Vivet. A new con-
+tactless smartcard IC using an on-chip antenna and an asynchronous
+micro-controller.
+IEEE Journal of Solid-State Circuits, 36(7):1101–
+1107, July 2001.
+
+[3] T. Agerwala. Putting Petri nets to work. IEEE Computer, 12(12):85–94,
+December 1979.
+
+[4] W.J. Bainbridge and S.B. Furber. Asynchronous macrocell intercon-
+nect using MARBLE. In Proc. International Symposium on Advanced
+Research in Asynchronous Circuits and Systems (Async’98), pages 122–
+132. IEEE Computer Society Press, April 1998.
+
+[5] W.J. Bainbridge and S.B. Furber.
+MARBLE: An asynchronous on-
+chip macrocell bus. Microprocessors and Microsystems, 24(4):213–222,
+April 2000.
+
+[6] T.S. Balraj and M.J. Foster. Miss Manners: A specialized silicon com-
+piler for synchronizers. In Charles E. Leierson, editor, Advanced Re-
+search in VLSI, pages 3–20. MIT Press, April 1986.
+
+[7] A. Bardsley. The Balsa web pages.
+http://www.cs.man.ac.uk/amulet/balsa/projects/balsa.
+
+[8] A. Bardsley. Implementing Balsa Handshake Circuits. PhD thesis, De-
+partment of Computer Science, University of Manchester, 2000.
+
+[9] A. Bardsley and D.A. Edwards. Compiling the language Balsa to delay-
+insensitive hardware. In C. D. Kloos and E. Cerny, editors, Hardware
+Description Languages and their Applications (CHDL), pages 89–91,
+April 1997.
+
+319
+
+
+320
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[10] A. Bardsley and D.A. Edwards. The Balsa asynchronous circuit synthe-
+sis system. In Forum on Design Languages, September 2000.
+
+[11] A. Bardsley and D.A. Edwards. Synthesising an asynchronous DMA
+controller with Balsa. Journal of Systems Architecture, 46:1309–1319,
+2000.
+
+[12] P.A. Beerel, C.J. Myers, and T.H.-Y. Meng.
+Automatic synthesis of
+gate-level speed-independent circuits. Technical Report CSL-TR-94-
+648, Stanford University, November 1994.
+
+[13] P.A. Beerel, C.J. Myers, and T.H.-Y. Meng. Covering conditions and
+algorithms for the synthesis of speed-independent circuits. IEEE Trans-
+actions on Computer-Aided Design, March 1998.
+
+[14] G. Birtwistle and A. Davis, editors. Proceedings of the Banff VIII Work-
+shop: Asynchronous Digital Circuit Design, Banff, Alberta, Canada,
+August 28–September 3, 1993. Springer Verlag, Workshops in Com-
+puting Science, 1995.
+Contributions from: S.B. Furber, “Comput-
+ing without Clocks: Micropipelining the ARM Processor,” A. Davis,
+“Practical Asynchronous Circuit Design: Methods and Tools,” C.H. van
+Berkel, “VLSI Programming of Asynchronous Circuits for Low Power,”
+J. Ebergen, “Parallel Program and Asynchronous Circuit Design,” A.
+Davis and S. Nowick, “Introductory Survey”.
+
+[15] I. Bogdan, M. Munteau, P.A. Ivey, N.L. Seed, and N. Powell. Power
+reduction techniques for a Viterbi decoder implementation. European
+Low Power Initiative for Electronic System Design (ESDLPD) Third In-
+ternational Workshop, pages 28–48, 2000.
+
+[16] E. Brinksma and T. Bolognesi. Introduction to the ISO specification
+language LOTOS. Computer Networks and ISDN Systems, 14(1), 1987.
+
+[17] E. Brunvand and R.F. Sproull. Translating concurrent programs into
+delay-insensitive circuits.
+In Proc. International Conf. Computer-
+Aided Design (ICCAD), pages 262–265. IEEE Computer Society Press,
+November 1989.
+
+[18] J.A. Brzozowsky and C.-J.H. Seager. Asynchronous Circuits. Springer
+Verlag, Monographs in Computer Science, 1994. ISBN: 0-387-94420-6.
+
+[19] S.M. Burns. Performance Analysis and Optimization of Asynchronous
+Circuits. PhD thesis, Computer Science Department, California Institute
+of Technology, 1991. Caltech-CS-TR-91-01.
+
+[20] S.M. Burns.
+General condition for the decomposition of state hold-
+ing elements. In Proc. International Symposium on Advanced Research
+
+
+REFERENCES
+321
+
+in Asynchronous Circuits and Systems. IEEE Computer Society Press,
+March 1996.
+
+[21] S.M. Burns and A.J. Martin. Syntax-directed translation of concurrent
+programs into self-timed circuits. In J. Allen and F. Leighton, editors,
+Advanced Research in VLSI, pages 35–50. MIT Press, 1988.
+
+[22] D.M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems.
+PhD thesis, Stanford University, October 1984.
+
+[23] K.T. Christensen, P. Jensen, P. Korger, and J. Sparsø. The design of an
+asynchronous TinyRISC TR4101 microprocessor core. In Proc. Inter-
+national Symposium on Advanced Research in Asynchronous Circuits
+and Systems, pages 108–119. IEEE Computer Society Press, 1998.
+
+[24] T.-A. Chu. Synthesis of Self-Timed VLSI Circuits from Graph-Theoretic
+Specifications. PhD thesis, MIT Laboratory for Computer Science, June
+1987.
+
+[25] T.-A. Chu and R.K. Roy (editors). Special issue on asynchronous cir-
+cuits and systems. IEEE Design & Test, 11(2), 1994.
+
+[26] T.-A. Chu and L.A. Glasser. Synthesis of self-timed control circuits
+from graphs: An example. In Proc. International Conf. Computer De-
+sign (ICCD), pages 565–571. IEEE Computer Society Press, 1986.
+
+[27] B. Coates, A. Davis, and K. Stevens.
+The Post Office experience:
+Designing a large asynchronous chip. Integration, the VLSI journal,
+15(3):341–366, October 1993.
+
+[28] F. Commoner, A.W. Holt, S. Even, and A. Pnueli.
+Marked directed
+graphs. J. Comput. System Sci., 5(1):511–523, October 1971.
+
+[29] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and
+A. Yakovlev. Petrify: a tool for manipulating concurrent specifications
+and synthesis of asynchronous controllers. In XI Conference on Design
+of Integrated Circuits and Systems, Barcelona, November 1996.
+
+[30] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and
+A. Yakovlev. Petrify: a tool for manipulating concurrent specifications
+and synthesis of asynchronous controllers. IEICE Transactions on In-
+formation and Systems, E80-D(3):315–325, March 1997.
+
+[31] U. Cummings, A. Lines, and A. Martin. An asynchronous pipelined lat-
+tice structure filter. In Proc. International Symposium on Advanced Re-
+search in Asynchronous Circuits and Systems, pages 126–133, Novem-
+ber 1994.
+
+
+322
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[32] A. Davis. A data-driven machine architecture suitable for VLSI imple-
+mentation. In Proceedings of the First Caltech Conference on VLSI,
+pages 479–494, Pasadena, CA, January 1979.
+
+[33] A. Davis and S.M. Nowick. Asynchronous circuit design: Motivation,
+background, and methods. In G. Birtwistle and A. Davis, editors, Asyn-
+chronous Digital Circuit Design, Workshops in Computing, pages 1–49.
+Springer-Verlag, 1995.
+
+[34] A. Davis and S.M. Nowick. An introduction to asynchronous circuit
+design. Technical Report UUCS-97-013, Department of Computer Sci-
+ence, University of Utah, September 1997.
+
+[35] A. Davis and S.M. Nowick. An introduction to asynchronous circuit
+design. In A. Kent and J. G. Williams, editors, The Encyclopedia of
+Computer Science and Technology, volume 38. Marcel Dekker, New
+York, February 1998.
+
+[36] J.B. Dennis. Data Flow Computation. In Control Flow and Data Flow
+— Concepts of Distributed Programming, International Summer School,
+pages 343–398, Marktoberdorf, West Germany, July 31 – August 12,
+1984. Springer, Berlin.
+
+[37] J.C. Ebergen and R. Berks. Response time properties of linear asyn-
+chronous pipelines. Proceedings of the IEEE, 87(2):308–318, February
+1999.
+
+[38] P.B. Endecott and S.B. Furber.
+Modelling and simulation of asyn-
+chronous systems using the LARD hardware description language. In
+Proceedings of the 12th European Simulation Multiconference, Manch-
+ester, Society for Computer Simulation International, pages 39–43, June
+1994. ISBN 1-56555-148-6.
+
+[39] K.M. Fant and S.A. Brandt. Null Conventional Logic: A complete and
+consistent logic for asynchronous digital circuit synthesis. In Interna-
+tional Conference on Application-specific Systems, Architectures, and
+Processors, pages 261–273, 1996.
+
+[40] R.M. Fuhrer, S.M. Nowick, M. Theobald, N.K. Jha, B. Lin, and
+L. Plana.
+Minimalist: An environment for the synthesis, verifica-
+tion and testability of burst-mode asynchronous machines.
+Techni-
+cal Report TR CUCS-020-99, Columbia University, NY, July 1999.
+http://www.cs.columbia.edu/˜nowick/minimalist.pdf.
+
+[41] S.B. Furber and P. Day. Four-phase micropipeline latch control circuits.
+IEEE Transactions on VLSI Systems, 4(2):247–253, June 1996.
+
+
+REFERENCES
+323
+
+[42] S.B. Furber, P. Day, J.D. Garside, N.C. Paver, S. Temple, and J.V.
+Woods. The design and evaluation of an asynchronous microproces-
+sor. In Proc. Int’l. Conf. Computer Design, pages 217–220, October
+1994.
+
+[43] S.B. Furber, D.A. Edwards, and J.D. Garside. AMULET3: a 100 MIPS
+asynchronous embedded processor. In Proc. International Conf. Com-
+puter Design (ICCD), September 2000.
+
+[44] S.B. Furber, J.D. Garside, P. Riocreux, S. Temple, P. Day, J. Liu, and
+N.C. Paver. AMULET2e: An asynchronous embedded controller. Pro-
+ceedings of the IEEE, 87(2):243–256, February 1999.
+
+[45] S.B. Furber, J.D. Garside, S. Temple, J. Liu, P. Day, and N.C. Paver.
+AMULET2e: An asynchronous embedded controller. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 290–299. IEEE Computer Society Press, 1997.
+
+[46] EU IST-1999-13515, G3Card – generation 3 smartcard, January 2000.
+
+[47] J.D. Garside. The Asynchronous Logic Homepages.
+http://www.cs.man.ac.uk/async/.
+
+[48] J.D. Garside, W.J. Bainbridge, A. Bardsley, D.A. Edwards, S.B. Furber,
+J. Liu, D.W. Lloyd, S. Mohammadi, J.S. Pepper, O. Petlin, S. Temple,
+and J.V. Woods. AMULET3i – an asynchronous system-on-chip. In
+Proc. International Symposium on Advanced Research in Asynchronous
+Circuits and Systems, pages 162–175. IEEE Computer Society Press,
+April 2000.
+
+[49] J.D. Garside, S. Temple, and R. Mehra. The AMULET2e cache sys-
+tem. In Proc. International Symposium on Advanced Research in Asyn-
+chronous Circuits and Systems (Async’96), pages 208–217. IEEE Com-
+puter Society Press, March 1996.
+
+[50] D.A. Gilbert. Dependency and Exception Handling in an Asynchronous
+Microprocessor. PhD thesis, Department of Computer Science, Univer-
+sity of Manchester, 1997.
+
+[51] D.A. Gilbert and J.D. Garside.
+A result forwarding mechanism for
+asynchronous pipelined systems. In Proc. International Symposium on
+Advanced Research in Asynchronous Circuits and Systems (Async’97),
+pages 2–11. IEEE Computer Society Press, April 1997.
+
+[52] B. Gilchrist, J.H. Pomerene, and S.Y. Wong. Fast carry logic for digital
+computers. IRE Transactions on Electronic Computers, EC-4(4):133–
+136, December 1955.
+
+
+324
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[53] L.A. Glasser and D.W. Dobberpuhl. The Design and Analysis of VLSI
+Circuits. Addison-Wesley, 1985.
+
+[54] S. Hauck. Asynchronous design methodologies: An overview. Proceed-
+ings of the IEEE, 83(1):69–93, January 1995.
+
+[55] L.G. Heller, W.R. Griffin, J.W. Davis, and N.G. Thoma. Cascode voltage
+switch logic: A differential CMOS logic family.
+Proc. International
+Solid State Circuits Conference, pages 16–17, February 1984.
+
+[56] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantita-
+tive Approach (2nd edition). Morgan Kaufmann, 1996.
+
+[57] C.A.R. Hoare. Communicating sequential processes. Communications
+of the ACM, 21(8):666–677, August 1978.
+
+[58] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall,
+Englewood Cliffs, 1985.
+
+[59] D.A. Huffman.
+The synthesis of sequential switching circuits.
+J.
+Franklin Inst., pages 161–190, 275–303, March/April 1954.
+
+[60] D.A. Huffman. The synthesis of sequential switching circuits. In E. F.
+Moore, editor, Sequential Machines: Selected Papers. Addison-Wesley,
+1964.
+
+[61] H. Hulgaard, S.M. Burns, and G. Borriello. Testing asynchronous cir-
+cuits: A survey. Integration, the VLSI journal, 19(3):111–131, Novem-
+ber 1995.
+
+[62] K. Hwang. Computer Arithmetic: Principles, Architecture, and Design.
+John Wiley & Sons, 1979.
+
+[63] ISO/IEC. Mifare identification cards - contactless integrated circuit(s)
+cards - proximity cards. Standard ISO/IEC Standard 14443 Type A.
+
+[64] H. Jacobson, E. Brunvand, G. Gopalakrishnan, and P. Kudva. High-level
+asynchronous system design using the ACK framework. In Proc. Inter-
+national Symposium on Advanced Research in Asynchronous Circuits
+and Systems, pages 93–103. IEEE Computer Society Press, April 2000.
+
+[65] D. Jaggar. Advanced RISC Machines Architecture Reference Manual.
+Prentice Hall, 1996.
+
+[66] M. Johnson. Superscalar Microprocessor Design. Series in Innovative
+Technology. Prentice Hall, 1991.
+
+
+REFERENCES
+325
+
+[67] S.C. Johnson and S. Mazor. Silicon compiler lets system makers design
+their own VLSI chips. Electronic Design, 32(20):167–181, 1984.
+
+[68] G. Jones. Programming in OCCAM. Prentice-Hall international, 87.
+
+[69] M.B. Josephs, S.M. Nowick, and C.H. van Berkel. Modeling and de-
+sign of asynchronous circuits. Proceedings of the IEEE, 87(2):234–242,
+February 1999.
+
+[70] G.C. Clark Jr. and J.B. Cain. Error correcting coding for digital com-
+munication. Plenum, 1981.
+
+[71] G.D. Forney Jr. The Viterbi algorithm. Proc. IEEE, 13(3):268–278,
+1973.
+
+[72] G. Kane and J. Heinrich. MIPS RISC Achitecture. Prentice Hall, 1992.
+
+[73] J. Kessels, T. Kramer, G. den Besten, A. Peeters, and V. Timm. Apply-
+ing asynchronous circuits in contactless smart cards. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 36–44. IEEE Computer Society Press, April 2000.
+
+[74] J. Kessels, T. Kramer, A. Peeters, and V. Timm. DESCALE: a design ex-
+periment for a smart card application consuming low energy. In R. van
+Leuken, R. Nouta, and A. de Graaf, editors, European Low Power Ini-
+tiative for Electronic System Design, pages 247–262. Delft Institute of
+Microelectronics and Submicron Technology, July 2000.
+
+[75] Z. Kohavi. Switching and Finite Automata Theory. McGraw-Hill, 1978.
+
+[76] A. Kondratyev, J. Cortadella, M. Kishinevsky, L. Lavagno, and
+A. Yakovlev. Logic decomposition of speed-independent circuits. Pro-
+ceedings of the IEEE, 87(2):347–362, February 1999.
+
+[77] J. Liu. Arithmetic and control components for an asynchronous micro-
+processor. PhD thesis, Department of Computer Science, University of
+Manchester, 1997.
+
+[78] D.W. Lloyd. VHDL models of asychronous handshaking. (Personal
+communication, August 1998).
+
+[79] A.J. Martin.
+The probe: An addition to communication primitives.
+Information Processing Letters, 20(3):125–130, 1985.
+Erratum: IPL
+21(2):107, 1985.
+
+[80] A.J. Martin. Compiling communicating processes into delay-insensitive
+VLSI circuits. Distributed Computing, 1(4):226–234, 1986.
+
+
+326
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[81] A.J. Martin. Formal program transformations for VLSI circuit synthesis.
+In E.W. Dijkstra, editor, Formal Development of Programs and Proofs,
+UT Year of Programming Series, pages 59–80. Addison-Wesley, 1989.
+
+[82] A.J. Martin. The limitations to delay-insensitivity in asynchronous cir-
+cuits. In W.J. Dally, editor, Advanced Research in VLSI: Proceedings of
+the Sixth MIT Conference, pages 263–278. MIT Press, 1990.
+
+[83] A.J. Martin. Programming in VLSI: From communicating processes
+to delay-insensitive circuits.
+In C.A.R. Hoare, editor, Developments
+in Concurrency and Communication, UT Year of Programming Series,
+pages 1–64. Addison-Wesley, 1990.
+
+[84] A.J. Martin. Synthesis of asynchronous VLSI circuits, 1991.
+
+[85] A.J. Martin. Asynchronous datapaths and the design of an asynchronous
+adder. Formal Methods in System Design, 1(1):119–137, July 1992.
+
+[86] A.J. Martin, S.M. Burns, T. K. Lee, D. Borkovic, and P.J. Hazewindus.
+The design of an asynchronous microprocessor.
+In Charles L. Seitz,
+editor, Advanced Research in VLSI, pages 351–373. MIT Press, 1989.
+
+[87] A.J. Martin, S.M. Burns, T.K. Lee, D. Borkovic, and P.J. Hazewindus.
+The first asynchronous microprocessor: The test results. Computer Ar-
+chitecture News, 17(4):95–98, 1989.
+
+[88] A.J. Martin, A. Lines, R. Manohar, M. Nystr¨om, P. Penzes, R. South-
+worth, U.V. Cummings, and T.-K. Lee. The design of an asynchronous
+MIPS R3000. In Proceedings of the 17th Conference on Advanced Re-
+search in VLSI, pages 164–181. MIT Press, September 1997.
+
+[89] R. Milner. Communication and Concurrency. Prentice-Hall, 1989.
+
+[90] C.E. Molnar, I.W. Jones, W.S. Coates, and J.K. Lexau. A FIFO ring
+oscillator performance experiment. In Proc. International Symposium
+on Advanced Research in Asynchronous Circuits and Systems, pages
+279–289. IEEE Computer Society Press, April 1997.
+
+[91] C.E. Molnar, I.W. Jones, W.S. Coates, J.K. Lexau, S.M. Fairbanks, and
+I.E. Sutherland. Two FIFO ring performance experiments. Proceedings
+of the IEEE, 87(2):297–307, February 1999.
+
+[92] D.E. Muller. Asynchronous logics and application to information pro-
+cessing. In H. Aiken and W. F. Main, editors, Proc. Symp. on Applica-
+tion of Switching Theory in Space Technology, pages 289–297. Stanford
+University Press, 1963.
+
+
+REFERENCES
+327
+
+[93] D.E. Muller and W.S. Bartky. A theory of asynchronous circuits. In
+Proceedings of an International Symposium on the Theory of Switch-
+ing, Cambridge, April 1957, Part I, pages 204–243. Harvard University
+Press, 1959. The annals of the computation laboratory of Harvard Uni-
+versity, Volume XXIX.
+
+[94] T. Murata. Petri Nets: Properties, Analysis and Applications. Proceed-
+ings of the IEEE, 77(4):541–580, April 1989.
+
+[95] C.J. Myers. Asynchronous Circuit Design. John Wiley & Sons, July
+2001. ISBN: 0-471-41543-X.
+
+[96] National Bureau of Standards. Data encryption standard, January 1997.
+Federal Information Processing Standards Publication 46.
+
+[97] C.D. Nielsen. Evaluation of function blocks for asynchronous design.
+In Proc. European Design Automation Conference (EURO-DAC), pages
+454–459. IEEE Computer Society Press, September 1994.
+
+[98] C.D. Nielsen, J. Staunstrup, and S.R. Jones. Potential performance ad-
+vantages of delay-insensitivity. In M. Sami and J. Calzadilla-Daguerre,
+editors, Proceedings of IFIP workshop on Silicon Architectures for
+Neural Nets, StPaul-de-Vence, France, November 1990. North-Holland,
+Amsterdam, 1991.
+
+[99] L.S. Nielsen. Low-power Asynchronous VLSI Design. PhD thesis, De-
+partment of Information Technology, Technical University of Denmark,
+1997. IT-TR:1997-12.
+
+[100] L.S. Nielsen, C. Niessen, J. Sparsø, and C.H. van Berkel. Low-power
+operation using self-timed circuits and adaptive scaling of the supply
+voltage. IEEE Transactions on VLSI Systems, 2(4):391–397, 1994.
+
+[101] L.S. Nielsen and J. Sparsø. A low-power asynchronous data-path for a
+FIR filter bank. In Proc. International Symposium on Advanced Re-
+search in Asynchronous Circuits and Systems, pages 197–207. IEEE
+Computer Society Press, 1996.
+
+[102] L.S. Nielsen and J. Sparsø.
+An 85 µW asynchronous filter-bank for
+a digital hearing aid. In Proc. IEEE International Solid State circuits
+Conference, pages 108–109, 1998.
+
+[103] L.S. Nielsen and J. Sparsø. Designing asynchronous circuits for low
+power: An IFIR filter bank for a digital hearing aid. Proceedings of the
+IEEE, 87(2):268–281, February 1999. Special issue on “Asynchronous
+Circuits and Systems” (Invited Paper).
+
+
+328
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[104] D.C. Noice.
+A Two-Phase Clocking Dicipline for Digital Integrated
+Circuits. PhD thesis, Department of Electrical Engineering, Stanford
+University, February 1983.
+
+[105] S.M. Nowick. Design of a low-latency asynchronous adder using specu-
+lative completion. IEE Proceedings, Computers and Digital Techniques,
+143(5):301–307, September 1996.
+
+[106] S.M. Nowick, M.B. Josephs, and C.H. van Berkel (editors).
+Special
+issue on asynchronous circuits and systems. Proceedings of the IEEE,
+87(2), February 1999.
+
+[107] S.M. Nowick, K.Y. Yun, and P.A. Beerel. Speculative completion for
+the design of high-performance asynchronous dynamic adders. In Proc.
+International Symposium on Advanced Research in Asynchronous Cir-
+cuits and Systems, pages 210–223. IEEE Computer Society Press, April
+1997.
+
+[108] International Standards Organization. LOTOS — a formal description
+technique based on the temporal ordering of observational behaviour.
+ISO IS 8807, 1989.
+
+[109] N.C. Paver, P. Day, C. Farnsworth, D.L. Jackson, W.A. Lien, and J. Liu.
+A low-power, low-noise configurable self-timed DSP. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 32–42, 1998.
+
+[110] M. Pedersen.
+Design of asynchronous circuits using standard CAD
+tools. Technical Report IT-E 774, Technical University of Denmark,
+Dept. of Information Technology, 1998. (In Danish).
+
+[111] A.M.G. Peeters. The ‘Asynchronous’ Bibliography.
+http://www.win.tue.nl/˜wsinap/async.html.
+Corresponding e-mail address: async-bib@win.tue.nl.
+
+[112] A.M.G. Peeters. Single-Rail Handshake Circuits. PhD thesis, Eind-
+hoven University of Technology, June 1996.
+http://www.win.tue.nl/˜wsinap/pdf/Peeters96.pdf.
+
+[113] J.L. Peterson. Petri nets. Computing Surveys, 9(3):223–252, September
+1977.
+
+[114] Philips Semiconductors. PCA5007 handshake-technology pager IC data
+sheet. http://www.semiconductors.philips.com/pip/PCA5007H.
+
+[115] J. Rabaey. Digital Integrated Circuits: A Design Perspective. Prentice-
+Hall, 1996.
+
+
+REFERENCES
+329
+
+[116] P. Rakers, L. Connell, T. Collins, and D. Russell. Secure contactless
+smartcard ASIC with DPA protection. IEEE Journal of Solid-State Cir-
+cuits, 36(3):559–565, March 2001.
+
+[117] M. Renaudin, P. Vivet, and F. Robin. ASPRO-216: A standard-cell QDI
+16-bit RISC asynchronous microprocessor. In Proc. International Sym-
+posium on Advanced Research in Asynchronous Circuits and Systems
+(Async’98), pages 22–31. IEEE Computer Society Press, April 1998.
+
+[118] M. Renaudin, P. Vivet, and F. Robin. A design framework for asyn-
+chronous/synchronous circuits based on CHP to HDL translation. In
+Proc. International Symposium on Advanced Research in Asynchronous
+Circuits and Systems, pages 135–144, April 1999.
+
+[119] R. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital
+signatures and public-key crypto systems, June 1978.
+
+[120] M. Roncken. Defect-oriented testability for asynchronous ICs. Pro-
+ceedings of the IEEE, 87(2):363–375, February 1999.
+
+[121] C.L. Seitz. System timing. In C.A. Mead and L.A. Conway, editors,
+Introduction to VLSI Systems, chapter 7. Addison-Wesley, 1980.
+
+[122] N.P. Singh. A design methodology for self-timed systems. Master’s
+thesis, Laboratory for Computer Science, MIT, 1981. MIT/LCS/TR-
+258.
+
+[123] J. Sparsø, C.D. Nielsen, L.S. Nielsen, and J. Staunstrup. Design of self-
+timed multipliers: A comparison. In S. Furber and M. Edwards, editors,
+Asynchronous Design Methodologies, volume A-28 of IFIP Transac-
+tions, pages 165–180. Elsevier Science Publishers, 1993.
+
+[124] J. Sparsø and J. Staunstrup. Delay-insensitive multi-ring structures. IN-
+TEGRATION, the VLSI Journal, 15(3):313–340, October 1993.
+
+[125] J. Sparsø, J. Staunstrup, and M. Dantzer-Sørensen. Design of delay in-
+sensitive circuits using multi-ring structures. In G. Musgrave, editor,
+Proc. of EURO-DAC ’92, European Design Automation Conference,
+Hamburg, Germany, September 7-10, 1992, pages 15–20. IEEE Com-
+puter Society Press, 1992.
+
+[126] R.F. Sproull, I.E. Sutherland, and C.E. Molnar. The counterflow pipeline
+processor architecture. IEEE Design & Test of Computers, 11(3):48–59,
+1994.
+
+[127] L. Stok. Architectural Synthesis and Optimization of Digital Systems.
+PhD thesis, Eindhoven University of Technology, 1991.
+
+
+330
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[128] I.E. Sutherland.
+Micropipelines.
+Communications of the ACM,
+32(6):720–738, June 1989.
+
+[129] Synopsys, Inc. Synopsys VSS Family Core Programs Manual, 1997.
+
+[130] A. Takamura, M. Kuwako, M. Imai, T. Fujii, M. Ozawa, I. Fukasaku,
+Y. Ueno, and T. Nanya. TITAC-2: An asynchronous 32-bit micropro-
+cessor based on scalable-delay-insensitive model. In Proc. International
+Conf. Computer Design (ICCD’97), pages 288–294. MIT Press, Octo-
+ber 1997.
+
+[131] H. Terada, S. Miyata, and M. Iwata.
+DDMPs: Self-timed super-
+pipelined data-driven multimedia processors. Proceedings of the IEEE,
+87(2):282–296, February 1999.
+
+[132] M. Theobald and S.M. Nowick. Transformations for the synthesis and
+optimization of asynchronous distributed control. In Proc. ACM/IEEE
+Design Automation Conference, 2001.
+
+[133] S.H. Unger.
+Asynchronous Sequential Switching Circuits.
+Wiley-
+Interscience, John Wiley & Sons, Inc., New York, 1969.
+
+[134] C.H. van Berkel. Beware the isochronic fork. INTEGRATION, the VLSI
+journal, 13(3):103–128, 1992.
+
+[135] C.H. van Berkel. Handshake Circuits: an Asynchronous Architecture
+for VLSI Programming, volume 5 of International Series on Parallel
+Computation. Cambridge University Press, 1993.
+
+[136] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and
+F. Schalij. Asynchronous circuits for low power: a DCC error corrector.
+IEEE Design & Test, 11(2):22–32, 1994.
+
+[137] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and
+F. Schalij. A fully asynchronous low-power error corrector for the DCC
+player. In ISSCC 1994 Digest of Technical Papers, volume 37, pages
+88–89. IEEE, 1994. ISSN 0193-6530.
+
+[138] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken,
+F. Schalij, and R. van de Viel.
+A single-rail re-implementation of a
+DCC error detector using a generic standard-cell library. In 2nd Work-
+ing Conference on Asynchronous Design Methodologies, London, May
+30-31, 1995, pages 72–79, 1995.
+
+[139] C.H. van Berkel, F. Huberts, and A. Peeters.
+Stretching quasi delay
+insensitivity by means of extended isochronic forks. In Asynchronous
+
+
+REFERENCES
+331
+
+Design Methodologies, pages 99–106. IEEE Computer Society Press,
+May 1995.
+
+[140] C.H. van Berkel, M.B. Josephs, and S.M. Nowick. Scanning the tech-
+nology: Applications of asynchronous circuits.
+Proceedings of the
+IEEE, 87(2):223–233, February 1999.
+
+[141] C.H. van Berkel, J. Kessels, M. Roncken, R. Saeijs, and F. Schalij. The
+VLSI-programming language Tangram and its translation into hand-
+shake circuits. In Proc. European Conference on Design Automation
+(EDAC), pages 384–389, 1991.
+
+[142] C.H. van Berkel, C. Niessen, M. Rem, and R. Saeijs. VLSI program-
+ming and silicon compilation. In Proc. International Conf. Computer
+Design (ICCD), pages 150–166, Rye Brook, New York, 1988. IEEE
+Computer Society Press.
+
+[143] H. van Gageldonk.
+An Asynchronous Low-Power 80C51 Microcon-
+troller. PhD thesis, Dept. of Math. and C.S., Eindhoven Univ. of Tech-
+nology, September 1998.
+
+[144] H. van Gageldonk, D. Baumann, C.H. van Berkel, D. Gloor, A. Peeters,
+and G. Stegmann. An asynchronous low-power 80c51 microcontroller.
+In Proc. International Symposium on Advanced Research in Asyn-
+chronous Circuits and Systems, pages 96–107. IEEE Computer Society
+Press, April 1998.
+
+[145] P. Vanbekbergen.
+Synthesis of Asynchronous Control Circuits from
+Graph-Theoretic Specifications. PhD thesis, Catholic University of Leu-
+ven, September 1993.
+
+[146] V.I. Varshavsky, M.A. Kishinevsky, V.B. Marakhovsky, V.A. Peschan-
+sky, L.Y. Rosenblum, A.R. Taubin, and B.S. Tzirlin. Self-timed Con-
+trol of Concurrent Processes.
+Kluwer Academic Publisher, 1990.
+V.I.Varshavsky Ed., (Russian edition: 1986).
+
+[147] T. Verhoeff. Delay-insensitive codes - an overview. Distributed Com-
+puting, 3(1):1–8, 1988.
+
+[148] A.J. Viterbi. Error bounds for convolutional codes and an asymptotically
+optimum decoding algorithm.
+In IEEE Transactions on Information
+Theory, volume 13, pages 260–269, 1967.
+
+[149] P. Viviet and M. Renaudin. CHP2VHDL, a CHP to VHDL translator
+- towards asynchronous-design simulation.
+In L. Lavagno and M.B.
+
+
+332
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+Josephs, editors, Handouts from the ACiD-WG Workshop on Specifica-
+tion models and languages and technology effects of asynchronous de-
+sign. Dipartemento di Elettronica, Polytecnico de Torino, Italy, January
+1998.
+
+[150] J.F. Wakerly. Digital Design: Principles and Practices, 3/e. Prentice-
+Hall, 2001.
+
+[151] N. Weste and K. Esraghian. Principles of CMOS VLSI Design – A sys-
+tems Perspective, 2nd edition. Addison-Wesley, 1993.
+
+[152] Rik van de Wiel. High-level test evaluation of asynchronous circuits.
+In Asynchronous Design Methodologies, pages 63–71. IEEE Computer
+Society Press, May 1995.
+
+[153] T.E. Williams. Self-Timed Rings and their Application to Division. PhD
+thesis, Department of Electrical Engineering and Computer Science,
+Stanford University, 1991. CSL-TR-91-482.
+
+[154] T.E. Williams. Analyzing and improving latency and throughput in self-
+timed rings and pipelines. In Tau-92: 1992 Workshop on Timing Issues
+in the Specification and Synthesis of Digital Systems. ACM/SIGDA,
+March 1992.
+
+[155] T.E. Williams. Performance of iterative computation in self-timed rings.
+Journal of VLSI Signal Processing, 6(3), October 1993.
+
+[156] T.E. Williams and M.A. Horowitz.
+A zero-overhead self-timed 160
+ns. 54 bit CMOS divider.
+IEEE Journal of Solid State Circuits,
+26(11):1651–1661, 1991.
+
+[157] T.E. Williams, N. Patkar, and G. Shen. SPARC64: A 64-b 64-active-
+instruction out-of-order-execution MCM processor.
+IEEE Journal of
+Solid-State Circuits, 30(11):1215–1226, November 1995.
+
+[158] J.V. Woods, P. Day, S.B. Furber, J.D. Garside, N.C. Paver, and S. Tem-
+ple. AMULET1: An asynchronous ARM processor. IEEE Transactions
+on Computers, 46(4):385–398, April 1997.
+
+[159] C. Ykman-Couvreur, B. Lin, and H. de Man. Assassin: A synthesis sys-
+tem for asynchronous control circuits. Technical report, IMEC, Septem-
+ber 1994. User and Tutorial manual.
+
+[160] K.Y. Yun and D.L. Dill. Automatic synthesis of extended burst-mode
+circuits: Part II (automatic synthesis). IEEE Transactions on Computer-
+Aided Design, 18(2):118–132, February 1999.
+
+
+Index
+
+Acknowledgement (or indication), 15
+Activation port, 163
+Active port, 156
+Actual case latency, 65
+Adaptive voltage scaling, 231
+Addition (ripple-carry), 64
+Amulet microprocessors, 274
+Amulet1, 274, 281, 285, 290
+Amulet2, 290
+Amulet2e, 275, 303
+Amulet3, 282, 286, 291, 297, 300
+Amulet3i, 156, 205, 275, 303, 313
+DRACO, 278, 310
+And-or-invert (AOI) gates, 102
+Arbitration, 79, 202, 210–211, 269, 287, 304
+ARM, 274, 280, 297
+ASPRO-216, 278, 284
+Asymmetric delay, 48, 53
+Asynchronous advantages, 3, 231
+Asynchronous disadvantages, 232
+Asynchronous synthesis, 155
+Atomic complex gate, 94, 103
+Automatic performance adaptation, 231
+Average power consumption, 237
+Balsa, 123, 155, 312
+communications, 179
+area cost, 167
+array types, 175
+arrayed channels, 166, 176
+auto-assignment, 178, 184
+channel viewer, 171
+conditional execution, 180
+constants, 166, 174
+data types, 173
+design flow, 159
+DMA controller, 211
+enumerated types, 174
+for loops, 166
+hardware sharing, 187
+looping constructs, 180
+modular compilation, 165
+numeric types, 173
+operators, 181
+parallel composition, 165
+
+parameterised descriptions, 193
+program structure, 181
+record types, 174
+recursive definitions, 195
+simulation, 168
+structural iteration, 180, 189
+test harness, 168, 197
+tools, 159
+Branch colour, 284, 300
+Breeze, 159, 162
+Bubble limited, 49
+Bubble, 30
+Bundled-data, 9, 157, 255
+Burst mode, 86
+input burst, 86
+output burst, 86
+C-element, 14, 58, 92, 257
+asymmetric, 100, 59
+generalized, 100, 103, 105
+implementation, 15
+specification, 15, 92
+Cache, 275, 303–304, 307
+Calibrated time delays, 313
+Caltech, 133, 276
+Capture-pass latch, 19
+Cast, 176
+CCS (calculus of communicating systems), 123
+Channel (or link), 7, 30, 156
+communication in Balsa, 162
+Channel type
+biput, 115
+nonput, 115
+pull, 10, 115
+push, 10, 115
+Chip area interconnect (Chain), 312
+CHP (communicating hardware processes),
+123–124, 278
+Circuit templates:
+for statement, 37
+if statement, 36
+while statement, 38
+Classification
+delay-insensitive (DI), 25
+quasi delay-insensitive (QDI), 25
+
+333
+
+
+334
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+self-timed, 26
+speed-independent (SI), 25
+Closed circuit, 23
+Codeword (dual-rail), 12
+empty, 12
+intermediate, 12
+valid, 12
+Compatible states, 85
+Complete state coding (CSC), 88
+Completion indication, 65
+Completion
+detection, 21–22, 302
+indication, 62
+strong, 62
+weak, 62
+Complex gates, 104
+Concurrent processes, 123
+Concurrent statements, 123
+Consistent state assignment, 88
+Control limited, 50
+Control logic for transition signaling, 20
+Control-data-flow graphs, 36
+Convolution encoding, 250
+Counterflow Pipeline Processor (CFPP), 286
+CSP (communicating sequential processes), 123,
+223
+Cycle time of a ring, 49
+Data dependency, 279
+Data encoding
+bundled-data, 9
+dual-rail, 11
+m-of-n, 14
+one-hot (or 1-of-n), 13
+single-rail, 10
+Data limited, 49
+Data types, 173
+Data validity scheme (4-phase bundled-data)
+broad, 116
+early, 116
+extended early, 116
+late, 116
+Data validity, 156
+Data-flow abstraction, 7
+DCVSL, 70
+Deadlock, 30, 155, 199, 285, 287, 305
+Delay assumptions, 23
+Delay insensitive minterm synthesis (DIMS), 67
+Delay matching, 11, 236
+Delay model
+fixed delay, 83
+inertial delay, 83
+delay time, 83
+reject time, 83
+min-max delay, 83
+transport delay, 83
+unbounded delay, 83
+Delay selection, 66, 305
+
+Delay-insensitive (DI), 12, 17, 25, 156
+codes, 12, 312
+Demultiplexer (DEMUX), 32
+Dependency graph, 52
+DES coprocessor, 241
+Design for test, 314
+Determinism, 282
+Differential logic, 70
+DIMS, 67–68
+DMA controller, 202, 205
+Balsa description, 211
+control unit, 209, 213
+on DRACO, 312
+transfer engine, 210, 212
+DRACO, 276, 278, 310, 315
+Dual-rail carry signals, 65
+Dual-rail encoding, 11
+Dummy environment, 87
+Dynamic wavelength, 49
+Electromagnetic interference (EMI), 278, 315
+emission spectrum, 231
+Electromagnetic radiation (as power source), 222
+Empty word, 12, 29–30
+Environment, 83
+Event, 9
+Exceptions, 297
+Excitation region, 97
+Excited gate/variable, 24
+FIFO, 16, 257
+Finite state machine (using a ring), 35
+Firing (of a gate), 24
+For statement, 37, 166
+Fork, 31
+Forward latency, 47
+Four-phase handshake, 225
+Function block, 31, 60–61
+bundled-data (“speculative completion”), 66
+bundled-data, 18, 65
+dual-rail (DIMS), 67
+dual-rail (Martin’s adder), 71
+dual-rail (null convention logic), 69
+dual-rail (transistor level CMOS), 70
+dual-rail, 22
+hybrid, 73
+strongly indicating, 62
+weakly indicating, 62
+Fundamental mode, 81, 83–84
+Generalized C-element, 103, 105
+Generate (carry), 65
+Globally Asynchronous, Locally Synchronous
+(GALS), 312
+Greatest common divisor (GCD), 38, 131, 226
+Guarded command, 128, 240
+Guarded repetition, 128
+Halt, 237, 282
+Handshake channel, 115, 156, 225
+biput, 115
+
+
+INDEX
+335
+
+nonput, 115, 129
+pull, 10, 115, 129, 156
+push, 10, 115, 129, 156
+Handshake circuit, 128, 162, 223
+2-place ripple FIFO, 130–131
+2-place shift register, 129
+greatest common divisor (GCD), 132, 226
+Handshake component, 156, 225
+arbiter, 79
+bar, 131
+case, 157
+demultiplexer, 32, 76, 131
+do, 131, 226
+fetch, 157, 163
+fork, 31, 58, 131, 133
+join, 31, 58, 130
+latch, 29, 31, 57
+2-phase bundled-data, 19
+4-phase bundled-data, 18, 106
+4-phase dual-rail, 21
+merge, 32, 58
+multiplexer, 32, 76, 109, 131
+parallel, 225–226
+passivator, 130
+repeater, 129, 163
+sequencer, 129, 163, 225
+transferer, 130
+variable, 130, 163
+Handshake expansion, 133
+Handshake protocol, 7, 9
+2-phase bundled-data, 9, 274–275
+2-phase dual-rail, 13
+4-phase bundled-data, 9, 117, 255
+4-phase dual-rail, 11
+non-return-to-zero (NRZ), 10
+return-to-zero (RTZ), 10
+Handshaking, 7, 155
+Hazard, 297
+dynamic-01, 83
+dynamic-10, 83, 95
+static-0, 83
+static-1, 83, 94
+Huffmann, D. A., 84
+Hysteresis, 22, 64
+If statement, 36, 181
+IFIR filter bank, 39
+Indication (or acknowledgement), 15
+of completion, 65
+dependency graphs, 73
+distribution of valid/empty indication, 72
+strong, 62
+weak, 62
+Initial state, 101
+Initialization, 101, 30
+Input free choice, 88
+Input-output mode, 81, 84
+Instruction prefetching, 236
+
+Intermediate codeword, 12
+Interrupts, 299
+Isochronic fork, 26
+Iterative computation (using a ring), 35
+Join, 31
+Kill (carry), 65
+LARD, 159
+Latch (see also: handshake comp.), 18
+Latch controller, 106
+fully-decoupled, 120
+normally opaque, 121
+normally transparent, 121
+semi-decoupled, 120
+simple/un-decoupled, 119
+Latency, 47
+actual case, 65
+Line fetch latch (LFL), 308
+Link (or channel), 7, 30
+Liveness, 88
+Lock FIFO, 290
+Logic decomposition, 94
+Logic thresholds, 27
+LOTOS, 123
+M-of-n threshold gates with hysteresis, 69
+Makefile, 165–166
+MARBLE bus, 209, 304, 312
+Matched delay, 11, 65
+Memory, 302
+Merge, 32
+Metastability, 78
+filter, 78
+mean time between failure, 79
+probability of, 79
+Micropipelines, 19, 156, 274
+Microprocessors
+80C51, 236
+Amulet series, 274
+ASPRO-216, 278, 284
+asynchronous MIPS R3000, 133
+asynchronous MIPS, 39
+CFPP, 286
+MiniMIPS, 276
+TITAC-2, 276
+Minterm, 22, 67
+Modulo-10 counter, 158, 185
+Modulo-16 counter, 183
+Monotonic cover constraint, 97, 99, 103
+Muller C-element, 15
+Muller model of a closed circuit, 23
+Muller pipeline/distributor, 16, 257
+Muller, D., 84
+Multi-cycle instruction, 281
+Multiplexer (MUX), 32, 109
+Mutual exclusion, 58, 77, 300, 304
+mutual exclusion element (MUTEX), 77
+N-way multiplexer, 195
+NCL adder, 70
+
+
+336
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+Non-determinism, 282, 305
+Non-return-to-zero (NRZ), 10
+Null Convention Logic (NCL), 69
+NULL, 12
+OCCAM, 123
+Occupancy (or static spread), 49
+One-hot encoding, 13
+Operator reduction, 134
+Optimization, 227
+Parallel composition, 164
+Passive port, 156
+Performance parameters:
+cycle time of a ring, 49
+dynamic wavelength, 49
+forward latency, 47
+latency, 47
+period, 48
+reverse latency, 48
+throughput, 49
+Performance
+analysis and optimization, 41
+average case, 280
+Period, 48
+Persistency, 88
+Petri net, 86
+merge, 88
+1-bounded, 88
+controlled choice, 89
+firing, 86
+fork, 88
+input free choice, 88
+join, 88
+liveness, 88
+places, 86
+token, 86
+transition, 86
+Petrify, 102, 296
+Pipeline, 5, 30, 279
+2-phase bundled-data, 19
+4-phase bundled-data, 18
+4-phase dual-rail, 20
+balance, 281
+Place, 86
+Power consumption, 231, 234
+Power efficiency, 250
+Power supply, 246
+Precharged CMOS circuitry, 116
+Prefetch unit (80C51), 239
+Primitive flow table, 85
+Probe, 123, 125
+Process decomposition, 133
+Processors, 274
+Production rule expansion, 134
+Propagate (carry), 65
+Pull channel, 10, 115, 156
+Push channel, 10, 115, 156
+Quasi delay-insensitive (QDI), 25
+
+Quiescent region, 97
+Re-shuffling signal transitions, 102, 112
+Read-after-write data hazard, 40
+Receive, 123, 125
+Reduced flow table, 85
+Register
+dependency, 289
+locking, 40, 289
+Rendezvous, 125
+Reorder buffer, 291
+Reset function, 97
+Reset signal, 163
+Return-to-zero (RTZ), 9–10
+Reverse latency, 48
+Ring, 30, 296
+finite state machine, 35
+iterative computation, 35
+Ripple FIFO, 16
+Self-timed, 26
+Semantics-preserving transformations, 133
+Send, 123, 125
+Sequencer, 225
+Sequencing, 162
+Serial unary arithmetic, 257
+Set function, 97
+Set-Reset implementation, 96
+Shared ressource, 77
+Sharing, 187, 223
+Sharp DDMP, 278
+Shift register
+with parallel load, 44
+Signal transition graph (STG), 86, 297
+Signal transition, 9
+Silicon compiler, 124, 223
+Simulation, 168
+Single input change, 84
+Single-place buffer, 161
+Single-rail, 10
+Smart cards, 222, 232
+Spacer, 12
+Speculative completion, 66
+Speed adaptation, 248
+Speed-independent (SI), 23–25, 83
+Stable gate/variable, 23
+Standard C-element, 106
+implementation, 96
+State graph, 85
+Static data-flow structure, 7, 29
+Static data-flow structure
+examples:
+greatest common divisor (GCD), 38
+IFIR filter bank, 39
+MIPS microprocessor, 39
+simple example, 33
+vector multiplier, 40
+Static spread (or occupancy), 49, 120
+Static type checking, 118
+
+
+INDEX
+337
+
+Stuck-at fault model, 27
+Supply voltage variations, 231, 236
+Synchronizer flip-flop, 78
+Synchronous message passing, 123
+Syntax-directed compilation, 128, 155
+Tangram examples:
+2-place ripple FIFO, 127
+2-place shift register, 126
+GCD using guarded repetition, 128
+GCD using while and if statements, 127
+Tangram, 123, 155, 222–223, 278
+Technology mapping, 103, 224
+Test, 27, 245, 314
+IDDQ testing, 28
+halting of circuit, 28, 246
+isochronic forks, 28
+short and open faults, 28
+stuck-at faults, 27
+toggle test, 28
+untestable stuck-at faults, 28
+Throughput, 42, 49
+Thumb decoder, 282
+Time safe, 78
+TITAC-2, 276, 308
+Token, 7, 30, 86
+Transition, 86
+
+Transparent to handshaking, 7, 23, 33, 61
+Two-place buffer, 163
+Unique entry constraint, 97, 99
+Up/down decade counter, 185
+Valid codeword, 12
+Valid data, 12, 29
+Valid token, 30
+Value safe, 78
+Vector multiplier, 40
+Verilog, 124
+VHDL, 124, 155, 237
+Viterbi decoder, 249
+backtrace, 264
+branch metric, 260
+constraint length, 251
+global winner, 262
+History Unit, 264
+Path Metric Unit (PMU), 256
+soft codes, 253
+trellis diagram, 251
+VLSI programming, 128, 223
+VSTGL (Visual STG Lab), 103
+Wave, 16
+crest, 16
+trough, 16
+While statement, 38
+Write-back, 40
+
+
diff --git a/papers/references/Tegmark2000.pdf b/papers/references/Tegmark2000.pdf
new file mode 100644
index 00000000..08fe95ed
--- /dev/null
+++ b/papers/references/Tegmark2000.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e3f207a1170ff3edb0765236fce4fbcaf4c73a9745664e3f35bdc38dafc20aab
+size 365231
diff --git a/papers/references/Tegmark2000.txt b/papers/references/Tegmark2000.txt
new file mode 100644
index 00000000..049d0e7f
--- /dev/null
+++ b/papers/references/Tegmark2000.txt
@@ -0,0 +1,2024 @@
+arXiv:quant-ph/9907009v2  10 Nov 1999
+
+The Importance of Quantum Decoherence in Brain Processes
+
+Max Tegmark
+Institute for Advanced Study, Olden Lane, Princeton, NJ 08540; max@ias.edu
+Dept. of Physics, Univ. of Pennsylvania, Philadelphia, PA 19104
+(Submitted to Phys. Rev. E July 2 1999, accepted October 25)
+
+Based on a calculation of neural decoherence rates, we ar-
+gue that that the degrees of freedom of the human brain that
+relate to cognitive processes should be thought of as a classical
+rather than quantum system, i.e., that there is nothing funda-
+mentally wrong with the current classical approach to neural
+network simulations. We find that the decoherence timescales
+(∼ 10−13 − 10−20 seconds) are typically much shorter than
+the relevant dynamical timescales (∼ 10−3 − 10−1 seconds),
+both for regular neuron firing and for kink-like polarization
+excitations in microtubules. This conclusion disagrees with
+suggestions by Penrose and others that the brain acts as a
+quantum computer, and that quantum coherence is related
+to consciousness in a fundamental way.
+
+I. INTRODUCTION
+
+In most current mainstream biophysics research on
+cognitive processes, the brain is modeled as a neural net-
+work obeying classical physics. In contrast, Penrose [1,2],
+and others have argued that quantum mechanics may
+play an essential role, and that successful brain simula-
+tions can only be performed with a quantum computer.
+The main purpose of this paper is to address this issue
+with quantitative decoherence calculations.
+The field of artificial neural networks (for an introduc-
+tion, see, e.g., [4–6]) is currently booming, driven by a
+broad range of applications and improved computing re-
+sources. Although the popular neurological models come
+in various levels of abstraction, none involve effects of
+quantum coherence in any fundamental way. Encouraged
+by successes in modeling memory, learning, visual pro-
+cessing, etc. [7,8], many workers in the field have boldly
+conjectured that a sufficiently complex neural network
+could in principle perform all cognitive processes that we
+associate with consciousness.
+On the other hand, many authors have argued that
+consciousness can only be understood as a quantum ef-
+fect. For instance, Wigner [9] suggested that conscious-
+ness was linked to the quantum measurement problem1,
+and this idea has been greatly elaborated by Stapp [3].
+There have been numerous suggestions that conscious-
+ness is a macroquantum effect, involving superconduc-
+
+1 Interestingly, Wigner changed his mind and gave up this
+idea [10] after he became aware in of the first paper on deco-
+herence in 1970 [11].
+
+tivity [12], superfluidity [13], electromagnetic fields [14],
+Bose condensation [15,16], superflourescence [17] or some
+other mechanism [18,19]. Perhaps the most concrete one
+is that of Penrose [2], proposing that this takes place
+in microtubules, the ubiquitous hollow cylinders that
+among other things help cells maintain their shapes. It
+has been argued that microtubules can process informa-
+tion like a cellular automaton [20], and Penrose suggests
+that they operate as a quantum computer. This idea has
+been further elaborated employing string theory methods
+[21–27].
+The make-or-break issue for all these quantum mod-
+els is whether the relevant degrees of freedom of the
+brain can be sufficiently isolated to retain their quan-
+tum coherence, and opinions are divided. For instance,
+Stapp has argued that interaction with the environment
+is probably small enough to be unimportant for cer-
+tain neural processes [28], whereas Zeh [29], Zurek [30],
+Scott [31], Hawking [32] and Hepp [33] have conjectured
+that environment-induced coherence will rapidly destroy
+macrosuperpositions in the brain. It is therefore timely
+to try to settle the issue with detailed calculations of the
+relevant decoherence rates. This is the purpose of the
+present work.
+The rest of this paper is organized as follows. In Sec-
+tion II, we briefly review the open system quantum me-
+chanics necessary for our calculations, and introduce a
+decomposition into three subsystems to place the prob-
+lem in its proper context.
+In Section III, we evaluate
+decoherence rates both for neuron firing and for the mi-
+crotubule processes proposed by Penrose et al., relegating
+some technical details to the Appendix. We conclude in
+Section IV by discussing the implications of our results,
+both for modeling cognitive brain processes and for in-
+corporating them into a quantum-mechanical treatment
+of the rest of the world.
+
+II. SYSTEMS AND SUBSYSTEMS
+
+In this section, we review those aspects of quantum
+mechanics for open systems that are needed for our cal-
+culations, and introduce a classification scheme and a
+subsystem decomposition to place the problem at hand
+in its appropriate context.
+
+1
+
+
+A. Notation
+
+Let us first briefly review the quantum mechanics of
+subsystems. The state of an arbitrary quantum system
+is described by its density matrix ρ, which left in isolation
+will evolve in time according to the Schr¨odinger equation
+
+˙ρ = −i[H, ρ]/¯h.
+(1)
+
+It is often useful to view a system as composed of two
+subsystems, so that some of the degrees of freedom cor-
+respond to the 1st and the rest to the 2nd. The state of
+subsystem i is described by the reduced density matrix
+ρi obtained by tracing (marginalizing) over the degrees
+of freedom of the other: ρ1 ≡ tr 2ρ, ρ2 ≡ tr 1ρ. Let us
+decompose the Hamiltonian as
+
+H = H1 + H2 + Hint,
+(2)
+
+where the operator H1 affects only the 1st subsystem
+and H2 affects only the 2nd subsystem. The interaction
+Hamiltonian Hint is the remaining nonseparable part, de-
+fined as Hint ≡ H − H1 − H2, so such a decomposition
+is always possible, although it is generally only useful if
+Hint is in some sense small.
+If Hint = 0, i.e., if there is no interaction between
+the two subsystems, then it is easy to show that ˙ρi =
+−i[Hi, ρi]/¯h, i = 1, 2, that is, we can treat each subsys-
+tem as if the rest of the Universe did not exist, ignoring
+any correlations with the other subsystem that may have
+been present in the full non-separable density matrix ρ.
+It is of course this property that makes density matrices
+so useful in the first place, and that led von Neumann
+to invent them [34]: the full system is assumed to obey
+equation (1) simply because its interactions with the rest
+of the Universe are negligible.
+
+B. Fluctuation, dissipation, communication and
+decoherence
+
+In practice, the interaction Hint between subsystems
+is usually not zero. This has a number of qualitatively
+different effects:
+
+1. Fluctuation
+
+2. Dissipation
+
+3. Communication
+
+4. Decoherence
+
+The first two involve transfer of energy between the sub-
+systems, whereas the last two involve exchange of infor-
+mation. The first three occur in classical physics as well
+- only the last one is a purely quantum-mechanical phe-
+nomenon.
+For example, consider a tiny colloid grain (subsystem
+1) in a jar of water (subsystem 2). Collisions with water
+
+molecules will cause fluctuations in the center-of-mass
+position of the colloid (brownian motion). If its initial ve-
+locity is high, dissipation (friction) will slow it down to
+a mean speed corresponding to thermal equilibrium with
+the water. The dissipation timescale τdiss, defined as the
+time it would take to lose half of the initial excess energy,
+will in this case be of order τcoll × (M/m), where τcoll is
+the mean-free time between collisions, M the colloid mass
+M and m is the mass of a water molecule. We will define
+communication as exchange of information. The infor-
+mation that the two subsystems have about each other,
+measured in bits, is
+
+I12 ≡ S1 + S2 − S,
+(3)
+
+where Si ≡ −tr iρi log ρi is the entropy of the ith subsys-
+tem, S ≡ −tr ρ log ρ is the entropy of the total system,
+and the logarithms are base 2. If this mutual informa-
+tion is zero, then the states of the two systems are un-
+correlated and independent, with the density matrix of
+the separable form ρ = ρ1 ⊗ ρ2. If the subsystems start
+out independent, any interaction will at least initially
+increase the subsystem entropies Si, thereby increasing
+the mutual information, since the entropy S of the total
+system always remains constant.
+This apparent entropy increase of subsystems, which
+is related to the arrow of time and the 2nd law of of ther-
+modynamics [35], occurs also in classical physics. How-
+ever, quantum mechanics produces a qualitatively new
+effect as well, known as decoherence [11,36,37], sup-
+pressing off-diagonal elements in the reduced density ma-
+trices ρi. This effect destroys the ability to observe long-
+range quantum superpositions within the subsystems,
+and is now rather well-understood and uncontroversial
+[30,38–42] – the interested reader is referred to [43] and
+a recent book on decoherence [44] for details.
+For in-
+stance, if our colloid was initially in a superposition of
+two locations separated by a centimeter, this macrosu-
+perposition would for all practical purposes be destroyed
+by the first collision with a water molecule, i.e., on a
+timescale τdec of order τcoll, with the quantum superpo-
+sition surviving only on scales below the de de Broigle
+wavelength of the water molecules [45,46].2 This means
+
+2Decoherence picks out a preferred basis in the quantum-
+mechanical Hilbert space, termed the “pointer basis” by
+Zurek [36], in which superpositions are rapidly destroyed and
+classical behavior is approached. This normally includes the
+position basis, which is why we never experience superposi-
+tions of objects in macroscopically different positions. Deco-
+herence is quite generic. Although it has been claimed that
+this preferred basis consists of the maximal set of commuting
+observables that also commute with Hint (the “microstable
+basis” of Omnes [43]), this is in fact merely a sufficient condi-
+tion, not a necessary one. If [Hint, x] = 0 for some observable
+x but [Hint, p] ̸= 0 for its conjugate p, then the interaction
+
+2
+
+
+that τdiss/τdec ∼ M/m in our example, i.e., that decoher-
+ence is much faster than dissipation for macroscopic ob-
+jects, and this qualitative result has been shown to hold
+quite generally as well (see [43] and references therein).
+Loosely speaking, this is because each microscopic par-
+ticle that scatters off of the subsystem carries away only
+a tiny fraction m/M of the total momentum, but essen-
+tially all of the necessary information.
+
+QUANTUM�
+SYSTEM
+
+NOT �
+INDEPENDENT�
+SYSTEM
+
+IMPOSSIBLE
+
+CLASSICAL�
+SYSTEM
+
+0.1
+1
+
+1
+
+0.1
+
+10
+
+100
+
+10
+100
+Dissipation time/Decoherence time
+
+Dynamical time/Decoherence time
+
+FIG. 1. The qualitative behavior of a subsystem depends on
+the timescales for dynamics, dissipation and decoherence.
+This
+classification is by necessity quite crude, so the boundaries should
+not be thought of as sharp.
+
+C. Classification of systems
+
+Let us define the dynamical timescale τdyn of a subsys-
+tem as that which is characteristic of its internal dynam-
+ics. For a planetary system or an atom, τdyn would be
+the orbital frequency.
+The qualitative behavior of a system depends on the
+ratio of these timescales, as illustrated in Figure 1. If
+τdyn ≪ τdec, we are are dealing with a true quantum sys-
+tem, since its superpositions can persist long enough to
+be dynamically important. If τdyn ≫ τdiss, it is hardly
+meaningful to view it as an independent system at all,
+since its internal forces are so week that they are dwarfed
+
+will indeed cause decoherence for x as advertised. But this
+will happen even if [Hint, x] ̸= 0 — all that matters is that
+[Hint, p] ̸= 0, i.e., that the interaction Hamiltonian contains
+(“measures”) x.
+
+by the effects of the surroundings. In the intermediate
+case where τdec ≪ τdyn <∼ τdiss, we have a familiar classi-
+cal system.
+The relation between τdec and τdiss depends only on
+the form of Hint, whereas the question of whether τdyn
+falls between these values depends on the normalization
+of Hint in equation (2). Since τdec ∼ τdiss for microscopic
+(atom-sized) systems and τdec ≪ τdiss for macroscopic
+ones, Figure 1 shows that whereas macroscopic systems
+can behave quantum-mechanically, microscopic ones can
+never behave classically.
+
+D. Three systems: subject, object and environment
+
+Most discussions of quantum statistical mechanics split
+the Universe into two subsystems [47]: the object under
+consideration and everything else (referred to as the en-
+vironment). Since our purpose is to model the observer,
+we need to include a third subsystem as well, the subject.
+As illustrated in Figure 2, we therefore decompose the
+total system into three subsystems:
+
+• The subject consists of the degrees of freedom as-
+sociated with the subjective perceptions of the ob-
+server. This does not include any other degrees of
+freedom associated with the brain or other parts of
+the body.
+
+• The object consists of the degrees of freedom that
+the observer is interested in studying, e.g., the
+pointer position on a measurement apparatus.
+
+• The environment consists of everything else, i.e.,
+all the degrees of freedom that the observer is not
+paying attention to. By definition, these are the
+degrees of freedom that we always perform a partial
+trace over.
+
+3
+
+
+SUBJECT
+OBJECT
+
+ENVIRONMENT
+
+Hs
+Ho
+
+He
+
+Hso
+
+Hoe
+Hse
+
+Object �
+decoherence
+
+Subject�
+decoherence,�
+finalizing �
+decisions
+
+Measurement,�
+observation,�
+"wavefuntion �
+collapse",�
+willful action
+
+(Always traced over)
+
+(Always zero entropy)
+
+FIG. 2. An observer can always decompose the world into three
+subsystems: the degrees of freedom corresponding to her subjective
+perceptions (the subject), the degrees of freedom being studied (the
+object), and everything else (the environment). As indicated, the
+subsystem Hamiltonians Hs, Ho, He and the interaction Hamilto-
+nians Hso, Hoe, Hse can cause qualitatively very different effects,
+which is why it is often useful to study them separately. This paper
+focuses on the interaction Hse.
+
+Note that the first two definitions are very restrictive.
+Whereas the subject would include the entire body of
+the observer in the common way of speaking, only very
+few degrees of freedom qualify as our subject or object.
+For instance, if a physicist is observing a Stern-Gerlach
+apparatus, the vast majority of the ∼ 1028 degrees of
+freedom in the the observer and apparatus are counted
+as environment, not as subject or object.
+The term “perception” is used in a broad sense in item
+1, including thoughts, emotions and any other attributes
+of the subjectively perceived state of the observer.
+The practical usefulness in this decomposition lies in
+that one can often neglect everything except the object
+and its internal dynamics (given by Ho) to first order,
+using simple prescriptions to correct for the interactions
+with the subject and the environment.
+The effects of
+both Hso and Hoe have been extensively studied in the
+literature. Hso involves quantum measurement, and gives
+rise to the usual interpretation of the diagonal elements of
+the object density matrix as probabilities. Hoe produces
+decoherence, selecting a preferred basis and making the
+object act classically if the conditions in Figure 1 are met.
+In contrast, Hse, which causes decoherence directly in
+the subject system, has received relatively little atten-
+
+tion. It is the focus of the present paper, and the next
+section is devoted to quantitative calculations of decoher-
+ence in brain processes, aimed at determining whether
+the subject system should be classified as classical or
+quantum in the sense of Figure 1.
+We will return to
+Figure 2 and a more detailed discussion of its various
+subsystem interactions in Section IV.
+
+III. DECOHERENCE RATES
+
+In this section, we will make quantitative estimates
+of decoherence rates for neurological processes. We first
+analyze the process of neuron firing, widely assumed to be
+central to cognitive processes. We also analyze electrical
+excitations in microtubules, which Penrose and others
+have suggested may be relevant to conscious thought.
+
+A. Neuron firing
+
+Neurons (see Figure 3) are one of the key building
+blocks of the brain’s information processing system. It is
+widely believed that the complex network of ∼ 1011 neu-
+rons with their nonlinear synaptic couplings is in some
+way linked to our subjective perceptions, i.e., to the sub-
+ject degrees of freedom. If this picture is correct, then if
+Hs or Hso puts the subject into a superposition of two
+distinct mental states, some neurons will be in a super-
+position of firing and not firing. How fast does such a
+superposition of a firing and non-firing neuron decohere?
+Let us consider this process in more detail.
+For in-
+troductory reviews of neuron dynamics, the reader is re-
+ferred to, e.g., [48–50].
+Like virtually all animal cells,
+neurons have ATP driven pumps in their membranes
+which push sodium ions out of the cell into the surround-
+ing fluids and potassium ions the other way. The former
+process is slightly more efficient, so the neuron contains a
+slight excess of negative charge in its “resting” state, cor-
+responding to a potential difference U0 ≈ −0.07 V across
+the axon membrane (“axolemma”). There is an inher-
+ent instability in the system, however. If the potential
+becomes substantially less negative, then voltage-gated
+sodium channels in the axon membrane open up, allow-
+ing Na+ ions to come gushing in. This makes the poten-
+tial still less negative, causes still more opening, etc. This
+chain reaction, “firing”, propagates down the axon at a
+speed of up to 100 m/s, changing the potential difference
+to a value U1 that is typically of order +0.03 V [49].
+The axon quickly recovers. After less than ∼ 1 ms, the
+sodium channels close regardless of the voltage, and large
+potassium channels (also voltage gated, but with a time
+delay) open up allowing K+ ions to flow out and restore
+the resting potential U0. The ATP driven pumps quickly
+restore the Na+ and K+ concentrations to their initial
+values, making the neuron ready to fire again if triggered.
+Fast neurons can fire over 103 times per second.
+
+4
+
+
+Na+
+Na+
+
+dendrites
+
+axon
+
+cell body
+
+myelin�
+insulation
+
+fraction f�
+not insulated
+
+thickness h
+
+Here�
+if�
+firing
+
+Here�
+if not�
+firing
+
+voltage�
+sensitive�
+gate
+
+length�
+L
+
+axon�
+membrane
+
+pulse
+
+di
+
+re
+
+ct
+
+io
+
+n
+
+diameter d
+
+FIG. 3. Schematic illustration of a neuron (left), a section of
+the myelinated axon (center) and and a piece of its axon membrane
+(right).
+The axon is typically insulated (myelinated) with small
+bare patches every 0.5 mm or so (so-called Nodes of Ranvier) where
+the voltage-sensitive sodium and potassium gates are concentrated
+[51,52]. If the neuron is in a superposition of firing and not firing,
+then N ∼ 106 Na+ ions are in a superposition of being inside and
+outside the cell (right).
+
+Consider a small patch of the membrane, assumed to
+be roughly flat with uniform thickness h as in Figure 3.
+If there is an excess surface density ±σ of charge near
+the inside/outside membrane surfaces, giving a voltage
+differential U across the membrane, then application of
+Gauss’ law tells us that σ = ǫ0E, where the electric field
+strength in the membrane is E = U/h and ǫ0 is the vac-
+uum permittivity.
+Consider an axon of length L and
+diameter d, with a fraction f of its surface area bare (not
+insulated with myelin). The total active surface area is
+thus A = πdLf, so the total number of Na+ ions that
+migrate in during firing is
+
+N = Aσ
+
+q
+= πdLfǫ0(U1 − U0)
+
+qh
+,
+(4)
+
+where q is the ionic charge (q = qe, the absolute value
+of the electron charge). Taking values typical for central
+nervous system axons [52,53], h = 8 nm, d = 10 µm,
+L = 10 cm, f = 10−3, U0 = −0.07 V and U1 = +0.03 V
+gives N ≈ 106 ions, and reasonable variations in our
+parameters can change this number by a few orders of
+magnitude.
+
+B. Neuron decoherence mechanisms
+
+Above we saw that a quantum superposition of the
+neuron states “resting” and “firing” involves of order a
+million ions being in a spatial superposition of inside and
+outside the axon membrane, separated by a distance of
+order h ∼ 10 nm. In this subsection, we will compute the
+timescale on which decoherence destroys such a superpo-
+sition.
+
+In this analysis, the object is the neuron, and the su-
+perposition will be destroyed by any interaction with
+other (environment) degrees of freedom that is sensitive
+to where the ions are located. We will consider the fol-
+lowing three sources of decoherence for the ions:
+
+1. Collisions with other ions
+
+2. Collisions with water molecules
+
+3. Coloumb interactions with more distant ions
+
+There are many more decoherence mechanisms [44–46].
+Exotic candidates such as quantum gravity [54] and
+modified quantum mechanics [55,56] are generally much
+weaker [46]. A number of decoherence effects may be even
+stronger than those listed, e.g., interactions as the ions
+penetrate the membrane — the listed effects will turn out
+to be so strong that we can make our argument by sim-
+ply using them as lower limits on the actual decoherence
+rate.
+Let ρ denote the density matrix for the position r of a
+single Na+ ion. As reviewed in the Appendix, all three
+of the listed processes cause ρ to evolve as
+
+ρ(x, x′, t0 + t) = ρ(x, x′, t0)f(x, x′, t)
+(5)
+
+for some function f that is independent of the ion state
+ρ and depends only on the interaction Hamiltonian Hint.
+This assumes that we can neglect the motion of the ion
+itself on the decoherence timescale — we will see that
+this condition is met with a broad margin.
+
+1. Ion–ion collisions
+
+For scattering of environment particles (processes 1
+and 2) that have a typical de Broigle wavelength λ, we
+have [46]
+
+f(x, x′, t) = e−Λt�
+1−e−|x′−x|2/2λ2�
+
+≈
+
+�
+e−|x′−x|2Λt/2λ2
+for |x′ − x| ≪ λ,
+
+e−Λt
+for |x′ − x| ≫ λ.
+(6)
+
+Here Λ is the scattering rate, given by Λ ≡ n⟨σv⟩, where
+n is the density of scatterers, σ is the scattering cross
+section and v is the velocity. The product σv is aver-
+aged over a the velocity distribution, which we take to
+be a thermal (Boltzmann) distribution for correspond-
+ing to T = 37◦C ≈ 310 K. The gist of equation (6) is
+that a single collision decoheres the ion down to the
+de Broigle wavelength of the scattering particle.
+The
+information I12 communicated during the scattering is
+I12 ∼ log2(∆x/λ) bits, where ∆x is the initial spread in
+the position of our particle.
+Since the typical de Broigle wavelength of a Na+ ion
+(mass m ≈ 23mp) or H2O molecule (m ≈ 18mp) is
+
+5
+
+
+λ =
+2π¯h
+√
+
+3mkT
+≈ 0.03 nm
+(7)
+
+at 310K, way smaller than the the membrane thickness
+h ∼ 10 nm over which we need to maintain quantum
+coherence, we are clearly in the |x′ − x| ≫ λ limit of
+equation (6). This means that the spatial superposition
+of an ion decays exponentially Λ−1, of order its mean
+free time between collisions. Since the superposition of
+the neuron states “resting” and “firing” involves N such
+superposed ions, it thus gets destroyed on a timescale
+τ ≡ (NΛ)−1.
+Let us now evaluate τ. Coulomb scattering between
+two ions of unit charge gives substantial deflection angles
+(θ ∼ 1) with a cross section or order3
+
+σ ∼
+� gq2
+
+mv2
+
+�2
+,
+(9)
+
+where v is the relative velocity and g ≡ 1/4πǫ0 is the
+Coulomb constant. In thermal equilibrium, the kinetic
+energy mv2/2 is of order kT , so v ∼
+�
+
+kT/m. For the
+ion density, let us write n = ηnH2O, where the density
+of water molecules nH2O is about (1 g/cm3)/(18mp) ∼
+1023/cm3 and η is the relative concentration of ions (pos-
+itive and negative combined). Typical ion concentrations
+during the resting state are [Na+] =9.2 (120) mmol/l and
+[K+] =140 (2.5) mmol/l inside (outside) the axon mem-
+brane [48], corresponding to total Na+ + K+ concentra-
+tions of η ≈ 0.00027 (0.00022) inside (outside). To be
+conservative, we will simply use η ≈ 0.0002 throughout.
+Ion–ion collisions therefore destroy the superposition on
+a timescale
+
+τ ∼
+1
+
+Nnσv ∼
+
+�
+
+m(kT )3
+
+Ng2q4en
+∼ 10−20 s.
+(10)
+
+2. Ion–water collisions
+
+Since H2O molecules are electrically neutral, the cross-
+section is dominated by their electric dipole moment
+p ≈ 1.85 Debye ≈ (0.0385 nm) × qe. We can model this
+
+3 If the first ion starts at rest at r1 = (0, 0, 0) and the sec-
+ond is incident with r2 = (vt, b, 0), then a very weak scatter-
+ing with deflection angle θ ≪ 1 will leave these trajectories
+roughly unchanged, the radial force F = gq2/|r1 −r2|2 merely
+causing a net transverse acceleration [57]
+
+∆vy =
+� ∞
+
+−∞
+
+�y · F
+
+m dt =
+� ∞
+
+−∞
+
+gq2b dt
+
+[b2 + (vt)2]3/2 = 2gq2
+
+mvb .
+(8)
+
+The approximation breaks down as the deflection angle θ ≈
+∆vy/v approaches unity. This occurs for b ∼ gq2/mv2, giving
+σ = πb2 as in equation (9).
+
+dipole as two opposing unit charges separated by a dis-
+tance y ≡ p/qe ≪ b, so summing the two corresponding
+contributions from equation (8) gives a deflection angle
+
+θ ≈ 2gqep
+
+mv2b2 .
+(11)
+
+This gives a cross section
+
+σ = πb2 ∼ gqep
+
+mv2 .
+(12)
+
+for scattering with large (θ ∼ 1) deflections. Although σ
+is smaller than for the case of ion–ion collisions, n is larger
+because the concentration factor η drops out, giving a
+final result
+
+τ ∼
+1
+
+Nnσv ∼
+
+√
+
+mkT
+
+Ngqepn ∼ 10−20 s
+(13)
+
+3. Interactions with distant ions
+
+As shown in the Appendix, long-range interaction with
+a distant (environment) particle gives
+
+f(r, r′, t) = �p2 [M(r′ − r)t/¯h] ,
+(14)
+
+up to a phase factor that is irrelevant for decoherence.
+Here �p2 is the Fourier transform of p2(r) ≡ ρ2(r, r), the
+probability distribution for the location of the environ-
+ment particle. M is the 3 × 3 Hessian matrix of second
+derivatives of the interaction potential of the two parti-
+cles at their mean separation. A slightly less general for-
+mula was derived in the seminal paper [45]. For roughly
+thermal states, ρ2 (and thus p) is likely to be well ap-
+proximated by a Gaussian [58,59]. This gives
+
+f(r, r′, t) = e− 1
+
+2 (r′−r)tMtΣM(r′−r)t2/¯h2,
+(15)
+
+where Σ = ⟨r2rt
+2⟩ − ⟨r2⟩⟨rt
+2⟩ is the covariance matrix of
+the location of the environment particle.
+Decoherence
+is destroyed when the exponent becomes of order unity,
+i.e., on a timescale
+
+τ ≡
+�
+(r′ − r)tMtΣM(r′ − r)
+�−1/2 ¯h.
+(16)
+
+Assuming a Coulomb potential V = gq2/|r2 − r1| gives
+M = (3�a�at − I)gq2/a3 where a ≡ r2 − r1 = a�a, |�a| =
+1. For thermal states, we have the isotropic case Σ =
+(∆x)2I, so equation (16) reduces to
+
+τ =
+¯ha3
+
+gq2|r′ − r|∆x
+�
+1 + 3 cos2 θ
+�−1/2 ,
+(17)
+
+where cos θ ≡ �a · (r′ − r)/|r′ − r|. To be conservative,
+we take ∆x to be as small as the uncertainty principle
+allows. With the thermal constraint (∆p)2/m <∼ kT on
+the momentum uncertainty, this gives
+
+6
+
+
+∆x =
+¯h
+
+2∆p ∼
+¯h
+√
+
+mkT
+.
+(18)
+
+Substituting this into equation (17) and dividing by the
+number of ions N, we obtain the decoherence timescale
+
+τ ∼
+a3√
+
+mkT
+
+Ngq2|r′ − r|.
+(19)
+
+caused by a single environment ion a distance a away.
+Each such ion will produce its own suppression factor f,
+so we need to sum the exponent in equation (15) over all
+ions. Since the tidal force M ∝ a−3 causes the exponent
+to drop as a−6, this sum will generally be dominated by
+the very closest ion, which will typically be a distance
+a ∼ n−1/3 away. We are interested in decoherence for
+separations |r′ − r| = h, the membrane thickness, which
+gives
+
+τ ∼
+
+√
+
+mkT
+
+Ngq2enh ∼ 10−19 s.
+(20)
+
+The relation between these different estimates is dis-
+cussed in more detail in the Appendix.
+
+C. Microtubules
+
+Microtubules are a major component of the cytoskele-
+ton, the “scaffolding” that helps cells maintain their
+shapes.
+They are hollow cylinders of diameter D =
+24 nm made up of 13 filaments that are strung together
+out of proteins known as tubulin dimers. These dimers
+can make transitions between two states known as α
+and β, corresponding to different electric dipole moments
+along the axis of the tube. It has been argued that micro-
+tubules may have additional functions as well, serving as
+a means of energy and information transfer [20]. A model
+has been presented whereby the dipole-dipole interac-
+tions between nearby dimers can lead to long-range po-
+larization and kink-like excitations that may travel down
+the microtubules at speeds exceeding 1 m/s [60].
+Penrose has gone further and suggested that the dy-
+namics of such excitations can make a microtubule act
+like a quantum computer, and that microtubules are the
+site of of human consciousness [2]. This idea has been fur-
+ther elaborated [21–24] employing methods from string
+theory, with the conclusion that quantum superpositions
+of coherent excitations can persist for as long as a second
+before being destroyed by decoherence. See also [61,62].
+This was hailed as a success for the model, the interpre-
+tation being that the quantum gravity effect on micro-
+tubules was identified with the human though process on
+this same timescale.
+This decoherence rate τ ∼ 1 s was computed assuming
+that quantum gravity is the main decoherence source.
+Since this quantum gravity model is somewhat contro-
+versial [32] and its effect has been found to be more than
+
+20 orders of magnitude weaker than other decoherence
+sources in some cases [46], it seems prudent to evalu-
+ate other decoherence sources for the microtubule case
+as well, to see whether they are in fact dominant. We
+will now do so.
+Using coordinates where the x-axis is along the tube
+axis, the above-mentioned models all focus on the time-
+evolution of p(x), the average x-component of the electric
+dipole moment of the tubulin dimers at each x. In terms
+of this polarization function p(x), the net charge per unit
+length of tube is −p′(x). The propagating kink-like exci-
+tations [60] are of the form
+
+p(x) =
+� +p0
+for x ≪ x0,
+
+−p0
+for x ≫ x0,
+(21)
+
+where the kink location x0 propagates with constant
+speed and has a width of order a few tubulin dimers.
+The polarization strength p0 is such that the total charge
+around the kink is Q = − � p′(x)dx = 2p0 ∼ 940qe, due
+to the presence of 18 Ca2+ ions on each of the 13 fila-
+ments contributing to p0 [60].
+Suppose that such a kink is in two different places
+in superposition, separated by some distance |r′ − r|.
+How rapidly will the superposition be destroyed by de-
+coherence?
+To be conservative, we will ignore colli-
+sions between polarized tubulin dimers and nearby water
+molecules, since it has been argued that these may be in
+some sense ordered and part of the quantum system [24]
+– although this argument is difficult to maintain for the
+water outside the microtubule, which permeates the en-
+tire cell volume. Let us instead apply equation (19), with
+N = Q/qe ∼ 103. The distance to the nearest ion will
+generally be less than a = R + n−1/3 ∼ 26 nm, where the
+tubulin diameter D = 24 nm dominates over the inter-
+ion separation n−1/3 ∼ 2 nm in the fluid surrounding
+the microtubule. Superpositions spanning many tubuline
+dimers (|r′ − r| ≫ D) therefore decohere on a timescale
+
+τ ∼ D2√
+
+mkT
+
+Ngq2e
+∼ 10−13 s.
+(22)
+
+due to the nearest ion alone. This is quite a conserva-
+tive estimate, since the other nD3 ∼ 103 ions that are
+merely a small fraction further away will also contribute
+to the decoherence rate, but it is nonetheless 6-7 orders
+of magnitude shorter than the estimates of Mavromatos
+& Nanopoulos [25–27]. We will comment on screening
+effects below.
+
+1. Decoherence summary
+
+Our decoherence rates are summarized in Table 1. How
+accurate are they likely to be?
+In the calculations above, we generally tried to be con-
+servative, erring on the side of underestimating the deco-
+herence rate. For instance, we neglected that N potas-
+sium ions also end up in superposition once the neuron
+
+7
+
+
+firing is quenched, we neglected the contribution of other
+abundant ions such as Cl− to η, and and we ignored col-
+lisions with water molecules in the microtubule case.
+Since we were only interested in order-of-magnitude
+estimates, we made a number of crude approximations,
+e.g., for the cross sections. We neglected screening ef-
+fects because the decoherence rates were dominated by
+the particles closest to the system, i.e., the very same par-
+ticles that are responsible for screening the charge from
+more distant ones.
+
+Table 1. Decoherence timescales.
+
+Object
+Environment
+τdec
+
+Neuron
+Colliding ion
+10−20s
+Neuron
+Colliding H2O
+10−20s
+Neuron
+Nearby ion
+10−19s
+Microtubule
+Distant ion
+10−13s
+
+IV. DISCUSSION
+
+A. The classical nature of brain processes
+
+The calculations above enable us to address the ques-
+tion of whether cognitive processes in the brain consti-
+tute a classical or quantum system in the sense of Fig-
+ure 1. If we take the characteristic dynamical timescale
+for such processes to be τdyn ∼ 10−2 s − 100 s (the ap-
+parent timescale of e.g., speech, thought and motor re-
+sponse), then a comparison of τdyn with τdec from Table 1
+shows that processes associated with either conventional
+neuron firing or with polarization excitations in micro-
+tubules fall squarely in the classical category, by a mar-
+gin exceeding ten orders of magnitude. Neuron firing it-
+self is also highly classical, since it occurs on a timescale
+τdyn ∼ 10−3 − 10−4 s [53]. Even a kink-like microtubule
+excitation is classical by many orders of magnitude, since
+it traverses a short tubule on a timescale τdyn ∼ 5×10−7 s
+[60].
+What about other mechanisms?
+It is worth noting
+that if (as is commonly believed) different neuron fir-
+ing patterns correspond in some way to different con-
+scious perceptions, then consciousness itself cannot be
+of a quantum nature even if there is a yet undiscovered
+physical process in the brain with a very long decoherence
+time. As mentioned above, suggestions for such candi-
+dates have involved, e.g., superconductivity [12], super-
+fluidity [13], electromagnetic fields [14], Bose condensa-
+tion [15,16], superflourescence [17] and other mechanisms
+[18,19]. The reason is that as soon as such a quantum
+subsystem communicates with the constantly decohering
+neurons to create conscious experience, everything deco-
+heres.
+How extreme variations in the decoherence rates can
+we obtain by changing our model assumptions? Although
+the rates can be altered by a few of orders of magnitudes
+by pushing parameters such as the neuron dimensions,
+the myelination fraction or the microtubule kink charge
+
+to the limits of plausibility, it is clearly impossible to
+change the basic conclusion that τdec ≪ 10−3 s, i.e., that
+we are dealing with a classical system in the sense of Fig-
+ure 1. Even the tiniest neuron imaginable, with only a
+single ion (N = 1) traversing the cell wall during firing,
+would have τdec ∼ 10−14 s.
+Likewise, reducing the ef-
+fective microtubule kink charge to a small fraction of qe
+would not help.
+How are we to understand the above-mentioned claims
+that brain subsystems can be sufficiently isolated to
+exhibit macroquantum behavior?
+It appears that the
+subtle distinction between dissipation and decoherence
+timescales has not always been appreciated.
+
+B. Implications for the subject-object-environment
+decomposition
+
+Let us now discuss the subsystem decomposition of
+Figure 2 in more detail in light of our results. As the
+figure indicates, the virtue of this decomposition into
+subject, object and environment is that the subsystem
+Hamiltonians Hs, Ho, He and the interaction Hamiltoni-
+ans Hso, Hoe, Hse can cause qualitatively very different
+effects. Let us now briefly discuss each of them in turn.
+Most of these processes are schematically illustrated
+in Figure 4 and Figure 5, where for purposes of illus-
+tration, we have shown the extremely simple case where
+both the subject and object have only a single degree of
+freedom that can take on only a few distinct values (3
+for the subject, 2 for the object). For definiteness, we
+denote the three subject states |¨- ⟩, | ¨⌣⟩ and | ¨⌢⟩, and in-
+terpret them as the observer feeling neutral, happy and
+sad, respectively. We denote the two object states |↑⟩
+and |↓⟩, and interpret them as the spin component (“up”
+or “down”) in the z-direction of a spin-1/2 system, say a
+silver atom. The joint system consisting of subject and
+object therefore has only 2 × 3 = 6 basis states: |¨- ↑⟩,
+|¨- ↓⟩, | ¨⌣↑⟩, | ¨⌣↓⟩, | ¨⌢↑⟩, | ¨⌢↓⟩. In Figures 4 and 5, we
+have therefore plotted ρ as a 6 × 6 matrix consisting of
+nine two-by-two blocks.
+
+=
++
+
+Object�
+evolution
+Object�
+decohe-�
+rence
+Ho
+(Entropy�
+constant)
+(Entropy�
+increases)
+
+Hoe
+
+Observation/Measurement
+
+(Entropy decreases)
+Hso
+�
+
+2
+1_
+2
+1_
+
+8
+
+
+FIG. 4. Time evolution of the 6×6 density matrix for the basis
+states |¨- ↑⟩, |¨- ↓⟩, | ¨⌣↑⟩, | ¨⌣↓⟩, | ¨
+⌢↑⟩, | ¨⌢↓⟩ as the object evolves in
+isolation, then decoheres, then gets observed by the subject. The
+final result is a statistical mixture of the states | ¨⌣↑⟩ and | ¨⌢↓⟩,
+simple zero-entropy states like the one we started with.
+
+1. Effect of Ho: constant entropy
+
+If the object were to evolve during a time interval t
+without interacting with the subject or the environment
+(Hso = Hoe = 0), then according to equation (1) its
+reduced density matrix ρo would evolve into UρoU † with
+the same entropy, since the time-evolution operator U ≡
+e−iHot is unitary.
+Suppose the subject stays in the state |¨- ⟩ and the
+object starts out in the pure state |↑⟩. Let the object
+Hamiltonian Ho correspond to a magnetic field in the y-
+direction causing the spin to precess to the x-direction,
+i.e., to the state (|↑⟩+|↓⟩)/
+√
+
+2. The object density matrix
+ρo then evolves into
+
+ρo = U|↑⟩⟨↑|U † = 1
+
+2(|↑⟩ + |↓⟩)(⟨↑| + ⟨↓|)
+
+= 1
+
+2(|↑⟩⟨↑| + |↑⟩⟨↓| + |↓⟩⟨↑| + |↓⟩⟨↓|),
+(23)
+
+corresponding to the four entries of 1/2 in the second
+matrix of Figure 4.
+This is quite typical of pure quantum time evolution: a
+basis state eventually evolves into a superposition of ba-
+sis states, and the quantum nature of this superposition
+is manifested by off-diagonal elements in ρo. Another fa-
+miliar example of this is the familiar spreading out of the
+wave packet of a free particle.
+
+2. Effect of Hoe: increasing entropy
+
+This was the effect of Ho alone. In contrast, Hoe will
+generally cause decoherence and increase the entropy of
+the object. As discussed in detail in Section III and the
+Appendix, it entangles it with the environment, which
+suppresses the off-diagonal elements of the reduced den-
+sity matrix of the object as illustrated in Figure 4. If Hoe
+couples to the z-component of the spin, this destroys the
+terms |↑⟩⟨↓| and |↓⟩⟨↑|. Complete decoherence therefore
+converts the final state of equation (23) into
+
+ρo = 1
+
+2(|↑⟩⟨↑| + |↓⟩⟨↓|),
+(24)
+
+corresponding to the two entries of 1/2 in the third ma-
+trix of Figure 4.
+
+3. Effect of Hso: decreasing entropy
+
+Whereas Hoe typically causes the apparent entropy of
+the object to increase, Hso typically causes it to decrease.
+
+Figure 4 illustrates the case of an ideal measurement,
+where the subject starts out in the state |¨- ⟩ and Hso is of
+such a form that gets perfectly correlated with the object.
+In the language of Section II, an ideal measurement is a
+type of communication where the mutual information I12
+between the subject and object systems is increased to its
+maximum possible value. Suppose that the measurement
+is caused by Hso becoming large during a time interval so
+brief that we can neglect the effects of Hs and Ho. The
+joint subject+object density matrix ρso then evolves as
+ρso �→ UρsoU †, where U ≡ exp
+�
+−i
+�
+Hsodt
+�
+. If observing
+|↑⟩ makes the subject happy and |↓⟩ makes the subject
+sad, then we have U|¨-↑⟩ = | ¨⌣↑⟩ and U|¨-↓⟩ = | ¨⌢↓⟩. The
+state given by equation (24) would therefore evolve into
+
+ρo = 1
+
+2U(|¨- ⟩⟨¨- |) ⊗ (|↑⟩⟨↑| + |↓⟩⟨↓|)U †
+(25)
+
+= 1
+
+2(U|¨-↑⟩⟨¨-↑|U † + U|¨-↓⟩⟨¨-↓|U †
+(26)
+
+= 1
+
+2(| ¨⌣↑⟩⟨ ¨⌣↑| + | ¨⌢↓⟩⟨ ¨⌢↓ |),
+(27)
+
+as illustrated in Figure 4.
+This final state contains a
+mixture of two subjects, corresponding to definite but
+opposite knowledge of the object state.
+According to
+both of them, the entropy of the object has decreased
+from one bit to zero bits.
+In general, we see that the object decreases its en-
+tropy when it exchanges information with the subject
+and increases when it exchanges information with the
+environment.4 Loosely speaking, the entropy of an ob-
+ject decreases while you look at it and increases while
+you don’t5.
+
+4If n bits of information are exchanged with the environ-
+ment, then equation (3) shows that the object entropy will
+increase by this same amount if the environment is in ther-
+mal equilibrium (with maximal entropy) throughout. If we
+were to know the state of the environment initially (by our
+definition of environment, we do not), then both the object
+and environment entropy will typically increase by n/2 bits.
+5 Here and throughout, we are assuming that the total
+system, which is by definition isolated, evolves according to
+the Schr¨odinger equation (1). Although modifications of the
+Schr¨odinger equation have been suggested by some authors,
+either in a mathematically explicit form as in [55,56] or ver-
+bally as a so-called reduction postulate, there is so far no
+experimental evidence suggesting that modifications are nec-
+essary. The original motivations for such modifications were
+
+1. to be able to interpret the diagonal elements of the
+density matrix as probabilities and
+
+2. to suppress off-diagonal elements of the density matrix.
+
+The subsequent discovery by Everett [64] that the probability
+interpretation automatically appears to hold for almost all
+observers in the final superposition solved problem 1, and is
+discussed in more detail in, e.g., [29,66–74]. The still more
+
+9
+
+
+=
++
+
+Subject�
+evolution
+Subject�
+decohe-�
+rence
+Hs
+(Snap �
+decision)
+�
+Hse
+
+�
+
+2
+1_
+2
+1_
+
+FIG. 5. Time evolution of the same 6 × 6 density matrix as in
+Figure 4 when the subject evolves in isolation, then decoheres. The
+object remains in the state |↑⟩ the whole time. The final result is
+a statistical mixture of the two states | ¨⌣↑⟩ and | ¨
+⌢↑⟩.
+
+4. Effect if Hs: the thought process
+
+So far, we have focused on the object and discussed
+effects of its internal dynamics (Ho) and its interactions
+with the environment (Hoe) and subject (Hso). Let us
+now turn to the subject and consider the role played by
+its internal dynamics (Hs) and interactions with the en-
+vironment (Hse).
+In his seminal 1993 book, Stapp [3]
+presents an argument about brain dynamics that can be
+summarized as follows.
+
+1. Since the brain contains ∼ 1011 synapses connected
+together by neurons in a highly nonlinear fashion,
+there must be a huge number of metastable rever-
+berating patters of pulses into which the brain can
+evolve.
+
+2. Neural network simulations have indicated that the
+metastable state into which a brain does in fact
+evolves depends sensitively on the initial conditions
+in small numbers of synapses.
+
+3. The latter depends on the locations of a small num-
+ber of calcium atoms, which might be expected to
+be in quantum superpositions.
+
+4. Therefore, one would expect the brain to evolve
+into
+a
+quantum
+superposition
+of
+many
+such
+metastable configurations.
+
+5. Moreover, the fatigue characteristics of the synap-
+tic junctions will cause any given metastable state
+
+recent discovery of decoherence [11,36,37] solved problem 2,
+as well as explaining so-called superselection rules for the first
+time (why for instance the position basis has a special status)
+[44].
+
+to become, after a short time, unstable:
+the
+subject will then be forced to search for a new
+metastable configuration, and will therefore con-
+tinue to evolve into a superposition of increasingly
+disparate states.
+
+If different states (perceptions) of the subject correspond
+to different metastable states of neuron firing patterns, a
+definite perception would eventually evolve into a super-
+position of several subjectively distinguishable percep-
+tions.
+We will follow Stapp in making this assumption about
+Hs. For illustrative purposes, let us assume that this can
+happen even at the level of a single thought or snap de-
+cision where the outcome feels unpredictable to us. Con-
+sider the following experiment: the subject starts out
+with a blank face and counts silently to three, then makes
+a snap decision on whether to smile or frown. The time-
+evolution operator U ≡ exp
+�
+−i � Hsdt
+�
+will then have
+the property that U|¨- ⟩ = (| ¨⌣⟩ + | ¨⌢⟩)/
+√
+
+2, so the sub-
+ject density matrix ρs will evolve into
+
+ρs = U|¨- ⟩⟨¨- |U † = 1
+
+2(| ¨⌣⟩ + | ¨⌢⟩)(⟨ ¨⌣| + ⟨ ¨⌢|)
+
+= 1
+
+2(| ¨⌣⟩⟨ ¨⌣| + | ¨⌣⟩⟨ ¨⌢| + | ¨⌢⟩⟨ ¨⌣| + | ¨⌢⟩⟨ ¨⌢|),
+(28)
+
+corresponding to the four entries of 1/2 in the second
+matrix in Figure 5.
+
+5. Effect of Hse: subject decoherence
+
+Just as Hoe can decohere the object, Hse can decohere
+the subject. The difference is that whereas the object can
+be either a quantum system (with small Hoe) or a classi-
+cal system (with large Hoe), a human subject always has
+a large interaction with the environment. As we showed
+in Section III, τdec ≪ τdyn for the subject, i.e., the ef-
+fect of Hse is faster than that of Hs by many orders of
+magnitude. This means that we should strictly speaking
+not think of macrosuperpositions such as equation (28)
+as first forming and then decohering as in Figure 5 —
+rather, subject decoherence is so fast that such superpo-
+sitions decohere already during their process of forma-
+tion. Therefore we are never even close to being able to
+perceive superpositions of different perceptions. Reduc-
+ing object decoherence (from Hoe) during measurement
+would make no difference, since decoherence would take
+place in the brain long before the transmission of the ap-
+propriate sensory input through sensory nerves had been
+completed.
+
+C. He and Hsoe
+
+The environment is of course the most complicated sys-
+tem, since it contains the vast majority of the degrees of
+
+10
+
+
+freedom in the total system. It is therefore very fortu-
+nate that we can so often ignore it, considering only those
+limited aspects of it that affect the subject and object.
+For the most general H, there can also be an ugly
+irreducible residual term Hsoe ≡ H − Hs − Ho − He −
+Hso − Hoe − Hse.
+
+D. Implications for modeling cognitive processes
+
+For the neural network community, the implication of
+our result is “business as usual”, i.e., there is no need
+to worry about the fact that current simulations do not
+incorporate effects of quantum coherence. The only rem-
+nant from quantum mechanics is the apparent random-
+ness that we subjectively perceive every time the subject
+system evolves into a superposition as in equation (28),
+but this can be simply modeled by including a random
+number generator in the simulation. In other words, the
+recipe used to prescribe when a given neuron should fire
+and how synaptic coupling strengths should be updated
+may have to involve some classical randomness to cor-
+rectly mimic the behavior of the brain.
+
+1. Hyper-classicality
+
+If a subject system is to be a good model of us, Hso
+and Hse need to meet certain criteria: decoherence and
+communication are necessary, but fluctuation and dissi-
+pation must be kept low enough that the subject does
+not lose its autonomy completely.
+In our study of neural processes, we concluded that the
+subject is not a quantum system, since τdec ≪ τdyn. How-
+ever, since the dissipation time τdiss for neuron firing is
+of the same order as its dynamical timescale, we see that
+in the sense of Figure 1, the subject is not a simple clas-
+sical system either. It is therefore somewhat misleading
+to think of it as simply some classical degrees of freedom
+evolving fairly undisturbed (only interacting enough to
+stay decohered and occasionally communicate with the
+outside world). Rather, the semi-autonomous degrees of
+freedom that constitute the subject are to be found at a
+higher level of complexity, perhaps as metastable global
+patters of neuron firing.
+These degrees of freedom might be termed “hyper-
+classical”:
+although
+there
+is
+nothing
+quantum-
+mechanical about their equations of motion (except that
+they can be stochastic), they may bear little resemblance
+with the underlying classical equations from which they
+were derived.
+Energy conservation and other familiar
+concepts from Hamiltonian dynamics will be irrelevant
+for these more abstract equations, since neurons are en-
+ergy pumped and highly dissipative. Other examples of
+such hyper-classical systems include the time-evolution
+of the memory contents of a regular (highly dissipative)
+
+digital computer as well as the motion on the screen of
+objects in a computer game.
+
+2. Nature of the subject system
+
+In this paper, we have tacitly assumed that conscious-
+ness is synonymous with certain brain processes. This is
+what Lockwood terms the “identity theory” [66]. It dates
+back to Hobbes (∼1660) and has been espoused by, e.g.,
+Russell, Feigl, Smart, Armstrong, Churchland and Lock-
+wood himself. Let us briefly explore the more specific
+assumption that the subject degrees of freedom are our
+perceptions. In this picture, some of the subject degrees
+of freedom would have to constitute a “world model”,
+with the interaction Hso such that the resulting commu-
+nication keeps these degrees of freedom highly correlated
+with selected properties of the outside world (object +
+environment). Some such properties, i.e.,
+
+• the intensity of the electromagnetic on the retina,
+averaged through three narrow-band filters (color
+vision) and one broad-band filter (black-and-white
+vision),
+
+• the spectrum of air pressure fluctuations in the ears
+(sound),
+
+• the chemical composition of gas in the nose (smell)
+and solutions in the mouth (taste),
+
+• heat and pressure at a variety of skin locations,
+
+• locations of body parts,
+
+are tracked rather continuously, with the corresponding
+mutual information I12 between subject and surround-
+ings remaining fairly constant.
+Persisting correlations
+with properties of the past state of the surroundings
+(memories) further contribute to the mutual information
+I12. Much of I12 is due to correlations with quite subtle
+aspects of the surroundings, e.g., the contents of books.
+The total mutual information I12 between a person and
+the external world is fairly low at birth, gradually grows
+through learning, and falls when we forget. In contrast,
+most innate objects have a very small mutual informa-
+tion with the rest of the world, books and diskettes being
+notable exceptions.
+The extremely limited selection of properties that the
+subject correlates with has presumably been determined
+by evolutionary utility, since it is known to differ between
+species: birds perceive four primary colors but cats only
+one, bees perceive light polarization, etc. In this picture,
+we should therefore not consider these particular (“classi-
+cal”) aspects of our surroundings to be more fundamental
+than the vast majority that the subject system is uncor-
+related with. Morover, our perception of e.g. space is as
+subjective as our perception of color, just as suggested
+by e.g. [50].
+
+11
+
+
+3. The binding problem
+
+One of the motivations for models with quantum co-
+herence in the brain was the so-called binding problem.
+In the words of James [75,76], “the only realities are the
+separate molecules, or at most cells. Their aggregation
+into a ‘brain’ is a fiction of popular speech”. James’ con-
+cern, shared by many after him, was that consciousness
+did not seem to be spatially localized to any one small
+part of the brain, yet subjectively feels like a coherent
+entity. Because of this, Stapp [3] and many others have
+appealed to quantum coherence, arguing that this could
+make consciousness a holistic effect involving the brain
+as a whole.
+However, non-local degrees of freedom can be impor-
+tant even in classical physics, For instance, oscillations
+in a guitar string are local in Fourier space, not in real
+space, so in this case the “binding problem” can be solved
+by a simple change of variables. As Eddington remarked
+[77], when observing the ocean we perceive the moving
+waves as objects in their own right because they display a
+certain permanence, even though the water itself is only
+bobbing up and down. Similarly, thoughts are presum-
+ably highly non-local excitation patterns in the neural
+network of our brain, except of a non-linear and much
+more complex nature.
+In short, this author feels that
+there is no binding problem.
+
+4. Outlook
+
+In summary, our decoherence calculations have in-
+dicated that there is nothing fundamentally quantum-
+mechanical about cognitive processes in the brain, sup-
+porting the Hepp’s conjecture [33]. Specifically, the com-
+putations in the brain appear to be of a classical rather
+than quantum nature, and the argument by Lisewski [78]
+that quantum corrections may be needed for accurate
+modeling of some details, e.g., non-Markovian noise in
+neurons, does of course not change this conclusion. This
+means that although the current state-of-the-art in neu-
+ral network hardware is clearly still very far from be-
+ing able to model and understand cognitive processes as
+complex as those in the brain, there are no quantum me-
+chanical reasons to doubt that this research is on the
+right track.
+
+Acknowledgements:
+The author wishes to thank
+the organizers of the Spaatind-98 and Gausdal-99 win-
+ter schools, where much of this work was done, and
+Mark Alford, Philippe Blanchard, Carlton Caves, Angel-
+ica de Oliveira-Costa, Matthew Donald, Andrei Gruzi-
+nov, Piet Hut, Nick Mavromatos, Henry Stapp, Hans-
+Dieter Zeh and Woitek Zurek for stimulating discussions
+and helpful comments. Support for this work was pro-
+vided by the Sloan Foundation and by NASA though
+
+grant NAG5-6034 and Hubble Fellowship HF-01084.01-
+96A from STScI, operated by AURA, Inc. under NASA
+contract NAS5-26555.
+
+APPENDIX: DECOHERENCE FORMULAS
+
+The quantitative effect of decoherence from both short
+range interactions (scattering) and long-range interac-
+tions was first derived in a seminal paper by Joos & Zeh
+[45]. Since our application involved scattering between
+particles of comparable mass, we used a generalized ver-
+sion of these results that included the effect of recoil [46].
+In this Appendix, we derive a slightly generalized formula
+for long-range interactions, and briefly comment on the
+relation between these short-range and long-range limit-
+ing cases.
+
+1. Decoherence due to tidal forces
+
+Even if the dissipation and fluctuation caused by Hint
+is dynamically unimportant, H1 and H2 can be neglected
+in equation (2) when calculating the decoherence effect in
+the many cases where the interaction Hamiltonian deco-
+heres the object on a timescale far below the dynamical
+time. In this approximation, we consider two particles
+with an interaction H = Hint = V (r2 − r1) for some
+potential V . According to equation (1), the two-particle
+density matrix ρ therefore evolves as
+
+ρ(r1, r′
+1, r2, r′
+2, t0 + t)
+
+= ρ(r1, r′
+1, r2, r′
+2, t)e−i[V (r2−r1)−V (r′
+2−r′
+1)]/¯h.
+(A1)
+
+Following [45], we assume that the two particles are fairly
+localized near their initial average positions
+
+r0
+i ≡ ⟨ri⟩0 = tr [riρi(t0)],
+(A2)
+
+i = 1, 2, and approximate the potential by its second
+order Taylor expansion
+
+V (r2 − r1) ≈ V (a) − F · (x2 − x1)
+
++ 1
+
+2(x2 − x1)tM(x2 − x1).
+(A3)
+
+Here F ≡= −∇V (a) is the average force, M is the Hes-
+sian matrix Mij ≡ ∂i∂jV (a) and a ≡ r0
+2−r0
+1. We have in-
+troduced relative coordinates xi ≡ ri−r0
+i . Assuming that
+the two particles are independent initially as in [45], i.e.,
+that ρ(t0) takes the separable form ρ(x1, x′
+1, x2, x′
+2, t0) =
+ρ1(x1, x′
+1, t0)ρ2(x2, x′
+2, t0), this gives
+
+ρ1(x1, x′
+1, t0 + t) = tr 2ρ(t0 + t) =
+�
+ρ(x1, x′
+1, x, x, t0 + t)d3x = ρ1(x1, x′
+1, t0)f(x1, x′
+1, t), (A4)
+
+where
+
+12
+
+
+f(x1, x′
+1, t) ≈
+
+eiφ(x1,x′
+1,t)
+�
+ρ2(x2, x′
+2, t0)e−it(x′
+1−x1)tMx2/¯hd3x2 =
+
+eiφ(x1,x′
+1,t)�p2[M(x′
+1 − x1)t/¯h].
+(A5)
+
+Here the phase factor
+
+eiφ(x,x′,t) ≡ e
+i
+¯h[F·(x′−x)+ 1
+
+2 x′tMx′− 1
+
+2 xtMx]
+(A6)
+
+is of no importance for decoherence, since it does not
+suppress the magnitude |ρ1(x1, x′
+1, t)| of the off-diagonal
+elements – it merely causes momentum transfer related
+to fluctuation and dissipation.
+It is the other term
+that causes decoherence. �p2 is the Fourier transform of
+p2(x) ≡ ρ2(x, x, t0), the probability distribution for the
+location of the environment particle.
+
+2. Properties of the effect
+
+Let us briefly discuss some qualitative features of equa-
+tion (A5).
+Since �p2(0) =
+�
+p2(x2)d3x2 = tr ρ2 = 1,
+ρ1(x, x′) remains unchanged on the diagonal x = x′.
+This is because Hint is not changing the position of our
+our object particle, merely its momentum.
+Since the
+mean position ⟨x2⟩ =
+�
+p2x2d3x2 = tr [x2ρ2] = 0 van-
+ishes (using equation (A2)), we have ∇�p2(0) = 0.
+In
+fact, |f| takes a maximum on the diagonal, and the
+Riemann-Lebesgue Lemma shows that |f| = |�p2| ≤ 1
+whenever x ̸= x′, with equality only for the unphys-
+ical case where p2 is a delta function, i.e., where the
+location of the environment particle is perfectly known.
+∂i∂j|f(0)| = −M⟨x2xt
+2⟩Mt2/2¯h2, so so the larger ⟨x2xt
+2⟩
+is (i.e., the more spread out the environment particle is),
+the closer to the diagonal decoherence will suppress our
+density matrix.
+Since M is the shear matrix of the force field −∇V , we
+see that it is tidal forces that are causing the decoherence
+— the average force F simply contributes to the phase
+factor eiφ. Specifically, the rate at which our object de-
+grees of freedom r1 decohere grows with the tidal force
+that it exerts on the environment: if the environment
+particle is spread out with ⟨x2xt
+2⟩ large, experiencing a
+wide range of forces from the object, object decoherence
+is rapid. In the opposite situation, where the object is
+spread out and the environment is not, the object will
+experience strong classical tidal forces but no decoher-
+ence.
+
+3. Relation between long-range and short-range
+decoherence
+
+Above we derived the effect of decoherence from long-
+range tidal forces. Another interesting case that has been
+solved analytically [45] is that of short-range interactions
+
+that can be modeled as scattering events. If the scatter-
+ing takes place during short enough a time interval that
+we can neglect the internal dynamics of the object, then
+its reduced density matrix changes as [46]
+
+ρ1(r, r′) �→ ρ1(r, r′)�p
+�r′ − r
+
+¯h
+
+�
+,
+(A7)
+
+where p(q) is the probability distribution for the momen-
+tum transfer q in the collision. This equation generalizes
+the scattering result of [45] by including the effect of re-
+coil. The larger the uncertainty in momentum transfer,
+the stronger the decoherence effect becomes, since widen-
+ing p narrows its Fourier transform �p. Changing the mean
+momentum transfer ⟨q⟩ does not affect the decoherence,
+merely contributes a phase factor just as F did above.
+Typically, the last factor in equation (A7) destroys coher-
+ence down to scales of order the de Broigle wavelength
+of the scatterer, with directional modulations from the
+angular dependence of the scattering cross section. Gen-
+eralization to a steady flux of scattering particles [46]
+gives equation (6).
+Equation (A7) has striking similarities with the tidal
+force result of equation (A5): in both cases, the density
+matrix gets multiplied by the Fourier transform of a prob-
+ability distribution.
+If fact, up to uninteresting phase
+factors, we can rewrite our equation (A5) in exactly the
+form of equation (A7) by redefining p to be the probabil-
+ity distribution for momentum transfer q = M(x2 −x1)t
+due to tidal forces for a fixed x1, i.e.,
+
+p(q) ≡ p2(x2)d3x2
+
+d3q = p2(x1 + M−1q/t)
+
+t3 det M
+.
+(A8)
+
+Fourier transforming this expression and substituting the
+result into equation (A7), we recover equation (A5) up
+to a phase factor.
+Perhaps the simplest way to understand all these re-
+sults is in terms of Wigner functions [79]. If W(x1, p1) is
+the Wigner phase space distribution for the object parti-
+cle, then any of the momentum-transferring interactions
+that we have considered will take the form
+
+W(x1, p1) �→
+�
+W(x1, p1 − q)p(q, x1)d3q
+(A9)
+
+for some probability distribution p that may or may not
+depend on x1. Since the density matrix
+
+ρ1(x1, x′
+1) =
+�
+W
+�x1 + x′
+1
+
+2
+, p
+�
+e−i(x−x′)·pd3p
+(A10)
+
+is just the Wigner function Fourier transformed in the
+momentum direction (and rotated by 45◦), the convolu-
+tion with p in equation (A9) reduces to a simple multi-
+plication with �p in equation (A7).
+
+13
+
+
+[1] R. Penrose, The Emperor’s New Mind (Oxford, Oxford
+Univ. Press, 1989).
+[2] R. Penrose, in The Large, the Small and the Human
+Mind, ed.
+M. Longair (Cambridge, Cambridge Univ.
+Press, 1997).
+[3] H. P. Stapp, Mind, Matter and Quantum Mechanics
+(Berlin, Springer, 1993).
+[4] D. J. Amit, Modeling Brain Functions (Cambridge, Cam-
+bridge Univ. Press, 1989).
+[5] M. M´ezard, G. Parisi, and M. Virasoro, Spin Glass The-
+ory and Beyond (Singapore, World Scientific, 1993).
+[6] R. L. Harvey, Neural Network Principles (Englewood
+Cliffs, Prentice Hall, 1994).
+[7] F. H. Eeckman and J. M. Bower, Computation and Neu-
+ral Systems (Boston, Kluwer, 1993).
+[8] D. R. McMillen, G. M. T D’Eleuterio, and J. R. P
+Halperin, Phys. Rev. E 59, 6 (1999).
+[9] E. P. Wigner, in The Scientist Speculates: an Anthology
+of Partly-Baked Ideas, p284-302, ed. I. J. Good (London,
+Heinemann, 1962).
+[10] J. Mehra and A. S. Wightman, The Collected Works of
+E. P. Wigner, Vol. VI, p271 (Berlin, Springer, 1995).
+[11] H. D. Zeh, Found. Phys. 1, 69 (1970).
+[12] E. H. Walker, Mathematical Biosciences 7, 131 (1970).
+[13] L. H. Domash, in Scientific Research on TM, ed. D. W.
+Orme-Johnson and J. T. Farrow (Weggis, Switzerland,
+Maharishi Univ. Press, 1977).
+[14] H. P. Stapp, Phys. Rev. D 28, 1386 (1983).
+[15] I. N. Marshall, New Ideas in Psychology 7, 73 (1989).
+[16] D. Zohar, The Quantum Self (New York, William Mor-
+row, 1990).
+[17] H. Rosu, Metaphysical Review 3, 1, gr-qc/9409007
+(1997).
+[18] L. M. Ricciardi and H. Umezawa, Kibernetik 4, 44
+(1967).
+[19] A. Vitiello, Int. J. Mod. Phys. B9, 973-89 (1996).
+[20] S. R. Hameroff and R. C. Watt, Journal of Theoretical
+Biology 98, 549 (1982).
+S. R. Hameroff, Ultimate Computing: Biomolecular Con-
+sciousness
+and
+Nanotechnology
+(Amsterdam, North-
+Holland, 1987).
+[21] D. V. Nanopoulos 1995, hep-ph/9505374
+[22] N.
+Mavromatos
+and D.
+V.
+Nanopoulos
+1995,
+hep-
+ph/9505401
+[23] N. Mavromatos and D. V. Nanopoulos 1995, quant-
+ph/9510003
+[24] N. Mavromatos and D. V. Nanopoulos 1995, quant-
+ph/9512021
+[25] N. Mavromatos and D. V. Nanopoulos, Int. J. Mod. Phys
+B 12, 517, quant-ph/9708003 (1998).
+[26] N. Mavromatos and D. V. Nanopoulos 1998, quant-
+ph/9802063
+[27] N. Mavromatos 1999, J.
+Bioelectrochemistry & Bioenergetics;48;273
+[28] H. P. Stapp, Found. Phys. 21, 1451 (1991).
+[29] H. D. Zeh, quant-ph/9908084, Epistemological Letters of
+the Ferdinand-Gonseth Association 63:0 (Biel, Switzer-
+land, 1981)
+[30] W. H. Zurek, Phys. Today 44 (10), 36 (1991).
+[31] A. Scott, J. Consciousness Studies 6, 484 (1996).
+
+[32] S. Hawking, in The Large, the Small and the Human
+Mind, ed.
+M. Longair (Cambridge, Cambridge Univ.
+Press, 1997).
+[33] K. Hepp, in Quantum Future, ed. P. Blanchard and A.
+Jadczyk (Berlin, Springer, 1999).
+[34] J.
+von
+Neumann,
+Matematische
+Grundlagen
+der
+Quanten-Mechanik (Springer, Berlin, 1932).
+[35] H. D. Zeh, The Arrow of Time, 3rd ed. (Springer, Berlin,
+1999).
+[36] W. H. Zurek, Phys. Rev. D 24, 1516 (1981).
+[37] W. H. Zurek, Phys. Rev. D 26, 1862 (1982).
+[38] W. H. Zurek, reprint LAUR 84-2750, in Non-Equilibrium
+Statistical Physics, ed. G. Moore and M. O. Sculy (New
+York, Plenum, 1984).
+[39] E. Peres, Am. J. Phys. 54, 688 (1986).
+[40] P. Pearle, Phys. Rev. A 39, 2277 (1989).
+[41] M. R. Gallis and G. N. Fleming, Phys. Rev. A 42, 38
+(1989).
+[42] W. H. Unruh and W. H. Zurek, Phys. Rev. D 40, 1071
+(1989).
+[43] R. Omn`es, Phys. Rev. A 56, 3383 (1997).
+[44] D. Giulini, E. Joos, C. Kiefer, J. Kupsch, I. O. Sta-
+matescu, and H. D. Zeh, Decoherence and the Appear-
+ance of a Classical World in Quantum Theory (Springer,
+Berlin, 1996).
+[45] E. Joos and H. D. Zeh, Z. Phys. B 59, 223 (1985).
+[46] M. Tegmark, Found. Phys. Lett. 6, 571 (1993).
+[47] R. P. Feynman, Statistical Mechanics (Reading, Ben-
+jamin, 1972).
+[48] B. Katz, Nerve, Muscle, and Synapse (New York,
+McGraw-Hill, 1966).
+[49] J. P. Schad´e and D. H. Ford, Basic Neurology, 2nd ed.
+(Amsterdam, Elsevier, 1973).
+[50] A. G. Cairns-Smith, Evolving the Mind (Cambridge,
+Cambridge Univ. Press, 1996).
+[51] P. Morell and W. T. Norton, Sci. Am. 242, 74 (1980).
+[52] A. Hirano and J. A. Llena, in The Axon, ed. S. G. Wax-
+man, J. D. Kocsis, and P. K. Stys (New York, Oxford
+Univ. Press, 1995).
+[53] J. M. Ritchie, in The Axon, ed. S. G. Waxman, J. D.
+Kocsis, and P. K. Stys (New York, Oxford Univ. Press,
+1995).
+[54] J. Ellis, S. Mohanty, and Nanopoulos D V, Phys. Lett. B
+221, 113 (1989).
+[55] P. Pearle, Phys. Rev. D 13, 857 (1976).
+[56] G. C. Ghirardi, A. Rimini, and T. Weber, Phys. Rev. D
+34, 470 (1986).
+[57] J. D. Jackson, Classical Electrodynamics (New York, Wi-
+ley, 1975).
+[58] W. H. Zurek, S. Habib, and J. P. Paz, Phys. Rev. Lett.
+70, 1187 (1993).
+[59] M. Tegmark and H. S. Shapiro, Phys. Rev. E 50, 2538
+(1994).
+[60] M. V. Satari´c, J. A. Tuszy´nski, and R. B. ˇZakula, Phys.
+Rev. E 48, 589 (1993).
+[61] H. C. Rosu, Phys. Rev. E 55, 2038 (1997).
+[62] H. C. Rosu, Nuovo Cimento D 20, 369 (1998).
+[63] H. S. Stapp 1999,
+Attention, Intention, and Mind in Quantum Physics and
+Quantum Ontology and Mind-Matter Synthesis, available
+
+14
+
+
+at www-physics.lbl.gov/ stapp/stappfiles.html.
+[64] H. Everett III, Rev. Mod. Phys. 29, 454 (1957).
+H. Everett III, The Many-Worlds Interpretation of Quan-
+tum Mechanics, B. S. DeWitt and N. Graham (Prince-
+ton, Princeton Univ. Press, 1986).
+[65] J. A. Wheeler, Rev. Mod. Phys. 29, 463;1957 (1957).
+L. M. Cooper and D. van Vechten, Am. J. Phys 37, 1212
+(1969).
+B. S. DeWitt, Phys. Today 23, 30 (1971).
+[66] M. Lockwood, Mind, Brain and the Quantum (Cam-
+bridge, Blackwell, 1989).
+[67] D. Deutsch The Fabric of Reality (Allen Lane, New York,
+1997).
+[68] D. N. Page A 1995, gr-qc/9507025
+[69] M. J. Donald 1997, quant-ph/9703008
+[70] M. J. Donald 1999, quant-ph/9904001
+[71] L. Vaidman 1996, quant-ph/9609006, Int. Stud. Phil.
+Sci., in press
+[72] T. Sakaguchi 1997, quant-ph/9704039
+[73] M. Tegmark, quant-ph/9709032, Fortschr. Phys. 46, 855
+(1997).
+[74] M. Tegmark, gr-qc/9704009, Annals of Physics 270, 1
+(1998).
+[75] W. James, The Principles of Psychology (New York, Holt,
+1890).
+[76] W. James 1904, in The Writings of William James,
+pp169-183, ed. J. J. McDermott (Chicago, Univ. Chicago
+Press, 1977).
+[77] A. Eddington, Space, Time & Gravitation (Cambridge,
+Cambridge Univ. Press, 1920).
+[78] A. M. Lisewski 1999, quant-ph/9907052
+[79] E. P. Wigner, Phys. Rev. 40, 749 (1932).
+M. Hillery, R. H. O’Connell, M. O. Scully & Wigner E
+P, Phys. Rep. 106, 121 (1984).
+Y. S. Kim and M. E. Noz, Phase Space Picture of Quan-
+tum Mechanics: Group Theoretical Approach (Singapore,
+World Scientific, 1991).
+
+15
+
+
diff --git a/papers/references/Tegmark2000_Placeholder.md b/papers/references/Tegmark2000_Placeholder.md
deleted file mode 100644
index a1eb5956..00000000
--- a/papers/references/Tegmark2000_Placeholder.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Importance of quantum decoherence in brain processes (Tegmark 2000)
-
-This paper proves that the decoherence timescale in the brain is ~10^-13 seconds, demonstrating the absolute physical limit of sustained quantum states at biological temperatures.
-Due to copyright and its format, the full PDF is not hosted in this repository.
-
-**Citation:**
-Tegmark, M. (2000). *Phys. Rev. E* **61**, 4194.
diff --git a/papers/references/temp/Amari2016.pdf b/papers/references/temp/Amari2016.pdf
new file mode 100644
index 00000000..eab69df0
--- /dev/null
+++ b/papers/references/temp/Amari2016.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d54c73f85233ad188ddd594aa061cd6ed671c77dd371ef2525e3448098558190
+size 15842558
diff --git a/papers/references/temp/Amari2016.txt b/papers/references/temp/Amari2016.txt
new file mode 100644
index 00000000..6c6084bb
--- /dev/null
+++ b/papers/references/temp/Amari2016.txt
@@ -0,0 +1,37050 @@
+Information 
+Geometry
+
+Geert Verdoolaege
+
+www.mdpi.com/journal/entropy
+
+Edited by
+
+Printed Edition of the Special Issue Published in Entropy
+
+
+Information Geometry
+
+
+
+Information Geometry
+
+Special Issue Editor
+
+Geert Verdoolaege
+
+MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade
+
+
+Special Issue Editor
+
+Geert Verdoolaege
+
+Ghent University
+
+Belgium
+
+Editorial Office
+
+MDPI
+
+St. Alban-Anlage 66
+
+4052 Basel, Switzerland
+
+This is a reprint of articles from the Special Issue published online in the open access journal Entropy
+
+(ISSN 1099-4300) in 2014 (available at: https://www.mdpi.com/journal/entropy/special issues/
+
+information-geometry)
+
+For citation purposes, cite each article independently as indicated on the article page online and as
+
+indicated below:
+
+LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number,
+
+Page Range.
+
+ISBN 978-3-03897-632-5 (Pbk)
+
+ISBN 978-3-03897-633-2 (PDF)
+
+Cover image courtesy of Geert Verdoolaege.
+
+c⃝ 2019 by the authors. Articles in this book are Open Access and distributed under the Creative
+
+Commons Attribution (CC BY) license, which allows users to download, copy and build upon
+
+published articles, as long as the author and publisher are properly credited, which ensures maximum
+
+dissemination and a wider impact of our publications.
+
+The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
+
+license CC BY-NC-ND.
+
+
+Contents
+
+About the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
+
+Preface to ”Information Geometry” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+ix
+
+Shun-ichi Amari
+Information Geometry of Positive Measures and Positive-DefiniteMatrices: 
+Decomposable Dually Flat Structure
+Reprinted from: Entropy 2014, 16, 2131–2145, doi:10.3390/e16042131 . . . . . . . . . . . . . . . . .
+1
+
+Harsha K. V. and Subrahamanian Moosath K S
+F-Geometry and Amari’s α−Geometry on a Statistical Manifold
+Reprinted from: Entropy 2014, 16, 2472–2487, doi:10.3390/e16052472 . . . . . . . . . . . . . . . . .
+14
+
+Frank Critchley and Paul Marriott
+Computational Information Geometry in Statistics: Theory and Practice
+Reprinted from: Entropy 2014, 16, 2454–2471, doi:10.3390/e16052454 . . . . . . . . . . . . . . . . .
+29
+
+Paul Vos and Karim Anaya-Izquierdo
+Using Geometry to Select One Dimensional Exponential Families That Are Monotone
+Likelihood Ratio in the Sample Space, Are Weakly Unimodal and Can Be Parametrized by a
+Measure of Central Tendency
+Reprinted from: Entropy 2014, 16, 4088–4100, doi:10.3390/e16074088 . . . . . . . . . . . . . . . . .
+44
+
+Guido Mont´ufar, Johannes Rauh and Nihat Ay
+On the Fisher Metric of Conditional Probability Polytopes
+Reprinted from: Entropy 2014, 16, 3207–3233, doi:10.3390/e16063207 . . . . . . . . . . . . . . . . .
+56
+
+Andr´e Klein
+Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes
+Reprinted from: Entropy 2014, 16, 2023–2055, doi:10.3390/e16042023 . . . . . . . . . . . . . . . . .
+80
+
+Keisuke Yano and Fumiyasu Komaki
+Asymptotically Constant-Risk Predictive Densities When the Distributions of Data and Target
+Variables Are Different
+Reprinted from: Entropy 2014, 16, 3026–3048, doi:10.3390/e16063026 . . . . . . . . . . . . . . . . . 110
+
+Samuel Livingstone and Mark Girolami
+Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions
+Reprinted from: Entropy 2014, 16, 3074–3102, doi:10.3390/e16063074 . . . . . . . . . . . . . . . . . 131
+
+Hui Zhao and Paul Marriott
+Variational Bayes for Regime-Switching Log-Normal Models
+Reprinted from: Entropy 2014, 16, 3832–3847, doi:10.3390/e16073832 . . . . . . . . . . . . . . . . . 155
+
+Frank Nielsen, Richard Nock and Shun-ichi Amari
+On Clustering Histograms with k-Means by Using Mixed α-Divergences
+Reprinted from: Entropy 2014, 16, 3273–3301, doi:10.3390/e16063273 . . . . . . . . . . . . . . . . . 169
+
+Salem Said, Lionel Bombrun and Yannick Berthoumieu
+New Riemannian Priors on the Univariate Normal Model
+Reprinted from: Entropy 2014, 16, 4015–4031, doi:10.3390/e16074015 . . . . . . . . . . . . . . . . . 194
+
+v
+
+
+Luigi Malag`o and Giovanni Pistone
+Combinatorial Optimization with Information Geometry: The Newton Method
+Reprinted from: Entropy 2014, 16, 4260–4289, doi:10.3390/e16084260 . . . . . . . . . . . . . . . . . 209
+
+Domenico Felice, Carlo Cafaro and Stefano Mancini
+Information Geometric Complexity of a Trivariate Gaussian Statistical Model
+Reprinted from: Entropy 2014, 16, 2944–2958, doi:10.3390/e16062944 . . . . . . . . . . . . . . . . . 237
+
+Alexandre Levada
+Learning from Complex Systems: On the Roles of Entropy and Fisher Information in Pairwise
+Isotropic Gaussian Markov Random Fields
+Reprinted from: Entropy 2014, 16, 1002–1036, doi:10.3390/e16021002 . . . . . . . . . . . . . . . . . 250
+
+Masatoshi Funabashi
+Network Decomposition and Complexity Measures: An Information Geometrical Approach
+Reprinted from: Entropy 2014, 16, 4132–4167, doi:10.3390/e16074132 . . . . . . . . . . . . . . . . . 283
+
+Roger Balian
+The Entropy-Based Quantum Metric
+Reprinted from: Entropy 2014, 16, 3878–3888, doi:10.3390/e16073878 . . . . . . . . . . . . . . . . . 315
+
+Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li
+Extending the Extreme Physical Information to Universal Cognitive Models via a Confident
+Information First Principle
+Reprinted from: Entropy 2014, 16, 3670–3688, doi:10.3390/e16073670 . . . . . . . . . . . . . . . . . 324
+
+vi
+
+
+About the Special Issue Editor
+
+Geert Verdoolaege obtained an M.Sc.
+degree in Theoretical Physics in 1999 and the Ph.D. in
+
+Engineering Physics in 2006, both at Ghent University (UGent, Belgium). His Ph.D. work concerned
+
+applications of Bayesian probability theory to plasma spectroscopy in fusion devices.
+He was a
+
+postdoctoral researcher in the field of computer vision at the University of Antwerp (2007–2008),
+
+working on probabilistic modeling of image textures using information geometry. From 2008 to 2010,
+
+he was with the Department of Data Analysis at UGent, where he worked on modeling and
+
+estimation of brain activity, based on functional magnetic resonance imaging. In 2010, he returned
+
+to the Department of Applied Physics at UGent, first as a postdoctoral assistant and from 2014
+
+onwards, as a part-time assistant professor.
+Since 2013, he has held a cross-appointment as a
+
+researcher at the Laboratory for Plasma Physics of the Royal Military Academy (LPP-ERM/KMS)
+
+in Brussels. His research activities comprise development of data analysis techniques using methods
+
+from probability theory, machine learning and information geometry, and their application to nuclear
+
+fusion experiments. He also teaches a Master course on Continuum Mechanics at Ghent University.
+
+He serves on the editorial board of the multidisciplinary journal Entropy and is a member of the
+
+scientific committees of several conferences (IAEA Technical Meeting on Fusion Data Processing,
+
+Validation and Analysis; International Workshop on Bayesian Inference and Maximum Entropy
+
+Methods in Science and Engineering; Conference on Geometric Science of Information). In addition,
+
+he is a consulting expert in the International Tokamak Physics Activity (ITPA) Transport and
+
+Confinement Topical Group and member of the General Assembly of the European Fusion Education
+
+Network (FuseNet).
+
+vii
+
+
+
+Preface to ”Information Geometry”
+
+The mathematical field of information geometry originated from the observation the Fisher
+
+information can be used to define a Riemannian metric on manifolds of probability distributions.
+
+This led to a geometrical description of probability theory and statistics, which over the years has
+
+developed into a rich mathematical field with a broad range of applications in the data sciences.
+
+Moreover, similar to the concept of entropy, there are various connections to and applications of
+
+information geometry in statistical mechanics, quantum mechanics, and neuroscience.
+
+It has been a pleasure to act as a guest editor for this first Special Issue on information geometry
+
+in the journal Entropy. For me, as a physicist working on the development and application of
+
+data science techniques in the context of nuclear fusion experiments, the interdisciplinary character
+
+of information geometry has always been one of the main reasons for its appeal.
+There are, of
+
+course, many other domains in physics where geometrical notions play a key role, including classical
+
+mechanics, continuum mechanics (which I have been teaching at Ghent University for several years
+
+now), general relativity, and much of modern physics. This interplay between the beautiful and
+
+elegant formalism of differential geometry on the one hand and physics and data science on the
+
+other hand is both fascinating and inspiring. The variety of topics covered by this Special Issue is a
+
+reflection of this cross-fertilization between disciplines.
+
+“Information Geometry I” has been a great success, and although the papers were published
+
+already several years ago, it was decided that it was worthwhile to reprint the collection of papers
+
+in book form. Indeed, even though all papers present original research, many have a strong tutorial
+
+character, and we were honored to receive multiple contributions by authorities in the field. The
+
+papers have been structured according to their main subject area, or field of application, and we
+
+briefly discuss each of them in the following.
+
+We start with two papers related to the foundations of information geometry. We were very
+
+pleased to receive a contribution by one of the founders of the field of information geometry, prof.
+
+Shun-ichi Amari. In his paper, the dually flat structure of the manifold of positive measures is
+
+discussed, derived from a class of Bregman divergences. These so-called (ρ,τ)-divergences, originally
+
+proposed by J. Zhang, are defined in terms of two monotone, scalar functions (ρ and τ) and form a
+
+unique class of dually flat, decomposable divergences. This is extended to the set of positive-definite
+
+matrices, additionally requiring invariance of the divergence under matrix transformations. It is well
+
+known that such dually flat manifolds have computationally desirable properties in applications to
+
+classification and information retrieval.
+
+Harsha K. V. and Subrahamanian Moosath K. S. introduce F-geometry as a generalization
+
+of α-geometry, based on a representation of a probability density function through a function F.
+
+They then combine this with another function G to define a weighted expectation, from which
+
+an (F,G)-metric and connection are deduced. A condition for two of such structures to lead
+
+to dual connections is also derived. However, it was shown by Zhang (J. Zhang, Entropy 17,
+
+pp. 4485–4499, 2015) that this framework is equivalent to the (ρ,τ)-geometry introduced earlier by
+
+him. Although the present paper is slightly different in perspective, it should be read with this
+
+equivalence in mind.
+
+The next four papers deal with applications of information geometry in statistics. The paper
+
+by Frank Critchley and Paul Marriott presents an important research program aimed at rendering
+
+some of the most useful results of information geometry more accessible to statisticians in
+
+ix
+
+
+practical applications. Indeed, the formalism of differential geometry and tensor algebra can
+
+appear daunting at first sight and may present an obstacle to adoption of many useful results
+
+by practitioners. The paper describes a computational framework that facilitates implementation
+
+of results from information geometry, based on an embedding of various important statistical
+
+models in a (sufficiently large) simplex. Challenges related to extension of the framework to the
+
+infinite–dimensional case are touched upon as well.
+
+In the paper by Paul Vos and Karim Anaya-Izquierdo, the goal is to identify one-dimensional
+
+exponential families enjoying a number of properties that are convenient for statistical modeling,
+
+i.e., parametrization by a measure of central tendency, unimodality, and monotone likelihood ratio.
+
+The basis for the framework is the multinomial distribution, modeled geometrically by the simplex.
+
+The selection of exponential families with desirable properties is then based on a partitioning of the
+
+natural parameter space of the family of multinomial distributions by means of convex cones.
+
+Guido Mont´ufar and co-workers consider various possibilities to define natural Riemannian
+
+metrics on polytopes of stochastic matrices, which describe the conditional probability distribution
+
+of two categorical random variables. Inspired by the classical result regarding the uniqueness of the
+
+Fisher metric by requiring invariance under Markov morphisms, they define metrics derived from
+
+a natural class of stochastic maps between such polytopes, or, alternatively, through embeddings in
+
+various possible model spaces. They provide recommendations as to which metric to use, depending
+
+on the application.
+
+Andr´e Klein, in his article, provides a survey of several matrix algebraic properties of the Fisher
+
+information matrix corresponding to weakly stationary time series. The link with various structured
+
+matrices arising from a number of time series models is demonstrated. A statistical distance measure
+
+is built using the Fisher information matrix in the context of classical and quantum information.
+
+Finally, conditions are obtained for the Fisher information of a stationary process to obey certain
+
+forms of the Stein equation.
+
+We continue with three papers concerning applications of information geometry in Bayesian
+
+inference and simulation. Keisuke Yano and Fumiyasu Komaki, in their paper, construct constant-risk
+
+Bayesian predictive densities using the Kullback-Leibler loss function when the distributions of the
+
+data and the target variable to be predicted are different but have a common unknown parameter.
+
+Specifically, the issue of prior selection is investigated, and several applications are given.
+
+Samuel Livingstone and Mark Girolami provide an introduction to recent enhancements of
+
+Markov chain Monte Carlo simulation techniques inspired by information geometry. They apply
+
+this to the Metropolis–Hastings algorithm driven by Langevin diffusion, gradually transforming the
+
+ingredients to the setting of a Riemannian manifold equipped with a metric similar to the Fisher
+
+information metric. Pointers to various applications are given. The paper is written in a way that also
+
+makes it accessible to practitioners with little background in stochastic processes and geometry.
+
+The paper by Hui Zhao and Paul Marriott concerns Bayesian inference making use of variational
+
+methods for approximating the posterior distribution. In the context of inference for time series
+
+models that switch between different regimes, variational Bayes is shown to be a computationally
+
+attractive alternative to Markov chain Monte Carlo simulations. The geometry related to the
+
+projection of the posterior onto a computationally tractable family of distributions is elucidated by
+
+means of a simple example. This is followed by an application wherein it is shown that variational
+
+Bayes is successful in estimating the regime-switching model, including the number of regimes.
+
+Applications of information geometry in machine learning are represented by the following
+
+x
+
+
+three papers. The article by Frank Nielsen and colleagues considers κ-means histogram clustering,
+
+with applications to, e.g., information retrieval. Based on the α-divergences as similarity measures,
+
+clustering is performed using either the sided (asymmetric) or symmetrized divergence, or by means
+
+of the interesting notion of a mixed divergence. An important computational advantage is that the
+
+centroids based on the sided and mixed divergences have a closed-form expression. Next, the scheme
+
+is extended to algorithms with optimized initialization of cluster centroids, as well as soft clustering.
+
+Salem Said and co-workers present a class of distributions on the manifold of the univariate
+
+normal model equipped with the Fisher information metric. Expressed in terms of the Fisher-Rao
+
+distance, the distributions are used as priors for modeling the classes in Bayesian classification
+
+problems of normal distributions. Characteristics of this “Gaussian” distribution on the manifold are
+
+discussed, as well as estimation of its parameters and the posterior using the Laplace approximation.
+
+In an application to classification of image textures, the improved performance of these priors over
+
+conjugate priors is demonstrated.
+
+Luigi Malag`o and Giovanni Pistoni address optimization on manifolds of exponential
+
+distributions on a discrete state space using Newton’s method, which is based on second-order
+
+calculus. In particular, the goal is to find maxima of the expectation of a function with respect to the
+
+distribution (stochastic relaxation). Details of the computation are provided, including calculation
+
+of the Riemannian Hessian. A nonparametric formalism is used, with a view to extension to the
+
+infinite–dimensional case.
+
+The next three papers are related to the role of information geometry in complex systems
+
+research. Domenico Felice and colleagues consider the time-averaged volume explored by geodesics
+
+on a statistical manifold as an indicator of complexity of the entropic dynamics of a system.
+
+The parameters of the model play the role of macrovariables conveying information on the
+
+system’s microstate. Examples are given for the case of univariate, bivariate, and trivariate normal
+
+distributions, providing interesting results depending on correlations between the microvariables.
+
+Alexandre Levada investigates the role of entropy and Fisher information in pairwise isotropic
+
+Gaussian Markov random fields, acting as models for complex systems. Expressions for these
+
+quantities are derived and the evolution of the Fisher information, and entropy is studied as
+
+a function of the inverse temperature of the system. An interesting interpretation is given of
+
+asymmetries between these curves in terms of the arrow of time.
+
+Masatoshi Funabashi presents a framework for measuring statistical dependence between
+
+subsystems of a stochastic model, based on the model’s graph representation.
+A description in
+
+terms of the mixed coordinates of the system is used to quantify the complexity loss incurred by
+
+cutting an edge of the graph. In addition, a complexity measure is defined as a geometric mean of
+
+Kullback–Leibler divergences between decompositions of the system in terms of subsystems with
+
+fewer statistical dependencies. This quantifies the degree to which the system can be decomposed.
+
+The following paper concerns an application to physics, specifically quantum mechanics. Roger
+
+Balian gives an overview of a geometrical framework for measuring information loss in quantum
+
+systems resulting from the mixing of states. A Riemannian metric is defined, based on the von
+
+Neumann entropy, generating a mapping between states and observables. The metric is compared
+
+to other quantum metrics, as well as the Fisher–Rao metric, and various geometrical properties are
+
+derived. Applications are given to quantum information, as well as equilibrium and non-equilibrium
+
+quantum statistical mechanics.
+
+The final paper in the Special Issue is situated at the interface between physics and neuroscience.
+
+xi
+
+
+Xiaozhao Zhao and colleagues consider the principle of extreme physical information based on the
+
+Fisher information, which has been used before in an attempt to establish an information-theoretical
+
+basis for physical laws. They extend the idea to cognitive systems and aim at narrowing the gap
+
+between the information bound and the data information for such complex systems, by transforming
+
+the model to a simpler one. This is done by means of a dimensionality reduction technique, also based
+
+on the Fisher information. The approach is applied to derive the model for single-layer Boltzmann
+
+machines and interpret their learning algorithms.
+
+We are convinced that the varied collection of papers in this Special Issue will be useful for
+
+scientists who are new to the field, while providing an excellent reference for the more seasoned
+
+researcher. Furthermore, it is worth mentioning that the second Entropy Special Issue in this series,
+
+“Information Geometry II”, will also be published as a book, and that a third Special Issue is being
+
+prepared. We hope that the reader will enjoy browsing and reading through this collection of papers
+
+as much as we enjoyed guest editing this Special Issue “Information Geometry I”.
+
+Finally, I would like to thank the Editor-in-Chief of Entropy, Prof. Dr. Kevin H. Knuth, for
+
+suggesting the opportunity to guest-edit a Special Issue on information geometry.
+Furthermore,
+
+I wish to thank the editorial staff at MDPI for their great help with contacting authors, organizing
+
+paper reviews, and editing the original Special Issue in Entropy, as well as the reprinted version in the
+
+present book.
+
+Geert Verdoolaege
+
+Special Issue Editor
+
+xii
+
+
+
+
+entropy
+
+Article
+Information Geometry of Positive Measures and
+Positive-Definite Matrices: Decomposable Dually
+Flat Structure
+
+Shun-ichi Amari
+
+RIKEN Brain Science Institute, Hirosawa 2-1, Wako-shi, Saitama 351-0198, Japan;
+E-Mail: amari@brain.riken.jp; Tel.: +81-48-467-9669; Fax: +81-48-467-9687
+
+Received: 14 February 2014; in revised form: 9 April 2014 / Accepted: 10 April 2014 /
+Published: 14 April 2014
+
+Abstract: Information geometry studies the dually flat structure of a manifold, highlighted by
+the generalized Pythagorean theorem. The present paper studies a class of Bregman divergences
+called the (ρ, τ)-divergence. A (ρ, τ)-divergence generates a dually flat structure in the manifold of
+positive measures, as well as in the manifold of positive-definite matrices. The class is composed of
+decomposable divergences, which are written as a sum of componentwise divergences. Conversely,
+a decomposable dually flat divergence is shown to be a (ρ, τ)-divergence. A (ρ, τ)-divergence is
+determined from two monotone scalar functions, ρ and τ. The class includes the KL-divergence, α-,
+β- and (α, β)-divergences as special cases. The transformation between an affine parameter and its
+dual is easily calculated in the case of a decomposable divergence. Therefore, such a divergence
+is useful for obtaining the center for a cluster of points, which will be applied to classification and
+information retrieval in vision. For the manifold of positive-definite matrices, in addition to the dually
+flatness and decomposability, we require the invariance under linear transformations, in particular
+under orthogonal transformations. This opens a way to define a new class of divergences, called the
+(ρ, τ)-structure in the manifold of positive-definite matrices.
+
+Keywords: information geometry; dually flat structure; decomposable divergence; (ρ, τ)-structure
+
+1. Introduction
+
+Information geometry, originated from the invariant structure of a manifold of probability
+distributions, consists of a Riemannian metric and dually coupled affine connections with respect to
+the metric [1]. A manifold having a dually flat structure is particularly interesting and important. In
+such a manifold, there are two dually coupled affine coordinate systems and a canonical divergence,
+which is a Bregman divergence. The highlight is given by the generalized Pythagorean theorem and
+projection theorem. Information geometry is useful not only for statistical inference, but also for
+machine learning, pattern recognition, optimization and even for neural networks. It is also related to
+the statistical physics of Tsallis q-entropy [2–4].
+The present paper studies a general and unique class of decomposable divergence functions
+in Rn+, the manifold of n-dimensional positive measures. This is the (ρ, τ)-divergences, introduced
+by Zhang [5,6], from the point of view of “representation duality”. They are Bregman divergences
+generating a dually flat structure. The class includes the well-known Kullback-Leibler divergence,
+α-divergence, β-divergence and (α, β)-divergence [1,7–9] as special cases. The merit of a decomposable
+Bregman divergence is that the θ-η Legendre transformation is computationally tractable, where θ
+and η are two affine coordinates systems coupled by the Legendre transformation. When one uses
+a dually flat divergence to define the center of a cluster of elements, the center is easily given by
+the arithmetic mean of the dual coordinates of the elements [10,11]. However, we need to calculate
+its primal coordinates. This is the θ-η transformation. Hence, our new type of divergences has an
+
+Entropy 2014, 16, 2131–2145; doi:10.3390/e16042131
+www.mdpi.com/journal/entropy
+1
+
+
+Entropy 2014, 16, 2131–2145
+
+advantage of calculating θ-coordinates for clustering and related pattern matching problems. The most
+general class of dually flat divergences, not necessarily decomposable, is further given in Rn+. They are
+the (ρ, τ) divergence.
+Positive-definite (PD) matrices appear in many engineering problems, such as convex
+programming, diffusion tensor analysis and multivariate statistical analysis [12–20]. The manifold,
+PDn, of n × n PD matrices form a cone, and its geometry is by itself an important subject of research.
+If we consider the submanifold consisting of only diagonal matrices, it is equivalent to the manifold
+of positive measures. Hence, PD matrices can be regarded as a generalization of positive measures.
+There are many studies on geometry and divergences of the manifold of positive-definite matrices. We
+introduce a general class of dually flat divergences, the (ρ, τ)-divergence. We analyze the cases when
+a (ρ, τ)-divergence is invariant under the general linear transformations, Gl(n), and invariant under
+the orthogonal transformations, O(n). They not only include many well-known divergences of PD
+matrices, but also give new important divergences.
+The present paper is organized as follows. Section 2 is preliminary, giving a short introduction
+to a dually flat manifold and the Bregman divergence. It also defines the cluster center due to a
+divergence. Section 3 defines the (ρ, τ)-structure in Rn+. It gives dually flat decomposable affine
+coordinates and a related canonical divergence (Bregman divergence). Section 4 is devoted to the
+(ρ, τ)-structure of the manifold, PDn, of PD matrices. We first study the class of divergences that are
+invariant under O(n). We further study a decomposable divergence that is invariant under Gl(n).
+It coincides with the invariant divergence derived from zero-mean Gaussian distributions with PD
+covariance matrices. They not only include various known divergences, but new remarkable ones.
+Section 5 discusses a general class of non-decomposable flat divergences and miscellaneous topics.
+Section 6 is the conclusions.
+
+2. Preliminaries to Information Geometry of Divergence
+
+2.1. Dually Flat Manifold
+
+A manifold is said to have the dually flat Riemannian structure, when it has two affine coordinate
+systems θ =
+�
+θ1, · · · , θn�
+and η = (η1, · · · , ηn) (with respect to two flat affine connections) together
+with two convex functions, ψ(θ) and ϕ(η), such that the two coordinates are connected by the Legendre
+transformations:
+η = ∇ψ(θ),
+θ = ∇ϕ(η),
+(1)
+
+where ∇ is the gradient operator. The Riemannian metric is given by:
+
+�
+gij(θ)
+� = ∇∇ψ(θ),
+�
+gij(η)
+�
+= ∇∇ϕ(η)
+(2)
+
+in the respective coordinate systems. A curve that is linear in the θ-coordinates is called a θ-geodesic,
+and a curve linear in the η-coordinates is called an η-geodesic.
+A dually flat manifold has a unique canonical divergence, which is the Bregman divergence
+defined by the convex functions,
+
+D[P : Q] = ψ (θP) + ϕ
+�
+ηQ
+� − θP · ηQ,
+(3)
+
+where θP is the θ-coordinates of P, ηQ is the η-coordinates of Q and θP · ηQ = ∑i
+�
+θi
+P
+� �
+ηQi
+�
+, where θi
+P
+and ηQi are components of θp and ηQ, respectively. The Pythagorean and projection theorems hold in
+a dually flat manifold:
+Pythagorean Theorem
+Given three points, P, Q, R, when the η-geodesic connecting P and Q is
+orthogonal to the θ-geodesic connecting Q and R with respect to the Riemannian metric,
+
+D[P : Q] + D[Q : R] = D[P : R].
+(4)
+
+2
+
+
+Entropy 2014, 16, 2131–2145
+
+Projection Theorem
+Given a smooth submanifold, S, let PS be the minimizer of divergence
+from P to S,
+PS = min
+Q∈S D[P : Q].
+(5)
+
+Then, PS is the η-geodesic projection of P to S, that is the η-geodesic connecting P and PS is orthogonal
+to S.
+We have the dual of the above theorems where θ- and η-geodesics are exchanged and D[P : Q] is
+replaced by its dual D[Q : P].
+
+2.2. Decomposable Divergence
+
+A divergence, D[P
+:
+Q], is said to be decomposable, when it is written as a sum of
+component-wise divergences,
+
+D[P : Q] =
+n
+∑
+i=1
+d
+�
+θi
+P, θi
+Q
+�
+,
+(6)
+
+where θi
+P and θi
+Q are the components of θP and θQ and d
+�
+θi
+P, θi
+Q
+�
+is a scalar divergence function.
+An f-divergence:
+
+Df [P : Q] = ∑ pi f
+� qi
+
+pi
+
+�
+(7)
+
+is a typical example of decomposable divergence in the manifold of probability distributions, where
+P = (p) and Q = (q) are two probability vectors with ∑ pi = ∑ qi = 1. A convex function, ψ(θ), is
+said to be decomposable, when it is written as:
+
+ψ(θ) =
+n
+∑
+i=1
+˜ψ
+�
+θi�
+(8)
+
+by using a scalar convex function, ˜ψ(θ). The Bregman divergence derived from a decomposable convex
+function is decomposable.
+When ψ(θ) is a decomposable convex function, its Legendre dual is also decomposable. The
+Legendre transformation is given componentwise as:
+
+ηi = ˜ψ′ (θi) ,
+(9)
+
+where ′ is the differentiation of a function, so that it is computationally tractable.
+Its inverse
+transformation is also componentwise,
+θi = ˜ϕ′ (ηi) ,
+(10)
+
+where ˜ϕ is the Legendre dual of ˜ψ.
+
+2.3. Cluster Center
+
+Consider a cluster of points P1, · · · , Pm of which θ-coordinates are θ1, · · · , θm and η-coordinates
+are η1, · · · , ηm. The center, R, of the cluster with respect to the divergence, D[P : Q], is defined by:
+
+R = arg min
+Q
+
+m
+∑
+i=1
+D [Q : Pi] .
+(11)
+
+By differentiating ∑ D [Q : Pi] by θ (the θ-coordinates of Q), we have:
+
+∇ψ (θR) = 1
+
+m
+
+m
+∑
+i=1
+ηi.
+(12)
+
+Hence, the cluster-center theorem due to Banerjee et al. [10] follows; see also [11]:
+
+3
+
+
+Entropy 2014, 16, 2131–2145
+
+Cluster-Center Theorem
+The η-coordinates ηR of the cluster center are given by the arithmetic
+average of the η-coordinates of the points in the cluster:
+
+ηR = 1
+
+m
+
+m
+∑
+i=1
+ηi.
+(13)
+
+When we need to obtain the θ-coordinates of the cluster center, it is given by the θ-η transformation
+from ηR,
+θR = ∇ϕ (ηR) .
+(14)
+
+However, in many cases, the transformation is computationally heavy and intractable when the
+dimensions of a manifold is large. The transformation is easy in the case of a decomposable divergence.
+This is motivation for considering a general class of decomposable Bregman divergences.
+
+3. (ρ, τ) Dually Flat Structure in Rn
++
+
+3.1. (ρ, τ)-Coordinates of Rn
++
+
+Let Rn+ be the manifold of positive measures over n elements x1, · · · , xn. A measure (or a weight)
+of xi is given by:
+ξi = m (xi) > 0
+(15)
+
+and ξ = (ξ1, · · · , ξn) is a distribution of measures. When ∑ ξi = 1 is satisfied, it is a probability
+measure. We write:
+R+
+n = {ξ |ξi > 0}
+(16)
+
+and ξ forms a coordinate system of Rn+.
+Let ρ(ξ) and τ(ξ) be two monotonically increasing differentiable functions. We call:
+
+θ = ρ(ξ),
+η = τ(ξ)
+(17)
+
+the ρ- and τ-representations of positive measure ξ. This is a generalization of the ±α representations [1]
+and was introduced in [5] for a manifold of probability distributions. See also [6].
+By using these functions, we construct new coordinate systems θ and η of Rn+. They are given, for
+θ =
+�
+θi�
+and η = (ηi), by componentwise relations,
+
+θi = ρ (ξi) ,
+ηi = τ (ξi) .
+(18)
+
+They are called the ρ- and τ-representations of ξ ∈ Rn+, respectively. We search for convex functions,
+ψρ,τ(θ) and ϕρ,τ(η), which are Legendre duals to each other, such that θ and η are two dually coupled
+affine coordinate systems.
+
+3.2. Convex Functions
+
+We introduce two scalar functions of θ and η by:
+
+˜ψρ,τ(θ)
+=
+� ρ−1(θ)
+
+0
+τ(ξ)ρ′(ξ)dξ,
+(19)
+
+˜ϕρ,τ(η)
+=
+� τ−1(η)
+
+0
+ρ(ξ)τ′(ξ)dξ.
+(20)
+
+Then, the first and second derivatives of ˜ψρ,τ are:
+
+˜ψ′
+ρ,τ(θ)
+=
+τ(ξ),
+(21)
+
+˜ψ′′
+ρ,τ(θ)
+=
+τ′(ξ)
+ρ′(ξ) .
+(22)
+
+4
+
+
+Entropy 2014, 16, 2131–2145
+
+Since ρ′(ξ) > 0, τ′(ξ) > 0, we see that ˜ψρ,τ(θ) is a convex function. So is ˜ϕρ,τ(η). Moreover, they are
+Legendre duals, because:
+
+˜ψρ,τ(θ) + ˜ϕρ,τ(η) − θη
+=
+� ξ
+
+0 τ(ξ)ρ′(ξ)dξ +
+� ξ
+
+0 ρ(ξ)τ′(ξ)dξ − ρ(ξ)τ(ξ)
+(23)
+
+=
+0.
+(24)
+
+We then define two decomposable convex functions of θ and η by:
+
+ψρ,τ(θ)
+= ∑ ˜ψρ,τ
+�
+θi�
+,
+(25)
+
+ϕρ,τ(η)
+= ∑ ˜ϕρ,τ (ηi) .
+(26)
+
+They are Legendre duals to each other.
+
+3.3. (ρ, τ)-Divergence
+
+The (ρ, τ)-divergence between two points, ξ, ξ′ ∈ R+
+n , is defined by:
+
+Dρ,τ
+�
+ξ : ξ′�
+=
+ψρ,τ (θ) + ϕρ,τ
+�
+η′� − θ · η′
+(27)
+
+=
+n
+∑
+i=1
+
+�� ξi
+
+0
+τ(ξ)ρ′(ξ)dξ +
+� ξ′
+i
+
+0
+ρ(ξ)τ′(ξ)dξ − ρ (ξi) τ
+�
+ξ′
+i
+��
+,
+(28)
+
+where θ and η′ are ρ- and τ-representations of ξ and ξ′, respectively.
+The (ρ, τ)-divergence gives a dually flat structure having θ and η as affine and dual affine
+coordinate systems. This is originally due to Zhang [5] and a generalization of our previous results
+concerning the q and deformed exponential families [4]. The transformation between θ and η is simple
+in the (ρ, τ)-structure, because it can be done componentwise,
+
+θi
+=
+ρ
+�
+τ−1 (ηi)
+�
+,
+(29)
+
+ηi
+=
+τ
+�
+ρ−1 �
+θi��
+.
+(30)
+
+The Riemannian metric is:
+
+gij(ξ) = τ′ (ξi)
+
+ρ′ (ξi) δij,
+(31)
+
+and hence Euclidean, because the Riemann-Christoffel curvature due to the Levi-Civita connection
+vanishes, too.
+The following theorem is new, characterizing the (ρ, τ)-divergence.
+
+Theorem 1. The (ρ, τ)-divergences form a unique class of divergences in Rn+ that are dually flat and
+decomposable.
+
+3.4. Biduality: α-(ρ, τ) Divergence
+
+We have dually flat connections,
+�
+∇ρ,τ, ∇∗
+ρ,τ
+�
+, represented in terms of covariant derivatives,
+which are derived from Dρ,τ. This is called the representation duality by Zhang [5]. We further have
+the α-(ρ, τ) connections defined by:
+
+∇(α)
+ρ,τ = 1 + α
+
+2
+∇ρ,τ + 1 − α
+
+2
+∇∗
+ρ,τ.
+(32)
+
+The α-(−α) duality is called the reference duality [5]. Therefore, ∇(α)
+ρ,τ possesses the biduality, one
+concerning α and (−α), and the other with respect to ρ and τ.
+
+5
+
+
+Entropy 2014, 16, 2131–2145
+
+The Riemann-Christoffel curvature of ∇(α)
+ρ,τ is:
+
+R(α)
+ρ,τ = 1 − α2
+
+4
+R(0)
+ρ,τ = 0
+(33)
+
+for any α. Hence, there exists unique canonical divergence D(α)
+ρ,τ and α-(ρ, τ) affine coordinate systems.
+It is an interesting future problem to obtain their explicit forms.
+
+3.5. Various Examples
+
+As a special case of the (ρ, τ)-divergence, we have the (α, β)-divergence obtained from the
+following power functions,
+
+ρ(ξ) = 1
+
+αξα, τ(ξ) = 1
+
+βξβ.
+(34)
+
+This was introduced by Cichocki, Cruse and Amari in [7,8].
+The affine and dual affine coordinates are:
+
+θi = 1
+
+α (ξi)α ,
+ηi = 1
+
+β (ξi)β
+(35)
+
+and the convex functions are:
+
+ψ(θ) = cα,β ∑ θ
+
+α+β
+
+i α
+,
+ϕ(η) = cβ,α ∑ η
+
+α+β
+
+β
+i
+,
+(36)
+
+where:
+cα,β =
+1
+
+β(α + β)α
+α+β
+
+α .
+(37)
+
+The induced (α, β)-divergence has a simple form,
+
+Dα,β[ξ : ξ′] =
+1
+
+αβ (α + β) ∑
+�
+αξα+β
+i
++ βξ′α+β
+i
+− (α + β)ξα
+i ξ′β
+i
+�
+,
+(38)
+
+for ξ, ξ′ ∈ Rn+. It is defined similarly in the manifold, Sn, of probability distributions, but it is not
+a Bregman divergence in Sn. This is because the total mass constraint ∑ ξi = 1 is not linear in θ- or
+η-coordinates in general.
+The α-divergence is a special case of the (α, β)-divergence, so that it is a (ρ, τ)-divergence. By
+putting:
+
+ρ(ξ) =
+2
+
+1 − αξ
+1−α
+
+2 ,
+τ(ξ) =
+2
+
+1 + αξ
+1+α
+
+2 ,
+(39)
+
+we have:
+
+Dα
+�
+ξ : ξ′� =
+4
+
+1 − α2 ∑
+
+�1 − α
+
+2
+ξi + 1 + α
+
+2
+ξ
+
+1−α
+
+i 2
+− ξα
+i
+�
+ξ′
+i
+� 1+α
+
+2
+�
+.
+(40)
+
+The β-divergence [19] is obtained from:
+
+ρ(ξ) = ξ,
+τ(ξ) = 1
+
+βξ1+β.
+(41)
+
+It is written as:
+
+Dβ
+�
+ξ : ξ′� =
+1
+
+β(β + 1) ∑
+i
+
+�
+ξβ+1
+i
++ (β + 1)ξ′
+i −
+�
+ξ′
+i
+�β+1 − (β + 1)ξi
+�
+ξ′
+i
+�β�
+.
+(42)
+
+The β-divergence is special in the sense that it gives a dually flat structure, even in Sn. This is because
+u(ξ) is linear in ξ.
+
+6
+
+
+Entropy 2014, 16, 2131–2145
+
+The classes of α-divergences and β-divergences intersect at the KL-divergence, and their duals are
+different in general. They are the only intersecting points of the two classes.
+When ρ(ξ) = ξ and τ(ξ) = U′(ξ) where U is a convex function, (ρ, τ)-divergence is Eguchi’s
+U-divergence [21].
+Zhang already introduced the (α, β)-divergence in [5], which is not a (ρ, τ)-divergence in Rn+ and
+different from ours. We regret for our confusing the naming of the (α, β)-divergence.
+
+4. Invariant, Flat Decomposable Divergences in the Manifold of Positive-Definite Matrices
+
+4.1. Invariant and Decomposable Convex Function
+
+Let P be a positive-definite matrix and ψ(P) be a convex function. Then, a Bregman divergence is
+defined between two positive definite matrices, P and Q, by:
+
+D[P : Q] = ψ(P) − (Q) − ∇ (P) · (P − Q)
+(43)
+
+where ∇ is the gradient operator with respect to matrix P =
+�
+Pij
+�
+, so that ∇ψ(P) is a matrix and the
+inner product of two matrices is defined by:
+
+∇ψ(Q) · P = tr {∇ψ(Q)P} ,
+(44)
+
+where tr is the trace of a matrix.
+It induces a dually flat structure to the manifold of positive-definite matrices, where the affine
+coordinate system (θ-coordinates) is
+= P and the dual affine coordinate system (η-coordinates) is:
+
+H = ∇ψ(P).
+(45)
+
+A convex function, ψ(P), is said to be invariant under the orthogonal group O(n), when:
+
+ψ(P) = ψ
+�
+OTPO
+�
+(46)
+
+holds for any orthogonal transformation O, where OT is the transpose of O. An invariant function is
+written as a symmetric function of n eigenvalues λ1, · · · , λn of P. See Dhillon and Tropp [12]. When
+an invariant convex function of P is written, by using a convex function, f, of one variable, in the
+additive form:
+ψ(P) = ∑ f (λi) ,
+(47)
+
+it is said to be decomposable. We have:
+
+ψ(P) = trf (P).
+(48)
+
+4.2. Invariant, Flat and Decomposable Divergence
+
+A divergence D[P : Q] is said to be invariant under O(n), when it satisfies:
+
+D[P : Q] = D
+�
+OTPO : OTQO
+�
+.
+(49)
+
+When it is derived from a decomposable convex function, ψ(P), it is invariant, flat and decomposable.
+We give well-known examples of decomposable convex functions and the divergences derived
+from them:
+
+7
+
+
+Entropy 2014, 16, 2131–2145
+
+(1) For f (λ) = (1/2)λ2, we have:
+
+ψ(P)
+=
+1
+2 ∑ λ2
+i ,
+(50)
+
+D[P : Q]
+=
+1
+2∥P − Q∥2,
+(51)
+
+where ∥P∥2 is the Frobenius norm:
+∥P∥2 = ∑ P2
+ij.
+(52)
+
+(2) For f (λ) = − log λ
+
+ψ(P)
+=
+− log (det |P|) ,
+(53)
+
+D[P : Q]
+=
+tr
+�
+PQ−1�
+− log
+�
+det
+���PQ−1���
+�
+− n.
+(54)
+
+The affine coordinate system is P, and the dual coordinate system is P−1. The derived geometry is
+the same as that of multivariate Gaussian probability distributions with mean zero and covariance
+matrix P.
+(3) For f (λ) = λ log λ − λ,
+
+ψ(P)
+=
+tr (P log P − P) ,
+(55)
+
+D[P : Q]
+=
+tr (P log P − P log Q − P + Q) .
+(56)
+
+This divergence is used in quantum information theory. The affine coordinate system is P, and
+the dual affine coordinate system is log P; and, ψ(P) is called the negative von Neuman entropy.
+
+4.3. (ρ, τ)-Structure in Positive Definite Matrices
+
+We extend the (ρ, τ)-structure in the previous section to the matrix case and introduce a general
+dually flat invariant decomposable divergence in the manifold of positive-definite matrices. Let:
+
+Θ = ρ(P),
+H = τ(P)
+(57)
+
+be ρ- and τ-representations of matrices.
+We use two functions,
+˜ψρ,τ(θ) and ˜ϕρ,τ(η), defined
+in Equations (19) and (20), for defining a pair of dually coupled invariant and decomposable
+convex functions,
+
+ψ(Θ)
+=
+tr ˜ψρ,τ {Θ} ,
+(58)
+
+ϕ(H)
+=
+tr ˜ϕρ,τ {H} .
+(59)
+
+They are not convex with respect to P, but are convex with respect to Θ and H, respectively. The
+derived Bregman divergence is:
+
+D[P : Q] = ψ {Θ(P)} + ϕ {H(Q)} − Θ(P) · H(Q).
+(60)
+
+Theorem 2. The (ρ, τ)-divergences form a unique class of invariant, decomposable and dually flat
+divergences in the manifold of positive matrices.
+
+8
+
+
+Entropy 2014, 16, 2131–2145
+
+The Euclidean, Gaussian and von Neuman divergences given in Equations (51), (54) and (56) are
+special examples of (ρ, τ)-divergences. They are given, respectively, by:
+
+(1)
+ρ(ξ) = τ(ξ) = ξ,
+(61)
+
+(2)
+ρ(ξ) = ξ,
+τ(ξ) = −1
+
+ξ ,
+(62)
+
+(3)
+ρ(ξ) = ξ,
+τ(ξ) = log ξ.
+(63)
+
+When ρ and τ are power functions, we have the (α, β)-structure in the manifold of positive-definite
+matrices.
+
+(4) (α-β)-divergence.
+
+By using the (α, β) power functions given by Equation (34), we have:
+
+ψ(Θ) =
+α
+
+α + βtr Θ
+α+β
+
+α
+=
+α
+
+α + βtr Pα+β,
+(64)
+
+ϕ(H) =
+β
+
+α + βtr H
+
+α+β
+
+β
+=
+β
+
+α + βtr Pα+β
+(65)
+
+so that the (α, β)-divergence of matrices is:
+
+D[P : Q] = tr
+�
+α
+
+α + βPα+β +
+β
+
+α + βQα+β − PαQβ
+�
+.
+(66)
+
+This is a Bregman divergence, where the affine coordinate system is Θ = Pα and its dual is
+H = Pβ.
+(5) The α-divergence is derived as:
+
+Θ(P)
+=
+2
+
+1 − αP
+1−α
+
+2 ,
+(67)
+
+ψ(Θ)
+=
+2
+
+1 + αP,
+(68)
+
+Dα[P : Q]
+=
+4
+
+1 − α2 tr
+�
+−P
+1−α
+
+2 Q
+1+α
+
+2 + 1 − α
+
+2
+P + 1 + α
+
+2
+Q
+�
+.
+(69)
+
+The affine coordinate system is
+2
+
+1−αP
+1−α
+
+2 , and its dual is
+2
+
+1+αP
+1+α
+
+2 .
+(6) The β-divergence is derived from Equation (41) as:
+
+Dβ[P : Q] =
+1
+
+β(β + 1)tr
+�
+Pβ+1 + (β + 1)Q − Qβ+1 − (β + 1)PQβ�
+.
+(70)
+
+4.4. Invariance Under Gl(n)
+
+We extend the concept of invariance under the orthogonal group to that under the general linear
+group, Gl(n), that is the set of invertible matrices, L, det |L| ̸= 0. This is a stronger condition. A
+divergence is said to be invariant under Gl(n), when:
+
+D[P : Q] = D
+�
+LTPL : LTQL
+�
+(71)
+
+holds for any L ∈ Gl(n).
+We identify matrix P with the zero-mean Gaussian distribution:
+
+p(x, P) = exp
+�
+−1
+
+2xTP−1x − 1
+
+2 log det |P| − c
+�
+,
+(72)
+
+9
+
+
+Entropy 2014, 16, 2131–2145
+
+where c is a constant. We know that an invariant divergence belongs to the class of f-divergences in
+the case of a manifold of probability distributions, where the invariance means the geometry does
+not change under a one-to-one mapping of x to y. Moreover, the only invariant flat divergence is the
+KL-divergence [22]. These facts suggest the following conjecture.
+
+Proposition. The invariant, flat and decomposable divergence under Gl(n) is the KL-divergence
+given by:
+
+DKL[P : Q] = tr
+�
+PQ−1�
+− log
+�
+det
+���PQ−1|
+�
+− n.
+(73)
+
+5. Non-Decomposable Divergence
+
+We have focused on flat and decomposable divergences.
+There are many interesting
+non-decomposable divergences. We first discuss a general class of flat divergences in Rn+ and then
+touch upon interesting flat and non-flat divergences in the manifold of positive-definite matrices.
+
+5.1. General Class of Flat Divergences in Rn
++
+
+We can describe a general class of flat divergence in Rn+, which are not necessarily decomposable.
+This is introduced in [23], which studies the conformal structure of general total Bregman divergences
+([11,13]). When Rn+ is endowed with a dually flat structure, it has a θ-coordinate system given by:
+
+θ = ρ(ξ)
+(74)
+
+which is not necessarily a componentwise function. Any pair of invertible θ = ρ(ξ) and convex
+function ψ(θ) defines a dually flat structure and, hence, a Bregman divergence in Rn+.
+The dual coordinates η = τ(ξ) are given by:
+
+η = ∇ψ(θ)
+(75)
+
+so that we have:
+η = τ(ξ) = ∇ψ {ρ(ξ)} .
+(76)
+
+This implies that a pair (ρ, τ) of coordinate systems can define dually coupled affine coordinates and,
+hence, a dually flat structure, when and only when η = τ
+�
+ρ−1(θ)
+�
+is a gradient of a convex function.
+This is different from the case of decomposable divergence, where any monotone pair of ρ(ξ) and
+τ(ξ) gives a dually flat structure.
+
+5.2. Non-Decomposable Flat Divergence in PDn
+
+Ohara and Eguchi [15,16] introduced the following function:
+
+ψV(P) = V (det |P|) ,
+(77)
+
+where V(ξ) is a monotonically decreasing scalar function. ψV is convex when and only when:
+
+1 + V′′(ξ)ξ2
+
+V′(ξ)
+< 1
+
+n.
+(78)
+
+In such a case, we can introduce dually flat structure to PDn, where P is an affine coordinate system
+with convex ψV(P), and the dual affine coordinate system is:
+
+H = V′(det ∥P∥)P−1.
+(79)
+
+10
+
+
+Entropy 2014, 16, 2131–2145
+
+The derived divergence is:
+
+DV[P : Q] = V(det |P) − V(det |Q)|
+(80)
+
++ V′(det |Q|)tr
+�
+Q−1(Q − P)
+�
+.
+(81)
+
+When V(ξ) = − log ξ, it reduces to the case of Equation (54), which is invariant under Gl(n) and
+decomposable. However, the divergence DV[P : Q] is not decomposable. It is invariant under O(n)
+and more strongly so under SGl(n) ⊂ Gl(n), defined by det |L| = ±1.
+
+5.3. Flat Structure Derived from q-Escort Distribution
+
+A dually flat structure is introduced in the manifold of probability distributions [4] as:
+
+˜Dα[p : q] =
+1
+
+1 − q
+1
+
+Hq(p)
+
+�
+1 − ∑ p1−q
+i
+qq
+i
+�
+,
+(82)
+
+where:
+
+Hq(p)
+= ∑ pq
+i ,
+(83)
+
+q
+=
+1 + α
+
+2
+.
+(84)
+
+The dual affine coordinates are the q-escort distribution: [4]
+
+ηi =
+1
+
+Hq(p) pq
+i .
+(85)
+
+The divergence, ˜Dq, is flat, but not decomposable.
+We can generalize it to the case of PDn,
+
+˜Dq[P : Q] =
+1
+
+1 − q
+1
+
+tr Pq
+�
+(1 − q) tr (P) + q tr (Q) − tr
+�
+P1−qQq��
+.
+(86)
+
+This is flat, but not decomposable.
+
+5.4. γ-Divergence in PDn
+
+The γ-divergence is introduced by Fujisawa and Eguchi [24]. It gives a super-robust estimator. It
+is interesting to generalize it to PDn,
+
+Dγ[P : Q] =
+1
+
+γ(γ − 1)
+
+�
+log tr Pγ − (γ − 1) log tr Qγ−1 − γ log tr PQγ−1�
+.
+(87)
+
+This is not flat nor decomposable. This is a projective divergence in the sense that, for any c, c′ > 0,
+
+Dγ
+�
+cP : c′Q
+� = Dγ[P : Q].
+(88)
+
+Therefore, it can be defined in the submanifold of tr P = 1.
+
+6. Concluding Remarks
+
+We have shown that the (ρ, τ)-divergence introduced by Zhang [5] is a general dually flat
+decomposable structure of the manifold of positive measures. We then extended it to the manifold
+of positive-definite matrices, where the criterion of invariance under linear transformations (in
+particular, under orthogonal transformations) were added. The decomposability is useful from the
+
+11
+
+
+Entropy 2014, 16, 2131–2145
+
+computational point of view, because the θ-η transformation is tractable. This is the motivation for
+studying decomposable flat divergences.
+When we treat the manifold of probability distributions, it is a submanifold of the manifold of
+positive measures, where the total sum of measures are restricted to one. This is a nonlinear constraint
+in the θ or η coordinates, so that the manifold is not flat, but curved in general. Hence, our arguments
+hold in this case only when at least one of the ρ and τ functions are linear. The U-divergence [21] and
+β-divergence [19] are such cases. However, for clustering, we can take the average of the η-coordinates
+of member probability distributions in the larger manifold of positive measures and then project it
+to the manifold of probability distributions. This is called the exterior average, and the projection is
+simply a normalization of the result. Therefore, the (ρ, τ)-structure is useful in the case of probability
+distributions. The same situation holds in the case of positive-definite matrices.
+Quantum information theory deals with positive-definite Hermitian matrices of trace one [25,26].
+We need to extend our discussions to the case of complex matrices. The trace one constraint is not
+linear with respect to θ- or η-coordinates, as is the same in the case of probability distributions. Many
+interesting divergence functions have been introduced in the manifold of positive-definite Hermitian
+matrices. It is an interesting future problem to apply our theory to quantum information theory.
+
+Conflicts of Interest: The author declares no conflicts of interest.
+
+References
+
+1.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society and Oxford
+University Press: Rhode Island, RI, USA, 2000.
+2.
+Tsallis, C. Introduction to Nonextensive Statistical Mechanics:
+Approaching a Complex World; Springer:
+Berlin/Heidelberg, Germany, 2009.
+3.
+Naudts, J. Generalized Thermostatistics; Springer: Berlin/Heidelberg, Germany, 2011.
+4.
+Amari,
+S.;
+Ohara,
+A.;
+Matsuzoe,
+H. Geometry of deformed exponential families:
+Invariant,
+dually-flat and conformal geometries. Physica A 2012, 391, 4308–4319.
+5.
+Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195.
+6.
+Zhang, J. Nonparametric information geometry: From divergence function to referential-representational
+biduality on statistical manifolds. Entropy 2013, 15, 5384–5418.
+7.
+Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of
+similarities. Entropy 2010, 12, 1532–1568.
+8.
+Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust
+nonnegative matrix factorization. Entropy 2011, 13, 134–170.
+9.
+Minami, M.; Eguchi, S. Robust blind source separation by beta-divergence. Neural Comput. 2002 14,
+1859–1886.
+10.
+Banerjee, A.; Merugu, S.; Dhillon I.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res.
+2005, 6, 1705–1749.
+11.
+Liu, M.; Vemuri, B.C.; Amari, S.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering.
+IEEE Trans. Pattern Anal. Mach. Learn. 2012, 24, 3192–3212.
+12.
+Dhillon, I.S.; Tropp, J.A. Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl.
+2007, 29, 1120–1146.
+13.
+Vemuri, B.C.; Liu, M.; Amari, S.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis.
+IEEE Trans. Med. Imaging 2011, 30, 475–483.
+14.
+Ohara, A.; Suda, N.; Amari, S. Dualistic differential geometry of positive definite matrices and its applications
+to related problems. Linear Algebra Appl. 1996 247, 31–53.
+15.
+Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by
+beta-divergence. Entropy 2013, 15, 4732–4747.
+16.
+Ohara, A.; Eguchi, S. Geometry on positive definite matrices induced from V-potential functions. In Geometric
+Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2013;
+pp. 621–629.
+
+12
+
+
+Entropy 2014, 16, 2131–2145
+
+17.
+Chebbi,
+Z.;
+Moakher,
+M.
+Means
+of
+Hermitian
+positive-definite
+matrices
+based
+on
+the
+log-determinant alpha-divergence function. Linear Algebra Appl. 2012, 436, 1872–1889.
+18.
+Tsuda, K.; Ratsch, G.; Warmuth, M.K. Matrix exponentiated gradient updates for on-line learning and
+Bregman projection. J. Mach. Learn. Res. 2005, 6, 995–1018.
+19.
+Nock, R.; Magdalou, B.; Briys, E.; Nielsen, F. Mining matrix data with Bregman matrix divergences for
+portfolio selection. In Matrix Information Geometry; Nielsen, F., Bhatia, R., Eds.; Springer: Berlin/Heidelberg,
+Germany, 2013; Chapter 15, pp. 373–402.
+20.
+Nielsen, F., Bhatia, R., Eds. Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013.
+21.
+Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo. 2006, 19, 197–216.
+22.
+Amari, S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes.
+IEEE Trans. Inf. Theory 2009, 55, 4925–4931.
+23.
+Nock, R.; Nielsen, F.; Amari, S. On conformal divergences and their population minimizers. IEEE Trans. Inf.
+Theory 2014, submitted for publication.
+24.
+Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination.
+J. Multivar. Anal. 2008, 99, 2053–2081.
+25.
+Petz, P. Monotone metrics on matrix spaces. Linear Algebra Appl. 1996, 244, 81–96.
+26.
+Hasegawa, H. α-divergence of the non-commutative information geometry. Rep. Math. Phys. 1993, 33, 87–93.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+13
+
+
+entropy
+
+Article
+F-Geometry and Amari’s α−Geometry on a
+Statistical Manifold
+
+Harsha K. V. * and Subrahamanian Moosath K S *
+
+Indian Institute of Space Science and Technology, Department of Space, Government of India, Valiamala P.O,
+Thiruvananthapuram-695547, Kerala, India
+*
+E-Mails: harsha.11@iist.ac.in (K.V.H.); smoosath@iist.ac.in (K.S.S.M.); Tel.: +91-95-6736-0425 (K.V.H.);
++91-94-9574-3148 (K.S.S.M.).
+
+Received: 13 December 2013; in revised form: 21 April 2014 / Accepted: 25 April 2014 /
+Published: 6 May 2014
+
+Abstract: In this paper, we introduce a geometry called F-geometry on a statistical manifold S using
+an embedding F of S into the space RX of random variables. Amari’s α−geometry is a special
+case of F−geometry. Then using the embedding F and a positive smooth function G, we introduce
+(F, G)−metric and (F, G)−connections that enable one to consider weighted Fisher information
+metric and weighted connections. The necessary and sufficient condition for two (F, G)−connections
+to be dual with respect to the (F, G)−metric is obtained. Then we show that Amari’s 0−connection is
+the only self dual F−connection with respect to the Fisher information metric. Invariance properties
+of the geometric structures are discussed, which proved that Amari’s α−connections are the only
+F−connections that are invariant under smooth one-to-one transformations of the random variables.
+
+Keywords:
+embedding; Amari’s α−connections;
+F−metric;
+F−connections; (F, G)−metric;
+(F, G)−connections; invariance
+
+1. Introduction
+
+Geometric study of statistical estimation has opened up an interesting new area called the
+Information Geometry. Information geometry achieved a remarkable progress through the works of
+Amari [1,2], and his colleagues [3,4]. In the last few years, many authors have considerably contributed
+in this area [5–9]. Information geometry has a wide variety of applications in other areas of engineering
+and science, such as neural networks, machine learning, biology, mathematical finance, control system
+theory, quantum systems, statistical mechanics, etc.
+A statistical manifold of probability distributions is equipped with a Riemannian metric and a pair
+of dual affine connections [2,4,9]. It was Rao [10] who introduced the idea of using Fisher information
+as a Riemannian metric in the manifold of probability distributions. Chentsov [11] introduced a family
+of affine connections on a statistical manifold defined on finite sets. Amari [2] introduced a family of
+affine connections called α−connections using a one parameter family of functions, the α−embeddings.
+These α−connections are equivalent to those defined by Chentsov. The Fisher information metric and
+these affine connections are characterized by invariance with respect to the sufficient statistic [4,12] and
+play a vital role in the theory of statistical estimation. Zhang [13] generalized Amari’s α−representation
+and using this general representation together with a convex function he defined a family of divergence
+functions from the point of view of representational and referential duality. The Riemannian metric
+and dual connections are defined using these divergence functions.
+In this paper, Amari’s idea of using α−embeddings to define geometric structures is extended to
+a general embedding. This paper is organized as follows. In Section 2, we define an affine connection
+called F−connection and a Riemannian metric called F−metric using a general embedding F of
+a statistical manifold S into the space of random variables. We show that F−metric is the Fisher
+
+Entropy 2014, 16, 2472–2487; doi:10.3390/e16052472
+www.mdpi.com/journal/entropy
+14
+
+
+Entropy 2014, 16, 2472–2487
+
+information metric and Amari’s α−geometry is a special case of F−geometry. Further, we introduce
+(F, G)−metric and (F, G)−connections using the embedding F and a positive smooth function G.
+In Section 3, a necessary and sufficient condition for two (F, G)−connections to be dual with
+respect to the (F, G)−metric is derived and we prove that Amari’s 0−connection is the only self
+dual F−connection with respect to the Fisher information metric. Then we prove that the set of all
+positive finite measures on X, for a finite X, has an F−affine manifold structure for any embedding F.
+In Section 4, invariance properties of the geometric structures are discussed. We prove that the
+Fisher information metric and Amari’s α−connections are invariant under both the transformation
+of the parameter and the transformation of the random variable. Further we show that Amari’s
+α−connections are the only F−connections that are invariant under both the transformation of the
+parameter and the transformation of the random variable.
+Let (X, B) be a measurable space, where X is a non-empty subset of R and B is the σ-field of
+subsets of X. Let RX be the space of all real valued measurable functions defined on (X, B). Consider
+an n−dimensional statistical manifold S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn}, with coordinates
+ξ = [ξ1, ..., ξn], defined on X. S is a subset of P(X), the set of all probability measures on X given by
+
+P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X);
+�
+
+X
+p(x)dx = 1}.
+(1)
+
+The tangent space to S at a point pξ is given by
+
+Tξ(S) = {
+n
+∑
+i=1
+αi∂i / αi ∈ R}
+where ∂i =
+∂
+∂ξi .
+(2)
+
+Define ℓ(x; ξ) = log p(x; ξ) and consider the partial derivatives { ∂ℓ
+
+∂ξi = ∂iℓ ; i = 1, ...., n} which are
+called scores. For the statistical manifold S, ∂iℓ’s are linearly independent functions in x for a fixed ξ.
+Let T1
+ξ (S) be the n-dimensional vector space spanned by n functions {∂iℓ ; i = 1, ...., n} in x. So
+
+T1
+ξ (S) = {
+n
+∑
+i=1
+Ai∂iℓ / Ai ∈ R}.
+(3)
+
+Then there is a natural isomorphism between these two vector spaces Tξ(S) and T1
+ξ (S) given by
+
+∂i ∈ Tξ(S) ←→ ∂iℓ(x; ξ) ∈ T1
+ξ (S).
+(4)
+
+Obviously, a tangent vector A = ∑n
+i=1 Ai∂i ∈ Tξ(S) corresponds to a random variable A(x) =
+∑n
+i=1 Ai∂iℓ(x; ξ) ∈ T1
+ξ (S) having the same components Ai. Note that Tξ(S) is the differentiation
+
+operator representation of the tangent space, while T1
+ξ (S) is the random variable representation of the
+
+same tangent space. The space T1
+ξ (S) is called the 1-representation of the tangent space.
+Let A and B be two tangent vectors in Tξ(S) and A(x) and B(x) be the 1−representations of A and B
+respectively. We can define an inner product on each tangent space Tξ(S) by
+
+gξ(A, B) =< A, B >ξ = Eξ[A(x)B(x)] =
+�
+A(x)B(x)p(x; ξ)dx.
+(5)
+
+Especially the inner product of the basis vectors ∂i and ∂j is
+
+gij(ξ) = < ∂i, ∂j >ξ = Eξ[∂iℓ ∂jℓ] =
+�
+∂iℓ(x; ξ)∂jℓ(x; ξ)p(x; ξ)dx.
+(6)
+
+15
+
+
+Entropy 2014, 16, 2472–2487
+
+Note that g =<, > defines a Riemannian metric on S called the Fisher information metric.
+On the Riemannian manifold (S, g =<, >), define n3 functions Γijk by
+
+Γijk(ξ) = Eξ[(∂i∂jℓ(x; ξ))(∂kℓ(x; ξ))].
+(7)
+
+These functions Γijk uniquely determine an affine connection ∇ on S by
+
+Γijk(ξ) =< ∇∂i∂j, ∂k >ξ .
+(8)
+
+∇ is called the 1−connection or the exponential connection.
+Amari [2] defined a one parameter family of functions called the α−embeddings given by
+
+Lα(p) =
+
+�
+2
+
+1−α p
+1−α
+
+2
+α ̸= 1
+log p
+α = 1
+(9)
+
+Using these, we can define n3 functions Γα
+ijk by
+
+Γα
+ijk =
+�
+∂i∂jLα(p(x; ξ))∂kL−α(p(x; ξ))dx
+(10)
+
+These Γα
+ijk uniquely determine affine connections ∇α on the statistical manifold S by
+
+Γα
+ijk = < ∇α
+∂i∂j, ∂k >
+(11)
+
+which are called ff−connections.
+
+2. F−Geometry of a Statistical Manifold
+
+On a statistical manifold S, the Fisher information metric and exponential connection are defined
+using the log embedding. In a similar way, α−connections are defined using a one parameter family of
+functions, the α−embeddings. In general, we can give other geometric structures on S using different
+embeddings of the manifold S into the space of random variables RX.
+Let F : (0, ∞) −→ R be an injective function that is at least twice differentiable. Thus we have
+F′(u) ̸= 0, ∀ u ∈ (0, ∞). F is an embedding of S into RX that takes each p(x; ξ) �−→ F(p(x; ξ)).
+Denote F(p(x; ξ)) by F(x; ξ) and ∂iF can be written as
+
+∂iF(x; ξ) = p(x; ξ)F′(p(x; ξ))∂iℓ(p(x; ξ)).
+(12)
+
+It is clear that ∂iF(x; ξ);
+i = 1, ..., n are linearly independent functions in x for fixed ξ since
+∂iℓ(p(x; ξ)); i = 1, .., n are linearly independent. Let TF(pξ)F(S) be the n-dimensional vector space
+spanned by n functions ∂iF; i = 1, ...., n in x for fixed ξ. So
+
+TF(pξ)F(S) = {
+n
+∑
+i=1
+Ai∂iF / Ai ∈ R}
+(13)
+
+Let the tangent space TF(pξ)(F(S)) to F(S) at the point F(pξ) be denoted by TF
+ξ (S). There is a natural
+
+isomorphism between the two vector spaces Tξ(S) and TF
+ξ (S) given by
+
+∂i ∈ Tξ(S) ←→ ∂iF(x; ξ) ∈ TF
+ξ (S).
+(14)
+
+TF
+ξ (S) is called the F−representation of the tangent space Tξ(S).
+
+16
+
+
+Entropy 2014, 16, 2472–2487
+
+For any A = ∑n
+i=1 Ai∂i ∈ Tξ(S), the corresponding A(x) = ∑n
+i=1 Ai∂iF ∈ TF
+ξ (S) is called the
+
+F−representation of the tangent vector A and is denoted by AF(x). Note that TF
+ξ (S) ⊆ TF(pξ)(RX).
+
+Since RX is a vector space, its tangent space TF(pξ)(RX) can be identified with RX. So TF
+ξ (S) ⊆ RX.
+
+Definition 1. F−expectation of a random variable
+f
+with respect to the distribution p(x; ξ) is
+defined as
+
+EF
+ξ ( f ) =
+�
+f (x)
+1
+
+p(F′(p))2 dx.
+(15)
+
+We can use this F−expectation to define an inner product in RX by
+
+< f, g >F
+ξ = EF
+ξ [ f (x)g(x)],
+(16)
+
+which induces an inner product on Tξ(S) by
+
+< A, B >F
+ξ = EF
+ξ [AF(x)BF(x)] ; A, B ∈ Tξ(S).
+(17)
+
+Proposition 1. The induced metric <, >F on S is the Fisher information metric g =<, > on S.
+
+Proof. For any basis vectors ∂i, ∂j ∈ Tξ(S)
+
+< ∂i, ∂j >F
+ξ
+=
+EF
+ξ [∂iF ∂jF]
+
+=
+�
+∂iF ∂jF
+1
+
+p(F′(p))2 dx
+
+=
+�
+(p F′(p) ∂iℓ) (p F′(p) ∂jℓ)
+1
+
+p(F′(p))2 dx
+(18)
+
+=
+�
+∂iℓ ∂jℓ p(x; ξ) dx
+
+=
+Eξ[∂iℓ ∂jℓ]
+
+=
+gij(ξ)
+
+=
+< ∂i, ∂j >ξ .
+
+So the metric <, >F on S induced by the embedding F of S into RX is the Fisher information metric
+g =<, > on S.
+
+We can induce a connection on S using the embedding F.
+Let πF
+|pξ : RX −→ TF
+ξ (S) be the projection map.
+
+Definition 2. The connection induced by the embedding F on S, the F−connection, is defined as
+
+∇F
+∂i∂j
+=
+πF
+|pξ(∂i∂jF)
+
+= ∑
+n ∑
+m
+gmn < ∂i∂jF, ∂mF >F
+ξ ∂n.
+(19)
+
+where [gmn(ξ)] is the inverse of the Fisher information matrix G(ξ) = [gmn(ξ)]. Note that the F−connections
+are symmetric.
+
+Lemma 1. The F−connection and its components can be written in terms of scores as
+
+∇F
+∂i∂j = ∑
+n ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+∂n
+(20)
+
+17
+
+
+Entropy 2014, 16, 2472–2487
+
+and
+
+ΓF
+ijk(ξ) = Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+(21)
+
+Proof. From Equation (12), we have
+
+∂i∂jF = pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ.
+(22)
+
+Therefore
+
+< ∂i∂jF, ∂mF >F
+ξ
+=
+�
+∂i∂jF ∂mF
+1
+
+p(F′(p))2 dx
+
+=
+� �
+pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ
+�
+∂mℓ
+F′(p)dx
+(23)
+
+=
+� �
+∂i∂jℓ ∂mℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ ∂mℓ
+�
+pdx
+
+=
+Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+.
+
+Hence we can write
+
+∇F
+∂i∂j
+=
+πF
+|pξ(∂i∂jF)
+
+= ∑
+n ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+∂n.
+(24)
+
+Then we have the Christoffel symbols of the F−connection
+
+Γn
+ij = ∑
+m
+gmnEξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂mℓ)
+�
+(25)
+
+and components of the F−connection are given by
+
+ΓF
+ijk(ξ) =< ∇F
+∂i∂j, ∂k >ξ= Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+.
+(26)
+
+Theorem 1. Amari’s α−geometry is a special case of the F−geometry.
+
+Proof. Let F(p) = Lα(p), Lα(p) is the α−embedding of Amari.
+The components Γα
+ijk of the α−connection are given by
+
+Γα
+ijk(ξ)
+=
+< ∇α
+∂i∂j, ∂k >ξ
+
+=
+Eξ
+
+�
+(∂i∂jℓ + 1 − α
+
+2
+∂iℓ ∂jℓ)(∂kℓ)
+�
+.
+(27)
+
+From Equation (26), when F(p) = Lα(p)
+we have
+
+F′(p) = L′
+α(p)
+=
+p−( 1+α
+
+2 )
+(28)
+
+F′′(p) = L′′
+α(p)
+=
+−1 + α
+
+2
+p−( 3+α
+
+2 ).
+(29)
+
+18
+
+
+Entropy 2014, 16, 2472–2487
+
+Then we get
+
+1 + pF′′(p)
+
+F′(p) = 1 + pL′′
+α(p)
+
+L′α(p) = 1 − α
+
+2
+(30)
+
+Hence
+
+ΓF
+ijk(ξ) =< ∇F
+∂i∂j, ∂k >ξ
+=
+Eξ
+
+�
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+
+=
+Eξ
+
+�
+(∂i∂jℓ + 1 − α
+
+2
+∂iℓ ∂jℓ)(∂kℓ)
+�
+(31)
+
+=
+Γα
+ijk(ξ)
+
+which are the components of the α−connection. Hence F−connection reduces to α−connection.
+Thus we obtain that α−geometry is a special case of F−geometry.
+
+Remark 1. Burbea [14] introduced the concept of weighted Fisher information metric using a positive
+continuous function. We use this idea to define weighted F−metric and weighted F−connections. Let
+G : (0, ∞) −→ R be a positive smooth function and F be an embedding, define (F, G)−expectation of a
+random variable with respect to the distribution pξ as
+
+EF,G
+ξ
+( f ) =
+�
+f (x)
+G(p)
+
+p(F′(p))2 dx.
+(32)
+
+Define (F, G)−metric <, >F,G
+ξ
+in Tpξ(S) by
+
+< ∂i, ∂j >F,G
+ξ
+=
+EF,G
+ξ
+[∂iF ∂jF]
+
+=
+�
+∂iF ∂jF
+G(p)
+
+p(F′(p))2 dx
+(33)
+
+=
+�
+∂iℓ ∂jℓ G(p) p dx
+
+=
+Eξ[G(p) ∂iℓ ∂jℓ].
+
+Define (F, G)−connection as
+
+ΓF,G
+ijk
+=
+< ∇F,G
+∂i ∂j, ∂k >ξ
+
+=
+Eξ
+
+��
+(∂i∂jℓ + (1 + pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ)(∂kℓ)
+�
+(G(p))
+�
+.
+(34)
+
+When G(p) = 1, (F, G)−connection reduces to the F−connection and the metric <, >F,G reduces to the Fisher
+information metric. This is a more general way of defining Riemannian metrics and affine connections on a
+statistical manifold.
+
+3. Dual Affine Connections
+
+Definition 3. Let M be a Riemannian manifold with a Riemannian metric g. Two affine connections, ∇ and
+∇∗ on the tangent bundle are said to be dual connections with respect to the metric g if
+
+Zg(X, Y) = g(∇ZX, Y) + g(X, ∇∗
+ZY)
+(35)
+
+holds for any vector fields X, Y, Z on M.
+
+19
+
+
+Entropy 2014, 16, 2472–2487
+
+Theorem 2. Let F, H be two embeddings of statistical manifold S into the space RX of random variables. Let G
+be a positive smooth function on (0, ∞). Then the (F, G)−connection ∇F,G and the (H, G)−connection ∇H,G
+
+are dual connections with respect to the (F, G)−metric iff the functions F and H satisfy
+
+H′(p) = G(p)
+
+pF′(p).
+(36)
+
+We call such an embedding H as a G−dual embedding of F.
+The components of the dual connection ∇H,G can be written as
+
+ΓH,G
+ijk
+=
+� �
+∂i∂jℓ + ( pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ
+�
+∂kℓ G(p)p dx.
+(37)
+
+Proof. ∇F,G and ∇H,G are dual connections with respect to the G−metric means,
+
+∂k < ∂i, ∂j >F,G=< ∇F,G
+∂k ∂i, ∂j >F,G + < ∂i, ∇H,G
+∂k
+∂j >F,G .
+(38)
+
+for any basis vectors ∂i, ∂j, ∂k ∈ Tξ(S).
+
+∂k < ∂i, ∂j >F,G
+=
+�
+∂k∂jℓ ∂iℓ pG(p)dx +
+�
+∂k∂iℓ ∂jℓ pG(p)dx
+
++
+�
+(1 + pG′(p)
+
+G(p) )∂iℓ ∂jℓ ∂kℓ pG(p)dx.
+(39)
+
+< ∇F,G
+∂k ∂i, ∂j >F,G + < ∂i, ∇H,G
+∂k
+∂j >F,G
+=
+�
+∂k∂iℓ ∂jℓ pG(p)dx
+
++
+�
+1 + pF′′(p)
+
+F′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx
+
++
+�
+1 + pH′′(p)
+
+H′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx
+
++
+�
+∂k∂jℓ ∂iℓ pG(p)dx
+(40)
+
+Then the condition (38) holds iff
+
+�
+[2 + pF′′(p)
+
+F′(p) + pH′′(p)
+
+H′(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx =
+
+�
+[1 + pG′(p)
+
+G(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx
+(41)
+
+⇐⇒ [2 + pF′′(p)
+
+F′(p) + pH′′(p)
+
+H′(p) ] = 1 + pG′(p)
+
+G(p) .
+(42)
+
+⇐⇒ 1 + pH′′(p)
+
+H′(p) = pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p)
+(43)
+
+⇐⇒ H′′(p)
+
+H′(p) = G′(p)
+
+G(p) − F′′(p)
+
+F′(p) − 1
+
+p ⇐⇒ H′(p) = G(p)
+
+pF′(p).
+(44)
+
+Hence ∇F,G and ∇H,G are dual connections with respect to the (F, G)−metric iff Equation (36) holds.
+From Equation (43), we can rewrite the components of dual connection ∇H,G as
+
+ΓH,G
+ijk
+=
+� �
+∂i∂jℓ + ( pG′(p)
+
+G(p) − pF′′(p)
+
+F′(p) )∂iℓ ∂jℓ
+�
+∂kℓ G(p)p dx.
+(45)
+
+20
+
+
+Entropy 2014, 16, 2472–2487
+
+Corollary 1. Amari’s 0−connection is the only self dual F−connection with respect to the Fisher information
+metric.
+
+Proof. From Theorem 2, for G(p) = 1 the F−connection ∇F and the H−connection ∇H are dual
+connections with respect to the Fisher information metric iff the functions F and H satisfy
+
+H′(p) =
+1
+
+pF′(p)
+(46)
+
+Thus the F−connection ∇F is self dual iff the embedding F satisfies the condition
+
+F′(p) =
+1
+
+pF′(p) ⇐⇒ F′(p) = p−( 1
+
+2 ) ⇐⇒ F(p) = 2p
+1
+2 = L0(p).
+(47)
+
+That is, Amari’s 0−connection is the only self dual F−connection with respect to the Fisher
+information metric.
+
+So far, we have considered the statistical manifold S as a subset of P(X), the set of all probability
+measures on X. Now we relax the condition �
+p(x)dx = 1, and consider S as a subset of ˜P(X), which
+is defined by
+˜P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X);
+�
+
+X
+p(x)dx < ∞}.
+(48)
+
+Definition 4. Let M be a Riemannian manifold with a Riemannian metric g. Let ∇ be an affine connection on
+M. If there exists a coordinate system [θi] of M such that ∇∂i∂j = 0 then we say that ∇ is flat, or alternatively
+M is flat with respect to ∇, and we call such a coordinate system [θi] an affine coordinate system for ∇.
+
+Definition 5. Let S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn} be an n−dimensional statistical manifold. If
+for some coordinate system [θi]; i = 1, ..., n
+
+∂i∂jF(p(x; θ)) = 0
+(49)
+
+then we can see from Equation (19) that [θi] is an F−affine coordinate system and that S = {pθ} is F−flat. We
+call such S as an F−affine manifold.
+The condition (49) is equivalent to the existence of the functions C, F1, .., Fn on X such that
+
+F(p(x; θ)) = C(x) +
+n
+∑
+i=1
+θiFi(x)
+(50)
+
+Theorem 3. For any embedding F, ˜P(X) is an F−affine manifold for finite X.
+
+Proof. Let X = {x1, ...., xn} be a finite set constituted by n elements. Let Fi : X −→ R be the functions
+defined by Fi(xj) = δij for i, j = 1, .., n. Let us define n coordinates [θi] by
+
+θi = F(p(xi))
+(51)
+
+Then we get F(p(x))
+=
+∑n
+i=1 θiFi(x).
+Therefore
+˜P(X) is an F−affine manifold for any
+embedding F(p).
+
+Remark 2. Zhang [13] introduced ρ-representation, which is a generalization of α-representation of Amari.
+Zhang’s geometry is defined using this ρ-representation together with a convex function. Zhang also defined the
+ρ-affine family of density functions and discussed its dually flat structure. The F−geometry defined using a
+
+21
+
+
+Entropy 2014, 16, 2472–2487
+
+general F-representation is different from the Zhang’s geometry. The metric defined in the F-embedding approach
+is the Fisher information metric and the Riemannian metric defined using the ρ-representation is different from
+the Fisher information metric. The F-connections defined are not in general dually flat and are different from the
+dual connections defined by Zhang.
+
+Remark 3. On a statistical manifold S, we introduced a dualistic structure (g, ∇F, ∇H), where g is the Fisher
+information metric and ∇F, ∇H are the dual connections with respect to the Fisher information metric. Since
+F-connections are symmetric, the manifold S is flat with respect to ∇F iff S is flat with respect to ∇H. Thus if
+S is flat with respect to ∇F, then (S, g, ∇F, ∇H) is a dually flat space. The dually flat spaces are important in
+statistical estimation [4].
+
+4. Invariance of the Geometric Structures
+
+For the statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn}, the parameters are merely labels attached
+to each point p ∈ S, hence the intrinsic geometric properties should be independent of these labels.
+Consequently, it is natural to consider the invariance properties of the geometric structures under
+suitable transformations of the variables in a statistical manifold. Here we can consider two kinds of
+invariance of the geometric structures; covariance under re-parametrization of the parameter of the
+manifold and invariance under the transformations of the random variable [15]. Now let us investigate
+the invariance properties of the F-geometric structures defined in Section 2.
+
+4.1. Covariance under Re-Parametrization
+
+Let [θi] and [ηj] be two coordinate systems on S, which are related by an invertible transformation
+η = η(θ). Let us denote ∂i =
+∂
+∂θi and ∂j =
+∂
+∂ηj . Let the coordinate expressions of the metric g be given
+
+by gij =< ∂i, ∂j > and ˜gij =< ∂i, ∂j >. Let the components of the connection ∇ with respect to the
+coordinates [θi] and [ηj] be given by Γijk, ˜Γijk respectively.
+Then the covariance of the metric g and the connection ∇ under the re-parametrization means,
+
+˜gij
+= ∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+gmn
+(52)
+
+˜Γijk
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+Γmnh + ∑
+m,h
+
+∂θh
+
+∂ηk
+
+∂2θm
+
+∂ηi∂ηj
+gmh
+(53)
+
+Lemma 2. The Fisher information metric g is covariant under re-parametrization.
+
+Proof. The components of the Fisher information metric with respect to the coordinate system [θi] are
+given by
+
+gij(θ) = < ∂i, ∂j >θ =
+�
+∂ip(x; θ)∂jp(x; θ)
+1
+
+p(x; θ)dx.
+(54)
+
+Let ˜p(x; η) = p(x; θ(η)). Then the components of the Fisher information metric with respect to the
+coordinate system [ηj] are given by
+
+˜gij(η) = < ∂i, ∂j >η =
+�
+∂i ˜p(x; η)∂j ˜p(x; η)
+1
+
+˜p(x; η)dx.
+(55)
+
+Since
+
+∂i ˜p(x; η) = ∑
+m
+
+∂θm
+
+∂ηi
+
+∂p(x; θ(η))
+
+∂θm
+(56)
+
+22
+
+
+Entropy 2014, 16, 2472–2487
+
+we can write
+
+˜gij(η)
+=
+�
+∂i ˜p(x; η)∂j ˜p(x; η)
+1
+
+˜p(x; η)dx
+
+=
+�
+∑
+m
+
+∂θm
+
+∂ηi
+
+∂p(x; θ)
+
+∂θm
+∑
+n
+
+∂θn
+
+∂ηj
+
+∂p(x; θ)
+
+∂θn
+1
+
+p(x; θ)dx
+(57)
+
+= ∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+�
+∂mp(x; θ)∂np(x; θ)
+1
+
+p(x; θ)dx.
+
+=
+
+�
+∑
+m ∑
+n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+gmn(θ)
+
+�
+
+θ=θ(η)
+
+Lemma 3. The F−connection ∇F is covariant under re-parametrization.
+
+Proof. Let the components of ∇F with respect to the coordinates [θi] and [ηj] be given by Γijk,
+˜Γijk respectively.
+Let ˜p(x; η) = p(x; θ(η)). Let us denote log p(x; θ) by ℓ(x; θ) and log ˜p(x; η) by ˜ℓ(x; η).
+The components of the F−connection ∇F with respect to the coordinate system [θi] are given by
+
+Γijk =
+� �
+∂i∂jℓ(x; θ) + (1 + pF′′(p)
+
+F′(p) )∂iℓ(x; θ) ∂jℓ(x; θ)
+�
+∂kℓ(x; θ)p(x; θ)dx
+(58)
+
+The components of ∇F with respect to the coordinate system [ηj] are given by
+
+˜Γijk =
+� �
+∂i∂j˜ℓ(x; η) + (1 + ˜pF′′( ˜p)
+
+F′( ˜p) )∂i˜ℓ(x; η) ∂j˜ℓ(x; η)
+�
+∂k˜ℓ(x; η) ˜p(x; η)dx
+(59)
+
+We can write
+
+∂i˜ℓ(x; η) = ∑
+m
+
+∂θm
+
+∂ηi
+
+∂ℓ(x; θ(η))
+
+∂θm
+(60)
+
+Then
+
+∂i∂j˜ℓ(x; η) = ∑
+m,n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂2ℓ(x; θ(η))
+
+∂θm∂θn
++ ∑
+m
+
+∂2θm
+
+∂ηi∂ηj
+
+∂ℓ(x; θ(η))
+
+∂θm
+(61)
+
+∂i˜ℓ(x; η) ∂j˜ℓ(x; η) = ∑
+m,n
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+(62)
+
+∂k˜ℓ(x; η) = ∑
+h
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θh
+(63)
+
+Hence we get
+
+˜Γijk
+=
+�
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+∂2ℓ(x; θ(η))
+
+∂θm∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+�
+∑
+m,h
+
+∂2θm
+
+∂ηi∂ηj
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+(64)
+
+�
+(1 + pF′′(p)
+
+F′(p) ) ∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx
+
+23
+
+
+Entropy 2014, 16, 2472–2487
+
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+� ∂2ℓ(x; θ(η))
+
+∂θm∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+∑
+m,h
+
+∂2θm
+
+∂ηi∂ηj
+
+∂θh
+
+∂ηk
+
+� ∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx +
+
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+
+�
+(1 + pF′′(p)
+
+F′(p) )∂ℓ(x; θ(η))
+
+∂θm
+∂ℓ(x; θ(η))
+
+∂θn
+∂ℓ(x; θ(η))
+
+∂θh
+p(x; θ(η))dx
+
+=
+∑
+m,n,h
+
+∂θm
+
+∂ηi
+
+∂θn
+
+∂ηj
+
+∂θh
+
+∂ηk
+Γmnh + ∑
+m,h
+
+∂θh
+
+∂ηk
+
+∂2θm
+
+∂ηi∂ηj
+gmh
+
+Hence we showed that F−connections are covariant under re-parametrization of the parameter.
+The covariance under re-parametrization actually means that the metric and connections are coordinate
+independent. Hence we obtained that the F−geometry is coordinate independent.
+
+4.2. Invariance Under the Transformation of the Random Variable
+
+Amari and Nagaoka [4] defined the invariance of Riemannian metric and connections on a
+statistical manifold under a transformation of the random variable as follows,
+
+Definition 6. Let S = {p(x; ξ) | ξ ∈ E ⊆ Rn} be a statistical manifold defined on a sample space
+X.
+Let x, y be random variables defined on sample spaces X, Y respectively and φ be a transformation
+of x to y.
+Assume that this transformation induces a model S′ = {q(y; ξ) | ξ ∈ E ⊆ Rn} on Y.
+Let λ : S −→ S′ be a diffeomorphism defined as
+
+λ(pξ) = qξ
+(65)
+
+Let g =<>, g′ =<>′ be two Riemannian metrics defined on S and S′ respectively. Let ∇, ∇
+′ be two affine
+connections on S and S′ respectively. Then the invariance properties are given by
+
+< X, Y >p
+=
+< λ∗(X), λ∗(Y) >′
+λ(p) ∀ X, Y ∈ Tp(S)
+(66)
+
+λ∗(∇XY)
+=
+∇
+′
+λ∗(X)λ∗(Y)
+(67)
+
+where λ∗ is the push forward map associated with the map λ, which is defined by
+
+λ∗(X)λ(p) = (dλ)p(X)
+(68)
+
+Now we discuss the invariance properties of the F−geometry under suitable transformations
+of the random variable. Let us restrict ourselves to the case of smooth one-to-one transformations
+of the random variable that are in fact statistically interesting. Amari and Nagaoka [4] mentioned a
+transformation, the sufficient statistic of the parameter of the statistical model, which is widely used in
+statistical estimation. In fact the one-to-one transformations of the random variable are trivial examples
+of sufficient statistic.
+Consider a statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn} defined on a sample space X. Let φ be
+a smooth one-to-one transformation of the random variable x to y. Then the density function q(y; ξ) of
+the induced model S′ takes the form
+
+q(y : ξ) = p(w(y); ξ)w′(y)
+(69)
+
+where w is a function such that x = w(y) and φ′(x) =
+1
+
+w′(φ(x)).
+Let us denote log q(y; ξ) by ℓ(qy) and log p(x; ξ) by ℓ(px).
+
+24
+
+
+Entropy 2014, 16, 2472–2487
+
+Lemma 4. The Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one
+transformations of the random variable.
+
+Proof. Let φ be a smooth one-to-one transformation of the random variable x to y.
+From Equation (69)
+
+p(x; ξ)
+=
+q(φ(x); ξ)φ′(x)
+(70)
+
+∂iℓ(qy)
+=
+∂iℓ(pw(y))
+(71)
+
+∂iℓ(qφ(x))
+=
+∂iℓ(px)
+(72)
+
+The Fisher information metric g′ on the induced manifold S′ is given by
+
+g′
+ij(qξ)
+=
+�
+
+Y ∂iℓ(qy) ∂jℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂iℓ(qφ(x)) ∂jℓ(qφ(x)) q(φ(x); ξ) φ′(x)dx
+(73)
+
+=
+�
+
+X ∂iℓ(px) ∂jℓ(px) p(x; ξ)dx
+
+=
+gij(pξ)
+
+which is the Fisher information metric on S.
+The components of Amari’s α−connections on the induced manifold S′ are given by
+
+´Γα
+ijk(qξ)
+=
+�
+
+Y ∂i∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy +
+�
+
+Y
+1 − α
+
+2
+∂iℓ(qy) ∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂i∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx +
+�
+
+X
+1 − α
+
+2
+∂iℓ(qφ(x)) ∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx
+(74)
+
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+�
+
+X
+1 − α
+
+2
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx
+
+=
+Γα
+ijk(pξ)
+
+which are the components of Amari’s α−connections on the manifold S. Thus we obtained that
+the Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one
+transformations of the random variable.
+
+Now we prove that α-connections are the only F−connections that are invariant under smooth
+one-to-one transformations of the random variable.
+
+Theorem 4. Amari’s α-connections are the only F−connections that are invariant under smooth one-to-one
+transformations of the random variable.
+
+25
+
+
+Entropy 2014, 16, 2472–2487
+
+Proof. Let φ be a smooth one-to-one transformation of the random variable x to y.
+The components of the F−connection of the induced manifold S′ are
+
+´ΓF
+ijk(qξ)
+=
+�
+
+Y
+
+�
+∂i∂jℓ(qy) + (1 + qF′′(q)
+
+F′(q) )∂iℓ(qy) ∂jℓ(qy)
+�
+∂kℓ(qy) q(y; ξ)dy
+
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+(75)
+
+�
+
+X(1 + q(φ(x); ξ)F′′(q(φ(x); ξ))
+
+F′(q(φ(x); ξ))
+)∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx.
+
+and the components of the F−connection of the manifold S are
+
+ΓF
+ijk(pξ)
+=
+�
+
+X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
+
+�
+
+X(1 + p(φ(x); ξ)F′′(p(x; ξ))
+
+F′(p(x; ξ))
+)∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx.
+(76)
+
+Then by equating the components ´ΓF
+ijk(qξ), ΓF
+ijk(pξ) of the F−connection, we get
+
+� q(φ(x); ξ)F′′(q(φ(x); ξ))
+
+F′(q(φ(x); ξ))
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx =
+
+� p(x; ξ)F′′(p(x; ξ))
+
+F′(p(x; ξ))
+∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx
+(77)
+
+Then it follows that the condition for F−connection to be invariant under the transformation φ is
+given by
+pF′′(p)
+F′(p) = k,
+(78)
+
+where k is a real constant.
+Hence it follows from the Euler’s homogeneous function theorem that the function F′ is a positive
+homogeneous function in p of degree k. So
+
+F′(λp) = λkF′(p) for λ > 0.
+(79)
+
+Since F′ is a positive homogeneous function in the single variable p, without loss of generality
+we can take,
+F′(p) = pk.
+(80)
+
+Therefore
+
+F(p) =
+
+�
+pk+1
+
+k+1
+k ̸= −1
+log p
+k = −1
+(81)
+
+Let
+
+k = −(1 + α)
+
+2
+, α ∈ R.
+(82)
+
+we get
+
+F(p) =
+
+�
+2
+
+1−α p
+1−α
+
+2
+α ̸= 1
+log p
+α = 1
+(83)
+
+which is nothing but Amari’s α−embeddings Lα(p). Hence we obtain that Amari’s α−connections
+are the only F−connections that are invariant under smooth one-to-one transformations of the
+random variable.
+
+26
+
+
+Entropy 2014, 16, 2472–2487
+
+Remark 4. In Section 2, we defined (F, G)-connections using a general embedding function F and a positive
+smooth function G. We can show that (F, G)-connection is invariant under smooth one-to-one transformation
+of the random variable when G(p) = c, where c is a real constant and F(p) = Lα(p) (proof is similar to that of
+Theorem 4). The notion of (F, G)−metric and (F, G)−connection provides a more general way of introducing
+geometric structures on a manifold. We were able to show that the Fisher information metric (up to a constant)
+and Amari’s α−connections are the only metric and connections belonging to this class that are invariant under
+both the transformation of the parameter and the one-to-one transformation of the random variable.
+
+5. Conclusions
+
+The Fisher information metric and Amari’s α−connections are widely used in the theory
+of information geometry and have an important role in the theory of statistical estimation.
+Amari’s α−connections are defined using a one parameter family of functions, the α−embeddings.
+We generalized this idea to introduce geometric structures on a statistical manifold S. We considered
+a general embedding function F of S into RX and obtained a geometric structure on S called the
+F−geometry. Amari’s α−geometry is a special case of F−geometry. A more general way of defining
+Riemannian metrics and affine connections on a statistical manifold S is given using a positive
+continuous function G and the embedding F.
+Amari’s α−geometry is the only F−geometry that is invariant under both the transformation of
+the parameter and the random variable or equivalently under the sufficient statistic. We can relax the
+condition of invariance under the sufficient statistic and can consider other statistically significant
+transformations as well, which then gives an F−geometry other than α−geometry that is invariant
+under these statistically significant transformations. We believe that the idea of F−geometry can be
+used in the further development of the geometric theory of q-exponential families. We look forward to
+studying these problems in detail later.
+
+Acknowledgments: We are extremely thankful to Shun-ichi Amari for reading this article and encouraging our
+learning process. We would like to thank the reviewer who mentioned the references [13,16] that are of great
+importance in our future work.
+
+Author Contributions: The authors contributed equally to the presented mathematical framework and the writing
+of the paper.
+
+Conflicts of Interest: The authors declare no conflicts of interest.
+
+References
+
+1.
+Amari, S. Differential geometry of curved exponential families-curvature and information loss.
+Ann. Statist. 1982, 10, 357–385.
+2.
+Amari, S. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics, Volume 28; Springer-Verlag:
+New York, NY, USA, 1985.
+3.
+Amari, S.; Kumon, M. Differential geometry of Edgeworth expansions in curved exponential family. Ann. Inst.
+Statist. Math. 1983, 35, 1–24.
+4.
+Amari, S.; Nagaoka, H. Methods of Information Geometry, Translations of Mathematical Monographs;
+Oxford University Press: Oxford, UK, 2000.
+5.
+Barndorff-Nielsen, O.E.; Cox, D.R.; Reid, N. The role of differential geometry in statistical theory.
+Internat. Statist. Rev. 1986, 54, 83–96.
+6.
+Dawid, A.P. A Discussion to Efron’s paper. Ann. Statist. 1975, 3, 1231–1234.
+7.
+Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency).
+Ann. Statist. 1975, 3, 1189–1242.
+8.
+Efron, B. The geometry of exponential families. Ann. Statist. 1978, 6, 362–376.
+9.
+Murray,
+M.K.;
+Rice,
+R.W.
+Differential
+Geometry
+and
+Statistics;
+Chapman
+&
+Hall:
+London,
+UK, 1995.
+10.
+Rao,
+C.R.
+Information
+and
+accuracy
+attainable
+in
+the
+estimation
+of
+statistical
+parameters.
+Bull. Calcutta. Math. Soc. 1945, 37, 81–91.
+
+27
+
+
+Entropy 2014, 16, 2472–2487
+
+11.
+Chentsov, N.N. Statistical Decision Rules and Optimal Inference; Transted in English, Translation of the
+Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 1982.
+12.
+Corcuera, J.M.; Giummole, F. A characterization of monotone and regular divergences.
+Ann.
+Inst.
+Statist. Math. 1998, 50, 433–450.
+13.
+Zhang, J. Divergence function, duality and convex analysis. Neur. Comput. 2004, 16, 159–195.
+14.
+Burbea, J. Informative geometry of probability spaces. Expo Math. 1986, 4, 347–378.
+15.
+Wagenaar,
+D.A.
+Information
+Geometry
+for
+Neural
+Networks.
+Available
+online:
+http://www.danielwagenaar.net/res/papers/98-Wage2.pdf (accessed on 13 December 2013).
+16.
+Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually flat and
+conformal geometries. Physica A 2012, 391, 4308–4319.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+28
+
+
+entropy
+
+Article
+Computational Information Geometry in Statistics:
+Theory and Practice
+
+Frank Critchley 1 and Paul Marriott 2,*
+
+1 Department of Mathematics and Statistics, The Open University, Walton Hall, Milton Keynes,
+Buckinghamshire MK7 6AA, UK; E-Mail: f.critchley@open.ac.uk
+2 Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo,
+ON N2L 3G1, Canada
+*
+E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567.
+
+Received: 27 March 2014; in revised form: 25 April 2014 / Accepted: 29 April 2014 /
+Published: 2 May 2014
+
+Abstract: A broad view of the nature and potential of computational information geometry in
+statistics is offered.
+This new area suitably extends the manifold-based approach of classical
+information geometry to a simplicial setting, in order to obtain an operational universal model space.
+Additional underlying theory and illustrative real examples are presented. In the infinite-dimensional
+case, challenges inherent in this ambitious overall agenda are highlighted and promising new
+methodologies indicated.
+
+Keywords: information geometry; computational geometry; statistical foundations
+
+1. Introduction
+
+The application of geometry to statistical theory and practice has seen a number of different
+approaches developed. One of the most important can be defined as starting with Efron’s seminal
+paper [? ] on statistical curvature and subsequent landmark references, including the book by Kass
+and Vos [? ]. This approach, a major part of which has been called information geometry, continues
+today, a primary focus being invariant higher-order asymptotic expansions obtained through the use
+of differential geometry. A somewhat representative example of the type of result it generates is taken
+from [? ], where the notation is defined:
+
+Example 1. The bias correction of a first-order efficient estimator, ˆβ, is defined by:
+
+ba(β) = − 1
+
+2n gaa′ �
+gbcΓ(−1)
+a′bc + gκλh(−1)
+κλa′
+�
+,
+
+and has the property that if ˆβ∗ := ˆβ − b(β) then:
+
+Eβ( ˆβ∗ − β) = O(n−3/2).
+
+The strengths usually claimed of such a result are that, for a worker fluent in the language of
+information geometry, it is explicit, insightful as to the underlying structure and of clear utility in
+statistical practice. We agree entirely. However, the overwhelming evidence of the literature is that,
+while the benefits of such inferential improvements are widely acknowledged in principle, in practice,
+the overhead of first becoming fluent in information geometry prevents their routine use. As a result, a
+great number of powerful results of practical importance lay severely underused, locked away behind
+notational and conceptual bars.
+This paper proposes that this problem can be addressed computationally by the development
+of what we call computational information geometry. This gives a mathematical and numerical
+
+Entropy 2014, 16, 2454–2471; doi:10.3390/e16052454
+www.mdpi.com/journal/entropy
+29
+
+
+Entropy 2014, 16, 2454–2471
+
+computational framework in which the results of information geometry can be encoded as “black-box"
+numerical algorithms, allowing direct access to their power. Essentially, this works by exploiting the
+structural properties of information geometry, which are such that all formulae can be expressed in
+terms of four fundamental building blocks: defined and detailed in Amari [? ], these are the +1 and −1
+geometries, the way that these are connected via the Fisher information and the foundational duality
+theorem. Additionally, computational information geometry enables a range of methodologies and
+insights impossible without it; notably, those deriving from the operational, universal model space,
+which it affords; see, for example, [? ? ? ].
+The paper is structured as follows. Section 2 looks at the case of distributions on a finite number
+of categories where the extended multinomial family provides an exhaustive model underlying the
+corresponding information geometry. Since the aim is to produce a computational theory, a finite
+representation is the ultimate aim, making the results of this section of central importance. The
+paper also emphasises how the simplicial structures introduced here are foundational to a theory of
+computational information geometry. Being intrinsically constructive, a simplicial approach is useful
+both theoretically and computationally. Section 3 looks at how simplicial structures, defined for finite
+dimensions, can be extended to the infinite dimensional case.
+
+2. Finite Discrete Case
+
+2.1. Introduction
+
+This section shows how the results of classical information geometry can be applied in a purely
+computational way. We emphasise that the framework developed here can be implemented in
+a purely algorithmic way, allowing direct access to a powerful information geometric theory of
+practical importance.
+The key tool, as explained in [? ], is the simplex:
+
+Δk :=
+
+�
+
+ß = (ß0, ß1, . . . , ßk)⊤ : ßi ≥ 0 ,
+k
+∑
+i=0
+ßi = 1
+
+�
+
+,
+(1)
+
+with a label associated with each vertex. Here, k is chosen to be sufficiently large, so that any statistical
+model—by which we mean a sample space, a set of probability distributions and selected inference
+problem—can be embedded. The embedding is done in such a way that all the building blocks
+of information geometry (i.e., manifold, affine connections and metric tensor) can be numerically
+computed explicitly. Within such a simplex, we can embed a large class of regular exponential families;
+see [? ] for details. This class includes exponential family random graph models, logistic regression,
+log-linear and other models for categorical data analysis. Furthermore, the multinomial family on k + 1
+categories is naturally identified with the relative interior of this space, int(Δk), while the extended
+family, Equation (??), is a union of distributions with different support sets.
+This paper builds on the theory of information geometry following that introduced by [? ] via
+the affine space construction introduced by [? ] and extended by [? ]. Since this paper concentrates
+on categorical random variables, the following definitions are appropriate. Consider a finite set of
+disjoint categories or bins B = {Bi}i∈A. Any distribution over this finite set of categories is defined
+by a set, {πi}i∈A, which defines the corresponding probabilities. With “mix” connoting mixtures of
+distributions, we have:
+
+Definition 1. The −1-affine space structure over distributions on B := {Bi}i∈A is (Xmix, Vmix, +) where:
+
+Xmix =
+
+�
+
+{xi}i∈A| ∑
+i∈A
+xi = 1
+
+�
+
+, Vmix =
+
+�
+
+{vi}i∈A| ∑
+i∈A
+vi = 0
+
+�
+
+and the addition operator, +, is the usual addition of sequences.
+
+30
+
+
+Entropy 2014, 16, 2454–2471
+
+In Definition ??, the space of (discretised) distributions is a −1-convex subspace of the affine
+space, (Xmix, Vmix, +). A similar affine structure for the +1-geometry, once the support has been fixed,
+can be derived from the definitions in [? ].
+
+2.2. Examples
+
+Examples ?? and ?? are used for illustration. The second of these is a moderately high dimensional
+family, where the way that the boundaries of the simplex are attached to the model is of great
+importance for the behaviour of the likelihood and of the maximum likelihood estimate. In general,
+working in a simplex, boundary effects mean that standard first order asymptotic results can fail,
+while the much more flexible higher order methods can be very effective. The other example is a
+continuous curved exponential family, where both higher order asymptotic sampling theory results
+and geometrically-based dimension reduction are described.
+
+Example 2. The paper [? ] models survival times for leukaemia patients. These times, recorded in days, start
+at the time of diagnosis, and there are 43 observations; see [? ] for details. We further assume that the data is
+censored at a fixed value. It was observed that a censored exponential distribution gives a reasonable, but not
+exact, fit. As discussed in [? ], this gives a one-dimensional curved exponential family inside a two-dimensional
+regular exponential family of the form:
+
+exp
+�
+λ1x + λ2y − log
+� 1
+
+λ2
+
+�
+eλ2t − 1
+�
++ eλ1+λ2t
+��
+,
+(2)
+
+where y = min(z, t) and x = I(z ≥ t), and the embedding map is given by (λ1(θ), λ2(θ)) = (− log θ, −θ).
+As shown in [? ], the loss due to discretisation can be made arbitrarily small for all information geometry
+objects. Thus, for example, using this computational approach, it is straightforward to compute the bias
+correction described in Example ??. Each of the terms in the asymptotic bias, i.e., the metric, gij, its inverse, gij,
+
+the Christoffel symbols, Γ(−1)
+ijk
+, and curvature term, h(−1), can be directly numerically coded as appropriate finite
+difference approximations to derivatives. Thus, “black-box” code can directly calculate the numerical value of
+the asymptotic bias, and this numerical value can then be used by those who are not familiar with information
+geometry. For example this calculation establishes the fact that, with this particular data set, the sample size is
+such that the bias is inferentially unimportant.
+
+1
+
+2
+
+3
+
+4
+
+Figure 1. Undirected graphical model showing the cyclic graph of order four.
+
+Example 3. The paper [? ] discusses an undirected graphical model based on the cyclic graph of order four,
+shown in Figure ??, with binary random variables at each node. Without any constraints, there are 16 possible
+values for the graph, so model space can be thought of as a 15-dimensional simplex, including the relative
+
+31
+
+
+Entropy 2014, 16, 2454–2471
+
+boundary. However, the conditional independence relations encoded by the graph impose linear constraints in the
+natural parameters of the exponential family. Thus, the resultant model is a lower dimensional full exponential
+family and its closure.
+As described in [? ], the four cycle model is a seven dimensional exponential family, which is a +1-affine
+subspace of the +1-affine structure of the 15-dimensional simplex. The model can be written in the form:
+
+⎛
+
+⎝
+πi exp
+�
+∑8
+h=1 ηhvhi
+�
+
+∑15
+j=0 πj exp
+�
+∑8
+h=1 ηhvhj
+�
+
+⎞
+
+⎠
+
+15
+
+i=0
+
+(3)
+
+for a given set of linearly independent vectors {vh}8
+h=1. The existence of the maximum likelihood estimate for
+η = (ηh) will depend on how the limit points of Model (??) meet the observed face of Δ15; that is, the span of the
+vertices (bins) having positive counts. Thus, a key computational task is to learn how a full exponential family,
+defined by a representation of the form of (??), is attached to boundary sub-simplices of the high-dimensional
+embedding simplex.
+In order to visualise the geometric aspects of this problem, consider a lower dimensional version. Define
+a two-dimensional full exponential family by the vectors v1 = (1, 2, 3, 4), v2 = (1, 4, 9, −1) and the uniform
+distribution base point, πi, embedded in the three-dimensional simplex. The two-dimensional family is defined
+by the +1-affine space through (0.25, 0.25, 0.25, 0.25) spanned by the space of vectors of the form:
+
+α(1, 2, 3, 4) + β(1, 4, 9, −1) = (α + β, 2α + 4β, 3α + 9β, 4α − β).
+
+Consider directions from the origin obtained by writing α = θβ, giving, for each θ, a one-dimensional, full
+exponential family parameterized by β in the direction β(θ + 1, 2θ + 4, 3θ + 9, 4θ − 1). The aspect of this vector,
+which determines the connection to the boundary, is the rank order of its elements. For example, suppose the first
+component was the maximum and the last the minimum. Then, as β → ±∞, this one-dimensional family will
+be connected to the first and fourth vertex of the embedding four simplex, respectively. Note that changing the
+value of θ changes the rank structure, as illustrated in Figure ??. This plot shows the four element-wise linear
+functions of θ (dashed lines) and the salient overall feature of their rank order; that is, their upper and lower
+envelopes (solid lines). From this analysis of the envelopes of a set of linear functions, it can be seen that the
+function 2θ + 4 is redundant. The consequence of this is shown in Figure ??, which shows a direct computation
+of the two-dimensional family. It is clear that, indeed, only three of the four vertexes have been connected by
+the model.
+In general, the problem of finding the limit points in full exponential families inside simplex models is a
+problem of finding redundant linear constraints. As shown in [? ], this can be converted, via convex duality, into
+the problem of finding extremal points in a finite dimensional affine space. In the four-cycle model, this technique
+can construct all sub-simplices containing limit points of the four-cycle model. For example, it can be shown
+that all of the 16 vertices are part of the boundary. Once the boundary points have been identified as necessary
+and sufficient, conditions for the existence of the maximum likelihood in the +1-parameters can easily be found
+computationally [? ].
+
+32
+
+
+Entropy 2014, 16, 2454–2471
+
+���
+���
+��
+�
+�
+��
+��
+
+���
+���
+���
+�
+��
+��
+��
+
+Envelope of linear functions
+
+�
+
+���������������
+
+Figure 2. The envelope of a set of linear functions. Functions, dashed lines; envelope, solid lines.
+
+Figure 3. Attaching a two-dimensional example to the boundary of the simplex.
+
+2.3. Tensor Analysis and Numerical Stability
+
+One of the most powerful set of results from classical information geometry is the way that
+geometrically-based tensor analysis is perfect for use in multi-dimensional higher order asymptotic
+analysis; see [? ] or [? ]. The tensorial formulation does, however, present a couple of problems in
+practice. For many, its very tight and efficient notational aspects can obscure rather than enlighten,
+while the resulting formulae tend to have a very large number of terms, making them rather
+cumbersome to work with explicitly. These are not problems at all for the computational approach
+described in this paper. Rather, the clarity of the tensorial approach is ideal for coding, where large
+numbers of additive terms, of course, are easy to deal with.
+Two more fundamental issues, which the global geometric approach of this paper highlights,
+concern numerical stability. The ability to invert the Fisher information matrix is vital in most tensorial
+
+33
+
+
+Entropy 2014, 16, 2454–2471
+
+formulae, and so understanding its spectrum, discussed in Section ??, is vital. Secondly, numerical
+underflow and overflow near boundaries require careful analysis, and so, understanding the way
+that models are attached to the boundaries of the extended multinomial models is equally important.
+The four-cycle model, to which we now return, illustrates computational information geometry doing
+this effectively.
+
+Example 4. The multivariate Edgeworth approximation to the sampling distribution of part of the sufficient
+statistic for the four-cycle model is shown in Figure ??. Using the techniques described above, a point near the
+boundary of the 15-simplex has been selected as the data generation process. For illustration, we focus on the
+marginal distribution of two components of the sufficient statistic, though any number could have been chosen.
+The boundary forces constraints on the range of the sufficient statistics, shown by the dashed line in the plot.
+The points, jittered for clarity, show the distribution computed by simulation. It is typical that such boundary
+constraints prevent standard first order methods from performing well, but the greater flexibility of higher
+order methods can be seen to work well here. As discussed above, methods, such as the multivariate Edgeworth
+expansion, can be strongly exploited in a computational framework, such as ours. Note, the discretization that
+can be observed in the figure is extensively discussed in [? ].
+
+�
+��
+��
+��
+
+��
+�
+�
+��
+��
+��
+
+��������������������
+
+��������������������
+
+Figure 4. Using the Edgeworth expansion near the boundary of four-cycle model.
+
+2.4. Spectrum of Fisher Information
+
+We focus now on the second numerical issue identified above. In any multinomial, the Fisher
+information matrix and its inverse are explicit. Indeed, the 0-geodesics and the corresponding geodesic
+distance are also explicit; see [? ] or [? ]. However, since the simplex glues together multinomial
+structures with different supports and the computational theory is in high dimensions, it is a fact
+that the Fisher information matrix can be arbitrarily close to being singular. It is therefore of central
+interest that the spectral decomposition of the Fisher information itself has a very nice structure, as
+shown below.
+
+Example 5. Consider a multinomial distribution based on 81 equal width categories on [−5, 5], where the
+probability associated to a bin is proportional to that of the standard normal distribution for that bin. The Fisher
+information for this model is an 80 × 80 matrix, whose spectrum is shown in Figure ??. By inspection, it can
+be seen that there are exponentially small eigenvalues, so that while the matrix is positive definite, it is also
+arbitrarily close to being singular. Furthermore, it can be seen that the spectrum has the shape of a half-normal
+density function and that the eigenvalues seem to come in pairs. These facts are direct consequences of the general
+results below.
+
+34
+
+
+Entropy 2014, 16, 2454–2471
+
+With π−0 denoting the vector of all bin probabilities, except π0, we can write the Fisher
+information matrix (in the +1 form) as N times:
+
+I(π) := diag(π−0) − π−0πT
+−0.
+
+This has an explicit spectral decomposition, which can be computed by using interlacing
+eigenvalue results (see for example [?
+], Chapter 4).
+In particular, if the diagonal matrices,
+diag(π1, . . . , πk) and diag(λ1Im1| · · · |λgImg), agree up to a row-and-column permutation, where g > 1
+and λ1 > · · · > λg > 0, then I(π) has ordered spectrum:
+
+λ1 > ˜λ1 > · · · > λg > ˜λg ≥ 0,
+(4)
+
+with ˜λg > 0 ⇐⇒ π0 > 0, each λi having multiplicity mi − 1, while each ˜λg is simple.
+
+0
+20
+40
+60
+80
+
+0.00
+0.01
+0.02
+0.03
+0.04
+0.05
+0.06
+
+Eigenvalues
+
+rank
+
+Eigenvalues
+
+Figure 5. Spectrum of the Fisher information matrix of a discretised normal distribution.
+
+We give a complete account of the spectral decomposition (SpD) of I(π). There are four cases to
+consider, the last having the generic spectrum of (??). Without loss, after permutation, assume now
+π1 ≥ · · · ≥ πk. The four cases are:
+
+Case 1 For some l < k, the last k − l elements of π−0 vanish: the sub-case l = 0 ⇐⇒ π0 = 1 ⇐⇒
+I(π) = 0 is trivial. Otherwise, writing π+ = (π1, . . . , πl)T and Π+ = diag(π+), the SpD of:
+
+I(π) =
+
+�
+Π+ − π+πT+
+0
+
+0
+0
+
+�
+
+follows at once from that of Π+ − π+πT+, given below.
+Case 2 k = 1: this case is trivial.
+Case 3 k > 1, π = λ1k, λ > 0: the SpD of I(π) is:
+
+λCk + λ(1 − kλ)Jk
+
+where Ck = Ik − Jk and Jk = k−11k1T
+k . Here, λ has multiplicity k − 1 and eigenspace [Span(1k)]⊥,
+while ˜λ := λ(1 − kλ) has multiplicity one and eigenspace Span(1k).
+In particular, since
+1 − π0 = kλ, it follows that:
+I(π) is singular ⇐⇒ π0 = 0.
+
+35
+
+
+Entropy 2014, 16, 2454–2471
+
+Case 4 π−0 = (λ11T
+m1| . . . |λg1T
+mg)T, g > 1 and λ1 > · · · > λg > 0:
+
+This is the generic case, having the spectrum of (??) above. Denoting by Om the zero matrix of
+order m × m and by P(ν) the rank one orthogonal projector onto Span(ν), (ν ̸= 0), the SpD is:
+
+g
+∑
+i=1,mi>1
+λidiag(Omi−, Cmi, Omi−) +
+
+g
+∑
+i=1
+˜λiP
+
+⎛
+
+⎝
+�
+λ1
+
+˜λi − λ1
+1T
+m1, . . . ,
+λg
+
+˜λi − λg
+1T
+mg
+
+�T⎞
+
+⎠ ,
+
+where: mi− = ∑{mj|j < i}, mi+ = ∑{mj|j > i} and the ˜λi are the zeros of:
+
+h(˜λ) := 1 +
+
+g
+∑
+i=1
+
+miλ2
+i
+
+˜λ − λi
+= (1 −
+
+g
+∑
+i=1
+miλi) + ˜λ
+
+� g
+∑
+i=1
+
+miλi
+˜λ − λi
+
+�
+
+.
+
+In particular, {˜λi : i = 1, · · · , g} are simple eigenvalues satisfying (??) while, whenever mi > 1,
+λi, is also an eigenvalue having multiplicity mi − 1. Further, expanding det(I(π)), we again find:
+
+I(π) is singular
+⇐⇒ π0 = 0,
+
+so that �λg
+>
+0 ⇔
+π0
+>
+0, as claimed.
+Finally, we note that each �λi (i <
+g) is
+typically (much) closer to λi than to λi+1.
+For, considering the graph of x
+→
+1/x,
+h ((λi + λi+1)/2 + δ(λi − λi+1)/2) (−1 < δ < +1) is well-approximated by:
+
+1 −
+2miλ2
+i
+
+(λi − λi+1)(1 − δ) +
+2mi+1λ2
+i+1
+
+(λi − λi+1)(1 + δ)
+
+whose unique zero δ∗ over (−1, 1) is positive whenever, as will typically be the case, mi = mi+1
+(both will usually be one), while (miλi + mi+1λi+1) < 1/2. Indeed, a straightforward analysis
+shows that, for any mi and mi+1, δ∗ = 1 + O(λi) as λi → 0.
+
+2.5. Total Positivity and Local Mixing
+
+Mixture modelling is an exemplar of a major area of statistics in which computational information
+geometry enables distinctive methodological progress. The −1-convex hull of an exponential family is
+of great interest, mixture models being widely used in many areas of statistical science. In particular,
+they are explored further in [? ]. Here, we simply state the main result, a simple consequence of the
+total positivity of exponential families [? ], that, generically, convex hulls are of maximal dimension. In
+this result, “generic” means that the +1 tangent vector, which defines the exponential family as having
+components that are all distinct.
+
+Theorem 1. The −1-convex hull of an open subset of a generic one-dimensional exponential family is of full
+dimension.
+
+Proof. For any (πi) ∈ Δk with each πi > 0, θ0 < · · · < θk and s0 < · · · < sk, let B = (π(θ0), ..., π(θk))
+have general element:
+πi(θj) := πi exp[siθj − ψ(θj)].
+
+Further, let �B = B − π(θ0)1T
+k+1, whose general column is π(θj) − π(θ0). Then, it suffices to show that
+�B has rank k. However, using [? ] (p. 33), Rank(�B) = Rank(B) − 1, so that:
+
+Rank(�B) = k ⇔ B is nonsingular ⇔ B∗ is nonsingular,
+
+36
+
+
+Entropy 2014, 16, 2454–2471
+
+where B∗ = (exp[siθj]). It suffices, then, to recall [? ] that K(x, y) = exp(xy) is strictly total positive (of
+order ∞), so that det B∗ > 0.
+
+3. Infinite Dimensional Structure
+
+This section will start to explore the question of whether the simplex structure, which describes
+the finite dimensional space of distributions, can extend to the infinite dimensional case. We examine
+some of the differences with the finite dimensional case, illustrating them with clear, commonly
+occurring examples.
+
+3.1. Infinite Dimensional Information Geometry: A Review
+
+In the previous sections, the underlying computational space is always finite dimensional. This
+section looks at issues related to an infinite dimensional extension of the theory in that paper. There
+is a great deal of literature concerning infinite dimensional statistical models. The discussion here
+concentrates on information geometric, parametrisation and boundary issues.
+The information geometry theory of Amari [? ] has a geometric foundation, where statistical
+models (typically full and curved exponential families) have a finite dimensional manifold structure.
+When considering the extension to infinite dimensional cases, Amari notes the problem of finding an
+“adequate topology” [? ] (p. 93). There has to be very interesting work following up this topological
+challenge. By concentrating on distributions with a common support, the paper [? ] uses the geometry
+of a Banach manifold, where local patches on the manifold are modelled by Banach spaces, via
+the concept of an Orlicz space. This gives a structure that is analogous to an infinite dimensional
+exponential family, with mean and natural parameters and including the ability to define mixed
+parametrisations. One drawback of this Banach structure, as pointed out in [? ], is that the likelihood
+function with finite samples is not continuous on the manifold. Fukumizu uses a reproducing kernel
+Hilbert space structure rather than a Banach manifold, which is a stronger topology. There are strong
+connections between the approach taken in [? ] and the material in Section ??, we note two issues here:
+(1) a focus on the finite nature of the data; and (2) using a Hilbert structure defined by a cumulant
+generating function. The approaches differ in that [? ] uses a manifold approach rather than the
+simplicial complex as the fundamental geometric object. There is also other work that explicitly used
+infinite dimensional Hilbert spaces in statistics, a good reference being [? ].
+In this paper, in contrast to previous authors, a simplicial, rather than a manifold-based, approach
+is taken. This allows distributions with varying support, as well as closures of statistical families to be
+included in the geometry. Another difference in approach is the way in which geometric structures
+are induced by infinite dimensional affine spaces rather than by using an intrinsic geometry. This
+approach was introduced by [? ] and extended by [? ]. Spaces of distributions are convex subsets of
+the affine spaces, and their closure within the affine space is key to the geometry.
+In exponential families, the −1-affine structure is often called the mean parametrisation, and
+using moments as parameters is one very important part of modelling. In the infinite dimensional
+case, the use of moments as a parameter system is related to the classical moment problem—when
+does there exist a (unique) distribution whose moments agree with a given sequence?—which has
+generated a vast literature in its own right; see [? ? ? ]. In general terms, the existence of a solution
+to the moment problem is connected to positivity conditions on moment matrices. Such conditions
+have been used in connection to the infinite dimensional geometry of mixture models [? ]. Uniqueness,
+however, is a much more subtle problem: sufficient conditions can be formulated in terms of the rate
+of growth of the moments [? ]. Counter examples to general uniqueness results include the log-normal
+distribution [? ].
+The geometry of the Fisher information is also much more complex in general spaces of
+distributions than in exponential families. Simple mixture models, including two-component mixtures
+of exponential distributions [? ], can have “infinite” expected Fisher information, which gives rise to
+
+37
+
+
+Entropy 2014, 16, 2454–2471
+
+non-standard inference issues. Similar results on infinitely small (and large) eigenvalues of covariance
+operators are also noted in [? ]. Since the Fisher information is a covariance, the fact that it does not
+exist for certain distributions or that its spectrum can be unbounded above or arbitrarily close to zero
+is not a surprise. However, these observations do need to be taken into account when considering the
+information geometry of infinite dimensional spaces.
+The rest of this section looks at the topology and geometry of the infinite dimensional simplex
+and gives some illustrative examples, which, in particular, show the need for specific Hilbert space
+structures, discussed in the final section.
+
+3.2. Topology
+
+For simplicity and concreteness, in this section, we will be looking at models for real valued
+random variables. In this paper, we restrict attention to the cases where the sample space is R+ or R
+and has been discretised to a countably infinite set of bins, Bi, with i ∈ N or Z, respectively. In the
+finite case, the basic object is the standard simplex, Δk, with k + 1 bins. We generalise this to countable
+unions of such objects. Of these, one is of central importance, denoted by Δemp or simply Δ, because it
+is the smallest object that contains all possible empirical distributions.
+
+Definition 2. For any finite subset of bins, indexed by I ⊂ N or Z, denote
+
+ΔI =
+
+�
+
+x = (xi)i∈I : xi ≥ 0 , ∑
+i∈I
+xi = 1
+
+�
+
+.
+
+We take the union of all such sets �
+|I|<∞ ΔI, where |I| denotes the number of elements of the index set. This
+can always be written as:
+
+Δ =
+
+�
+
+x = (xi)i∈Z : ∑
+i∈Z
+xi = 1, xi ≥ 0 and only finitely many xi > 0
+
+�
+
+.
+
+In what follows, it is important to note that for any given statistical inference problem, the sample
+size, n, is always finite, even if we frequently use asymptotic approximations, where n → ∞. Thus, the
+data, as represented by the empirical distribution, naturally lie in the space, Δ. However, many models,
+used in the given inference problem, will have support over all bins, so the models most naturally
+lie in the “boundary” constructed using the closures of the set. These objects are subsets of sequence
+spaces, and the corresponding topologies can be constructed from the Banach spaces, ℓp, p ∈ [1, ∞].
+The following results follow directly from explicit calculations, where we note that in this section, since
+all terms are non-negative, convergence always means absolute convergence. In particular, arbitrary
+rearrangements of series do not affect the existence of limits or their values.
+
+Example 6. Consider the sequence of “uniform distributions” x(n) = ( 1
+
+n, . . . , 1
+
+n, 0, . . . ) as elements of Δ. This
+has an ℓp limit of the zero sequence for p ∈ (1, ∞].
+
+Proposition 1. The ℓp extreme points of Δ, for p ∈ (1, ∞], are the zero sequence and the sequences, ffii (i ∈ Z),
+with one as the i − th element and zero elsewhere.
+
+For p ∈ [1, ∞], let Δp ⊂ ℓp denote the ℓp closure of Δ.
+
+Theorem 2. (a) Δ1 = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi = 1} .
+(b) Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} .
+(c) For p ∈ (1, ∞), Δp = Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} .
+
+38
+
+
+Entropy 2014, 16, 2454–2471
+
+Proof. (a) It is immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ1. Conversely, if ¯x is a limit
+point, then all its elements must be non-negative. Finally, if ∑∞
+i=1 ¯xi is not bounded above by one,
+
+then there exists N, such that ∑N
+i=1 ¯xi > 1 + ϵ for some ϵ > 0. Hence, ∑∞
+i=1 | ¯xi − x(n)
+i
+| ≥ ∑N
+i=1 | ¯xi −
+
+x(n)
+i
+| ≥ ∑N
+i=1 ¯xi − ∑N
+i=1 x(n)
+i
+> ϵ for all n, which contradicts convergence. If ∑∞
+i=1 ¯xi < 1 − ϵ, then
+
+∑∞
+i=1 | ¯xi − x(n)
+i
+| ≥ ∑∞
+i=1 x(n)
+i
+− ∑∞
+i=1 ¯xi > ϵ, which again contradicts convergence.
+(b) It is again immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ∞. However, by Example ??,
+the zero sequence is also in Δ∞, so that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi ≤ 1} ⊆ Δ∞.
+Conversely, by contradiction, it is easy to see that all elements of the closure must have non-
+negative elements. Finally, for any ¯x ∈ Δ∞, if ∑∞
+i=1 ¯xi is not bounded above by one, there exists N, such
+
+that ∑N
+i=1 ¯xi > 1 + ϵ for some ϵ > 0. For any sequence of points, x(n) in Δ, we have that ∑N
+i=1 x(n)
+i
+≤ 1,
+
+so that, for i = 1, . . . , N, the maximum value of |x(n)
+i
+− ¯xi| > ϵ/N. Hence, for all sequences, x(n), we
+have ∥x(n) − ¯x∥∞ > ϵ/N, which contradicts ¯x being in the closure.
+(c) This follows essentially the same argument as (b) by noting in the case where ∑∞
+i=1 ¯xi is not
+bounded above by one, we have:
+
+∥x(n) − ¯x∥p
+p ≥
+N
+∑
+i=1
+| ¯xi − x(n)
+i
+|p ≥ N max
+i=1,...N |x(n)
+i
+− ¯xi|p > N1−pϵp
+
+for any sequences, x(n), which contradicts ¯x being in the closure.
+
+It is immediate that the spaces, Δ and Δ1, are convex subsets of ℓ1 and that Δ∞ is a convex set in
+ℓ∞.
+
+3.3. Geometry
+
+In the same way as for the finite case, the −1-geometry can be defined using an affine space
+structure using the following definition.
+
+Definition 3. Let I be a countable index set which is a subset of Z. The −1-affine space structure over
+distributions is (Xmix, Vmix, +), where:
+
+Xmix =
+
+�
+
+x = (xi)|∑
+I
+xi = 1,∑
+I
+|xi| < ∞
+
+�
+
+, Vmix =
+
+�
+
+v = vi|∑
+I
+vi = 0,∑
+I
+|vi| < ∞,
+
+�
+
+,
+
+and x + v = (xi + vi).
+
+In order to define the +1-geometric structure, we also follow the approach used in the finite case.
+Initially, to understand the +1- structure, consider the case where all distributions have a common
+support, i.e., assume πi > 0 for all i. We follow here the approach of [? ].
+
+Definition 4. Consider the set of non-negative measures on N or Z and the equivalence relation defined by:
+
+{ai} ∼ {bi} ⇐⇒ ∃λ > 0 s.t. ∀i ai = λbi.
+
+The equivalences classes of this are the points in the +1 geometry.
+These points can be further partitioned into sets with the same support, i.e., supp(< a >) = {i : ai > 0},
+where this is clearly well-defined.
+
+On sets of +1-points with the same support, we can define the +1-geometry in the same way as
+in the finite case. With “exp” connoting an exponential family distribution, we have:
+
+39
+
+
+Entropy 2014, 16, 2454–2471
+
+Definition 5. For a given index set, I, define Xexp to be all +1-points whose support equals I, and define the
+vector space Vexp = {vi, i ∈ I} with the operation, ⊕, defined by:
+
+< xi > ⊕vi = ⟨xi exp(vi)⟩ ,
+
+is an affine space. The +1-affine structure is then defined by (Xexp, Vexp, ⊕).
+
+Theorem 3. If a and b lie in Δ (or Δ1) and have the same support, then C(ρ) = ∑(aρ
+i b(1−ρ)
+i
+) < ∞ for
+
+ρ ∈ [0, 1]. Hence, aρ
+i b(1−ρ)
+i
+
+C(ρ)
+∈ Δ (or Δ1).
+
+Proof. Since a, b are absolutely convergent, the sequence, max(ai, bi), is also. Since we have:
+
+0 ≤ min(ai, bi) ≤ aρ
+i b1−ρ
+i
+≤ max(ai, bi)
+
+it follows that C(ρ) < ∞, and we have the result.
+
+This result shows that sets in Δ1 with the same support are +1-convex, just as the faces in the
+finite case are.
+
+3.4. Examples
+
+In order to get a sense of how the +1-geometry works, let us consider a few illustrative examples.
+
+Example 7. If we denote the discretised standard normal density by a and the discretised Cauchy density by b
+and consider the path:
+
+aρ
+i b(1−ρ)
+i
+
+C(ρ)
+,
+
+the normalising constant is shown in Figure ??. We see that at ρ = 0 (the Cauchy distribution), we have that
+the derivative of the normalising constant (i.e., the mean of the sufficient statistic) is tending to infinity. At the
+other end (ρ = 1), the model can be extended in the sense that the distribution exists for values greater than one.
+
+���
+���
+���
+���
+���
+���
+
+����
+����
+����
+����
+����
+����
+����
+����
+
+����������������
+
+��������������������
+
+Figure 6. Normalising constant for normal-Cauchy exponential mixing example.
+
+Thus, in this example, the path joining the two distributions is an extended, rather than natural,
+exponential family, since we have to include the boundary point where the mean is unbounded.
+
+40
+
+
+Entropy 2014, 16, 2454–2471
+
+Example 8. Let us return to Example ??, but now without the censoring. Thus, now, there is a countably
+infinite set of bins, and so, we can investigate its embedding in the infinite simplex. As discussed in [? ], we shall
+discretise the continuous distribution by computing the probabilities associated to bins [ci, ci+1], i = 1, 2, · · · .
+For the exponential model, Exp(θ), the bin probabilities are simply:
+
+πi(θ) = exp(−θci) − exp(−θci+1).
+
+Using this, the model will lie in the infinite simplex on the positive half line with the index set I = N.
+First, consider the case where we have a uniform choice of discretisation, where cn = n × ϵ for some fixed,
+ϵ > 0. In this case, the bin probabilities can be written as an exponential family:
+
+πn(θ) = exp
+�
+−θϵn + log(1 − e−θϵ)
+�
+
+for θ > 0. This gives a +1-geodesic though {πi(θ0)} in the direction {ϵ × n} of the form:
+
+πn(θ0) exp
+
+�
+
+−λϵn + log
+
+�
+1 − e−(λ+θ0)ϵ
+
+1 − e−θ0ϵ
+
+��
+
+(5)
+
+for λ > −θ0. In the case where λ → −θ0, the limiting distribution is the zero measure in Δ∞, and at the
+other extreme, where λ → ∞, the limiting distribution is the atomic distribution in the first bin, a distribution
+with a different support than πi(θ0). However, unlike the finite case, there is no guarantee that, for a given
+“direction”, {ti}, there exists a +1-geodesic starting at {πi(θ0)}, since we require the convergence of the
+normalising constant:
+∞
+∑
+i=0
+πi(θ0) exp(λti) < ∞.
+
+From this example, we see that the limit points of exponential families can lie in the space, Δ∞,
+but not in Δ1. The next example shows that limits do not have to exist at all.
+
+Example 9. Consider the family whose bin probabilities, πi ∈ Δ∞, are proportional to a discretised standard
+normal with bins of constant width. The exponential family, which is proportional to πi exp(θi), does not have
+an ℓ∞ limit, as it is discretised normal with mean θ. The natural parameter space here is (−∞, ∞).
+
+The last illustrative example is from [? ] and shows that even for simple models, the Fisher
+information for the parameters of interest need not be finite.
+
+Example 10. Let us consider a simple example of a two-component mixture of (discretised) exponential distributions:
+
+(1 − ρ)πi(θ0 + λ) + ρπi(θ0)
+(6)
+
+the tangent vector in the ρ-direction is:
+
+πi(θ0) − πi(θ0 + λ) = πi(θ0)
+�
+1 − e−λϵnC
+�
+
+for a positive constant, C. The corresponding squared length, with respect to the Fisher information, is:
+
+∞
+∑
+n=0
+
+�
+1 − e−λϵnC
+�2
+
+πi(θ0)
+.
+
+As an example, consider θ0 = 1; then, this term will be infinite for λ ≤ −0.5.
+
+41
+
+
+Entropy 2014, 16, 2454–2471
+
+3.5. Hilbert Space Structures
+
+Following these examples, we can consider the Hilbert space structure of exponential families
+inside the infinite simplex with the following results.
+
+Definition 6. Define the functions, S(·), by S({vi}, ß) = supθ {θ| ∑I πi exp(θvi) < ∞}, the function being
+set to ∞ when the set is unbounded. Furthermore, define for a given {πi} ∈ ¯Δ∞, the set:
+
+V(ß) = {{vi}|S({vi}, ß > 0} , and Vc(ß) = {{vi}| ± {vi} ∈ V(ß)} .
+
+The spaces, Vc(ß), correspond to the directions in which the +1-geodesic and, so, the
+corresponding exponential families are well-defined and have particularly “nice” geometric structures.
+
+Theorem 4. For ß, define a Hilbert space by:
+
+H(ß) :=
+�
+{vi}|∑ v2
+i πi < ∞
+�
+
+with inner product:
+⟨{vi}, {wi}⟩ß = ∑ viwiπi,
+
+and corresponding norm || · ||ß. Under these conditions:
+(i) Vc(ß) is a subspace of H(ß), and
+(ii) the set V(ß) is a convex cone.
+
+Proof. (i) First, if {vi} ∈ Vc(ß), then by definition, the moment generating function:
+
+∑ exp(θvi)πi,
+
+is finite for θ in an open set containing θ = 0. Hence, have both:
+
+∑ viπi < ∞, and ∑ v2
+i πi < ∞.
+
+Thus, {vi} ∈ H(ß). The fact that it is a subspace follows from (ii) below.
+(ii) It is immediate that V(ß) is a cone.
+Convexity follows from the Cauchy–Schwartz inequality, since for all {vi}, {v∗
+i } ∈ V(ß) and
+λ ∈ [0, 1], it follows that:
+
+�
+∑ πie
+θ
+2 (λvi+(1−λ)v∗
+i )�2
+=
+�
+∑
+�√πie
+θ
+2 λvi
+� �√πie
+θ
+2 (1−λ)v∗
+i
+��2
+
+≤
+�
+∑ πieθλvi
+� �
+∑ πieθ(1−λ)v∗
+i
+�
+,
+
+and, so, is finite for a strictly positive value of θ, hence
+�
+λvi + (1 − λ)v∗
+i
+� ∈ V(ß).
+
+Hence, this result illustrates the point above regarding the existence of “nice” geometric structure
+in the sense of Amari’s information geometry developed for finite dimensional exponential families.
+Infinite dimensional families have a richer structure; for example, they include the possibility of having
+an infinite Fisher information; see Examples ?? and ??.
+
+Acknowledgments: The authors would like to thank Karim Anaya-Izquierdo and Paul Vos for many helpful
+discussions and the UK’s Engineering and Physical Sciences Research Council (EPSRC) for the support of grant
+number EP/E017878/.
+
+Author Contributions: All authors contributed to the conception and design of the study, the collection and
+analysis of the data and the discussion of the results. All authors read and approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+42
+
+
+Entropy 2014, 16, 2454–2471
+
+References
+
+1.
+Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency).
+Ann. Stat. 1975, 3, 1189–1242.
+2.
+Kass, R.E.; Vos, P.W. Geometrical Foundations of Asymptotic Inference; John Wiley & Sons: London, UK, 1997.
+3.
+Amari, S.-I. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics; Springer-Verlag Inc.:
+New York, NY, USA, 1985; Volume 28.
+4.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Foundations.
+In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science;
+Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 311–318.
+5.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Mixture
+Modelling. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer
+Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 319–326.
+6.
+Anaya-Izquierdo, K.; Critchley, F.; Marriott, P. When are first order asymptotics adequate? A diagnostic. Stat
+2014, 3, 17–22.
+7.
+Murray, M.K.; Rice, J.W. Differential Geometry and Statistics; Chapman & Hall: London, UK, 1993.
+8.
+Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 95–97.
+9.
+Hand, D.J.; Daly, F.; Lunn, A.D.; McConway, K.J.; Ostrowski, E. A Handbook of Small Data Sets; Chapman and
+Hall: London, UK, 1994.
+10.
+Bryson, M.C.; Siddiqui, M.M. Survival times: Some criteria for aging. J. Am. Stat. Assoc. 1969, 64, 1472–1483.
+11.
+Marriott, P.; West, S. On the geometry of censored models. Calcutta Stat. Assoc. Bull. 2002, 52, 567–576.
+12.
+Geiger, D.; Heckerman, D.; King, H.; Meek, C. Stratified exponential families: Graphical models and model
+selection. Ann. Stat. 2001, 29, 505–529.
+13.
+Edelsbrunner, H. Algorithms in Combinatorial Geometry; Springer-Verlag: NewYork, NY, USA, 1987.
+14.
+Barndorff-Nielsen, O.E.; Cox, D.R. Asymptotic Techniques for Use in Statistics; Chapman & Hall: London, UK, 1989.
+15.
+McCullagh, P. Tensor Methods in Statistics; Chapman & Hall: London, UK, 1987.
+16.
+Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge Universtiy Press: Cambridge, UK, 1985.
+17.
+Karlin, S. Total Positivity; Stanford University Press: Stanford, CA, USA, 1968; Volume I.
+18.
+Householder, A.S. The Theory of Matrices in Numerical Analysis; Dover Publications: Dover, DE, USA, 1975.
+19.
+Pistone, G.; Rogantin, M.P. The exponential statistical manifold: Mean parameters, orthogonality and space
+transformations. Bernoulli 1999, 5, 571–760.
+20.
+Fukumizu, K. Infinite dimensional exponential families by reproducing kernel Hilbert spaces. In Proceedings
+of the 2nd International Symposium on Information Geometry and its Applications, Tokyo, Japan,
+12–16 December 2005.
+21.
+Small, C.G.; McLeish, D.L. Hilbert Space Methods in Probability and Statistical Inference; John Wiley & Sons:
+London, UK, 1994.
+22.
+Akhiezer, N.I. The Classical Moment Problem; Hafner: New York, NY, USA, 1965.
+23.
+Stoyanov, J.M. Counter Examples in Probability; John Wiley & Sons: London, UK, 1987.
+24.
+Gut, A. On the moment problem. Bernoulli 2002, 8, 407–421.
+25.
+Lindsay, B.G. Moment matrices: Applications in mixtures. Ann. Stat. 1989, 17, 722–740.
+26.
+Li, P.; Chen, J.; Marriott, P. Non-finite Fisher information and homogeneity: An EM approach. Biometrika
+2009, 96, 411–426.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+43
+
+
+entropy
+
+Article
+Using Geometry to Select One Dimensional
+Exponential Families That Are Monotone Likelihood
+Ratio in the Sample Space, Are Weakly Unimodal and
+Can Be Parametrized by a Measure of
+Central Tendency
+
+Paul Vos 1 and Karim Anaya-Izquierdo 2,*
+
+1 Department of Biostatistics, East Carolina University, Greenville, NC 27858, USA; E-Mail: vosp@ecu.edu
+2 Department of Mathematical Sciences, University of Bath, Bath BA27AY, UK
+*
+E-Mail: kai21@bath.ac.uk; Tel: +44-1225-384644
+
+Received: 30 April 2014; in revised form: 30 June 2014 / Accepted: 14 July 2014 /
+Published: 18 July 2014
+
+Abstract: One dimensional exponential families on finite sample spaces are studied using the
+geometry of the simplex Δ◦
+n−1 and that of a transformation Vn−1 of its interior. This transformation is
+the natural parameter space associated with the family of multinomial distributions. The space Vn−1
+is partitioned into cones that are used to find one dimensional families with desirable properties for
+modeling and inference. These properties include the availability of uniformly most powerful tests
+and estimators that exhibit optimal properties in terms of variability and unbiasedness.
+
+Keywords: simplex; cone; exponential family; monotone likelihood ratio; unimodal; duality
+
+1. Introduction
+
+The motivation for the constructions in this paper begins with a sample from a one dimensional
+space that is discrete. We allow for a continuous sample space but assume that this has been suitably
+discretized into n bins. The simplest underlying structure for the probability assigned to these bins is
+given by the multinomial distribution. The collection of all multinomial distributions can be identified
+with the n − 1 simplex Δn−1. We use the geometry of the simplex along with a transformation of its
+interior Δ◦
+n−1 to search for one dimensional subspaces that have good properties for modeling and for
+inference. In particular, we want families that can be parameterized by the mean, have only unimodal
+distributions, have desirable test characteristics (such as providing uniformly most powerful unbiased
+tests) and estimation properties (such as unbiasedness and small variability).
+The boundary of the (n − 1) dimensional simplex Δn−1 can be written as the union of simplexes
+of dimension (n − 2). This process can be repeated on the simplexes of lower dimension until the
+boundary consists of the vertices of the original simplex. This construction has statistical relevance
+to the possible supports for the probability distributions considered on the n bins. We obtain a dual
+decomposition for a transformation Vn−1 (defined in Equation (5) in Section 5) of Δ◦
+n−1; it is dual in
+that the result can be obtained by replacing simplexes with cones. The statistical relevance of the
+conical decomposition is to the possible modes for all the distributions on the n bins. Since Vn−1 is
+the natural parameter space for the distributions in Δ◦
+n−1, one dimensional exponential families are
+lines in Vn−1 and these can be related to the cones that partition Vn−1. One result is that the limiting
+distribution for any one dimensional exponential family in Δ◦
+n−1 is the uniform distribution whose
+support is determined by the cone that contains the limiting values of the line corresponding to the
+exponential family.
+
+Entropy 2014, 16, 4088–4100; doi:10.3390/e16074088
+www.mdpi.com/journal/entropy
+44
+
+
+Entropy 2014, 16, 4088–4100
+
+While one parameter exponential families can be defined quite generally by choosing a sufficient
+statistic, it can be useful to start with the sufficient statistics from well-known families such as the
+binomial, Poisson, negative binomial, normal, inverse Gaussian, and Gamma distribution. These
+exponential families have good modeling and inferential properties that we try to maintain by limiting
+the extent to which the sufficient statistic is modified. These restrictions lead to considering vectors in
+Vn−1 that lie in a cone. Examples of how to construct these cones are given.
+
+2. Motivating Examples
+
+One dimensional exponential families such as the binomial or Poisson are the workhorse of
+parametric inference because of their excellent statistical properties. However, being one dimensional
+means they do not always fit data very well so an extension to a two (or higher) dimensional
+exponential family can be pursued in order to preserve the nice inferential structure. An issue
+with such extension is that, for each extra natural parameter added, we need to choose a new sufficient
+statistic and this choice can substantially change the shape of the corresponding density functions. For
+example densities can pass from being unimodal to have multiple modes for some parameter values.
+To see this, consider the following examples.
+
+Example 1. Altham [1] considered the so-called multiplicative generalization of the binomial distribution with
+corresponding density
+
+f (x; p, φ) =
+�n
+x
+
+�
+px(1 − p)n−xφx(x−n)/C(p, φ)
+(1)
+
+where C is the normalizing constant and where clearly the binomial is recovered when φ = 1.
+By reparametrizing using θ1 = log(p/(1 − p)) and θ2 = log(φ) this density can be expressed in
+exponential form as
+f (x; θ1, θ2) = h(x) exp(θ1 x + θ2 T(x) − K(θ1, θ2))
+(2)
+
+where T(x) = x(x − n) is the added sufficient statistic and h(x) = (n
+x) where dependence on n has been ignored.
+Note that the same family is obtained if T(x) = x2 is added as a sufficient statistic instead of x(x − n).
+If n = 127 and (θ1, θ2) = (−0.0122, 0.018) then density (2) is bimodal as shown in the left panel of Figure
+
+1. The mean μ of this distribution is 50. Also plotted is the corresponding binomial density with the same mean
+or equivalently with θ1 = log(50/(127 − 50)) = −0.4318 and θ2 = 0.
+
+�
+��
+��
+��
+��
+���
+
+����
+����
+����
+����
+����
+
+�
+
+�
+��
+��
+��
+��
+���
+
+����
+����
+����
+����
+����
+
+�
+
+Figure 1. Binomial density (thick in both panels). Multiplicative binomial density (left panel and thin)
+and double binomial density (right panel and thin). All densities have the same mean μ = 50 and
+n = 127. Variance of the multiplicative and double binomial densities is equal.
+
+45
+
+
+Entropy 2014, 16, 4088–4100
+
+As explained by Lovison [2], this distribution has the feature of being under- or over-dispersed with
+respect to the binomial depending on θ2 being negative or positive, respectively. Furthermore, using the mixed
+parametrization (μ, θ2) (see [3] for details) it is easy to see that this distribution can be parametrized so that one
+parameter controls dispersion independently of the mean. In fact, for a fixed mean μ, as θ2 → −∞ f (x; θ1, θ2)
+tends to a two point distribution (with support points at the extremes x = 0 and x = n) or to a degenerate
+distribution on x = μ when θ2 → ∞.
+
+Example 2. Double exponential families [4] are two parameter exponential families that extend standard
+unidimensional exponential families such as the binomial and the Poisson. Similar to the multiplicative binomial
+in Example 1, the extra parameter involved in double exponential families controls the variance independently of
+the mean. The density for the so-called double binomial family can be written in the form (2) with
+
+T(x) = x log
+� x
+
+n
+
+�
++ (n − x) log
+�
+1 − x
+
+n
+
+�
+
+h(x) = (n
+x) and with the particular restriction that θ2 < 1 (see [4] for details). The range θ2 < 0 generates
+underdispersion and θ2 ∈ [0, 1) generates overdispersion with respect to the binomial. As shown on the right
+panel of Figure 1, the double binomial density can also be multimodal where the double binomial density shown
+has the same mean and variance as the multiplicative binomial shown in the left panel.
+
+These examples show that while extending exponential families can lead to useful modeling
+properties such as overdispersion, the extension can also result in distributions that are not suitable
+for modeling. We are interested in the relationship between geometric properties of one dimensional
+families and the modeling properties of their distributions.
+
+3. Sample Space and Distribution-valued Random Variables
+
+We consider first the general case where the sample space for a single observation X1 consists of
+n bins
+Sn = {B1, B2, . . . , Bn−1, Bn} .
+
+We consider the space of all probability distributions P on this sample space Sn. Each probability
+distribution in P is defined by the n-tuple p whose ith component is
+
+pi = Pr(Bi)
+
+so that P can be identified with the n − 1 simplex
+
+Δn−1 = {p ∈ Rn : pi ≥ 0 ∀i, 1′p = 1}
+
+where 1 in 1′p is the vector 1 ∈ Rn each of whose components is 1. We will slightly abuse the notation
+by using p to name a point in Δn−1, and hence in Rn, as well as the corresponding distribution in P.
+The sample space for a random sample of size N from a distribution p0 ∈ Δn−1 is
+
+X N
+n = {x : x is an n vector of nonnegative integers that sum to N} .
+
+There is simple relationship between X N
+n and the simplex that we obtain by dividing each component
+of x by N. Although the sample space X N
+n can be viewed as formed by compositional data, we will
+follow a different approach to handle this kind of data compared with the classical approach described
+by Aitchison [5] because the data we consider have additional structure.
+In Figure 2 the sample space for the sample of size N = 10 is displayed using open circles. The
+vertices correspond to the case where all 10 values fall in a single bin. The other points correspond
+to the less extreme cases. Let p0 be any point in Δn−1. By mapping the multinomial random variable
+of counts X to Δn−1, we obtain the random distribution �P = X/N whose values are multinomial
+
+46
+
+
+Entropy 2014, 16, 4088–4100
+
+distributions each having number of cases N and probability vector X/N. Identifying X N
+n -valued
+random variables with distribution-valued random variables provides a natural means for comparing
+data with probability models using the Kullback–Leibler (KL) divergence.
+We can compare distributions in Δn−1 using the KL divergence D : P × P �→ R
+
+D(p1, p2) = ∑ p1 log (p1/p2) = H(p1, p2) − H(p1)
+
+where H(p1, p2) = − ∑ p1 log(p2) and H(p1) = H(p1, p1) is the entropy of p1. Note that the arguments
+to D and H are distributions while the logarithm and ratios are defined on points in Rn. Following Wu
+and Vos [6], the variance of the random distribution �P is defined to be
+
+Varp0( �P) = min
+p∈Δn−1
+Ep0D( �P, p)
+
+and its mean is defined to be
+Ep0( �P) = arg min
+p∈Δn−1
+Ep0D( �P, p).
+
+Note that the expectation on the right hand side of the equations above are for real-valued random
+variables while the expectation on the left hand side of the second equation is for a distribution-valued
+random variable.
+
+Figure 2. Simplex for n = 3 bins and sample space for N = 10 observations.
+
+It is not difficult to show that Ep0 �P = p0 so that �P can be considered an unbiased estimator for
+p0. Details are in [6], which also shows that the KL risk can be decomposed into bias-squared and
+variance terms:
+Ep0D( �P, q) = D(p0, q) + Varp0( �P).
+
+The distributional variance is related to the entropy
+
+Varp0( �P) = Ep0D( �P, p0) = H(p0) − Ep0 H( �P).
+
+Note that for N = 1, H( �P) = 0 so that for a single observation the random distribution �P taking values
+on the vertices of Δn−1 has variance equal to the entropy of p0.
+For inference, p0 is unknown but we specify a subspace M ⊂ Δn−1 that contains p0, or at
+least has distributions that are not too different from p0. Estimates can be obtained by choosing a
+parameterization for M, say θ, and then considering real-valued functions ˆθ and evaluating these in
+
+47
+
+
+Entropy 2014, 16, 4088–4100
+
+terms of bias and variance. Bias and variance are useful descriptions when θ describes a feature of
+the distribution that is of inherent interest. However, if θ is simply a parameterization, or if there are
+other features that are also of interest, then these quantities are less useful. For inference regarding the
+distribution p0 we can use a distribution-valued estimator �PM where the subscript indicates that the
+estimator is defined to account for the fact that p0 ∈ M.
+We will not pursue the details of distribution-valued estimators here; we mention these only
+because all the subspaces we consider will be exponential families and in this case the maximum
+likelihood estimator has important properties in terms of distribution variance and distribution bias:
+when M is an exponential family, the maximum likelihood estimator is distribution unbiased, and it
+uniquely minimizes the distribution variance among the class of all distribution unbiased estimators.
+Furthermore, when p0 ̸∈ M then the maximum likelihood estimator is the unique unbiased minimum
+distribution variance estimator of the distribution in M that is closest (in terms of KL) to p0. Extensions
+of one dimensional exponential families that do not result in exponential families will not enjoy these
+properties of maximum likelihood estimation. Details of these results that hold for sample spaces more
+general than Sn are in [7].
+
+4. Simplices Δs
+
+One dimensional exponential families on Sn are curves in Δn−1 whose properties will depend on
+their location within various subspaces of Δn−1. An important collection of subspaces will be indexed
+by the subsets of Sn. For notational convenience we take Bi to the integer i. Using integers is suggestive
+of an ordering and a scale structure but at this point these are only being used to indicate distinct bins.
+For each s ⊂ Sn,
+
+Δs =
+�
+p ∈ Rn : pi ≥ 0 ∀i ∈ s, pi = 0 ∀i ∈ sc, 1′p = 1
+�
+
+where sc = {i ∈ Sn : i ̸∈ s}. Note that ΔSn = Δn−1. The interior of Δs is
+
+Δ◦
+s =
+�
+p ∈ Δs : pi > 0 ∀i ∈ s
+�
+.
+
+As probability distributions in P, Δ◦
+s corresponds to the set of all distributions having support s. There
+is a simple and obvious relationship between the dimension of Δs, |Δs|, and the cardinality of s, |s|,
+which holds for all nonempty s ⊂ Sn
+|Δs| + 1 = |s|.
+
+The boundary of Δs is defined as
+
+∂Δs = {p ∈ Δs : p ̸∈ Δ◦
+s }
+
+so that
+Δs = Δ◦
+s ⊎ ∂Δs
+
+where ⊎ indicates the sets in the union are disjoint. The boundary ∂Δs can be written as the union of
+all simplices of dimension one less than that Δs
+
+∂Δs =
+�
+
+s′:s′⊂s, |s′|=|s|−1
+Δs′
+(3)
+
+This boundary property for Δs holds because the simplex Sn consists of all possible subsets. Each
+nonempty s ∈ Sn specifies one of the possible supports for distribution P ∈ Pn
+
+Δs =
+�
+
+s′:s′⊂s
+Δ◦
+s′
+(4)
+
+48
+
+
+Entropy 2014, 16, 4088–4100
+
+where we set Δ∅ = ∅.
+
+5. Cones Λs
+
+The set of all nonempty subsets of the sample space provides a partition of Δn−1 based on the
+support of the distributions in P. The elements in the partition are simplices whose dimension is one
+less than the cardinality of the indexing set. In most cases we will consider models having support
+Sn, that is, models corresponding to Δ◦
+n−1. If we use subsets s to define the mode rather than support,
+we obtain a partition of P◦, the distributions in P having support Sn. This partition can be expressed
+using convex cones in an n − 1 dimensional plane Vn−1. The dimension of the cones are n minus the
+cardinality of the indexing set and the relationship between interiors of cones and their boundaries is
+analogous to that for simplices expressed in Equations (3) and (4).
+Let
+Vn−1 =
+�
+v ∈ Rn : 1′v = 0
+�
+(5)
+
+be the subspace of Rn of dimension n − 1 of all vectors that sum to zero. For each nonempty s ∈
+Sn define
+Λs =
+�
+v ∈ Vn−1 : vi ≥ vj ∀i ∈ s, ∀j ∈ Sn
+�
+.
+
+It is easily checked that Λs is a convex cone
+
+v1, v2 ∈ Λs =⇒ a1v1 + a2v2 ∈ Λs ∀a1, a2 ∈ [0, ∞) .
+
+The dimension of Λs is |Λs| = n − |s| since each point in j ∈ sc provides a basis vector bj whose ith
+
+component is 1 if i ∈ s or i = j and is zero otherwise and |sc| = n − |s|. The interior of Λs is
+
+Λ◦
+s =
+�
+v ∈ Λs : vi > vj ∀i ∈ s, ∀j ∈ sc�
+,
+
+the boundary is
+∂Λs = {v ∈ Λs : v ̸∈ Λ◦
+s } ,
+
+so that
+Λs = Λ◦
+s ⊎ ∂Λs
+
+by definition. Note ΛSn = Λ◦
+Sn = 0 ∈ Vn−1 ⊂ Rn where the first equality holds because the conditions
+in the definition of Λ◦
+s hold vacuously since i ∈ Sc
+n = ∅ adds no restriction. Likewise, we can extend
+the definition of Λs to include s = ∅ and since i ∈ ∅ adds no restriction
+
+Λ∅ = Λ◦
+∅ = Vn−1.
+
+Note that Λ∅ depends on the cardinality of the set Sn. Since we are considering n fixed, we will not
+show this dependence in the notation.
+Corresponding to Equation (3) we have for all nonempty s that the boundary of the cone Λs is the
+union of all cones having dimension one less than the dimension of Λs
+
+∂Λs =
+�
+
+s′:s⊂s′, |s′|=|s|+1
+Λs′.
+(6)
+
+Corresponding to Equation (4) we have
+
+Λs =
+�
+
+s′:s⊂s′
+Λ◦
+s′
+(7)
+
+The relationship between the simplices Δ and cones Λ is more easily seen if we suppress the
+sets that index these objects. Let Δ and Δ∗ be any two simplices and let Λ and Λ∗ be any two convex
+
+49
+
+
+Entropy 2014, 16, 4088–4100
+
+cones. We only consider cones and simplices that correspond to a nonempty subset of Sn. Then the
+Equations (6) and (7) for the convex cones are obtained by simply replacing Δ in Equations (3) and (4)
+with Λ:
+∂Δ =
+�
+
+Δ∗:|Δ∗|=|Δ|−1
+Δ∗,
+∂Λ =
+�
+
+Λ∗:|Λ∗|=|Λ|−1
+Λ∗
+(8)
+
+Δ =
+�
+
+Δ∗⊂Δ
+Δ◦
+∗,
+Λ =
+�
+
+Λ∗⊂Λ
+Λ◦
+∗
+(9)
+
+Equation (9) also holds for the empty set since Δ∅ = ∅ and Λ∅ = Vn−1.
+
+6. Vn−1 and P◦
+
+There is a natural bijection φ between Vn−1 and Δ◦
+n−1 defined by
+
+φ(p) = log(p) − m(p)1
+
+where log(p) is the vector with ith component log(pi) and m(p) is defined so that 1′φ(p) = 0. The
+inverse is
+ϕ(v) = k−1(v) exp(v)
+
+where exp(v) is the vector with ith component exp(vi) and k(v) is defined so that 1′ exp(v) = 1.
+Each cone Λ◦
+s in the partition
+Vn−1 =
+�
+Λ◦
+s
+
+corresponds to one of the 2n − 1 possible modes for any distribution having support Sn since vi > vj if
+and only if ϕi(v) > ϕj(v).
+
+7. Vn−1 and Exponential Families in P◦
+
+We define a line by a pair of vectors v0, v1 ∈ Vn−1 with v1 ̸= 0
+
+ℓ = ℓ(t) = {v ∈ Vn−1 : v = v0 + tv1, t ∈ R}
+
+Note that v0 and v1 are not unique. Applying the inverse transformation ϕ to points in ℓ gives
+probability densities
+
+ϕ(v0 + tv1) =
+exp(v0 + tv1)
+1′ exp(v0 + tv1)
+(10)
+
+which have the exponential family form with t playing the role of the natural parameter. Therefore,
+the space Vn−1 is easily recognized as the natural parameter space for the distributions Δ◦
+n−1 so that
+each line ℓ in Vn−1 corresponds to a one dimensional exponential family.
+For each line ℓ(t) there is a value tmax such that {ℓ(t) : t ≥ tmax} is contained in one of the cones
+Λ◦
+s where s is the subset of Sn with the property that vi
+1 ≥ vj
+1 for all i ∈ s for vectors v1 ∈ Λ◦
+x. For each
+line ℓ(t) there is a value tmin such that {ℓ(t) : t ≤ tmin} is contained in one of the cones Λ◦
+s′ where s′ is
+
+the subset of Sn with the property that vi
+1 ≤ vj
+1 for all i ∈ s′ for vectors v1 ∈ Λ◦
+x. The cones Λ◦
+s and Λ◦
+s′
+are disjoint and will be called the extremal cones for ℓ. There is at least one other cone Λ◦
+s′′ such that
+ℓ ∩ Λ◦
+s′′ ̸= ∅.
+Any one dimensional exponential family ℓ(t) can be described by an ordered sequence of
+disjoint cones
+�
+Λ◦
+s1, Λ◦
+s2, . . . , Λ◦
+sk
+
+�
+
+50
+
+
+Entropy 2014, 16, 4088–4100
+
+where k = k(ℓ) will depend on the family. These are simply the cones that are traversed by ℓ(t)
+between its extremal cones. We take Λ◦
+sk to be the cone that contains ℓ(t) for all sufficiently large t.
+Equation (6) for cones means that
+
+∂Λsi ⊂ Λsj for j = i + 1 or j = i − 1
+
+The ordered sequence of cones provides an ordered sequence of unique subsets of Sn
+
+(s1, s2, . . . , sk)
+
+that we call the modal profile for ℓ as these are the modes realized by the exponential family ℓ(t) between
+its extremal cones that have modes s1 and sk.
+Each point on a line ℓ(t) in Vn−1 corresponds to a distribution having support Sn. As t goes to −∞
+(+∞) ϕ(ℓ(t)) goes to a distribution having support s1 (sk). In fact, these are the uniform distribution
+on these supports. For every s ⊂ Sn other than ∅ and Sn, the uniform distribution on s is a limiting
+distribution for some one dimensional exponential family in P◦.
+Figure 3 shows Vn−1 for the two dimensional simplex shown in Figure 2. The three rays are the
+one dimensional cones and the spaces between these cones are the two dimensional cones. The origin
+is the zero dimensional cone. The sample values on the boundary of Δ2 are not in V2. Note that the
+one dimensional cones are line segments in Δ2.
+
+��
+��
+�
+�
+�
+
+��
+��
+�
+�
+�
+
+��
+
+Figure 3. V2 for n = 3 bins and sample space for N = 10 observations that are in the interior of Δ2.
+
+8. Ordered Bins and the Monotone Likelihood Ratio Property
+
+Let the bins be ordered and assign the first n integers to the bins to reflect this ordering. We seek
+to define exponential families that have a modal profile of the form
+
+({1} , {1, 2} , {2} , {2, 3} , . . . , {n − 1, n} , {n})
+(11)
+
+or a contiguous sub-collection of this profile. Extensions to three or more contiguous modes are clearly
+possible but not discussed here.
+From the definition of modal profile, it follows that a family with modal profile (11) will have the
+property that the mode is a non-decreasing function of t. In addition to this property for the mode, we
+want the likelihood ratio for any two members of the family to provide the same ordering structure
+
+51
+
+
+Entropy 2014, 16, 4088–4100
+
+as that of the bins. A family that satisfies this condition is said to have the monotone likelihood ratio
+property with respect to x where x takes the values of the bin labels: 1, 2, . . . , n. Let pθ1 and pθ2 be
+two distributions in a one dimensional family parameterized by θ and let pθ2/pθ1 be the n-vector with
+
+components pj
+θ2/pj
+θ1 for 1 ≤ j ≤ n. This family has monotone likelihood ratio if for all θ1 < θ2 and
+j < j′
+
+pj
+θ2
+
+pj
+θ1
+<
+pj′
+
+θ2
+
+pj′
+θ1
+
+.
+
+A family with this property avoids the problem situation where in general the data in the higher
+numbered bins are evidence for pθ2 but in going from a particular bin, say j0 to j0 + 1, the likelihood
+ratio actually decreases. Exponential families such as the binomial and Poisson have this monotone
+likelihood ratio property for the bin labels. The monotone likelihood ratio property can be extended
+to allow for likelihood ratios that are monotone in some function of x. An important advantage of
+families with the monotone likelihood ratio property is the existence of uniformly most powerful tests.
+To ensure that our exponential families have the monotone likelihood ratio property we consider
+vectors in the cone Λ↑ ⊂ Λn
+Λ↑ =
+�
+v : vi < vj, i < j
+�
+.
+
+From Equation (10), the exponential family indexed by θ is k(θ) exp(v0 + θv1)
+
+pj
+θ2
+
+pj
+θ1
+= k(θ2)
+
+k(θ1) exp
+�
+(θ2 − θ1) vj
+1
+�
+
+so that the likelihood ratio is monotone in j if v1 ∈ Λ↑.
+
+9. Selecting Vectors in Λ↑
+
+In order to choose n-dimensional vectors v ∈ Λ↑ we will consider a set of infinite dimensional
+vectors f. Let ¯f : R �→ R and consider f = ¯f |Z where Z is the set of integers. The function f is
+represented by a doubly infinite sequence
+
+f = . . . , f j−1, f j, f j+1, . . .
+
+and we denote the set of all such functions as
+
+F =
+�
+f : f j ∈ R ∀ j ∈ Z
+�
+.
+
+While it is not necessary to consider functions ¯f to define f, these functions are useful to describe
+properties of f, which can be thought of as a discretized version of ¯f.
+Define the gradient of f as the function ∇ whose jth component is
+
+(∇ f )j = f j − f j−1
+
+The simplest functions in F are the constant functions
+
+F0 =
+�
+f ∈ F : f j = f j′ ∀j, j′ ∈ Z
+�
+.
+
+The next simplest functions are those whose gradient is constant. We call these first order functions
+and denote the set of these as
+F1 = { f ∈ F : ∇ f ∈ F0} .
+
+52
+
+
+Entropy 2014, 16, 4088–4100
+
+Functions in F1 are such that changes from one bin to the next bin is the same for all bins. That is,
+these functions describe constant change. We can write the functions in F1 explicitly as
+
+F1 =
+�
+f ∈ F : f j = aj + b, a, b ∈ R
+�
+
+which shows that each f ∈ F1 is the discretized version of a function ¯f whose graph is a line in R × R.
+We obtain a vector v from f by defining the jth component of v as
+
+vj = f j −
+n
+∑
+1
+f i
+
+. From this definition we see that the intercept b of f does not affect v and that the slope is a scaling
+factor so that the restriction to first order functions results in a single direction in Λ↑. This direction
+defines the one dimensional cone defined by the vector with vj = j − (n + 1)/2.
+Additional directions can be obtained from the second order functions
+
+F2 = { f ∈ F : ∇ f ∈ F1} .
+
+If f ∈ F2 then (∇2 f )j = a for some a ∈ R and for all j ∈ Z. Using the fact that
+
+(∇2 f )j = (∇(∇ f ))j = ( f j − f j−1) − ( f j−1 − f j−2)
+
+= f j + f j−2 − 2f j−1
+
+the second order functions can be written explicitly as
+
+F2 =
+�
+f ∈ F : f j = a
+
+2 j(j + 1) + bj + c, a, b, c ∈ R
+�
+
+.
+In order for the vector v obtained from f ∈ F2 to be in Λ↑ we need (∇ f )j ≥ 0 for j = 1, 2, . . . , n.
+With f j = (a/2)j(j + 1) + bj + c we have (∇ f )j = aj + b so that for a > 0 we require b ≥ −a and for
+a < 0 we require b ≥ −an. Since we are concerned with the direction rather than the magnitude we
+can take a = ±1 and the value of c is chosen so the sum of the components is zero.
+The second order vectors in Λ↑ consists of the cone defined by the vectors v20 and v21 having
+components defined by
+
+(n − 1)(v20)j = 1
+
+2 j(j + 1) − j − c20
+
+(n − 1)(v21)j = −1
+
+2 j(j + 1) + nj − c21
+
+Notice that this cone contains v1 since v1 is proportional to v20 + v21. Many discrete one dimensional
+exponential families (e.g., binomial, negative binomial, and Poisson) use the vector v1. Furthermore,
+many continuous one dimensional exponential families use the continuous function f used to define
+v1: normal with σ known, and the gamma and inverse Gaussian distributions with known shape
+parameter (the shape parameter is the non-scale parameter). The cone defined by v20 and v21 allows us
+to perturb the v1 direction to obtain related exponential families that we would expect to have similar
+properties. Figure 4 shows v20 and v21 as well as v1 = 0.5v20 + 0.5v21.
+Other vectors can be used to define cones around v1. Looking at common exponential families we
+see that log(x) and x−1 are sufficient statistics so that these suggest taking ¯f (x) = log(x) or ¯f (x) = 1/x.
+These can be further generalized to ¯f (x; λ), which can be the power family or some other family of
+transformations. The vectors v f0 and v f1 are defined using the discretized f with the constraints that
+v f0, v f1 ∈ Λ↑ and 0.5v f0 + 0.5v f1 = v1.
+
+53
+
+
+Entropy 2014, 16, 4088–4100
+
+An exponential family with sufficient statistic x can be modified by choosing a function ¯f (x) and
+0 ≤ α ≤ 1 where α = 0.5 corresponds to the original exponential family and other values perturb this
+direction. We denote this vector as v f α so that v0 + tv f α is the natural parameter of the modified family.
+Figure 4 shows the components of the vectors v20 and v21.
+
+�
+��
+��
+��
+��
+���
+���
+
+����
+����
+���
+���
+���
+
+�����������
+
+�����
+
+���������������
+
+���
+
+���
+
+Figure 4. Components of the vectors v20 and v21 for n = 128 bins.
+
+Since v0 is common to each exponential family with natural parameter ℓ(t) = v0 + tv f α, the
+monotone likelihood ratio property will hold even if v0 ̸∈ Λ↑. Initial choices for v0 are suggested by
+the Poisson, binomial, and negative binomial distributions:
+
+(vPoisson)j = − log Γ(j) + c
+/∈ Λ↑
+
+(vbinomial)j = log Γ(n) − log Γ(j) − log Γ(n − j) + c
+/∈ Λ↑
+
+(vneg.bin.)j = log Γ(j + r) − log Γ(j) + c
+∈ Λ↑
+
+where c is a constant chosen so that the components sum to 1, n is the number of bins, and r is a
+positive real constant.
+
+Author Contributions: This paper was initiated by the first author but all sections reflect a collaborative effort.
+Both authors have read and approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+References
+
+1.
+Altham, P.M.E. Two Generalizations of the Binomial Distribution. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1978,
+27, 162–167.
+2.
+Lovison, G. An alternative representation of Altham’s multiplicative-binomial distribution. Stat. Probab. Lett.
+1998, 36, 415–420.
+3.
+Brown, L. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory;
+IMS Lecture Notes; Institute of Mathematical Statistics: Hayward, CA, USA, 1986.
+4.
+Efron, B. Double Exponential Families and Their Use in Generalized Linear Regression. J. Am. Stat. Assoc.
+1986, 81, 709–721.
+5.
+Aitchison, J. The Statistical Analysis of Compositional Data; Chapman and Hall: London, UK, 1986.
+
+54
+
+
+Entropy 2014, 16, 4088–4100
+
+6.
+Wu, Q.; Vos, P. Decomposition of Kullback–Leibler risk and unbiasedness for parameter-free estimators.
+J. Stat. Plan. Inference 2012, 142, 1525–1536.
+7.
+Vos, P.; Wu, Q.
+Maximum Likelihood Estimators Uniformly Minimize Distribution Variance among
+Distribution Unbiased Estimators in Exponential Families. Bernoulli 2014, submitted.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+55
+
+
+entropy
+
+Article
+On the Fisher Metric of Conditional Probability
+Polytopes
+
+Guido Montúfar 1,*, Johannes Rauh 1 and Nihat Ay 1,2,3
+
+1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig 04103, Germany; E-Mails:
+jrauh@mis.mpg.de (J.R.); nay@mis.mpg.de (N.A.)
+2 Department of Mathematics and Computer Science, Leipzig University, PF 10 09 20, Leipzig 04009, Germany
+3 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
+*
+E-Mail: montufar@mis.mpg.de; Tel.: +49-341-9959-521.
+
+Received: 31 March 2014; in revised form: 18 May 2014 / Accepted: 29 May 2014 /
+Published: 6 June 2014
+
+Abstract: We consider three different approaches to define natural Riemannian metrics on polytopes
+of stochastic matrices. First, we define a natural class of stochastic maps between these polytopes and
+give a metric characterization of Chentsov type in terms of invariance with respect to these maps.
+Second, we consider the Fisher metric defined on arbitrary polytopes through their embeddings as
+exponential families in the probability simplex. We show that these metrics can also be characterized
+by an invariance principle with respect to morphisms of exponential families. Third, we consider
+the Fisher metric resulting from embedding the polytope of stochastic matrices in a simplex of joint
+distributions by specifying a marginal distribution. All three approaches result in slight variations
+of products of Fisher metrics. This is consistent with the nature of polytopes of stochastic matrices,
+which are Cartesian products of probability simplices. The first approach yields a scaled product of
+Fisher metrics; the second, a product of Fisher metrics; and the third, a product of Fisher metrics
+scaled by the marginal distribution.
+
+Keywords: Fisher information metric; information geometry; convex support polytope; conditional
+model; Markov morphism; isometric embedding; natural gradient
+
+1. Introduction
+
+The Riemannian structure of a function’s domain has a crucial impact on the performance of
+gradient optimization methods, especially in the presence of plateaus and local maxima. The natural
+gradient [1] gives the steepest increase direction of functions on a Riemannian space. For example,
+artificial neural networks can often be trained by following some function’s gradient on a space of
+probabilities. In this context, it has been observed that following the natural gradient with respect to
+the Fisher information metric, instead of the Euclidean metric, can significantly alleviate the plateau
+problem [1,2]. The Fisher information metric, which is also called Shahshahani metric [3] in biological
+contexts, is broadly recognized as the natural metric of probability spaces. An important argument
+was given by Chentsov [4], who showed that the Fisher information metric is the only metric on
+probability spaces for which certain natural statistical embeddings, called Markov morphisms, are
+isometries. More generally, Chentsov’s Theorem characterizes the Fisher metric and α-connections of
+statistical manifolds uniquely (up to a multiplicative constant) by requiring invariance with respect
+to Markov morphisms. Campbell [5] gave another proof that characterizes invariant metrics on the
+set of non-normalized positive measures, which restrict to the Fisher metric in the case of probability
+measures (up to a multiplicative constant). In this paper, we explore ways of defining distinguished
+Riemannian metrics on spaces of stochastic matrices.
+
+Entropy 2014, 16, 3207–3233; doi:10.3390/e16063207
+www.mdpi.com/journal/entropy
+56
+
+
+Entropy 2014, 16, 3207–3233
+
+In learning theory, when modeling the policy of a system, it is often preferred to consider
+stochastic matrices instead of joint probability distributions. For example, in robotics applications,
+policies are optimized over a parametric set of stochastic matrices by following the gradient of a
+reward function [6,7]. The set of stochastic matrices can be parametrized in many ways, e.g., in terms
+of feedforward neural networks, Boltzmann machines [8] or projections of exponential families [9].
+The information geometry of policy models plays an important role in these applications and has
+been studied by Kakade [2], Peters and co-workers [10–12], and Bagnell and Schneider [13], among
+others. A stochastic matrix is a tuple of probability distributions, and therefore, the space of stochastic
+matrices is a Cartesian product of probability simplices. Accordingly, in applications, usually a product
+metric is considered, with the usual Fisher metric on each factor. On the other hand, Lebanon [14]
+takes an axiomatic approach, following the ideas of Chentsov and Campbell, and characterizes a class
+of invariant metrics of positive matrices that restricts to the product of Fisher metrics in the case of
+stochastic matrices. We will consider three different approaches discussed in the following.
+In the first part, we take another look at Lebanon’s approach for characterizing a distinguished
+metric on polytopes of stochastic matrices. However, since the maps considered by Lebanon do not
+map stochastic matrices to stochastic matrices, we will use different maps. We show that the product
+of Fisher metrics can be characterized by an invariance principle with respect to natural maps between
+stochastic matrices.
+In the second part, we consider an approach that allows us to define Riemannian structures
+on arbitrary polytopes. Any polytope can be identified with an exponential family by using the
+coordinates of the polytope vertices as observables. The inverse of the moment map then defines
+an embedding of the polytope in a probability simplex. This embedding can be used to pull back
+geometric structures from the probability simplex to the polytope, including Riemannian metrics,
+affine connections, divergences, etc. This approach has been considered in [9] as a way to define
+low-dimensional families of conditional probability distributions. More general embeddings can be
+defined by identifying each exponential family with a point configuration, B, together with a weight
+function, ν. Given B and ν, the corresponding exponential family defines geometric structures on the
+set (conv B)◦, which is the relative interior of the convex support of the exponential family. Moreover,
+we can define natural morphisms between weighted point configurations as surjective maps between
+the point sets, which are compatible with the weight functions. As it turns out, the Fisher metric on
+(conv B)◦ can be characterized by invariance under these maps.
+In the third part, we return to stochastic matrices. We study natural embeddings of conditional
+distributions in probability simplices as joint distributions with a fixed marginal. These embeddings
+define a Fisher metric equal to a weighted product of Fisher metrics. This result corresponds to the
+Definitions commonly used in robotics applications.
+All three approaches give very similar results. In all cases, the identified metric is a product
+metric. This is a sensible result, since the set of k × m stochastic matrices is a Cartesian product of
+probability simplices Δm−1 × · · · × Δm−1 = Δk
+m−1, which suggests using the product metric of the
+Fisher metrics defined on the factor simplices, Δm−1. Indeed, this is the result obtained from our second
+approach. The first approach yields that same result with an additional scaling factor of 1/k. Only
+when stochastic matrices of different sizes are compared, the two approaches differ. The third approach
+yields a product of Fisher metrics scaled by the marginal distribution that defines the embedding.
+Which metric to use depends on the concrete problem and whether a natural marginal distribution
+is defined and known. In Section 7, we do a case study using a reward function that is given as an
+expectation value over a joint distribution. In this simple example, the weighted product metric gives
+the best asymptotic rate of convergence, under the assumption that the weights are optimally chosen.
+In Section 8, we sum up our findings.
+The contents of the paper is organized as follows. Section 2 contains basic Definitions around the
+Fisher metric and concepts of differential geometry. In Section 3, we discuss the Theorems of Chentsov,
+Campbell and Lebanon, which characterize natural geometric structures on the probability simplex,
+
+57
+
+
+Entropy 2014, 16, 3207–3233
+
+on the set of positive measures and on the cone of positive matrices, respectively. In Section 4, we
+study metrics on polytopes of stochastic matrices, which are invariant under natural embeddings. In
+Section 5, we define a Riemannian structure for polytopes, which generalizes the Fisher information
+metric of probability simplices and conditional models in a natural way. In Section 6, we study a class of
+weighted product metrics. In Section 7, we study the gradient flow with respect to an expectation value.
+Section 8 contains concluding Remarks. In Appendix A, we investigate restrictions on the parameters
+of the metrics characterized in Sections 3 and 4 that make them positive definite. Appendix B contains
+the proofs of the results from Section 4.
+
+2. Preliminaries
+
+We will consider the simplex of probability distributions on [m] := {1, . . . , m}, m ≥ 2, which is
+given by Δm−1 := {(pi)i ∈ Rm : pi ≥ 0, ∑i pi = 1}. The relative interior of Δm−1 consists of all strictly
+positive probability distributions on [m], and will be denoted Δ◦
+m−1. This is a subset of Rm+, the cone
+of strictly positive vectors. The set of k × m row-stochastic matrices is given by Δk
+m−1 := {(Kij)ij ∈
+Rk×m : (Kij)j ∈ Δm−1 for all i ∈ [k]} and is equal to the Cartesian product ×i∈[k] Δm−1. The relative
+
+interior (Δk
+m−1)◦ is a subset of Rk×m
++
+, the cone of strictly positive matrices.
+Given two random variables X and Y taking values in the finite sets [k] and [m], respectively, the
+conditional probability distribution of Y given X is the stochastic matrix K = (P(y|x))x∈[k],y∈[m] with
+rows (P(y|x))y∈[m] ∈ Δm−1 for all x ∈ [k]. Therefore, the polytope of stochastic matrices Δk
+m−1 is called
+a conditional polytope.
+The tangent space of Rn+ at a point p ∈ Rn+, denoted by TpRn+, is the real vector space spanned
+by the vectors ∂1, . . . , ∂n of partial derivatives with respect to the n components. The tangent space of
+Δ◦
+n−1 at a point p ∈ Δ◦
+n−1 ⊂ Rn+ is the subspace TpΔ◦
+n−1 ⊂ TpRn+ consisting of the vectors:
+
+u = ∑
+i
+ui∂i ∈ TpRn
++
+with
+∑
+i
+ui = 0.
+(1)
+
+The Fisher metric on the positive probability simplex Δ◦
+n−1 is the Riemannian metric given by:
+
+g(n)
+p (u, v) =
+n
+∑
+i=1
+
+uivi
+pi
+,
+for all u, v ∈ TpΔ◦
+n−1.
+(2)
+
+The same formula (2) also defines a Riemannian metric on Rn+, which we will denote by the same
+symbol. This, however, is not the only way in which the Fisher metric can be extended from Δ◦
+n−1
+to Rn+. We will discuss other extensions in the next section (see Campbell’s Theorem, Theorem 2).
+Consider a smoothly parametrized family of probability distributions M = {(p(x; θ))x∈[n] : θ ∈
+
+Ω} ⊆ Δ◦
+n−1, where Ω ⊆ Rd is open. Then, g(n) induces a Riemannian metric on M. Denote by
+∂θi =
+∂
+∂θi the tangent vector corresponding to the partial derivative with respect to θi, for all i ∈ [d].
+Then, the Fisher matrix has coordinates:
+
+gM
+θ (∂θi, ∂θj) = ∑
+x∈[n]
+p(x; θ)∂ log p(x; θ)
+
+∂θi
+
+∂ log p(x; θ)
+
+∂θj
+,
+for all i, j ∈ [d],
+for all θ ∈ Ω.
+(3)
+
+Here, it is not necessary to assume that the parameters θi are independent. In particular, the dimension
+of M may be smaller than d, in which case the matrix is not positive definite. If the map Ω → M, θ �→
+p(·; θ) is an embedding (i.e., a smooth injective map that is a diffeomorphism onto its image), then gM
+θ
+defines a Riemannian metric on Ω, which corresponds to the pull-back of g(n).
+Consider an embedding f : E → E′. The pull-back of a metric g′ on E′ through f is defined as:
+
+( f ∗g′)p(u, v) := g′
+f (p)( f∗u, f∗v),
+for all u, v ∈ TpE,
+(4)
+
+58
+
+
+Entropy 2014, 16, 3207–3233
+
+where f∗ denotes the push-forward of TpE through f, which in coordinates is given by:
+
+f∗ :
+TpE → Tf (p)E′;
+∑
+i
+ui∂θi �→ ∑
+j ∑
+i
+ui
+∂ fj(p)
+
+∂θi
+∂θ′
+j,
+(5)
+
+where {∂θi}i spans TqE and {∂θ′
+j}j spans Tf (p)E′.
+
+An embedding f : E → E′ of two Riemannian manifolds (E, g) and (E′, g′) is an isometry iff:
+
+gp(u, v) = ( f ∗g′)p(u, v),
+for all p ∈ E and u, v ∈ TpE.
+(6)
+
+In this case, we say that the metric g is invariant with respect to f (and g′).
+
+3. The Results of Campbell and Lebanon
+
+One of the theoretical motivations for using the Fisher metric is provided by Chentsov’s
+characterization [4], which states that the Fisher metric is uniquely specified, up to a multiplicative
+constant, by an invariance principle under a class of stochastic maps, called Markov morphisms. Later,
+Campbell [5] considered the characterization problem on the space Rn+ instead of Δ◦
+n−1. This simplifies
+the computations, since Rn+ has a more symmetric parametrization.
+
+Definition 1. Let 2 ≤ m ≤ n. A (row) stochastic partition matrix (or just row-partition matrix) is a matrix
+Q ∈ Rm×n of non-negative entries, which satisfies ∑j∈Ai′ Qij = δii′ for an m block partition {A1, . . . , Am} of
+[n]. The linear map defined by:
+Rm
++ → Rn
++;
+p �→ p · Q
+(7)
+
+is called a congruent embedding by a Markov mapping of Rm+ to Rn+ or just a Markov map, for short.
+
+An example of a 3 × 5 row-partition matrix is:
+
+Q =
+
+⎛
+
+⎜
+⎝
+1/2
+0
+1/2
+0
+0
+0
+1/3
+0
+2/3
+0
+0
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ .
+(8)
+
+Markov maps preserve the 1-norm and restrict to embeddings Δ◦
+m−1 → Δ◦
+n−1.
+
+Theorem 1 (Chentsov’s Theorem.).
+
+• Let g(m) be a Riemannian metric on Δ◦
+m−1 for m ∈ {2, 3, . . .}. Let this sequence of metrics have the
+property that every congruent embedding by a Markov mapping is an isometry. Then, there is a constant
+C > 0 that satisfies:
+
+g(m)
+p
+(u, v) = C∑
+i
+
+uivi
+pi
+.
+(9)
+
+• Conversely, for any C > 0, the metrics given by Equation (9) define a sequence of Riemannian metrics
+under which every congruent embedding by a Markov mapping is an isometry.
+
+The main result in Campbell’s work [5] is the following variant of Chentsov’s Theorem.
+
+Theorem 2 (Campbell’s Theorem.).
+
+• Let g(m) be a Riemannian metric on Rm+ for m ∈ {2, 3, . . .}. Let this sequence of metrics have the
+property that every embedding by a Markov mapping is an isometry. Then:
+
+g(m)
+p
+(∂i, ∂j) = A(|p|) + δijC(|p|)|p|
+
+pi
+,
+(10)
+
+59
+
+
+Entropy 2014, 16, 3207–3233
+
+where |p| = ∑m
+i=1 pi, δij is the Kronecker delta, and A and C are C∞ functions on R+ satisfying
+C(α) > 0 and A(α) + C(α) > 0 for all α > 0.
+• Conversely, if A and C are C∞ functions on R+ satisfying C(α) > 0, A(α) + C(α) > 0 for all α > 0,
+then Equation (10) defines a sequence of Riemannian metrics under which every embedding by a Markov
+mapping is an isometry.
+
+The metrics from Campbell’s Theorem also define metrics on the probability simplices Δ◦
+m−1 for
+m = 2, 3, . . .. Since the tangent vectors v = ∑i vi∂i ∈ TpΔ◦
+m−1 satisfy ∑i vi = 0, for any two vectors
+u, v ∈ TpΔ◦
+m−1, also ∑i ∑j Auivj = 0 for any A. In this case, the choice of A is immaterial, and the
+metric becomes Chentsov’s metric.
+
+Remark 1. Observe that Chentsov’s Theorem is not a direct implication of Campbell’s Theorem. However, it
+can be deduced from it by the following arguments. Suppose that we have a family of Riemannian simplices
+(Δ◦
+m−1, g(m)) for m ∈ {2, 3, . . .}, and suppose that they are isometric with respect to Markov maps. If we can
+extend every g(m) to a Riemannian metric ˜g(m) on Rm+ in such a way that the resulting spaces (Rm+, ˜g(m)) are
+still isometric with respect to Markov maps, then Campbell’s Theorem implies that g(m) is a multiple of the
+Fisher metric. Such metric extensions can be defined as follows. Consider the diffeomorphism:
+
+Δ◦
+m−1 × R+ ∼= Rm
++,
+(p, r) �→ r · p.
+(11)
+
+Any tangent vector u ∈ T(p,r)Rm+ can be written uniquely as u = up + ur∂r, where up is tangent to rΔ◦
+m−1.
+Since each Markov map f preserves the one-norm | · |, its push-forward f∗ maps the tangent vector ∂r ∈ T(p,r)Rm+
+to the corresponding tangent vector ∂r ∈ Tf (p,r)Rm+; that is, f∗u = f∗up + ur∂r. Therefore,
+
+˜g(m)
+(p,r)(u, v) := g(m)
+p
+(up, vp) + urvr
+(12)
+
+is a metric on Rm+ that is invariant under f.
+
+In what follows, we will focus on positive matrices. In order to define a natural Riemannian
+metric, we can use the identification Rk×m
++
+∼= Rkm
++ and apply Campbell’s Theorem. This leads to
+metrics of the form:
+g(k,m)
+M
+(∂ij, ∂kl) = A(|M|) + δikδjlC(|M|)/Mij,
+(13)
+
+where ∂ij =
+∂
+
+∂Mij and |M| = ∑ij Mij. However, a disadvantage of this approach is that the action of
+
+general Markov maps on Rkm
++ has no natural interpretation in terms of the matrix structure. Therefore,
+Lebanon [14] considered a special class of Markov maps defined as follows.
+
+Definition 2. Consider a k × l row-partition matrix R and a collection of m × n row-partition matrices
+Q = {Q(1), . . . , Q(k)}. The map:
+
+Rk×m
++
+→ Rl×n
++ ;
+M �→ R⊤(M ⊗ Q)
+(14)
+
+is called a congruent embedding by a Markov morphism of Rk×m
++
+to Rl×n
++
+in [15]. We will refer to such an
+embedding as a Lebanon map. Here, the row product M ⊗ Q is defined by:
+
+(M ⊗ Q)ab = (M · Q(a))ab,
+for all a ∈ [k], b ∈ [n];
+(15)
+
+that is, the a-th row of M is multiplied by the matrix Q(a).
+
+In a Lebanon map, each row of the input matrix M is mapped by an individual Markov mapping
+Q(i), and each resulting row is copied and scaled by an entry of R. This kind of map preserves the
+sum of all matrix entries. Therefore, with the identification Rk×m
++
+∼= Rkm
++ , each Lebanon map restricts
+
+60
+
+
+Entropy 2014, 16, 3207–3233
+
+to a map Δ◦
+mk−1 → Δ◦
+nl−1. The set Δ◦
+mk−1 can be identified with the set of joint distributions of two
+random variables. Lebanon maps can be regarded as special Markov maps that incorporate the product
+structure present in the set of joint probability distributions of a pair of random variables. In Section 4,
+we will give an interpretation of these maps.
+Contrary to what is stated in [15], a Lebanon map does not map (Δk
+m−1)◦ to (Δl
+n−1)◦, unless k = l.
+Therefore, later, we will provide a characterization for the metrics on (Δk
+m−1)◦ in terms of invariance
+under other maps (which are not Markov nor Lebanon maps).
+The main result in Lebanon’s work [15, Theorems 1 and 2] is the following.
+
+Theorem 3 (Lebanon’s Theorem.).
+
+• For each k ≥ 1, m ≥ 2, let g(k,m) be a Riemannian metric on Rk×m
++
+in such a way that every Lebanon
+map is an isometry. Then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A(|M|) + δac
+
+� B(|M|)
+
+|Ma|
++ δbd
+C(|M|)
+
+Mab
+
+�
+(16)
+
+for some differentiable functions A, B, C ∈ C∞(R+).
+• Conversely, let {(Rk×m
++
+, g(k,m))} be a sequence of Riemannian manifolds, with metrics g(k,m) of the
+form (16) for some A, B, C ∈ C∞(R+). Then, every Lebanon map is an isometry.
+
+Lebanon does not study the question under which assumptions on A, B, C ∈ C∞(R+) the
+formula (16) does indeed define a Riemannian metric. This question has the following simple answer,
+which we will prove in Appendix A:
+
+Proposition 1. The matrix (16) is positive definite if and only if C(|M|) > 0, B(|M|) + C(|M|) > 0 and
+A(|M|) + B(|M|) + C(|M|) > 0.
+
+The class of metrics (16) is larger than the class of metrics (13) derived in Campbell’s Theorem.
+The reason is that Campbell’s metrics are invariant with respect to a larger class of embeddings.
+The special case with A(|M|) = 0, B(|M|) = 0 and C(|M|) = 1 is called product Fisher metric,
+
+g(k,m)
+M
+(∂ab, ∂cd) = δacδbd
+1
+
+Mab
+.
+(17)
+
+Furthermore, if we restrict to (Δk
+m−1)◦, the functions A and B do not play any role. In this case |M| = k,
+and we obtain the scaled product Fisher metric:
+
+g(k,m)
+M
+(∂ab, ∂cd) = δacδbd
+C(k)
+Mab
+,
+(18)
+
+where C(k) : N → R+ is a positive function. As mentioned before, Lebanon’s Theorem does not give a
+characterization of invariant metrics of stochastic matrices, since Lebanon maps do not preserve the
+stochasticity of the matrices. However, Lebanon maps are natural maps on the set Δ◦
+mk−1 of positive
+joint distributions. In the same way as Chentsov’s Theorem can be derived from Campbell’s Theorem
+(see Remark 1), we obtain the following Corollary:
+
+61
+
+
+Entropy 2014, 16, 3207–3233
+
+Corollary 1.
+
+• Let {(Δ◦
+km−1, g(k,m)): k ≥ 1, m ≥ 2} be a double sequence of Riemannian manifolds with the property
+that every Lebanon map is an isometry. Then:
+
+g(k,m)
+P
+(u, v) = B∑
+a ∑
+b,c
+
+uabuac
+
+|Pa|
++ C∑
+a ∑
+b
+
+uabvab
+
+Pab
+,
+for each P ∈ Δ◦
+km−1,
+(19)
+
+for some constants B, C ∈ R with C > 0 and B + C > 0, where |Pa| = ∑b Pab.
+• Conversely, let {(Δ◦
+km−1, g(k,m))} be a sequence of Riemannian manifolds with metrics g(k,m) of the form
+of Equation (19) for some B, C ∈ R. Then, every Lebanon map is an isometry.
+
+Observe that these metrics agree with (a multiple of) the Fisher metric only if B = 0. The case B = 0
+can also be characterized; note that Lebanon maps do not treat the two random variables symmetrically.
+Switching the two random variables corresponds to transposing the joint distribution matrix P. When
+exchanging the role of the two random variables, the Lebanon map becomes P �→ (P⊤ ⊗ Q)⊤R. We
+call such a map a dual Lebanon map. If we require invariance under both Lebanon maps and their
+duals in Theorem 3 or Corollary 1, the statements remain true with the additional restriction that B = 0
+(as a function or constant, respectively).
+
+4. Invariance Metric Characterizations for Conditional Polytopes
+
+According to Chentsov’s Theorem (Theorem 1), a natural metric on the probability simplex can
+be characterized by requiring the isometry of natural embeddings. Lebanon follows this axiomatic
+approach to characterize metrics on products of positive measures (Theorem 3). However, the maps
+considered by Lebanon dissolve the row-normalization of conditional distributions. In general, they
+do not map conditional polytopes to conditional polytopes. Therefore, we will consider a slight
+modification of Lebanon maps, in order to obtain maps between conditional polytopes.
+
+4.1. Stochastic Embeddings of Conditional Polytopes
+
+A matrix of conditional distributions P(Y|X) in Δk
+m−1 can be regarded as the equivalence class
+of all joint probability distributions P(X, Y) ∈ Δkm−1 with conditional distribution P(Y|X). Which
+Markov maps of probability simplices are compatible with this equivalence relation? The most obvious
+examples are permutations (relabelings) of the state spaces of X and Y.
+In information theory, stochastic matrices are also viewed as channels. For any distribution of X,
+the stochastic matrix gives us a joint distribution of the pair (X, Y) and, hence, a marginal distribution
+of Y. If we input a distribution of X into the channel, the stochastic matrix determines what the
+distribution of the output Y will be.
+Channels can be combined, provided the cardinalities of the state spaces fit together. If we
+take the output Y of the first channel P(Y|X) and feed it into another channel P(Y′|Y) then we
+obtain a combined channel P(Y′|X). The composition of channels corresponds to ordinary matrix
+multiplication. If the first channel is described by the stochastic matrix K and the second channel by Q,
+then the combined channel is described by K · Q. Observe that in this case, the joint distribution P
+(considered as a normalized matrix P ∈ Δkm−1) is transformed similarly; that is, the joint distribution
+of the pair (X, Y′) is given by P · Q.
+More general maps result from compositions where the choice of the second channel depends on
+the input of the first channel. In other words, we have a first channel that takes as input X and gives
+as output Y, and we have another channel that takes as input (X, Y) and gives as output Y′; we are
+interested in the resulting channel from X to Y′. The second channel can be described by a collection
+of stochastic matrices Q = {Q(i)}i. If K describes the first channel, then the combined channel is
+described by the row product K ⊗ Q (see Definition 2). Again, the joint distribution of (X, Y′) arises in
+a similar way as P ⊗ Q.
+
+62
+
+
+Entropy 2014, 16, 3207–3233
+
+We can also consider transformations of the first random variable X. Suppose that we use X as the
+input to a channel described by a stochastic matrix R. In this case, the joint distribution of the output
+X′ of the channel and Y is described by R⊤X. However, in general, there is not much that we can say
+about the conditional distribution of Y given X′. The result depends in an essential way on the original
+distribution of X. However, this is not true in the special case that the channel is “not mixing”, that is,
+in the case that R is a stochastic partition matrix. In this case, the conditional distribution P(Y|X′) is
+
+described by R⊤K, where R is the corresponding partition indicator matrix, where all non-zero entries
+of R are replaced by one. In other words, each state of X corresponds to several states of X′, and the
+corresponding row of K is copied a corresponding number of times.
+To sum up, if we combine the transformations due to Q and R, then the joint probability
+
+distribution transforms as P �→ R⊤(P ⊗ Q) and the conditional transforms as K �→ R⊤(K ⊗ Q).
+In particular, for the joint distribution, we obtain the Definition of a Lebanon map. Figure 1 illustrates
+the situation.
+
+X
+Y
+
+X′
+Y′
+
+K
+
+K′
+
+R
+Q
+
+joint distributions: P′ = R⊤(P ⊗ Q)
+
+conditional distributions: K′ = R⊤(K ⊗ Q)
+
+Figure 1. An interpretation for Lebanon maps and conditional embeddings. The variable X′ is
+computed from X by R, and Y′ is computed from X and Y by Q.
+
+Finally, we will also consider the special case where the partition of R (and R) is homogeneous,
+i.e., such that all blocks have the same size. For example, this describes the case where there is a third
+random variable Z that is independent of Y given X. In this case, the conditional distribution satisfies
+P(Y|X) = P(Y|X, Z), and R describes the conditional distribution of (X, Z) given X.
+
+Definition 3. A (row) partition indicator matrix is a matrix R ∈ {0, 1}k×l that satisfies:
+
+Rij =
+
+�
+1,
+if j ∈ Ai,
+
+0,
+else,
+(20)
+
+for a k block partition {A1, . . . , Ak} of [l].
+
+For example, the 3 × 5 partition indicator matrix corresponding to Equation (8) is:
+
+R =
+
+⎛
+
+⎜
+⎝
+1
+0
+1
+0
+0
+0
+1
+0
+1
+0
+0
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ .
+(21)
+
+Definition 4. Consider a k × l partition indicator matrix R and a collection of m × n stochastic partition
+matrices Q = {Q(i)}k
+i=1. We call the map:
+
+f :
+Rk×m
++
+→ Rl×n
++ ;
+M �→ R⊤(M ⊗ Q)
+(22)
+
+a conditional embedding of Rk×m
++
+in Rl×n
++ . We denote the set of all such maps by ˆF l,n
+k,m. If R is the partition
+indicator matrix of a homogeneous partition (with partition blocks of equal cardinality), then we call f a
+homogeneous conditional embedding. We denote the set of all such homogeneous conditional embeddings by
+F l,n
+k,m and assume that l is a multiple of k.
+
+63
+
+
+Entropy 2014, 16, 3207–3233
+
+Conditional embeddings preserve the 1-norm of the matrix rows; that is, the elements of ˆF l,n
+k,m
+map (Δk
+m−1)◦ to (Δl
+n−1)◦. On the other hand, they do not preserve the 1-norm of the entire matrix.
+Conditional embeddings are Markov maps only when k = l, in which case they are also Lebanon
+maps.
+
+4.2. Invariance Characterization
+
+Considering the conditional embeddings discussed in the previous section, we obtain the
+following metric characterization.
+
+Theorem 4.
+
+• Let g(k,m) denote a metric on Rk×m
++
+for each k ≥ 1 and m ≥ 2. If every homogeneous conditional
+embedding f ∈ F l,n
+k,m is an isometry with respect to these metrics, then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A
+
+k2 + δac
+
+�
+k B
+
+k2 + δbd
+|M|
+Mab
+
+C
+k2
+
+�
+,
+for all M ∈ Rk×m
++
+,
+(23)
+
+for some constants A, B, C ∈ R, where ∂ab =
+∂
+
+∂Mab and |M| = ∑ab Mab.
+• Conversely, given the metrics defined by Equation (23) for any non-degenerate choice of constants
+A, B, C ∈ R, each homogeneous conditional embedding f ∈ F l,n
+k,m, k ≤ l, m ≤ n is an isometry.
+• Moreover, the tensors g(k,m) from Equation (23) are positive-definite for all k ≥ 1 and m ≥ 2 if and only
+if C > 0, B + C > 0 and A + B + C > 0.
+
+The proof of Theorem 4 is similar to the proof of the Theorems of Chentsov, Campbell and
+Lebanon. Due to its technical nature, we defer it to Appendix B.
+Now, for the restriction of the metric g(k,m) to (Δk
+m−1)◦, we have the following. In this case,
+|M| = k. Since tangent vectors v = ∑ab vab∂ab ∈ TM(Δk
+m−1)◦ satisfy ∑b vab = 0 for all a, the constants
+A and B become immaterial, and the metric can be written as:
+
+g(k,m)
+M
+(u, v) = ∑
+ab
+
+|M|uabvab
+
+Mab
+
+C
+k2 = ∑
+ab
+
+uabvab
+Mab
+
+C
+k ,
+for all u, v ∈ TM(Δk
+m−1)◦.
+(24)
+
+This metric is a specialization of the metric (18) derived by Lebanon (Theorem 3).
+The statement of Theorem 4 becomes false if we consider general conditional embeddings instead
+of homogeneous ones:
+
+Theorem 5. There is no family of metrics g(k,m) on Rk×m
++
+(or on (Δk
+m−1)◦) for each k ≥ 1 and m ≥ 2, for
+
+which every conditional embedding f ∈ ˆF l,n
+k,m is an isometry.
+
+This negative result will become clearer from the perspective of Section 6: as we will show in
+Theorem 7, although there are no metrics that are invariant under all conditional embeddings, there are
+families of metrics (depending on a parameter, ρ) that transform covariantly (that is, in a well-defined
+manner) with respect to the conditional embeddings. We defer the proof of Theorem 5 to Appendix B.
+
+5. The Fisher Metric on Polytopes and Point Configurations
+
+In the previous section, we obtained distinguished Riemannian metrics on Rk×m
++
+and (Δk
+m−1)◦ by
+postulating invariance under natural maps. In this section, we take another viewpoint based on general
+considerations about Riemannian metrics on arbitrary polytopes. This is achieved by embedding each
+polytope in a probability simplex as an exponential family. We first recall the necessary background.
+In Section 5.2, we then present our general results, and in Section 5.3, we discuss the special case of
+conditional polytopes.
+
+64
+
+
+Entropy 2014, 16, 3207–3233
+
+5.1. Exponential Families and Polytopes
+
+Let X be a finite set and A ∈ Rd×X a matrix with columns ax indexed by x ∈ X . It will be
+convenient to consider the rows Ai, i ∈ [d] of A as functions Ai : X → R. Finally, let ν: X → R+. The
+exponential family EA,ν is the set of probability distributions on X given by:
+
+p(x; θ) = exp(θ⊤ax + log(ν(x)) − log(Z(θ))),
+for all x ∈ X ,
+for all θ ∈ Rd,
+(25)
+
+with the normalization function Z(θ) = ∑x′∈X exp(θ⊤ax′ + log(ν(x′))). The functions Ai are called
+the observables and ν the reference measure of the exponential family. When the reference measure ν
+is constant, ν(x) = 1 for all x ∈ X , we omit the subscript and write EA.
+A direct calculation shows that the Fisher information matrix of EA,ν at a point θ ∈ Rd has
+coordinates:
+
+gEA,ν
+θ
+(∂θi, ∂θj) = covθ(Ai, Aj),
+for all i, j ∈ [d].
+(26)
+
+Here, covθ denotes the covariance computed with respect to the probability distribution p(·; θ).
+The convex support of EA,ν is defined as:
+
+conv A := conv{ax : x ∈ X } =
+�
+Ep[A]: p ∈ Δ|X |−1
+�
+=
+�
+Ep[A]: p ∈ EA,ν
+�
+,
+(27)
+
+where conv S is the set of all convex combinations of points in S. The moment map μ : p ∈ Δn−1 �→
+A · p ∈ Rd restricts to a homeomorphism EA,ν → conv A; see [16]. Here, EA,ν denotes the Euclidean
+closure of EA,ν. The inverse of μ will be denoted by μ−1 : conv A → EA,ν ⊆ Δn−1. This gives a natural
+embedding of the polytope conv A in the probability simplex Δ|X |−1. Note that the convex support is
+independent of the reference measure ν. See [17] for more details.
+
+5.2. Invariance Fisher Metric Characterizations for Polytopes
+
+Let P ∈ Rd be a polytope with n vertices a1, . . . , an. Let A = (a1, . . . , an) be the matrix with
+columns ai ∈ Rd for all i ∈ [n]. Then, EA ⊆ Δ◦
+n−1 is an exponential family with convex support P. We
+will also denote this exponential family by EP. We can use the inverse of the moment map, μ−1, to pull
+back geometric structures on Δ◦
+n−1 to the relative interior P◦ of P.
+
+Definition 5. The Fisher metric on P◦ is the pull-back of the Fisher metric on EA ⊆ Δ◦
+n−1 by μ−1.
+
+Some obvious questions are: Why is this a natural construction? Which maps between polytopes
+are isometries between their Fisher metrics? Can we find a characterization of Chentsov type for
+this metric?
+Affine maps are natural maps between polytopes. However, in order to obtain isometries, we
+need to put some additional constraints. Consider two polytopes P ∈ Rd, P′ ∈ Rd′ and an affine
+map φ : Rd → Rd′ that satisfies φ(P) ⊆ P′. A natural condition in the context of exponential families is
+that φ restricts to a bijection between the set vert(P) of vertices of P and the set vert(P′) of vertices of P′.
+In this case, EP′ ⊆ EP ⊆ Δ◦
+n−1. Moreover, the moment map μ′ of P′ factorizes through the moment
+map μ of P: μ′ = φ ◦ μ. Let φ−1 = μ ◦ μ′−1. Then, the following diagram commutes:
+
+P◦
+EP
+
+Δ◦
+n−1
+
+P′◦
+EP′
+
+μ−1
+
+μ′−1
+
+φ−1
+(28)
+
+65
+
+
+Entropy 2014, 16, 3207–3233
+
+It follows that φ−1 is an isometry from P′◦ to its image in P◦. Observe that the inverse moment map
+itself arises in this way: In the diagram (28), if P is equal to Δn−1, then the upper moment map μ−1 is
+the identity map, and φ−1 equals the inverse moment map μ′−1 of P′.
+The constraint of mapping vertices to vertices bijectively is very restrictive. In order to consider
+a larger class of affine maps, we need to generalize our construction from polytopes to weighted
+point configurations.
+
+Definition 6. A weighted point configuration is a pair (A, ν) consisting of a matrix A ∈ Rd×n with columns
+a1, . . . , an and a positive weight function ν : {1, . . . , n} → R+ assigning a weight to each column ai. The pair
+(A, ν) defines the exponential family EA,ν.
+The (A, ν)-Fisher metric on (conv A)◦ is the pull-back of the Fisher metric on Δ◦
+n−1 through the inverse
+of the moment map.
+
+We recover Definition 5 as follows. For a polytope P, let A be the point configuration consisting
+of the vertices of P. Moreover, let ν be a constant function. Then, EP = EA,ν, and the two Definitions of
+the Fisher metric on P◦ coincide.
+The following are natural maps between weighted point configurations:
+
+Definition 7. Let (A, ν), (A′, ν′) be two weighted point configurations with A = (ai)i ∈ Rd×n and A′ =
+(a′
+j)j ∈ Rd′×n′. A morphism (A, ν) → (A′, ν′) is a pair (φ, σ) consisting of an affine mapφ : Rd → Rd′ and a
+surjective map σ : {1, . . . , n} → {1, . . . , n′} with φ(ai) = a′
+σ(i) andν′(a′
+j) = α ∑i:σ(i)=j ν(ai), where α > 0 is
+a constant that does not depend on j.
+
+Consider a morphism (φ, σ) : (A, ν) → (A′, ν′). For each j ∈ [n′], let Aj = {i : φ(ai) = a′
+j}. Then,
+
+(A1, . . . , An′) is a partition of [n]. Define a matrix Q ∈ Rn′×n by:
+
+Qji =
+
+⎧
+⎨
+
+⎩
+
+ν(i)
+
+∑i′∈Aj ν(i′),
+if i ∈ Aj,
+
+0,
+else.
+(29)
+
+Then, Q is a Markov mapping, and the following diagram commutes:
+
+(conv A)◦
+EA,ν
+Δ◦
+n−1
+
+(conv A′)◦
+EA′,ν′
+Δ◦
+n′−1
+
+μ−1
+
+μ′−1
+
+φ−1
+Q
+(30)
+
+By Chentsov’s Theorem (Theorem 1), Q is an isometric embedding. It follows that φ−1 also induces an
+isometric embedding. This shows the first part of the following Theorem:
+
+Theorem 6.
+
+• Let (φ, σ) : (A, ν) → (A′, ν′) be a morphism of weighted point configurations.
+Then, φ−1 :
+(conv A′)◦ → (conv A)◦ is an isometric embedding with respect to the Fisher metrics on (conv A)◦
+
+and (conv A′)◦.
+• Let gA,ν be a Riemannian metric on (conv A)◦ for each weighted point configuration (A, ν). If every
+morphism (φ, σ) : (A, ν) → (A′, ν′) of weighted point configurations induces an isometric embedding
+φ−1 : (conv A′)◦ → (conv A)◦, then there exists a constant α ∈ R+ such that gA,ν is equal to α times
+the (A, ν)-Fisher metric.
+
+66
+
+
+Entropy 2014, 16, 3207–3233
+
+Proof. The first statement follows from the discussion before the Theorem. For the second statement,
+we show that under the given assumptions, all Markov maps are isometric embeddings. By Chentsov’s
+Theorem (Theorem 1), this implies that the metrics gP agree with the Fisher metric whenever P is
+a simplex. The statement then follows from the two facts that the metric on P◦ or (conv A)◦ is the
+pull-back of the Fisher metric through the inverse of the moment map and that μ−1 is itself a morphism.
+Observe that Δn−1 = conv In = conv{e1, . . . , en} is a polytope, and Δ◦
+n−1 is the corresponding
+exponential family. Consider a Markov embedding Q : Δ◦
+n′−1 → Δ◦
+n−1, p �→ p · Q. Let ν(i) = ∑j Qji
+be the value of the unique non-zero entry of Q in the i-th column. This defines a morphism and an
+embedding as follows:
+Let A be the matrix that arises from Q by replacing each non-zero entry by one. We define φ
+as the linear map represented by the matrix A, and define σ : [n] → [n′] by σ(j) = i if and only
+if aj = ei, that is, σ(j) indicates the row i in which the j-th column of A is non-zero. Then, (φ, σ)
+is a morphism (In, ν) → (In′, 1), and by assumption, the inverse φ−1 is an isometric embedding
+Δ◦
+n′−1 → Δ◦
+n−1. However, φ−1 is equal to the Markov map Q. This shows that all Markov maps are
+isometric embeddings, and so, by Chentsov’s Theorem, the statement holds true on the simplices.
+
+Theorem 6 defines a natural metric on (Δk
+m−1)◦ that we want to discuss in more detail next.
+
+5.3. Independence Models and Conditional Polytopes
+
+Consider k random variables with finite state spaces [n1], . . . , [nk]. The independence model
+consists of all joint distributions p ∈ Δ∏i∈[k] ni−1 of these variables that factorize as:
+
+p(x1, . . . , xk) = ∏
+i∈[k]
+pi(xi),
+for all x1 ∈ [n1], . . . , xk ∈ [nk],
+(31)
+
+where pi ∈ Δni−1 for all i ∈ [k]. Assuming fixed n1, . . . , nk, we denote the independence model by Ek.
+It is the Euclidean closure of an exponential family (with observables of the form δiyi). The convex
+support of Ek is equal to the product of simplices Pk := Δn1−1 × · · · × Δnk−1. The parametrization (31)
+corresponds to the inverse of the moment map.
+We can write any tangent vector u ∈ T(p1,...,pk)P◦
+k of this open product of simplices as a linear
+combination u = ∑i∈[k] ∑xi∈[ni] uixi∂i,xi, where ∑xi∈[ni] vixi = 0 for all i ∈ [k]. Given two such tangent
+vectors, the Fisher metric is given by:
+
+gPk
+(p1,...,pk)(u, v) = ∑
+i∈[k] ∑
+xi∈[ni]
+
+uixivixi
+pi(xi) .
+(32)
+
+Just as the convex support of the independence model is the Cartesian product of probability
+simplices, the Fisher metric on the independence model is the product metric of the Fisher metrics
+on the probability simplices of the individual variables. If n1 = · · · = nk =: n, then Pk = Δk
+n−1 can be
+identified with the set of k × n stochastic matrices.
+The Fisher metric on the product of simplices is equal to the product of the Fisher metrics on the
+factors. More generally, if P = Q1 × Q2 is a Cartesian product, then the Fisher metric on P◦ is equal
+to the product of the Fisher metrics on Q◦
+1 and Q◦
+2. In fact, in this case, the inverse of the moment
+map of P can be expressed in terms of the two moment map inverses μ1 : Q1 → EQ1 ⊆ Δm1−1 and
+μ2 : Q2 → EQ2 ⊆ Δm2−1 and the moment map ˜μ of the independence model Δm1−1 × Δm2−1, by:
+
+μ−1(q1, q2) = ˜μ−1(μ−1
+1 (q1), μ−1
+2 (q2)).
+(33)
+
+Therefore, the pull-back by μ−1 factorizes through the pull-back by ˜μ−1, and since the independence
+model carries a product metric, the product of polytopes also carries a product metric.
+
+67
+
+
+Entropy 2014, 16, 3207–3233
+
+Let us compare the metric g(k,m)
+K
+from Equation (24), with the Fisher metric gPk
+(K1,...,Kk) from
+
+Equation (32) on the product of simplices P◦ = (Δk
+m−1)◦. In both cases, the metric is a product
+metric; that is, it has the form:
+g = g1 + · · · + gk,
+(34)
+
+where gi is a metric on the i-th factor Δ◦
+m−1. For g
+Δk
+Km−1
+, gi is equal to the Fisher metric on Δ◦
+m−1. However,
+
+for g(k,m)
+K
+, gi is equal to 1/k times the Fisher metric on Δ◦
+m−1. Since this factor only depends on k, it
+only plays a role if stochastic matrices of different sizes are compared. The additional factor of 1/k can
+be interpreted as the uniform distribution on k elements. This is related to another more general class
+of Riemannian metrics that are used in applications; namely, given a function K ∈ Δk
+m−1 → ρK ∈ Rk+,
+it is common to use product metrics with gi equal to ρK(i) times the Fisher metric on Δ◦
+m−1. When K
+has the interpretation of a channel or when K describes the policy by which a system reacts to some
+sensor values, a natural possibility is to let ρK be the stationary distribution of the channel input or of
+the sensor values, respectively. We will discuss this approach in Section 6.
+
+6. Weighted Product Metrics for Conditional Models
+
+In this section, we consider metrics on spaces of stochastic matrices defined as weighted sums
+of the Fisher metrics on the spaces of the matrix rows, similar to Equation (34). This kind of metric
+was used initially by Amari [1] in order to define a natural gradient in the supervised learning context.
+Later, in the context of reinforcement learning, Kakade [2] defined a natural policy gradient based on
+this kind of metric, which has been further developed by Peters et al. [10]. Related applications within
+unsupervised learning have been pursued by Zahedi et al. [18].
+Consider the following weighted product Fisher metric:
+
+gρ,m
+K
+= ∑
+a
+ρK(a)g(m),a
+Ka
+,
+for all K ∈ (Δk
+m−1)◦,
+(35)
+
+where g(m),a
+Ka
+denotes the Fisher metric of Δ◦
+m−1 at the a-th row of K and ρK ∈ Δ◦
+k−1 is a probability
+distribution over a associated with each K ∈ (Δk
+m−1)◦. For example, the distribution ρK could be the
+stationary distribution of sensor values observed by an agent when operating under a policy described
+by K.
+In the following, we will try to illuminate the properties of polytope embeddings that yield the
+metric (35) as the pull-back of the Fisher information metric on a probability simplex. We will focus on
+the case that ρK = ρ is independent of K.
+There are two direct ways of embedding Δk
+n−1 in a probability simplex. In Section 5, we used
+the inverse of the moment map of an exponential family, possibly with some reference measure. This
+embedding is illustrated in the left panel of Figure 2. If we have given a fixed probability distribution
+ρ ∈ Δ◦
+k−1, there is a second natural embedding ψρ : Δk
+m−1 → Δk·m−1 defined as follows:
+
+ψρ(K)(x, y) = ρ(x)Kx,y
+for all x ∈ [k], y ∈ [m].
+(36)
+
+If ρ is the distribution of a random variable X and K ∈ Δk
+m−1 is the stochastic matrix describing the
+conditional distribution of another variable Y given X, then ψρ(K) is the joint distribution of X and Y.
+Note that ψρ is an affine embedding. See the right panel of Figure 2 for an illustration.
+The pull-back of the Fisher metric on Δ◦
+km−1 through ψρ is given by:
+
+g(km)
+ψρ(K)(ψρ∗u, ψρ∗v) =∑
+a,b ∑
+c,d ∑
+i,j
+ρ(i)Kijuab
+∂ log ρ(i)Kij
+
+∂Kab
+vcd
+∂ log ρ(i)Kij
+
+∂Kcd
+
+=∑
+i
+ρ(i)∑
+j
+
+uijvij
+Kij
+= ∑
+i
+ρ(i)gi
+Ki(ui, vi) = gρ,m
+K (u, v).
+(37)
+
+68
+
+
+Entropy 2014, 16, 3207–3233
+
+This recovers the weighted sum of Fisher metrics from Equation (35).
+
+Δk
+m−1
+
+Δmk−1
+
+Ek
+
+Δk·m−1
+
+ψ( 1
+
+4, 3
+
+4)Δk
+m−1
+
+ψ( 1
+
+2, 1
+
+2)Δk
+m−1
+
+Figure 2. An illustration of different embeddings of the conditional polytope Δk
+m−1 in a probability
+simplex. The left panel shows an embedding in Δmk−1 by the inverse of the moment map μ of the
+independence model. The right panel shows an affine embedding in Δk·m−1 as a set of joint probability
+distributions for two different specifications of marginals.
+
+Are there natural maps that leave the metrics gρ,m invariant? Let us reconsider the stochastic
+embeddings from Definition 4. Let R be a k × l indicator partition matrix and R a stochastic partition
+matrix with the same block structure as R. Observe that to each indicator partition matrix R there are
+many compatible stochastic partition matrices R, but the indicator partition matrix R for any stochastic
+partition matrix R is unique. Furthermore, let Q = {Q(a)}a∈[k] be a collection of stochastic partition
+
+matrices. The corresponding conditional embedding f maps K ∈ Δk
+m−1 to f (K) := R⊤(K ⊗ Q) ∈ Δl
+n−1.
+Let ρ ∈ Δ◦
+k−1. Suppose that K describes the conditional distribution of Y given X and that ψρ(K)
+describes the joint distribution of Y and X. As explained in Section 4.1, the matrix f (P) := R⊤(P ⊗ Q)
+describes the joint distribution of a pair of random variables (X′, Y′), and the conditional distribution
+of Y′ given X′ is given by f (K). In this situation, the marginal distribution of X′ is given by ρ′ = ρR.
+Therefore, the following diagram commutes:
+
+(Δk
+m−1)◦
+Δ◦
+mk−1
+
+(Δl
+n−1)◦
+Δ◦
+nl−1
+
+ψρ
+
+ψρ′
+
+f
+f
+(38)
+
+The preceding discussion implies the first statement of the following result:
+
+69
+
+
+Entropy 2014, 16, 3207–3233
+
+Theorem 7.
+
+• For any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦
+k−1, the Riemannian metric gρ,m on (Δk
+m−1)◦ satisfies:
+
+gρ,m = f
+∗(gρ′,n),
+for ρ′ = ρR,
+(39)
+
+for any conditional embedding f : K �→ R(K ⊗ Q).
+• Conversely, suppose that for any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦
+k−1, there is a Riemannian metric
+g(ρ,m) on (Δk
+m−1)◦, such that Equation (39) holds for all conditional embeddings, and suppose that g(ρ,m)
+
+depends continuously on ρ. Then, there is a constant A > 0 that satisfies g(ρ,m) = Agρ,m.
+
+Proof. The first statement follows from the commutative diagram (38). For the second statement,
+denote by ρk the uniform distribution on a set of k elements. If f : K �→ R(K ⊗ Q) is a homogeneous
+conditional embedding of Δk
+m−1 in Δl
+n−1, then R = k
+
+l R is a stochastic partition matrix corresponding
+to the partition indicator matrix R. Observe that ρl = ρkR. Therefore, the family of Riemannian
+metrics gρk,m on Δk
+m−1 satisfies the assumptions of Theorem 4. Therefore, there is a constant A > 0
+
+for which gρk,m equals A/k times the product Fisher metric. This proves the statement for uniform
+distributions ρ.
+A general distribution ρ ∈ Δ◦
+k−1 can be approximated by a distribution with rational probabilities.
+Since g(ρ,m) is assumed to be continuous, it suffices to prove the statement for rational ρ. In this case,
+there exists a stochastic partition matrix R for which ρ′ := ρR is a uniform distribution, and so, g(ρ′,n)
+
+is of the desired form. Equation (39) shows that g(ρ,m) is also of the desired form.
+
+7. Gradient Fields and Replicator Equations
+
+In this section, we use gradient fields in order to compare Riemannian metrics on the space
+(Δk
+n−1)◦.
+
+7.1. Replicator Equations
+
+We start with gradient fields on the simplex Δ◦
+n−1. A Riemannian metric g on Δ◦
+n−1 allows us to
+consider gradient fields of differentiable functions F: Δ◦
+n−1 → R. To be more precise, consider the
+differential dpF : TpΔ◦
+n−1 → R of F in p. It is a linear form on TpΔ◦
+n−1, which maps each tangent vector
+u to dpF(u) = ∂F
+
+∂u(p) ∈ R. Using the map u �→ gp(u, ·), this linear form can be identified with a tangent
+vector in TpΔ◦
+n−1, which we denote by gradpF. If we choose the Fisher metric g(n) as the Riemannian
+metric, we obtain the gradient in the following way. First consider a differentiable extension of F to the
+positive cone Rn+, which we will denote by the same symbol F. With the partial derivatives ∂iF of F,
+the Fisher gradient of F on the simplex Δ◦
+n−1 is given as:
+
+(gradpF)i = pi
+
+�
+
+∂iF(p) −
+n
+∑
+j=1
+pj ∂jF(p)
+
+�
+
+,
+i ∈ [n].
+(40)
+
+Note that the expression on the right-hand side of Equation (40) does not depend on the particular
+differentiable extension of F to Rn+. The corresponding differential equation is well known in theoretical
+biology as the replicator equation; see [19,20].
+
+˙pi = pi
+
+�
+
+∂iF(p) −
+n
+∑
+j=1
+pj ∂jF(p)
+
+�
+
+,
+i ∈ [n].
+(41)
+
+70
+
+
+Entropy 2014, 16, 3207–3233
+
+We now apply this gradient formula to functions that have the structure of an expectation value.
+Given real numbers Fi, i ∈ [n], referred to as fitness values, we consider the mean fitness:
+
+¯F(p) :=
+n
+∑
+i=1
+pi Fi.
+(42)
+
+Replacing the pi by any positive real numbers leads to a differentiable extension of F, also denoted
+by F. Obviously, we have ∂iF = Fi, which leads to the following replicator equation:
+
+˙pi = pi (Fi − ¯F(p)) ,
+i ∈ [n].
+(43)
+
+This equation has the solution:
+
+pi(t) =
+pi(0)etFi
+
+∑n
+j=1 pj(0)etFi ,
+i ∈ [n].
+(44)
+
+Clearly, the mean fitness will increase along this solution of the gradient field. The rate of increase can
+be easily calculated:
+
+d
+dt
+¯F
+�
+p(t)
+� =
+n
+∑
+i=1
+˙pi(t) Fi =
+n
+∑
+i=1
+pi (Fi − ¯F(p)) Fi =
+n
+∑
+i=1
+pi (Fi − ¯F(p))2 = varp(F) > 0.
+(45)
+
+As limit points of this solution, we obtain:
+
+lim
+t→−∞ pi(t) =
+
+�
+pi(0)
+
+∑j∈argmin F pj(0),
+if i ∈ argmin F
+
+0
+,
+otherwise
+,
+i ∈ [n],
+(46)
+
+and:
+
+lim
+t→+∞ pi(t) =
+
+�
+pi(0)
+
+∑j∈argmax F pj(0),
+if i ∈ argmax F
+
+0
+,
+otherwise
+,
+i ∈ [n].
+(47)
+
+7.2. Extension of the Replicator Equations to Stochastic Matrices
+
+Now, we come to the corresponding considerations of gradient fields in the context of stochastic
+matrices K ∈ (Δk
+n−1)◦. We consider a function:
+
+K �→ F(K) = F(K11, . . . , K1n; K21, . . . , K2n; . . . ; Kk1, . . . , Kkn).
+(48)
+
+One way to deal with this is to consider for each i ∈ [k] the corresponding replicator equation:
+
+˙Kij = Kij
+
+⎛
+
+⎝∂ijF(K) −
+n
+∑
+j′=1
+Kij′ ∂ij′F(K)
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(49)
+
+Obviously, this is the gradient field that one obtains by using the product Fisher metric on (Δk
+n−1)◦
+
+(Equation (17)):
+
+g(k,m)
+K
+(u, v) = ∑
+ij
+
+1
+Kij
+uijvij.
+(50)
+
+If we replace the metric by the weighted product Fisher metric considered by Kakade (Equation (35)),
+
+gρ,m
+K (u, v) = ∑
+ij
+
+ρi
+Kij
+uijvij,
+(51)
+
+71
+
+
+Entropy 2014, 16, 3207–3233
+
+then we obtain
+
+˙Kij = Kij
+
+ρi
+
+⎛
+
+⎝∂ijF(K) −
+n
+∑
+j′=1
+Kij′ ∂ij′F(K)
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(52)
+
+7.3. The Example of Mean Fitness
+
+Next, we want to study how the gradient flows with respect to different metrics compare. We
+restrict to the class of metrics gρ,m (Equation (35)), where ρ ∈ Δ◦
+k is a probability distribution. In
+principle, one could drop the normalization condition ∑i ρi = 1 and allow arbitrary coefficients ρi.
+However, it is clear that the rate of convergence can always be increased by scaling all values ρi with a
+common positive factor. Therefore, some normalization condition is needed for ρ.
+With a probability distribution p ∈ Δ◦
+k−1 and fitness values Fij, let us consider again the example
+of an expectation value function:
+
+¯F(K) =
+k
+∑
+i=1
+pi
+
+n
+∑
+j=1
+Kij Fij.
+(53)
+
+With ∂ij ¯F(π) = pi Fij, this leads to:
+
+˙Kij = pi
+
+ρi
+Kij
+
+⎛
+
+⎝Fij −
+n
+∑
+j′=1
+Kij′ Fij′
+
+⎞
+
+⎠ ,
+j ∈ [n].
+(54)
+
+The corresponding solutions are given by:
+
+Kij(t) =
+Kij(0) et pi
+
+ρi Fij
+
+∑n
+j′=1 Kij′(0) et pi
+
+ρi Fij′ ,
+i ∈ [n].
+(55)
+
+Since argmax( pi
+
+ρi Fi·) and argmin( pi
+
+ρi Fi·) are independent of ρi > 0, the limit points are given
+independently of the chosen ρ as:
+
+lim
+t→−∞ Kij(t) =
+
+⎧
+⎨
+
+⎩
+
+Kij(0)
+
+∑j′∈argmin Fi· Kij′(0),
+if j ∈ argmin Fi·
+
+0
+,
+otherwise
+,
+i ∈ [n],
+(56)
+
+and:
+
+lim
+t→+∞ Kij(t) =
+
+⎧
+⎨
+
+⎩
+
+Kij(0)
+
+∑j′∈argmax Fi· Kij′(0),
+if j ∈ argmax Fi·
+
+0
+,
+otherwise
+,
+i ∈ [n].
+(57)
+
+This is consistent with the fact that the critical points of gradient fields are independent of the chosen
+Riemannian metric. However, the speed of convergence does depend on the metric:
+For each i, let Gi = maxj Fij and gi = maxj/∈argmax(Fij) Fij be the largest and second-largest values
+in the i-th row of Fij, respectively. Then, as: t → ∞,
+
+Kij(t) =
+
+�
+1 − O(exp(− pi
+
+ρi (Gi − gi)t),
+if i ∈ argmax Fi·
+O(exp(− pi
+
+ρi (Gi − Fij)t) ,
+otherwise
+(58)
+
+72
+
+
+Entropy 2014, 16, 3207–3233
+
+Therefore,
+
+¯F(K(t)) = ∑
+i
+pi
+∑
+j∈argmax Fi·
+Fij + O
+�
+exp(− pi
+
+ρi
+(Gi − gi)t)
+�
+
+= ∑
+i
+pi
+∑
+j∈argmax Fi·
+Fij + O
+�
+exp(− inf
+i
+
+� pi
+
+ρi
+(Gi − gi)
+�
+t)
+�
+.
+(59)
+
+Thus, in the long run, the rate of convergence is given by infi{ pi
+
+ρi (Gi − gi)}, which depends on the
+parameter ρ of the metric. As a result, in this case study, the optimal choice of ρi, i.e., with the largest
+convergence rate, can be computed if the numbers Gi and gi are known.
+Consider, for example, the case that the differences Gi − gi are of comparable sizes for all i. Then,
+we need to find the choice of ρ that maximizes infi{ pi
+
+ρi }. Clearly, infi{ pi
+
+ρi } ≤ 1 (since there is always an
+index i with pi ≤ ρi). Equality is attained for the choice ρi = pi. Thus, we recover the choice of Kakade.
+
+8. Conclusions
+
+So, which Riemannian metric should one use in practice on the set of stochastic matrices, (Δk
+m−1)◦?
+The results provided in this manuscript give different answers, depending on the approach. In all
+cases, the characterized Riemannian metrics are products of Fisher metrics with suitable factor weights.
+Theorem 4 suggests to use a factor weight proportional to 1/k, and Theorem 6 suggests to use a
+constant weight independent of k. In many cases, it is possible to work within a single conditional
+polytope (Δk
+m−1)◦ and a fixed k, and then, these two results are basically equivalent. On the other
+hand, Theorem 7 gives an answer that allows arbitrary factor weights ρ.
+Which metric performs best obviously depends on the concrete application. The first observation
+is that in order to use the metric gρ,m of Theorem 7, it is necessary to know ρ. If the problem at
+hand suggests a natural marginal distribution ρ, then it is natural to make use of it and choose the
+metric gρ,m. Even if ρ is not known at the beginning, a learning system might try to learn it to improve
+its performance.
+On the other hand, there may be situations where there is no natural choice of the weights ρ.
+Observe that ρ breaks the symmetry of permuting the rows of a stochastic matrix. This is also expressed
+by the structural difference between Theorems 4 and 6 on the one side and Theorem 7 on the other.
+While the first two Theorems provide an invariance metric characterization, Theorem 7 provides a
+“covariance” classification; that is, the metrics gρ,m are not invariant under conditional embeddings,
+but they transform in a controlled manner. This again illustrates that the choice of a metric should
+depend on which mappings are natural to consider, e.g., which mappings describe the symmetries of a
+given problem.
+For example, consider a utility function of the form F = ∑i ρi ∑j KijFij. Row permutations do not
+leave gρ,m invariant (for a general ρ), but they are not symmetries of the utility function F, either, and
+hence, they are not very natural mappings to consider. However, row permutations transform the
+metric gρ,m and the utility function in a controlled manner; in such a way that the two transformations
+match. Therefore, in this case, it is natural to use gρ,m. On the other hand, when studying problems
+that are symmetric under all row permutations, it is more natural to use the invariant metric g(k,m).
+
+Appendix A
+
+Appendix A Conditions for Positive Definiteness
+
+Equation (16) in Lebanon’s Theorem 3 defines a Riemannian metrics whenever it defines a
+positive-definite quadratic form. The next Proposition gives sufficient and necessary conditions for
+which this is the case.
+
+73
+
+
+Entropy 2014, 16, 3207–3233
+
+Proposition A1. For each pair k ≥ 1 and m ≥ 2, consider the tensor on Rk×m
++
+defined by:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A(|M|) + δac
+
+� B(|M|)
+
+|Ma|
++ δbd
+C(|M|)
+
+Mab
+
+�
+(A1)
+
+for some differentiable functions A, B, C ∈ C∞(R+). The tensor g(k,m) defines a Riemannian metric for all k
+and m if and only if C(α) > 0, B(α) + C(α) > 0 and A(α) + B(α) + C(α) > 0 for all α ∈ R+.
+
+Proof. The tensors are Riemannian metrics when:
+
+g(k,m)
+M
+(V) = A(|M|)(∑
+ab
+Vab)2 + B(|M|)∑
+a
+
+|M|
+|Ma|(∑
+b
+Vab)2 + C(|M|)∑
+ab
+
+|M|
+Mab
+V2
+ab
+(A2)
+
+is strictly positive for all non-zero V ∈ Rk×m, for all M ∈ Rk×m
++
+.
+We can derive necessary conditions on the functions A, B, C from some basic observations.
+Choosing V = ∂ab in Equation (A2) shows that A(|M|) + |M|
+
+|Ma| B(|M|) + |M|
+
+Mab C(|M|) has to be positive
+
+for all a ∈ [k], b ∈ [m], for all M ∈ Rk×m
++
+. Since Mab can be arbitrarily small for fixed |M| and |Ma|, we
+see that C has to be non-negative. Since we can choose |Ma| ≈ Mab ≪ |M| for a fixed |M|, we find
+that B + C has to be non-negative. Further, since we can choose Mab ≈ |Ma| ≈ |M| for a given |M|,
+we find that A + B + C has to be non-negative. This shows that the quadratic form is positive definite
+only if C ≥ 0, B + C ≥ 0, A + B + C ≥ 0. Since the cone of positive definite matrices is open, these
+inequalities have to be strictly satisfied. In the following, we study sufficient conditions.
+For any given M ∈ Rk×m
++
+, we can write Equation (A2) as a product V⊤GV, for all V ∈ Rkm, where
+G = GA + GB + GC ∈ Rkm×km is the sum of a matrix GA with all entries equal to A(|M|), a block
+diagonal matrix GB whose a-th block has all entries equal to |M|
+
+|Ma| B(|M|), and a diagonal matrix GC with
+
+diagonal entries equal to |M|
+
+Mab C(|M|). The matrix G is obviously symmetric, and by Sylvester’s criterion,
+it is positive definite iff all its leading principal minors are positive. We can evaluate the minors using
+Sylvester’s determinant Theorem. That Theorem states that for any invertible m × m matrix X, an
+m × n matrix Y and an n × m matrix Z, one has the equality det(X + YZ) = det(X) det(In + ZX−1Y).
+Let us consider a leading square block G′, consisting of all entries Gab,cd of G with row-index pairs
+(a, b) satisfying b ∈ [m] for all a < a′ and b ≤ b′ for a = a′ for some a′ ≤ k and b′ ≤ m; and the same
+restriction for the column index pairs. The corresponding block G′
+A + G′
+B can be written as the rank-a′
+
+matrix YZ, with Y consisting of columns 1a for all a ≤ a′ and Z consisting of rows A + 1a
+|M|
+|Ma| B for all
+
+a ≤ a′. Hence, the determinant of G′ is equal to:
+
+det(G′) = det(G′
+C) · det(Ia′ + ZG′−1
+C Y).
+(A3)
+
+Since G′C is diagonal, the first term is just:
+
+det(G′
+C) =
+
+�
+∏
+a<a′ ∏
+b
+
+|M|
+Mab C
+
+� �
+∏
+b≤b′
+
+|M|
+Ma′b C
+
+�
+
+.
+(A4)
+
+The matrix in the second term of Equation (A3) is given by:
+
+Ia′ + ZG′−1
+C Y =
+
+1
+C
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+C + B
+...
+C + B
+
+C + ∑b≤b′ Ma′b
+
+|Ma′|
+B
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
++ 1
+
+C
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+|M1|
+|M| A
+· · ·
+|Ma′−1|
+
+|M|
+A
+∑b≤b′ Ma′b
+
+|M|
+A
+...
+...
+...
+|M1|
+|M| A
+· · ·
+|Ma′−1|
+
+|M|
+A
+∑b≤b′ Ma′b
+
+|M|
+A
+
+⎞
+
+⎟
+⎟
+⎟
+⎠ .
+(A5)
+
+74
+
+
+Entropy 2014, 16, 3207–3233
+
+By Sylvester’s determinant Theorem, we have:
+
+det(Ia′ + ZG′−1
+C Y) = C−a′(C + B)a′−1(C + ∑b≤b′ Ma′b
+
+|Ma′|
+B)(1 + ∑
+a<a′
+
+|Ma|
+|M| A
+
+C + B +
+
+∑b≤b′ Ma′b
+
+|M|
+A
+
+C + ∑b≤b′ Ma′b
+
+|Ma′|
+B
+)
+
+=
+
+�
+∏
+a
+
+C + Ba
+
+C
+
+� �
+
+1 + ∑
+a
+
+Aa
+
+C + Ba
+
+�
+
+,
+(A6)
+
+where Aa = |Ma|
+
+|M| A for a < a′ and Aa′ = ∑b≤b′ Ma′b
+
+|M|
+A, and Ba = B for a < a′ and Ba′ = ∑b≤b′ Ma′b
+
+|Ma′|
+B.
+This shows that the matrix G is positive definite for all M if and only if C > 0, C + B > 0 and
+�
+1 + ∑a≤a′
+Aa
+
+C+Ba
+
+�
+> 0 for all a′ and b′. The latter inequality is satisfied whenever A + B + C > 0. This
+completes the proof.
+
+Appendix B Proofs of the Invariance Characterization
+
+The following Lemma follows directly from the Definition and contains all the technical details
+we need for the proofs.
+
+Lemma A1. The push-forward f∗ : TMRk×m
++
+→ Tf (M)Rl×n
++
+of a map f ∈ ˆF l,n
+k,m is given by:
+
+f∗(∂ab) =
+l
+∑
+i=1
+
+n
+∑
+j=1
+
+RaiQ(a)
+bj ∂′
+ij,
+(A7)
+
+and the pull-back of a metric g(l,n) on Rl×n
++
+through f is given by:
+
+( f ∗g(l,n))M(∂ab, ∂cd) = g(l,n)
+f (M)( f∗∂ab, f∗∂cd) =
+l
+∑
+i=1
+
+n
+∑
+j=1
+
+l
+∑
+s=1
+
+n
+∑
+t=1
+
+RaiRcsQ(a)
+bj Q(c)
+dt g(l,n)
+f (M)(∂′
+ij, ∂′
+st).
+(A8)
+
+Proof of Theorem 4. We follow the strategy of [5,14]. The idea is to consider subclasses of maps from
+the class F l,n
+k,m and to evaluate their push-forward and pull-back maps together with the isometry
+requirement. This yields restrictions on the possible metrics, eventually fully characterizing them.
+
+First. Consider the maps hπ,σ ∈ F k,m
+k,m , resulting from permutation matrices Q(a) = Pπa, πa : [m] → [m]
+for all a ∈ [k], and R = Pσ, σ: [k] → [k]. Requiring isometry yields:
+
+(hπ,σ)∗(∂ab)
+=
+∂′
+σ(a) πa(b)
+(A9)
+
+g(k,m)
+M
+(∂ab, ∂cd)
+=
+g(k,m)
+hπ,σ(M)(∂σ(a) π(a)(b), ∂σ(c) π(c)(d)).
+(A10)
+
+Second. Consider the maps rzw ∈ F kz,mw
+k,m
+defined by Q(1) = · · · = Q(k) ∈ Rm×mw and R ∈ Rk×kz
+
+being uniform. In this case, for some permutations π and σ,
+
+(rzw)∗(∂ab)
+=
+1
+w
+
+z
+∑
+i=1
+
+w
+∑
+j=1
+∂′
+σ(a)(i) π(b)(j)
+(A11)
+
+(rzw∗g(kz,mw))M(∂ab, ∂cd)
+=
+1
+w2
+
+z
+∑
+i=1
+
+w
+∑
+j=1
+
+z
+∑
+s=1
+
+w
+∑
+t=1
+g(kz,mw)
+rzw(M) (∂′
+σ(a)(i) π(b)(j), ∂′
+σ(c)(s) π(d)(t)).
+(A12)
+
+75
+
+
+Entropy 2014, 16, 3207–3233
+
+Third. For a rational matrix M = 1
+
+Z ˜M with ˜M ∈ Nk×m and row-sum | ˜Ma| = N ∈ N for all a ∈ [k],
+consider the map vM ∈ F zk,N
+k,m
+that maps M to a constant matrix. In this case, R ∈ Rk×kz and Q(a) has
+
+the b-th row with | ˜Mab| entries with value
+1
+
+| ˜Mab| at positions π(ab)([ ˜Mab]) ⊆ [N], and:
+
+(vM)∗(∂ab)
+=
+1
+˜Mab
+
+k
+∑
+i=1
+
+˜Mab
+∑
+j=1
+∂′
+σ(a)(i) π(ab)(j)
+(A13)
+
+(vM∗g(kz,N))M(∂ab, ∂cd)
+=
+1
+˜Mab
+
+1
+˜Mcd
+
+z
+∑
+i=1
+
+˜Mab
+∑
+j=1
+
+z
+∑
+s=1
+
+˜Mcd
+∑
+t=1
+g(kz,N)
+vM(M)(∂′
+σ(a)(i) π(ab)(j), ∂′
+σ(c)(s) π(cd)(t)). (A14)
+
+Step 1: a ̸= c. Consider a constant matrix M = U. Then:
+
+g(k,m)
+U
+(∂a1b1, ∂c1d1) = g(k,m)
+hπ,σ(U)(∂a2b2, ∂c2d2) = g(k,m)
+U
+(∂a2b2, ∂c2d2).
+(A15)
+
+This implies that g(k,m)
+U
+(∂ab, ∂cd) = ˆA(k, m) when a ̸= c.
+Using the second type of map, we get:
+
+ˆA(k, m) = z2w2
+
+w2
+ˆA(kz, mw),
+(A16)
+
+which implies g(k,m)
+U
+(∂ab, ∂cd) = A
+
+k2 , when a ̸= c. Considering a rational matrix M and the map vM
+yields:
+
+g(k,m)
+M
+(∂ab, ∂c,d) = A
+
+k2 .
+(A17)
+
+Step 2: b ̸= d. By similar arguments as in Part 1, g(k,m)
+U
+(∂ab, ∂ad) = ˆB(k, m). Evaluating the map
+rzw yields:
+
+ˆB(k, m) = zw2
+
+w2 ˆB(kz, mw) + z(z − 1)w2
+
+w2
+A
+
+(kz)2
+
+= z ˆB(kz, mw) + z − 1
+
+z
+A
+k2 ,
+(A18)
+
+and therefore,
+
+1
+z
+
+�
+ˆB(k, m) − A
+
+k2
+
+�
+= ˆB(kz, mw) −
+A
+
+(kz)2 ,
+(A19)
+
+which implies that
+�
+ˆB(k, m) − A
+
+k2
+�
+is independent of m and scales with the inverse of k, such that it
+
+can be written as B
+
+k . Rearranging the terms yields g(k,m)
+U
+(∂ab, ∂ad) = A
+
+k2 + B
+
+k , for b ̸= d.
+For a rational matrix M, the pull-back through vM shows then:
+
+g(k,m)
+M
+(∂ab, ∂cd) = z
+˜Mab ˜Mad
+˜Mab ˜Mad
+
+�
+A
+
+(kz)2 + B
+
+kz
+
+�
++ z(z − 1) ˜Mab ˜Mad
+
+˜Mab ˜Mad
+
+A
+
+(kz)2 = A
+
+k2 + B
+
+k .
+(A20)
+
+76
+
+
+Entropy 2014, 16, 3207–3233
+
+Step 3: a = c and b = d. In this case, g(k,m)
+U
+(∂a1b1, ∂a1b1) = g(k,m)
+U
+(∂a2b2, ∂a2b2) = ˆC(k, m), and:
+
+ˆC(k, m) = 1
+
+w2 zw ˆC(kz, mw) + 1
+
+w2 zw(w − 1)
+�
+A
+
+(kz)2 + B
+
+kz
+
+�
++ 1
+
+w2 zw2z(z − 1)
+A
+
+(kz)2
+
+= z
+
+w
+ˆC(kz, mw) + (1 − 1
+
+zw) A
+
+k2 + (1 − z
+
+zw) B
+
+k ,
+(A21)
+
+which implies:
+k
+m
+
+�
+ˆC(k, m) − A
+
+k2 − B
+
+k
+
+�
+= kz
+
+mw
+
+�
+˜C(kz, mw) −
+A
+
+(kz)2 − B
+
+kz
+
+�
+,
+(A22)
+
+such that the left-hand side is a constant C, and g(k,m)
+U
+(∂ab, ∂ab) = A
+
+k2 + B
+
+k + m
+
+k C. Now, for a rational
+matrix M, pulling back through vM gives:
+
+g(k,m)
+M
+(∂ab, ∂ab) =
+1
+˜M2
+ab
+˜Mab
+
+� A
+
+k2 + B
+
+k + | ˜Ma|
+
+k
+C
+�
++
+1
+˜M2
+ab
+˜Mab( ˜Mab − 1)
+� A
+
+k2 + B
+
+k
+
+�
++ 0
+
+= A
+
+k2 + B
+
+k + | ˜Ma|
+
+˜MabkC
+
+= A
+
+k2 + k B
+
+k2 + |M|
+
+Mab
+
+C
+k2 .
+(A23)
+
+Summarizing, we found:
+
+g(k,m)
+M
+(∂ab, ∂cd) = A
+
+k2 + δac
+
+�
+k B
+
+k2 + δbd
+|M|
+Mab
+
+C
+k2
+
+�
+,
+(A24)
+
+which proves the first statement. The second statement follows by plugging Equation (23) into
+Equation (A8). Finally, the statement about the positive-definiteness is a direct consequence of
+Proposition 1.
+
+Proof of Theorem 5. Suppose, contrary to the claim, that a family of metrics g(k,m)
+M
+exists, which is
+invariant with respect to any conditional embedding. By Theorem 4, these metrics are of the form
+of Equation (23). To prove the claim, we only need to show that A, B and C vanish. In the following,
+we study conditional embeddings where Q consists of identity matrices and evaluate the isometry
+
+requirement ( f ∗g(l,n))M(∂ab, ∂cd) = g(k,m)
+M
+(∂ab, ∂cd).
+Step 1: In the case a ̸= c, we obtain from the invariance requirement and Equation (A8), that:
+
+A
+k2 = |Ra||Rc| A
+
+l2 .
+(A25)
+
+Observe that:
+1
+k
+
+k
+∑
+i=1
+|Ri| = 1
+
+k |R| = l
+
+k.
+(A26)
+
+In fact, |Ri| is the cardinality of the i-th block of the partition belonging to R. Therefore, if we choose R
+to be the partition indicator matrix of a partition that is not homogeneous and in which |Ra| > l/k
+and |Rc| > l/k, then Equation (A25) implies that A = 0.
+Step 2: In the case a = c and b ̸= d, we obtain from invariance and Equation (A8), that:
+
+B
+k =
+l
+∑
+i=1
+
+l
+∑
+s=1
+
+RaiRasδis
+B
+l = |Ra| B
+
+l .
+(A27)
+
+Again, we may chose Ra in such a way that |Ra| ̸= k
+
+l and find that B = 0.
+
+77
+
+
+Entropy 2014, 16, 3207–3233
+
+Step 3: Finally, in the case a = c and b = d, we obtain from invariance and Equation (A8), that:
+
+C|M|
+k2Mab
+=
+l
+∑
+i=1
+
+l
+∑
+s=1
+
+RaiRasδi,s
+C|R⊤M|
+
+l2(R⊤M)ib
+= |Ra|C|R⊤M|
+
+l2Mab
+.
+(A28)
+
+If we chose Ra, such that |Ra| ̸=
+|M|
+
+|R⊤M|, then we see that C = 0. Therefore, g(k,m) is the zero-tensor,
+
+which is not a metric.
+
+Acknowledgments
+
+The authors are grateful to Keyan Zahedi for discussions related to policy gradient methods in
+robotics applications. Guido Montúfar thanks the Santa Fe Institute for hosting him during the initial
+work on this article. Johannes Rauh acknowledges support by the VW Foundation. This work was
+supported in part by the DFG Priority Program, Autonomous Learning (DFG-SPP 1527).
+
+Author Contributions
+
+All authors contributed to the design of the research. The research was carried out by all authors,
+with main contributions by Guido Montúfar and Johannes Rauh. The manuscript was written by
+Guido Montúfar, Johannes Rauh and Nihat Ay. All authors read and approved the final manuscript.
+
+Conflicts of Interest
+
+The authors declare no conflict of interests.
+
+References
+
+1.
+Amari, S. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276.
+2.
+Kakade, S. A Natural Policy Gradient. Advances in Neural Information Processing Systems 14; MIT Press:
+Cambridge, MA, USA, 2001; pp. 1531–1538.
+3.
+Shahshahani, S. A New Mathematical Framework for the Study of Linkage and Selection; American Mathematical
+Society: Providence, RI, USA, 1979.
+4.
+Chentsov, N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Providence, RI,
+USA, 1982.
+5.
+Campbell, L. An extended ˇCencov characterization of the information metric. Proc. Am. Math. Soc. 1986,
+98, 135–141.
+6.
+Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with
+Function Approximation. In Advances in Neural Information Processing Systems 12; MIT Press: Cambridge,
+MA, USA, 2000; pp. 1057–1063.
+7.
+Marbach, P.; Tsitsiklis, J.
+Simulation-based optimization of Markov reward processes.
+IEEE Trans.
+Autom. Control 2001, 46, 191–209.
+8.
+Montúfar, G.; Ay, N.; Zahedi, K. Expressive power of conditional restricted boltzmann machines for
+sensorimotor control. 2014, arXiv:1402.3346.
+9.
+Ay, N.; Montúfar, G.; Rauh, J. Selection Criteria for Neuromanifolds of Stochastic Dynamics. In Advances
+in Cognitive Neurodynamics (III); Yamaguchi, Y., Ed.; Springer-Verlag: Dordrecht, The Netherlands 2013;
+pp. 147–154.
+10.
+Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190.
+11.
+Peters, J.; Schaal, S. Policy Gradient Methods for Robotics. In Proceedings of the IEEE International
+Conference on Intelligent Robotics Systems (IROS 2006), Beijing, China, 9–15 October 2006.
+
+78
+
+
+Entropy 2014, 16, 3207–3233
+
+12.
+Peters, J.; Vijayakumar, S.; Schaal, S. Reinforcement learning for humanoid robotics. In Proceedings of the
+third IEEE-RAS international conference on humanoid robots, Karlsruhe, Germany, 29–30 September 2003;
+pp. 1–20.
+13.
+Bagnell, J.A.; Schneider, J.
+Covariant policy search.
+In Proceedings of the 18th International Joint
+Conference on Artificial Intelligence, Acapulco, Mexico, August 9–15, 2003; Morgan Kaufmann Publishers
+Inc.: San Francisco, CA, USA, 2003; pp. 1019–1024.
+14.
+Lebanon, G. Axiomatic geometry of conditional models. IEEE Trans. Inform. Theor. 2005, 51, 1283–1294.
+15.
+Lebanon, G.
+An Extended ˇCencov-Campbell Characterization of Conditional Information Geometry.
+In Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence (UAI 04), Banff, AL, Canada,
+7–11 July 2004; Chickering, D.M., Halpern, J.Y., Eds.; AUAI Press: Arlington, VA, USA, 2004; pp. 341–345.
+16.
+Barndorff-Nielsen, O. Information and Exponential Families: In statistical Theory; John Wiley & Sons, Inc.:
+Hoboken, NJ, USA, 1978.
+17.
+Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory;
+Institute of Mathematical Statistics: Hayward, CA, USA, 1986.
+18.
+Zahedi, K.; Ay, N.; Der, R. Higher coordination with less control—A result of informaion maximiation in the
+sensorimotor loop. Adapt. Behav. 2010, 18.
+19.
+Hofbauer, J.; Sigmund, K.
+Evolutionary Games and Population Dynamics; Cambridge University Press:
+Cambridge, United Kingdom, 1998.
+20.
+Ay, N.; Erb, I. On a notion of linear replicator equations. J. Dyn. Differ. Equ. 2005, 17, 427–451.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+79
+
+
+entropy
+
+Article
+Matrix Algebraic Properties of the Fisher Information
+Matrix of Stationary Processes
+
+André Klein
+
+Rothschild Blv. 123 Apt.7, 65271 Tel Aviv, Israel; A.A.B.Klein@uva.nl or klein@contact.uva.nl; Tel.: 972.5.25594723
+
+Received: 12 February 2014; in revised form: 11 March 2014 / Accepted: 24 March 2014 /
+Published: 8 April 2014
+
+Abstract: In this survey paper, a summary of results which are to be found in a series of papers,
+is presented.
+The subject of interest is focused on matrix algebraic properties of the Fisher
+information matrix (FIM) of stationary processes. The FIM is an ingredient of the Cramér-Rao
+inequality, and belongs to the basics of asymptotic estimation theory in mathematical statistics. The
+FIM is interconnected with the Sylvester, Bezout and tensor Sylvester matrices. Through these
+interconnections it is shown that the FIM of scalar and multiple stationary processes fulfill the
+resultant matrix property. A statistical distance measure involving entries of the FIM is presented.
+In quantum information, a different statistical distance measure is set forth. It is related to the
+Fisher information but where the information about one parameter in a particular measurement
+procedure is considered. The FIM of scalar stationary processes is also interconnected to the solutions
+of appropriate Stein equations, conditions for the FIM to verify certain Stein equations are formulated.
+The presence of Vandermonde matrices is also emphasized.MSC Classification: 15A23, 15A24, 15B99,
+60G10, 62B10, 62M20.
+
+Keywords: Bezout matrix; Sylvester matrix; tensor Sylvester matrix; Stein equation; Vandermonde
+matrix; stationary process; matrix resultant; Fisher information matrix
+
+1. Introduction
+
+In this survey paper, a summary of results derived and described in a series of papers, is presented.
+It concerns some matrix algebraic properties of the Fisher information matrix (abbreviated as FIM) of
+stationary processes. An essential property emphasized in this paper concerns the matrix resultant
+property of the FIM of stationary processes. To be more explicit, consider the coefficients of two monic
+polynomials p(z) and q(z) of finite degree, as the entries of a matrix such that the matrix becomes
+singular if and only if the polynomials p(z) and q(z) have at least one common root. Such a matrix is
+called a resultant matrix and its determinant is called the resultant. The Sylvester, Bezout and tensor
+Sylvester matrices have such a property and are extensively studied in the literature, see e.g., [1,3]. The
+FIM associated with various stationary processes will be expressed by these matrices. The derived
+interconnections are obtained by developing the necessary factorizations of the FIM in terms of the
+Sylvester, Bezout and tensor Sylvester matrices. These factored forms of the FIM enable us to show that
+the FIM of scalar and multiple stationary processes fulfill the resultant matrix property. Consequently,
+the singularity conditions of the appropriate Fisher information matrices and Sylvester, Bezout and
+tensor Sylvester matrices coincide, these results are described in [4,6].
+A statistical distance measure involving entries of the FIM is presented and is based on [7]. In
+quantum information, a statistical distance measure is set forth, see [8,10], and is related to the Fisher
+information but where the information about one parameter in a particular measurement procedure is
+considered. This leads to a challenging question that can be presented as, can the existing distance
+measure in quantum information be developed at the matrix level?
+
+Entropy 2014, 16, 2023–2055; doi:10.3390/e16042023
+www.mdpi.com/journal/entropy
+80
+
+
+Entropy 2014, 16, 2023–2055
+
+The matrix Stein equation, see e.g., [11], is associated with the Fisher information matrices of
+scalar stationary processes through the solutions of the appropriate Stein equations. Conditions for the
+Fisher information matrices or associated matrices to verify certain Stein equations are formulated
+and proved in this paper. The presence of Vandermonde matrices is also emphasized. The general
+and more detailed results are set forth in [12] and [13]. In this survey paper it is shown that the FIM of
+linear stationary processes form a class of structured matrices. Note that in [14], the authors emphasize
+that statistical problems related to stationary processes have been treated successfully with the aid
+of Toeplitz forms. This paper is organized as follows. The various stationary processes, considered
+in this paper, are presented in Section 2, the Fisher information matrices of the stationary processes
+are displayed in Section 3. Section 3 sets forth the interconnections between the Fisher information
+matrices and the Sylvester, Bezout, tensor Sylvester matrices, and solutions to Stein equations. A
+statistical distance measure is expressed in terms of entries of a FIM.
+
+2. The Linear Stationary Processes
+
+In this section we display the class of linear stationary processes whose corresponding Fisher
+information matrix shall be investigated in a matrix algebraic context. But first some basic definitions
+are set forth, see e.g., [15].
+
+If a random variable X is indexed to time, usually denoted by t, the observations {Xt, t ∈ T } is
+
+called a time series, where T is a time index set (for example, T = Z, the integer set).
+
+2.1. Definition 2.1
+
+A stochastic process is a family of random variables {Xt, t ∈ T } defined on a probability space {Ω, F, ℘}.
+
+2.2. Definition 2.2
+
+The Autocovariance function. If {Xt, t ∈ T } is a process such that Var(Xt) < ∞ (variance) for each t, then
+
+the autocovariance function γX (·, ·) of {Xt} is defined by γX (r, s) = Cov (Xr, Xs) = E [(Xr − E Xr) (Xs − E
+
+Xs)], r, s ∈ Z and E represents the expected value.
+
+2.3. Definition 2.3
+
+Stationarity. The time series {Xt, t ∈ Z}, with the index set Z ={0,±}1,±}2, . . .}, is said to be stationary if
+
+(i)
+E |Xt|2 < ∞
+
+(ii)
+E (Xt) = m for all t ∈ Z, m is the constant average or mean
+(iii)
+γX (r, s) = γX (r + t, s + t) for all r, s, t ∈ Z,
+
+From Definition 2.3 can be concluded that the joint probability distributions of the random variables
+{X1, X2, . . . Xtn} and {X1+k, X2+k, . . . Xtn+k} are the same for arbitrary times t1, t2, . . . , tn for all n and
+all lags or leads k = 0, ±}1, ±}2, . . .. The probability distribution of observations of a stationary process
+is invariant with respect to shifts in time. In the next section the linear stationary processes that will be
+considered throughout this paper are presented.
+
+2.4. The Vector ARMAX or VARMAX Process
+
+We display one of the most general linear stationary process called the multivariate autoregressive,
+moving average and exogenous process, the VARMAX process. To be more specific, consider the
+vector difference equation representation of a linear system {y(t), t ∈ Z}, of order (p, r, q),
+
+p
+∑
+j=0
+Aj y(t − j) =
+r
+∑
+j=0
+Cj x(t − j) +
+
+q
+∑
+j=0
+Bj e(t − j), t ∈ Z
+(1)
+
+81
+
+
+Entropy 2014, 16, 2023–2055
+
+where y(t) are the observable outputs, x(t) the observable inputs and ϵ(t) the unobservable errors, all
+are n-dimensional. The acronym VARMAX stands for vector autoregressive-moving average with
+exogenous variables. The left side of (1) is the autoregressive part the second term on the right
+is the moving average part and x(t) is exogenous. If x(t) does not occur the system is said to be
+(V)ARMA. Next to exogenous, the input x(t) is also named the control variable, depending on the field
+of application, in econometrics and time series analysis, e.g., [15], and in signal processing and control,
+e.g., [16,17]. The matrix coefficients, Aj ∈ Rn×n, Cj ∈ Rn×n, and Bj ∈ Rn×n are the associate parameter
+matrices. We have the property A0 ≡ B0 ≡ C0 ≡ In.
+Equation (1) can compactly be written as
+
+A(z) y(t) = C(z) x(t) + B(z) e(t)
+(2)
+
+where
+
+A(z) =
+
+p
+∑
+j=0
+Aj zj; C(z) =
+r
+∑
+j=0
+Cj zj; B(z) =
+
+q
+∑
+j=0
+Bj zj
+
+we use z to denote the backward shift operator, for example z xt = xt−1. The matrix polynomials
+A(z), B(z) and C(z) are the associated autoregressive, moving average matrix polynomials, and the
+exogenous matrix polynomial respectively of order p, q and r respectively. Hence the process described
+
+by (2) is denoted as a VARMAX(p, r, q) process. Here z ∈ C with a duplicate use of z as an operator and
+as a complex variable, which is usual in the signal processing and time series literature, e.g., [15,16,18].
+The assumptions Det(A(z)) ̸= 0, such that |z| ≤ 1 and Det(B(z)) ̸= 0, such that |z| < 1 for all z ∈
+C, is imposed so that the VARMAX(p, r, q) process (2) has exactly one stationary solution and the
+condition Det(B(z)) ̸= 0 implies the invertibility condition, see e.g., [15] for more details. Under these
+assumptions, the eigenvalues of the matrix polynomials A(z) and B(z) lie outside the unit circle. The
+eigenvalues of a matrix polynomial Y (z) are the roots of the equation Det(Y (z)) = 0, Det(X) is the
+determinant of X. The VARMAX(p, r, q) stationary process (2) is thoroughly discussed in [15,18,19].
+The error {ϵ(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables
+
+each having positive definite covariance matrix ∑ and we assume, for all s, t, Eϑ { x(s) ϵT(t)} = 0, where
+
+XT denotes the transposition of matrix X and Eϑ represents the expected value under the parameter
+ϑ. The matrix ϑ represents all the VARMAX(p, r, q) parameters, with the total number of parameters
+being n2(p + q + r). For different purposes which will be specified in the next sections, two choices of
+the parameter structure are considred. First, the parameter vector ϑ ∈ Rn2(p+q+r)×1 is defined by
+
+ϑ = vec {A1, A2, . . . , Ap, C1, C2, . . . , Cr, B1, B2, . . . , Bq}
+(3)
+
+The vec operator transforms a matrix into a vector by stacking the columns of the matrix one
+underneath the other according to vec X = col(col(Xij)n
+i=1)n
+j=1, see e.g., [2,20]. A different choice
+
+is set forth, when the parameter matrix ϑ ∈ Rn×n(p+q+r) is of the form
+
+ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+r ϑp+r+1 ϑp+r+2 . . . ϑp+r+q)
+(4)
+
+= (A1 A2 . . . Ap C1 C2 . . . Cr B1 B2 . . . Bq)
+(5)
+
+Representation (5) of the parameter matrix has been used in [21]. The estimation of the matrices A1,
+A2,. . ., Ap, C1, C2,. . ., Cr, B1, B2, . . ., Bq and ∑ has received considerable attention in the time series
+and statistical signal processing literature, see e.g., [15,17,19]. In [19], the authors study the asymptotic
+properties of maximum likelihood estimates of the coefficients of VARMAX(p, r, q) processes, stored in
+a (ℓ × 1) vector ϑ, where ℓ = n2(p + q + r).
+Before describing the control-exogenous variable x(t) used in this survey paper, we shall present
+the different special cases of the model described in 1 and 2.
+
+82
+
+
+Entropy 2014, 16, 2023–2055
+
+2.5. The Vector ARMA or VARMA Process
+
+When the process (2) does not contain the control process x(t) it yields
+
+A(z)y(t) = B(z)e(t)
+(6)
+
+which is a vector autoregressive and moving average process, VARMA(p, q) process, see e.g., [15].
+The matrix ϑ represents now all the VARMA parameters, with the total number of parameters being
+n2(p+q). The VARMA(p, q) version of the parameter vector ϑ defined in (3) is then given by
+
+ϑ = vec {A1, A2, . . . , Ap, B1, B2, . . . , Bq}
+(7)
+
+A VARMA process equivalent to the parameter matrix (4) is then the n × n(p + q) parameter matrix
+
+ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+q) = (A1 A2 . . . Ap B1 B2 . . . Bq)
+(8)
+
+A description of the input variable x(t), in 2 follows. Generally, one can assume either that x(t) is non
+
+stochastic or that x(t) is stochastic. In the latter case, we assume Eϑ{ x(s) ϵT(t)} = 0, for all s, t, and
+that statistical inference is performed conditionally on the values taken by x(t). In this case it can
+be interpreted as constant, see [22] for a detailed exposition. However, in the papers referred in this
+survey, like in [21] and [23], the observed input variable x(t), is assumed to be a stationary VARMA
+process, of the form
+
+α(z)x(t) = β(z)η(t)
+(9)
+
+where α(z) and β(z) are the autoregressive and moving average polynomials of appropriate degree
+and {η(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables each having
+positive definite covariance matrix Ω. The spectral density of the VARMA process x(t) is Rx(·)/2π and
+for a definition, see e.g., [15,16], to obtain
+
+Rx(eiω) = α−1(eiω)β(eiω)Ωβ∗(eiω)α−∗(eiω)
+ω ∈ [−π, π]
+(10)
+
+where i is the imaginary unit with the property i2 = −1, ω is the frequency, the spectral density Rx(eiω)
+is Hermitian, and we further have, Rx(eiω) ≥ 0 and � π
+−π Rx(eiω)dω < ∞. As mentioned above, the
+basic assumption, x(t) and ϵ(t) are independent or at least uncorrelated processes, which corresponds
+geometrically with orthogonal processes, holds and X* is the complex conjugate transpose of matrix X.
+
+2.6. The ARMAX and ARMA Processes
+
+The scalar equivalent to the VARMAX(p, r, q) and VARMA(p, q) processes, given by 2 and 6
+
+respectively, shall now be displayed, to obtain for the ARMAX(p, r, q) process
+
+a(z)y(t) = c(z)x(t) + b(z)e(t)
+(11)
+
+and for the ARMA(p, q) process
+
+a(z)y(t) = b(z)e(t)
+(12)
+
+popularized in, among others, the Box-Jenkins type of time series analysis, see e.g., [15]. Where a(z),
+b(z) and c(z) are respectively the scalar autoregressive, moving average polynomials and exogenous
+polynomial, with corresponding scalar coefficients aj, bj and cj,
+
+a(z) =
+
+p
+∑
+j=0
+aj zj; c(z) =
+r
+∑
+j=0
+cj zj; b(z) =
+
+q
+∑
+j=0
+bj zj
+(13)
+
+83
+
+
+Entropy 2014, 16, 2023–2055
+
+Note that as in the multiple case, a0 = b0 = 1. The parameter vector, ϑ, for the processes, 11 and 12
+is then
+
+ϑ = {a1, a2, . . . , ap, c1, c2, . . . , cr, b1, b2, . . . , bq}
+(14)
+
+and
+
+ϑ = {a1, a2, . . . , ap, b1, b2, . . . , bq}
+(15)
+
+respectively.
+In the next section the matrix algebraic properties of the Fisher information matrix of the stationary
+processes (2), (6), (11) and (12) will be verified. Interconnections with various known structured
+matrices like the Sylvester resultant matrix, the Bezout matrix and Vandermonde matrix are set forth.
+The Fisher information matrix of the various stationary processes is also expressed in terms of the
+unique solutions to the appropriate Stein equations.
+
+3. Structured Matrix Properties of the Asymptotic Fisher Information Matrix of
+Stationary Processes
+
+The Fisher information is an ingredient of the Cramér-Rao inequality, also called by some
+the Cauchy-Schwarz inequality in mathematical statistics, and belongs to the basics of asymptotic
+estimation theory in mathematical statistics. The Cramér-Rao theorem [24] is therefore considered.
+When assuming that the estimators of ϑ, defined in the previuos sections, are asymptotically unbiased,
+the inverse of the asymptotic information matrix yields the Cramér-Rao bound, and provided that the
+estimators are asymptotically efficient, the asymptotic covariance matrix then verifies the inequality
+
+Cov
+� ˆϑ
+� ≽ I−1� ˆϑ
+�
+
+here I (�ϑ) is the FIM, Cov (�ϑ) is the covariance of �ϑ, the unbiased estimator of ϑ, for a detailed
+fundamental statistical analysis, see [25,26]. The FIM equals the Cramér-Rao lower bound, and the
+subject of the FIM is also of interest in the control theory and signal processing literature, see e.g., [27].
+Its quantum analog was introduced immediately after the foundation of mathematical quantum
+estimation theory in the 1960’s, see [28,29] for a rigorous exposition of the subject. More specifically, the
+Fisher information is also emphasized in the context of quantum information theory, see e.g., [30,31].
+It is clear that the Cramér-Rao inequality takes a lot of attention because it is located on the highly
+exciting boundary of statistics, information and quantum theory and more recently matrix theory. In
+the next sections, the Fisher information matrices of linear stationary processes will be presented and
+its role as a new class of structured matrices will be the subject of study.
+When time series models are the subject, using 2 for all t ∈ Z to determine the residual ϵ(t) or
+ϵt(ϑ), to emphasize the dependency on the parameter vector ϑ, and assuming that x(t) is stochastic and
+that (y(t), x(t)) is a Gaussian stationary process, the asymptotic FIM F(ϑ) is defined by the following
+(ℓ × ℓ) matrix which does not depend on t
+
+F(ϑ) = E
+
+��∂et(ϑ)
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+�∂et(ϑ)
+
+∂ϑ⊤
+
+��
+
+(16)
+
+where the (v × ℓ) matrix ∂(·)/∂ϑ T, the derivative with respect to ϑ T, for any (v × 1) column vector
+(·) and ℓ is the total number of parameters. The derivative with respect to ϑ T is used for obtaining
+the appropriate dimensions. Equality (16) is used for computing the FIM of the various time series
+processes presented in the previous sections and appropriate definitions of the derivatives are used,
+especially for the multivariate processes (2) and (6), see [21,22].
+
+84
+
+
+Entropy 2014, 16, 2023–2055
+
+3.1. The Fisher Information Matrix of an ARMA(p, q) Process
+
+In this section, the focus is on the FIM of the ARMA process (12). When ϑ is given in 15, the
+derivatives in 16 are at the scalar level
+
+∂et(ϑ)
+
+∂aj
+=
+1
+
+a(z)et−j
+for j = 1, . . . , p and∂et(ϑ)
+
+∂bk
+= − 1
+
+b(z)et−k for k = 1, . . . , q
+
+when combined for all j and k, the FIM of the ARMA process (12) with the variance of the noise process
+ϵt(ϑ) equal to one, yields the block decomposition, see [32]
+
+F(ϑ) =
+
+�
+Faa(ϑ)
+Fab(ϑ)
+Fba(ϑ)
+Fbb(ϑ)
+
+�
+
+(17)
+
+The expressions of the different blocks of the matrix F(ϑ) are
+
+Faa(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)u⊤
+p (z−1)
+
+a(z)a(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)v⊤
+p (z)
+
+a(z)ˆa(z)
+dz
+(18)
+
+Fab(ϑ) = − 1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)u⊤
+q (z−1)
+
+a(z)b(z−1)
+dz
+z = − 1
+
+2πi
+
+�
+
+|z|=1
+
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z)
+dz
+(19)
+
+Fba(ϑ) = − 1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)u⊤
+p (z−1)
+
+a(z−1)b(z)
+dz
+z = − 1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)v⊤
+p (z)
+
+ˆa(z)b(z) dz
+(20)
+
+Fbb(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)u⊤
+q (z−1)
+
+b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z)
+dz
+(21)
+
+where the integration above and everywhere below is counterclockwise around the unit circle. The
+
+reciprocal monic polynomials â(z) and �b(z) are defined as â(z) = zpa(z−1) and �b (z) = zqb(z−1) and ϑ
+=(a1, . . . , ap, b1, . . . , bq) T introduced in (15). For each positive integer k we have uk(z) = (1, z, z2,
+. . . , zk−1) T and vk(z) = zk−1uk(z−1). Considering the stability condition of the ARMA(p, q) process
+implies that all the roots of the monic polynomials a(z) and b(z) lie outside the unit circle. Consequently,
+
+the roots of the polynomials â(z) and �b(z) lie within the unit circle and will be used as the poles for
+computing the integrals (18)–(21) when Cauchy’s residue theorem is applied. Notice that the FIM F(ϑ)
+is symmetric block Toeplitz so that Fab(ϑ) = F ⊤
+ba(ϑ) and the integrands in (18)–(21) are Hermitian.
+The computation of the integral expressions, (18)–(21) is easily implementable by using the standard
+residue theorem. The algorithms displayed in [33] and [22] are suited for numerical computations of
+among others the FIM of an ARMA(p, q) process.
+
+3.2. The Sylvester Resultant Matrix - The Fisher Information Matrix
+
+The resultant property of a matrix is considered, in order to show that the FIM F(ϑ) has the matrix
+resultant property implies to show that the matrix F(ϑ) becomes singular if and only if the appropriate
+
+scalar monic polynomials â(z) and �b(z) have at least one common zero. To illustrate the subject, the
+following known property of two polynomials is set forth. The greatest common divisor (frequently
+abbreviated as GCD) of two polynomials is a polynomial, of the highest possible degree, that is a factor
+of both the two original polynomials, the roots of the GCD of two polynomials are the common roots of
+the two polynomials. Consider the coefficients of two monic polynomials p(z) and q(z) of finite degree,
+as the entries of a matrix such that the matrix becomes singular if and only if the polynomials p(z) and
+q(z) have at least one common root. Such a matrix is called a resultant matrix and its determinant is
+
+85
+
+
+Entropy 2014, 16, 2023–2055
+
+called the resultant. Therefore we present the known (p + q) × (p + q) Sylvester resultant matrix of the
+polynomials a and b, see e.g., [2], to obtain
+
+S(a, b) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+a1
+· · ·
+ap
+0
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+0
+0
+· · ·
+0
+1
+a1
+· · ·
+ap
+1
+b1
+· · ·
+bq
+0
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+0n×n
+0
+· · ·
+0
+1
+b1
+· · ·
+bq
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(22)
+
+Consider the q ×(p+q) and p×(p+q) upper and lower submatrices Sp (b) and Sq (−a) of the Sylvester
+
+resultant matrix S (−b, a) such that
+
+S(b, −a) =
+
+�
+Sp(b)
+−Sq(a)
+
+�
+
+(23)
+
+The matrix
+
+S
+
+(a, b) becomes singular in the presence of one or more common zeros of the monic polynomials â(z)
+
+and �b(z), this property is assessed by the following equalities
+
+R(a, b) =
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(αi − βj), R(b, a) = (−1)pq
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(αi − βj)
+(24)
+
+and
+
+R(b, −a) = (−1)q
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(βj − αi), and R(−b, a) = (−1)p
+∏
+i = 1, . . . , p
+j = 1, . . . , q
+
+(βj − αi)
+(25)
+
+where
+R(a,
+b)
+is
+the
+resultant
+of
+â(z)
+and
+�b(z),
+and
+is
+equal
+to
+Det
+includegraphics[scale=1]entropy-16-02023f6.pdf (a, b).
+The string of equalities in (24) and (25)
+hold since R(b, a) = (−1)pq R(a, b), R(b, −a) = (−1)q R(b, a), and R(−b, a) = (−1)p R(b, a), see [34]. The
+
+zeros of the scalar monic polynomials â(z) and �b(z) are αi and βj respectively and are assumed to be
+distinct. By this is meant, when we have (z − αi)nαi and (z − βj)nβj with the powers nαi and nβj both
+greater than one, that only the distinct roots will be considered free from the corresponding powers.
+
+The key property of the classical Sylvester resultant matrix S (a, b) is that its null space provides a
+complete description of the common zeros of the polynomials involved. In particular, in the scalar
+
+case the polynomials â(z) and �b(z) are coprime if and only if S (a, b) is non-singular. The following
+
+key property of the classical Sylvester resultant matrix S (a, b), is given by the well known theorem on
+resultants, to obtain
+
+dim Ker S(a, b) = ν(a, b)
+(26)
+
+86
+
+
+Entropy 2014, 16, 2023–2055
+
+where ν(a, b) is the number of common roots of the polynomials â(z) and �b(z), with counting
+multiplicities, see e.g., [3]. The dimension of a subspace V is represented by dim (V ), Ker (X)
+is the null space or kernel of the matrix X, denoted by Null or Ker. The null space of an n × n matrix A
+with coefficients in a field K (typically the field of the real numbers or of the complex numbers) is the
+set Ker A = {x ∈ Kn: Ax = 0}, see e.g., [1,2,20].
+
+In order to prove that the FIM F (ϑ) fulfills the resultant matrix property, the following
+factorization is derived, Lemma 2.1 in [5],
+
+F(ϑ) = S(b, −a)P(ϑ)S⊤(b, −a)
+(27)
+
+where the matrix ℘(ϑ) ∈ R(p+q)×(p+q) admits the form
+
+P(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)u⊤
+p+a(z−1)
+
+a(z)b(z)a(z−1)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)v⊤
+p+q(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(28)
+
+It is proved in [5] that the symmetric matrix ℘(ϑ) fulfills the property, ℘(ϑ) ≻ O. The factorization (27)
+allows us to show the matrix resultant property of the FIM, Corollary 2.2 in [5] states.
+The FIM of an ARMA(p, q) process with polynomials a(z) and b(z) of order p, q respectively
+
+becomes singular if and only if the polynomials â(z) and �b(z) have at least one common root.
+From Corollary 2.2 in [5] can be concluded, the FIM of an ARMA(p, q) process and the Sylvester
+resultant matrix
+
+S
+
+(−b, a) have the same singularity property. By virtue of (26) and (27) we will specify the dimension of
+
+the null space of the FIM F (ϑ), this is set forth in the following lemma.
+
+3.2.1. Lemma 3.1
+
+Assume that the polynomials â(z) and b(z) have ν(a, b) common roots, counting multiplicities.
+The factorization (27) of the FIM and the property (26) enable us to prove the equality
+
+dim (Ker F(ϑ)) = dim (Ker S(b, −a)) = ν(a, b)
+(29)
+
+Proof
+
+The matrix ℘(ϑ) ∈ R(p+q)×(p+q), given in (27), fulfills the property of positive definiteness, as
+proved in [5]. This implies that a Cholesky decomposition can be applied to ℘(ϑ), see [35] for more
+details, to obtain ℘(ϑ) =LT(ϑ)L(ϑ), where L(ϑ) is a R(p+q)×(p+q) upper triangular matrix that is unique if
+its diagonal elements are all positive. Consequently, all its eigenvalues are then positive so that the
+matrix L(ϑ) is also positive definite. Factorization of (27) now admits the representation
+
+F(ϑ) = S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a)
+(30)
+
+and taking the property, if A is an m× n matrix, then Ker (A) = Ker (ATA), into account, yields when
+applied to (30)
+
+Ker F(ϑ) = Ker S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a) = Ker L(ϑ)S⊤(b, −a)
+
+Assume the vector u ∈ Ker L(ϑ) S⊤ (b, −a), such that L(ϑ) S⊤ (b, −a)u = 0 and set S⊤ (b, −a)u = v = ⇒
+
+L(ϑ)v = 0, since the matrix L(ϑ) ≻ O = ⇒ v = 0, this implies S⊤ (b, −a)u = 0 = ⇒ u ∈ Ker S⊤ (b, −a).
+Consequently,
+
+87
+
+
+Entropy 2014, 16, 2023–2055
+
+Ker F(ϑ) = Ker S⊤(b, −a)
+(31)
+
+We will now consider the Rank-Nullity Theorem, see e.g., [1], if A is an m × n matrix, then
+
+dim (Ker A) + dim (Im A) = n
+
+and the property dim (Im A) = dim (Im AT). When applied to the (p + q) × (p + q) matrix S (b, −a),
+it yields
+
+dim (Ker S(b, −a)) = dim (Ker S⊤(b, −a)) ⇒ dim (Ker F(ϑ)) = dim (Ker S(b, −a))
+
+which completes the proof.
+Notice that the dimension of the null space of matrix A is called the nullity of A and the dimension
+of the image of matrix A, dim (Im A), is termed the rank of matrix A. An alternative proof to the one
+developed in Corollary 2.2 in [5], is given in a corollary to Lemma 3.1, reconfirming the resultant
+
+matrix property of the FIM F (ϑ).
+
+3.2.2. Corollary 3.2
+
+The FIM F (ϑ) of an ARMA(p, q) process becomes singular if and only if the autoregressive and moving
+
+average polynomials â(z) and �b(z) have at least one common root.
+
+Proof
+
+By virtue of the equality (31) combining with the property Det S⊤ (b, −a) = Det S (b, −a) and
+
+the matrix resultant property of the Sylvester matrix S (b, −a) yields, Det S⊤ (b, −a) = 0 ⇔ Ker S⊤
+
+(b, −a) ̸= {0} if and only if the ARMA(p, q) polynomials â(z) and �b(z) have at least one common root.
+
+Equivalently, Det S⊤ (b, −a) ̸= 0 ⇔ Ker S⊤ (b, −a) = {0} if and only if the ARMA(p, q) polynomials
+
+â(z) and �b(z) have no common roots. Consequently, by virtue of the equality Ker F (ϑ) =Ker S⊤ (b,
+
+−a) can be concluded, the FIM F (ϑ) becomes singular if and only if the ARMA(p, q) polynomials â(z)
+
+and �b(z) have at least one common root. This completes the proof.
+
+3.3. The Statistical Distance Measure and the Fisher Information Matrix
+
+In [7] statistical distance measures are studied. Most multivariate statistical techniques are based
+upon the concept of distance. For that purpose a statistical distance measure is considered that is
+a normalized Euclidean distance measure with entries of the FIM as weighting coefficients. The
+measurements x1, x2,. . . , xn are subject to random fluctuations of different magnitudes and have
+therefore different variabilities. It is then important to consider a distance that takes the variability
+of these variables or measurements into account when determining its distance from a fix point. A
+rotation of the coordinate system through a chosen angle while keeping the scatter of points given
+by the data fixed, is also applied, see [7] for more details. It is shown that when the FIM is positive
+definite, the appropriate statistical distance measure is a metric. In case of a singular FIM of an ARMA
+stationary process, the metric property depends on the rotation angle. The statistical distance measure,
+is based on m parameters unlike a statistical distance measure introduced in quantum information, see
+e.g., [8,9], that is also related to the Fisher information but where the information about one parameter
+in a particular measurement procedure is considered.
+
+88
+
+
+Entropy 2014, 16, 2023–2055
+
+The straight-line or Euclidean distance between the stochastic vector x =
+�
+x1
+x2
+. . .
+xn
+�⊤
+
+and fixed vector y =
+�
+y1
+y2
+. . .
+yn
+�⊤
+where x, y ∈ Rn, is given by
+
+d(x, y) = ∥x − y∥ =
+
+�
+n
+∑
+j=1
+(xj − yj)2
+�1/2
+(32)
+
+where the metric d(x, y):= ||x−y|| is induced by the standard Euclidean norm || · || on Rn, see
+e.g., [2] for the metric conditions.
+The observations x1, x2, . . . , xn are used to compute maximum likelihood estimated of the
+parameters ϑ1, ϑ2, . . . , ϑm and where m < n. These estimated parameters are random variables, see
+e.g., [15]. The distance of the estimated vector ϑ ∈ Rm, given in (15), is studied. Entries of the FIM are
+inserted in the distance measure as weighting coefficients. The linear transformation
+
+�ϑ = Li(ϕ)ϑ
+(33)
+
+is applied, where Li(ϕ) ∈ Rm×n is the Givens rotation matrix with rotation angle ϕ, with 0 ≤ ϕ ≤ 2π
+and i ∈ {1, . . . , m − 1}, see e.g., [36], and is given by
+
+Li(ϕ) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+Ii−1
+0
+0
+0
+0
+(cos(ϕ))i,i
+(− sin(ϕ))i,i+1
+0
+0
+(sin(ϕ))i+1,i
+(cos(ϕ))i+1,i+1
+0
+0
+0
+0
+Im−i−1
+
+⎞
+
+⎟
+⎟
+⎟
+⎠,
+0 ≤ ϕ ≤ 2π
+(34)
+
+The following matrix decomposition is applied in order to obtain a transformed FIM
+
+Fϕ(ϑ) = Li(ϕ)F(ϑ)L⊤
+i (ϕ)
+(35)
+
+where Fϕ(ϑ) and F (ϑ) are respectively the transformed and untransformed Fisher information
+matrices. It is straightforward to conclude that by virtue of (35), the transformed and untransformed
+
+Fisher information matrices F ϕ(ϑ) and F (ϑ), are similar since the rotation matrix Li(ϕ) is orthogonal.
+Two matrices A and B are similar if there exists an invertible matrix X such that the equality AX = XB
+holds. As can be seen, the Givens matrix Li(ϕ) involves only two coordinates that are affected by the
+rotation angle ϕ whereas the other directions, which correspond to eigenvalues of one, are unaffected
+by the rotation matrix.
+
+By virtue of (35) can be concluded that a positive definite FIM, F (ϑ) ≻ 0, implies a positive
+
+definite transformed FIM, F ϕ(ϑ) ≻ 0. Consequently, the elements on the main diagonal of F (ϑ), f 1,1,
+
+f 2,2, . . . , fm,m, as well as the elements on the main diagonal of F ϕ(ϑ), �f1,1, �f2,2, . . . , �fm,m are all
+positive. However, the elements on the main diagonal of a singular FIM of a stationary ARMA process
+are also positive.
+As developed in [7], combining (33) and (35) yields the distance measure of the estimated
+parameters ϑ1, ϑ2, . . . , ϑm accordingly, to obtain
+
+d2
+Fϕ(ϑ) =
+m
+∑
+j=1,j̸=i,i+1
+
+� ϑ2
+j
+
+fj,j
+
+�
+
++ {ϑi cos(ϕ) − ϑi+1 sin(ϕ)}2
+
+�fi,i(ϕ)
++ {ϑi+1 cos(ϕ) + ϑi sin(ϕ)}2
+
+�fi+1,i+1(ϕ)
+(36)
+
+where
+
+�fi,i(ϕ) = fi,i cos2(ϕ) − fi,i+1 sin(2ϕ) + fi+1,i+1 sin2(ϕ)
+(37)
+
+89
+
+
+Entropy 2014, 16, 2023–2055
+
+�fi+1,i+1(ϕ) = fi+1,i+1 cos2(ϕ) + fi,i+1 sin(2ϕ) + fi,i sin2(ϕ)
+(38)
+
+and fj,l are entries of the FIM F (ϑ) whereas �fi,i(φ) and �fi+1,i+1(φ) are the transformed components
+
+since the rotation affects only the entries, i and i+1, as can be seen in matrix Li(ϕ). In [7], the existence
+of the following inequalities is proved
+
+�fi,i(ϕ) > 0
+and
+�fi+1,i+1(ϕ) > 0
+
+this guaratees the metric property of (36).
+When the FIM of an ARMA(p, q) process is the
+case, a combination of (27) and (35) for the ARMA(p, q) parameters, given in (15) yields for the
+transformed FIM,
+
+Fϕ(ϑ) = Sϕ(−b, a)P(ϑ)S⊤
+ϕ (−b, a)
+(39)
+
+where ℘(ϑ) is given by (28) and the transformed Sylvester resultant matrix is of the form
+
+Sϕ(−b, a) = Li(ϕ)S(−b, a)
+(40)
+
+Proposition 3.5 in [7], proves that the transformed FIM F ϕ(ϑ) and the transformed Sylvester matrix
+Sφ (−b, a) fulfill the resultant matrix property by using the equalities (40) and (39). The following
+property is then set forth.
+
+3.3.1. Proposition 3.3
+
+The properties
+
+Ker Fϕ(ϑ) = Ker S⊤
+ϕ (−b, a) and Ker Sϕ(−b, a) = Ker S(−b, a)
+
+hold true.
+
+Proof
+
+By virtue of the equalities (39), (40) and the orthogonality property of the rotation matrix Li(ϕ)
+which implies that Ker Li(ϕ) = {0} combined with the same approach as in Lemma 3.1 completes
+the proof.
+A straightforward conclusion from Proposition 3.3 is then
+
+dim Ker Fϕ(ϑ) = dim Ker Sϕ(−b, a), dim Ker Sϕ(−b, a) = dim Ker S(−b, a)
+
+In the next section a distance measure introduced in quantum information is discussed.
+Statistical Distance Measure - Fisher Information and Quantum Information
+In quantum information, the Fisher information, the information about a parameter θ in a
+particular measurement procedure, is expressed in terms of the statistical distance s, see [8,10]. The
+statistical distance used is defined as a measure to distinguish two probability distributions on the basis
+of measurement outcomes, see [37]. The Fisher information and the statistical distance are statistical
+quantities, and generally refer to many measurements as it is the case in this survey. However, in
+the quantum information theory and quantum statistics context, the problem set up is presented as
+follows. There may or may not be a small phase change θ, and the question is whether it is there. In
+that case you can design quantum experiments that will tell you the answer unambiguously in a single
+measurement. The equality derived is of the form
+
+F (ϕ) =
+� ds
+
+dθ
+
+�2
+(41)
+
+90
+
+
+Entropy 2014, 16, 2023–2055
+
+the Fisher information is the square of the derivative of the statistical distance s with respect to θ.
+Contrary to (36), where the square of the statistical distance measure is expressed in terms of entries
+
+of a FIM F (ϑ) which is based on information about m parameters estimated from n measurements,
+for m < n. A challenging question could therefore be formulated as follows, can a generalization of
+equality (41) be developed in a quantum information context but at the matrix level ? To be more
+specific, many observations or measurements that lead to more than one parameter such that the
+corresponding Fisher information matrix is interconnected to an appropriate statistical distance matrix,
+a matrix where entries are scalar distance measures. This question could equally be a challenge to
+algebraic matrix theory and to quantum information.
+
+3.4. The Bezoutian - The Fisher Information Matrix
+
+In this section an additional resultant matrix is presented, it concerns the Bezout matrix or
+Bezoutian. The notation of Lancaster and Tismenetsky [2] shall be used and the results presented are
+extracted from [38]. Assume the polynomials a and b given by a(z) = ∑n
+j=0 aj zj and b(z) = ∑n
+j=0 bj zj,
+cfr. (13) but where p = q = n, and we further assume a0 = b0 = 1. The Bezout matrix B(a, b) of the
+polynomials a and b is defined by the relation
+
+a(z)b(w) − a(w)b(z) = (z − w)u⊤
+n (z)B(a, b)un(z)
+
+This matrix is often referred as the Bezoutian. We will display a decomposition of the Bezout matrix
+B(a, b) developed in [38]. For that purpose the matrix Uϕ and its inverse Tϕ are presented, where ϕ is a
+given complex number, to obtain
+
+Uϕ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+· · ·
+· · ·
+0
+−ϕ
+1
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+0
+· · ·
+0
+−ϕ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, Tϕ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+· · ·
+· · ·
+0
+ϕ
+1
+· · ·
+· · ·
+0
+
+ϕ2
+...
+...
+...
+...
+...
+ϕn−1
+· · ·
+ϕ2
+ϕ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+Let (1 − α1z) and (1 − β1z) be a factor of a(z) and b(z) respectively and α1 and β1 are zeros of â(z)
+
+and �b(z). Consider the factored form of the nth order polynomials a(z) and b(z) of the form a(z) = (1
+− α1z)a−1(z) and b(z) = (1 − β1z)b−1(z) respectively. Proceeding this way, for α2, . . . , αn yields the
+recursion a−(k−1)(z) = (1 − αkz)a−k(z), equivalently for the polynomials b−k(z) and a0(z) = a(z) and b0(z)
+= b(z). Proposition 3.1 in [38] is presented.
+The following non-symmetric decomposition of the Bezoutian is derived, considering the
+notations above
+
+B(a, b) = Uα1
+
+�
+B(a−1, b−1)
+0
+0
+0
+
+�
+
+U⊤
+β1 + (β1 − α1)bβ1a⊤
+α1
+(42)
+
+with aα1 such that a⊤
+α1 un(z) = a−1 similarly for bβ1. Iteration gives the following expansion for the
+Bezout matrix
+
+B(a, b) =
+n
+∑
+k=1
+(βk − αk)Uα1 . . . Uαk−1Uβk+1 . . . Uβnen
+1 (en
+1)⊤ U⊤
+β1 . . . U⊤
+βk−1U⊤
+αk+1 . . . U⊤
+αn
+
+where en
+1 is the first unit standard basis column vector in Rn, by ej we denote the jth coordinate vector,
+ej = (0, . . . , 1, . . . , 0) T, with all its components equal to 0 except the jth component which equals 1.
+The following corollarys to Proposition 3.1 in [38] are now presented.
+
+Corollary 3.2 in [38] states. Let ϕ be a common zero of the polynomials â(z) and �b(z). Then a(z) =
+(1 − ϕz)a−1(z) and b(z) = (1 − ϕz)b−1(z) and
+
+91
+
+
+Entropy 2014, 16, 2023–2055
+
+B(a, b) = Uϕ
+
+�
+B(a−1, b−1)
+0
+0
+0
+
+�
+
+U⊤
+ϕ
+
+This a direct consequence of (42) and from which can be concluded that the Bezoutian B(a, b) is
+non-singular if and only if the polynomials a(z) and b(z) have no common factors. A similar conclusion
+
+is drawn for the FIM in (27) so that matrices F (ϑ) and B(a, b) have the same singularity property.
+Related to Corollary 3.2 in[38], this is where we give a description of the kernel or nullspace of
+the Bezout matrix.
+Corollary 3.3 in [38] is now presented. Let ϕ1, . . ., ϕm be all the common zeros of the polynomials
+
+â(z) and �b(z), with multiplicities n1, . . . , nm. Let ℓ be the last unit standard basis column vector in Rn
+
+and put
+
+wj
+k =
+�
+Tj
+ϕk Jj−1�⊤
+ℓ
+
+for k = 1, . . . , m and j = 1, . . . , nk and by J we denote the forward n × n shift matrix, Jij = 1 if i = j + 1.
+
+Consequently, the subspace Ker B(a, b) is the linear span of the vectors wj
+k.
+An alternative representation to (27) but involving the Bezoutian B(b, a) and derived in
+Proposition 5.1 in [38] is of the form
+
+F(ϑ) = M−1(b, a)H(ϑ)M−⊤(b, a)
+(43)
+
+where
+
+H(ϑ) =
+
+�
+I
+0
+0
+B(b, a)
+
+�
+
+Q(ϑ)
+
+�
+I
+0
+0
+B(b, a)
+
+�
+
+and M(b, a) =
+
+�
+P
+0
+PS(ˆa)P
+PS(ˆb)P
+
+�
+
+(44)
+
+and
+
+P =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+· · ·
+0
+1
+...
+1
+0
+
+0
+...
+1
+0
+· · ·
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+, S(ˆa) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+an−1
+an−2
+· · ·
+a0
+an−2
+a0
+0
+...
+...
+a0
+0
+· · ·
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
+and Q(ϑ) ≻ 0
+
+The matrix S(â) is the symmetrizer of the polynomial â(z), in this paper a0 = 1, see [2] and P is a
+permutation matrix. In [38] it is shown that the matrix Q(ϑ) is the unique solution to an appropriate
+Stein equation and is strictly positive definite. However, in the next section an explicit form of the Stein
+solution Q(ϑ) is developed. Some comments concerning the property summarized in Corollary 5.2
+in [38] follow.
+
+The matrix H(ϑ) is non-singular if and only if the polynomials a(z) and b(z) have no common
+factors. The proof is straightforward since the matrix Q(ϑ) is non-singular which implies that the
+
+matrixH(ϑ) is only non-singular when the Bezoutian B(b, a) is non-singular and this is fulfilled if and
+only if the polynomials a(z) and b(z) have no common factors.
+
+The matrix M(b, a) is non-singular if a0 ̸= 0 and b0 ̸= 0, which is the case since we have a0 =
+
+b0 = 1. From (43) can be concluded that the FIM F (ϑ) is non-singular only when the matrix H(ϑ)
+is non-singular or by virtue of (44) when the Bezoutian B(b, a) is non-singular. Consequently, the
+
+singularity conditions of the Bezoutian B(b, a), the FIM F (ϑ) and the Sylvester resultant matrix
+
+S
+
+92
+
+
+Entropy 2014, 16, 2023–2055
+
+(b, −a) are therefore equivalent. Can be concluded, by virtue of (29) proved in Lemma 3.1 and the
+
+equality dim (Ker S (a, b)) = dim (Ker B(a, b)) proved in Theorem 21.11 in [1], yields
+
+dim (Ker S(b, −a)) = dim (Ker F(ϑ)) = dim (Ker B(b, a)) = ν(a, b)
+
+3.5. The Stein Equation - The Fisher Information Matrix of an ARMA(p, q) Process
+
+In [12], a link between the FIM of an ARMA process and an appropriate solution of a Stein
+equation is set forth. In this survey paper we shall present some of the results and confront some
+results displayed in the previous sections. However, alternative proofs will be given to some results
+obtained in [12,38].
+The Stein matrix equation is now set forth. Let A ∈ Cm×m, B ∈ Cn×n and Γ ∈ Cn×m and consider
+the Stein equation
+
+S − BSA⊤ = Γ
+(45)
+
+It has a unique solution if and only if λμ ̸= 1 for any λ ∈ σ(A) and μ ∈ σ(B), the spectrum of D is σ(D)
+
+= {λ ∈ C: det(λIm − D) = 0}, the set of eigenvalues of D. The unique solution will be given in the next
+theorem [11].
+
+3.5.1. Theorem 3.4
+
+Let A and B be, such that there is a single closed contour C with σ(B) inside C and for each non-zero w ∈
+σ(A), w−1 is outside C. Then for an arbitrary Γ the Stein 45 has a unique solution S
+
+S =
+1
+
+2πi
+
+�
+
+C
+(λIn − B)−1Γ(Im − λA)−⊤dλ
+(46)
+
+In this section an interconnection between the representation (27) of the FIM F (ϑ) and an appropriate
+solution to a Stein equation of the form (45) as developed in [12] is set forth. The distinct roots of
+
+the polynomials â(z) and �b(z) are denoted by α1, α2, . . . , αp and β1, β2, . . . , βq respectively such
+
+that the non-singularity of the FIM F (ϑ) is guaranteed. The following representation of the integral
+expression (28) is given when Cauchy’s residue theorem is applied, equation (4.8) in [12]
+
+P(ϑ) = U(ϑ)D(ϑ) ˆU(ϑ)
+(47)
+
+where
+
+U(ϑ) = {up+q(α1), up+q(α2), . . . , up+q(αp), up+q(β1), up+q(β2), . . . , up+q(βq)}
+
+D(ϑ) = diag
+��
+1
+
+ˆa(z;αi)ˆb(αi)a(αi)b(αi)
+
+�
+,
+�
+1
+
+ˆa(βj)ˆb(z;βj)a(βj)b(βj)
+
+��
+, i = 1, ..., p and j = 1, ..., q
+
+and
+
+ˆU(ϑ) = {vp+q(α1), vp+q(α2), . . . , vp+q(αp), vp+q(β1), vp+q(β2), . . . , vp+q(βq)}⊤
+
+the polynomial p(·; β) is defined accordingly, p(z; β) =
+p(z)
+(z−β) and D (ϑ) is the (p + q) × (p + q) diagonal
+
+matrix. The matrices U (ϑ) and �U((ϑ) in (47) are the (p + q)× (p + q) Vandermonde matrices Vαβ and
+�U αβ respectively, given by
+
+93
+
+
+Entropy 2014, 16, 2023–2055
+
+Vαβ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+α1
+α2
+1
+· · ·
+αp+q−1
+1
+1
+α2
+α2
+2
+· · ·
+αp+q−1
+2
+...
+...
+...
+...
+...
+1
+αp
+α2
+p
+· · ·
+αp+q−1
+p
+1
+β1
+β2
+1
+· · ·
+βp+q−1
+1
+1
+β2
+β2
+2
+· · ·
+βp+q−1
+2
+...
+...
+...
+...
+...
+1
+βq
+β2
+q
+· · ·
+βp+q−1
+q
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+and ˆVαβ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αp+q−1
+1
+αp+q−2
+1
+· · ·
+α1
+1
+αp+q−1
+2
+αp+q−2
+2
+· · ·
+α2
+1
+...
+...
+...
+...
+...
+αp+q−1
+p
+αp+q−2
+p
+· · ·
+αp
+1
+βp+q−1
+1
+βp+q−2
+1
+· · ·
+β1
+1
+βp+q−1
+2
+βp+q−2
+2
+· · ·
+β2
+1
+...
+...
+...
+...
+...
+βp+q−1
+q
+βp+q−2
+q
+· · ·
+βq
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+It is clear that the (p + q) × (p + q) Vandermonde matrices Vαβ and �U αβ are nonsingular when αi ̸= αj,
+βk ̸= βh and αi ̸= βk for all i, j = 1, . . . , p and k, h = 1, . . . , q. A rigorous systematic evaluation of the
+
+Vandermonde determinants DetVαβ and Det �U αβ, yields
+
+DetVαβ = (−1)(p+q) (p+q−1)/2Φ (αi, βk)
+
+where
+
+Φ (αi, βk) =
+∏
+1≤i<j≤p
+(αi − αj)
+∏
+1≤k<h≤q
+(βk − βh)
+∏
+m = 1, . . . p
+n = 1, . . . q
+
+(αm − βn)
+
+Since Vαβ = P ˆV⊤
+αβ and given the configuration of the permutation matrix, P, this leads to the equalities
+
+Det ˆV⊤
+αβ = DetP DetVαβ and DetP = (−1)(p+q)(p+q−1)/2 so that
+
+Det ˆVαβ = (−1)(p+q) (p+q−1) Φ (αi, βk) ⇒| DetVαβ |= | Det ˆVαβ |
+
+We shall now introduce an appropriate Stein equation of the form (45) such that an interconnection with
+℘(ϑ) in (47) can be verified. Therefore the following (p + q)× (p + q) companion matrix is introduced,
+
+Cg =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+1
+· · ·
+0
+...
+...
+...
+0
+· · ·
+0
+1
+−gp+q
+−gp+q−1
+· · ·
+−g1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
+(48)
+
+where the entries gi are given by zp+q + ∑
+p+q
+i=1 gi(ϑ)zp+q−i = ˆa(z)ˆb(z) = ˆg(z, ϑ) and ˆg(ϑ) is the vector
+ˆg(ϑ) = (gp+q(ϑ), gp+q−1(ϑ), . . . , g1(ϑ)) T. Likewise is the vector g(z, ϑ) = a(z)b(z) and g(ϑ) = (g1(ϑ), g1(ϑ),
+. . . , gp+q(ϑ)) T, for investigating the properties of a companion matrix see e.g., [36], [2]. Since all
+
+the roots of the polynomials â(z) and �b(z) are distinct and lie within the unit circle implies that the
+products αiβj ̸= 1, αiαj ̸= 1 and βiβj ̸= 1 hold for all i = 1, 2, . . . , p and j = 1, 2, . . . , q. Consequently,
+the uniqueness condition of the solution of an appropriate Stein equation is verified. The following
+Stein equation and its solution, according to (45) and (46), are now presented
+
+S − CgSC⊤
+g = Γ and S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − Cg)−1Γ(Ip+q − zCg)−⊤dz
+
+where the closed contour is now the unit circle |z| = 1 and the matrix Γ is of size (p + q)× (p + q). A
+more explicit expression of the solution S is of the form
+
+94
+
+
+Entropy 2014, 16, 2023–2055
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+q − Cg)Γ adj(Ip+q − zCg)⊤
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(49)
+
+where adj(X) = X−1 Det(X), the adjoint of matrix X. When Cauchy’s residue theorem is applied to the
+solution S in (49), the following factored form of S is derived, equation (4.9) in [12]
+
+S = (C1, C2) (Ip+q ⊗ Γ) (D(ϑ) ⊗ Ip+q) (C3, C4)⊤
+(50)
+
+where
+
+C1 = adj(α1Ip+q − Cg), adj(α2Ip+q − Cg), . . . , adj(αpIp+q − Cg)
+C2 = adj(β1Ip+q − Cg), adj(β2Ip+q − Cg), . . . , adj(βpIp+q − Cg)
+C3 = adj(Ip+q − α1Cg), adj(Ip+q − α2Cg), . . . , adj(Ip+q − αpCg)
+C4 = adj(Ip+q − β1Cg), adj(Ip+q − β2Cg), . . . , adj(Ip+q − βpCg)
+
+and D ϑ) is given in (47), the following matrix rule is applied
+
+(A ⊗ B) (C ⊗ D) = AC ⊗ BD
+
+and the operator ⊗ is the tensor (Kronecker) product of two matrices, see e.g., [2], [20].
+Combining (47) and (50) and taking the assumption, αi ̸= αj, βk ̸= βh and αi ̸= βk, into account
+
+implies that the inverse of the (p + q)× (p + q) Vandermonde matrices Vαβ and �U αβ exist, as Lemma
+4.2 [12] states.
+The following equality holds true
+
+S = (C1, C2)
+�
+V−1
+αβ P(ϑ) ˆV−1
+αβ ⊗ Γ
+�
+(C3, C4)⊤
+
+or
+
+S = (C1, C2)
+�
+V−1
+αβ S−1(b, −a)F(ϑ)S−⊤(b, −a) ˆV−1
+αβ ⊗ Γ
+�
+(C3, C4)⊤
+(51)
+
+Consequently, under the condition αi ̸= αj, βk ̸= βh and αi ̸= βk, and by virtue of (27) and (51),
+
+an interconnection involving the FIM F (ϑ), a solution to an appropriate Stein equation S, the
+Sylvester matrix
+
+S
+
+(b, −a) and the Vandermonde matrices Vαβ and �U αβ is established. It is clear that by using the
+expression (43), the Bezoutian B (a, b) can be inserted in equality (51).
+We will formulate a Stein equation when the matrix Γ = ep+qe⊤
+p+q,
+
+S − CgSC⊤
+g = ep+qe⊤
+p+q
+(52)
+
+where ep+q is the last standard basis column vector in Rp+q, em
+i is the i-th unit standard basis
+column vector in Rm, with all its components equal to 0 except the i-th component which equals 1. The
+next lemma is formulated.
+
+3.5.2. Lemma 3.5
+
+The symmetric matrix ℘(ϑ) defined in (28) fulfills the Stein Equation (52).
+
+Proof
+
+The unique solution of (52) is according to (46)
+
+95
+
+
+Entropy 2014, 16, 2023–2055
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − Cg)−1ep+qe⊤
+p+q(Ip+q − zCg)−⊤dz
+
+more explictely written,
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+q − Cg)ep+qe⊤
+p+qadj(Ip+q − zCg)⊤
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+
+Using the property of the companion matrix Cg, standard computation shows that the last column
+
+of adj(zIp+q − Cg) is the basic vector up+q(z) and consequently the last column of adj(Ip+q − z Cg)
+
+is the basic vector vp+q(z) = zp+q−1up+q(z−1). This implies that adj(zIp+q − Cg)ep+q = up+q(z) and
+e⊤
+p+qadj(Ip+q − zCg)⊤ = v⊤
+p+q(z) or
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+q(z)v⊤
+p+q(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz = P(ϑ)
+
+Consequently, the solution S to the Stein 52 coincides with the matrix ℘(ϑ) defined in (28).
+
+The Stein equation that is verified by the FIM F (ϑ) will be considered. For that purpose we
+
+display the following p × p and q × q companion matrices Ca and Cb of the form,
+
+Ca =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+−a1
+−a2
+· · ·
+· · ·
+−ap
+1
+0
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+0
+· · ·
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, Cb =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+−b1
+−b2
+· · ·
+· · ·
+−bq
+1
+0
+· · ·
+· · ·
+0
+
+0
+...
+...
+...
+...
+...
+...
+0
+· · ·
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+respectively. Introduce the (p + q) × (p + q) matrix K(ϑ) =
+
+�
+Ca
+O
+O
+Cb
+
+�
+
+and the (p + q) × 1 vector
+
+B =
+
+�
+e1
+p
+−e1
+q
+
+�
+
+, where e1
+p and e1
+q are the first standard basis column vectors in Rp and Rq respectively.
+
+Consider the Stein equation
+
+S − K(ϑ)SK⊤(ϑ) = BB⊤
+(53)
+
+followed by the theorem.
+
+3.5.3. Theorem 3.6
+
+The Fisher information matrix F (ϑ) (17) coincides with the solution to the Stein 53.
+
+Proof
+
+The eigenvalues of the companion matrices Ca and Cb are respectively the zeros of the
+
+polynomials â(z) and �b(z) which are in absolute value smaller than one. This implies that the unique
+solution of the Stein 53 exists and is given by
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+q − K(ϑ))−1BB⊤(Ip+q − zK(ϑ))−⊤dz
+
+96
+
+
+Entropy 2014, 16, 2023–2055
+
+developing this integral expression in a more explicit form yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+adj(zIp−Ca)
+
+ˆa(z)
+O
+
+O
+adj(zIq−Cb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+e1
+p
+−e1
+q
+
+� ⎧
+⎨
+
+⎩
+
+⎛
+
+⎝
+
+adj(Ip−zCa)
+
+a(z)
+O
+
+O
+adj(Iq−zCb)
+
+b(z)
+
+⎞
+
+⎠
+�
+e1
+p
+−e1
+q
+
+�⎫
+⎬
+
+⎭
+
+⊤
+
+dz
+
+Considering the form of the companion matrices Ca and Cb leads through straightforward
+
+computation to the conclusion, the first column of adj(zIp − Ca ) is the basic vector vp(z) and
+
+consequently the first column of adj(Ip − z Ca ) is the basic vector up(z). Equivalently for the companion
+
+matrix Cb , this yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+vp(z)
+ˆa(z)
+− vq(z)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+u⊤p (z)
+a(z)
+−
+u⊤
+q (z)
+b(z)
+
+�
+dz
+(54)
+
+Representation (54) is such that in order to obtain an equivalent representation to the FIM F (ϑ) in (17),
+the transpose of the solution to the Stein 53 is therefore required, to obtain
+
+S⊤ =
+1
+
+2πi
+
+�
+
+|z|=1
+
+⎛
+
+⎜
+⎝
+
+up(z)v⊤p (z)
+
+a(z)ˆa(z)
+−
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z)
+
+−
+uq(z)v⊤p (z)
+
+ˆa(z)b(z)
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z)
+
+⎞
+
+⎟
+⎠ dz = F(ϑ)
+(55)
+
+or
+
+S⊤ =
+1
+
+2πi
+
+�
+
+|z|=1
+(Ip+q − zK(ϑ))−1BB⊤(zIp+q − K(ϑ))−⊤dz = F(ϑ)
+
+The symmetry property of the FIM F (ϑ), leads to S = F (ϑ). From the representation (55) can be
+
+concluded that the solution S of the Stein 53 coincides with the symmetric block Toeplitz FIM F ( ϑ)
+given in (17). This completes the proof.
+It is straightforward to verify that the submatrix (1,2) in (55) is the complex conjugate transpose
+of the submatrix (2,1), whereas each submatrix on the main diagonal is Hermitian, consequently,
+the integrand is Hermitian. This implies that when the standard residue theorem is applied, it yields
+F ( ϑ) = F T (ϑ).
+An Illustrative Example of Theorem 3.6
+To illustrate Theorem 3.6, the case of an ARMA(2, 2) process is considered. We will use the
+
+representation (17) for computing the FIM F (ϑ) of an ARMA(2, 2) process. The autoregressive and
+moving average polynomials are of degree two or p = q = 2 and the ARMA(2, 2) process is described by,
+
+y(t)a(z) = b(z)e(t)
+(56)
+
+where y(t) is the stationary process driven by white noise ϵ(t), a(z) = (1 + a1z + a2z2) and b(z) = (1+b1z +
+b2z2) and the parameter vector is ϑ = (a1, a2, b1, b2)T. The condition, the zeros of the polynomials
+
+ˆa(z) = z2a(z−1) = z2 + a1z + a2 and ˆb(z) = z2b(z−1) = z2 + b1z + b2
+
+are in absolute value smaller than one, is imposed. The FIM F (ϑ) of the ARMA(2, 2) process (56) is of
+the form
+
+97
+
+
+Entropy 2014, 16, 2023–2055
+
+F(ϑ) =
+
+�
+Faa(ϑ)
+Fab(ϑ)
+F ⊤
+ab(ϑ)
+Fbb(ϑ)
+
+�
+
+(57)
+
+where
+
+Faa(ϑ) =
+1
+
+(1−a2)
+�
+(1+a2)2−a2
+1
+�
+
+�
+1 + a2
+−a1
+−a1
+1 + a2
+
+�
+
+Fbb(ϑ) =
+1
+
+(1−b2)
+�
+(1+b2)2−b2
+1
+�
+
+�
+1 + b2
+−b1
+−b1
+1 + b2
+
+�
+
+Fab(ϑ) =
+1
+
+(a2b2−1)2+(a2b1−a1) (b1−a1b2)
+
+�
+a2b2 − 1
+a1 − a2b1
+b1 − a1b2
+a2b2 − 1
+
+�
+
+The submatrices F aa(ϑ) and F bb(ϑ) are symmetric and Toeplitz whereas F ab(ϑ) is Toeplitz. One can
+assert that without any loss of generality, the property, symmetric block Toeplitz, holds for the class
+of Fisher information matrices of stationary ARMA(p, q) processes, where p and q are arbitrary, finite
+integers that represent the degrees of the autoregressive and moving average polynomials, respectively.
+
+The appropriate companion matrices Ca , Cb , the 4 × 4 matricesK (ϑ) and BBT are
+
+Ca =
+
+�
+−a1
+−a2
+1
+0
+
+�
+
+, Cb =
+
+�
+−b1
+−b2
+1
+0
+
+�
+
+, K(ϑ) =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+−a1
+−a2
+0
+0
+1
+0
+0
+0
+0
+0
+−b1
+−b2
+0
+0
+1
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎠
+and BB⊤ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎝
+
+1
+0
+−1
+0
+0
+0
+0
+0
+−1
+0
+1
+0
+0
+0
+0
+0
+
+⎞
+
+⎟
+⎟
+⎟
+⎠
+(58)
+
+where B =
+�
+1
+0
+−1
+0
+�⊤
+. It can be verified that the Stein equation
+
+F(ϑ) − K(ϑ)F(ϑ)K⊤(ϑ) = BB⊤
+
+holds true, when F (ϑ) is of the form (57) and the matricesK (ϑ) and
+includegraphics[scale=1]entropy-16-02023f666.pdfT are given in (58).
+
+3.5.4. Some Additional Results
+
+In Proposition 5.1 in [38], the matrix Q(ϑ) in (44) fulfills the Stein 59 and the property Q(ϑ) ≻ 0 is
+
+proved. It states that when e⊤
+P =
+�
+e⊤
+1 P, 0
+�⊤ = (en, 0n)⊤ ∈ R2n, where e1 is the first unit standard basis
+column vector in Rn and en is the last or n-th unit standard basis column vector in Rn, the following
+Stein equation admits the form
+
+Q(ϑ) = FN(ϑ)Q(ϑ)F⊤
+N (ϑ) + ePe⊤
+P
+(59)
+
+where
+
+FN(ϑ) =
+
+�
+ˆCa
+0
+e1e⊤
+1
+Cb
+
+�
+
+, ˆCa =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+0
+1
+0
+· · ·
+0
+0
+0
+1
+· · ·
+0
+...
+...
+...
+...
+
+0
+...
+...
+1
+−ap
+−ap−1
+· · ·
+· · ·
+−a1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+A corollary to Proposition 5.1, [38] will be set forth, the involvement of various Vandermonde matrices
+in the explicit solution to 59 is confirmed. For that purpose the following Vandermonde matrices
+are displayed,
+
+98
+
+
+Entropy 2014, 16, 2023–2055
+
+Vα =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+1
+1
+α1
+α2
+αn
+α2
+1
+α2
+2
+α2
+n
+...
+...
+...
+αn−1
+1
+αn−1
+2
+αn−1
+n
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, ˆVα =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αn−1
+1
+αn−2
+1
+1
+αn−1
+2
+αn−2
+2
+1
+αn−1
+3
+αn−2
+3
+1
+...
+...
+...
+αn−1
+n
+αn−2
+n
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+, ˆVαβ =
+
+�
+ˆVα
+ˆVβ
+
+�
+
+, and Vαβ =
+�
+Vα
+Vβ
+�
+
+(60)
+
+where �U β and Vβ have the same configuration as �U α and Vα respectively. A corollary to Proposition
+5.1 in [38] is now formulated.
+
+3.5.5. Corollary 3.7
+
+An explicit expression of the solution to the Stein 59 is of the form
+
+Q(ϑ) =
+
+�
+VαD11(ϑ) ˆVα
+VαD12(ϑ)V⊤
+α
+ˆV⊤
+αβD21(ϑ) ˆVαβ
+ˆV⊤
+αβD22(ϑ)V⊤
+αβ
+
+�
+
+(61)
+
+where the n × n and 2n × 2n diagonal matrices Dkl ϑ) shall be specified in the proof.
+
+Proof
+
+The condition of a unique solution of the Stein 59 is guaranteed since the eigenvalues of the
+
+companions matrices �Ca and Cb given respectively by the zeros of the polynomials â (z) and �b (z)
+are in absolute value smaller than one. Consequently, the unique solution to the Stein 59 exists and is
+given by
+
+Q(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+(zI2n − FN(ϑ))−1ePe⊤
+P (I2n − zFN(ϑ))−⊤dz
+(62)
+
+in order to proceed successfully, the following matrix property is displayed, to obtain
+
+�
+A
+O
+B
+C
+
+�−1
+=
+
+�
+A−1
+O
+−C−1BA−1
+C−1
+
+�
+
+When applied to the 62, it yields
+
+Q(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+adj(zIp− ˆCa)
+
+ˆa(z)
+O
+
+adj(zIq−Cb)e1e⊤
+1 adj(zIp− ˆCa)
+
+ˆa(z)ˆb(z)
+adj(zIq−Cb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+en
+0
+
+�
+
+×
+
+⎧
+⎨
+
+⎩
+
+⎛
+
+⎝
+
+adj(In−z ˆCa)
+
+ˆa(z)
+O
+
+adj(In−zCb)e1e⊤
+1 adj(Ip−z ˆCa)
+
+ˆa(z)ˆb(z)
+adj(In−zCb)
+
+ˆb(z)
+
+⎞
+
+⎠
+�
+en
+0
+
+�⎫
+⎬
+
+⎭
+
+⊤
+
+dz
+
+Considering that the last column vector of the matrices adj(zIp − �Ca ) and adj(In − z �Ca ) are the
+vectors un(z) and vn(z) respectively, it then yields
+
+Q(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+un(z)
+ˆa(z)
+vn(z)
+
+ˆa(z)ˆb(z)
+
+⎞
+
+⎠
+�
+v⊤
+n (z)
+a(z)
+zn−1u⊤
+n (z)
+
+a(z)b(z)
+
+�
+dz
+
+=
+1
+
+2πi
+�
+
+|z|=1
+
+⎛
+
+⎝
+
+un(z)v⊤
+n (z)
+
+a(z)ˆa(z)
+zn−1un(z)u⊤
+n (z)
+
+ˆa(z)a(z)b(z)
+vn(z)v⊤
+n (z)
+
+ˆa(z)ˆb(z)a(z)
+zn−1vn(z)u⊤
+n (z)
+
+ˆa(z)ˆb(z)a(z)b(z)
+
+⎞
+
+⎠ dz =
+
+�
+Q11(ϑ)
+Q12(ϑ)
+Q21(ϑ)
+Q22(ϑ)
+
+�
+
+99
+
+
+Entropy 2014, 16, 2023–2055
+
+Applying the standard residue theorem leads for the respective submatrices
+
+Q11(ϑ) = {un(α1), . . . , un(αn)}D11(ϑ) {vn(α1), . . . , vn(αn)}⊤
+
+Q12(ϑ) = {un(α1), . . . , un(αn)}D12(ϑ) {un(α1), . . . , un(αn)}⊤
+
+Q21(ϑ) = {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}D21(ϑ) {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}⊤
+
+Q22(ϑ) = {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}D22(ϑ) {un(α1), . . . , un(αn), un(β1), . . . , un(βn)}⊤
+
+where the n × n diagonal matrices are
+
+D11(ϑ) = diag {1/(a(αi)ˆa(z; αi))}, D12(ϑ) = diag {αn−1
+i
+/(a(αi)b(αi)ˆa(z; αi))} for i = 1, . . . , n
+
+and the 2n × 2n diagonal matrices are
+
+D21(ϑ) = diag
+�
+1/
+�
+a(αi)ˆb(αi)ˆa(z; αi)
+�
+, 1/
+�
+ˆa(βj)a(βj)ˆb(z; βj)
+��
+, for i, j = 1, . . . , n
+
+D22(ϑ) = diag
+�
+αn−1
+i
+/
+�
+a(αi)b(αi)ˆb(αi)ˆa(z; αi)
+�
+, βn−1
+j
+/
+�
+ˆa(βj)a(βj)b(βj)ˆb(z; βj)
+��
+, for i, j = 1, . . . , n
+
+It is clear that the first and third matrices in Q11(ϑ), Q12(ϑ), Q21(ϑ) and Q22(ϑ) are the appropriate
+Vandermonde matrices displayed in (60), it can be concluded that the representation (61) is verified.
+This completes the proof.
+In this section an explicit form of the solution Q(ϑ), expressed in terms of various Vandermonde
+
+matrices, is displayed. Also, an interconnection between the Fisher information F (ϑ) and appropriate
+solutions to Stein equations and related matrices is presented. Proofs are given when the Stein
+
+equations are verified by the FIM F (ϑ) and the associated matrix ℘(ϑ). These are alternative to the
+proofs developed in [38]. The presence of various forms of Vandermonde matrices is also emphasized.
+
+In the next section some matrix properties of the FIM F (ϑ) of an ARMAX process is presented.
+
+3.6. The Fisher Information Matrix of an ARMAX(p, r, q) Process
+
+The FIM of the ARMAX process (11) is set forth according to [4].
+The derivatives in the
+corresponding representation (16) are
+
+∂et(ϑ)
+
+∂aj
+=
+c(z)
+
+a(z)b(z) x(t − j) +
+1
+
+a(z)e(t − j), ∂et(ϑ)
+
+∂cl
+= − 1
+
+b(z)e(t − l) and∂et(ϑ)
+
+∂bk
+= − 1
+
+b(z)et−k
+
+where j = 1, . . . , p, l = 1, . . . , r and k = 1, . . . , q. Combining all j, l and k yields the (p + r + q) × (p + r +
+q) FIM
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+Gaa(ϑ)
+Gac(ϑ)
+Gab(ϑ)
+G⊤
+ac(ϑ)
+Gcc(ϑ)
+Gcb(ϑ)
+G⊤
+ab(ϑ)
+G⊤
+cb(ϑ)
+Gbb(ϑ)
+
+⎞
+
+⎟
+⎠
+(63)
+
+where the submatrices of G (ϑ) are given by
+
+100
+
+
+Entropy 2014, 16, 2023–2055
+
+Gaa(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z)
+up(z)u⊤p (z−1)c(z)c(z−1)
+
+a(z)a(z−1)b(z)b(z−1)
+dz
+z +
+1
+
+2πi
+�
+
+|z|=1
+
+up(z)u⊤p (z−1)
+
+a(z)a(z−1)
+dz
+z
+
+=
+1
+
+2πi
+�
+
+|z|=1
+Rx(z)
+up(z)v⊤p (z)c(z)ˆc(z)
+
+a(z)ˆa(z)b(z)ˆb(z)zr−q dz +
+1
+
+2πi
+�
+
+|z|=1
+
+up(z)v⊤p (z)
+
+a(z)ˆa(z) dz
+
+Gab(ϑ) = − 1
+
+2πi
+�
+
+|z|=1
+
+up(z)u⊤
+q (z−1)
+
+a(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+
+up(z)v⊤
+q (z)
+
+a(z)ˆb(z) dz
+
+Gac(ϑ) = − 1
+
+2πi
+�
+
+|z|=1
+Rx(z) up(z)u⊤
+r (z−1)c(z)
+
+a(z)b(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+Rx(z) up(z)v⊤
+r (z)c(z)
+
+a(z)b(z)ˆb(z)zr−q dz
+
+Gcc(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z) ur(z)u⊤
+r (z−1)
+
+b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+�
+
+|z|=1
+Rx(z) ur(z)v⊤
+r (z)
+
+b(z)ˆb(z)zr−q dz
+
+Gbb(ϑ) =
+1
+
+2πi
+�
+
+|z|=1
+
+uq(z)u⊤
+q (z−1)
+
+b(z)b(z−1)
+dz
+z = − 1
+
+2πi
+�
+
+|z|=1
+
+uq(z)v⊤
+q (z)
+
+b(z)ˆb(z) dz, and Gcb(ϑ) = O
+
+where Rx(z) is the spectral density of the process x(t) and is defined in (10).
+Let K(z) =
+a(z)a(z−1)b(z)b(z−1), combining all the expressions in (63) leads to the following representation of
+G (ϑ) as the sum of two matrices
+
+1
+
+2πi
+�
+
+|z|=1
+
+Rx(z)
+K(z)
+
+⎛
+
+⎜
+⎝
+c(z)up(z)
+−a(z)ur(z)
+O
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+c(z)up(z)
+−a(z)ur(z)
+O
+
+⎞
+
+⎟
+⎠
+
+∗
+
+dz
+z +
+1
+
+2πi
+�
+
+|z|=1
+
+1
+
+K(z)
+
+⎛
+
+⎜
+⎝
+b(z)up(z)
+O
+−a(z)uq(z)
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+b(z)up(z)
+O
+−a(z)uq(z)
+
+⎞
+
+⎟
+⎠
+
+∗
+
+dz
+z
+(64)
+
+where (X)* is the complex conjugate transpose of the matrix X ∈ Cm×n. Like in (23) we set forth
+
+S(−c, a) =
+
+�
+−Sp(c)
+Sr(a)
+
+�
+
+here Sp (c) is formed by the top p rows of S (−c, a). In a similar way we decompose
+
+S(−b, a) =
+
+�
+−Sp(b)
+Sq(a)
+
+�
+
+The representation (64) can be expressed by the appropriate block representations of the Sylvester
+resultant matrices, to obtain
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠ W(ϑ)
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠ P(ϑ)
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
+(65)
+
+where the matrix ℘(ϑ) is given in (28) and the matrix P (ϑ) ∈ R(p+r)×(p+r) is of the form
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Rx(z)
+up+r(z)u⊤
+p+r(z−1)
+
+a(z)a(z−1)b(z)b(z−1)
+dz
+z =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Rx(z)
+up+r(z)v⊤
+p+r(z)
+
+a(z)b(z)ˆa(z)ˆb(z)
+dz
+(66)
+
+It is shown in [4] that P (ϑ) ≻ O. As can be seen in (65), the ARMAX part is explained by the first
+term, whereas the ARMA part is described by the second term, the combination of both terms is a
+
+summary of the Fisher information of a ARMAX(p, r, q) process. The FIM G(ϑ) under form (65) allows
+
+us to prove the following property, Theorem 3.1 in [4]. The FIM G (ϑ) of the ARMAX(p, r, q) process
+with polynomials a(z), c(z) and b(z) of order p, r, q respectively becomes singular if and only if these
+
+101
+
+
+Entropy 2014, 16, 2023–2055
+
+polynomials have at least one common root. Consequently, the class of resultant matrices is extended
+
+by the FIM G (ϑ).
+
+3.7. The Stein Equation - The Fisher Information Matrix of an ARMAX(p, r, q) Process
+
+In Lemma 3.5 it is proved that the matrix ℘(ϑ) (28) fulfills the Stein 52. We will now consider
+
+the conditions under which the matrix P (ϑ) (66) verifies an appropriate Stein equation. For that
+purpose we consider the spectral density to be of the form Rx(z) = (1/h(z)h(z−1)). The degree of the
+polynomial h(z) is ℓ and we assume the distinct roots of the polynomial h(z) to lie outside the unit
+
+circle, consequently, the roots of the polynomial ˆh(z) lie within the unit circle. We therefore rewrite P
+
+(ϑ) accordingly
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)u⊤
+p+r(z−1)
+
+h(z)h(z−1)a(z)a(z−1)b(z)b(z−1)
+dz
+z
+
+We consider a companion matrix of the form (48) and with size p + q + ℓ, it is denoted by Cf and the
+
+entries fi are given by zp+q+ℓ + ∑
+p+q+ℓ
+i=1
+fi(ϑ)zp+q+qℓ−i = ˆa(z)ˆb(z)ˆh(z) = ˆf (z, ϑ) and �f((ϑ) is the vector
+
+�f((ϑ) = (fp+q+ℓ(ϑ), fp+q+ℓ−1(ϑ), . . . , f 1(ϑ))T. Likewise for the vector f(z, ϑ) = a(z)b(z)h(z) and f(ϑ) =
+
+(f 1(ϑ), f 1(ϑ), . . . , fp+q+ℓ(ϑ))T. The property Det(zIp+q+ℓ − Cf ) = â(z)�b(z)ˆh(z) and Det(Ip+q+ℓ − z Cf ) =
+a(z)b(z)h(z) holds and assume
+
+r = q + ℓ or p + q + ℓ = p + r and r > q
+(67)
+
+P (ϑ) is then of the form
+
+W(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)v⊤
+p+r(z)
+
+h(z)ˆh(z)a(z)ˆa(z)b(z)ˆb(z)
+dz
+(68)
+
+We will formulate a Stein equation when the matrix Γ = ep+re⊤
+p+r and which is of the form
+
+S − C f SC⊤
+f = ep+re⊤
+p+r
+(69)
+
+where ep+r is the last standard basis column vector in Rp+r. The next lemma is formulated.
+
+3.7.1. Lemma 3.8
+
+The matrix P (ϑ) given in (68) fulfills the Stein 69.
+
+Proof
+
+The unique solution of (69) is assured since the product of all the eigenvalues of Cf are different
+from one, the solution is of the form
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+(zIp+r − C f )−1ep+re⊤
+p+r(Ip+r − zC f )−⊤dz
+
+or
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+adj(zIp+r − C f )ep+re⊤
+p+radj(Ip+r − zC f )⊤
+
+ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z)
+dz
+
+102
+
+
+Entropy 2014, 16, 2023–2055
+
+taking the property of the companion matrix Cf into account implies that the last column vector of
+
+adj(zIp+r − Cf ) is the basic vector up+r(z), consequently the last column of adj(Ip+r − z Cf ) is the basic
+vector vp+r(z), this yields
+
+S =
+1
+
+2πi
+
+�
+
+|z|=1
+
+up+r(z)v⊤
+p+r(z)
+
+ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z)
+dz = W(ϑ)
+
+Consequently, the matrix P (ϑ) defined in (68) verifies the Stein 69. This completes the proof.
+
+The matrices, ℘(ϑ) and P (ϑ), in (65), verify under specific conditions appropriate Stein equations,
+as has been shown in Lemma 3.5 and Lemma 3.8, respectively. We will now confirm the presence of
+
+Vandermonde matrices by applying the standard residue theorem to P (ϑ) in (68), to obtain
+
+W(ϑ) = VαβξR (ϑ) ˆVαβξ
+(70)
+
+The (p + r) × (p + r) diagonal matrix R(ϑ) is of the form
+
+R (ϑ) = diag
+��
+1/ˆa(z; αi)ˆb(αi)ˆh(αi)ϕ(αi)
+�
+,
+�
+1/ˆa(βj)ˆb(z; βj)ˆh(βj)ϕ(βj)
+�
+,
+�
+1/ˆa(ξk)ˆb(ξk)ˆh(z; ξk)ϕ(ξk)
+��
+
+where ϕ(z) = a(z)b(z)h(z) and i = 1, . . . , p, j = 1, . . . , q and k = 1, . . . , ℓ. Whereas the (p + r) × (p + r)
+
+matrices Vαβξ and �U αβξ are of the form
+
+Vαβξ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+1
+α1
+α2
+1
+· · ·
+αp+r−1
+1
+...
+...
+...
+...
+...
+1
+αp
+α2
+p
+· · ·
+αp+r−1
+p
+1
+β1
+β2
+1
+· · ·
+βp+r−1
+1
+...
+...
+...
+...
+...
+1
+βq
+β2
+q
+· · ·
+βp+r−1
+q
+1
+ξ1
+ξ2
+1
+· · ·
+ξp+r−1
+1
+...
+...
+...
+...
+...
+1
+ξℓ
+ξ2
+ℓ
+· · ·
+ξp+r−1
+ℓ
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+⊤
+
+, ˆVαβξ =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+αp+r−1
+1
+αp+r−2
+1
+· · ·
+α1
+1
+...
+...
+...
+...
+αp+r−1
+p
+αp+r−2
+p
+· · ·
+αp
+1
+βp+r−1
+1
+βp+r−2
+1
+· · ·
+β1
+1
+...
+...
+...
+...
+βp+r−1
+q
+βp+r−2
+q
+· · ·
+βq
+1
+ξp+r−1
+1
+ξp+r−2
+1
+· · ·
+ξ1
+1
+...
+...
+...
+...
+ξp+r−1
+ℓ
+ξp+r−2
+ℓ
+· · ·
+ξℓ
+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+The (p + r) × (p + r) Vandermonde matrices Vαβξ and �U αβξ are nonsingular when αi ̸= αj , βk ̸= βh,
+ξm ̸= ξn, αi ̸= βk, αi ̸= ξm, βk ̸= ξm for all i, j = 1, . . . , p, k, h = 1, . . . , q and m,n = 1, . . . , ℓ. The
+
+Vandermonde determinants DetVαβξ and Det �U αβξ, are
+
+DetVαβξ = (−1)(p+r) (p+r−1)/2 Ψ (αi, βk, ξm)
+
+where
+
+Ψ (αi, βk, ξm) =
+∏
+1≤i<j≤p
+(αi − αj)
+∏
+1≤k<h≤q
+(βk − βh)
+∏
+1≤m<n≤ℓ
+(ξm − ξn)
+∏
+r = 1, . . . , p
+s = 1, . . . , q
+
+(αr − βs)
+∏
+r = 1, . . . , p
+w = 1, . . . , ℓ
+
+(αr − ξw)
+∏
+s = 1, . . . , q
+w = 1, . . . , ℓ
+
+(βs − ξw)
+
+Like for the Vandermonde matrices Vαβ and ˆV⊤
+αβ,
+
+103
+
+
+Entropy 2014, 16, 2023–2055
+
+Det ˆVαβξ = (−1)(p+r) (p+r−1) Ψ (αi, βk, ξm) ⇒| DetVαβξ |= | Det ˆVαβξ |
+
+(70) is the ARMAX equivalent to (47). A combination of both equations generates a new representation
+
+of the FIM G (ϑ), this is set forth in the following lemma.
+
+3.7.2. Lemma 3.9
+
+Assume the conditions (67) to hold and consider the representations of ℘(ϑ) and P (ϑ) in (47) and (70)
+respectively, leads to an alternative form to (65), it is given by
+
+G(ϑ) =
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠ VαβξR (ϑ) ˆVαβξ
+
+⎛
+
+⎜
+⎝
+−Sp(c)
+Sr(a)
+O
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠ VαβD(ϑ) ˆVαβ
+
+⎛
+
+⎜
+⎝
+−Sp(b)
+O
+Sq(a)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
+In Lemma 3.9, the FIM G (ϑ) is expressed by submatrices of two Sylvester matrices and various
+Vandermonde matrices, both type of matrices become singular if and only if the appropriate
+polynomials have at least one common root.
+
+3.8. The Fisher Information Matrix of a Vector ARMA(p, q) Process
+
+The process (5) is summarized as,
+
+A(z)y(t) = B(z)e(t)
+
+and we assume that {y(t), t ∈ N}, is a zero mean Gaussian time series and {ϵ(t), t ∈ N} is a n-dimensional
+vector random variable, such that
+
+Eϑ
+
+{ϵ(t)} = 0 and Eϑ {ϵ(t)ϵT (t)} = ∑ and the parameter vector ϑ is of the form (7). In [6] it is shown that
+representation (16) for the n2(p+q)×n2(p+q) asymptotic FIM of the VARMA process (6) is
+
+F(ϑ) = Eϑ
+
+�� ∂e
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+� ∂e
+
+∂ϑ⊤
+
+��
+
+(71)
+
+where ∂ϵ/∂ϑT is of size n×n2(p+q) and for convenience t is omitted from ϵ(t). Using the differential
+rules outlined in [6], yields
+
+∂e
+∂ϑ⊤ =
+�
+(A−1(z)B(z)e)
+⊤ ⊗ B−1(z)
+�∂vec A(z)
+
+∂ϑ⊤
+− (e⊤ ⊗ B−1(z))∂vec B(z)
+
+∂ϑ⊤
+(72)
+
+The substitution of representation (72) of ∂ϵ/∂ϑ T in (71) yields the FIM of a VARMA process. The
+purpose is to construct a factorization of the FIM F(ϑ) that should be a multiple variant of the
+factorization (27), so that a multiple resultant matrix property can be proved for F(ϑ). As illustrated
+in [6], the multiple version of the Sylvester resultant matrix (22) does not fulfill the multiple resultant
+matrix property. In that case even when the matrix polynomials A(z) and B(z) have a common zero or
+a common eigenvalue, the multiple Sylvester matrix is not neccessarily singular. This has also been
+
+illustrated in [3]. In order to consider a multiple equivalent to the resultant matrix S −b, a), Gohberg
+and Lerer set forth the n2(p + q) × n2(p + q) tensor Sylvester matrix
+
+104
+
+
+Entropy 2014, 16, 2023–2055
+
+S⊗(−B, A) :=
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+(−In) ⊗ In
+(−B1) ⊗ In
+· · ·
+(−Bq) ⊗ In
+On2×n2
+· · ·
+On2×n2
+
+On2×n2
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+On2×n2
+On2×n2
+· · ·
+On2×n2
+(−In) ⊗ In
+(−B1) ⊗ In
+· · ·
+(−Bq) ⊗ In
+In ⊗ In
+In ⊗ A1
+· · ·
+In ⊗ Ap
+On2×n2
+· · ·
+On2×n2
+
+On2×n2
+...
+...
+...
+...
+...
+...
+...
+...
+...
+...
+On2×n2
+On2×n2
+· · ·
+On2×n2
+In ⊗ In
+In ⊗ A1
+· · ·
+In ⊗ Ap
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(73)
+
+In [3], the authors prove that the tensor Sylvester matrix S⊗ (−B,A) fulfills the multiple resultant
+property, it becomes singular if and only if the appropriate matrix polynomials A(z) and B(z) have at
+least one common zero. In Proposition 2.2 in [6], the following factorized form of the Fisher information
+F(ϑ) is developed
+
+F(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Φ(z)Θ(z)Φ∗(z)dz
+
+z
+(74)
+
+where
+
+Φ(z) =
+
+�
+Ip ⊗ A−1(z) ⊗ In
+Opn2×qn2
+Oqn2×pn2
+Iq ⊗ In ⊗ A−1(z)
+
+�
+
+S⊗(−B, A) (up+q(z) ⊗ In2)
+
+and
+
+Θ(z) = Σ ⊗ σ(z), σ(z) = B−⊤(z)Σ−1B−1(z−1)
+(75)
+
+In order to obtain a multiple variant of (27), the following matrix is introduced,
+
+M(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+
+Λ(z)J (z)Λ∗(z)dz
+
+z = S⊗(−B, A)P(ϑ) (S⊗(−B, A))⊤
+(76)
+
+where
+
+J (z) = Φ(z)Θ(z)Φ∗(z) and Λ(z) =
+
+�
+Ip ⊗ A(z) ⊗ In
+Opn2×qn2
+Oqn2×pn2
+Iq ⊗ In ⊗ A(z)
+
+�
+
+and the matrix P(ϑ) is a multiple variant of the matrix ℘(ϑ) in (28), it is of the form
+
+P(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1
+(up+q(z) ⊗ In2) Θ(z) (up+q(z) ⊗ In2)∗ dz
+
+z
+(77)
+
+In Lemma 2.3 in [6], it is proved that the matrix M(ϑ) in (76) becomes singular if and only if the matrix
+polynomials A(z) and B(z) have at least one common eigenvalue-zero. The proof is a multiple equivalent
+of the proof of Corollary 2.2 in [5], since the equality (76) is a multiple version of (27). Consequently,
+
+the matrix M(ϑ) like the tensor Sylvester matrix S⊗ (−B,A), fulfills the multiple resultant matrix
+property. Since the matrix M(ϑ) is derived from the FIM F(ϑ), this enables us to prove that the matrix
+F(ϑ) fulfills the multiple resultant matrix property by showing that it becomes singular if and only if
+the matrix M(ϑ) is singular, this is done in Proposition 2.4 in [6]. Consequently, it can be concluded
+
+from [6] that the FIM of a VARMA process F(ϑ) and the tensor Sylvester matrix S⊗ (−B,A) have the
+same singularity conditions. The FIM of a VARMA process F(ϑ) can therefore be added to the class of
+multiple resultant matrices.
+
+105
+
+
+Entropy 2014, 16, 2023–2055
+
+A brief summary of the contribution of [6] follows, in order to show that the FIM of a VARMA
+process F(ϑ) is a multiple resultant matrix two new representations of the FIM are derived. To
+construct such representations appropriate matrix differential rules are applied. The newly obtained
+representations are expressed in terms of the multiple Sylvester matrix and the tensor Sylvester matrix.
+The representation of the FIM expressed by the tensor Sylvester matrix is used to prove that the FIM
+becomes singular if and only if the autoregressive and moving average matrix polynomials have
+at least one common eigenvalue. It then follows that the FIM and the tensor Sylvester matrix have
+equivalent singularity conditions. In a numerical example it is shown, however, that the FIM fails to
+detect common eigenvalues due to some kind of numerical instability. The tensor Sylvester matrix
+reveals it clearly, proving the usefulness of the results derived in this paper.
+
+3.9. The Fisher Information Matrix of a Vector ARMAX(p, r, q) Process
+
+The n2(p + q + r) × n2(p + q + r) asymptotic FIM of the VARMAX(p, r, q) process (2)
+
+A(z)y(t) = C(z)x(t) + B(z)e(t)
+
+is displayed according to [23] and is an extension of the FIM of the VARMA(p, q) process (6).
+Representation (16) of the FIM of the VARMAX(p, r, q) process is then
+
+G(ϑ) = Eϑ
+
+�� ∂e
+
+∂ϑ⊤
+
+�⊤
+Σ−1
+� ∂e
+
+∂ϑ⊤
+
+��
+
+where
+
+∂e
+∂ϑ⊤ =
+�
+(A−1(z)C(z)x)⊤ ⊗ B−1(z)
+� ∂vec A(z)
+
+∂ϑ⊤
++
+�
+(A−1(z)B(z)e)⊤ ⊗ B−1(z)
+� ∂vec A(z)
+
+∂ϑ⊤
+−{x⊤ ⊗ B−1(z)} ∂vec C(z)
+
+∂ϑ⊤
+−(e⊤ ⊗ B−1(z)) ∂vec B(z)
+
+∂ϑ⊤
+
+(78)
+
+To obtain the term ∂ϵ/∂ϑ T, of size n × n2(p + q + r), the same differential rules are applied as for the
+VARMA(p, q) process. In Proposition 2.3 in [23], the representation of the FIM of a VARMAX process
+is expressed in terms of tensor Sylvester matrices, this obtained when ∂ϵ/∂ϑ T in (78) is substituted
+in (16), to obtain
+
+G(ϑ) =
+1
+
+2πi
+
+�
+
+|z|=1 Φx(z)Θ(z)Φ∗
+x(z)dz
+
+z +
+1
+
+2πi
+
+�
+
+|z|=1 Λx(z)Ψ(z)Λ∗
+x(z)dz
+
+z
+(79)
+
+The matrices in (79) are of the form
+
+Φx(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A−1(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Orn2×rn2
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Iq ⊗ In ⊗ A−1(z)
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠ (up+q(z) ⊗ In2)
+
+Λx(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A−1(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Ir ⊗ In ⊗ A−1(z)
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Oqn2×qn2
+
+⎞
+
+⎟
+⎠
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠ (up+r(z) ⊗ In2)
+
+S⊗
+p,q(−B, A) =
+
+�
+−S⊗
+p (B)
+S⊗
+q (A)
+
+�
+
+, S⊗
+p,r(−C, A) =
+
+�
+−S⊗
+p (C)
+S⊗
+r (A)
+
+�
+
+(80)
+
+additionally we have Ψ(z) = Rx(z) ⊗ σ(z) and the Hermitian spectral density matrix Rx(z) is defined
+in (10), whereas the matrix polynomials Θ(z) and σ(z) are presented in (75). In (80), we have the pn2
+
+× (p + q)n2 and qn2 × (p + q)n2 submatrices S⊗
+p (−B) and S⊗
+q (A) of the tensor Sylvester resultant
+matrix S⊗
+p,q(−B, A). Whereas the matrices S⊗
+p (−C) and S⊗
+r (A) are the upper and lower blocks of the
+(p+r)n2×(p+r)n2 tensor Sylvester resultant matrix S⊗
+p,r(−C, A). As for the FIM of the VARMA(p, q)
+process, the objective is to construct a multiple version of (65), this done in [23], to obtain
+
+106
+
+
+Entropy 2014, 16, 2023–2055
+
+Mx(ϑ) =
+1
+
+2πi
+�
+
+|z|=1 L(z)A(z)L∗(z) dz
+
+z +
+1
+
+2πi
+�
+
+|z|=1 W(z)B(z)W∗(z) dz
+
+z
+
+=
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠ P(ϑ)
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (B)
+Orn2×n2(p+q)
+S⊗
+q (A)
+
+⎞
+
+⎟
+⎠
+
+⊤
+
++
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠ T(ϑ)
+
+⎛
+
+⎜
+⎝
+−S⊗
+p (C)
+S⊗
+r (A)
+Oqn2×n2(p+r)
+
+⎞
+
+⎟
+⎠
+
+⊤
+(81)
+
+The matrices involved are of the form
+
+L(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Orn2×rn2
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Iq ⊗ In ⊗ A(z)
+
+⎞
+
+⎟
+⎠ and A(z) := Φx(z)Θ(z)Φ∗
+x(z)
+
+W(z) =
+
+⎛
+
+⎜
+⎝
+
+Ip ⊗ A(z) ⊗ In
+Opn2×rn2
+Opn2×qn2
+Orn2×pn2
+Ir ⊗ In ⊗ A(z)
+Orn2×qn2
+Oqn2×pn2
+Oqn2×rn2
+Oqn2×qn2
+
+⎞
+
+⎟
+⎠ and B(z) := Λx(z)Ψ(z)Λ∗
+x(z)
+
+T(ϑ) =
+1
+
+2πi
+�
+
+|z|=1 (up+r(z) ⊗ In2) Ψ(z) (up+r(z) ⊗ In2)∗ dz
+
+z
+
+and P(ϑ) is given in (77). Note, the matrices Φx(z), Λx(z), L(z) and P (z) are the corrected versions of
+the corresponding matrices in [23].
+A parallel between the scalar and multiple structures is straightforward. This is best illustrated
+by comparing the representations (27) and (28) with (76) and (77) respectively, confronting the FIM
+for scalar and vector ARMA(p, q) processes. The FIM of the scalar ARMAX(p, r, q) process contains
+an ARMA(p, q) part, this is confirmed by (65), through the presence of the matrix ℘(ϑ) which is
+originally displayed in (28). The multiple resultant matrices M(ϑ) and Mx(ϑ) derived from the FIM
+of the VARMA(p, q) and VARMAX(p, r, q) processes respectively both contain P(ϑ), whereas the first
+matrix term of the matrices Φ(z) and Φx(z), which are of different size, consist of the same nonzero
+submatrices. To summarize, in [23] compact forms of the FIM of a VARMAX process expressed in
+terms of multiple and tensor Sylvester matrices are developed. The tensor Sylvester matrices allow
+us to investigate the multiple resultant matrix property of the FIM of VARMAX(p, r, q) processes.
+However, since no proof of the multiple resultant matrix property of the FIM G(ϑ) has been done yet,
+justifies the consideration of a conjecture. A conjecture that states, the FIMG(ϑ) of a VARMAX(p, r, q)
+process becomes singular if and only if the matrix polynomials A(z), B(z) and C(z) have at least one
+common eigenvalue. A multiple equivalent to Theorem 3.1 in [4] and combined with Proposition 2.4
+in [6], but based on the representations (79) and (81), can be envisaged to formulate a proof which will
+be a subject for future study.
+
+4. Conclusions
+
+In this survey paper, matrix algebraic properties of the FIM of stationary processes are discussed.
+The presented material is a summary of papers where several matrix structural aspects of the FIM
+are investigated. The FIM of scalar and multiple processes like the (V)ARMA(X) are set forth with
+appropriate factorized forms involving (tensor) Sylvester matrices. These representations enable us to
+prove the resultant matrix property of the corresponding FIM. This has been done for (V)ARMA(p,
+q) and ARMAX(p, r, q) processes in the papers [4,6]. The development of the stages that lead to the
+appropriate factorized form of the FIM G(ϑ) (79) is set forth in [23]. However, there is no proof done
+yet that confirms the multiple resultant matrix property of the FIM G(ϑ) of a VARMAX(p, r, q) process.
+This justifies the consideration of a conjecture which is formulated in the former section, this can be a
+subject for future study.
+The statistical distance measure derived in [7], involves entries of the FIM. This distance measure
+can be a challenge to its quantum information counterpart (41). Because (36) involves information
+about m parameters estimated from n measurements. Whereas in quantum information, like in
+e.g., [8,10], the information about one parameter in a particular measurement procedure is considered
+
+107
+
+
+Entropy 2014, 16, 2023–2055
+
+for establishing an interconnection with the appropriate statistical distance measure. A possible
+approach, by combining matrix algebra and quantum information, for developing a statistical distance
+measure in quantum information or quantum statistics but at the matrix level, can be a subject of
+future research. Some results concerning interconnections between the FIM of ARMA(X) models
+and appropriate solutions to Stein matrix equations are discussed, the material is extracted from the
+papers, [12] and [13]. However, in this paper, some alternative and new proofs that emphasize the
+conditions under which the FIM fulfills appropriate Stein equations, are set forth. The presence of
+various types of Vandermonde matrices is also emphasized when an explicit expansion of the FIM is
+computed. These Vandermonde matrices are inserted in interconnections with appropriate solutions to
+Stein equations. This explains, when the matrix algebraic structures of the FIM of stationary processes
+are investigated, the involvement of structured matrices like the (tensor) Sylvester, Bezoutian and
+Vandermonde matrices is essential.
+
+Acknowledgments: The author thanks a perceptive reviewer for his comments which significantly improved the
+quality and presentation of the paper.
+
+Conflicts of Interest: The authors have declared no conflict of interest.
+
+References
+
+1.
+Dym, H. Linear Algebra in Action; American Mathematical Society: Providence, RI, USA, 2006; Volume 78.
+2.
+Lancaster, P.; Tismenetsky, M. The Theory of Matrices with Applications, 2nd ed; Academic Press: Orlando, FL,
+USA, 1985.
+3.
+Gohberg, I.; Lerer, L. Resultants of matrix polynomials. Bull. Am. Math. Soc 1976, 82, 565–567.
+4.
+Klein, A.; Spreij, P. On Fisher’s information matrix of an ARMAX process and Sylvester’s resultant matrices.
+Linear Algebra Appl 1996, 237/238, 579–590.
+5.
+Klein, A.; Spreij, P. On Fisher’s information matrix of an ARMA process. In Stochastic Differential and Difference
+Equations; Csiszar, I., Michaletzky, Gy., Eds.; Birkhäuser: Boston: Boston, USA, 1997; Progress in Systems and
+Control Theory; Volume 23, pp. 273–284.
+6.
+Klein, A.; Mélard, G.; Spreij, P. On the Resultant Property of the Fisher Information Matrix of a Vector ARMA
+process. Linear Algebra Appl 2005, 403, 291–313.
+7.
+Klein, A.; Spreij, P. Transformed Statistical Distance Measures and the Fisher Information Matrix.
+Linear Algebra Appl 2012, 437, 692–712.
+8.
+Braunstein, S.L.; Caves, C.M. Statistical Distance and the Geometry of Quantum States. Phys. Rev. Lett 1994,
+72, 3439–3443.
+9.
+Jones, P.J.; Kok, P. Geometric derivation of the quantum speed limit. Phys. Rev. A 2010, 82, 022107.
+10.
+Kok, P. Tutorial: Statistical distance and Fisher information; Oxford: UK, 2006.
+11.
+Lancaster, P.; Rodman, L. Algebraic Riccati Equations; Clarendon Press: Oxford, UK, 1995.
+12.
+Klein, A.; Spreij, P. On Stein’s equation, Vandermonde matrices and Fisher’s information matrix of time
+series processes. Part I: The autoregressive moving average process. Linear Algebra Appl 2001, 329, 9–47.
+13.
+Klein, A.; Spreij, P. On the solution of Stein’s equation and Fisher’s information matrix of an ARMAX process.
+Linear Algebra Appl 2005, 396, 1–34.
+14.
+Grenander, U.; Szeg˝o, G.P. Toeplitz Forms and Their Applications; University of California Press: New York, NY,
+USA, 1958.
+15.
+Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods, 2nd ed; Springer Verlag: Berlin, Germany;
+New York, NY, USA, 1991.
+16.
+Caines, P. Linear Stochastic Systems; John Wiley and Sons: New York, NY, USA, 1988.
+17.
+Ljung, L.; Söderström, T. Theory and Practice of Recursive Identification; M.I.T. Press: Cambridge, MA, USA, 1983.
+18.
+Hannan, E.J.; Deistler, M. The Statistical Theory of Linear Systems; John Wiley and Sons: New York, NY, USA, 1988.
+19.
+Hannan, E.J.; Dunsmuir, W.T.M.; Deistler, M. Estimation of vector Armax models. J. Multivar. Anal 1980, 10,
+275–295.
+20.
+Horn, R.A.; Johnson, C.R. Topics in Matrix Analysis; Cambridge University Press: New York, NY, USA, 1995.
+21.
+Klein, A.; Spreij, P. Matrix differential calculus applied to multiple stationary time series and an extended
+Whittle formula for information matrices. Linear Algebra Appl 2009, 430, 674–691.
+
+108
+
+
+Entropy 2014, 16, 2023–2055
+
+22.
+Klein, A.; Mélard, G. An algorithm for the exact Fisher information matrix of vector ARMAX time series.
+Linear Algebra Its Appl 2014, 446, 1–24.
+23.
+Klein, A.; Spreij, P. Tensor Sylvester matrices and the Fisher information matrix of VARMAX processes.
+Linear Algebra Appl 2010, 432, 1975–1989.
+24.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc 1945, 37, 81–91.
+25.
+Ibragimov, I.A.; Has’minski˘ı, R.Z. Statistical Estimation. In Asymptotic Theory; Springer-Verlag: New York,
+NY, USA, 1981.
+26.
+Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983.
+27.
+Friedlander, B. On the computation of the Cramér-Rao bound for ARMA parameter estimation. IEEE Trans.
+Acoust. Speech Signal Process 1984, 32, 721–727.
+28.
+Holevo, A.S. Probabilistic and Statistical Aspects of Quantum Theory, 2nd ed; Edizioni Della Normale, SNS Pisa:
+Pisa, Italy, 2011.
+29.
+Petz, T. Quantum Information Theory and Quantum Statistics; Springer-Verlag: Berlin Heidelberg, Germany,
+2008.
+30.
+Barndorff-Nielsen, O.E.; Gill, R.D. Fisher Information in quantum statistics. J. Phys. A 2000, 30, 4481–4490.
+31.
+Luo, S. Wigner-Yanase skew information vs. quantum Fisher information. Proc. Amer. Math. Soc 2004, 132,
+885–890.
+32.
+Klein, A.; Mélard, G. On algorithms for computing the covariance matrix of estimates in autoregressive
+moving average processes. Comput. Stat. Q 1989, 5, 1–9.
+33.
+Klein, A.; Mélard, G. An algorithm for computing the asymptotic Fisher information matrix for seasonal
+SISO models. J. Time Ser. Anal 2004, 25, 627–648.
+34.
+Bistritz, Y.; Lifshitz, A. Bounds for resultants of univariate and bivariate polynomials. Linear Algebra Appl
+2010, 432, 1995–2005.
+35.
+Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: New York, NY, USA, 1996.
+36.
+Golub, G.H.; van Loan, C.F. Matrix Computations, 3rd ed; John Hopkins University Press: Baltimore, USA, 1996.
+37.
+Kullback, S. Information Theory and Statistics; John Wiley and Sons: New York, NY, USA, 1959.
+38.
+Klein, A.; Spreij, P. The Bezoutian, state space realizations and Fisher’s information matrix of an ARMA
+process. Linear Algebra Appl 2006, 416, 160–174.
+
+© 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+109
+
+
+entropy
+
+Article
+Asymptotically Constant-Risk Predictive Densities
+When the Distributions of Data and Target Variables
+Are Different
+
+Keisuke Yano 1,* and Fumiyasu Komaki 1,2
+
+1 Department of Mathematical Informatics, Graduate School of Information Science and Technology,
+The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan; E-Mail:
+komaki@mist.i.u-tokyo.ac.jp
+2 RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
+*
+E-Mail: Keisuke_Yano@mist.i.u-tokyo.ac.jp; Tel.: +81-3-5841-6909.
+
+Received: 28 March 2014; in revised form: 9 May 2014 / Accepted: 22 May 2014 /
+Published: 28 May 2014
+
+Abstract: We investigate the asymptotic construction of constant-risk Bayesian predictive densities
+under the Kullback–Leibler risk when the distributions of data and target variables are different and
+have a common unknown parameter. It is known that the Kullback–Leibler risk is asymptotically
+equal to a trace of the product of two matrices: the inverse of the Fisher information matrix for the
+data and the Fisher information matrix for the target variables. We assume that the trace has a unique
+maximum point with respect to the parameter. We construct asymptotically constant-risk Bayesian
+predictive densities using a prior depending on the sample size. Further, we apply the theory to the
+subminimax estimator problem and the prediction based on the binary regression model.
+
+Keywords:
+Bayesian prediction; Fisher information; Kullback–Leibler divergence; minimax;
+predictive metric; subminimax estimator
+
+1. Introduction
+
+Let x(N) = (x1, · · · , xN) be independent N data distributed according to a probability density,
+p(x|θ), that belongs to a d-dimensional parametric model, {p(x|θ) : θ ∈ Θ}, where θ = (θ1, · · · , θd)
+is an unknown d-dimensional parameter and Θ is the parameter space. Let y be a target variable
+distributed according to a probability density, q(y|θ), that belongs to a d-dimensional parametric
+model, {q(y|θ) : θ ∈ Θ} with the same parameter, θ. Here, we assume that the distributions of the
+data and the target variables, p(x|θ) and q(y|θ), are different. For simplicity, we assume that the data
+and the target variables are independent, given by θ.
+We construct predictive densities for target variables based on the data.
+We measure
+the performance of the predictive density,
+ˆq(y; x(N)), by the Kullback–Leibler divergence,
+D(q(·|θ), ˆq(·; x(N))), from the true density, q(y|θ), to the predictive density, ˆq(y; x(N)):
+
+D(q(·|θ), ˆq(·; x(N)))
+=
+�
+q(y|θ) log
+q(y|θ)
+
+ˆq(y; x(N))dy.
+
+Then, the risk function, R(θ, ˆq(y; x(N))), of the predictive density, ˆq(y; x(N)), is given by:
+
+R(θ, ˆq(y; x(N)))
+=
+�
+p(x(N)|θ)D(q(·|θ), ˆq(·; x(N)))dx(N)
+
+=
+�
+p(x(N)|θ)
+�
+q(y|θ) log
+q(y|θ)
+
+ˆq(y; x(N))dydx(N).
+
+Entropy 2014, 16, 3026–3048; doi:10.3390/e16063026
+www.mdpi.com/journal/entropy
+110
+
+
+Entropy 2014, 16, 3026–3048
+
+For the construction of predictive densities, we consider the Bayesian predictive density defined
+by:
+
+ˆqπ(y|x(N))
+=
+�
+q(y|θ)p(x(N)|θ)π(θ; N)dθ
+�
+p(x(N)|θ)π(θ; N)dθ
+,
+
+where π(θ; N) is a prior density for θ, possibly depending on the sample size, N. Aitchison [1]
+showed that, for a given prior density, π(θ; N), the Bayesian predictive density, ˆqπ(y|x(N)), is a Bayes
+solution under the Kullback–Leibler risk. Based on the asymptotics as the sample size goes to infinity,
+Komaki [2] and Hartigan [3] showed its superiority over any plug-in predictive density, q(y| ˆθ), with
+any estimator, ˆθ. However, there remains a problem of prior selection for constructing better Bayesian
+predictive densities. Thus, a prior, π(θ; N), must be chosen based on an optimality criterion for actual
+applications.
+Among various criteria, we focus on a criterion of constructing minimax predictive densities
+under the Kullback–Leibler risk. For simplicity, we refer to the priors generating minimax predictive
+densities as minimax priors. Minimax priors have been previously studied in various predictive
+settings; see [4–8]. When the simultaneous distributions of the target variables and the data belong to
+the submodel of the multinomial distributions, Komaki [7] shows that minimax priors are given as
+latent information priors maximizing the conditional mutual information between target variables and
+the parameter given the data. However, the explicit forms of latent information priors are difficult to
+obtain, and we need asymptotic methods, because they require the maximization on the space of the
+probability measures on Θ.
+Except for [7], these studies on minimax priors are based on the assumption that the distributions,
+p(x|θ) and q(y|θ), are identical. Let us consider the prediction based on the logistic regression model
+where the covariates of the data and the target variables are not identical. In this predictive setting, the
+assumption that the distributions, p(x|θ) and q(y|θ), are identical is no longer valid.
+We focus on the minimax priors in predictions where the distributions, p(x|θ) and q(y|θ), are
+different and have a common unknown parameter. Such a predictive setting has traditionally been
+considered in statistical prediction and experiment design. It has recently been studied in statistical
+learning theory; for example, see [9]. Predictive densities where the distributions, p(x|θ) and q(y|θ),
+are different and have a common unknown parameter are studied by [10–13].
+Let gX
+ij (θ) be the (i, j)-component of the Fisher information matrix of the distribution, p(x|θ), and
+
+let gY
+ij(θ) be the (i, j)-component of the Fisher information matrix of the distribution, q(y|θ). Let gX,ij(θ)
+
+and gY,ij(θ) denote the (i, j)-components of their inverse matrices. We adopt Einstein’s summation
+convention: if the same indices appear twice in any one term, it implies summation over that index
+from one to d. For the asymptotics below, we assume that the prior densities, π(θ; N), are smooth.
+On the asymptotics as the sample size N goes to infinity, we construct the asymptotically
+constant-risk prior, π(θ; N), in the sense that the asymptotic risk:
+
+R(θ, ˆqπ(y|x(N))) = 1
+
+N R1(θ, ˆqπ(y|x(N))) +
+1
+
+N
+√
+
+N
+R2(θ, ˆqπ(y|x(N))) + O(N−2)
+
+is constant up to O(N−2). Since the proper prior with the constant risk is a minimax prior for any finite
+sample size, the asymptotically constant-risk prior relates to the minimax prior; in Section 4, we verify
+that the asymptotically constant-risk prior agrees with the exact minimax prior in binomial examples.
+When we use the prior, π(θ), independent of the sample size, N, it is known that the N−1-order
+term, R1(θ, ˆqπ(y|x(N))), of the Kullback–Leibler risk is equal to the trace, gX,ij(θ)gY
+ij(θ). If the trace
+does not depend on the parameter, θ, the construction of the asymptotically constant-risk prior is
+parallel to [6]; see also [13].
+However, we consider the settings where there exists a unique maximum point of the trace,
+gX,ij(θ)gY
+ij(θ); for example, these settings appear in predictions based on the binary regression model,
+
+111
+
+
+Entropy 2014, 16, 3026–3048
+
+where the covariates of the data and the target variables are not identical. In the settings, there do
+not exist asymptotically constant-risk priors among the priors independent of the sample size, N.
+The reason is as follows: we consider the prior, π(θ), independent of the sample size, N. Then, the
+Kullback–Leibler risk of the Bayesian predictive density is expanded as:
+
+R(θ, ˆqπ(y|x(N))) =
+1
+2N gY
+ij(θ)gX,ij(θ) + O(N−2).
+
+Since, in our settings, the first-order term, gY
+ij(θ)gX,ij(θ), is not constant, the prior independent of the
+sample size, N, is not an asymptotically constant-risk prior.
+When there exists a unique maximum point of the trace, gX,ij(θ)gY
+ij(θ), we construct the
+
+asymptotically constant-risk prior, π(θ; N), up to O(N−2), by making the prior dependent on the
+sample size, N, as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+{ f (θ)}
+√
+
+Nh(θ),
+
+where f (θ) and h(θ) are the scalar functions of θ independent of N and |gX(θ)| denotes the determinant
+of the Fisher information matrix, gX(θ).
+The key idea is that, if the specified parameter point has more undue risk than the other parameter
+points, then the more prior weights should be concentrated on that point.
+Further, we clarify the subminimax estimator problem based on the mean squared error from
+the viewpoint of the prediction where the distributions of data and target variables are different and
+have a common unknown parameter. We obtain the improvement achieved by the minimax estimator
+over the subminimax estimators up to O(N−2). The subminimax estimator problem [14,15] is the
+problem that, at first glance, there seems to exist asymptotically dominating estimators of the minimax
+estimator. However, any relationship between such subminimax estimator problems and predictions
+have not been investigated, and further, in general, the improvement by the minimax estimator over
+the subminimax estimators has not been investigated.
+
+2. Information Geometrical Notations
+
+In this section, we prepare the information geometrical notations; see [16] for details. We
+abbreviate ∂/∂θi to ∂i, where the indices, i, j, k, . . ., run from one to d. Similarly, we abbreviate ∂2/∂θi∂θj,
+∂3/∂θi∂θj∂θk and ∂4/∂θi∂θj∂θk∂θl to ∂ij, ∂ijk and ∂ijkl, respectively. We denote the expectations of the
+random variables, X, Y and X(N), by EX[·], EY[·] and EX(N)[·], respectively. We denote their probability
+densities by p(x|θ), q(y|θ) and p(x(N)|θ), respectively.
+We define the predictive metric proposed by Komaki [13] as:
+
+˚gij(θ)
+=
+gX
+ik(θ)gY,kl(θ)gX
+lj (θ).
+
+When the parameter is one-dimensional, gθθ(θ) denotes Fisher information and gθθ(θ) denotes its
+
+inverse. Let
+
+e
+Γ X
+ij,k(θ) and
+
+m
+Γ X
+ij,k(θ) be the quantities given by:
+
+e
+Γ X
+ij,k(θ)
+:=
+EX[∂ij log p(x|θ)∂k log p(x|θ)]
+
+and:
+
+m
+Γ X
+ij,k(θ)
+:=
+�
+1
+
+p(x|θ)[∂ijp(x|θ)∂kp(x|θ)]dx.
+
+112
+
+
+Entropy 2014, 16, 3026–3048
+
+Using these quantities, the e-connection and m-connection coefficients with respect to the parameter, θ,
+for the model, {p(x|θ) : θ ∈ Θ}, are given by:
+
+e
+Γ X
+ij,k(θ)
+:=
+gX,lk(θ)
+
+e
+Γ X
+ij,l(θ)
+
+and:
+
+m
+Γ X
+ij,k(θ)
+:=
+gX,kl(θ)
+
+m
+Γ X
+ij,l(θ),
+
+respectively.
+The (0, 3)-tensor, TX
+ijk(θ), is defined by:
+
+TX
+ijk(θ)
+:=
+EX[∂i log p(x|θ)∂j log p(x|θ)∂k log p(x|θ)].
+
+The tensor, TX
+ijk(θ), also produces a (0, 1)-tensor:
+
+TX
+i (θ)
+:=
+TX
+ijk(θ)gX,jk(θ).
+
+In the same manner, the information geometrical quantities,
+
+e
+Γ X
+ij,l(θ),
+
+m
+Γ X
+ij,l(θ) and TY
+ijk(θ), are
+defined for the model, {q(y|θ) : θ ∈ Θ}.
+Let Mk
+ij(θ) be a (1, 2)-tensor defined by:
+
+Mk
+ij(θ) :=
+
+m
+Γ Y,k
+ij
+(θ) −
+
+m
+Γ X,k
+ij
+(θ).
+
+For a derivative, (∂1v(θ), · · · , ∂dv(θ)), of the scalar function, v(θ), the e-covariant derivative is
+given by:
+
+e
+∇i∂jv(θ)
+:=
+∂ijv(θ) −
+
+e
+Γ X,k
+ij
+(θ)∂kv(θ).
+
+3. Asymptotically Constant-Risk Priors When the Distributions of Data and Target Variables
+Are Different
+
+In this section, we consider the settings where the trace, gX,ij(θ)gY
+ij(θ), has a unique maximum
+point. We construct the asymptotically constant-risk prior under the Kullback–Leibler risk in the sense
+that the asymptotic risk up to O(N−2) is constant. We find asymptotically constant-risk priors up to
+O(N−2) in two steps: first, expand the Kullback–Leibler risks of Bayesian predictive densities; second,
+find the prior having an asymptotically constant risk using this expansion.
+From now on, we assume the following two conditions for the prior, π(θ; N):
+
+(C1) The prior, π(θ; N), has the form:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ) + log h(θ)},
+
+where f (θ) and h(θ) are smooth scalar functions of θ independent of N.
+(C2) The unique maximum point of the scalar function, f (θ), is equal to the unique maximum point
+of the trace, gX,ij(θ)gY
+ij(θ).
+
+113
+
+
+Entropy 2014, 16, 3026–3048
+
+Based on Conditions (C1) and (C2), we expand the Kullback–Leibler risk of a Bayesian predictive
+density up to O(N−2).
+
+Theorem 1. The Kullback–Leibler risk of a Bayesian predictive density based on the prior, π(θ; N), satisfying
+Condition (C1), is expanded as:
+
+R(θ, ˆqπ(y|x(N)))
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ) −
+1
+
+N
+√
+
+N
+TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)
+e
+∇i∂j log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
+−
+1
+
+3N
+√
+
+N
+TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gY
+kl(θ)Ml
+ij(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij (θ)∂m log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)Mk
+ij(θ)∂k log f (θ)
+
++
+1
+
+2N
+√
+
+N
+˚gij(θ)TX
+i (θ)∂j log f (θ) +
+1
+
+2N
+√
+
+N
+gX,im(θ)gY
+ij(θ)gX,kl(θ)Mj
+kl(θ)∂m log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(1)
+
+The proof is given in the Appendix. The first term in (1) represents that the precision of the
+estimation is determined by the geometric quantity of the data, gX,ij(θ), and the metric of the parameter
+is determined by the geometric quantity of the target variables, gY
+ij(θ). Note that each term in (1) is
+invariant under the reparametrization.
+
+Remark 1. For the subsequent theorem, it is important that at the point, θ f , maximizing the scalar function,
+log f (θ), R(θ f , ˆqπ(y|xN)) is given by:
+
+R(θ f , ˆqπ(y|xN))
+
+=
+1
+2N sup
+θ∈Θ
+{gX,ij(θ)gY
+ij(θ)} +
+1
+
+N
+√
+
+N
+˚gij(θ f )∂ij log f (θ f ) + O(N−2).
+(2)
+
+The N−3/2-order term of this risk is common whenever we use the same scalar function, log f (θ). This term is
+negative because of the definition of the point, θ f . Under Condition (C2), θ f is equal to the unique maximum
+point, θmax, of the trace, gX,ij(θ)gY
+ij(θ).
+
+Based on (1) and (2), we construct asymptotically constant-risk priors using the solutions of the
+partial differential equations.
+
+Theorem 2. Suppose that the scalar functions, log ˜f (θ) and log ˜h(θ), satisfy the following conditions:
+
+(A1) log ˜f (θ) is the solution of the Eikonal equation given by:
+
+˚gij(θ)∂i log ˜f (θ)∂j log ˜f (θ)
+=
+gX,ij(θmax)gY
+ij(θmax) − gX,ij(θ)gY
+ij(θ),
+(3)
+
+where θmax is the unique maximum point of the scalar function, gX,ij(θ)gY
+ij(θ).
+
+114
+
+
+Entropy 2014, 16, 3026–3048
+
+(A2) log ˜h(θ) is the solution of the first-order linear partial equation given by:
+
+˚gij∂i log ˜f (θ)∂j log ˜h(θ) = − ˚gij(θ)
+e
+∇i∂j log ˜f (θ)
+
+− ˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log ˜f (θ)
+�
+∂j log ˜f (θ)∂l log ˜f (θ)
+
++ TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log ˜f (θ)
+
++ 1
+
+3TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log ˜f (θ)∂t log ˜f (θ)∂u log ˜f (θ)
+
+− 1
+
+2 gY
+kl(θ)Ml
+ij(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log ˜f (θ)∂t log ˜f (θ)∂u log ˜f (θ)
+
+− 1
+
+2 gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij (θ)∂m log ˜f (θ) − ˚gij(θ)Mk
+ij(θ)∂k log ˜f (θ)
+
+− 1
+
+2 ˚gij(θ)TX
+i (θ)∂j log ˜f (θ) − 1
+
+2 gX,im(θ)gY
+ij(θ)gX,kl(θ)Mj
+kl(θ)∂m log ˜f (θ)
+
++ ˚gij(θmax)∂ij log ˜f (θmax).
+(4)
+
+Let π(θ; N) be the prior that is constructed as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ) + log ˜h(θ)}.
+
+Further, suppose that log ˜f (θ) satisfies Condition (C2).
+Then, the Bayesian predictive density based on the prior, π(θ; N), has the asymptotically smallest constant
+risk up to O(N−2) among all priors with the form (C1).
+
+Proof. First, we consider the prior, φ(θ; N), constructed as:
+
+φ(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ)}.
+
+From Theorem 1, the Kullback–Leibler risk, R(θ, ˆqφ(y|x(N))), based on the prior, φ(θ; N), is given by:
+
+R(θ, ˆqφ(y|x(N)))
+=
+1
+2N gX,ij(θmax)gY
+ij(θmax) + o(N−1).
+(5)
+
+This is constant up to o(N−1).
+Suppose that there exists another prior, ϕ(θ; N), constructed as:
+
+ϕ(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ)},
+
+and the Bayesian predictive density based on the prior, ϕ(θ; N), has the asymptotically constant risk:
+
+R(θ, ˆqϕ(y|x(N))) =
+k
+2N + o(N−1).
+
+From Theorem 1, the prior ϕ(θ; N) must satisfy the equation:
+
+˚gij(θ)∂i log f (θ)∂j log f (θ)
+=
+k − gX,ij(θ)gY
+ij(θ).
+
+The left-hand side of the above equation is non-negative, because the matrix, ˚gij(θ), is positive-definite.
+Hence, the infimum of the constant, k, is equal to gX,ij(θmax)gY
+ij(θmax). From (5), the N−1-order term
+
+of the risk based on the prior, φ(θ; N), achieves the infimum, gX,ij(θmax)gY
+ij(θmax). Thus, the Bayesian
+
+115
+
+
+Entropy 2014, 16, 3026–3048
+
+predictive density based on the prior, φ(θ; N), has the asymptotically smallest constant risk up to
+o(N−1).
+Second, we consider the prior, π(θ; N), constructed as:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log ˜f (θ) + log ˜h(θ)}.
+
+The above argument ensures that the prior, π(θ; N), has the asymptotically smallest constant risk up
+to o(N−1). Thus, we only have to check if the N−3/2-order term of the risk is the smallest constant.
+From (2), the N−3/2-order term of the risk at the point, θmax, is unchanged by the choice of the
+scalar function, log h(θ). In other words, the constant N−3/2-order term must agree with the quantity,
+˚gij(θmax)∂ij log ˜f (θmax). From Theorem 1, if we choose the prior, π(θ; N), the N−3/2-order term of the
+risk is the smallest constant, and it agrees with the quantity, ˚gij(θmax)∂ij log ˜f (θmax). Thus, the prior,
+π(θ; N), has the asymptotically smallest constant risk up to O(N−2).
+
+Remark 2. In Theorem 2, we choose log ˜f (θ), satisfying Condition (C2) among the solutions of (A1). We
+consider the model with a one-dimensional parameter, θ. There are four possibilities to the solutions of (A1):
+
+�
+
+˚gθθ(θ)∂θ log ˜f (θ) =
+
+⎧
+⎨
+
+⎩
+±
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≤ θmax,
+
+±
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≥ θmax,
+
+where the double-sign corresponds. From the concavity around θmax as suggested by (C2), we choose log ˜f (θ)
+as the solution of the following equation:
+
+�
+
+˚gθθ(θ)∂θ log ˜f (θ) =
+
+⎧
+⎨
+
+⎩
+
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≤ θmax,
+
+−
+�
+
+gX,θθ(θmax)gY
+θθ(θmax) − gX,θθ(θ)gY
+θθ(θ)
+if
+θ ≥ θmax.
+(6)
+
+Integrating both sides of Equation (6), the unique function, log ˜f (θ), is obtained.
+
+Remark 3. Compare the Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), with
+that based on the prior, λ(θ), independent of the sample size, N. From Theorem 1 and Theorem 2, the
+Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), is given as:
+
+R(θ, ˆqπ(y|x(N)))
+=
+1
+2N gX,ij(θmax)gY
+ij(θmax)
+
++
+1
+
+N
+√
+
+N
+˚gij(θmax)∂ij log ˜f (θmax) + O(N−2).
+(7)
+
+In contrast, the Kullback–Leibler risk based on the prior, λ(θ), is given as:
+
+R(θ, ˆqλ(y|x(N)))
+=
+1
+2N gX,ij(θ)gY
+ij(θ) + O(N−2).
+(8)
+
+The N−1-order term in (8) is under the N−1-order term in (7); although the N−3/2-order term in (8) does not
+exist, the N−3/2-order term in (7) is negative. Thus, the maximum of the risk based on the asymptotically
+constant-risk prior, π(θ; N), is smaller than that of the risk based on the prior, λ(θ). This result is consistent
+with the minimaxity of selecting the prior that constructs the predictive density with the smallest maximum of
+the risk.
+
+4. Subminimax Estimator Problem Based on the Mean Squared Error
+
+In this section, we refer to the subminimax estimator problem based on the mean squared error,
+from the viewpoint of the prediction where the distributions of data and target variables are different
+
+116
+
+
+Entropy 2014, 16, 3026–3048
+
+and have a common unknown parameter. First, we give a brief review of subminimax estimator
+problem through the binomial example.
+
+Example 1. Let us consider the binomial estimation based on the mean squared error, RMSE(θ, ˆθ). For any
+finite sample size, N, the Bayes estimator, ˆθπ, based on the Beta prior, π(θ; N) ∝ θ
+√
+
+N/2−1(1 − θ)
+√
+
+N/2−1, is
+minimax under the mean squared error. The mean squared error of the minimax Bayes estimator, ˆθπ, is given by:
+
+RMSE(θ, ˆθπ)
+=
+N
+
+4(
+√
+
+N + N)2 =
+1
+4N −
+1
+
+2N
+√
+
+N
++ O(N−2).
+(9)
+
+In contrast, the mean squared error of the maximum likelihood estimator, ˆθMLE, is given by:
+
+RMSE(θ, ˆθMLE)
+=
+θ(1 − θ)
+
+N
+.
+
+We compare the two estimators, ˆθπ and ˆθMLE. In the comparison of the N−1-order terms of the mean
+squared errors, it seems that the maximum likelihood estimator, ˆθMLE, dominates the minimax Bayes estimator,
+ˆθπ. In other words, the N−1-order term of RMSE(θ, ˆθMLE) is not greater than that of RMSE(θ, ˆθπ) for every
+θ ∈ Θ, and the equality holds when θ = 1/2. This seeming paradox is known as the subminimax estimator
+problem; see [14,17,18] for details. See also [15] for the conditions that such problems do not occur in estimation.
+However, this paradox does not mean the inferiority of the minimax Bayes estimator. This is because,
+although the mean squared error of the minimax Bayes estimator, ˆθπ, has the negative N−3/2-order term, the
+mean squared error of the maximum likelihood estimator, ˆθMLE, does not have the N−3/2-order term. Hence, in
+comparison to the mean squared errors up to O(N−2), the maximum of the mean squared error, RMSE(θ, ˆθπ), is
+below the maximum of the mean squared error, RMSE(θ, ˆθMLE).
+
+Next, we construct the asymptotically constant-risk prior in the estimation based on the mean
+squared error when the subminimax estimator problem occurs, from the viewpoint of the prediction.
+We consider the priors, π(θ; N), satisfying (C1). From Lemma A3 in the Appendix, the mean squared
+error of the Bayes estimator, ˆθπ, is equal to the Kullback–Leibler risk of the ˆθπ-plugin predictive density,
+q(y| ˆθπ), by assuming that the target variable, y, is a d-dimensional Gaussian random variable with the
+
+mean vector, θ, and unit variance. Note that gY
+ij(θ) = 1,
+
+m
+Γ Y
+ij,k = 0 and
+
+e
+Γ Y
+ij,k = 0 for i, j, k = 1, · · · , d.
+
+Thus, if gY
+ij(θ)gX,ij(θ) = Σd
+i=1gX,ii(θ) has a unique maximum point, we obtain the asymptotically
+
+constant-risk prior, π(θ; N), up to O(N−2) from Lemma A2 in the Appendix and Theorem 2.
+Finally, we compare the mean squared error of the asymptotically constant-risk Bayes estimator,
+ˆθπ, with that of the maximum likelihood estimator, ˆθMLE. The mean squared error of the asymptotically
+constant-risk Bayes estimator, ˆθπ, is given as:
+
+RMSE(θ, ˆθπ)
+=
+1
+N
+
+d
+Σ
+i=1
+gX,ii(θmax) +
+2
+
+N
+√
+
+N
+Σd
+k=1gX,ik(θmax)gX,jk(θmax)∂ij log ˜f (θmax) + O(N−2).
+
+In contrast, the mean squared error of the maximum likelihood estimator, ˆθMLE, is given as:
+
+RMSE(θ, ˆθMLE) = 1
+
+N Σd
+i=1gX,ii(θ) + O(N−2).
+
+See [16,19].
+Thus, the maximum of the mean squared error of the asymptotically constant-risk Bayes estimator
+is smaller than that of estimators by the improvement of order N−3/2 in proportion to the Hessian
+of the scalar function, log ˜f (θ), at θmax. In the prediction where the trace, gX,ij(θ)gY
+ij(θ), has a unique
+maximum point, the same improvement holds (Remark 3).
+
+117
+
+
+Entropy 2014, 16, 3026–3048
+
+Example 2. Using the above results, we consider the binomial estimation based on the mean squared error from
+the viewpoint of the prediction. The geometrical quantities to be used are given by:
+
+gX
+θθ(θ) =
+1
+
+θ(1 − θ),
+gY
+θθ(θ) = 1,
+
+m
+Γ X
+θθ,θ (θ) = 0,
+
+m
+Γ X
+θθ,θ(θ) = 0,
+
+e
+Γ X
+θθ,θ(θ) (θ) = −
+1 − 2θ
+
+θ2(1 − θ)2 ,
+
+e
+Γ Y
+θθ,θ (θ) = 0,
+
+TX
+θθθ(θ) =
+1 − 2θ
+
+θ2(1 − θ)2 ,
+and
+TY
+θθθ(θ) = 0,
+
+respectively. Since
+m X,θ
+θθ ,
+mY,θ
+θθ
+and TY
+θθθ vanish, the asymptotically constant-risk prior in the estimation is
+identical to the asymptotically constant-risk prior in the prediction; compare Theorem 1 with the expansion of
+gY,ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] in Lemma A2 in the Appendix.
+In this example, Equation (3) is given by:
+
+θ2(1 − θ)2{∂θ log ˜f (θ)}2
+=
+
+�
+
+1
+4 − θ(1 − θ),
+
+and the solution, log ˜f (θ), is (1/2) log{θ(1 − θ)}. Here, the second-order derivative of the function, log ˜f (θ),
+is given by:
+
+∂θθ log ˜f (θ)
+=
+−1 − 2θ + 2θ2
+
+2θ2(1 − θ)2 .
+
+From this, Equation (4) is given by:
+
+1
+2θ(1 − θ)(1 − 2θ)∂θ log ˜h(θ) + θ2 − θ
+=
+−1
+
+4,
+
+and the solution, log ˜h(θ), is (1/2) log{θ(1 − θ)}. Hence, the asymptotically constant-risk prior, π(θ; N), is a
+Beta prior with the parameters, α =
+√
+
+N/2 and β =
+√
+
+N/2. Note that the asymptotically constant-risk prior
+coincides with the exact minimax prior. Since gX,θθ(θmax) = 1/2 and gX,θθ(θmax)∂θθ log ˜f (θmax) = −1, the
+mean squared error of the asymptotically constant-risk Bayes estimator, ˆθπ, agrees with (9) up to O(N−2).
+
+5. Application to the Prediction of the Binary Regression Model under the Covariate Shift
+
+In this section, we construct asymptotically constant-risk priors in the prediction based on the
+binary regression model under the covariate shift; see [10].
+We consider that we predict a binary response variable, y, based on the binary response variables,
+x(N). We assume that the target variable, y, and the data, x(N), follow the logistic regression models
+with the same parameter, β, given by:
+
+log
+Πx
+
+1 − Πx
+=
+α + zβ
+
+and:
+
+log
+Πy
+
+1 − Πy
+=
+˜α + ˜zβ,
+
+118
+
+
+Entropy 2014, 16, 3026–3048
+
+where Πx is the success probability of the data and Πy is the success probability of the target variable.
+Let α and ˜α denote known constant terms, and let β denote the common unknown parameter. Further,
+we assume that the covariates, z and ˜z, are different.
+Using the parameter θ = Πx, we convert this predictive setting to binomial prediction where the
+data, x, and the target variable, y, are distributed according to:
+
+p(x|θ) :=
+
+�
+θ
+if x = 1,
+1 − θ
+if x = 0,
+
+and:
+
+q(y|θ) :=
+
+⎧
+⎨
+
+⎩
+
+e˜α−˜zz−1αθ ˜zz−1/
+�
+(1 − θ)˜zz−1 + e˜α−˜zz−1αθ ˜zz−1�
+if y = 1,
+
+(1 − θ)˜zz−1/
+�
+(1 − θ)˜zz−1 + e˜α−˜zz−1αθ ˜zz−1�
+if y = 0,
+
+respectively. We obtain two Fisher information for x and y as:
+
+gX
+θθ(θ)
+=
+1
+
+θ(1 − θ)
+
+and:
+
+gY
+θθ(θ)
+=
+� ˜z
+
+z
+
+�2
+e−˜α+˜zz−1α
+(1 − θ)˜zz−1−2θ ˜zz−1−2
+
+�
+θ ˜zz−1 + e−˜α+˜zz−1α(1 − θ)˜zz−1�2 ,
+
+respectively.
+For simplicity, we consider the setting where z = 1, ˜z = 2 and α = ˜α = 0. The geometrical
+quantities for the model, {p(x|θ) : θ ∈ Θ}, are given by:
+
+gX
+θθ(θ) =
+1
+
+θ(1 − θ),
+
+m
+Γ X
+θθ,θ (θ) = 0,
+
+e
+Γ X
+θθ,θ(θ) (θ) = −
+1 − 2θ
+
+θ2(1 − θ)2 ,
+and
+TX
+θθθ(θ) =
+1 − 2θ
+
+θ2(1 − θ)2 ,
+
+respectively. In the same manner, the geometrical quantities for the model, {q(y|θ) : θ ∈ Θ}, are
+given by:
+
+gY
+θθ(θ) =
+4
+
+{(1 − θ)2 + θ2}2 ,
+
+m
+Γ X
+θθ,θ(θ) = 4 (1 − 2θ)(1 + 2θ − 2θ2)
+
+θ(1 − θ){(1 − θ)2 + θ2}3 ,
+
+e
+Γ Y
+θθ,θ (θ) = −4
+1 − 2θ
+
+θ(1 − θ){(1 − θ)2 + θ2}2 ,
+and
+TY
+θθθ(θ) = 8
+1 − 2θ
+
+θ(1 − θ){(1 − θ)2 + θ2}3 ,
+
+respectively.
+Using these quantities, Equation (3) is given by:
+
+4
+θ2(1 − θ)2
+
+{θ2 + (1 − θ)2}2 (∂θ log ˜f (θ))2 = 4 − 4
+θ(1 − θ)
+
+{θ2 + (1 − θ)2}2 .
+
+119
+
+
+Entropy 2014, 16, 3026–3048
+
+By noting that the maximum point of gX,θθ(θ)gY
+θθ(θ) is 1/2, the solution, log ˜f (θ), of this equation is
+given by:
+
+log ˜f (θ)
+=
+2
+�
+
+1 − θ + θ2 + log{θ(1 − θ)}
+
+− log(2 − θ + 2
+�
+
+1 − θ + θ2) − log(1 + θ + 2
+�
+
+1 − θ + θ2).
+
+Using this solution, we obtain the solution of Equation (4) given by:
+
+log ˜h(θ)
+=
+1
+6
+
+�
+−
+1
+
+1 − θ − 1
+
+θ − 12θ(1 − θ) − 12
+√
+
+3
+�
+
+1 − θ + θ2
+
++(3 − 6
+√
+
+3){log θ + log(1 − θ)} − 3 log(1 − θ + θ2) + 10 log{(1 − θ)2 + θ2}
+
+−6 log(
+√
+
+3 + 2
+�
+
+1 − θ + θ2) + 6
+√
+
+3 log{1 + (1 − θ) + 2
+�
+
+1 − θ + θ2}
+
++6
+√
+
+3 log{1 + θ + 2
+�
+
+1 − θ + θ2}
+�
+.
+
+The asymptotically constant-risk priors for the different sample sizes are shown in Figure 1. The prior
+weight is found to be more concentrated to 1/2 as the sample size, N, grows.
+In this example, we obtain the Kullback–Leibler risk of the Bayesian predictive density based on
+the asymptotically constant-risk prior, π(θ; N), as:
+
+R(θ, ˆqπ(y|x(N)))
+=
+2
+N − 4
+√
+
+3
+
+N
+√
+
+N
++ O(N−2).
+
+We compare this value with the Bayes risk calculated using the Monte Carlo simulation; see Figure 2.
+As the sample size, N, grows, the difference appears negligible. Further, we compare this value with
+the risk itself calculated by the Monte Carlo simulation; see Figure 3. As the sample size, N, grows, the
+risk becomes more constant.
+
+Figure 1. Asymptotically constant-risk prior in the prediction where the data are distributed according
+to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial
+distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+120
+
+
+Entropy 2014, 16, 3026–3048
+
+�����
+�����
+�����
+�����
+
+���������������
+
+����������
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+����
+
+��������������������
+����� ��� ���
+
+�����
+�����
+�����
+�����
+
+Figure 2. Bayes risk based on the asymptotically constant-risk prior in the prediction where the data
+are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed
+according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+Figure 3. Comparison of the Kullback–Leibler risk calculated using the Monte Carlo simulations and
+the asymptotic risk, 2/N − (4
+√
+
+3)/(N
+√
+
+N), in the prediction where the data are distributed according
+to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial
+distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
+
+6. Discussion and Conclusions
+
+We have considered the setting where the quantity, gX,ij(θ)gY
+ij(θ)—the trace of the product of the
+
+inverse Fisher information matrix, gX,ij(θ), and the Fisher information matrix, gY
+ij(θ)—has a unique
+maximum point, and we have investigated the asymptotically constant-risk prior in the sense that the
+asymptotic risk is constant up to O(N−2).
+In Section 3, we have considered the prior depending on the sample size, N, and constructed
+the asymptotically constant-risk prior using Equations (3) and (4). In Section 4, we have clarified the
+relationship between the subminimax estimator problem based on the mean squared error and the
+prediction where the distributions of data and target variables are different. In Section 5, we have
+constructed the asymptotically constant-risk prior in the prediction based on the logistic regression
+model under the covariate shift.
+We have assumed that the trace, gX,ij(θ)gY
+ij(θ), is finite. However, the trace may diverge in the
+non-compact parameter space; for example, it diverges under the predictive setting, where the
+
+121
+
+
+Entropy 2014, 16, 3026–3048
+
+distribution, q(y|θ), of the target variable is the Poisson distribution and the data distribution, p(x|θ),
+is the exponential distribution, with Θ equivalent to R. Therefore, for our future work, in such a
+setting, we should adopt criteria other than minimaxity.
+
+Acknowledgments: The authors thank the referees for their helpful comments. This research was partially
+supported by a Grant-in-Aid for Scientific Research (23650144, 26280005).
+
+Author Contributions: Both authors contributed to the research and writing of this paper. Both authors read and
+approved the final manuscript.
+
+Conflicts of Interest: The authors declare no conflict of interest.
+
+Appendix A
+
+We prove Theorem 1. First, we introduce some lemmas for the proof. For the expansion, we
+follow the following six steps (the first five steps are arranged in the form of lemmas): the first is to
+expand the MAPestimator; the second is to calculate their bias and mean squared error; the third
+is to expand the Kullback–Leibler risk using ˆθπ-plugin predictive density, q(y| ˆθπ); the fourth is to
+expand the Bayesian predictive density based on the prior π(θ; N); the fifth is to expand the Bayesian
+estimator minimizing the Bayes risk; and the last is to prove Theorem 1 using these lemmas.
+We use some additional notations for the expansion. Let ˆθπ be the maximum point of the
+scalar function log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Let l(θ|x(N)) denote the log likelihood of
+the data, x(N). Let lij(θ|x(N)), lijk(θ|x(N)) and lijkl(θ|x(N)) be the derivatives of order 2, 3 and 4 of the
+log likelihood, l(θ|x(N)). Let Hij(θ|x(N)) denote the quantity, lij(θ|x(N)) + NgX
+ij (θ). Let ˜li(θ|x(N)) and
+˜Hij(θ|x(N)) denote (1/
+√
+
+N)li(θ|x(N)) and (1/
+√
+
+N)Hij(θ|x(N)), respectively. In addition, the brackets ( )
+denotes the symmetrization: for any two tensors, aij and bij, ai(jbk)l denotes ai(jbk)l = (aijbkl + aikbjl)/2.
+
+Lemma A1. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th
+component of this estimator ˆθπ is expanded as follows:
+
+ˆθi
+π
+=
+θi +
+1
+√
+
+N
+gX,ik(θ)˜lk(θ|x(N)) +
+1
+√
+
+N
+gX,ik(θ)∂k log f (θ)
+
++ 1
+
+N gX,ik(θ) ˜Hkm(θ|x(N))gX,mr(θ)˜lr(θ|xN)
+
++ 1
+
+2N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)˜lq(θ|xN)˜ls(θ|x(N))
+
++ 1
+
+N gX,ik(θ) ˜Hkm(θ|xN)gX,mr(θ)∂r log f (θ)
+
++ 1
+
+N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)˜lq(θ|xN)∂s log f (θ)
+
++ 1
+
+2N gX,ik(θ)LX
+kmr(θ)gX,mq(θ)gX,rs(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)˜lq(θ|x(N))
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ)
+
++ 1
+
+N gX,ik(θ)∂k log h(θ) + OP(N−3/2).
+(A1)
+
+Proof. By the definition of ˆθπ, we get the equation given by:
+
+∂i log p(x(N)| ˆθπ) + ∂i log
+π( ˆθπ; N)
+
+|gX( ˆθπ)|1/2
+=
+0.
+
+122
+
+
+Entropy 2014, 16, 3026–3048
+
+From our assumption that prior π(θ; N) has the form given by:
+
+π(θ; N)
+
+|gX(θ)|1/2
+∝
+exp{
+√
+
+N log f (θ) + log h(θ)},
+
+we rewrite this equation as:
+
+∂i log p(x(N)| ˆθπ) +
+√
+
+N∂i log f ( ˆθπ) + ∂i log h( ˆθπ)
+=
+0.
+
+By applying Taylor expansion around θ to this new equation, we derive the following expansion:
+
+∂i log p(x(N)|θ) + {∂ij log p(x(N)|θ)}( ˆθj
+π − θj)
+
++1
+
+2{∂ijk log p(x(N)|θ)}( ˆθj
+π − θj)( ˆθk
+π − θk) +
+√
+
+N∂i log f (θ)
+
++
+√
+
+N{∂ij log f (θ)}( ˆθj
+π − θj) + ∂i log h(θ) + oP(1) = 0.
+
+From the law of large numbers and the central limit theorem, we rewrite the above expansion as:
+
+NgX
+ij (θ)( ˆθj
+π − θj)
+=
+∂i log p(x(N)|θ) +
+√
+
+N∂i log f (θ) + Hij(θ|x(N))( ˆθj
+π − θj)
+
++ N
+
+2 Lijk(θ)( ˆθj
+π − θj)( ˆθk
+π − θk) +
+√
+
+N∂ij log f (θ)( ˆθj
+π − θj)
+
++∂i log h(θ) + oP(1).
+(A2)
+
+By substituting the deviation, ˆθπ − θ, recursively into Expansion (A2), we obtain Expansion (A1).
+
+Lemma A2. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th
+component of the bias of the estimator, ˆθπ, is given by:
+
+EX(N)[ ˆθi
+π]
+=
+θi +
+1
+√
+
+N
+gX,ik∂k log f (θ)
+
+− 1
+
+2N
+
+m
+Γ X,i(θ) + 1
+
+2N gX,ik(θ)gX,mq(θ)gX,rs(θ)LX
+kmr(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ)
+
++ 1
+
+N gX,ik(θ)∂k log h(θ) + O(N−3/2).
+(A3)
+
+123
+
+
+Entropy 2014, 16, 3026–3048
+
+The (i, j)-component of the mean squared error of ˆθπ is given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
+= 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+−
+1
+
+N
+√
+
+N
+gX,k(i(θ)
+m
+Γ X,j)(θ)∂k log f (θ) +
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂kl log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)∂kgX,j)l(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)gX,pt(θ)LX
+lrt(θ)∂k log f (θ)∂n log f (θ)∂p log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)∂ln log f (θ)∂r log f (θ)∂k log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂k log f (θ)∂l log h(θ)
+
++O(N−2),
+(A4)
+
+where gX,k(i(θ)
+
+m
+Γ X,j) (θ) denotes (1/2){gX,ki(θ)
+
+m
+Γ X,j(θ) + gX,ki(θ)
+
+m
+Γ X,j(θ)} and gX,k(i(θ)∂kgX,j)l(θ)
+denotes (1/2){gX,ki(θ)∂kgX,jl(θ) + gX,kj(θ)∂kgX,il(θ)}. The (i, j, k)-component of the mean of the third
+power of the deviation, ˆθπ − θ, is given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+
+N
+√
+
+N
+gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+3
+
+N
+√
+
+N
+gX,(ij(θ)gX,k)l(θ)∂l log f (θ) + O(N−2).
+(A5)
+
+Proof. First, using Lemma A1, we determine the i-th component of the bias of ˆθπ given by:
+
+EX(N)[ ˆθi
+π − θi]
+
+=
+1
+√
+
+N
+gX,ik∂k log f (θ)
+
+− 1
+
+2N
+
+m
+Γ X,i(θ) + 1
+
+2N gX,ik(θ)gX,mq(θ)gX,rs(θ)LX
+kmr(θ)∂q log f (θ)∂s log f (θ)
+
++ 1
+
+N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ) + 1
+
+N gX,ik(θ)∂k log h(θ) + O(N−3/2).
+
+Second, consider the following relationship:
+
+EX(N)
+
+��
+ˆθi
+π − θi −
+1
+√
+
+N
+gX,ik(θ)˜lk(θ|x(N)) −
+1
+√
+
+N
+gX,ik(θ)∂k log f (θ)
+�
+
+×
+�
+ˆθj
+π − θj −
+1
+√
+
+N
+gX,jl(θ)˜ll(θ|xN) −
+1
+√
+
+N
+gX,jl(θ)∂l log f (θ)
+��
+
+= EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] + 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+− 1
+√
+
+N
+gX,ki(θ)EX(N)[( ˆθj
+π − θj)˜lk(θ|x(N))] −
+1
+√
+
+N
+gX,kj(θ)EX(N)[( ˆθi
+π − θi)˜lk(θ|x(N))]
+
+− 1
+√
+
+N
+gX,ki(θ)EX(N)[( ˆθj
+π − θj)∂k log f (θ)] −
+1
+√
+
+N
+gX,kj(θ)EX(N)[( ˆθi
+π − θi)∂k log f (θ)]. (A6)
+
+124
+
+
+Entropy 2014, 16, 3026–3048
+
+By differentiating the j-th component of the bias, EX(N)[ ˆθj
+π − θj], we obtain the equation given by:
+
+1
+N ∂kEX(N)[ ˆθj
+π − θj]
+=
+− 1
+
+N δj
+k +
+1
+√
+
+N
+EX(N)[( ˆθj
+π − θj)˜lk(θ|xN)],
+(A7)
+
+where δi
+j denotes the delta function: if the upper and the lower indices agree, then the value of
+this function is one and otherwise zero. Equation (A7) has been used by [2,16,19]. By substituting
+Equations (A7) and (A3) into Relationship (A6), we obtain the (i, j)-component of the mean squared
+error of ˆθπ given by:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
+= 1
+
+N gX,ij(θ) + 1
+
+N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)
+
+−
+1
+
+N
+√
+
+N
+gX,k(i(θ)
+m
+Γ X,j)(θ)∂k log f (θ) +
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂kl log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)∂kgX,j)l(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)gX,pt(θ)LX
+lrt(θ)∂k log f (θ)∂n log f (θ)∂p log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)gX,nr(θ)∂ln log f (θ)∂r log f (θ)∂k log f (θ)
+
++
+2
+
+N
+√
+
+N
+gX,k(i(θ)gX,j)l(θ)∂k log f (θ)∂l log h(θ) + O(N−2).
+
+Finally, by taking the expectation of the third power of the deviation, ˆθi
+π − θi, we obtain the
+following expansion:
+
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+
+N
+√
+
+N
+gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+3
+
+N
+√
+
+N
+gX,(ij(θ)gX,k)l(θ)∂l log f (θ) + O(N−2).
+
+125
+
+
+Entropy 2014, 16, 3026–3048
+
+Lemma
+A3. Let
+ˆθπ
+be
+the
+maximum
+point
+of
+log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}.
+The Kullback–Leibler risk of the plug-in predictive density, q(y(N)| ˆθπ), with the estimator,
+ˆθπ, is
+expanded as follows:
+
+R(θ, q(y| ˆθπ))
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)
+� e
+∇i∂j log f (θ)
+�
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+� e
+∇i∂k log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
+−
+1
+
+3N
+√
+
+N
+TY
+ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gY
+kl(θ)Ml
+ijgX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,ij(θ)gY
+kl(θ)gX,kl(θ)Mm
+ij ∂m log f (θ) −
+1
+
+N
+√
+
+N
+TY
+ijk(θ)gX,ij(θ)gX,kl(θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)Mk
+ij∂k log f (θ) +
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(A8)
+
+Proof. By applying the Taylor expansion, the Kullback–Leibler risk, R(θ, q(y| ˆθπ)), is expanded as:
+
+Ex(N)[D(q(·|θ), q(·| ˆθπ))]
+
+= EX(N)
+
+��
+q(y|θ)
+�
+−li(θ|y) ˜θi
+π − 1
+
+2lij(θ|y)( ˆθi
+π − θi)( ˆθj
+π − θj)
+
+−1
+
+6lijk(θ|y)( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk) + OP(N−2)
+�
+dy
+�
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] − 1
+
+6 LY
+ijk(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)] + O(N−2)
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)]
+
++
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+3
+2
+
+m
+Γ Y
+(ij,k)(θ) − 1
+
+3TY
+ijk(θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)] + O(N−2)
+
+= 1
+
+2 gY
+ij(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)] − 1
+
+3TY
+ijk(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
++1
+
+2
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+gY
+kl(θ)
+
+m
+Γ Y,l
+ij (θ) − gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
++1
+
+2 gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk] + O(N−2),
+(A9)
+
+where
+
+e
+Γ Y
+(ij,k) denotes (1/3){
+
+e
+Γ Y
+ij,k +
+
+e
+Γ Y
+jk,i +
+
+e
+Γ Y
+ki,j}.
+
+126
+
+
+Entropy 2014, 16, 3026–3048
+
+By the definition of the predictive metric, ˚gij(θ) = gX
+ik(θ)gY,kl(θ)gX
+lj (θ), by Expansions (A4) and
+
+(A5) and by the relationship LX
+ijk(θ) = −
+
+e
+Γ X
+ij,k(θ) −
+e X
+jk,i(θ) −
+e X
+ki,j(θ) − TX
+ijk(θ), the last two terms of the
+above expansion (A9) are expanded as:
+
+1
+2 gY
+ij(θ)EX(N)[ ˆ(θ
+i
+π − θi)( ˆθj
+π − θj)] + 1
+
+2 gY
+kl(θ)
+
+m
+Γ X,l
+ij
+(θ)EX(N)[( ˆθi
+π − θi)( ˆθj
+π − θj)( ˆθk
+π − θk)]
+
+=
+1
+2N gY
+ij(θ)gX,ij(θ) + 1
+
+2N ˚gij(θ)∂i log f (θ)∂j log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂ij log f (θ) −
+
+e
+Γ X,k
+ij
+(θ)∂k log f (θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)gX,kl(θ)
+�
+∂ik log f (θ) −
+e X,m
+ik
+∂m log f (θ)
+�
+∂j log f (θ)∂l log f (θ)
+
++
+1
+
+N
+√
+
+N
+˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
+(A10)
+
+By substituting Expansion (A10) into Expansion (A9), Expansion (A8) is obtained.
+
+Note that Expansion (A8) is invariant up to O(N−2) under the reparametrization, so that each
+term of this expansion is a scalar function of θ.
+
+Lemma A4. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. The Bayesian
+predictive density based on the prior, π(θ; N), is expanded as:
+
+ˆqπ(y|x(N))
+=
+q(y| ˆθπ) + 1
+
+N gX,ij( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂i log |gX( ˆθπ)|
+1
+2 −
+
+e
+Γ X,k
+ik
+( ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+∂jq(y| ˆθπ)
+
++ 1
+
+2N gX,ij( ˆθπ)
+
+⎧
+⎨
+
+⎩∂ijq(y| ˆθπ) −
+
+m
+Γ X,k
+ij
+( ˆθπ)∂kq(y| ˆθπ)
+
+⎫
+⎬
+
+⎭ + OP(N−3/2).
+(A11)
+
+Proof. Let ˜θπ denote ˆθπ − θ. First, using a Taylor expansion twice, we expand the posterior density,
+π(θ|x(N)), as:
+
+π(θ|x(N))
+=
+|gX( ˆθπ)|
+1
+2
+π( ˆθπ)
+
+|gX( ˆθπ)|
+1
+2
+p(x(N)| ˆθπ) exp
+�
+−1
+
+2{−lij( ˆθπ|x(N))} ˜θi
+π ˜θj
+π
+
+�
+
+×
+
+�
+
+1 − {∂i log |gX( ˆθπ)|
+1
+2 } ˜θi
+π + 1
+
+2
+
+�
+∂ij|gX( ˆθπ)|
+1
+2
+
+|gX( ˆθπ)|
+1
+2
+
+�
+˜θi
+π ˜θj
+π + OP(N−3/2)
+
+�
+
+×
+�
+1 + 1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6{lijk( ˆθπ|x(N))} ˜θi
+π ˜θj
+π ˜θk
+π + 1
+
+2{log h( ˆθπ)} ˜θi
+π ˜θj
+π
+
+−1
+
+6{
+√
+
+N∂ijk log f ( ˆθπ)} ˜θi
+π ˜θj
+π ˜θk
+π + 1
+
+24lijkl( ˆθπ|x(N)) ˜θi
+π ˜θj
+π ˜θk
+π ˜θl
+π
+
++1
+
+2
+
+�1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6lijk( ˆθπ|xN) ˜θi
+π ˜θj
+π ˜θk
+π
+
+�
+
+×
+�1
+
+2{
+√
+
+N∂ij log f ( ˆθπ)} ˜θi
+π ˜θj
+π − 1
+
+6lijk( ˆθπ|x(N)) ˜θi
+π ˜θj
+π ˜θk
+π
+
+�
++ OP(N−3/2)
+�
+
+×
+
+��
+p(x(N)|θ) π(θ; N)
+
+|gX(θ)|
+1
+2
+|gX(θ)|
+1
+2 dθ
+
+�−1
+.
+
+127
+
+
+Entropy 2014, 16, 3026–3048
+
+We denote the N−1/2-order,
+N−1-order and N−3/2-order terms by (N−1/2)a0( ˜θπ; ˆθπ),
+(N−1)a1( ˜θπ; ˆθπ) and (N−3/2)a2( ˜θπ; ˆθπ), respectively. Then, this expansion is rewritten as:
+
+π(θ|x(N))
+=
+|gX( ˆθπ)|
+1
+2
+π( ˆθπ)
+
+|gX( ˆθπ)|
+1
+2
+p(x(N)| ˆθπ) exp
+�
+−1
+
+2{−lij( ˆθπ|x(N))} ˜θi
+π ˜θj
+π
+
+�
+
+×
+�
+1 +
+1
+√
+
+N
+a0( ˜θπ; ˆθπ)
+
++ 1
+
+N a1( ˜θπ; ˆθπ) +
+1
+
+N
+√
+
+N
+a2( ˜θπ; ˆθπ) + OP(N−2)
+�
+
+×
+
+��
+p(x(N)|θ) π(θ; N)
+
+|gX(θ)|
+1
+2
+|gX(θ)|
+1
+2 dθ
+
+�−1
+.
+
+To make the expansion easier to see, the following notations are used. Let φ(η; −lij( ˆθπ|x(N))) be
+the probability density function of the d-dimensional normal distribution with the precision matrix
+whose (i, j)-component is −lij( ˆθπ|x(N)). Let η = (η1, · · · , ηd) be a d-dimensional random vector
+distributed according to the normal density, φ(η; −lij( ˆθπ|x(N))) The notations, ¯a0( ˆθπ), ¯a1( ˆθπ), ¯a2( ˆθπ)
+and ˆωij( ˆθπ), denote the expectations of a0(η; ˆθπ), a1(η; ˆθπ), a2(η; ˆθπ) and ηiηj, respectively.
+Using the above notations, we get the following posterior expansion:
+
+π(θ|x(N))
+=
+φ( ˆθπ; −lij( ˆθπ|x(N)))
+
+×
+�
+1 +
+1
+√
+
+N
+{a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)} + 1
+
+N {a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)}
+
+− 1
+
+N ¯a0( ˆθπ){a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)} +
+1
+
+N
+√
+
+N
+{a2( ˜θπ; ˆθπ) − ¯a2( ˆθπ)}
+
+−
+1
+
+N
+√
+
+N
+¯a0( ˆθπ){a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)} −
+1
+
+N
+√
+
+N
+¯a1( ˆθπ){a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)}
+
++
+1
+
+N
+√
+
+N
+¯a2
+0( ˆθπ){a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)} + OP(N−2)
+�
+.
+(A12)
+
+Second, using (A12), the Bayesian predictive density, ˆqπ(y|x(N)), based on the prior, π(θ; N), is
+expanded as:
+
+ˆqπ(y|x(N))
+
+=
+�
+q(y| ˆθπ)
+
+�
+
+1 − {∂i log q(y| ˆθπ)} ˜θi
+π + 1
+
+2
+∂ijq(y| ˆθπ)
+
+q(y| ˆθπ)
+˜θi
+π ˜θj
+π + oP(N−1)
+
+�
+
+π(θ|xN)dθ
+
+=
+�
+q(y| ˆθπ)
+�
+1 + {∂i log |gX( ˆθπ)|
+1
+2 }{∂j log q(y| ˆθπ)} ˜θi
+π ˜θj
+π
+
++1
+
+6{∂ijk log p(x(N)| ˆθπ) +
+√
+
+N∂ijk log f ( ˆθπ)}{∂l log q(y| ˆθπ)} ˜θi
+π ˜θj
+π ˜θk
+π ˜θl
+π
+
++1
+
+2
+∂ijq(y| ˆθπ)
+
+q(y| ˆθπ)
+˜θi
+π ˜θj
+π + oP(N−1)
+
+�
+
+φ( ˜θπ; −lij( ˆθπ|xN))d ˜θπ
+
+= q(y| ˆθπ) + ˆωij( ˆθπ){∂i log |gX( ˆθπ)|
+1
+2 }∂jq(y| ˆθπ) + 1
+
+2 ˆωik( ˆθπ) ˆωjl( ˆθπ)lijk( ˆθπ|xN)∂lq(y| ˆθπ)
+
++1
+
+2 ˆωij( ˆθπ)∂ijq(y| ˆθπ) + OP(N−3/2).
+(A13)
+
+Here, the following two equations hold:
+
+−lij( ˆθπ|x(N))
+=
+NgX
+ij ( ˆθπ) −
+√
+
+N ˜Hij( ˆθπ|xN) + OP(1),
+(A14)
+
+128
+
+
+Entropy 2014, 16, 3026–3048
+
+lijk( ˆθπ|x(N))
+=
+−2N
+
+e
+Γ X
+ij,k( ˆθπ) − N
+
+m
+Γ X
+ik,j( ˆθπ) +
+√
+
+N ˜Hijk( ˆθ|xN).
+(A15)
+
+By combining Equation (A14) with the Sherman–Morrison–Woodbury formula, the following
+expansion is obtained:
+
+ˆωij( ˆθπ)
+=
+1
+N gX,ij( ˆθπ) +
+1
+
+N
+√
+
+N
+gX,ik( ˆθπ)gX,jl( ˆθπ)Hkl( ˆθπ|x(N)) + OP(N−2).
+(A16)
+
+By substituting Equations (A14), (A15) and (A16) into Expansion (A13), Expansion (A11) is
+obtained.
+
+Note that the integration of Expansion (A11) is one up to OP(N−2). Further, Expansion (A11) is
+similar to the expansion in [2]. However, the estimator that is the center of the expansion is different,
+because of the dependence of the prior on the sample size.
+
+Lemma A5. The Bayesian estimator, ˆθopt, minimizing the Bayes risk,
+�
+R(θ, q(y| ˆθ))dπ(θ; N), among plug-in predictive densities is given by:
+
+ˆθi
+opt
+=
+ˆθi
+π + 1
+
+2N gX,ij( ˆθπ)TX
+j ( ˆθπ)
+
++ 1
+
+2N gX,jk( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+m
+Γ Y,i
+jk ( ˆθπ) −
+
+m
+Γ X,i
+jk ( ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
++ OP(N−3/2).
+(A17)
+
+Proof. The Bayes risk, �
+R(θ, q(y| ˆθ))dπ(θ; N), is decomposed as:
+
+�
+R(θ, q(y| ˆθ))dπ(θ; N)
+=
+�
+π(θ; N)
+�
+p(x(N)|θ)
+�
+q(y|θ) log
+q(y|θ)
+
+ˆqπ(y|x(N))dydx(N)dθ
+
++
+�
+π(θ; N)
+�
+p(x(N)|θ)
+�
+q(y|θ) log ˆqπ(y|x(N))
+
+q(y| ˆθ)
+dydx(N)dθ.
+
+The first term of this decomposition is not dependent on ˆθ. From Fubini’s theorem and Lemma A4, the
+proof is completed.
+
+Using these lemmas, we prove Theorem 1. First, we find that the Kullback–Leibler risk of the
+plug-in predictive density with the estimator, ˆθopt, defined in Lemma A5, is given by:
+
+R(θ, q(y| ˆθopt))
+=
+R(θ, q(y| ˆθπ)) +
+1
+
+2N
+√
+
+N
+˚gij(θ)TX
+i (θ)∂j log f (θ)
+
++
+1
+
+2N
+√
+
+N
+gX,im(θ)gY
+ij(θ)gX,kl(θ)
+
+×
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+m
+Γ Y,j
+kl (θ) −
+
+m
+Γ X,j
+kl
+((θ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
+∂m log f (θ).
+(A18)
+
+Using Expansion (A18) and Lemma A3, we expand the Kullback–Leibler risk, R(θ, ˆqπ(y|x(N))).
+Here, the risk, R(θ, ˆqπ(y|x(N))), is equal to the risk, R(θ, q(y| ˆθopt)), up to O(N−2), because we expand
+the Bayesian predictive density, ˆqπ(y|x(N)) as:
+
+q(y|x(N)) = q(y| ˆθopt) + 1
+
+2N gX,ij( ˆθπ)
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+∂ijq(y| ˆθπ) −
+
+m
+Γ Y,k
+ij
+( ˆθπ)∂kq(y| ˆθπ)
+
+⎫
+⎪
+⎬
+
+⎪
+⎭
++ OP(N−3/2).
+(A19)
+
+129
+
+
+Entropy 2014, 16, 3026–3048
+
+Thus, we obtain Expansion (1).
+
+References
+
+1. Aitchison, J. Goodness of prediction fit. Biometrika 1975, 62, 547–554.
+2. Komaki, F. On asymptotic properties of predictive distributions. Biometrika 1996, 83, 299–313.
+3. Hartigan, J. The maximum likelihood prior. Ann. Stat. 1998, 26, 2083–2103.
+4. Bernardo, J. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. B 1979, 41, 113–147.
+5. Clarke, B.; Barron, A. Jeffreys prior is asymptotically least favorable under entropy risk.
+J. Stat. Plan.
+Inference 1994, 41, 37–60.
+6. Aslan, M. Asymptotically minimax Bayes predictive densities. Ann. Stat. 2006, 34, 2921–2938.
+7. Komaki, F. Bayesian predictive densities based on latent information priors. J. Stat. Plan. Inference 2011, 141,
+3705–3715.
+8. Komaki, F. Asymptotically minimax Bayesian predictive densities for multinomial models. Electron. J. Stat.
+2012, 6, 934–957.
+9. Kanamori, T.; Shimodaira, H. Active learning algorithm using the maximum weighted log-likelihood
+estimator. J. Stat. Plan. Inference 2003, 116, 149–162.
+10. Shimodaira,
+H.
+Improving
+predictive
+inference
+under
+covariate
+shift
+by
+weighting
+the
+log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244.
+11. Fushiki, T.; Komaki, F.; Aihara, K. On parametric bootstrapping and Bayesian prediction. Scand. J. Stat. 2004,
+31, 403–416.
+12. Suzuki, T.; Komaki, F. On prior selection and covariate shift of β-Bayesian prediction under α-divergence
+risk. Commun. Stat. Theory 2010, 39, 1655–1673.
+13. Komaki, F. Asymptotic properties of Bayesian predictive densities when the distributions of data and target
+variables are different. Bayesian Anal. 2014, submitted for publication.
+14. Hodges, J.L.; Lehmann, E.L. Some problems in minimax point estimation. Ann. Math. Stat. 1950, 21, 182–197.
+15. Ghosh, M.N. Uniform approximation of minimax point estimates. Ann. Math. Stat. 1964, 35, 1031–1047.
+16. Amari, S. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1985.
+17. Robbins, H.
+Asymptotically Subminimax solutions of Compound Statistical Decision Problems.
+In Proceedings of the Second Berkley Symposium Mathematical Statistics and Probability, Berkeley, CA,
+USA, 31 July–12 August 1950; University of California Press: Oakland, CA, USA, 1950; pp. 131–148.
+18. Frank, P.; Kiefer, J. Almost subminimax and biased minimax procedures. Ann. Math. Stat. 1951, 22, 465–468.
+19. Efron, B. Defining curvature of a statistical problem (with applications to second order efficiency). Ann. Stat.
+1975, 3, 189–1372.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+130
+
+
+entropy
+
+Article
+Information-Geometric Markov Chain Monte Carlo
+Methods Using Diffusions
+
+Samuel Livingstone 1,* and Mark Girolami 2
+
+1 Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK
+2 Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; E-Mail:m.girolami@warwick.ac.uk
+*
+E-Mail: samuel.livingstone@ucl.ac.uk; Tel.: +44-20-7679-1872.
+
+Received: 29 March 2014; in revised form: 23 May 2014 / Accepted: 28 May 2014 /
+Published: 3 June 2014
+
+Abstract: Recent work incorporating geometric ideas in Markov chain Monte Carlo is reviewed
+in order to highlight these advances and their possible application in a range of domains beyond
+statistics. A full exposition of Markov chains and their use in Monte Carlo simulation for statistical
+inference and molecular dynamics is provided, with particular emphasis on methods based on
+Langevin diffusions. After this, geometric concepts in Markov chain Monte Carlo are introduced.
+A full derivation of the Langevin diffusion on a Riemannian manifold is given, together with
+a discussion of the appropriate Riemannian metric choice for different problems. A survey of
+applications is provided, and some open questions are discussed.
+
+Keywords: information geometry; Markov chain Monte Carlo; Bayesian inference; computational
+statistics; machine learning; statistical mechanics; diffusions
+
+1. Introduction
+
+There are three objectives to this article. The first is to introduce geometric concepts that have
+recently been employed in Monte Carlo methods based on Markov chains [1] to a wider audience.
+The second is to clarify what a “diffusion on a manifold” is, and how this relates to a diffusion
+defined on Euclidean space. Finally, we review the state-of-the-art in the field and suggest avenues for
+further research.
+The connections between some Monte Carlo methods commonly used in statistics, physics
+and application domains, such as econometrics, and ideas from both Riemannian and information
+geometry [2,3] were highlighted by Girolami and Calderhead [1] and the potential benefits
+demonstrated empirically. Two Markov chain Monte Carlo methods were introduced, the manifold
+Metropolis-adjusted Langevin algorithm and Riemannian manifold Hamiltonian Monte Carlo. Here,
+we focus on the former for two reasons. First, the intuition for why geometric ideas can improve
+standard algorithms is the same in both cases. Second, the foundations of the methods are quite
+different, and since the focus of the article is on using geometric ideas to improve performance,
+we considered a detailed description of both to be unnecessary. It should be noted, however, that
+impressive empirical evidence exists for using Hamiltonian methods in some scenarios (e.g., [4]). We
+refer interested readers to [5,6].
+We take an expository approach, providing a review of some necessary preliminaries from
+Markov chain Monte Carlo, diffusion processes and Riemannian geometry. We assume only a minimal
+familiarity with measure-theoretic probability. More informed readers may prefer to skip these
+sections. We then provide a full derivation of the Langevin diffusion on a Riemannian manifold and
+offer some intuition for how to think about such a process. We conclude Section 4 by presenting the
+Metropolis-adjusted Langevin algorithm on a Riemannian manifold.
+A key challenge in the geometric approach is which manifold to choose. We discuss this in
+Section 4.4 and review some candidates that have been suggested in the literature, along with the
+
+Entropy 2014, 16, 3074–3102; doi:10.3390/e16063074
+www.mdpi.com/journal/entropy
+131
+
+
+Entropy 2014, 16, 3074–3102
+
+reasoning for each. Rather than provide a simulation study here, we instead reference studies where
+the methods we describe have been applied in Section 5. In Section 6, we discuss several open
+questions, which we feel could be interesting areas of further research and of interest to both theorists
+and practitioners.
+Throughout, π(·) will refer to an n-dimensional probability distribution and π(x) its density with
+respect to the Lebesgue measure.
+
+2. Markov Chain Monte Carlo
+
+Markov chain Monte Carlo (MCMC) is a set of methods for drawing samples from a distribution,
+π(·), defined on a measurable space (X , B), whose density is only known up to some proportionality
+constant. Although the i-th sample is dependent on the (i − 1)-th, the Ergodic Theorem ensures that
+for an appropriately constructed Markov chain with invariant distribution π(·), long-run averages are
+consistent estimators for expectations under π(·). As a result, MCMC methods have proven useful
+in Bayesian statistical inference, where often, the posterior density π(x|y) ∝ f (y|x)π0(x) for some
+parameter, x (where f (y|x) denotes the likelihood for data y and π0(x) the prior density), is only
+known up to a constant [7]. Here, we briefly introduce some concepts from general state space Markov
+chain theory together with a short overview of MCMC methods. The exposition follows [8].
+
+2.1. Markov Chain Preliminaries
+
+A time-homogeneous Markov chain, {Xm}m∈N, is a collection of random variables, Xm, each of
+which is defined on a measurable space (X , B), such that:
+P[Xm ∈ A|X0 = x0, ..., Xm−1 = xm−1] = P[Xm ∈ A|Xm−1 = xm−1],
+(1)
+
+for any A ∈ B. We define the transition kernel P(xm−1, A) = P[Xm ∈ A|Xm−1 = xm−1] for the chain to
+be a map for which P(x, ·) defines a distribution over (X , B) for any x ∈ X , and P(·, A) is measurable
+for any A ∈ B. Intuitively, P defines a map from points to distributions in X . Similarly, we define the
+m-step transition kernel to be:
+
+Pm(x0, A) = P[Xm ∈ A|X0 = x0].
+(2)
+
+We call a distribution π(·) invariant for {Xm}m∈N if:
+
+π(A) =
+�
+
+X P(x, A)π(dx)
+(3)
+
+for all A ∈ B. If P(x, ·) admits a density, p(x′|x), this can be equivalently written:
+
+π(x′) =
+�
+
+X π(x)p(x′|x)dx.
+(4)
+
+The connotation of Equations (3) and (4) is that if Xm ∼ π(·), then Xm+s ∼ π(·) for any s ∈ N. In this
+instance, we say the chain is “at stationarity”. Of interest to us will be Markov chains for which there
+is a unique invariant distribution, which is also the limiting distribution for the chain, meaning that for
+any x0 ∈ X for which π(x0) > 0:
+lim
+m→∞ Pm(x0, A) = π(A)
+(5)
+
+for any A ∈ B. Certain conditions are required for Equation (5) to hold, but for all Markov chains
+presented here, these are satisfied (though, see [8]).
+A useful condition, which is sufficient (though not necessary) for π(·) to be an invariant
+distribution, is reversibility, which can be shown by the relation:
+
+π(x)p(x′|x) = π(x′)p(x|x′).
+(6)
+
+132
+
+
+Entropy 2014, 16, 3074–3102
+
+Integrating over both sides with respect to x, we recover Equation (4). In other words, a chain is
+reversible if, at stationarity, the probability that xi ∈ A and xi+1 ∈ B are equal to the probability that
+xi+1 ∈ A and xi ∈ B. The relation (6) will be the primary tool used to construct Markov chains with a
+desired invariant distribution in the next section.
+
+2.1.1. Monte Carlo Estimates from Markov Chains
+
+Of most interest here are estimators constructed from a Markov chain. The Ergodic Theorem
+states that for any chain, {Xm}m∈N, satisfying Equation (5) and any g ∈ L1(π), we have that:
+
+lim
+m→∞
+1
+m
+
+m
+∑
+i=1
+g(Xi) = Eπ[g(X)]
+(7)
+
+with probability one [7]. This is a Markov chain analogue to the Law of large numbers.
+The efficiency of estimators of the form ˆtm
+= ∑i g(Xi)/m can be assessed through the
+autocorrelation between elements in the chain. We will assess the efficiency of ˆtm relative to estimators
+¯tm = ∑i g(Zi)/m, where {Zi}m∈N is a sequence of independent random variables, each having
+distribution π(·). Provided Varπ[g(Zi)] < ∞, then Var[¯tm] = Varπ[g(Zi)]/m. We now seek a similar
+result for estimators of the form, ˆtm.
+It follows directly from the Kipnis–Varadhan Theorem [9] that an estimator, ˆtm, from a reversible
+Markov chain for which X0 ∼ π(·) satisfies:
+
+lim
+m→∞
+Var[ˆtm]
+Var[¯tm] = 1 + 2
+∞
+∑
+i=1
+ρ(0,i) = τ,
+(8)
+
+provided that ∑∞
+i=1 i|ρ(0,i)| < ∞, where ρ(0,i) = Corrπ[g(X0), g(Xi)]. We will refer to the constant, τ, as
+the autocorrelation time for the chain.
+Equation (8) implies that for large enough m, Var[ˆtm] ≈ τVar[¯tm]. In practical applications, the
+sum in Equation (8) is truncated to the first p − 1 realisations of the chain, where p is the first instance
+at which |ρ(0,p)| < ϵ for some ϵ > 0. For example, in the Convergence Diagnosis and Output Analysis
+for MCMC (CODA) package within the R statistical software ϵ = 0.05 [10,11].
+Another commonly used measure of efficiency is the effective sample size me f f = m/τ, which
+gives the number of independent samples from π(·) needed to give an equally efficient estimate for
+Eπ[g(X)]. Clearly, minimising τ is equivalent to maximising me f f .
+The measures arising from Equation (8) give some intuition for what sort of Markov chain gives
+rise to efficient estimators. However, in practice, the chain will never be at stationarity. Therefore, we
+also assess Markov chains according to how far away they are from this point. For this, we need to
+measure how close Pm(x0, ·) is from π(·), which requires a notion of distance between probability
+distributions.
+Although there are several appropriate choices [12], a common option in the Markov chain
+literature is the total variation distance:
+
+∥μ(·) − ν(·)∥TV := sup
+A∈B
+|μ(A) − ν(A)|,
+(9)
+
+which informally gives the largest possible difference between the probabilities of a single event in
+B according to μ(·) and ν(·). If both distributions admit densities, Equation (9) can be written (see
+Appendix A):
+
+∥μ(·) − ν(·)∥TV = 1
+
+2
+
+�
+
+X |μ(x) − ν(x)|dx.
+(10)
+
+which is proportional to the L1 distance between μ(x) and ν(x). Our metric, ∥ · ∥TV ∈ [0, 1], with
+∥ · ∥TV = 1 for distributions with disjoint supports and ∥μ(·) − ν(·)∥TV = 0, implies μ(·) ≡ ν(·).
+
+133
+
+
+Entropy 2014, 16, 3074–3102
+
+Typically, for an unbounded X , the distance ∥Pm(x0, ·) − π(·)∥TV will depend on x0 for any finite
+m. Therefore, bounds on the distance are often sought via some inequality of the form:
+
+∥Pm(x0, ·) − π(·)∥TV ≤ MV(x0) f (m),
+(11)
+
+for some M < ∞, where V : X → [1, ∞) depends on x0 and is called a drift function, and f : N → [0, ∞)
+depends on the number of iterations, m (and is often defined, such that f (0) = 1).
+A Markov chain is called geometrically ergodic if f (m) = rm in Equation (11) for some 0 < r < 1.
+If in addition to this, V is bounded above, the chain is called uniformly ergodic. Intuitively, if either
+condition holds, then the distribution of Xm will converge to π(·) geometrically quickly as m grows,
+and in the uniform case, this rate is independent of x0. As well as providing some (often qualitative
+if M and r are unknown) bounds on the convergence rate of a Markov chain, geometric ergodicity
+implies that a central limit theorem exists for estimators of the form, ˆtm. For more detail on this,
+see [13,14].
+In practice several approximate methods also exist to assess whether a chain is close enough to
+stationarity for long-run averages to provide suitable estimators (e.g., [15]). The MCMC practitioner
+also uses a variety of visual aids to judge whether an estimate from the chain will be appropriate for
+his or her needs.
+
+2.2. Markov Chain Monte Carlo
+
+Now that we have introduced Markov chains, we turn to simulating them. The objective here
+is to devise a method for generating a Markov chain, which has a desired limiting distribution, π(·).
+In addition, we would strive for the convergence rate to be as fast as possible and the effective sample
+size to be suitably large relative to the number of iterations. Of course, the computational cost of
+performing an iteration is also an important practical consideration. Ideally, any method would
+also require limited problem-specific alterations, so that practitioners are able to use it with as little
+knowledge of the inner workings as is practical.
+Although other methods exist for constructing chains with a desired limiting distribution, a
+popular choice is the Metropolis–Hastings algorithm [7]. At iteration i, a sample is drawn from some
+candidate transition kernel, Q(xi−1, ·), and then either accepted or rejected (in which case, the state of
+the chain remains xi−1). We focus here on the case where Q(xi−1, ·) admits a density, q(x′|xi−1), for all
+xi−1 ∈ X (though, see [8]). In this case, a single step is shown below (the wedge notation a ∧ b denotes
+the minimum of a and b). The “acceptance rate”, α(xi−1, x′), governs the behaviour of the chain, so
+that, when it is close to one, then many proposed moves are accepted, and the current value in the
+chain is constantly changing. If it is on average close to zero, then many proposals are rejected, so
+that the chain will remain in the same place for many iterations. However, α ≈ 1 is typically not ideal,
+often resulting in a large autocorrelation time (see below). The challenge in practice is to find the right
+acceptance rate to balance these two extremes.
+
+Algorithm 1 Metropolis–Hastings, single iteration.
+
+Require: xi−1
+Draw X′ ∼ Q(xi−1, ·)
+Draw Z ∼ U[0, 1]
+Set α(xi−1, x′) ← 1 ∧
+π(x′)q(xi−1|x′)
+
+π(xi−1)q(x′|xi−1)
+if z < α(xi−1, x′) then
+Set xi ← x′
+else
+Set xi ← xi−1
+end if
+
+Combining the “proposal” and “acceptance” steps, the transition kernel for the resulting Markov
+chain is:
+
+134
+
+
+Entropy 2014, 16, 3074–3102
+
+P(x, A) = r(x)δx(A) +
+�
+
+A α(x, x′)q(x′|x)dx′,
+(12)
+
+for any A ∈ B, where:
+
+r(x) = 1 −
+�
+
+X α(x, x′)q(x′|x)dx′
+
+is the average probability that a draw from Q(x, ·) will be rejected, and δx(A) = 1 if x ∈ A and zero,
+otherwise. A Markov chain defined in this way will have π(·) as an invariant distribution, since the
+chain is reversible for π(·). We note here that:
+
+π(xi−1)q(xi|xi−1)α(xi−1, xi) = π(xi−1)q(xi|xi−1) ∧ π(xi)q(xi−1|xi)
+
+= α(xi, xi−1)q(xi−1|xi)π(xi)
+
+in the case that the proposed move is accepted and that if the proposed move is rejected, then xi = xi−1;
+so the chain is reversible for π(·). It can be shown that π(·) is also the limiting distribution for the
+chain [7].
+The convergence rate and autocorrelation time of a chain produced by the algorithm are dependent
+on both the choice of proposal, Q(xi−1, ·), and the target distribution, π(·). For simple forms of the
+latter, less consideration is required when choosing the former. A broad objective among researchers
+in the field is to find classes of proposal kernels that produce chains that converge and mix quickly for
+a large class of target distributions. We first review a simple choice before discussing one that is more
+sophisticated, and the will be the focus of the rest of the article.
+
+2.3. Random Walk Proposals
+
+An extremely simple choice for Q(x, ·) is one for which:
+
+q(x′|x) = q(∥x′ − x∥)
+(13)
+
+where ∥ · ∥ denotes some appropriate norm on X , meaning the proposal is symmetric. In this case, the
+acceptance rate reduces to:
+
+α(x, x′) = 1 ∧ π(x′)
+
+π(x) .
+(14)
+
+In addition to simplifying calculations, Equation (14) strengthens the intuition for the method, since
+proposed moves with higher density under π(·) will always be accepted. A typical choice for Q(x, ·)
+is N (x, λ2Σ), where the matrix, Σ, is often chosen in an attempt to match the correlation structure
+of π(·) or simply taken as the identity [16]. The tuning parameter, λ, is the only other user-specific
+input required.
+Much research has been conducted into properties of the random walk Metropolis algorithm
+(RWM). It has been shown that the optimal acceptance rate for proposals tends to 0.234 as the
+dimension, n, of the state space, X , tends to ∞ for a wide class of targets (e.g., [17,18]). The intuition
+for an optimal acceptance rate is to find the right balance between the distance of proposed moves
+and the chances of acceptance. Increasing the former will reduce the autocorrelation in the chain if the
+proposal is accepted, but if it is rejected, the chain will not move at all, so autocorrelation will be high.
+Random walk proposals are sometimes referred to as blind (e.g., [19]), as no information about π(·)
+is used when generating proposals, so typically, very large moves will result in a very low chance of
+acceptance, while small moves will be accepted, but result in very high autocorrelation for the chain.
+Figure 1 demonstrates this in the simple case where π(·) is a one-dimensional N (0, 12) distribution.
+
+135
+
+
+Entropy 2014, 16, 3074–3102
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+�������������
+����������������������������
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+������������
+�������������������������������
+
+�
+����
+����
+�����
+
+��
+��
+�
+�
+�
+
+����������
+
+������������
+�����������������������������
+
+Figure 1. These traceplots show the evolution of three RWM Markov chains for which π(·) is a N (0, 12)
+distribution, with different choices for λ.
+
+Several authors have also shown that for certain classes of π(·), the tuning parameter, λ, should
+be chosen, such that λ2 ∝ n−1, so that α ↛ 0 as n → ∞ [20]. Because of this, we say that algorithm
+efficiency “scales” O(n−1) as the dimension n of π(·) increases.
+Ergodicity results for a Markov chain constructed using the RWM algorithm also exist [21–23].
+At least exponentially light tails are a necessity for π(x) for geometric ergodicity, which means that
+π(x)/e−∥x∥ → c as ∥x∥ → ∞, for some constant, c. For super-exponential tails (where π(x) → 0 at
+a faster than the exponential rate), additional conditions are required [21,23]. We demonstrate with
+a simple example why heavy-tailed forms of π(x) pose difficulties here (where π(x) → 0 at a rate
+slower than e−∥x∥).
+
+Example: Take π(x) ∝ 1/(1 + x2), so that π(·) is a Cauchy distribution. Then, if X′ ∼ N (x, λ2), the ratio
+π(x′)/π(x) = (1 + x2)/(1 + (x′)2) → 1 as |x| → ∞. Therefore, if x0 is far away from zero, the Markov
+chain will dissolve into a random walk, with almost every proposal being accepted.
+
+It should be noted that starting the chain from at or near zero can also cause problems in the
+above example, as the tails of the distribution may not be explored. See [7] for more detail here.
+Ergodicity results for the RWM also exist for specific classes of the statistical model. Conditions for
+geometric ergodicity in the case of generalised linear mixed models are given in [24], while spherically
+constrained target densities are discussed in [25]. In [26], the authors provide necessary conditions
+for the geometric convergence of RWM algorithms, which are related to the existence of exponential
+moments for π(·) and P(x, ·). Weaker forms of ergodicity and corresponding conditions are also
+discussed in the paper.
+In the remainder of the article, we will primarily discuss another approach to choosing Q, which
+has been shown empirically [1] and, in some cases, theoretically [20] to be superior to the RWM
+algorithm, though it should be noted that random walk proposals are still widely used in practice and
+are often sufficient for more straightforward problems [16].
+
+3. Diffusions
+
+In MCMC, we are concerned with discrete time processes. However, often, there are benefits
+to first considering a continuous time process with the properties we desire. For example, some
+continuous time processes can be specified via a form of differential equation. In this section, we
+derive a choice for a Metropolis–Hastings proposal kernel based on approximations to diffusions,
+
+136
+
+
+Entropy 2014, 16, 3074–3102
+
+those continuous-time n-dimensional Markov processes (Xt)t≥0 for which any sample path t �→ Xt(ω)
+is a continuous function with probability one. For any fixed t, we assume Xt is a random variable
+taking values on the measurable space (X , B) as before. The motivation for this section is to define a
+class of diffusions for which π(·) is the invariant distribution. First, we provide some preliminaries,
+followed by an introduction to our main object of study, the Langevin diffusion.
+
+3.1. Preliminaries
+
+We focus on the class of time-homogeneous Itô diffusions, whose dynamics are governed by a
+stochastic differential equation of the form:
+
+dXt = b(Xt)dt + σ(Xt)dBt, X0 = x0,
+(15)
+
+where (Bt)t≥0 is a standard Brownian motion and the drift vector, b, and volatility matrix, σ, are
+Lipschitz continuous [27]. Since E[Bt+△t − Bt|Bt = bt] = 0 for any △t ≥ 0, informally, we can see that:
+
+E[Xt+△t − Xt|Xt = xt] = b(xt)△t + o(△t),
+(16)
+
+implying that the drift dictates how the mean of the process changes over a small time interval, and if
+we define the process (Mt)t≥0 through the relation:
+
+Mt = Xt −
+� t
+
+0 b(Xs)ds
+(17)
+
+then we have:
+
+E[(Mt+△t − Mt)(Mt+△t − Mt)T|Mt = mt, Xt = xt] = σ(xt)σ(xt)T△t + o(△t),
+(18)
+
+giving the stochastic part of the relationship between Xt+△t and Xt for small enough △t; see, e.g., [28].
+While Equation(15) is often a suitable description of an Itô diffusion, it can also be characterised
+through an infinitesimal generator, A, which describes how functions of the process are expected to
+evolve. We define this partial differential operator through its action on a function, f ∈ C0(X ), as:
+
+A f (Xt) = lim
+△t→0
+E[ f (Xt+△t)|Xt = xt] − f (xt)
+
+△t
+,
+(19)
+
+though A can be associated with the drift and volatility of (Xt)t≥0 by the relation:
+
+A f (x) = ∑
+i
+bi(x) ∂ f
+
+∂xi
+(x) + 1
+
+2 ∑
+i,j
+Vij(x) ∂2 f
+
+∂xi∂xj
+(x),
+(20)
+
+where Vij(x) denotes the component in row i and column j of σ(x)σ(x)T [27].
+As in the discrete case, we can describe the transition kernel of a continuous time Markov process,
+Pt(x0, ·). In the case of an Itô diffusion, Pt(x0, ·) admits a density, pt(x|x0), which, in fact, varies
+smoothly as a function of t. The Fokker–Planck equation describes this variation in terms of the drift
+and volatility and is given by:
+
+∂
+∂t pt(x|x0) = −∑
+i
+
+∂
+∂xi
+[bi(x)pt(x|x0)] + 1
+
+2 ∑
+i,j
+
+∂2
+
+∂xi∂xj
+[Vij(x)pt(x|x0)].
+(21)
+
+137
+
+
+Entropy 2014, 16, 3074–3102
+
+Although, typically, the form of Pt(x0, ·) is unknown, the expectation and variance of Xt ∼ Pt(x0, ·)
+are given by the integral equations:
+
+E[Xt|X0 = x0] = x0 + E
+�� t
+
+0 b(Xs)ds
+�
+,
+
+E[(Xt − E[Xt])(Xt − E[Xt])T|X0 = x0] = E
+�� t
+
+0 σ(Xs)σ(Xs)Tds
+�
+,
+
+where the second of these is a result of the Itô isometry [27]. Continuing the analogy, a natural question
+is whether a diffusion process has an invariant distribution, π(·), and whether:
+
+lim
+t→∞ Pt(x0, A) = π(A)
+(22)
+
+for any A ∈ B and any x0 ∈ X , in some sense. For a large class of diffusions (which we confine
+ourselves to), this is, in fact, the case. Specifically, in the case of positive Harris recurrent diffusions
+with invariant distribution π(·), all compact sets must be small for some skeleton chain, see [29] for
+details. In addition, Equation (21) provides a means of finding π(·), given b and σ. Setting the left-hand
+side of Equation (21) to zero gives:
+
+∑
+i
+
+∂
+∂xi
+[bi(x)π(x)] = 1
+
+2 ∑
+i,j
+
+∂2
+
+∂xi∂xj
+[Vij(x)π(x)],
+(23)
+
+which can be solved to find π(·).
+
+3.2. Langevin Diffusions
+
+Given Equation (23), our goal becomes clearer: find drift and volatility terms, so that the resulting
+dynamics describe a diffusion, which converges to some user-defined invariant distribution, π(·). This
+process can then be used as a basis for choosing Q in a Metropolis–Hastings algorithm. The Langevin
+diffusion, first used to describe the dynamics of molecular systems [30], is such a process, given by the
+solution to the stochastic differential equation:
+
+dXt = 1
+
+2∇ log π(Xt)dt + dBt, X0 = x0.
+(24)
+
+Since Vij(x) = 1{i=j}, it is clear that
+
+1
+2
+∂
+∂xi
+[log π(x)]π(x) = 1
+
+2
+∂
+∂xi
+π(x), ∀i,
+(25)
+
+which is a sufficient condition for Equation (23) to hold. Therefore, for any case in which π(x) is
+suitably regular, so that ∇ log π(x) is well-defined and the derivatives in Equation (23) exist, we can
+use (24) to construct a diffusion, which has invariant distribution, π(·).
+Roberts and Tweedie [31] give sufficient conditions on π(·) under which a diffusion, (Xt)t≥0,
+with dynamics given by Equation (24), will be ergodic, meaning:
+
+∥Pt(x0, ·) − π(·)∥TV → 0
+(26)
+
+as t → ∞, for any x0 ∈ X .
+
+3.3. Metropolis-Adjusted Langevin Algorithm
+
+We can use Langevin diffusions as a basis for MCMC in many ways, but a popular
+variant is known as the Metropolis-adjusted Langevin algorithm (MALA), whereby Q(x, ·) is
+
+138
+
+
+Entropy 2014, 16, 3074–3102
+
+constructed through a Euler–Maruyama discretisation of (24) and used as a candidate kernel in
+a Metropolis–Hastings algorithm. The resulting Q is:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 ∇ log π(x), λ2I
+�
+,
+(27)
+
+where λ is again a tuning parameter.
+Before we discuss the theoretical properties of the approach, we first offer an intuition for the
+dynamics. From Equation (27), it can be seen that Langevin-type proposals comprise a deterministic
+shift towards a local mode of π(x), combined with some random additive Gaussian noise, with
+variance λ2 for each component. The relative weights of the deterministic and random parts are fixed,
+given as they are by the parameter, λ. Typically, if λ1/2 ≫ λ, then the random part of the proposal will
+dominate and vice versa in the opposite case, though this also depends on the form of ∇ log π(x) [31].
+Again, since this is a Metropolis–Hastings method, choosing λ is a balance between proposing
+large enough jumps and ensuring that a reasonable proportion are accepted. It has been shown that in
+the limit, as n → ∞, the optimal acceptance rate for the algorithm is 0.574 [20] for forms of π(·), which
+either have independent and identically distributed components or whose components only differ
+by some scaling factor [20]. In these cases, as n → ∞, the parameter, λ, must be ∝ n−1/3, so we say
+the algorithm efficiency scales O(n−1/3). Note that these results compare favourably with the O(n−1)
+scaling of the random walk algorithm.
+Convergence properties of the method have also been established. Roberts and Tweedie [31]
+highlight some cases in which MALA is either geometrically ergodic or not. Typically, results are
+based on the tail behaviour of π(x). If these tails are heavier than exponential, then the method is
+typically not geometrically ergodic and similarly if the tails are lighter than Gaussian. However, in the
+in between case, the converse is true. We again offer two simple examples for intuition here.
+
+Example: Take π(x) ∝ 1/(1 + x2) as in the previous example. Then, ∇ log π(x) = −2x/(1 + x2)2 → 0
+as |x| → ∞. Therefore, if x0 is far away from zero, then the MALA will be approximately equal to the RWM
+algorithm and, so, will also dissolve into a random walk.
+
+Example: Take π(x) ∝ e−x4. Then, ∇ log π(x) = −4x3 and X′ ∼ N (x − 4λ2x3, λ2). Therefore, for any fixed
+λ, there exists c > 0, such that, for |x0| > c, we have |4λ2x3| >> x and |x − 4λ2x3| >> λ, suggesting that
+MALA proposals will quickly spiral further and further away from any neighbourhood of zero, and hence, nearly
+all will be rejected.
+
+For cases where there is a strong correlation between elements of x or each element has a different
+marginal variance, the MALA can also be “pre-conditioned” in a similar way to the RWM, so that the
+covariance structure of proposals more accurately reflects that of π(x) [32]. In this case, proposals take
+the form:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 Σ∇ log π(x), λ2Σ
+�
+,
+(28)
+
+where λ is again a tuning parameter. It can be shown that provided Σ is a constant matrix, π(x) is still
+the invariant distribution for the diffusion on which Equation (28) is based [33].
+
+4. Geometric Concepts in Markov Chain Monte Carlo
+
+Ideas from information geometry have been successfully applied to statistics from as early as [34].
+More widely, other geometric ideas have also been applied, offering new insight into common problems
+(e.g., [35,36]). A survey is given in [37]. In this section, we suggest why some ideas from differential
+geometry may be beneficial for sampling methods based on Markov chains. We then review what is
+
+139
+
+
+Entropy 2014, 16, 3074–3102
+
+meant by a “diffusion on a manifold”, before turning to the specific case of Equation (24). After this,
+we discuss what can be learned from work in information geometry in this context.
+
+4.1. Manifolds and Markov Chains
+
+We often make assumptions in MCMC about the properties of the space, X , in which our
+Markov chains evolve. Often X = Rn or a simple re-parametrisation would make it so. However,
+here, Rn = {(a1, ..., an) : ai ∈ (−∞, ∞) ∀i}. The additional assumption that is often made is that Rn is
+Euclidean, an inner product space with the induced distance metric:
+
+d(x, y) =
+�
+
+∑
+i
+(xi − yi)2.
+(29)
+
+For sampling methods based on Markov chains that explore the space locally, like the RWM and
+MALA, it may be advantageous to instead impose a different metric structure on the space, X , so that
+some points are drawn closer together and others pushed further apart. Intuitively, one can picture
+distances in the space being defined, such that if the current position in the chain is far from an area
+of X , which is “likely to occur” under π(·), then the distance to such a typical set could be reduced.
+Similarly, once this region is reached, the space could be “stretched” or “warped”, so that it is explored
+as efficiently as possible.
+While the idea is attractive, it is far from a constructive definition. We only have the pre-requisite
+that (X , d) must be a metric space. However, as Langevin dynamics use gradient information, we will
+require (X , d) to be a space on which we can do differential calculus. Riemannian manifolds are an
+appropriate choice, therefore, as the rules of differentiation are well understand for functions defined
+on them [38,39], while we are still free to define a more local notion of distance than Euclidean. In this
+section, we write Rn to denote the Euclidean vector space.
+
+4.2. Preliminaries
+
+We do not provide a full overview of Riemannian geometry here [38–40]. We simply note that
+for our purposes, we can consider an n-dimensional Riemannian manifold (henceforth, manifold)
+to be an n-dimensional metric space, in which distances are defined in a specific way. We also only
+consider manifolds for which a global coordinate chart exists, meaning that a mapping r : Rn → M
+exists, which is both differentiable and invertible and for which the inverse is also differentiable (a
+diffeomorphism). Although this restricts the class of manifolds available (the sphere, for example, is
+not in this class), it is again suitable for our needs and avoids the practical challenges of switching
+between coordinate patches. The connection with Rn defined through r is crucial for making sense of
+differentiability in M. We say a function f : M → R is “differentiable” if ( f ◦ r) : Rn → R is [39].
+As has been stated, Equation (29) can be induced via a Euclidean inner product, which we denote
+⟨·, ·⟩. However, it will aid intuition to think of distances in Rn via curves:
+
+γ : [0, 1] → Rn.
+(30)
+
+We could think of the distance between two points in x, y ∈ Rn as the minimum length among all
+curves that pass through x and y. If γ(0) = x and γ(1) = y, the length is defined as:
+
+L(γ) =
+� 1
+
+0
+
+�
+
+⟨γ′(t), γ′(t)⟩dt,
+(31)
+
+giving the metric:
+d(x, y) = inf {L(γ) : γ(0) = x, γ(1) = y} .
+(32)
+
+In Rn, the curve with a minimum length will be a straight line, so that Equation (32) agrees with
+Equation (29). More generally, we call a solution to Equation (32) a geodesic [38].
+
+140
+
+
+Entropy 2014, 16, 3074–3102
+
+In a vector space, metric properties can always be induced through an inner product (which also
+gives a notion of orthogonality). Such a space can be thought of as “flat”, since for any two points, y
+and z, the straight line ay + (1 − a)z, a ∈ [0, 1] is also contained in the space. In general, manifolds do
+not have vector space structure globally, but do so at the infinitesimal level. As such, we can think
+of them as “curved”. We cannot always define an inner product, but we can still define distances
+through (32). We define a curve on a manifold, M, as γM : [0, 1] → M. At each point γM(t) = p ∈ M,
+the velocity vector, γ′
+M(t), lies in an n-dimensional vector space, which touches M at p. These are
+known as tangent spaces, denoted TpM, which can be thought of as local linear approximations to M.
+We can define an inner product on each as gp : TpM → R, which allows us to define a generalisation
+of (31) as:
+
+L(γM) =
+� 1
+
+0
+
+�
+
+gp(γ′
+M(t), γ′
+M(t))dt.
+(33)
+
+and
+provides
+a
+means
+to
+define
+a
+distance
+metric
+on
+the
+manifold
+as
+d(x, y)
+=
+inf {L(γM) : γM(0) = x, γM(1) = y}. We emphasise the difference between this distance metric on M
+and gp, which is called a Riemannian metric or metric tensor and which defines an inner product on
+TpM.
+
+Embeddings and Local Coordinates
+
+So far, we have introduced manifolds as abstract objects. In fact, they can also be considered as
+objects that are embedded in some higher-dimensional Euclidean space. A simple example is any
+two-dimensional surface, such as the unit sphere, lying in R3. If a manifold is embedded in this way,
+then metric properties can be induced from the ambient Euclidean space.
+We seek to make these ideas more concrete through an example, the graph of a function, f (x1, x2),
+of two variables, x1 and x2. The resulting map, r, is:
+
+r : R2 → M
+(34)
+
+r(x1, x2) = (x1, x2, f (x1, x2)).
+(35)
+
+We can see that M is embedded in R3, but that any point can be identified using only two coordinates,
+x1 and x2. In this case, each TpM is a plane, and therefore, a two-dimensional subspace of R3, so: (i) it
+inherits the Euclidean inner product, ⟨·, ·⟩; and (ii) any vector, v ∈ TpM, can be expressed as a linear
+combination of any two linearly independent basis vectors (a canonical choice is the partial derivatives
+∂r/∂x1 := r1 and r2, evaluated at x = r−1(p) ∈ R2). The resulting inner product, gp(v, w), between
+two vectors, v, w ∈ TpM, can be induced from the Euclidean inner product as:
+
+⟨v, w⟩ = ⟨v1r1(x) + v2r2(x), w1r1(x) + w2r2(x)⟩,
+
+= v1w1⟨r1(x), r1(x)⟩ + v1w2⟨r1(x), r2(x)⟩ + v2w1⟨r2(x), r1(x)⟩ + v2w2⟨r2(x), r2(x)⟩,
+
+= vTG(x)w,
+
+where:
+
+G(x) =
+
+�
+⟨r1(x), r1(x)⟩
+⟨r1(x), r2(x)⟩
+⟨r1(x), r2(x)⟩
+⟨r2(x), r2(x)⟩
+
+�
+
+(36)
+
+and we use vi, wi to denote the components of v and w. To write (31) using this notation, we define the
+curve, x(t) ∈ R2, corresponding to γM(t) ∈ M as x = (r−1 ◦ γM) : [0, 1] → R2. Equation (31) can then
+be written:
+
+L(γM) =
+� 1
+
+0
+
+�
+
+x′(t)TG(x(t))x′(t)dt,
+(37)
+
+which can be used in (32) as before.
+
+141
+
+
+Entropy 2014, 16, 3074–3102
+
+The key point is that, although we have started with an object embedded in R3, we can compute
+the Riemannian metric, gp(v, w) (and, hence, distances in M), using only the two-dimensional “local”
+coordinates (x1, x2). We also need not have explicit knowledge of the mapping, r, only the components
+of the positive definite matrix, G(x). The Nash embedding theorem [41] in essence enables us to define
+manifolds by the reverse process: simply choose the matrix, G(x), so that we define a metric space
+with suitable distance properties, and some object embedded in some higher-dimensional Euclidean
+space will exist for which these metric properties can be induced as above. Therefore, to define our new
+space, we simply choose an appropriate matrix-valued map, G(x) (we discuss this choice in Section
+4.4). If G(x) does not depend on x, then M has a vector space structure and can be thought of as “flat”.
+Trivially, G(x) = I gives Euclidean n-space.
+We can also define volumes on a Riemannian manifold in local coordinates. Following standard
+coordinate transformation rules, we can see that for the above example, the area element, dx, in R2
+
+will change according to a Jacobian J = |(Dr)T(Dr)|1/2, where Dr = ∂(p1, p2, p3)/∂(x1, x2). This
+reduces to J = |G(x)|1/2, which is also the case for more general manifolds [38]. We therefore define
+the Riemannian volume measure on a manifold, M, in local coordinates as:
+
+VolM(dx) = |G(x)|
+1
+2 dx.
+(38)
+
+If G(x) = I, then this reduces to the Lebesgue measure.
+
+4.3. Diffusions on Manifolds
+
+By a “diffusion on a manifold” in local coordinates, we actually mean a diffusion defined on
+Euclidean space. For example, a realisation of Brownian motion on the surface, S ⊂ R3, defined
+in Figure 2 through r(x1, x2) = (x1, x2, sin(x1) + 1) will be a sample path, which is defined on S
+and “looks locally” like Brownian motion in a neighbourhood of any point, p ∈ S. However, the
+pre-image of this sample path (through r−1) will not be a realisation of a Brownian motion defined on
+R2, owing to the nonlinearity of the mapping. Therefore, to define “Brownian motion on S”, we define
+some diffusion (Xt)t≥0 that takes values in R2, for which the process (r(Xt))t≥0 “looks locally” like a
+Brownian motion (and lies on S). See [42] for more intuition here.
+Our goal, therefore, is to define a diffusion on Euclidean space, which, when mapped onto a
+manifold through r, becomes the Langevin diffusion described in (24) by the above procedure. Such a
+diffusion takes the form:
+
+dXt = 1
+
+2
+˜∇ log ˜π(Xt)dt + d ˜Bt,
+(39)
+
+where those objects marked with a tilde must be defined appropriately. The next few paragraphs
+are technical, and readers aiming to simply grasp the key points may wish to skip to the end of
+this Subsection.
+We turn first to ( ˜Bt)t≥0, which we use to denote Brownian motion on a manifold. Intuitively,
+we may think of a construction based on embedded manifolds, by setting ˜B0 = p ∈ M, and for
+each increment sampling some random vector in the tangent space TpM, and then moving along the
+manifold in the prescribed direction for an infinitesimal period of time before re-sampling another
+velocity vector from the next tangent space [42]. In fact, we can define such a construction using
+Stratonovich calculus and show that the infinitesimal generator can be written using only local
+coordinates [28]. Here, we instead take the approach of generalising the generator directly from
+Euclidean space to the local coordinates of a manifold, arriving at the same result. We then deduce the
+stochastic differential equation describing ( ˜Bt)t≥0 in Itô form using (20).
+For a standard Brownian motion on Rn, A = △/2, where △ denotes the Laplace operator:
+
+△ f = ∑
+i
+
+∂2 f
+∂x2
+i
+= div(∇ f ).
+(40)
+
+142
+
+
+Entropy 2014, 16, 3074–3102
+
+Substituting A = △/2 into (20) trivially gives bi(x) = 0 ∀i, Vij(x) = 1{i=j}, as required. The Laplacian,
+△ f (x), is the divergence of the gradient vector field of some function, f ∈ C2(Rn), and its value at
+x ∈ Rn can be thought of as the average value of f in some neighbourhood of x [43].
+
+A 
+
+B 
+
+Figure 2. A two-dimensional manifold (surface) embedded in R3 through r(x1, x2) = (x1, x2, sin(x1) +
+1), parametrised by the local coordinates, x1 and x2. The distance between points A and B is given by
+the length of the curve γ(t) = (t, t, sin(t) + 1)).
+
+To define a Brownian motion on any manifold, the gradient and divergence must be generalised.
+We provide a full derivation in Appendix B, which shows that the gradient operator on a manifold can
+be written in local coordinates as ∇M = G−1(x)∇. Combining with the operator, divM, we can define
+a generalisation of the Laplace operator, known as the Laplace–Beltrami operator (e.g., [44,45]), as:
+
+△LB f = divM(∇M f ) = |G(x|− 1
+
+2
+n
+∑
+i=1
+
+∂
+∂xi
+
+�
+
+|G(x)|
+1
+2
+n
+∑
+j=1
+{G−1(x)}ij
+∂ f
+∂xj
+
+�
+
+,
+(41)
+
+for some f ∈ C2
+0(M).
+The generator of a Brownian motion on M is △LB/2 [44]. Using (20), the resulting diffusion has
+dynamics given by:
+
+d ˜Bt = Ω(Xt)dt +
+�
+
+G−1(Xt)dBt,
+
+Ωi(Xt) = 1
+
+2|G(Xt)|− 1
+
+2
+n
+∑
+j=1
+
+∂
+∂xj
+
+�
+|G(Xt)|
+1
+2 {G−1(Xt)}ij
+�
+.
+
+Those familiar with the Itô formula will not be surprised by the additional drift term, Ω(Xt). As Itô
+integrals do not follow the chain rule of ordinary calculus, non-linear mappings of martingales, such
+as (Bt)t≥0, typically result in drift terms being added to the dynamics (e.g., [27]).
+To define ˜∇, we simply note that this is again the gradient operator on a general manifold, so
+˜∇ = G−1(x)∇. For the density, ˜π(x), we note that this density will now implicitly be defined with
+respect to the volume measure, |G(x)|
+1
+2 dx, on the manifold. Therefore, to ensure the diffusion (39) has
+the correct invariant density with respect to the Lebesgue measure, we define:
+
+˜π(x) = π(x)|G(x)|− 1
+
+2 .
+(42)
+
+Putting these three elements together, Equation (39) becomes:
+
+143
+
+
+Entropy 2014, 16, 3074–3102
+
+dXt = 1
+
+2G−1(Xt)∇ log
+�
+π(Xt)|G(Xt)|− 1
+
+2
+�
+dt + Ω(Xt)dt +
+�
+
+G−1(Xt)dBt,
+
+which, upon simplification, becomes:
+
+dXt = 1
+
+2G−1(Xt)∇ log π(Xt)dt + Λ(Xt)dt +
+�
+
+G−1(Xt)dBt,
+(43)
+
+Λi(Xt) = 1
+
+2 ∑
+j
+
+∂
+∂xj
+{G−1(Xt)}ij.
+(44)
+
+It can be shown that this diffusion has invariant Lebesgue density π(x), as required [33]. Intuitively,
+when a set is mapped onto the manifold, distances are changed by a factor,
+�
+
+G(x). Therefore, to end
+up with the initial distances, they must first be changed by a factor of
+�
+
+G−1(x) before the mapping,
+which explains the volatility term in Equation (43).
+The resulting Metropolis–Hastings proposal kernel for this “MALA on a manifold” was clarified
+in [33] and is given by:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 G−1(x)∇ log π(x) + λ2Λ(x), λ2G−1(x)
+�
+,
+(45)
+
+where λ2 is a tuning parameter. The nonlinear drift term here is slightly different to that reported
+in [1,32], for reasons discussed in [33].
+
+4.4. Choosing a Metric
+
+We now turn to the question of which metric structure to put on the manifold, or equivalently,
+how to choose G(x). In this section, we sometimes switch notation slightly, denoting the target density,
+π(x|y), as some of the discussion is directed towards Bayesian inference, where π(·) is the posterior
+distribution for some parameter, x, after observing some data, y. The problem statement is: what is an
+appropriate choice of distance between points in the sample space of a given probability distribution?
+A related (but distinct) question is how to define a distance between two probability distributions
+from the same parametric family, but with different parameters. This has been a key theme in
+information geometry, explored by Rao [46] and others [2] for many years.
+Although generic
+measures of distance between distributions (such as total variation) are often appropriate, based on
+information-theoretic principles, one can deduce that for a given parametric family, {px(y) : x ∈ X },
+it is in some sense natural to consider this “space of distributions” to be a manifold, where the Fisher
+information is the matrix, G(x) (with the α = 0 connection employed; see [2] for details).
+Because of this, Girolami and Calderhead [1] proposed a variant of the Fisher metric for geometric
+Markov chain Monte Carlo, as:
+
+G(x) = Ey|x
+
+�
+
+−
+∂2
+
+∂xi∂xj
+log f (y|x)
+
+�
+
+−
+∂2
+
+∂xi∂xj
+log π0(x),
+(46)
+
+where π(x|y) ∝ f (y|x)π0(x) is the target density, f denotes the likelihood and π0 the prior. The metric
+is tailored to Bayesian problems, which are a common use for MCMC, so the Fisher information is
+combined with the negative Hessian of the log-prior. One can also view this metric as the expected
+negative Hessian of the log target, since this naturally reduces to (46).
+The motivation for a Hessian-style metric can also be understood from studying MCMC proposals.
+From (45) and by the same logic as for general pre-conditioning methods [32], the objective is to choose
+G−1(x) to match the covariance structure of π(x|y) locally. If the target density were Gaussian with
+covariance matrix, Σ, then:
+
+−
+∂2
+
+∂xi∂xj
+log π(x|y) = Σ.
+(47)
+
+144
+
+
+Entropy 2014, 16, 3074–3102
+
+In the non-Gaussian case, the negative Hessian is no longer constant, but we can imagine that it
+matches the correlation structure of π(x|y) locally at least. Such ideas have been discussed in the
+geostatistics literature previously [47]. One problem with simply using (47) to define a metric is that
+unless π(x|y) is log-concave, the negative Hessian will not be globally positive-definite, although
+Petra et al. [48] conjecture that it may be appropriate for use in some realistic scenarios and suggest
+some computationally efficient approximation procedures [48].
+
+Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = −∂2 log π(x)/∂x2. Then, G−1(x) = (1 + x2)2/(2 −
+2x2), which is negative if x2 > 1, so unusable as a proposal variance.
+
+Girolami and Calderhead [1] use the Fisher metric in part to counteract this problem. Taking
+expectations over the data ensures that the likelihood contribution to G(x) in (46) will be positive
+(semi-)definite globally (e.g., [49]); so, provided a log-concave prior is chosen, then (46) should
+be a suitable choice for G(x). Indeed, Girolami and Calderhead [1] provide several examples in
+which geometric MCMC methods using this Fisher metric perform better than their “non-geometric”
+counterparts.
+Betancourt [50] also starts from the viewpoint that the Hessian (47) is an appropriate choice for
+G(x) and defines a mapping from the set of n × n matrices to the set of positive-definite n × n matrices
+by taking a “smooth” absolute value of the eigenvalues of the Hessian. This is done in a way such that
+derivatives of G(x) are still computable, inspiring the author to the name, SoftAbs metric. For a fixed
+value of x, the negative Hessian, H(x), is first computed and, then, decomposed into UTDU, where
+D is the diagonal matrix of eigenvalues. Each diagonal element of D is then altered by the mapping
+tα : R → R, given by:
+tα(λi) = λi coth(αλi),
+(48)
+
+where α is a tuning parameter (typically chosen to be as large as possible for which eigenvalues remain
+non-zero numerically). The function, tα, acts as an absolute value function, but also uplifts eigenvalues,
+which are close to zero to ≈ 1/α. It should be noted that while the Fisher metric is only defined for
+models in which a likelihood is present and for which the expectation is tractable, the SoftAbs metric
+can be found for any target distribution, π(·).
+Many authors (e.g., [1,48]) have noted that for many problems, the terms involving derivatives
+of G(x) are often small, and so, it is not always worth the computational effort of evaluating them.
+Girolami and Calderhead [1] propose the simplified manifold, MALA, in which proposals are of
+the form:
+
+Q(x, ·) ≡ N
+�
+x + λ2
+
+2 G−1(x)∇ log π(x), λ2G−1(x)
+�
+(49)
+
+Using this method means derivatives of G(x) are no longer needed, so more pragmatic ways of
+regularising the Hessian are possible. One simple approach would be to take the absolute values
+of each eigenvalue, giving G(x) = UT|D|U, where H(x) = UTDU is the negative Hessian and |D|
+is a diagonal matrix with {|D|}ii = |λi| (this approach may fall into difficulties if eigenvalues are
+numerically zero). Another would be choosing G(x) as the “nearest” positive-definite matrix to the
+negative Hessian, according to some distance metric on the set of n × n matrices. The problem has, in
+fact, been well-studied in mathematical finance, in the context of finding correlations using incomplete
+data sets [51], and tackled using distances induced by the Frobenius norm. Approximate solution
+algorithms are discussed in Higham [51]. It is not clear to us at present whether small changes to
+the Hessian would result in large changes to the corresponding positive definite matrix under a
+given distance or, indeed, whether given a distance metric on the space of matrices, there is always a
+well-defined unique “nearest” positive definite matrix. Below, we provide two simple examples, here
+showing how a “Hessian-style metric” can alleviate some of the difficulties associated with both heavy
+and light-tailed target densities.
+
+145
+
+
+Entropy 2014, 16, 3074–3102
+
+Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) =
+−x(1 + x2)/|1 − x2|, which no longer tends to zero as |x| → ∞, suggesting a manifold variant of MALA with
+a Hessian-style metric may avoid some of the pitfalls of the standard algorithm. Note that the drift may become
+very large if |x| ≈ 1, but since this event occurs with probability zero, we do not see it as a major cause for
+concern.
+
+Example: Take π(x) ∝ e−x4, and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) = −x/3,
+which is O(x), so alleviating the problem of spiralling proposals for light-tailed targets demonstrated by MALA
+in an earlier example.
+
+Other choices for G(x) have been proposed, which are not based on the Hessian. These have the
+advantage that gradients need not be computed (either analytically or using computational methods).
+Sejdinovic et al. [52] propose a Metropolis–Hastings method, which can be viewed as a geometric
+variant of the RWM, where the choice for G(x) is based on mapping samples to an appropriate
+feature space, and performing principal component analysis on the resulting features to choose a local
+covariance structure for proposals.
+If we consider the RWM with Gaussian proposals to be a Euler–Maruyama discretisation of
+Brownian motion on a manifold, then proposals will take the form Q(x, ·) ≡ N (x + λ2Ω(x), λ2G−1(x)).
+If we assume (like in the simplified manifold MALA) that Ω(x) ≈ 0, then we have proposals centred
+at the current point in the Markov chain with a local covariance structure (the full Hastings acceptance
+rate must now be used as q(x′|x) ̸= q(x|x′) in general).
+As no gradient information is needed, the Sejdinovic et al. metric can be used in conjunction with
+the pseudo-marginal MCMC algorithm, so that π(x|y) need not be known exactly. Examples from the
+article demonstrate the power of the approach [52].
+An important property of any Riemannian metric is how it transforms under coordinate change
+(e.g., [2]). The Fisher information metric commonly studied in information geometry is an example of a
+“coordinate invariant” choice for G(x). If we consider two parametrisations for a statistical model given
+by x and z = t(x), computing the Fisher information under x and then transforming this matrix using
+the Jacobian for the mapping, t, will give the same result as computing the Fisher information under z.
+It should be noted that because of either the prior contribution in (46) or the nonlinear transformations
+applied in other cases, none of the metrics we have reviewed here have this property, which means
+that we have no principled way of understanding how G(x) will relate to G(z). It is intuitive, however,
+that using information from all of π(x), rather than only the likelihood contribution, f (y|x), would
+seem sensible when trying to sample from π(·).
+
+5. Survey of Applications
+
+Rather than conduct our own simulation study, we instead highlight some cases in the literature
+where geometric MCMC methods have been used with success.
+Martin et al. [53] consider Bayesian inference for a statistical inverse problem, in which a surface
+explosion causes seismic waves to travel down into the ground (the subsurface medium). Often,
+the properties of the subsurface vary with distance from ground level or because of obstacles in the
+medium, in which case, a fraction of the waves will scatter off these boundaries and be reflected
+back up to ground level at later times. The observations here are the initial explosion and the waves,
+which return to the surface, together with return times. The challenge is to infer the properties of the
+subsurface medium from this data. The authors construct a likelihood based on the wave equation for
+the data and perform Bayesian inference using a variant of the manifold MALA. Figures are provided
+showing the local correlations present in the posterior and, therefore, highlighting the need for an
+algorithm that can navigate the high density region efficiently. Several methods are compared in the
+paper, but the variant of MALA that incorporates a local correlation structure is shown to be the most
+efficient, particularly as the dimension of the problem increases [53].
+
+146
+
+
+Entropy 2014, 16, 3074–3102
+
+Calderhead and Girolami [54] dealt with two models for biological phenomena based on nonlinear
+dynamical systems. A model of circadian control in the Arabidopsis thaliana plant comprised a system
+of six nonlinear differential equations, with twenty two parameters to be inferred. Another model
+for cell signalling consisted of a system of six nonlinear differential equations with eight parameters,
+with inference complicated by the fact that observations of the model are not recorded directly [54].
+The resulting inference was performed using RWM, MALA and geometric methods, with the results
+highlighting the benefits of taking the latter approach. The simplified variant of MALA on a manifold
+is reported to have produced the most efficient inferences overall, in terms of effective sample size per
+unit of computational time.
+Stathopoulos and Girolami [55] considered the problem of inferring parameters in Markov jump
+processes. In the paper, a linear noise approximation is shown, which can make inference in such
+models more straightforward, enabling an approximate likelihood to be computed. Models based on
+chemical reaction dynamics are considered; one such from chemical kinetics contained four unknown
+parameters; another from gene expression consisted of seven. Inference was performed using the
+RWM, the simplified manifold MALA and Hamiltonian methods, with the MALA reported as most
+efficient according to the chosen diagnostics. The authors note that the simplified manifold method is
+both conceptually simple and able to account for local correlations, making it an attractive choice for
+inference [55].
+Konukoglu et al. [56] designed a method for personalising a generic model for a physiological
+process to a specific patient, using clinical data. The personalisation took the form of patient-specific
+parameter inference. The authors highlight some of the difficulties of this task in general, including the
+complexity of the models and the relative sparsity of the datasets, which often result in a parameter
+identifiability issue [56]. The example discussed in the paper is the Eikonal-diffusion model describing
+electrical activity in cardiac tissue, which results in a likelihood for the data based on a nonlinear partial
+differential equation, combined with observation noise [56]. A method for inference was developed by
+first approximating the likelihood using a spectral representation and then using geometric MCMC
+methods on the resulting approximate posterior. The method was first evaluated on synthetic data
+and then repeated on clinical data taken from a study for ventricular tachycardia radio-frequency
+ablation [56].
+
+6. Discussion
+
+The geometric viewpoint in not necessary to understand manifold variants of the MALA. Indeed,
+several authors [32,33] have discussed these algorithms without considering them to be “geometric”,
+rather simply Metropolis–Hastings methods in which proposal kernels have a position-dependent
+covariance structure. We do not claim that the geometric view is the only one that should be taken.
+Our goal is merely to point out that such position-dependent methods can often be viewed as methods
+defined on a manifold and that studying the structure of the manifold itself may lead to new insights on
+the methods. For example, taking the geometric viewpoint and noting the connection with information
+geometry enabled Girolami and Calderhead to adopt the Fisher metric for calculations [1]. We list here
+a few open questions on which the geometric viewpoint may help shed some insight.
+Computationally-minded readers will have noted that using position-dependent covariance
+matrices adds a significant computational overhead in practice, with additional O(n3) matrix inversions
+required at each step of the corresponding Metropolis–Hastings algorithms. Clearly, there will be
+many problems for which the matrix, G(x), does not change very much, and therefore, choosing
+a constant covariance G−1(x) = Σ may result in a more efficient algorithm overall. Geometrically,
+this would correspond to a manifold with scalar curvature close to zero everywhere. It may be that
+geometric ideas could be used to understand whether the manifold is flat enough that a constant choice
+of G(x) is sufficient. To make sense of this truly would require a relationship between curvature, an
+inherently local property and more global statements about the manifold. Many results in differential
+geometry, beginning with the celebrated Gauss–Bonnet theorem, have previously related global and
+
+147
+
+
+Entropy 2014, 16, 3074–3102
+
+local properties in this way [57]. It is unknown to the authors whether results exist relating the
+curvature of a manifold to some global property, but this is an interesting avenue for further research.
+A related question is when to choose the simplified manifold MALA over the full method.
+Problems in which the term, ∥Λ(x)∥, is sufficient large to warrant calculation correspond to those for
+which the manifold has very high curvature in many places; so again, making some global statement
+related to curvature could help here.
+Although there is a reasonably intuitive argument for why the Hessian is an appropriate starting
+point for G(x), the lack of positive-definiteness may be seen as a cause for concern by some. After
+all, it could be argued that if the curvature is not positive-definite in a region, then how can it be a
+reasonable approximation to the local covariance structure. Many statistical models used to describe
+natural phenomena are characterised by distributions with heavy tails or multiple modes, for which
+this is the case. In addition, for target densities of the form π(x) ∝ e−|x|, the Hessian is everywhere
+equal to zero!The attempts to force positive-definiteness we have described will typically result in
+small moves being proposed in such regions of the sample space, which may not be an optimal strategy.
+Much work in information geometry has centred on the geometry of Hessian structures [58], and some
+insights from this field may help to better understand the question of choosing an appropriate metric.
+In addition, the field of pseudo-Riemannian geometry deals with forms of G(x), which need not be
+positive-definite [39]; so again, understanding could be gained from here.
+Some recent work in high-dimensional inference has centred on defining MCMC methods for
+which efficiency scales O(1) with respect to the dimension, n, of π(·) [19,59]. In the case where X
+takes values in some infinite-dimensional function space, this can be done provided a Gaussian prior
+measure is defined for X. A striking result from infinite-dimensional probability spaces is that two
+different probability measures defined over some infinite dimensional space have a striking tendency
+to have disjoint supports [60]. The key challenge for MCMC is to define transition kernels for which
+proposed moves are inside the support for π(·). A straight-forward approach is to define proposals for
+which the prior is invariant, since the likelihood contribution to the posterior typically will not alter
+its support from that of the prior [19]. However, the posterior may still look very different from the
+prior, as noted in [61], so this proposal mechanism, though O(1), can still result in slow exploration.
+Understanding the geometry of the support and defining methods that incorporate the likelihood term,
+but also respect this geometry, so as to ensure proposals remain in support of π(·), is an intriguing
+research proposition.
+The methods reviewed in this paper are based on first order Langevin diffusions. Algorithms
+have also been developed that are based on second order Langevin diffusions, in which a stochastic
+differential equation governs the behaviour of the velocity of a process [62,63]. A natural extension to
+the work of Girolami and Calderhead [1] and Xifara et al. [33] would be to map such diffusions onto
+a manifold and derive Metropolis–Hastings proposal kernels based on the resulting dynamics. The
+resulting scheme would be a generalisation of [63], though the most appropriate discretisation scheme
+for a second order process to facilitate sampling is unclear and perhaps a question worthy of further
+exploration.
+We have focused primarily here on the sample space X = Rn and on defining an appropriate
+manifold on which to construct Markov chains. In some inference problems, however, the sample
+space is a pre-defined manifold, for example the set of n × n rotation matrices, commonly found in the
+field of directional statistics [64]. Such manifolds are often not globally mappable to Euclidean n-space.
+Methods have been devised for sampling from such spaces [65,66]. In order to use the methods
+described here for such problems, an appropriate approach for switching between coordinate patches
+at the relevant time would need to be devised, which could be an interesting area of further study.
+Alongside these geometric problems, we can also discuss geometric MCMC methods from a
+statistical perspective. The last example given in the previous section hinted that the manifold MALA
+may cope better with target distributions with heavy tails. In fact, Latuszynski et al. [67] have shown
+that, in one dimension, the manifold MALA is geometrically ergodic for a class of targets of the
+
+148
+
+
+Entropy 2014, 16, 3074–3102
+
+form π(x) ∝ exp(−|x|β) for any choice of β ̸= 1. This incorporates cases where tails are heavier
+than exponential and lighter than Gaussian, two scenarios under which geometric ergodicity fails for
+the MALA.
+Finding optimal acceptance rates and scaling of λ with dimension are two other related challenges.
+In this case, the picture is more complex. Traditional results have been shown for Metropolis–Hastings
+methods in the case where target distributions are independent and identically-distributed or some
+other suitable symmetry and regularity in the shape of π(·).
+Manifold methods are, however,
+specifically tailored to scenarios in which this is not the case, scenarios in which there is a high
+correlation between components of x, which changes depending on the value of x. It is less clear how
+to proceed with finding relevant results that can serve as guidelines to practitioners here. Indeed,
+Sherlock [18] notes that a requirement for optimal acceptance rate results for the RWM to be appropriate
+is that the curvature of π(x) does not change too much, yet this is the very scenario in which we would
+want to use a manifold method.
+
+Acknowledgments: We thank the two reviewers for helpful comments and suggestions. Samuel Livingstone is
+funded by a PhD Scholarship from Xerox Research Centre Europe. Mark Girolami is funded by an Engineering
+and Physical Sciences Research Council Established Career Research Fellowship, EP/J016934/1, and a Royal
+Society Wolfson Research Merit Award.
+
+Author Contributions: Author Contributions
+The article was written by Samuel Livingstone under the guidance of Mark Girolami. All authors
+have read and approved the final manuscript.
+
+Appendix
+
+Appendix Total Variation Distance
+
+We show how to obtain (10) from (9). Denoting two probability distributions, μ(·) and ν(·), and
+associated densities, μ(x) and ν(x), we have:
+
+∥μ(·) − ν(·)∥TV := sup
+A∈B
+|μ(A) − ν(A)|.
+
+Define the set B = {x ∈ X : μ(x) > ν(x)}. To see that B ∈ B, note that B = ∪q∈Q{x ∈ X : μ(x) >
+q} ∩ {x ∈ X : ν(x) < q}, and the result follows from properties of B (e.g., [68]). Now, for any A ∈ B:
+
+μ(A) − ν(A) ≤ μ(A ∩ B) − ν(A ∩ B) ≤ μ(B) − ν(B),
+
+and similarly:
+ν(A) − μ(A) ≤ ν(Bc) − μ(Bc),
+
+so, the supremum will be attained either at B or Bc. However, since μ(X ) = ν(X ) = 1, then:
+
+[μ(B) − ν(B)] − [ν(Bc) − μ(Bc)] = 0,
+
+so that
+|μ(B) − ν(B)| = |μ(Bc) − ν(Bc)|.
+
+Using these facts gives an alternative characterisation of the total variation distance as:
+
+∥μ(·) − ν(·)∥TV = 1
+
+2 (|μ(B) − ν(B)| + |μ(Bc) − ν(Bc)|)
+
+= 1
+
+2
+
+�
+
+X |μ(x) − ν(x)|dx
+
+as required.
+
+149
+
+
+Entropy 2014, 16, 3074–3102
+
+Appendix Gradient and Divergence Operators on a Riemannian Manifold
+
+The gradient of a function on Rn is the unique vector field, such that, for any unit vector, u:
+
+⟨∇ f (x), u⟩ = Du [ f (x)] = lim
+h→0
+
+� f (x + hu) − f (x)
+
+h
+
+�
+,
+(A1)
+
+the directional derivative of f along u at x ∈ Rn.
+On a manifold, the gradient operator, ∇M, can still be defined, such that the inner product
+gp(∇M f (x), u) = Du[ f (x)]. Setting ∇M = G(x)−1∇ gives:
+
+gp(∇M f (x), u) = (G−1(x)∇ f (x))TG(x)u,
+
+= ⟨∇ f (x), u⟩,
+
+which is equal to the directional derivative along u as required.
+The divergence of some vector field, v, at a point, x ∈ Rn, is the net outward flow generated by
+v through some small neighbourhood of x. Mathematically, the divergence of v(x) ∈ R3 is given by
+∑i ∂vi/∂xi. On a more general manifold, the divergence is also a sum of derivatives, but here, they
+are covariant derivatives. A short introduction is provided in Appendix C. Here, we simply state
+that the covariant derivative of a vector field, v, at a point p ∈ M is the orthogonal projection of the
+directional derivative onto the tangent space, TpM. Intuitively, a vector field on a manifold is a field
+of vectors, each of which lie in the tangent space to a point, p ∈ M. It only makes sense therefore to
+discuss how vector fields change along the manifold or in the direction of vectors, which also lie in the
+tangent space. Although the idea seems simple, the covariant derivative has some attractive geometric
+properties; notably, it can be completely written in local coordinates,and, so, does not depend on
+knowledge of an embedding in some ambient space.
+The divergence of a vector field, v, defined on a manifold, M, at the point, p ∈ M, is defined as:
+
+divM(v) =
+n
+∑
+i=1
+Dc
+ei[vi],
+
+where ei denotes the i-th basis vector for the tangent space, TpM, at p ∈ M, and vi denotes the i-th
+coefficient. This can be written in local coordinates (see Appendix C) as:
+
+divM(v) = |G(x)|− 1
+
+2
+n
+∑
+i=1
+
+∂
+∂xi
+
+�
+|G(x)|
+1
+2 vi
+�
+,
+
+and can be combined with ∇M to form the Laplace–Beltrami operator (41).
+
+Appendix Vector Fields and the Covariant Derivative
+
+Here, we provide a short introduction to vector fields and differentiation on a smooth manifold;
+see [38,39]. The following geometric notation is used here: (i) vector components are indexed with a
+superscript, e.g., v = (v1, ..., vn); and (ii) repeated subscripts and superscripts are summed over, e.g.,
+viei = ∑i viei (known as the Einstein summation convention).
+For any smooth manifold, M, the set of all tangent vectors to points on M is known as the tangent
+bundle and denoted TM.
+A Cr vector field defined on M is a mapping that assigns to each point, p ∈ M, a tangent vector,
+v(p) ∈ TpM. In addition, the components of v(p) in any basis for TpM must also be Cr [38]. We
+will denote the set of all vector fields on M as Γ(TM). For some vector field, v ∈ Γ(TM), at any
+point, p ∈ M, the vector, v(p) ∈ TpM, can be written as a linear combination of some n basis vectors
+{e1, ..., en} as v = viei. To understand how v will change in a particular direction along M, it only
+makes sense, therefore, to consider derivatives along vectors in TpM. Two other things must be
+
+150
+
+
+Entropy 2014, 16, 3074–3102
+
+considered when defining a derivative along a manifold: (i) how the components, vi, of each basis
+vector will change; and (ii) how each basis vector, ei, itself will change. For the usual directional
+derivative on Rn, the basis vectors do not change, as the tangent space is the same at each point, but
+for a more general manifold, this is no longer the case: the ei’s are referred to as a “local” basis for each
+TpM.
+The covariant derivative, Dc, is defined so as to account for these shortcomings. When considering
+differentiation along a vector, u∗ /∈ TpM, u∗ is simply projected onto the tangent space. The derivative
+with respect to any u ∈ TpM can now be decomposed into a linear combination of derivatives of basis
+vectors and vector components:
+
+Dc
+u[v] = Dc
+uiei[viei],
+(A2)
+
+where the argument, p, has been dropped, but is implied for both components and local basis vectors.
+The operator, Dc
+u[v], is defined to be linear in both u and v and to satisfy the product rule [38]; so,
+Equation (A2) can be decomposed into:
+
+Dc
+u[v] = ui �
+Dc
+ei[vj]ej + vjDc
+ei[ej]
+�
+.
+(A3)
+
+The operator, Dc, need, therefore, only be defined along the direction of basis vectors ei and for vector
+component vi and basis vector ei arguments.
+For components vi, Dc
+ej[vi] is defined as simply the partial derivative ∂jvi := ∂vi/∂xj. The
+directional derivative of some basis vector ei along some ej is best understood through the example of
+a regular surface Σ ⊂ R3. Here, Dej[ei] will be a vector, w ∈ R3. Taking the basis for this space at the
+point, p, as {e1, e2, ˆn}, where ˆn denotes the unit normal to TpΣ, we can write w = αe1 + βe2 + κ ˆn. The
+covariant derivative, Dc
+ej[ei], is simply the projection of w onto TpΣ, given by w∗ = αe1 + βe2. More
+
+generally, at some point, p, in a smooth manifold, M, the covariant derivative Dc
+ej[ei] = Γk
+jiek (with
+
+upper and lower indices summed over). The coefficients, Γk
+ji, are known as the Christoffel symbols: Γk
+ji
+denotes the coefficient of the k-th basis vector when taking the derivative of the i-th with respect to the
+j-th. If a Riemannian metric, g, is chosen for M; then, they can be expressed completely as a function
+of g (or in local coordinates as a function of the matrix, G). Using these definitions, Equation (A3) can
+be re-written as:
+Dc
+u[v] = ui �
+∂ivk + vjΓk
+ij
+�
+ek.
+(A4)
+
+The divergence of a vector field, v ∈ Γ(TM), at the point, p ∈ M, is given by:
+
+divM(v) = Dc
+ei[vi],
+(A5)
+
+where, again, repeated indices are summed over. If M = Rn, this reduces to the usual sum of partial
+derivatives, ∂ivi. On a more general manifold, M, the equivalent expression is:"’
+
+Dc
+ei[vi] = ∂ivi + viΓj
+ij,
+(A6)
+
+where, again, repeated indices are summed. As has been previously stated, if a metric, g, and coordinate
+chart is chosen for M, the Christoffel symbols can be written in terms of the matrix, G(x). In this
+case [69]:
+
+Γj
+ij = |G(x)|− 1
+
+2 ∂i
+�
+|G(x)|
+1
+2
+�
+,
+(A7)
+
+so Equation (A6) becomes:
+
+Dc
+ei[vi] = |G(x)|− 1
+
+2 ∂i
+�
+|G(x)|
+1
+2 vi�
+,
+(A8)
+
+where v = v(x).
+
+Conflicts of Interest: Conflicts of Interest
+
+151
+
+
+Entropy 2014, 16, 3074–3102
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Girolami, M.; Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat.
+Soc. Ser. B 2011, 73, 123–214.
+2.
+Amari, S.I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2007; Volume 191.
+3.
+Marriott, P.; Salmon, M. Applications of Differential Geometry to Econometrics; Cambridge University Press:
+Cambridge, UK, 2000.
+4.
+Betancourt, M.; Girolami, M. Hamiltonian Monte Carlo for Hierarchical Models. 2013, arXiv: 1312.0906.
+5.
+Neal, R. MCMC using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo; Chapman and
+Hall/CRC: Boca Raton, FL, USA, 2011; pp. 113–162.
+6.
+Betancourt, M.; Stein, L.C. The Geometry of Hamiltonian Monte Carlo. 2011, arXiv: 1112.4118.
+7.
+Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer: New York, NY, USA, 2004; Volume 319.
+8.
+Tierney, L. Markov chains for exploring posterior distributions. Ann. Stat. 1994, 22, 1701–1728.
+9.
+Kipnis, C.; Varadhan, S. Central limit theorem for additive functionals of reversible Markov processes and
+applications to simple exclusions. Commun. Math. Phys. 1986, 104, 1–19.
+10.
+R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing:
+Vienna, Austria, 2012.
+11.
+Plummer, M.; Best, N.; Cowles, K.; Vines, K. CODA: Convergence diagnosis and output analysis for MCMC.
+R. News 2006, 6, 7–11.
+12.
+Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435.
+13.
+Jones, G.L.; Hobert, J.P. Honest exploration of intractable probability distributions via Markov chain Monte
+Carlo. Stat. Sci. 2001, 16, 312–334.
+14.
+Jones, G.L. On the Markov chain central limit theorem. Probab. Surv. 2004, 1, 299–320.
+15.
+Gelman, A.; Rubin, D.B. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992, 7,
+457–472.
+16.
+Sherlock, C.; Fearnhead, P.; Roberts, G.O. The random walk Metropolis: Linking theory and practice through
+a case study. Stat. Sci. 2010, 25, 172–190.
+17.
+Sherlock, C.; Roberts, G. Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal
+targets. Bernoulli 2009, 15, 774–798.
+18.
+Sherlock, C. Optimal scaling of the random walk Metropolis: General criteria for the 0.234 acceptance rule. J.
+Appl. Probab. 2013, 50, 1–15.
+19.
+Beskos, A.; Kalogeropoulos, K.; Pazos, E. Advanced MCMC methods for sampling on diffusion pathspace.
+Stoch. Processes Appl. 2013, 123, 1415–1453.
+20.
+Roberts, G.O.; Rosenthal, J.S. Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. 2001,
+16, 351–367.
+21.
+Roberts, G.O.; Tweedie, R.L. Geometric convergence and central limit theorems for multidimensional
+Hastings and Metropolis algorithms. Biometrika 1996, 83, 95–110.
+22.
+Mengersen, K.L.; Tweedie, R.L. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Stat.
+1996, 24, 101–121.
+23.
+Jarner, S.F.; Hansen, E. Geometric ergodicity of Metropolis algorithms.
+Stoch. Processes Appl. 2000,
+85, 341–361.
+24.
+Christensen, O.F.; Møller, J.; Waagepetersen, R.P. Geometric ergodicity of Metropolis–Hastings algorithms
+for conditional simulation in generalized linear mixed models.
+Methodol. Comput. Appl. Probab. 2001,
+3, 309–327.
+25.
+Neal, P.; Roberts, G. Optimal scaling for random walk Metropolis on spherically constrained target densities.
+Methodol. Comput. Appl. Probab. 2008, 10, 277–297.
+26.
+Jarner, S.F.;
+Tweedie, R.L.
+Necessary conditions for geometric and polynomial ergodicity of
+random-walk-type. Bernoulli 2003, 9, 559–578.
+27.
+Øksendal, B. Stochastic Differential Equations; Springer: New York, NY, USA, 2003.
+
+152
+
+
+Entropy 2014, 16, 3074–3102
+
+28.
+Rogers, L.C.G.; Williams, D. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus; Cambridge
+University Press: Cambridge, UK, 2000; Volume 2.
+29.
+Meyn, S.P.; Tweedie, R.L. Stability of Markovian processes III: Foster–Lyapunov criteria for continuous-time
+processes. Adv. Appl. Probab. 1993, 25, 518–518.
+30.
+Coffey, W.; Kalmykov, Y.P.; Waldron, J.T. The Langevin Equation: with Applications to Stochastic Problems in
+Physics, Chemistry, and Electrical Engineering; World Scientific: Singapore, Singapore, 2004; Volume 14.
+31.
+Roberts, G.O.; Tweedie, R.L.
+Exponential convergence of Langevin distributions and their discrete
+approximations. Bernoulli 1996, 2, 341–363.
+32.
+Roberts, G.O.; Stramer, O. Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput.
+Appl. Probab. 2002, 4, 337–357.
+33.
+Xifara, T.; Sherlock, C.; Livingstone, S.; Byrne, S.; Girolami, M.
+Langevin diffusions and the
+Metropolis-adjusted Langevin algorithm. Stat. Probab. Lett. 2013, 91, 14–19.
+34.
+Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A
+Math. Phys. Sci. 1946, 186, 453–461.
+35.
+Critchley, F.; Marriott, P.; Salmon, M. Preferred point geometry and statistical manifolds. Ann. Stat. 1993,
+21, 1197–1224.
+36.
+Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 77–93.
+37.
+Barndorff-Nielsen, O.; Cox, D.; Reid, N. The role of differential geometry in statistical theory. Int. Stat. Rev.
+1986, 54, 83–96.
+38.
+Boothby, W.M. An Introduction to Differentiable Manifolds and Riemannian Geometry; Academic Press: San
+Diego, CA, USA, 1986; Volume 120.
+39.
+Lee, J.M. Smooth Manifolds; Springer: New York, NY, USA, 2003.
+40.
+Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992.
+41.
+Nash, J.F., Jr. The imbedding problem for Riemannian manifolds. In The Essential John Nash; Princeton
+University Press: Princeton, NJ, USA, 2002; p. 151.
+42.
+Manton, J.H. A Primer on Stochastic Differential Geometry for Signal Processing. 2013, arXiv: 1302.0430.
+43.
+Stewart, J. Multivariable Calculus; Cengage Learning: Boston, MA, USA, 2011.
+44.
+Hsu, E.P. Stochastic Analysis on Manifolds; American Mathematical Society: Providence, RI, USA, 2002;
+Volume 38.
+45.
+Kent, J. Time-reversible diffusions. Adv. Appl. Probab. 1978, 10, 819–835.
+46.
+Radhakrishna Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull.
+Calcutta Math. Soc. 1945, 37, 81–91.
+47.
+Christensen, O.F.; Roberts, G.O.; Sköld, M. Robust Markov chain Monte Carlo methods for spatial generalized
+linear mixed models. J. Comput. Graph. Stat. 2006, 15, 1–17.
+48.
+Petra, N.; Martin, J.; Stadler, G.; Ghattas, O. A computational framework for infinite-dimensional Bayesian
+inverse problems: Part II. Stochastic Newton MCMC with application to ice sheet flow inverse problems.
+2013, arXiv: 1308.6221.
+49.
+Pawitan, Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood; Oxford University Press:
+Oxford, UK, 2001.
+50.
+Betancourt, M. A General Metric for Riemannian Manifold Hamiltonian Monte Carlo. In Geometric Science of
+Information; Springer: New York, NY, USA, 2013; pp. 327–334.
+51.
+Higham, N.J. Computing the nearest correlation matrix—a problem from finance. IMA J. Numer. Anal. 2002,
+22, 329–343.
+52.
+Sejdinovic, D.; Garcia, M.L.; Strathmann, H.; Andrieu, C.; Gretton, A. Kernel Adaptive Metropolis–Hastings.
+2013, arXiv: 1307.5302.
+53.
+Martin, J.; Wilcox, L.C.; Burstedde, C.; Ghattas, O. A stochastic Newton MCMC method for large-scale
+statistical inverse problems with application to seismic inversion.
+SIAM J. Sci. Comput.
+2012,
+34, A1460–A1487.
+54.
+Calderhead, B.; Girolami, M. Statistical analysis of nonlinear dynamical systems using differential geometric
+sampling methods. Interface Focus 2011, 1, 821–835.
+55.
+Stathopoulos, V.; Girolami, M.A. Markov chain Monte Carlo inference for Markov jump processes via the
+linear noise approximation. Philos. Trans. R. Soc. A 2013, 371, 20110541.
+
+153
+
+
+Entropy 2014, 16, 3074–3102
+
+56.
+Konukoglu, E.; Relan, J.; Cilingir, U.; Menze, B.H.; Chinchapatnam, P.; Jadidi, A.; Cochet, H.; Hocini, M.;
+Delingette, H.; Jaïs, P.; et al. Efficient probabilistic model personalization integrating uncertainty on data and
+parameters: Application to eikonal-diffusion models in cardiac electrophysiology. Prog. Biophys. Mol. Biol.
+2011, 107, 134–146.
+57.
+Do Carmo, M.P.; Do Carmo, M.P. Differential Geometry of Curves and Surfaces; Englewood Cliffs: Prentice-Hall,
+NJ, USA, 1976; Volume 2.
+58.
+Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, Singapore, 2007; Volume 1.
+59.
+Cotter, S.; Roberts, G.; Stuart, A.; White, D. MCMC methods for functions: Modifying old algorithms to
+make them faster. Stat. Sci. 2013, 28, 424–446.
+60.
+Da Prato, G.; Zabczyk, J. Stochastic Equations in Infinite Dimensions; Cambridge University Press: Cambridge,
+UK, 2008.
+61.
+Law, K.J. Proposals which speed up function-space MCMC. J. Comput. Appl. Math. 2014, 262, 127–138.
+62.
+Ottobre, M.; Pillai, N.S.; Pinski, F.J.; Stuart, A.M. A Function Space HMC Algorithm With Second Order
+Langevin Diffusion Limit. 2013, arXiv: 1308.0543.
+63.
+Horowitz, A.M. A generalized guided Monte Carlo algorithm. Phys. Lett. B 1991, 268, 247–252.
+64.
+Mardia, K.V.; Jupp, P.E. Directional Statistics; Wiley: New York, NY, USA, 2009; Volume 494.
+65.
+Byrne, S.; Girolami, M. Geodesic Monte Carlo on embedded manifolds. Scand. J. Stat. 2013, 40, 825–845.
+66.
+Diaconis, P.; Holmes, S.; Shahshahani, M. Sampling from a manifold. In Advances in Modern Statistical Theory
+and Applications: A Festschrift in Honor of Morris L. Eaton; Institute of Mathematical Statistics: Washington,
+DC, USA, 2013; pp. 102–125.
+67.
+Latuszynski, K.; Roberts, G.O.; Thiery, A.; Wolny, K. Discussion on “Riemann manifold Langevin and
+Hamiltonian Monte Carlo methods” (by Girolami, M. and Calderhead, B.). J. R. Stat. Soc. Ser. B 2011,
+73, 188–189.
+68.
+Capinski, M.; Kopp, P.E. Measure, Integral and Probability; Springer: New York, NY, USA, 2004.
+69.
+Schutz, B.F. Geometrical Methods of Mathematical Physics; Cambridge University Press: Cambridge, UK, 1984.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+154
+
+
+entropy
+
+Article
+Variational Bayes for Regime-Switching
+Log-Normal Models
+
+Hui Zhao and Paul Marriott *
+
+University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada; E-Mail:
+h6zhao@uwaterloo.ca
+*
+E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567.
+
+Received: 14 April 2014; in revised form: 12 June 2014 / Accepted: 7 July 2014 /
+Published: 14 July 2014
+
+Abstract: The power of projection using divergence functions is a major theme in information
+geometry. One version of this is the variational Bayes (VB) method. This paper looks at VB in the
+context of other projection-based methods in information geometry. It also describes how to apply
+VB to the regime-switching log-normal model and how it provides a computationally fast solution to
+quantify the uncertainty in the model specification. The results show that the method can recover
+exactly the model structure, gives the reasonable point estimates and is very computationally efficient.
+The potential problems of the method in quantifying the parameter uncertainty are discussed.
+
+Keywords: information geometry; variational Bayes; regime-switching log-normal model; model
+selection; covariance estimation
+
+1. Introduction
+
+While, in principle, the calculation of the posterior distribution is mathematically straightforward,
+in practice, the computation of many of its features, such as posterior densities, normalizing constants
+and posterior moments, is a major challenge in Bayesian analysis. Such computations typically
+involve high dimensional integrals, which often have no analytical or tractable forms. The variational
+Bayes (VB) method was developed to generate tractable approximations to these quantities. This
+method provides analytic approximations to the posterior distribution by minimizing the Kullback–
+Leibler (KL) divergence from the approximations to the actual posterior and has been demonstrated to
+be computationally very fast.
+VB gains its computational advantages by making simplifying assumptions about the posterior
+dependence structure.
+For example, in the simplest form, it assumes posterior independence
+between selected sets of parameters. Under these assumptions, the resultant approximate posterior
+is either known analytically or can be computed by a simple iteration algorithm similar to the
+Expectation-Maximization (EM) algorithm. In this paper, we show that, as well as having advantages
+of computational speed, the VB algorithm does an excellent job of model selection, in particular in
+finding the appropriate number of regimes.
+While the simplification in the dependence gives computational advantages, it also comes at
+a cost. For example, we also found that the posterior variance may be underestimated. In [1], we
+propose a novel method to compute the true posterior covariance matrix by only using the information
+obtained from VB approximations.
+The use of projections to particular families is, of course, not new to information geometry (IG).
+In [2], we find the famous Pythagorean results concerning projection using α-divergences to α-families,
+and other important results on projections based on divergences can be found in [3] and [4] (Chapter 7).
+
+Entropy 2014, 16, 3832–3847; doi:10.3390/e16073832
+www.mdpi.com/journal/entropy
+155
+
+
+Entropy 2014, 16, 3832–3847
+
+1.1. Variational Bayes
+
+Suppose, in a Bayesian inference problem that we use q(τ) to approximate the posterior p(τ|y),
+where y is the data and τ = {τ1, · · · , τp} the model parameter vector. The KL divergence between
+them is defined as,
+
+KL [q(τ)||p(τ|y)] =
+�
+q(τ) log q(τ)
+
+p(τ|y)dτ,
+(1)
+
+provided the integral exists. We want to balance two things, having the discrepancy between p and q
+small, while keeping q tractable. Hence, we want to seek q(τ), which minimizes Equation (1), while
+keeping q(τ) in an analytically tractable form. First, note that the evaluation of Equation (1) requires
+p(τ|y), which may be unavailable, since in the general Bayesian problem, its normalizing constant is
+one of the main intractable integrals. However, we note that:
+
+KL [q(τ)||p(τ|y)]
+=
+�
+q(τ) log
+q(τ)
+
+p(τ|y)p(y)dτ + log p(y)
+
+=
+−
+�
+q(τ) log p(τ, y)
+
+q(τ) dτ + log p(y).
+(2)
+
+Thus, minimizing Equation (1) is equivalent to maximizing the first term of the right-hand side of
+Equation (2). The key computational point is that, often, the term p(τ, y) is available even when the
+full posterior
+� p(τ,y)
+p(τ,y)dτ is not.
+
+Definition 1. Let F(q) = �
+q(τ) log p(τ,y)
+
+q(τ) dτ and:
+
+ˆq = arg max
+q∈Q
+F(q),
+(3)
+
+where Q is a predetermined set of probability density functions over the parameter space. Then ˆq is called the
+variational approximation or variational posterior distribution, and functions of ˆq (such as mean, variance, etc.),
+are called variational parameters.
+
+Some of the power of Definition 1 comes when we assume that all elements of Q have tractable
+posteriors. In that case, all variational parameters will then also be tractable when the optimization
+can be achieved. A prime example of a choice for Q is the set of all densities that factorize as
+
+q(τ) =
+d
+∏
+i=1
+qi(τi).
+
+This reduces the computational problem from computing a high dimensional integral to one of
+computing a number of one-dimensional ones. Furthermore, as we see in the example of this paper, it
+is often the case that the variational families are standard exponential families (since they are often
+‘maximum entropy models’ in some sense), and the optimisation problem (3) can be solved by simple
+iterative methods with very fast convergence.
+The core of the method builds on the basis of the principle of the variational free energy
+minimization in physics, which is concerned with finding the maxima and minima of a functional over
+a class of functions, and the method gains its name from this root. Early developments of the method
+can be found in machine learning, especially in applications on neural networks [5,6]. The method
+has been successfully applied in many different disciplines and domains, for example, in independent
+component analysis [7,8], graphical models [9,10], information retrieval [11] and factor analysis [12].
+
+156
+
+
+Entropy 2014, 16, 3832–3847
+
+In the statistical literature, an early application of the variational principle can be found in the
+work of [13] to construct Bayes estimators. In recent years, the method has obtained more attention
+from both the application and theoretical perspective, for example [14–18].
+
+1.2. Regime-Switching Models
+
+In this paper, we illustrate the strengths and weaknesses of VB through a detailed case study.
+In particular, we look at a model that is used in finance, risk management and actuarial science, the
+so-called regime-switching log-normal model (RSLN) proposed, in this context, by [19].
+Switching between different states, or regimes, is a common phenomenon in many time series,
+and regime-switching models, originally proposed by [20], have been used to model these switching
+processes. As demonstrated in [21], the maximum likelihood estimate (MLE) does not give a simple
+method to deal with parameter uncertainty; for details of this method, see [21]. The asymptotic
+normality of maximum likelihood estimators may not apply for sample sizes commonly found in
+practice. Hence, to understand parameter uncertainty, [21] considered the RSLN model in a Bayesian
+framework using the Metropolis–Hastings algorithm. Furthermore, model uncertainty, in particular
+selecting the correct number of regimes, is a major issue. Hence, model selection criteria have to be
+used to choose the “best” model. Hardy [19] found that a two-regime RSLN model maximized the
+Bayes information criterion (BIC) [22] for both monthly TSE 300 total return data and S&P 500 total
+return data; however, according to the Akaike information criterion (AIC) [23], a three-regime model
+was the optimal on S&P data. To account for the model uncertainty associated with the number of
+regimes, [24] offered a trans-dimensional model using reversible jump MCMC [25]. We note that BIC
+is not necessarily ideal for model selection with state space models [26], while it is still commonly used
+in the literature.
+MCMC methods make possible the computation of all posterior quantities; however there are a
+number of practical issues associated with their implementation. A primary concern is determining
+that the generated chain has, in fact, “converged”. In practice, MCMC practitioners have to resort
+to convergence diagnostic techniques. Furthermore, the computational cost can be a concern. Other
+implementational issues include the difficulty of making good initalisation choices, implementing the
+MCMC algorithm in one long chain or several shorter chains in parallel, etc. Detailed discussions can
+be found in [27].
+One of the main contributions of this paper is to apply the variational Bayes (VB) method to the
+RSLN model and present a solution to quantify the uncertainty in model specification. The VB method
+is a technique that provides analytical approximations to the posterior quantities, and in practice, it is
+demonstrated to be a very much faster alternative to MCMC methods.
+
+2. Variational Bayes and Informational Geometry
+
+In this section, we explore the relationship between VB and IG, in particular the statistical
+properties of divergence-based projections onto exponential families. Here, we used the IG of [2], in
+particular the ±1 dual affine parameters for exponential families. One of the most striking results
+from [2] is the Pythagorean property of these dual affine coordinate systems. This is illustrated in
+Figure 1, which shows a schematic representing a model space containing the distribution f0(x) and
+an exponential family f (x; θ).
+
+157
+
+
+Entropy 2014, 16, 3832–3847
+
+θ
+
+−1−geodesic
+
++1−geodesic
+
+of (x)
+
+f(x,   )
+
+Figure 1. Projections onto an exponential family.
+
+The Pythagorean result comes from using the KL divergence to project onto the exponential family
+f (x; θ) = ν(x) exp {s(x)θ − ψ(θ)}, i.e.,
+
+min
+θ
+
+�
+− log f (x; θ)
+
+f0(x) f0(x)dx.
+
+All distributions that project to the same point form a −1-flat space defined by all distributions f (x)
+with the same mean, i.e.,
+E�θ(s(x)) = Ef (x)(s(x)),
+
+and further, it is Fisher orthogonal to the +1-flat family f (x; θ). The statistical interpretation of this
+concerns the behaviour of a model f (x, θ) when the data generation process does not lie in the model.
+In contrast to this, we have the VB method, which uses the reverse KL divergence for the projection,
+i.e.,
+
+min
+θ
+
+�
+log f (x; θ)
+
+f0(x) f (x; θ)dx.
+
+This results in a Fisher orthogonal projection, shown in Figure 1, but now using a +1-flat family.
+This does not have the property that the mean of s(x) is constant, but as we shall see, it does have nice
+computational properties when used in the context of Bayesian analysis.
+In order to investigate the information geometry of VB, we consider two examples. The first,
+in Section 3.1, is selected to maximally illustrate the underlying geometric issues and to get some
+understanding of the quality of the VB approximation. The second, in Section 3.2, shows an important
+real-world application from actuarial science and is illustrated with simulated and real data.
+
+3. Applications of Variational Bayes
+
+3.1. Geometric Foundation
+
+We consider the simplest model that shows dependence. Let X1, X2 be two binary random
+variables, with distribution π := (π00, π10, π01, π11), where P(X1 = i, X2 = j) = πij, i, j ∈ {0, 1}.
+Further, let the marginal distributions be denoted by π1 = P(X1 = 1), π2 = P(X2 = 1). We want to
+consider the geometry of the VB projection from a general distribution to the family of independent
+distributions. This represents the way that VB gains its computational advantages by simplifying the
+posterior dependence structure.
+The model space is illustrated in Figure 2, where π is represented by a point in the three simplex,
+and the independence surface, where π00π11 = π10π01, is also shown.
+
+158
+
+
+Entropy 2014, 16, 3832–3847
+
+1
+
+ y
+0
+
+0
+
+0.5
+ z
+
+1.0
+
+x
+
+1
+
+Figure 2. Space of distributions with independence surface: marginal probabilityand dependence.
+
+Both the interior of the simplex and independence surface are exponential families, and it is
+convenient to use the natural parameters for the interior of the simplex:
+
+ξ1 = log π10
+
+π00
+, ξ2 = log π01
+
+π00
+, ξ3 = log π11π00
+
+π10π01
+
+where the independence surface is given by ξ3 = 0.
+The independence surface can also be
+parameterised by the marginal distributions π1, π2 or the corresponding natural parameters ξind
+i
+:=
+log(πi/(1 − πi)). For any distribution, π, represented in natural parameters by (ξ1, ξ2, ξ3), has its VB
+approximation defined implicitly by the simultaneous equations:
+
+ξind
+1 (π1)
+=
+ξ1 + ξ3π2,
+(4)
+
+ξind
+2 (π2)
+=
+ξ2 + ξ3π1.
+(5)
+
+These can be solved, as is typical with VB methods, by iterating updated estimates of π1 and π2
+across the two equations. We show this in a realistic example in the following section.
+Having seen the VB solution in this simple model, we can investigate the quality of the
+approximation. If we were using the forward KL project, as proposed by [2], then the mean will
+be preserved by the approximation, while, of course, the variance structure is distorted. In the case of
+using the reverse KL projection, as used by VB, the mean will not be preserved, but in this example,
+we can investigate the distortion explicitly. Let (ξ1(α), ξ2(α), ξ3(α)) be a +1-geodesic, which cuts
+the independence surface orthogonally and is parameterised by α, where α = 0 corresponds to the
+independence surface. In this example, all such geodesics can be computed explicitly. Figure 3 shows
+the distortion associated with the VB approximation. In the left-hand panel, we show the mean, which
+is the marginal probability, P(X1 = 1), for all points on the orthogonal geodesic. We see, as expected,
+that this is not constant, but it is locally constant at α = 0, showing that the distortion of the mean can
+be small near the independence surface. The right-hand panel shows the dependence, as measured
+by the log-odds, for points on the geodesic. As expected, the VB does not preserve the dependence
+structure; indeed, it is designed to exploit the simplification of the dependence structure.
+
+159
+
+
+Entropy 2014, 16, 3832–3847
+
+Figure 3. Distortion implied by variational Bayes (VB) approximation.
+
+3.2. Variational Bayes for the RSLN Model
+
+The regime-switching log-normal model [19] with a fixed finite number, K, of regimes can be
+described as a bivariate discrete time process with the observed data sequence w1:T = {wt}T
+t=1 and
+the unobserved regime sequence S1:T = {St}T
+t=1, where St ∈ {1, · · · , K} and T is the number of
+observations. The logarithm of wt, denoted by yt = log wt, is assumed normally distributed, having
+mean μi and variance σ2
+i both dependent on the hidden regime St. The sequence of S1:T is assumed
+to follow a first order Markov chain having transition probabilities A = (aij) with the probabilities
+π = (πi)K
+i=1 to start the first regime.
+The RSLN model is a special case of more general state-space models, which were studied in
+detail by [28]. In this paper, we use this model and simulated and real data to illustrate the VB method
+in practice. We also calibrate its performance by referring to [24], which used MCMC methods to fit the
+same model to the same data. Here, we are regarding the MCMC analysis as a form of “gold-standard",
+but with the cost of being orders-of-magnitude slower than VB in computational time.
+In the Bayesian framework, we use a symmetric Dirichlet prior for π, that is p(π) =
+Dir(π; Cπ
+
+K , · · · , Cπ
+
+K ), for Cπ > 0.
+Let ai denote the i − th row vector of A. The prior for A is
+
+chosen as p(A) = ∏K
+i=1 p(ai) = ∏K
+i=1 Dir(ai; CA
+
+K , · · · , CA
+
+K ), for CA > 0, and the prior distribution for
+
+{(μi, σ2
+i )}K
+i=1 is chosen to be normal-inverse gamma, p({μi, σ2
+i }K
+i=1) = ∏K
+i=1 N(μi|σ2
+i ; γ, σ2
+i
+η2 )IG(σ2
+i ; α, β).
+
+In the above setting, Cπ, CA, γ, η2, α and β are hyper-parameters. Thus, the joint posterior distribution
+of π, A, {μi, σ2
+i }K
+i=1, and S1:T is P(π, A, {μi, σ2
+i }K
+i=1, S1:T|y1:T) and is proportional to:
+
+p(S1|π)
+T−1
+∏
+t=1
+p(St+1|St; A)
+T
+∏
+t=1
+p(yt|St; {μi, σ2
+i }K
+i=1)p(π)p(A)p({μi, σ2
+i }K
+i=1).
+(6)
+
+This posterior distribution and its corresponding marginal posterior distributions are analytically
+intractable. In VB, we seek an approximation of Equation (6), denoted by q(π, A, {μi, σ2
+i }K
+i=1, S1:T),
+to which we want to balance two things: having the discrepancy between Equation (6) and q small,
+while keeping q tractable. In general, there are two ways to choose q. The first is to specify a particular
+distributional family for q, for example the multivariate normal distribution. The other is to choose
+q with a simpler dependency structure than that of Equation (6); for example, we choose q, which
+factorizes as:
+
+q(π, A, {μi, σ2
+i }K
+i=1, S1:T) = q(π)
+K
+∏
+i=1
+q(ai)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T).
+(7)
+
+160
+
+
+Entropy 2014, 16, 3832–3847
+
+The Kullback–Leibler (KL) divergence [29] can be used as the measure of dissimilarity between
+Equations (6) and (7). For succinctness, we denote τ = (π, A, {μi, σ2
+i }K
+i=1, S1:T); thus the KL divergence
+is defined as:
+
+KL(q(τ) || p(τ|y)) =
+�
+q(τ) log q(τ)
+
+p(τ|y)dτ.
+(8)
+
+Note that the evaluation of Equation (8) requires p(τ|y), which is unavailable. However, we note that:
+
+KL(q(τ) || p(τ|y)) = log p(y) −
+�
+q(τ) log p(τ, y)
+
+q(τ) dτ
+
+Given the factorization Equation (7), this can be written as:
+
+KL(q(τ) || p(τ|y)) =
+
+log p(y) −
+�
+∑
+S1:T
+q(π)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T) log
+p(π, A, {μi, σ2
+i }K
+i=1, S1:T, y1:T)
+
+q(π)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T)
+
+dπdAd{μi, σ2
+i }K
+i=1
+
+Consider first the q(π) term. The right-hand side can be rearranged as:
+
+KL
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎝
+q(π)
+����
+
+����
+
+exp
+� � ∑
+S1:T
+q(S1:T)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i ) log p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T)dAd{μi, σ2
+i }K
+i=1
+
+�
+
+Zπ
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎠
++ Kπ,
+(9)
+
+where:
+
+Kπ =
+�
+∑
+S1:T
+q(S1:T)q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )q(S1:T) log q(A)
+K
+∏
+i=1
+q(μi|σ2
+i )q(σ2
+i )dAd{μi, σ2
+i }K
+i=1 − log Zπ + log p(y),
+
+and Zπ is a normalizing term. The first term of Equation (9) is the only term that depends on q(π).
+Thus, the minimum value of KL(q(τ) || p(τ|y)) is achieved when this term equals zero. Hence, we
+obtained:
+
+q(π) =
+exp
+� �
+∑S1:T q(S1:T)q(A) ∏K
+i=1 q(μi|σ2
+i )q(σ2
+i ) log p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T)dAd{μi, σ2
+i }K
+i=1
+
+�
+
+Zπ
+(10)
+
+Given the joint distribution of p(π, A, {μi, σ2
+i }K
+i=1, s1:T, y1:T) in the form of Equation (6), the
+straightforward evaluation of Equation (10) results in:
+
+q(π)
+∝
+K
+∏
+i=1
+π
+
+CKπ
+K +ws
+i −1
+i
+= Dir(π, wπ
+1 , · · · , wπ
+K); wπ
+i = CK
+π
+K + ws
+i , ws
+i = Eq(S1:T)[S1,i]
+
+(11)
+
+where S1,i = 1, if the process is in state i at time 1, and zero otherwise.
+
+161
+
+
+Entropy 2014, 16, 3832–3847
+
+Similarly, we can rearrange Equation (9) with respect to {q(ai)}K
+i=1, {q(μi|σ2
+i )}K
+i=1, {q(σ2
+i )}K
+i=1 and
+q(S1:T), respectively, and using the same arguments, then we can obtain:
+
+q(A)
+=
+k
+∏
+i
+Dir(ai; wA
+i1, ..., wA
+ik); wA
+ij = CA
+
+K + vs
+ij,
+(12)
+
+q(μi|σ2
+i )
+=
+N
+
+�
+
+γ′
+i, σ2
+i
+κi
+
+�
+
+, γ′
+i = η2γ + ps
+i
+
+η2 + qs
+i
+, κi = η2 + qs
+i
+(13)
+
+q(σ2
+i )
+=
+IG
+�
+α′
+i, β′
+i
+�
+, α′
+i = α + qs
+i
+2 , β′
+i = β + rs
+i
+2 + η2
+
+2 (γ
+′
+i − γ)2
+(14)
+
+q(S1:T)
+=
+
+k
+∏
+i=1
+π∗S1,i
+i
+
+T−1
+∏
+t=1
+
+k
+∏
+i=1
+
+k
+∏
+j=1
+a
+∗St,iSt+1,j
+ij
+
+T
+∏
+t=1
+
+k
+∏
+i=1
+θ∗St,i
+
+˜Z
+,
+(15)
+
+where St,i = 1, if the process in state i at time t, and zero otherwise, and with π∗
+i = eEq(π)[log πi],
+
+a∗
+ij = eEq(A)[log(aij)], θ∗
+i,t = eEq(μi|σ2
+i )q(σ2
+i )[log φi(yt)], vs
+ij = ∑T−1
+t=1 Eq(S1:T)
+�
+St,iSt+1,j
+�
+, ps
+i = ∑T
+t=1 Eq(S1:T)[st,i]yt,
+
+qs
+i = ∑T
+t=1 Eq(S1:T)[st,i], rs
+i = ∑T
+t=1(γ′
+i − yt)2Eq(S)[st,i]. Here, ψ is the digamma function, φ is the normal
+density function and the exact functional forms used in the updates are shown in Algorithm 1.
+
+Algorithm 1 Variational Bayes algorithm for the regime-switching log-normal model (RSLN) model.
+
+Initialize ws
+i
+(0), vs
+ij
+(0), ps
+i
+(0), qs
+i
+(0), and rs
+i
+(0) at step 0
+while wπ
+i
+(t−1), wA
+ij
+(t−1), γ′
+i
+(t−1), α′
+i
+(t−1), β′
+i
+(t−1), π∗
+i
+(t−1), a∗
+ij
+(t−1), and θ∗
+i,t
+(t−1) do not converge do
+
+1.
+Compute wπ
+i
+(t), wA
+ij
+(t), γ′
+i
+(t), κi(t), α′
+i
+(t), and β′
+i
+(t)at step t by
+
+wπ
+i
+(t) = CK
+π
+K + ws
+i
+(t−1),
+wA
+ij
+(t) = CA
+π
+K + vs
+ij
+(t−1),
+γ′
+i
+(t) = η2γ + ps
+i
+(t−1)
+
+η2 + qs
+i
+(t−1) ,
+
+κi(t) = η2 + qs
+i
+(t−1),
+α′
+i
+(t) = α + qs
+i
+(t−1)
+
+2
+,
+β′
+i
+(t) = β + rs
+i
+(t−1)
+
+2
++ η2
+
+2 (γ
+′
+i
+(t) − γ)2
+
+2.
+Compute π∗
+i
+(t), θ∗
+i,t
+(t) and a∗
+ij
+(t) at step t by:
+
+π∗
+i
+(t) = exp
+�
+ψ(wπ
+i
+(t)) − ψ(∑
+i
+wπ
+i
+(t))
+�
+,
+a∗
+ij
+(t) = exp
+�
+ψ(wA
+ij
+(t)) − ψ(∑
+j=1
+wA
+ij
+(t))
+�
+
+θ∗
+i,t
+(t) = exp
+� − 1
+
+2 log 2π − 1
+
+2(log β′
+i
+(t) − ψ(α′
+i
+(t))) − 1
+
+2
+
+�
+
+(yt − γ′
+i
+(t))2 α′
+i
+(t)
+
+β′
+i
+(t) +
+1
+
+κi(t)
+
+�
+�
+
+3.
+Compute ws
+i
+(t), vs
+ij
+(t), ps
+i
+(t), qs
+i
+(t), and rs
+i
+(t) at step t by:
+
+ws
+i
+(t) = Eq(t)(S1:T)[S1,i], vs
+ij
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)
+�
+St,iSt+1,j
+�
+, ps
+i
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)[st,i]yt,
+
+qs
+i
+(t) =
+T−1
+∑
+t=1
+Eq(t)(S1:T)[st,i], rs
+i
+(t) =
+T−1
+∑
+t=1
+(γ′
+i
+(t) − yt)2Eq(t)(S)[st,i]
+
+t ⇐ t + 1
+end while
+
+The VB method proceeds, as was shown with the simple Equations (4) and (5), by iterative
+updating the variational parameters to solve a set of simultaneous equations. In this example, the
+update equations for the variables π, A, {μi, σ2
+i }K
+i=1, S1:T are given explicitly by Algorithm 1. For the
+initialisation, we choose symmetric values for most of the parameters and choose random values for
+
+162
+
+
+Entropy 2014, 16, 3832–3847
+
+others, as appropriate. For this example, this worked very satisfactory, although we note that for more
+general state space models [28], states that find good initial values can be non-trivial.
+
+3.3. Interpretation of Results
+
+First, all approximating distributions above turn out to lie in well-known parametric families.
+The only unknown quantities are the parameters of these distributions, which are often called the
+variational parameters.
+The evaluation of parameters of q(π), q(A), q(μi|σ2
+i ), and q(σ2
+i ) requires knowledge of q(S1:T),
+and also, the evaluation of π∗
+i , a∗
+ij and θ∗
+i,t requires knowledge of q(π), q(A), q(μi|σ2
+i ) and q(σ2
+i ). This
+structure leads to an iterative updating scheme, described in Algorithm 1.
+The main computational effort in Algorithm 1 is computing Eq(S1:T)[St,i] and Eq(S1:T)
+�
+St,iSt+1,j
+�
+,
+which have no simple tractable forms. We note that the distributional form of q(S1:T) has a very
+similar structure as the conditional distribution of p(S1:T|Y1:T, τ) for which the forward-backward algo-
+rithm [30] is commonly used to compute Ep(S1:T|Y1:T,τ)[St,i|Y1:T, τ] and Ep(S1:T|Y1:T,τ)
+�
+St,iSt+1,j|Y1:T, τ
+�
+.
+Therefore, we also use the forward-backward algorithm to compute Eq(S1:T)[St,i] and Eq(S1:T)
+�
+St,iSt+1,j
+�
+.
+
+The conditional distribution of q(μi|σ2
+i ) is N
+�
+μi|σ2
+i ; γ′
+i, σ2
+i
+κi
+
+�
+, then the marginal distribution of μi
+
+is the location-scale t distribution, denoted as t2α′
+i(μi; γ′
+i,
+κi
+
+β′
+i/α′
+i ), where the density function of tν(x; μ, λ)
+
+is defined as p(x|ν, μ, λ) = Γ( ν+1
+
+2 )
+
+Γ( ν
+
+2 )
+
+�
+λ
+πν
+� 1
+
+2 �
+1 + λ(x−μ)2
+
+ν
+�− ν+1
+
+2 , for x, μ ∈ (−∞, +∞) and ν, λ > 0.
+
+4. Numerical Studies
+
+4.1. Simulated Data
+
+In this section, we applied the VB solutions to four sets of simulated data, which are used in [24].
+Through these simulated studies, we will test the performance of VB on detecting the number of
+regimes and compare it with those of the BIC and the MCMC methods [24]. For this paper, we present
+only an initial study with a relatively small number of datasets. The results are highly promising,
+but more extensive studies are needed to draw comprehensive conclusions. Furthermore, see [28] for
+general results on VB in hidden state space models.
+To estimate the number of regimes, we construct a matrix, called the relative magnitude matrix
+
+(RMM), defined as A′ =
+�ˆa′
+ij
+�
+, where ˆa′
+ij =
+wA
+ij
+
+wA
+0 , wA
+0 = ∑K
+i=1 ∑K
+j=1 wA
+ij and wA
+ij is the parameter of q(A).
+
+Our model selection procedure is to fit a VB with a large number of regimes and to examine the rows
+and columns in the RMM. If the values of the entries in the i − th row and the i − th column of A′ are
+all equal to
+CA/K
+
+T−1+CA×K, then we will declare the regime i nonexistent. This method is validated by the
+
+following observations. It can be shown that the parameter of vs
+ij in wA
+ij is equal to the number of times
+the process leaves regime i and enters regime j. Therefore, for the i − th regime, the values of zero for
+all of vs
+ji and vs
+ij with j = 1, · · · , K indicate that there is no transition process entering or leaving regime
+i.
+Table 1 specifies the parameters for the four cases, and we generate 671 observations for each
+case (equal to the number of months from January 1956 to September 2011). The parameters used in
+Case 1 are identical to the maximum likelihood estimates for TSX monthly return data from 1956 to
+1999 [19]. Case 2 only has one regime present. Case 3 is similar to Case 1, but the two regimes have the
+same mean. Case 4 adds a third regime. For each case, we use MLE to fit a one-regime, two-regime,
+three-regime and four-regime RSLN model and report the corresponding BIC and log-likelihood scores.
+We then misspecify the number of regimes and run a four-regime VB algorithm.
+
+163
+
+
+Entropy 2014, 16, 3832–3847
+
+Table 1. Parameters of the simulated data.
+
+Case
+Regime 1
+Regime 2
+Regime 3
+Transition Probability
+(μi, σi)
+(μi, σi)
+(μi, σi)
+
+1
+(0.012, 0.035)
+(−0.016, 0.078)
+-
+
+�0.963
+0.037
+0.210
+0.790
+
+�
+
+2
+(0.014, 0.050)
+-
+-
+-
+
+3
+(0.000, 0.035)
+(0.000, 0.078)
+-
+
+�0.963
+0.037
+0.210
+0.790
+
+�
+
+4
+(0.012, 0.035)
+(−0.016, 0.078)
+(0.04, 0.01)
+
+⎛
+
+⎝
+0.953
+0.037
+0.01
+0.210
+0.780
+0.01
+0.80
+0.190
+0.01
+
+⎞
+
+⎠
+
+Table 2 shows the number of iterations that VB takes to converge in each case and the
+corresponding computational time (on a MacBook, 2 GHz processor). On average, VB converges after
+a hundred iterations and takes about one minute. On the same computer, a 104-iteration Reverse Jump
+MCMC (RJMCMC) will take about 10 h to finish. Using diagnostics, this seemed to be enough for
+convergence, while not being an “unfair” comparison in terms of time with VB. We can see that the
+computational efficiency will be a very attractive feature of the VB method. The results of the BIC with
+the log-likelihood (in parentheses), the relative magnitude matrices and the posterior probabilities
+for the models with the different number of regimes estimated by MCMC (cited from Hartman and
+Heaton [24]) are given in Table 3. In Case 1, the BIC favors the two-regime model. The posterior
+probability estimated by MCMC for the one-regime model is the largest, but there is still a large
+probability for the two regime model. Note that the prior specification for the number of regimes
+can effect these numbers and is always an issue with these forms of multidimensional MCMC. The
+relative magnitude matrix clearly shows that there are only two regimes whose ˆa′
+ij are not negligible.
+This implies VB removes excess transition and emission processes and discovers the exact number of
+hidden regimes. In Case 2 and Case 3, both VB and the BIC can select the correct number of regimes,
+and the posterior probability for the one-regime model estimated by MCMC is still the largest. In Case
+4, VB does not detect the third regime. The transition probability to this regime is only 0.01, and the
+means and standard deviations of Regime 1 make the rare data from Regime 3 easily merged within
+the data from Regime 1. From Table 3, it is clear that for all of the cases, the log-likelihood always
+increases as the number of regimes increase.
+
+Table 2. Computational efficiency of VB.
+
+Case 1
+Case 2
+Case 3
+Case 4
+
+Iterations to converge
+62
+182
+132
+94
+Computational time [s]
+27.161
+80.842
+58.510
+45.044
+
+164
+
+
+Entropy 2014, 16, 3832–3847
+
+Table 3. The estimated number of regimes by VB, BIC and MCMC.
+
+No. of
+MLE
+RJMCMC
+VB
+Case
+Regimes
+BIC (Log Likelihood)
+Posterior
+Probability
+Relative Magnitude Matrix
+
+1
+
+1
+2
+3
+4
+
+1, 108.875(1, 115.384)
+1, 158.227(1, 174.499)
+1, 156.370(1, 182.405)
+1, 153.150(1, 188.948)
+
+0.647
+0.214
+0.088
+<0.052
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.14357
+0.00004
+0.00004
+0.03153
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.03018
+0.00004
+0.00004
+0.79428
+
+⎞
+
+⎟
+⎟
+⎠
+
+2
+
+1
+2
+3
+4
+
+1, 045.448(1, 051.957)
+1, 038.360(1, 054.632)
+1, 030.733(1, 056.768)
+1, 026.882(1, 062.680)
+
+0.864
+0.109
+0.020
+<0.006
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.99944
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+
+⎞
+
+⎟
+⎟
+⎠
+
+3
+
+1
+2
+3
+4
+
+1, 110.903(1, 117.411)
+1, 139.214(1, 155.486)
+1, 131.904(1, 157.719)
+1, 121.921(1, 157.940)
+
+0.629
+0.221
+0.098
+<0.052
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.11322
+0.00004
+0.00004
+0.02647
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.02659
+0.00004
+0.00004
+0.83327
+
+⎞
+
+⎟
+⎟
+⎠
+
+4
+
+1
+2
+3
+4
+
+1, 044.819(1, 051.328)
+1, 092.610(1, 108.881)
+1, 087.435(1, 113.470)
+1, 080.240(1, 116.038)
+
+0.641
+0.203
+0.094
+<0.06
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.22643
+0.00004
+0.00004
+0.05518
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.00004
+0.05377
+0.00004
+0.00004
+0.66417
+
+⎞
+
+⎟
+⎟
+⎠
+
+4.2. Real Data
+
+In this section, we apply the VB solution to the TSX monthly total return index in the period from
+January, 1956, to December, 1999 (528 observations in total and studied in [19,21]).
+A four-regime VB is implemented first. VB converges after 100 iterations about 34.284 s (on a
+MacBook, 2 GHz processor). The relative magnitude matrix, given in Table 4, clearly shows that VB
+identifies two regimes. This matches both of the BIC and AIC-based results [19]. Based on these results,
+we then fit a two-regime VB, which converges after 83 iterations in about 14.241 s. Table 5 gives the
+marginal distributions for all of the parameters. Figure 4 presents the corresponding density functions,
+where we can see that all of the plots show a symmetric and bell-shaped pattern.
+
+Table 4. Estimations of the number of regimes for TSXdata.
+
+January 1956–December 1999
+
+R. M. M.
+
+⎛
+
+⎜
+⎜
+⎝
+
+0.11496
+0.00005
+0.00005
+0.02803
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.00005
+0.02853
+0.00005
+0.00005
+0.82791
+
+⎞
+
+⎟
+⎟
+⎠
+
+Table 5. The marginal distributions of the parameters estimated by VB.
+
+Parameter
+Distribution
+Mean
+s.d.
+Transition Probability
+
+μ1
+t454.61(0.0123, 370778.19)
+0.0123
+0.00165
+-
+σ2
+1
+IG(227.30, 0.28)
+0.00122(0.0349)
+0.00008
+-
+μ2
+t80.39(−0.0161, 12987.55)
+−0.0161
+0.00889
+-
+σ2
+2
+IG(40.20, 0.24)
+0.00603(0.0777)
+0.00098
+-
+p1,2
+p2,1
+Beta(15.21, 434.78)
+Beta(15.00, 61.21)
+0.0338
+0.1969
+0.00851
+0.04525
+
+�0.9662
+0.0338
+0.1969
+0.8031
+
+�
+
+165
+
+
+Entropy 2014, 16, 3832–3847
+
+�����
+�����
+����
+����
+����
+
+�
+��
+���
+���
+���
+���
+���
+���
+
+�������
+
+(a)
+
+�����
+�����
+�����
+�����
+�����
+
+�
+����
+����
+����
+����
+����
+
+�������
+
+(b)
+
+���
+���
+���
+���
+���
+���
+
+�
+��
+��
+��
+��
+��
+
+�������
+
+(c)
+
+Figure 4. The VB marginal distributions of the parameters. (a) μ2 (left) and μ1 (right); (b) σ2
+1 (left) and
+σ2
+2 (right) ; (c) p1,2 (left) and p2,1 (right) .
+
+Table 6 (the upper part) gives the maximum likelihood estimates (cited from [19]), mean
+parameters computed by the MCMC method (cited from [21]) and mean parameters computed
+by VB. It clearly shows that the point estimates by VB are very close to those by MLE and MCMC.
+The numbers in parenthesis in Table 6 are the standard deviations computed by the three methods,
+respectively. It is worth noting that all of the variance estimated by VB are smaller than those by the
+MLE or MCMC methods. In fact, some other researchers also report the underestimation of posterior
+variance in other VB applications, for example [31,32]. In the paper [1], we look at some diagnostics
+methods that can assess how well the VB approximates the true posterior, particularly with regards to
+its covariance structure. The methods proposed also allow us to generate simple corrections when the
+approximation error is large.
+
+Table 6. Estimates and standard deviations by VB, MLE and MCMC.
+
+μ1
+σ1
+p1,2
+μ2
+σ2
+p2,1
+
+VB
+0.0123(0.00165)
+0.0349(0.00008)
+0.0338(0.00851)
+−0.0161(0.00889)
+0.0777(0.00098)
+0.1969(0.04525)
+MLE
+0.0123(0.002)
+0.0347(0.001)
+0.0371(0.012)
+−0.0157(0.010)
+0.0778(0.009)
+0.2101(0.086)
+MCMC
+0.0122(0.002)
+0.0351(0.002)
+0.0334(0.012)
+−0.0164(0.010)
+0.0804(0.009)
+0.2058(0.065)
+
+5. Conclusions
+
+Variational Bayes can be thought of in terms of information geometry as a projection-based
+approximation technique; it provides a framework to approximate posteriors. We applied this method
+to the regime-switching log-normal model and provide solutions to account for both model uncertainty
+and parameter uncertainty. The numerical results show that our method can recover exactly the
+number of regimes and gives reasonable point estimates. The VB method is also demonstrated to be
+very computationally efficient.
+The application on the TSX monthly total return index data in the period from January 1956 to
+December 1999, confirms the similar results in the literature in finding the number of regimes.
+
+Author Contributions
+
+The article was written by Hui Zhao under the guidance of Paul Marriott. All authors have read
+and approved the final manuscript.
+
+166
+
+
+Entropy 2014, 16, 3832–3847
+
+Conflicts of Interest
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Zhao, H.; Marriott, P. Diagnostics for variational bayes approximations. 2013, arXiv:1309.5117.
+2.
+Amari, S.-I. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1990.
+3.
+Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat.
+1983, 11, 793–803.
+4.
+Kass, R.; Vos, P. Geometrical Foundations of Asymptotic Inference; Wiley: New York, NY, USA, 1997.
+5.
+Hinton, G.E.; van Camp, D. Keeping neural networks simple by minimizing the description length of the
+weights. In Proceedings of the 6th ACM Conference on Computational Learning Theory, Santa Cruz, CA,
+USA, 26–28 July 1993; ACM: New York, NY, USA, 1993.
+6.
+MacKay, D. Developments in Probabilistic Modelling with Neural Networks—Ensemble Learning. In Neural
+Networks: Artifical Intelligence and Industrial Applications; Springer: London, UK, 1995; pp. 191–198.
+7.
+Attias, H. Independent Factor Analysis. Neur. Comput. 1999, 11, 803–851.
+8.
+Lappalainen, H. Ensemble Learning For Independent Component Analysis. In Proceedings of the First
+International Workshop on Independent Component Analysis, Aussois, France, 11–15 January 1999; pp.
+7–12.
+9.
+Beal, M.; Ghahramani, Z. The variational Bayesian EM algorithm for incomplete data: With application to
+scoring graphical model structures. Bayesian Stat. 2003, 7, 453–463.
+10.
+Winn, J. Variational Message Passing and its Applications. Ph.D. Thesis, Department of Physics, University of
+Cambridge, Cambridge, UK, 2003.
+11.
+Blei, D.M.; Ng, A.Y.; Jordan, M.I.; Lafferty, J. Latent Dirichlet allocation. J. Mach. Learn. Res.
+2003, 3,
+993–1022.
+12.
+Ghahramani, Z.; Beal, M.J. A Variational Inference for Bayesian Mixtures of Factor Analysers. Adv. Neur. Inf.
+Process. Syst. 2000, 12, 449–455.
+13.
+Haff, L.R. The Variational Form of Certain Bayes Estimators. Ann. Stat. 1991, 19, 1163–1190.
+14.
+Faes, C.; Ormerod, J.T.; Wand, M.P. Variational Bayesian Inference for Parametric and Nonparametric
+Regression With Missing Data. J. Am. Stat. Assoc. 2011, 106, 959–971.
+15.
+McGrory, C.; Titterington, D.; Reeves, R.; Pettitt, A.N. Variational Bayes for estimating the parameters of a
+hidden Potts model. Stat. Comput. 2009, 19, 329–340.
+16.
+Ormerod, J.T.; Wand, M.P. Gaussian Variational Approximate Inference for Generalized Linear Mixed
+Models. J. Comput. Graph. Stat. 2011, 21, 1–16.
+17.
+Hall, P.; Humphreys, K.; Titterington, D.M. On the Adequacy of Variational Lower Bound Functions for
+Likelihood-Based Inference in Markovian Models with Missing Values. J. R. Stat. Soc. Ser. B 2002, 64,
+549–564.
+18.
+Wang, B.; Titterington, M. Convergence Properties of a general algorithm for calculating variational Bayesian
+estimates for a normal mixture model. Bayesian Anal. 2006, 1, 625–650.
+19.
+Hardy, M.R. A Regime-Switching Model of Long-Term Stock Returns. N. Am. Actuar. J. 2001, 5, 41–53.
+20.
+Hamilton, J.D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business
+Cycle. Econometrica 1989, 57, 357–384.
+21.
+Hardy, M.R. Bayesian Risk Management for Equity-Linked Insurance. Scand. Actuar. J. 2002, 2002, 185–211.
+22.
+Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464.
+23.
+Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723.
+24.
+Hartman, B.M.; Heaton, M.J. Accounting for regime and parameter uncertainty in regime-switching models.
+Insur. Math. Econ. 2011, 49, 429–437.
+25.
+Green, P.J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.
+Biometrika 1995, 82, 711–732.
+26.
+Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: Cambridge, UK, 2009.
+27.
+Brooks, S.P. Markov Chain Monte Carlo Method and Its Application. J. R. Stat. Soc. Ser. D 1998, 47, 69–100.
+
+167
+
+
+Entropy 2014, 16, 3832–3847
+
+28.
+Ghahramani, Z.; Hinton, G.E. Variational learning for switching state-space models. Neur. Comput. 1998, 12,
+831–864.
+29.
+Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat 1951, 22, 79–86.
+30.
+Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A maximization technique occurring in the statistical analysis of
+probabilistic functions of markov chains. Ann. Math. Stat. 1970, 41, 164–171.
+31.
+Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using
+integrated nested Laplace approximations. J. R. Stat. Soc. Ser. B 2009, 71, 319–392.
+32.
+Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+168
+
+
+entropy
+
+Article
+On Clustering Histograms with k-Means by Using
+Mixed α-Divergences
+
+Frank Nielsen 1,2,*, Richard Nock 3 and Shun-ichi Amari 4
+
+1 Sony Computer Science Laboratories, Inc, Tokyo 141-0022, Japan
+2 École Polytechnique, 91128 Palaiseau Cedex, France
+3 NICTA and The Australian National University, Locked Bag 9013, Alexandria NSW 1435, Australia
+4 RIKEN Brain Science Institute, 2-1 Hirosawa Wako City, Saitama 351-0198, Japan; E-Mail: amari@brain.riken.jp
+*
+E-Mail: Frank.Nielsen@acm.org; Tel.:+81-3-5448-4380.
+
+Received: 15 May 2014; in revised form: 10 June 2014 / Accepted: 13 June 2014 /
+Published: 17 June 2014
+
+Abstract: Clustering sets of histograms has become popular thanks to the success of the generic
+method of bag-of-X used in text categorization and in visual categorization applications. In this paper,
+we investigate the use of a parametric family of distortion measures, called the α-divergences, for
+clustering histograms. Since it usually makes sense to deal with symmetric divergences in information
+retrieval systems, we symmetrize the α-divergences using the concept of mixed divergences. First,
+we present a novel extension of k-means clustering to mixed divergences. Second, we extend the
+k-means++ seeding to mixed α-divergences and report a guaranteed probabilistic bound. Finally, we
+describe a soft clustering technique for mixed α-divergences.
+
+Keywords: bag-of-X; α-divergence; Jeffreys divergence; centroid; k-means clustering; k-means seeding
+
+1. Introduction: Motivation and Background
+
+1.1. Clustering Histograms in the Bag-of-Word Modeling Paradigm
+
+A common task of information retrieval (IR) systems is to classify documents into categories.
+Given a training set of documents labeled with categories, one asks to classify new incoming documents.
+Text categorisation [1,2] proceeds by first defining a dictionary of words from a corpus. It then
+models each document by a word count yielding a word distribution histogram per document (see
+the University of California, Irvine, UCI, machine learning repository for such data-sets [3]). The
+importance of the words in the dictionary can be weighted by the term frequency-inverse document
+frequency [2] (tf-idf) that takes into account both the frequency of the words in a given document, but
+also of the frequency of the words in all documents: Namely, the tf-idf weight for a given word in a
+given document is the product of the frequency of that word in the document times the logarithm of
+the ratio of the number of documents divided by the document frequency of the word [2]. Defining a
+proper distance between histograms allows one to:
+
+• Classify a new on-line document: We first calculate its word distribution histogram signature and
+seek for the labeled document, which has the most similar histogram to deduce its category tag.
+• Find the initial set of categories: we cluster all document histograms and assign a category
+per cluster.
+
+This text classification method based on the representation of the bag-of -words (BoWs) has also
+been instrumental in computer vision for efficient object categorization [4] and recognition in natural
+images [5]. This paradigm is called bag-of-features [6] (BoFs) in the general case. It first requires one
+to create a dictionary of “visual words” by quantizing keypoints (e.g., affine invariant descriptors of
+image patches) of the training database. Quantization is performed using the k-means [7–9] algorithm
+
+Entropy 2014, 16, 3273–3301; doi:10.3390/e16063273
+www.mdpi.com/journal/entropy
+169
+
+
+Entropy 2014, 16, 3273–3301
+
+that partitions n data X = {x1, ..., xn} into k pairwise disjoint clusters C1, ..., Ck, where each data
+element belongs to the closest cluster center (i.e., the cluster prototype). From a given initialization,
+batched k-means first assigns data points to their closest centers and then updates the cluster centers
+and reiterates this process until convergence is met to a local minimum (not necessarily the global
+minimum) after a provably finite number of steps. Csurka et al. [4] used the squared Euclidean
+distance for building the visual vocabulary. Depending on the chosen features, other distances
+have proven useful. For example, the symmetrized Kullback–Leibler (KL) divergence was shown to
+perform experimentally better than the Euclidean or squared Euclidean distances for a compressed
+histogram of gradient descriptors [10] (CHoGs), even if it is not a metric distance, since its fails to
+satisfy the triangular inequality. To summarize, k-means histogram clustering with respect to the
+symmetrized KL (called Jeffreys divergence J) can be used to quantize both visual words and document
+categories. Nowadays, the seminal bag-of-word method has been generalized fruitfully to various
+settings using the generic bag-of-X paradigm, like the bag-of-textons [6], the bag-of-readers [11], etc.
+Bag-of-X represents each data (e.g., document, image, etc.) as an histogram of codeword count indices.
+Furthermore, the semantic space [12] paradigm has been recently explored to overcome two drawbacks
+of the bag-of-X paradigms: the high-dimensionality of the histograms (number of bins) and difficult
+human interpretation of the codewords due to the lack of semantic information. In semantic space,
+modeling relies on semantic multinomials that are discrete frequency histograms; see [12].
+In summary, clustering histograms with respect to symmetric distances (like the symmetrized KL
+divergence) is playing an increasing role. It turns out that the symmetrized KL divergence belongs to a
+1-parameter family of divergences, called symmetrized α-divergences, or Jeffreys α-divergence [13].
+
+1.2. Contributions
+
+Since divergences D(p : q) are usually asymmetric distortion measures between two objects
+p and q, one has to often consider two kinds of centroids obtained by carrying the minimization
+process either on the left argument or on the right argument of the divergences; see [14]. In theory,
+it is enough to consider only one type of centroid, say the right centroid, since the left centroid with
+respect to a divergence D(p : q) is equivalent to the right centroid with respect to the mirror divergence
+D′(p : q) = D(q : p).
+In this paper, we consider mixed divergences [15] that allow one to handle in a unified way the
+arithmetic symmetrization S(p, q) = 1
+
+2(D(p : q) + D(q : p)) of a given divergence D(p : q) with both
+the sided divergences: D(p : q) and its mirror divergence D′(p : q). The mixed α-divergence is the
+mixed divergence obtained for the α-divergence. We term α-clustering the clustering with respect
+to α-divergences and mixed α-clustering the clustering w.r.t. mixed α-divergences [16]. Our main
+contributions are to extend the celebrated batched k-means [7–9] algorithm to mixed divergences
+by associating two dual centroids per cluster and to generalize the probabilistically guaranteed
+good seeding of k-means++ [17] to mixed α-divergences. The mixed α-seedings provide guaranteed
+probabilistic clustering bounds by picking up seeds from the data and do not require explicitly
+computing of centroids. Therefore, it follows a fast clustering technique in practice, even when cluster
+centers are not available in closed form. We also consider clustering histograms by explicitly building
+the symmetrized α-centroids and end up with a variational k-means when the centroids are not
+available in closed-form, Finally, we investigate soft mixed α-clustering and discuss topics related to
+α-clustering. Note that clustering with respect to non-symmetrized α-divergences has been recently
+investigated independently in [18] and proven useful in several applications.
+
+1.3. Outline of the Paper
+
+The paper is organized as follows: Section 2 introduces the notion of mixed divergences, presents
+an extension of k-means to mixed divergences and recalls some properties of α-divergences. Section 3
+describes the α-seeding techniques and reports a probabilistically-guaranteed bound on the clustering
+quality. Section 4 investigates the various sided/symmetrized/mixed calculations of the α-centroids.
+
+170
+
+
+Entropy 2014, 16, 3273–3301
+
+Section 5 presents the soft α-clustering with respect to α-mixed divergences. Finally, Section 6
+summarises the contributions, discusses related topics and hints at further perspectives. The paper is
+followed by two appendices. Appendix B studies several properties of α-divergences that are used to
+derive the guaranteed probabilistic performance of the α-seeding. Appendix C proves that α-sided
+centroids are quasi-arithmetic means for the power generator functions.
+
+2. Mixed Centroid-Based k-Means Clustering
+
+2.1. Divergences, Centroids and k-Means
+
+Consider a set H of n histograms h1, ..., hn, each with d bins, with all positive real-valued bins:
+hi
+j > 0, ∀1 ≤ i ≤ d, 1 ≤ j ≤ n. A histogram h is called a frequency histogram when its bins sums up
+
+to one: w(h) = wh = ∑i hi = 1. Otherwise, it is called a positive histogram that can eventually be
+normalized to a frequency histogram:
+
+˜h
+.=
+h
+
+w(h).
+(1)
+
+The frequency histograms belong to the (d-1)-dimensional open probability simplex Δd:
+
+Δd
+.=
+
+�
+
+(x1, ..., xd) ∈ Rd | ∀i, xi > 0, and
+d
+∑
+i=1
+xi = 1
+
+�
+
+.
+(2)
+
+That is, although frequency histograms have d bins, the constraint that those bin values should
+sum up to one yields d-1 degrees of freedom. In probability theory, the frequency or counting of
+histograms either model discrete multinomial probabilities or discrete positive measures (also called
+positive arrays [19]).
+The celebrated k-means clustering [8,9] is one of the most famous clustering techniques that has
+been generalized in many ways [20,21]. In information geometry [22], a divergence D(p : q) is a
+smooth C3 differentiable dissimilarity measure that is not necessarily symmetric (D(p : q) ̸= D(q : p),
+hence the notation “:” instead of the classical “,” reserved for metric distances), but is non-negative and
+satisfies the separability property: D(p : q) = 0 iff p = q. More precisely, let ∂iD(x : y) =
+∂
+∂xi D(x : y),
+
+∂,iD(x : y) =
+∂
+∂yi D(x : y). Then, we require ∂iD(x : x) = ∂,iD(x : x) = 0 and −∂i∂,jD(x : y) positive
+
+definite for defining a divergence. For a distance function D(· : ·), we denote by D(x : H) the weighted
+average distance of x to a set a weighted histograms:
+
+D(x : H)
+.=
+n
+∑
+j=1
+wiD(x : hj).
+(3)
+
+An important class of divergences on frequency histograms is the f-divergences [23–25] defined for a
+convex generator f (with f (1) = f ′(1) = 0 and f ′′(1) = 1):
+
+If (p : q)
+.=
+d
+∑
+i=1
+qi f
+� pi
+
+qi
+
+�
+.
+
+Those divergences preserve information monotonicity [19] under any arbitrary transition probability
+(Markov morphisms). f-divergences can be extended to positive arrays [19].
+The k-means algorithm on a set of weighted histograms can be tailored to any divergence as
+follows: First, we initialize the k cluster centers C = {c1, ..., ck} (say, by picking up randomly arbitrary
+distinct seeds). Then, we iteratively repeat until convergence the following two steps:
+
+• Assignment: Assign each histogram hj to its closest cluster center:
+
+l(hj)
+.= arg
+k
+min
+l=1 D(hj : cl).
+
+171
+
+
+Entropy 2014, 16, 3273–3301
+
+This yields a partition of the histogram set H = ∪k
+l=1Al, where Al denotes the set of histograms
+of the l-th cluster: Al = {hj |l(hj) = l}.
+• Center relocation: Update the cluster centers by taking their centroids:
+
+cl
+.= arg min
+x
+∑
+hj∈Al
+wjD(hj : x).
+
+Throughout this paper, centroid shall be understood in the broader sense of a barycenter when
+weights are non-uniform.
+
+2.2. Mixed Divergences and Mixed k-Means Clustering
+
+Since divergences are potentially asymmetric, we can define two-sided k-means or always consider
+a right-sided k-means, but then define another sided divergence D′(p : q) = D(q : p). We can
+also consider the symmetrized k-means with respect to the symmetrized divergence: S(p, q) =
+D(p : q) + D(q : p). Eventually, we may skew the symmetrization with a parameter λ ∈ [0, 1]:
+Sλ(p, q) = λD(p : q) + (1 − λ)D(q : p) (and consider other averaging schemes instead of the arithmetic
+mean).
+In order to handle those sided and symmetrized k-means under the same framework, let us
+introduce the notion of mixed divergences [15] as follows:
+
+Definition 1 (Mixed divergence).
+
+Mλ(p : q : r)
+.= λD(p : q) + (1 − λ)D(q : r),
+(4)
+
+for λ ∈ [0, 1].
+
+A mixed divergence includes the sided divergences for λ ∈ {0, 1} and the symmetrized (arithmetic
+mean) divergence for λ = 1
+
+2.
+We generalize k-means clustering to mixed k-means clustering [15] by considering two centers
+per cluster (for the special cases of λ = 0, 1, it is enough to consider only one). Algorithm 1 sketches
+the generic mixed k-means algorithm. Note that a simple initialization consists of choosing randomly
+the k distinct seeds from the dataset with li = ri.
+
+Algorithm 1: Mixed divergence-based k-means clustering.
+
+Input: Weighted histogram set H, divergence D(·, ·), integer k > 0, real λ ∈ [0, 1];
+Initialize left-sided/right-sided seeds C = {(li, ri)}k
+i=1;
+repeat
+
+//Assignment
+for i = 1, 2, ..., k do
+
+Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj)};
+
+// Dual-sided centroid relocation
+for i = 1, 2, ..., k do
+
+ri ← arg minx D(Ci : x) = ∑h∈Ci wjD(h : x);
+li ← arg minx D(x : Ci) = ∑h∈Ci wjD(x : h);
+
+until convergence;
+Output: Partition of H into k clusters following C;
+
+Notice that the mixed k-means clustering is different from the k-means clustering with respect to
+the symmetrized divergences Sλ that considers only one centroid per cluster.
+
+172
+
+
+Entropy 2014, 16, 3273–3301
+
+2.3. Sided, Symmetrized and Mixed α-Divergences
+
+For α ̸= ±1, we define the family of α-divergences [26] on positive arrays [27] as:
+
+Dα(p : q)
+.=
+d
+∑
+i=1
+
+4
+
+1 − α2
+
+�1 − α
+
+2
+pi + 1 + α
+
+2
+qi − (pi)
+1−α
+
+2 (qi)
+1+α
+
+2
+�
+,
+
+=
+D−α(q : p), α ∈ R\{0, 1},
+(5)
+
+with the limit cases D−1(p : q) = KL(p : q) and D1(p : q) = KL(q : p), where KL is the extended
+Kullback–Leibler divergence:
+
+KL(p : q)
+.=
+d
+∑
+i=1
+pi log pi
+
+qi + qi − pi.
+(6)
+
+Divergence D0 is the squared Hellinger symmetric distance (scaled by a multiplicative factor of
+four) extended to positive arrays:
+
+D0(p : q) = 2
+� ��
+
+p(x) −
+�
+
+q(x)
+�2
+dx = 4H2(p, q),
+(7)
+
+with the Hellinger distance:
+
+H(p, q) =
+
+�
+
+1
+2
+
+� ��
+
+p(x) −
+�
+
+q(x)
+�2
+dx.
+(8)
+
+Note that α-divergences are defined for the full range of α values: α ∈ R.
+Observe that
+α-divergences of Equation (5) are homogeneous of degree one: Dα(λp : λq) = λDα(p : q) for
+λ > 0.
+When histograms p and q are both frequency histograms, we have:
+
+Dα( ˜p : ˜q)
+=
+4
+
+1 − α2
+
+�
+
+1 −
+d
+∑
+i=1
+( ˜pi)
+1−α
+
+2 ( ˜qi)
+1+α
+
+2
+
+�
+
+,
+
+=
+D−α( ˜q : ˜p), α ∈ R\{0, 1},
+(9)
+
+and the extended Kullback–Leibler divergence reduces to the traditional Kullback–Leibler
+
+divergence: KL( ˜p : ˜q) = ∑d
+i=1 ˜pi log ˜pi
+
+˜qi .
+The Kullback–Leibler divergence between frequency histograms ˜p and ˜q (α = ±1) is interpreted
+as the cross-entropy minus the Shannon entropy:
+
+KL( ˜p : ˜q)
+.= H×( ˜p : ˜q) − H( ˜p).
+
+Often, ˜p denotes the true model (hidden by nature), and ˜q is the estimated model from observations.
+However, in information retrieval, both ˜p and ˜q play the same symmetrical role, and we prefer to deal
+with a symmetric divergence.
+The Pearson and Neyman χ2 distances are obtained for α = −3 and α = 3, respectively:
+
+D3( ˜p : ˜q)
+=
+1
+2 ∑
+i
+
+( ˜qi − ˜pi)2
+
+˜pi
+,
+(10)
+
+D−3( ˜p : ˜q)
+=
+1
+2 ∑
+i
+
+( ˜qi − ˜pi)2
+
+˜qi
+.
+(11)
+
+173
+
+
+Entropy 2014, 16, 3273–3301
+
+The α-divergences belong to the class of Csiszár f-divergences with the following generator:
+
+f (t) =
+
+⎧
+⎪
+⎨
+
+⎪
+⎩
+
+4
+
+1−α2
+�
+1 − t(1+α)/2�
+,
+if α ̸= ±1,
+t ln t,
+if α = 1,
+− ln t,
+if α = −1
+(12)
+
+Remark 1. Historically, the α-divergences have been introduced by Chernoff [28,29] in the context of hypothesis
+testing. In Bayesian binary hypothesis testing, we are asked to decide whether an observation belongs to one
+class or the other class, based on prior w1 and w2 and class-conditional probabilities p1 and p2. The average
+expected error of the best decision maximum a posteriori (MAP) rule is called the probability of error, denoted by
+Pe. When prior probabilities are identical (w1 = w2 = 1
+
+2), we have Pe(p1, p2) = 1
+
+2
+�
+min(p1(x), p2(x))dx.
+Let S(p, q) = �
+min(p(x), q(x))dx denote the intersection similarity measure, with 0 < S ≤ 1 (generalizing
+the histogram intersection distance often used in computer vision [30]). S is bounded by the α-Chernoff affinity
+coefficient:
+
+S(p, q) ≤ Cβ(p, q) =
+�
+pβ(x)q1−β(x)dx,
+
+for all β ∈ [0, 1]. We can convert the affinity coefficient 0 < Cβ ≤ 1 into a divergence Dβ by simply taking
+Dβ = 1 − Cβ. Since the absolute value of divergences does not matter, we can rescale appropriately the
+divergence. One nice rescaling is by multiplying by
+1
+
+β(1−β): Dβ =
+1
+
+β(1−β)(1 − Cβ). This lets coincide the
+parameterized divergence with the fundamental Kullback–Leibler divergence for the limit values β ∈ {0, 1}.
+Last, by choosing β = 1−α
+
+2 , it yields the well-known expression of the α-divergences.
+
+Interestingly, the α-divergences can be interpreted as a generalized α-Kullback–Leibler
+divergence [26] with deformed logarithms.
+Next, we introduce the mixed α-divergence of a histogram x to two histograms p and q as follows:
+
+Definition 2 (Mixed α-divergence). The mixed α-divergence of a histogram x to two histograms p and q is
+defined by:
+
+Mλ,α(p : x : q)
+=
+λDα(p : x) + (1 − λ)Dα(x : q),
+
+=
+λD−α(x : p) + (1 − λ)D−α(q : x),
+
+=
+M1−λ,−α(q : x : p),
+(13)
+
+The α-Jeffreys symmetrized divergence is obtained for λ = 1
+
+2:
+
+Sα(p, q) = M 1
+
+2 ,α(q : p : q) = M 1
+
+2 ,α(p : q : p).
+
+The skew symmetrized α-divergence is defined by:
+
+Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p).
+
+2.4. Notations and Hard/Soft Clusterings
+
+Throughout the paper, superscript index i denotes the histogram bin numbers and subscript
+index j the histogram numbers. Index l is used to iterate on the clusters. The left-sided, right-sided
+and symmetrized histogram positive and frequency α-centroids are denoted by lα, rα, sα and ˜lα, ˜rα, ˜sα,
+respectively.
+In this paper, we investigate the following kinds of clusterings for sets of histograms:
+
+Hard clustering. Each histogram belongs to exactly one cluster:
+
+• k-means with respect to mixed divergences Mλ,α.
+• k-means with respect to symmetrized divergences Sλ,α.
+
+174
+
+
+Entropy 2014, 16, 3273–3301
+
+• Randomized seeding for mixed/symmetrized k-means by extending k-means++ with
+guaranteed probabilistic bounds for α-divergences.
+
+Soft clustering. Each histogram belongs to all clusters according to some weight distribution:
+the soft mixed α-clustering.
+
+3. Coupled k-Means++ α-Seeding
+
+It is well-known that the Lloyd k-means clustering algorithm monotonically decreases the loss
+function and stops after a finite number of iterations into a local optimal. Optimizing globally
+the k-means loss is NP-hard [17] when d > 1 and k > 1. In practice, the performance of the
+k-means algorithm heavily relies on the initialization. A breakthrough was obtained by the k-means++
+seeding [17], which guarantees in expectation a good starting partition. We extend this scheme to
+the coupled α-clustering. However, we point out that although k-means++ prove popular and are
+often used in practice with very good results; it has been recently pointed out that “worst case”
+configurations exist and even in small dimensions, on which the algorithm cannot beat significantly
+its expected approximability with a high probability [31]. Still, the expected approximability ratio,
+roughly in O(log(k)), is very good, as long as the number of clusters is not too large.
+
+Algorithm 2: Mixed α-seeding; MAS(H, k, λ, α)
+
+Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R;
+Let C ← hj with uniform probability ;
+for i = 2, 3, ..., k do
+
+Pick at random histogram h ∈ H with probability:
+
+πH(h)
+.=
+whMλ,α(ch : h : ch)
+
+∑y∈H wyMλ,α(cy : y : cy) ,
+(14)
+
+//where (ch, ch)
+.= arg min(z,z)∈C Mλ,α(z : h : z);
+C ← C ∪ {(h, h)};
+
+Output: Set of initial cluster centers C;
+
+Algorithm 2 provides our adaptation of k-means++ seeding [15,17]. It works for all three of our
+sided/symmetrized and mixed clustering settings:
+
+• Pick λ = 1 for the left-sided centroid initialization,
+• Pick λ = 0 for the right-sided centroid initialization (a left-sided initialization for −α),
+• with arbitrary λ, for the λ-Jα (skew Jeffreys) centroids or mixed λ centroids. Indeed, the
+initialization is the same (see the MAS procedure in Algorithm 2).
+
+Our proof follows and generalizes the proof described for the case of mixed Bregman seeding [15]
+(Lemma 2). In fact, our proof is more precise, as it quantifies the expected potential with respect to the
+optimum only, whereas in [15], the optimal potential is averaged with a dual optimal potential, which
+depends on the optimal centers, but may be larger than the optimum sought.
+
+Theorem 1. Let Cλ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew
+Jeffreys or mixed) in MASand Copt
+λ,α denote the optimal related clustering in k clusters, for λ
+∈ [0, 1] and α ∈ (−1, 1). Then, on average, with respect to distribution (14), the initial clustering of
+MAS satisfies:
+
+Eπ[Cλ,α]
+≤
+4
+
+�
+f (λ)g(k)h2(α)Copt
+λ,α
+if
+λ ∈ (0, 1)
+
+g(k)z(α)h4(α)Copt
+λ,α
+otherwise
+.
+(15)
+
+175
+
+
+Entropy 2014, 16, 3273–3301
+
+Here, f (λ) = max
+�
+1−λ
+
+λ ,
+λ
+
+1−λ
+�
+, g(k) = 2(2 + log k), z(α) =
+� 1+|α|
+
+1−|α|
+
+�
+8|α|2
+
+(1−|α|)2 , h(α) = maxi p|α|
+i / mini p|α|
+i ;
+the min is defined on strictly positive coordinates, and π denotes the picking distribution of Algorithm 2.
+
+Remark 2. The bound is particularly good when λ is close to 1/2, and in particular for the α-Jeffreys clustering,
+as in these cases, the only additional penalty compared to the Euclidean case [17] is h2(α), a penalty that relies
+on an optimal triangle inequality for α-divergences that we provide in Lemma A6 below.
+
+Remark 3. This guaranteed initialization is particularly useful for α-Jeffreys clustering, as there is no closed
+form solution for the centroids (except when α = ±1, see [32]).
+
+Algorithm 3: Mixed α-hard clustering: MAhC(H, k, λ, α)
+
+Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1], real α ∈ R;
+Let C = {(li, ri)}k
+i=1 ← MAS(H, k, λ, α);
+repeat
+
+//Assignment
+for i = 1, 2, ..., k do
+
+Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj)};
+
+// Centroid relocation
+for i = 1, 2, ..., k do
+
+ri ←
+�
+∑h∈Ai wih
+1−α
+
+2
+�
+2
+
+1−α ;
+
+li ←
+�
+∑h∈Ai wih
+1+α
+
+2
+�
+2
+
+1+α ;
+
+until convergence;
+Output: Partition of H in k clusters following C;
+
+Algorithm 3 presents the general hard mixed k-means clustering, which can be adapted also to
+left- (λ = 1) and right- (λ = 0) α-clustering.
+For skew Jeffreys centers, since the centroids are not available in closed form [32], we adopt a
+variational approach of k-means by updating iteratively the centroid in each cluster (thus improving
+the overall loss function without computing the optimal centroids that would eventually require
+infinitely many iterations).
+
+4. Sided, Symmetrized and Mixed α-Centroids
+
+The k-means clustering requires assigning data elements to their closest cluster center and
+then updating those cluster centers by taking their centroids. This section investigates the centroid
+computations for the sided, symmetrized and mixed α-divergences.
+Note that the mixed α-seeding presented in Section 3 does not require computing centroids and,
+yet, guarantees probabilistically a good clustering partition.
+Since mixed α-divergences are f-divergences, we start with the generic f-centroids.
+
+4.1. Csiszár f-Centroids
+
+The centroids induced by f-divergences of a set of positive measures (that relaxes the
+normalisation constraint) have been studied by Ben-Tal et al. [33]. Those entropic centroids are
+
+176
+
+
+Entropy 2014, 16, 3273–3301
+
+shown to be unique, since f-divergences are convex statistical distances in both arguments. Let Ef
+denote the energy to minimize when considering f-divergences:
+
+Ef
+.=
+min
+x∈X If (H : x) =
+n
+∑
+j=1
+wjIf (hj : x),
+(16)
+
+=
+min
+x∈X
+
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+pi
+j f
+
+�
+ci
+
+hi
+j
+
+�
+
+.
+(17)
+
+When the domain is the open probability simplex X = Δd, we get a constrained optimisation
+problem to solve. We transform this constrained minimisation problem (i.e., x ∈ Δd) into an equivalent
+unconstrained minimisation problem by using the Lagrange multiplier, γ:
+
+min
+x∈Rd
+
+n
+∑
+j=1
+wjIf (hj : c) + γ
+
+�
+d
+∑
+i=1
+xi − 1
+
+�
+
+.
+(18)
+
+Taking the derivatives according to xi, we get:
+
+∀i ∈ {1, ..., d},
+n
+∑
+j=1
+wj f ′
+�
+xi
+
+hi
+j
+
+�
+
+− γ = 0.
+(19)
+
+We now consider this equation for α-divergences and symmetrized α-divergences, both
+f-divergences.
+
+4.2. Sided Positive and Frequency α-Centroids
+
+The positive sided α-centroids for a set of weighted histograms were reported in [34] using the
+representation Bregman divergence. We summarise the results in the following theorem:
+
+Theorem 2 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted α-centroid
+coordinates of a set of n positive histograms h1, ..., hn are weighted α-means:
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+, li
+α = ri
+−α
+
+with fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+Furthermore, the frequency-sided α-centroids are simply the normalized-sided α-centroids.
+
+Theorem 3 (Sided frequency α-centroids [16]). The coordinates of the sided frequency α-centroids of a set of
+n weighted frequency histograms are the normalised weighted α-means.
+
+Table 1 summarizes the results concerning the sided positive and frequency α-centroids.
+
+177
+
+
+Entropy 2014, 16, 3273–3301
+
+Table 1. Positive and frequency α-centroids: the frequency α-centroids are normalized positive
+α-centroids, where w(h) denotes the cumulative sum of the histogram bins. The arithmetic mean
+is obtained for r−1 = l1 and the geometric mean for r1 = l−1.
+
+Positive centroid
+Frequency centroid
+
+Right-sided centroid
+riα =
+
+�
+(∑n
+j=1 wj(hi
+j)
+1−α
+
+2 )
+2
+
+1−α
+α ̸= 1
+ri
+1 = ∏n
+j=1(hi
+j)wj
+α = 1
+˜riα =
+ri
+α
+
+w(˜rα)
+
+Left-sided centroid
+liα = ri−α =
+
+�
+(∑n
+j=1 wj(hi
+j)
+1+α
+
+2 )
+2
+
+1+α
+α ̸= −1
+li
+−1 = ∏n
+j=1(hi
+j)wj
+α = −1
+˜liα = ˜ri−α =
+ri
+−α
+
+w(˜r−α)
+
+4.3. Mixed α-Centroids
+
+The mixed α-centroids for a set of n weighted histograms is defined as the minimizer of:
+
+∑
+j
+wjMλ,α(l : hj : r).
+(20)
+
+We state the theorem generalizing [15]:
+
+Theorem 4. The two mixed α-centroids are the left-sided and right-sided α-centroids.
+
+Figure 1 depicts some clustering result with our α-clustering software. We remark that the clusters
+found are all approximately subclusters of the “distinct” clusters that appear on the figure. When
+those distinct clusters are actually the optimal clusters—which is likely to be the case when they are
+separated by large minimal distance to other clusters—this is clearly a desirable qualitative property
+as long as the number of experimental clusters is not too large compared to the number of optimal
+clusters. We remark also that in the experiment displayed, there is no closed form solution for the
+cluster centers.
+
+Figure 1. Snapshot of the α-clustering software. Here, n = 800 frequency histograms of three bins with
+k = 8, and α = 0.7 and λ = 1
+
+2.
+
+178
+
+
+Entropy 2014, 16, 3273–3301
+
+4.4. Symmetrized Jeffreys-Type α-Centroids
+
+The Kullback–Leibler divergence can be symmetrized in various ways: Jeffreys divergence,
+Jensen–Shannon divergence and Chernoff information, just to mention a few. Here, we consider the
+following symmetrization of α-divergences extending Jeffreys J-divergence:
+
+Sα(p, q)
+=
+1
+2 (Dα(p : q) + Dα(q : p)) = S−α(p, q),
+
+=
+M 1
+
+2 (p : q : p),
+(21)
+
+For α = ±1, we get half of Jeffreys divergence:
+
+S±1(p, q) = 1
+
+2
+
+d
+∑
+i=1
+(pi − qi) log pi
+
+qi
+
+In particular, when p and q are frequency histograms, we have for α ̸= ±1:
+
+Jα( ˜p : ˜q)
+=
+8
+
+1 − α2
+
+�
+
+1 +
+d
+∑
+i=1
+H 1−α
+
+2 ( ˜pi, ˜qi)
+
+�
+
+,
+(22)
+
+where H 1−α
+
+2 (a, b) a symmetric Heinz mean [35,36]:
+
+Hβ(a, b) = aβb1−β + a1−βbβ
+
+2
+.
+
+Heinz means interpolate the arithmetic and geometric means and satisfies the inequality:
+
+√
+
+ab = H 1
+
+2 (a, b) ≤ Hα(a, b) ≤ H0(a, b) = a + b
+
+2
+.
+
+(Another interesting property of Heinz means is the integral representation of the logarithmic mean:
+L(x, y) =
+x−y
+
+log x−log y = � 1
+0 Hβ(x, y)dβ. This allows one to prove easily that √xy ≤ L(x, y) ≤ x+y
+
+2 .)
+The Jα-divergence is a Csiszár f-divergence [24,25].
+Observe that it is enough to consider α ∈ [0, ∞) and that the symmetrized α-divergence for
+positive and frequency histograms coincide only for α = ±1.
+For α = ±1, Sα(p, q) tends to the Jeffreys divergence:
+
+J(p, q) = KL(p, q) + KL(q, p) =
+d
+∑
+i=1
+(pi − qi)(log pi − log qi).
+(23)
+
+The Jeffreys divergence writes mathematically the same for frequency histograms:
+
+J( ˜p, ˜q) = KL( ˜p, ˜q) + KL( ˜q, ˜p) =
+d
+∑
+i=1
+( ˜pi − ˜qi)(log ˜pi − log ˜qi).
+(24)
+
+We state the results reported in [32]:
+
+Theorem 5 (Jeffreys positive centroid [32]). The Jeffreys positive centroid c = (c1, ..., cd) of a set {h1, ..., hn}
+of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W
+analytic function:
+
+ci =
+ai
+
+W( ai
+
+gi e)
+,
+
+where ai = ∑n
+j=1 πjhi
+j denotes the coordinate-wise arithmetic weighted means and gi = ∏n
+j=1(hi
+j)πj the
+coordinate-wise geometric weighted means.
+
+179
+
+
+Entropy 2014, 16, 3273–3301
+
+The Lambert analytic function W [37] (positive branch) is defined by W(x)eW(x) = x for x ≥ 0.
+
+Theorem 6 (Jeffreys frequency centroid [32]). Let ˜c denote the Jeffreys frequency centroid and ˜c′ =
+c
+wc the
+
+normalised Jeffreys positive centroid. Then, the approximation factor α˜c′ = S1(˜c′, ˜H)
+
+S1(˜c, ˜H) is such that 1 ≤ α˜c′ ≤
+1
+wc
+(with wc ≤ 1).
+
+Therefore, we shall consider α ̸= ±1 in the remainder.
+We state the following lemma generalizing the former results in [38] that were tailored to the
+symmetrized Kullback–Leibler divergence or the symmetrized Bregman divergence [14]:
+
+Lemma 1 (Reduction property). The symmetrized Jα-centroid of a set of n weighted histograms amount to
+computing the symmetrized α-centroid for the weighted α-mean and −α-mean:
+
+min Jα(x, H) = min
+x
+(Dα(x : rα) + Dα(lα : x)) .
+
+Proof. It follows that the minimization problem minx Sα(x, H) = ∑n
+j=1 wjSα(x, hj) reduces to the
+following minimization:
+
+min
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 ¯hi
+α − (xi)
+1−α
+
+2 ¯hi
+−α.
+(25)
+
+This is equivalent to minimizing:
+
+≡
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 ((¯hi
+α)
+2
+
+1−α )
+1−α
+
+2 −
+
+(xi)
+1−α
+
+2 ((¯hi
+−α)
+2
+
+1+α )
+1+α
+
+2 ,
+
+≡
+d
+∑
+i=1
+xi − (xi)
+1+α
+
+2 (ri
+α)
+
+1−α
+
+2
+− (xi)
+1−α
+
+2 (li
+α)
+1+α
+
+2
+
+≡
+Dα(x : rα) + Dα(lα : x).
+
+Note that α = ±1, the lemma states that the minimization problem is equivalent to minimizing
+KL(a : x) + KL(x : g) with respect to x, where a = l1 and g = r1 denote the arithmetic and geometric
+means, respectively.
+
+The lemma states that the optimization problem with n weighted histograms is equivalent to the
+optimization with only two equally weighted histograms.
+The positive symmetrized α-centroid is equivalent to computing a representation symmetrized
+Bregman centroid [14,34].
+The frequency symmetrized α-centroid asks to minimize the following problem:
+
+min
+˜x∈Δd ∑
+j
+wjSα( ˜x, ˜hi).
+
+Instead of seeking for ˜x in the probability simplex, we can optimize on the unconstrained
+domain Rd−1 by using a reparameterization. Indeed, frequency histograms belong to the exponential
+families [39] (multinomials).
+Exponential families also include many other continuous distributions, like the Gaussian, Beta or
+Dirichlet distributions. It turns out the α-divergences can be computed in closed-form for members of
+the same exponential family:
+
+180
+
+
+Entropy 2014, 16, 3273–3301
+
+Lemma 2. The α-divergence for distributions belonging to the same exponential families amounts to computing
+a divergence on the corresponding natural parameters:
+
+Aα(p : q) =
+4
+
+1 − α2
+
+�
+
+1 − e−J( 1−α
+
+2 )
+F
+(θp:θq)
+�
+
+,
+
+where Jβ
+F(θ1 : θ2) = βF(θ1) + (1 − β)F(θ2) − F(βθ1 + (1 − β)θ2) is a skewed Jensen divergence defined for
+the log-normaliser F of the family.
+
+The proof follows from the fact that �
+pα(x)q1−α(x)dx = e−J
+(α)(θp:θq)
+F
+; see [40].
+
+First, we convert a frequency histogram ˜h to its natural parameter θ with θi = log ˜hi
+
+˜hd ; see [39].
+The log-normaliser is a non-separable convex function F(θ) = log(1 + ∑i eθi). To convert back a
+multinomial to a frequency histogram with d bins, we first set ˜hd =
+1
+
+1+∑d−1
+l=1 eθl and then retrieve the
+
+other bin values as ˜hi = ˜hdeθi.
+The centroids with respect to skewed Jensen divergences has been investigated in [13,40].
+
+Remark 4. Note that for the special case of α = 0 (squared Hellinger centroid), the sided and symmetrized
+centroids coincide. In that case, the coordinates si
+0 of the squared Hellinger centroid are:
+
+si
+0 =
+
+�
+n
+∑
+j=1
+wj
+�
+
+hi
+j
+
+�2
+, 1 ≤ i ≤ d.
+
+Remark 5. The symmetrized positive α-centroids can be solved in special cases (α = ±3, α = ±1 corresponding
+to the symmetrized χ2 and Jeffreys positive centroids). For frequency centroids, when dealing with binary
+histograms (d = 2), we have only one degree of freedom and can solve the binary frequency centroids. Binary
+histograms (and mixtures thereof) are used in computer vision and pattern recognition [41].
+
+Remark 6. Since α-divergences are Csiszár f-divergences and f-divergences can always be symmetrized by
+taking generator s(t) = f (t) + t f ( 1
+
+t ), we deduce that symmetrized α-divergences Sα are f-divergences for
+the generator:
+
+f (t) = − log((1 − α) + αt) − t log
+�
+(1 − α) + α
+
+t
+
+�
+.
+(26)
+
+Hence, Sα divergences are convex in both arguments, and the sα centroids are unique.
+
+5. Soft Mixed α-Clustering
+
+Algorithm 4 reports the general clustering with soft membership, which can be adapted to left
+(λinit = 1), right (λinit = 0) or mixed clustering. We have not considered a weighted histogram set in
+order not load the notations and because the extension is straightforward.
+Again, for skew Jeffreys centers, we shall adopt a variational approach. Notice that the soft
+clustering approach learns all parameters, including λ (if not constrained to zero or one) and α ∈ R.
+This is not the case for Matsuyama’s α-expectation maximization (EM) algorithm [42] in which α is
+fixed beforehand (and, thus, not learned).
+Assuming we model the prior for histograms by:
+
+pλ,α,j(hi) ∝
+
+λ exp −Dα(lj : hi) + (1 − λ) exp −Dα(hi : rj) ,
+(27)
+
+181
+
+
+Entropy 2014, 16, 3273–3301
+
+the negative log-likelihood involves the α-depending quantity:
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+p(j|hi) log pλ,α,j(hi) ≥
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+Mλ,α(lj : hi : rj)p(j|hi),
+(28)
+
+because of the concavity of the logarithm function.
+Therefore, the maximization step for α
+involves finding:
+
+arg max
+α
+
+k
+∑
+j=1
+
+m
+∑
+i=1
+Mλ,α(lj : hi : rj)p(j|hi) .
+(29)
+
+Algorithm 4: Mixed α-soft clustering; MAsC(H, k, λ, α)
+
+Input: Histogram set H with |H| = m, integer k > 0, real λ ← λinit ∈ [0, 1], real α ∈ R;
+Let C = {(li, ri)}k
+i=1 ← MAS(H, k, λ, α);
+repeat
+
+//Expectation
+for i = 1, 2, ..., m do
+
+for j = 1, 2, ..., k do
+
+p(j|hi) =
+πj exp(−Mλ,α(lj:hi:rj))
+
+∑j′ πj′ exp(−Mλ,α(lj′ :hi:rj′));
+
+//Maximization
+for j = 1, 2, ..., k do
+
+πj ← 1
+
+m ∑i p(j|hi);
+
+li ←
+�
+1
+
+∑i p(j|hi) ∑i p(j|hi)h
+
+1+α
+
+2
+i
+
+�
+2
+
+1+α
+;
+
+ri ←
+�
+1
+
+∑i p(j|hi) ∑i p(j|hi)h
+
+1−α
+
+2
+i
+
+�
+2
+
+1−α
+;
+
+//Alpha - Lambda
+α ← α − η1 ∑k
+j=1 ∑m
+i=1 p(j|hi) ∂
+
+∂α Mλ,α(lj : hi : rj);
+
+if λinit ̸= 0, 1 then
+
+λ ← λ − η2
+�
+∑k
+j=1 ∑m
+i=1 p(j|hi)Dα(lj : hi)−
+
+∑k
+j=1 ∑m
+i=1 p(j|hi)Dα(hi : rj)
+�
+;
+
+//for some small η1, η2; ensure that λ ∈ [0, 1].
+
+until convergence;
+Output: Soft clustering of H according to k densities p(j|.) following C;
+
+No closed-form solution are known, so we compute the gradient update in Algorithm 4 with:
+
+∂Mλ,α(lj : hi : rj)
+
+∂α
+=
+
+λ∂Dα(lj : hi)
+
+∂α
++ (1 − λ)∂Dα(hi : rj)
+
+∂α
+,
+(30)
+
+182
+
+
+Entropy 2014, 16, 3273–3301
+
+∂Dα(p : q)
+
+∂α
+=
+2
+
+(1 − α)2 ×
+�
+
+q −
+�1 − α
+
+1 + α
+
+�2
+p + p
+1−α
+
+2 q
+1+α
+
+2
+�
+4α
+
+1 − α2 − ln
+� q
+
+p
+
+���
+
+.
+(31)
+
+The update in λ is easier as:
+
+∂Mλ,α(lj : hi : rj)
+
+∂λ
+=
+Dα(lj : hi) − Dα(hi : rj) .
+(32)
+
+Maximizing the likelihood in λ would imply choosing λ ∈ {0, 1} (a hard choice for left/right centers),
+yet we prefer the soft update for the parameter, like for α.
+
+6. Conclusions
+
+The family of α-divergences plays a fundamental role in information geometry: These statistical
+distortion measures are the canonical divergences of dual spaces on probability distribution manifolds
+with constant curvature κ = 1−α2
+
+4
+and the canonical divergences of dually flat manifolds for positive
+distribution manifolds [19].
+In this work, we have presented three techniques for clustering (positive or frequency) histograms
+using k-means:
+
+(1) Sided left or right α-centroid k-means,
+(2) Symmetrized Jeffreys-type α-centroid (variational) k-means, and
+(3) Coupled k-means with respect to mixed α-divergences relying on dual α-centroids.
+
+Sided and mixed dual centroids are always available in closed-forms and are therefore highly
+attractive from the standpoint of implementation. Symmetrized Jeffreys centroids are in general not
+available in closed-form and require one to implement a variational k-means by updating incrementally
+the cluster centroids in order to monotonically decrease the k-means loss function. From the clustering
+standpoint, this appears not to be a problem when guaranteed expected approximations to the optimal
+clustering are enough.
+Indeed, we also presented and analyzed an extension of k-means++ [17] for seeding those k-means
+algorithms. The mixed α-seeding initializations do not require one to calculate centroids and behaves
+like a discrete k-means by picking up the seeds among the data. We reported guaranteed probabilistic
+clustering bounds. Thus, it yields a fast hard/soft data partitioning technique with respect to mixed or
+symmetrized α-divergences. Recently, the advantage of clustering using α-divergences by tuning α in
+applications has been demonstrated in [18]. We thus expect the computationally fast mixed α-seeding
+with guaranteed performance to be useful in a growing number of applications.
+
+Acknowledgments
+
+NICTA is funded by the Australian Government as represented by the Department of Broadband,
+Communication and the Digital Economy and the Australian Research Council through the ICT Centre
+of Excellence program.
+
+Author Contributions: Author Contributions
+All authors contributed equally to the design of the research. The research was carried out by all
+authors. Frank Nielsen and Richard Nock wrote the paper. Frank Nielsen implemented the algorithms
+and performed experiments. All authors have read and approved the final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interests.
+
+183
+
+
+Entropy 2014, 16, 3273–3301
+
+Appendix Proof Sketch of Theorem 1
+
+We give here the key results allowing one to obtain the proof of the Theorem, following the proof
+scheme of [15]. In order not to load the notations, weights are considered uniform. The extension to
+non-uniform weights is immediate as it boils down to duplicate histograms in the histogram set and
+does not change the approximation result.
+Let A ⊆ H be an arbitrary cluster of Copt. Let us define UA and πA as the uniform and biased
+distributions conditioned to A. The key to the proof is to relate the expected potential of A under UA
+and πA to its contribution to the optimal potential.
+
+Lemma A1. Let A ⊆ H be an arbitrary cluster of Copt. Then:
+
+Ec∼UA[Mλ,α(A, c)]
+=
+Mopt,λ,α(A) + Mopt,λ,−α(A)
+
+=
+Ec∼UA[Mλ,−α(A, c)] ,
+
+where UA is the uniform distribution over A.
+
+Proof. α-coordinates have the property that for any subset A ⊆ H, (1/|A|) ∑p∈A uα(p) = uα(rα,A).
+Hence, we have:
+
+∀c ∈ A , ∑
+p∈A
+Dα(p : c)
+
+=
+∑
+p∈A
+Dϕα(uα(p) : uα(c))
+
+=
+∑
+p∈A
+Dϕα(uα(p) : uα(rα,A)) + |A|Dϕα(uα(rα,A) : uα(c))
+
+=
+∑
+p∈A
+Dα(p : rα,A) + |A|Dα(rα,A : c) .
+(A1)
+
+Because Dα(p : q) = D−α(q : p) and lα = r−α, we obtain:
+
+∀c ∈ A , ∑
+p∈A
+Dα(c : p)
+
+=
+∑
+p∈A
+D−α(p : c)
+
+=
+∑
+p∈A
+D−α(p : r−α,A) + |A|D−α(r−α,A : c)
+
+=
+∑
+p∈A
+Dα(lα,A : p) + |A|Dα(c : lα,A) .
+(A2)
+
+It comes now from (A1) and (A2) that:
+
+Ec∼UA[Mλ,α(A, c)]
+
+=
+1
+|A| ∑
+c∈A ∑
+p∈A
+{λDα(c : p) + (1 − λ)Dα(p : c)}
+
+=
+(1 − λ) ∑
+p∈A
+Dα(p : rα,A) + (1 − λ) ∑
+p∈A
+Dα(rα,A : p)
+
++λ ∑
+p∈A
+Dα(lα,A : p) + λ ∑
+p∈A
+Dα(p : lα,A)
+
+=
+(1 − λ)Mopt,0,α(A) + λMopt,1,α(A)
+
++(1 − λ)Mopt,0,−α(A) + λMopt,1,−α(A)
+
+=
+Mopt,λ,α(A) + Mopt,λ,−α(A) .
+(A3)
+
+184
+
+
+Entropy 2014, 16, 3273–3301
+
+This gives the left-hand side equality of the Lemma. The right-hand side follows from the fact that
+Ec∼UA[Mλ,−α(A, c)] = Mopt,1−λ,α(A) + Mopt,1−λ,−α(A).
+
+Instead of Mopt,λ,α(A) + Mopt,λ,−α(A), we want a term depending solely on Mopt,λ,α(A) as it is
+the “true” optimum. We now give two lemmata that shall be useful in obtaining this upper bound.
+The first is of independent interest, as it shows that any α-divergence is a scaled, squared Hellinger
+distance between geometric means of points.
+
+Lemma A2. For any p, q and α ̸= 1, there exists r ∈ [p, q], such that (1 − α)2Dα(p : q) = D0(p1−αrα :
+q1−αrα).
+
+Proof. By the definition of Bregman divergences, for any x, y, there exists some z ∈ [x, y], such that:
+
+Dϕα(x : y)
+=
+1
+2(x − y)2ϕ”α(z)
+
+=
+1
+2(x − y)2
+�
+1 + 1 − α
+
+2
+z
+� 2α
+
+1−α
+,
+
+and since uα is continuous and strictly increasing, for any p, q, there exists some r ∈ [p, q], such that:
+
+Dα(p : q)
+
+=
+Dϕα(uα(p) : uα(q))
+
+=
+1
+2(uα(p) − uα(q))2
+�
+1 + 1 − α
+
+2
+uα(r)
+� 2α
+
+1−α
+
+=
+2
+
+(1 − α)2
+
+�
+p
+1−α
+
+2 − q
+1−α
+
+2
+�2
+rα
+
+=
+2
+
+(1 − α)2
+
+�
+p1−α + q1−α − 2(pq)
+1−α
+
+2
+�
+rα
+
+=
+1
+
+(1 − α)2 D0(p1−αrα : q1−αrα) .
+
+Lemma A3. Let discrete random variable x take non-negative values x1, x2, ..., xm with uniform probabilities.
+Then, for any β > −1, we have var(x1+β/uβ) ≤ var(x), with u
+.= (1 + β)β maxi xi.
+
+Proof. First, ∀β > −1, remark that for any x, function f (x) = x(uβ − xβ) is increasing for x ≤
+u/(1 + β)β. Hence, assuming that the xis are put in non-increasing order without loss of generality, we
+have f (xi) ≥ f (xj), and so, xi(uβ − xβ
+i ) ≥ xj(uβ − xβ
+j ), ∀i ≥ j, as long as xi ≤ u/(1 + β)β. Choosing
+
+185
+
+
+Entropy 2014, 16, 3273–3301
+
+u = x1(1 + β)β yields, after reordering and putting the exponent, (x1+β
+i
+− x1+β
+j
+)2 ≤ (xiuβ − xjuβ)2.
+Hence:
+
+1
+m ∑
+i
+x2(1+β)
+i
+−
+
+�
+1
+m ∑
+i
+x(1+β)
+i
+
+�2
+
+=
+1
+
+2m2 ∑
+i,j
+(x1+β
+i
+− x1+β
+j
+)2
+
+≤
+1
+
+2m2 ∑
+i,j
+(xiuβ − xjuβ)2
+
+= u2β
+
+2m2 ∑
+i,j
+(xi − xj)2
+
+= u2β
+
+⎛
+
+⎝ 1
+
+m ∑
+i
+x2
+i −
+
+�
+1
+m ∑
+i
+xi
+
+�2⎞
+
+⎠ .
+
+Dividing by u2β the leftmost and rightmost terms and using the fact that var(λx) = λ2var(x) yields
+the statement of the Lemma.
+
+We are now ready to upper bound Mopt,λ,−α(A) as a function of Mopt,λ,α(A).
+
+Lemma A4. For any cluster A of Copt,
+
+Mopt,λ,−α(A)
+≤
+Mopt,λ,α(A) ×
+
+�
+f (λ)
+if λ ∈ (0, 1)
+z(α)h2(α)
+otherwise
+,
+
+where z(α), f (λ) and h(α) are defined in Theorem 1.
+
+Proof. The case λ ̸= 0, 1 is fast, as we have by definition:
+
+Mopt,λ,−α(A)
+=
+∑
+p∈A
+λD−α(l−α,A : p) + (1 − λ)D−α(p : r−α,A)
+
+=
+∑
+p∈A
+λDα(p : l−α,A) + (1 − λ)Dα(r−α,A : p)
+
+=
+∑
+p∈A
+λDα(p : rα,A) + (1 − λ)Dα(lα,A : p)
+
+≤
+max
+�1 − λ
+
+λ
+,
+λ
+
+1 − λ
+
+�
+Mopt,λ,α(A)
+
+= f (λ)Mopt,λ,α(A) .
+
+Suppose now that λ
+=
+0 and α
+≥
+0.
+Because Mopt,0,−α(A)
+=
+∑p∈A D−α(p : r−α,A)
+=
+∑p∈A Dα(lα,A : p) = Mopt,1,α(A), what we wish to do is upper bound ∑p∈A Dα(lα,A : p) = Mopt,1,α(A)
+as a function of ∑p∈A Dα(p : rα,A) = Mopt,0,α(A). We use Lemmatas A2 and A3 in the following
+derivations, using r(p) to refer to the r in Lemma A2, assuming α ≥ 0. We also note varA( f (p)) as
+
+186
+
+
+Entropy 2014, 16, 3273–3301
+
+the variance, under the uniform distribution over A, of discrete random variable f (p), for p ∈ A. We
+have:
+
+∑
+p∈A
+Dα(lα,A : p)
+
+=
+∑
+p∈A
+D−α(p : lα,A)
+
+=
+1
+
+(1 + α)2 ∑
+p∈A
+r(p)−αD0(p1+α : l1+α
+α,A )
+
+≤
+1
+
+(1 + α)2 minA pα ∑
+p∈A
+D0(p1+α : l1+α
+α,A )
+
+=
+1
+
+(1 + α)2 minA pα ∑
+p∈A
+
+�
+p1+α + l1+α
+α,A − 2p
+1+α
+
+2 l
+
+1+α
+
+2
+α,A
+
+�
+
+=
+|A|
+
+(1 + α)2 minA pα
+
+⎛
+
+⎝ 1
+
+|A| ∑
+p∈A
+p1+α −
+
+�
+1
+|A| ∑
+p∈A
+p
+1+α
+
+2
+
+�2⎞
+
+⎠
+
+=
+|A|varA(p
+1+α
+
+2 )
+
+(1 + α)2 minA pα .
+(A4)
+
+We have used the expression of left centroid l1+α
+α,A to simplify the expressions. Now, picking xi = p
+
+1−α
+
+i 2
+,
+
+β = 2α/(1 − α) and u =
+�
+1+α
+1−α
+� 2α
+
+1−α maxA p
+1−α
+
+2
+in Lemma A3 yields:
+
+varA(p
+1+α
+
+2 )
+
+=
+u2βvarA(p
+1+α
+
+2 /uβ)
+
+=
+u2βvarA
+�
+p
+1−α
+
+2 pα/uβ�
+
+=
+u2βvar(x1+β/uβ)
+
+≤
+u2βvar(x)
+
+= u2βvarA
+�
+p
+1−α
+
+2
+�
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2
+max
+A
+p2αvarA
+�
+p
+1−α
+
+2
+�
+.
+(A5)
+
+187
+
+
+Entropy 2014, 16, 3273–3301
+
+Plugging this in (A4) yields:
+
+∑
+p∈A
+Dα(lα,A : p)
+
+≤
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 |A| maxA p2αvarA
+�
+p
+1−α
+
+2
+�
+
+(1 + α)2 minA pα
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× |A| minA pαvarA(p
+1−α
+
+2 )
+
+(1 − α)2
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× minA pα
+
+(1 − α)2 ∑
+p∈A
+D0(p1−α : r1−α
+α,A )
+(A6)
+
+≤
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+×
+1
+
+(1 − α)2 ∑
+p∈A
+r(p)αD0(p1−α : r1−α
+α,A )
+
+=
+�1 + α
+
+1 − α
+
+�
+8α2
+
+(1−α)2 −2 �maxA p
+
+minA p
+
+�2α
+× ∑
+p∈A
+Dα(p : rα,A)
+
+≤
+z(α)
+�maxA p
+
+minA p
+
+�2α
+× ∑
+p∈A
+Dα(p : rα,A) .
+(A7)
+
+Here, (A6) follows the path backwards of derivations that lead to (A4). The cases λ = 1 or α < 0 are
+obtained using the same chains of derivations and achieve the proof of Lemma A4.
+
+Lemma A4 can be directly used to refine the bound of Lemma A1 in the uniform distribution. We
+give the Lemma for the biased distribution, directly integrating the refinement of the bound.
+
+Lemma A5. Let A be an arbitrary cluster of Copt and C an arbitrary clustering. If we add a random couple
+(c, c) to C, chosen from A with π as in Algorithm 2, then:
+
+Ec∼πA[Mλ,α(A, c)]
+
+≤
+4
+
+�
+f (λ)h2(α)Mopt,λ,α(A)
+if
+λ ∈ (0, 1)
+z(α)h4(α)Mopt,λ,α(A)
+otherwise
+,
+(A8)
+
+where f (λ) and h(α) are defined in Theorem 1.
+
+Proof. The proof essentially follows the proof of Lemma 3 in [15]. To complete it, we need a triangle
+inequality involving α-divergences. We give it here.
+
+Lemma A6. For any p, q, r and α, we have:
+
+�
+
+Dα(p : q)
+≤
+�maxi{pi, qi, ri}
+
+mini{pi, qi, ri}
+
+�|α| ��
+
+Dα(p : r) +
+�
+
+Dα(r : q)
+�
+(A9)
+
+(where the min is over strictly positive values)
+
+Remark: take α = 0; we find the triangle inequality for the squared Hellinger distance.
+
+Proof. Using the proof of Lemma 2 in [15] for Bregman divergence Dϕα, we get:
+
+�
+
+Dϕα(x : z)
+
+≤
+ρ(α)
+��
+
+Dϕα(x : y) +
+�
+
+Dϕα(y : z)
+�
+,
+(A10)
+
+188
+
+
+Entropy 2014, 16, 3273–3301
+
+where:
+
+ρ(α)
+=
+max
+u,v
+
+�
+1 + 1−α
+
+2 u
+� 2α
+
+1−α
+
+�
+1 + 1−α
+
+2 v
+� 2α
+
+1−α
+.
+(A11)
+
+Taking x = uα(p), y = uα(q), z = uα(r) yields ρ(α) = maxs,t∈{pi,qi,ri}(s/t)|α| and the statement of
+Lemma A6.
+
+The rest of the proof of Lemma A5 follows the proof of Lemma 3 in [15].
+
+We get all of the ingredients to our proof, and there remains to use Lemma 4 in [15] to achieve the
+proof of Theorem 1.
+
+Appendix Properties of α-Divergences
+
+For positive arrays p and q, the α-divergence Dα(p : q) can be defined as an equivalent
+representational Bregman divergence [19,34] Bϕα(uα(p) : uα(q)) over the (uα, vα)-structure [43] with:
+
+ϕα(x)
+.=
+2
+
+1 + α
+
+�
+1 + 1 − α
+
+2
+x
+�
+2
+
+1−α
+,
+(A12)
+
+uα(p)
+.=
+2
+
+1 − α
+
+�
+p
+1−α
+
+2 − 1
+�
+,
+(A13)
+
+vα(p)
+.=
+2
+
+1 + α p
+1+α
+
+2
+,
+(A14)
+
+where we assume that α ̸= ±1. Otherwise, for α = ±1, we compute Dα(p : q) by taking the sided
+Kullback–Leibler divergence extended to positive arrays.
+In the proof of Theorem 1, we have used two properties of α-divergences of independent interest:
+
+• any α-divergence can be explained as a scaled squared Hellinger distance between geometric
+means of its arguments and a point that belong to their segment (Lemma A2);
+• any α-divergence satisfies a generalized triangle inequality (Lemma A6). Notice that this Lemma
+is optimal in the sense that for α = 0, it is possible to recover the triangle inequality of the
+Hellinger distance.
+
+The following lemma shows how to bound the mixed divergence as a function of an α-divergence.
+
+Lemma A7. For any positive arrays l, h, r and α ̸= ±1, define η
+.= λ(1 − α)/(1 − α(2λ − 1)) ∈ [0, 1], gη
+with gi
+η
+.= (li)η(ri)1−η and aη with ai
+η
+.= ηli + (1 − η)ri. Then, we have:
+
+Mλ,α(l : h : r)
+≤
+1 − α2(2λ − 1)2
+
+1 − α2
+Dα(2λ−1)(gη : h)
+
++2(1 − α(2λ − 1))
+
+1 − α2
+∑
+i
+
+�
+ai
+η − gi
+η
+�
+.
+
+Proof. For all index i, we have:
+
+Mλ,α(li : hi : ri) = λDα(li : hi) + (1 − λ)Dα(hi : ri)
+
+=
+4
+
+1 − α2
+
+�λ(1 − α)
+
+2
+li + (1 − λ)(1 + α)
+
+2
+ri + 1 + α(2λ − 1)
+
+2
+hi
+(A15)
+
+−λ(li)
+1−α
+
+2 (hi)
+1+α
+
+2 − (1 − λ)(ri)
+1+α
+
+2 (hi)
+1−α
+
+2
+�
+.
+(A16)
+
+189
+
+
+Entropy 2014, 16, 3273–3301
+
+The arithmetic-geometric-harmonic (AGH) inequality implies:
+
+λ(li)
+1−α
+
+2 (hi)
+1+α
+
+2 + (1 − λ)(ri)
+1+α
+
+2 (hi)
+1−α
+
+2
+≥
+(li)
+λ(1−α)
+
+2
+(ri)
+(1−λ)(1+α)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+=
+�
+(li)
+
+λ(1−α)
+
+1−α(2λ−1) (ri)
+
+(1−λ)(1+α)
+1−α(2λ−1)
+� 1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+=
+�
+(li)η(ri)1−η� 1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+
+= (gi
+η)
+1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+.
+
+It follows that (A16) yields:
+
+Mλ,α(li : hi : ri)
+≤
+4
+
+1 − α2
+
+�1 − α(2λ − 1)
+
+2
+
+�
+ηli + (1 − η)ri�
++
+(A17)
+
+1 + α(2λ − 1)
+
+2
+hi − (gi
+η)
+1−α(2λ−1)
+
+2
+(hi)
+1+α(2λ−1)
+
+2
+�
+
+= 1 − α2(2λ − 1)2
+
+1 − α2
+Dα(2λ−1)(gi
+η : hi) + 2(1 − α(2λ − 1))
+
+1 − α2
+
+�
+ai
+η − gi
+η
+�
+,(A18)
+
+out of which we get the statement of the Lemma.
+
+Appendix Sided α-Centroids
+
+For the sake of completeness, we prove the following theorem:
+
+Theorem A1 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted
+α-centroid coordinates of a set of n positive histograms h1, ..., hn are weighted α-means:
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+, li
+α = ri
+−α
+
+with:
+
+fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+Proof. We distinguish three cases: α ̸= ±1, α = −1 and α = 1.
+First, consider the general case α ̸= ±1. We have to minimize:
+
+Rα(x, H) =
+4
+
+1 − α2
+
+n
+∑
+j=1
+wj×
+
+d
+∑
+i=1
+
+�1 − α
+
+2
+hi
+j + 1 + α
+
+2
+xi − (hi
+j)
+1−α
+
+2 (xi)
+1+α
+
+2
+�
+.
+
+Removing all additive terms independent of xi and the overall constant multiplicative factor
+4
+
+1−α2 ̸= 0,
+we get the following equivalent minimisation problem:
+
+R′
+α(x, H) =
+d
+∑
+i=1
+
+1 + α
+
+2
+xi − (xi)
+1+α
+
+2
+
+�
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2
+
+�
+
+�
+��
+�
+¯hiα
+
+,
+(A19)
+
+190
+
+
+Entropy 2014, 16, 3273–3301
+
+where ¯hi
+α denote the following aggregation term:
+
+¯hi
+α =
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2 .
+
+Setting coordinate-wise the derivative to zero of Equation (A19) (i.e., ∇xR′(x, H) = 0), we get:
+
+1 + α
+
+2
+− 1 + α
+
+2
+(xi)
+α−1
+
+2 ¯hi
+α = 0
+
+Thus, we find that the coordinates of the right-sided α-centroids are:
+
+ci
+α = (¯hi
+α)
+2
+
+1−α =
+
+�
+n
+∑
+j=1
+wj(hi
+j)
+1−α
+
+2
+
+�
+2
+
+1−α
+= ˆhi
+α.
+
+We recognise the expression of a quasi-arithmetic mean for the strictly monotonous generator fα(x):
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+,
+(A20)
+
+with:
+fα(x) = x
+1−α
+
+2 ,
+f −1
+α (x) = x
+2
+
+1−α , α ̸= ±1.
+
+Therefore, we conclude that the coordinates of the positive α-centroid are the weighted α-means of
+the histogram coordinates (for α ̸= ±1). Quasi-arithmetic means are also called in the literature
+quasi-linear means or f-means.
+When α = −1, we search for the right-sided extended Kullback–Leibler divergence centroid by
+minimising:
+
+R−1(x; ˜H) =
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+hi
+j log
+hi
+j
+
+xi + xi − hi
+j.
+
+It is equivalent to minimizing:
+
+R′
+−1(x; ˜H) =
+d
+∑
+i=1
+xi −
+
+�
+n
+∑
+j=1
+wjhi
+j
+
+�
+
+�
+��
+�
+a
+
+log xi,
+
+where a denotes the arithmetic mean. Solving coordinate-wise, we get ci = ai = ∑n
+j=1 wjhi
+j.
+When α = 1, the right-sided reverse extended KL centroid is a left-sided extended KL centroid. The
+minimisation problem is:
+
+R1(x; ˜H) =
+n
+∑
+j=1
+wj
+
+d
+∑
+i=1
+xi log xi
+
+hi
+j
++ hi
+j − xi.
+
+Since ∑j wj = 1, we solve coordinate-wise and find log x = ∑j wj log hj. That is, ri
+1 is the geometric
+mean:
+
+ri
+1 =
+n
+∏
+j=1
+(hi
+j)wj.
+
+Both the arithmetic mean and the geometric mean are power means in the limit case (and hence
+quasi-arithmetic means). Thus,
+
+ri
+α = f −1
+α
+
+�
+n
+∑
+j=1
+wj fα(hi
+j)
+
+�
+
+,
+(A21)
+
+191
+
+
+Entropy 2014, 16, 3273–3301
+
+with:
+
+fα(x) =
+
+�
+x
+1−α
+
+2
+α ̸= ±1,
+log x
+α = 1.
+
+References
+
+1.
+Baker, L.D.; McCallum, A.K.Distributional clustering of words for text classification. In Proceedings of the
+21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,
+Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 96–103.
+2.
+Bigi, B. Using Kullback–Leibler distance for text categorization.
+In Proceedings of the 25th European
+conference on IR research (ECIR), Pisa, Italy, 14–16 April 2003; Springer-Verlag: Berlin/Heidelberg, Germany,
+2003; ECIR’03, pp. 305–319.
+3.
+Bag of Words Data Set. Available online: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words (accessed
+on 17 June 2014).
+4.
+Csurka, G.; Bray, C.; Dance, C.; Fan, L. Visual Categorization with Bags of Keypoints; Workshop on Statistical
+Learning in Computer Vision (ECCV); Xerox Research Centre Europe: Meylan, France, 2004, pp. 1–22.
+5.
+Jégou, H.; Douze, M.; Schmid, C. Improving Bag-of-Features for Large Scale Image Search. Int. J. Comput.
+Vis. 2010, 87, 316–336.
+6.
+Yu, Z.; Li, A.; Au, O.; Xu, C. Bag of textons for image segmentation via soft clustering and convex shift. In
+Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI,
+USA, 16–21 June 2012; pp. 781–788.
+7.
+Steinhaus, H. Sur la division des corp matériels en parties. Bull. Acad. Polon. Sci. 1956, 1, 801–804. (in
+French)
+8.
+Lloyd, S.P. Least Squares Quantization in PCM; Technical Report RR-5497; Bell Laboratories: Murray Hill, NJ,
+USA, 1957.
+9.
+Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137.
+10.
+Chandrasekhar, V.; Takacs, G.; Chen, D.M.; Tsai, S.S.; Reznik, Y.A.; Grzeszczuk, R.; Girod, B. Compressed
+histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vis. 2012, 96, 384–399.
+11.
+Nock, R.; Nielsen, F.; Briys, E. Non-linear book manifolds: Learning from associations the dynamic geometry
+of digital libraries. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, New
+York, NY, USA, 2013; pp. 313–322.
+12.
+Kwitt, R.; Vasconcelos, N.; Rasiwasia, N.; Uhl, A.; Davis, B.C.; Häfner, M.; Wrba, F. Endoscopic image
+analysis in semantic space. Med. Image Anal. 2012, 16, 1415–1422.
+13.
+Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. 2010, arXiv:1009.4004.
+14.
+Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904.
+15.
+Nock, R.; Luosto, P.; Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proceedings of
+the European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium,
+15–19 September 2008; Springer-Verlag: Berlin/Heidelberg, Germany, 2008; pp. 154–169.
+16.
+Amari, S. Integration of Stochastic Models by Minimizing α-Divergence. Neural Comput. 2007, 19, 2780–2796.
+17.
+Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth
+Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, LA, USA, 7–9 January 2007;
+Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035.
+18.
+Olszewski, D.; Ster, B. Asymmetric clustering using the alpha-beta divergence. Pattern Recognit. 2014,
+47, 2031–2041.
+19.
+Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes.
+IEEE Trans. Inf. Theory 2009, 55, 4925–4931.
+20.
+Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res.
+2005, 6, 1705–1749.
+21.
+Teboulle, M. A unified continuous optimization framework for center-based clustering methods. J. Mach.
+Learn. Res. 2007, 8, 65–102.
+22.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000.
+23.
+Morimoto, T. Markov Processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331.
+
+192
+
+
+Entropy 2014, 16, 3273–3301
+
+24.
+Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat.
+Soc. Ser. B 1966, 28, 131–142.
+25.
+Csiszár, I. Information-type measures of difference of probability distributions and indirect observation.
+Studi. Sci. Math. Hung. 1967, 2, 229–318.
+26.
+Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust
+nonnegative matrix factorization. Entropy 2011, 13, 134–170.
+27.
+Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of
+Neural Networks; Operations Research/Computer Science Interfaces Series; Ellacott, S., Mason, J., Anderson,
+I., Eds.; Springer: New York, NY, USA, 1997; Volume 8, pp. 394–398.
+28.
+Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.
+Ann. Math. Stat. 1952, 23, 493–507.
+29.
+Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett.
+2013, 20, 269–272.
+30.
+Wu, J.; Rehg, J. Beyond the euclidean distance: creating effective visual codebooks using the histogram
+intersection kernel. In Proceedings of 2009 IEEE 12th International Conference on Computer Vision, Kyoto,
+Japan, 29 September–2 October 2009; pp. 630–637.
+31.
+Bhattacharya, A.; Jaiswal, R.; Ailon, N. A tight lower bound instance for k-means++ in constant dimension.
+In Theory and Applications of Models of Computation; Lecture Notes in Computer Science; Gopal, T., Agrawal,
+M., Li, A., Cooper, S., Eds.; Springer International Publishing: New York, NY, USA, 2014; Volume 8402, pp.
+7–22.
+32.
+Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight
+approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660.
+33.
+Ben-Tal, A.; Charnes, A.; Teboulle, M. Entropic means. J. Math. Anal. Appl. 1989, 139, 537–551.
+34.
+Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In
+Proceedings of International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June
+2009; pp. 71–78.
+35.
+Heinz, E. Beiträge zur Störungstheorie der Spektralzerlegung. Math. Anna. 1951, 123, 415–438. (in German)
+36.
+Besenyei, A. On the invariance equation for Heinz means. Math. Inequal. Appl. 2012, 15, 973–979.
+37.
+Barry, D.A.; Culligan-Hensley, P.J.; Barry, S.J. Real values of the W-function. ACM Trans. Math. Softw. 1995,
+21, 161–171.
+38.
+Veldhuis, R.N.J. The centroid of the symmetrical Kullback–Leibler distance. IEEE Signal Process. Lett. 2002,
+9, 96–99.
+39.
+Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. 2009, arXiv.org: 0911.4863.
+40.
+Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466.
+41.
+Romberg, S.; Lienhart, R. Bundle min-hashing for logo recognition.
+In Proceedings of the 3rd ACM
+Conference on International Conference on Multimedia Retrieval, Dallas, TX, USA, 16–19 April 2013; ACM:
+New York, NY, USA, 2013; pp. 113–120.
+42.
+Matsuyama, Y. The alpha-EM algorithm: Surrogate likelihood maximization using alpha-logarithmic
+information measures. IEEE Trans. Inf. Theory 2003, 49, 692–706.
+43.
+Amari, S.I. New developments of information geometry (26): Information geometry of convex programming
+and game theory. In Mathematical Sciences (suurikagaku); Number 605; The Science Company: Denver, CO,
+USA, 2013; pp. 65–74. (In Japanese)
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+193
+
+
+entropy
+
+Article
+New Riemannian Priors on the Univariate
+Normal Model
+
+Salem Said *, Lionel Bombrun and Yannick Berthoumieu
+
+Groupe Signal et Image, CNRS Laboratoire IMS, Institut Polytechnique de Bordeaux, Université de Bordeaux,
+UMR 5218, Talence, 33405, France; E-Mails: lionel.bombrun@u-bordeaux.fr (L.B.);
+yannick.berthoumieu@u-bordeaux.fr (Y.B.)
+*
+E-Mail: salem.said@u-bordeaux.fr; Tel.:+33-(0)5-4000-6185.
+
+Received: 17 April 2014; in revised form: 23 June 2014 / Accepted: 9 July 2014 /
+Published: 17 July 2014
+
+Abstract: The current paper introduces new prior distributions on the univariate normal model, with
+the aim of applying them to the classification of univariate normal populations. These new prior
+distributions are entirely based on the Riemannian geometry of the univariate normal model, so that
+they can be thought of as “Riemannian priors”. Precisely, if {pθ; θ ∈ Θ} is any parametrization of the
+univariate normal model, the paper considers prior distributions G( ¯θ, γ) with hyperparameters ¯θ ∈ Θ
+and γ > 0, whose density with respect to Riemannian volume is proportional to exp(−d2(θ, ¯θ)/2γ2),
+where d2(θ, ¯θ) is the square of Rao’s Riemannian distance. The distributions G( ¯θ, γ) are termed
+Gaussian distributions on the univariate normal model. The motivation for considering a distribution
+G( ¯θ, γ) is that this distribution gives a geometric representation of a class or cluster of univariate
+normal populations. Indeed, G( ¯θ, γ) has a unique mode ¯θ (precisely, ¯θ is the unique Riemannian
+center of mass of G( ¯θ, γ), as shown in the paper), and its dispersion away from ¯θ is given by γ.
+Therefore, one thinks of members of the class represented by G( ¯θ, γ) as being centered around ¯θ
+and lying within a typical distance determined by γ. The paper defines rigorously the Gaussian
+distributions G( ¯θ, γ) and describes an algorithm for computing maximum likelihood estimates of
+their hyperparameters. Based on this algorithm and on the Laplace approximation, it describes
+how the distributions G( ¯θ, γ) can be used as prior distributions for Bayesian classification of
+large univariate normal populations.
+In a concrete application to texture image classification,
+it is shown that this leads to an improvement in performance over the use of conjugate priors.
+
+Keywords: Fisher information; Riemannian metric; prior distribution; univariate normal distribution;
+image classification
+
+1. Introduction
+
+In this paper, a new class of prior distributions is introduced on the univariate normal model. The
+new prior distributions, which will be called Gaussian distributions, are based on the Riemannian
+geometry of the univariate normal model. The paper introduces these new distributions, uncovers
+some of their fundamental properties and applies them to the problem of the classification of univariate
+normal populations. It shows that, in the context of a real-life application to texture image classification,
+the use of these new prior distributions leads to improved performance in comparison with the use of
+more standard conjugate priors.
+To motivate the introduction of the new prior distributions, considered in the following, recall
+some general facts on the Riemannian geometry of parametric models.
+In information geometry [1], it is well known that a parametric model {pθ; θ ∈ Θ}, where Θ ⊂ Rp,
+can be equipped with a Riemannian geometry, determined by Fisher’s information matrix, say I(θ).
+
+Entropy 2014, 16, 4015–4031; doi:10.3390/e16074015
+www.mdpi.com/journal/entropy
+194
+
+
+Entropy 2014, 16, 4015–4031
+
+Indeed, assuming I(θ) is strictly positive definite, for each θ ∈ Θ, a Riemannian metric on Θ is defined
+by:
+
+ds2(θ) =
+
+p
+∑
+i,j=1
+Iij(θ)dθidθj
+(1)
+
+The fact that the length element Equation (1) is invariant to any change of parametrization was realized
+by Rao [2], who was the first to propose the application of Riemannian geometry in statistics.
+Once the Riemannian metric Equation (1) is introduced, the whole machinery of Riemannian
+geometry becomes available for application to statistical problems relevant to the parametric model
+{pθ; θ ∈ Θ}. This includes the notion of Riemannian distance between two distributions, pθ and pθ′,
+which is known as Rao’s distance, say d(θ, θ′), the notion of Riemannian volume, which is exactly the
+same as Jeffreys prior [3], and the notion of Riemannian gradient, which can be used in numerical
+optimization and coincides with the so-called natural gradient of Amari [4].
+It is quite natural to apply Rao’s distance to the problem of classifying populations that belong to
+the parametric model {pθ; θ ∈ Θ}. In the case where this parametric model is the univariate normal
+model, this approach to classification is implemented in [5]. For more general parametric models,
+beyond the univariate normal model, similar applications of Rao’s distance to problems of image
+segmentation and statistical tests can be found in [6–8].
+The idea of [5] is quite elegant. In general, it requires that some classes {SL; L = 1, . . . , C}, (based
+on a learning sequence) have been identified with “centers” ¯θL ∈ Θ. Then, in order to assign a test
+population, given by the parameter θt, to a class L∗, it is proposed to choose L∗, which minimizes Rao’s
+distance d2(θt, ¯θL), over L = 1, . . . , C. In the specific context of the classification of univariate normal
+populations [5], this leads to the introduction of hyperbolic Voronoi diagrams.
+The present paper is also concerned with the case where the parametric model {pθ; θ ∈ Θ} is a
+univariate normal model. It starts from the idea that a class SL should be identified not only with
+a center ¯θL, as in [5], but also with a kind of “variance”, say γ2, which will be called a dispersion
+parameter. Accordingly, assigning a test population given by the parameter θt to a class L should be
+based on a tradeoff between the square of Rao’s distance d2(θt, ¯θL) and the dispersion parameter γ2.
+Of course, this idea has a strong Bayesian flavor. It proposes to give more “confidence” to classes
+that have a smaller dispersion parameter. Thus, in order to implement it, in a concrete way, the paper
+starts by introducing prior distributions on the univariate normal model, which it calls Gaussian
+distributions. By definition, a Gaussian distribution G( ¯θ, γ2) has a probability density function, with
+respect to Riemannian volume, given by:
+
+p(θ| ¯θ, γ) ∝ exp
+�−d2(θ, ¯θ)
+
+2γ2
+
+�
+(2)
+
+Given this definition of a Gaussian distribution (which is developed in a detailed way, in Section 3),
+classification of univariate normal populations can be carried out by associating to each class SL of
+univariate normal populations a Gaussian distribution G( ¯θL, γ2
+L) and by assigning any test population
+with parameter θt to the class L∗, which maximizes the likelihood p(θt| ¯θL, γL), over L = 1, . . . , C.
+The present paper develops in a rigorous way the general approach to the classification of
+univariate normal populations, which has just been described. It proceeds as follows.
+Section 2, which is basically self-contained, provides the concepts, regarding the Riemannian
+geometry of the univariate normal model, which will be used throughout the paper.
+Section 3 introduces Gaussian distributions on the univariate normal model and uncovers some
+of their general properties. In particular, Section 3.2 of this section gives a Riemannian gradient descent
+algorithm for computing maximum likelihood estimates of the parameters ¯θ and γ of a Gaussian
+distribution.
+Section 4 states the general approach to classification of univariate normal populations proposed
+in this paper. It deals with two problems: (i) given a class S of univariate normal populations Si, how
+
+195
+
+
+Entropy 2014, 16, 4015–4031
+
+to fit a Gaussian distribution G(¯z, γ) to this class; and (ii) given a test univariate normal population St
+and a set of classes {SL, L = 1, . . . , C}, how to assign St to a suitable class SL∗.
+In the present paper, the chosen approach for resolving these two problems is marginalized
+likelihood estimation, in the asymptotic framework where each univariate normal population contains
+a large number of data points. In this asymptotic framework, the Laplace approximation plays a
+major role [9]. In particular, it reduces the first problem, of fitting a Gaussian distribution to a class
+of univariate normal populations, to the problem of maximum likelihood estimation, covered in
+Section 3.2.
+The final result of Section 4 is the decision rule Equation (37). This generalizes the one developed
+in [5] and already explained above, by taking into account the dispersion parameter γ, in addition to
+the center ¯θ, for each class.
+In Section 5, the formalism of Section 4 is applied to texture image classification, using the
+VisTeX image database [10]. This database is used to compare the performance obtained using
+Gaussian distributions, as in Section 4, to that obtained using conjugate prior distributions. It is
+shown that Gaussian distributions, proposed in the current paper, lead to a significant improvement
+in performance.
+Before going on, it should be noted that probability density functions of the form (2), on general
+Riemannian manifolds, were considered by Pennec in [11]. However, they were not specifically used
+as prior distributions, but rather as a representation of uncertainty in medical image analysis and
+directional or shape statistics.
+
+2. Riemannian Geometry of the Univariate Normal Model
+
+The current section presents in a self-contained way the results on the Riemannian geometry of
+the univariate normal model, which are required for the remainder of the paper. Section 2.1 recalls
+the fact that the univariate normal model can be reparametrized, so that its Riemannian geometry is
+essentially the same as that of the Poincaré upper half plane. Section 2.2 uses this fact to give analytic
+formulas for distance, geodesics and integration on the univariate normal model. Finally, Section 2.3
+presents, in general form, the Riemannian gradient descent algorithm.
+
+2.1. Derivation of the Fisher Metric
+
+This paper considers the Riemannian geometry of the univariate normal model, as based on the
+Fisher metric (1). To be precise, the univariate normal model has a two-dimensional parameter space
+Θ = {θ = (μ, σ)|μ ∈ R , σ > 0}, and is given by:
+
+pθ(x) = |2πσ2|−1/2 exp
+� −(x − μ)2
+
+2σ2
+
+�
+(3)
+
+where each pθ is a probability density function with respect to the Lebesgue measure on R. The Fisher
+information matrix, obtained from Equation (3), is the following:
+
+I(θ) =
+
+�
+1
+σ2
+0
+0
+2
+σ2
+
+�
+
+As in [12], this expression can be made more symmetric by introducing the parametrization z = (x, y),
+where x = μ/
+√
+
+2 and y = σ. This yields the Fisher information matrix:
+
+I(z) = 2 ×
+
+�
+1
+y2
+0
+
+0
+1
+y2
+
+�
+
+196
+
+
+Entropy 2014, 16, 4015–4031
+
+It is suitable to drop the factor two in this expression and introduce the following Riemannian metric
+for the univariate normal model,
+
+ds2(z) = dx2 + dy2
+
+y2
+(4)
+
+This is essentially the same as the Fisher metric (up to the factor tow) and will be considered throughout
+the following. The resulting Rao’s distance and Riemannian geometry are given in the following
+paragraph.
+
+2.2. Distance, Geodesics and Volume
+
+The Riemannian metric (4), obtained in the last paragraph, happens to be a very well-known
+object in differential geometry. Precisely, the parameter space H = {z = (x, y)|y > 0} equipped with
+the metric (4) is known as the Poincaré upper half plane and is a basic model of a two-dimensional
+hyperbolic space [13].
+Rao’s distance between two points z1 = (x1, y1) and z2 = (x2, y2) in H can be expressed as follows
+(for results in the present paragraph, see [13], or any suitable reference on hyperbolic geometry),
+
+d(z1, z2) = acosh
+�
+1 + (x1 − x2)2 + (y1 − y2)2
+
+2y1y2
+
+�
+(5)
+
+where acosh denotes the inverse hyperbolic cosine.
+Starting from z1, in any given direction, it is possible to draw a unique geodesic ray γ : R+ → H.
+This is a curve having the property that γ(0) = z1 and, for any t ∈ R+, if γ(t) = z2 then d(z1, z2) = t.
+In other words, the length of γ between z1 and z2 is equal to the distance between z1 and z2.
+The equation of a geodesic ray starting from z ∈ H is conveniently written down in complex
+notation (that is, by treating points of H as complex numbers). To begin, consider the case of z = i
+(which stands for x = 0 and y = 1). The geodesic in the direction making an angle ψ with the y-axis is
+the curve,
+
+γi(t) = et/2 cos(ψ/2) i − e−t/2 sin(ψ/2)
+
+et/2 sin(ψ/2) i + e−t/2 cos(ψ/2)
+(6)
+
+In particular ψ = 0 gives γi(t) = eti and ψ = π gives γi(t) = e−ti. If ψ is not a multiple of π, γi(t)
+traces out a portion of a circle, which is parallel to the y-axis, in the limit t → ∞. For a general starting
+point z, the geodesic ray in the direction making an angle ψ with the y-axis can be written:
+
+γz(t, ψ) = x + yγi(t/y, ψ)
+(7)
+
+where z = (x, y) and γi(t, ψ) is given by Equation (6). A more detailed treatment of Rao’s distance (5)
+and of geodesics in the Poincaré upper half plane, along with applications in image clustering, can be
+found in [5].
+The Riemannian volume (or area, since H is of dimension 2) element corresponding to the
+Riemannian metric (4) is dA(z) = dxdy/y2. Accordingly, the integral of a function f : H → R with
+respect to dA is given by:
+�
+
+H f (z)dA(z) =
+� +∞
+
+0
+
+� +∞
+
+−∞
+f (x, y)
+
+y2
+dxdy
+(8)
+
+In many cases, the analytic computation of this integral can be greatly simplified by using polar
+coordinates (r, φ) defined with respect to some “origin” ¯z ∈ H. Polar coordinates (r, ϕ) map to the
+point z(r, ϕ) given by:
+
+z(r, ϕ) = γ¯z
+�
+r, π
+
+2 − ϕ
+�
+(9)
+
+197
+
+
+Entropy 2014, 16, 4015–4031
+
+where the right-hand side is defined according to Equation (7). The polar coordinates (r, ϕ) do indeed
+define a global coordinate system of H, in the sense that the application that takes a complex number
+reiϕ to the point z(r, ϕ) in H is a diffeomorphism. The standard notation from differential geometry is:
+
+exp¯z
+�
+reiϕ�
+= z(r, ϕ)
+(10)
+
+In these coordinates, the Riemannian metric (4) takes on the form:
+
+ds2(z) = dr2 + sinh2 rdϕ2
+(11)
+
+The integral Equation (8) can be computed in polar coordinates using the formula [13],
+
+�
+
+H f (z)dA(z) =
+� 2π
+
+0
+
+� +∞
+
+0
+( f ◦ exp¯z)
+�
+reiϕ�
+sinh(r)drdϕ
+(12)
+
+where exp¯z was defined in Equation (10) and ◦ denotes composition. This is particularly useful when
+f ◦ exp¯z does not depend on ϕ.
+
+2.3. Riemannian Gradient Descent
+
+In this paper, the problem of minimizing, or maximizing, a differentiable function f : H → R will
+play a central role. A popular way of handling the minimization of a differentiable function defined on
+a Riemannian manifold (such as H) is through Riemannian gradient descent [14].
+Here, the definition of Riemannian gradient is reviewed, and a generic description of Riemannian
+gradient descent is provided. The Riemannian gradient of f is here defined as a mapping ∇ f : H → C
+with the following property:
+
+1
+y2 × Re {∇ f (z) h∗} = Re {d f (z) h∗}
+(13)
+
+for any complex number h, where Re denotes the real part, ∗ denotes conjugation and d f is the
+“derivative”, d f = (∂ f /∂x) + (∂ f /∂y) i. For example, if f (z) = y, it follows from Equation (13) that
+∇ f (z) = y2.
+Riemannian gradient descent consists in following the direction of −∇ f at each step, with the
+length of the step (in other words, the step size) being determined by the user. The generic algorithm
+is, up to some variations, the following:
+
+INPUT
+ˆz ∈ H
+% Initial guess
+
+WHILE
+∥∇ f (ˆz)∥ > ε
+% ε ≈ 0 machine precision
+
+ˆz ← expˆz (−λ∇ f (ˆz))
+% λ > 0 step size, depends on ˆz
+
+END WHILE
+
+OUTPUT
+ˆz
+% near critical point of f
+
+Here, in the condition for the while loop, ∥∇ f (zk)∥ is the Riemannian norm of the gradient
+∇ f (zk). In other words,
+
+∥∇ f (zk)∥2 = 1
+
+y2
+k
+× Re {∇ f (zk) ∇ f (zk)∗}
+
+Just like a classical gradient descent algorithm, the above Riemannian gradient descent consists in
+following the direction of the negative gradient −∇ f (ˆz), in order to define a new estimate. This is
+repeated as long as the gradient is sensibly nonzero, in the sense of the loop condition.
+The generic algorithm described above has no guarantee of convergence. Convergence and
+behavior near limit points depends on the function f, on the initialization of the algorithm and on the
+step sizes λ. For these aspects, the reader may consult [14](Chapter 4).
+
+198
+
+
+Entropy 2014, 16, 4015–4031
+
+3. Riemannian Prior on the Univariate Normal Model
+
+The current section introduces new prior distributions on the univariate normal model. These may
+be referred to as “Riemannian priors”, since they are entirely based on the Riemannian geometry of
+this model, and will also be called “Gaussian distributions”, when viewed as probability distributions
+on the Poincaré half plane.
+Here, Section 3.1 defines in a rigorous way Gaussian distributions on H (based on the intuitive
+Formula (2)). A Gaussian distribution G(¯z, γ) has two parameters, ¯z ∈ H, called the center of mass, and
+γ > 0, called the dispersion parameter. Section 3.2 uses the Riemannian gradient descent algorithm
+Section 2.3 to provide an algorithm for computing maximum likelihood estimates of ¯z and γ. Finally,
+Section 3.3 proves that ¯z is the Riemannian center of mass or Karcher mean of the distribution G(¯z, γ),
+(Historically, it is more correct to speak of the “Fréchet mean”, since this concept was proposed by
+Fréchet in 1948 [15]), and that γ is uniquely related to mean square Rao’s distance from ¯z.
+The reader may wish to note that the results of Section 3.3 are not used in the following, so this
+paragraph may be skipped on a first reading.
+
+3.1. Gaussian Distributions on H
+
+A Gaussian distribution G(¯z, γ) on H is a probability distribution with the following probability
+density function:
+
+p(z|¯z, γ) =
+1
+
+Z(γ) exp
+� −d2(z, ¯z)
+
+2γ2
+
+�
+(14)
+
+Here, ¯z ∈ H is called the center of mass and γ > 0 the dispersion parameter of the distribution G(¯z, γ).
+The squared distance d2(z, ¯z) refers to Rao’s distance (5). The probability density function (14) is
+understood with respect to the Riemannian volume element dA(z). In other words, the normalization
+constant Z(γ) is given by:
+
+Z(γ) =
+�
+
+H f (z)dA(z)
+f (z) = exp
+� −d2(z, ¯z)
+
+2γ2
+
+�
+
+Using polar coordinates, as in Equation (12), it is possible to calculate this integral explicitly. To do so,
+let (r, ϕ), whose origin is ¯z. Then, d2(z, ¯z) = r2 when z = z(r, ϕ), as in Equation (9). It follows that:
+
+( f ◦ exp¯z) (r, ϕ) = exp
+� −r2
+
+2γ2
+
+�
+(15)
+
+According to Equation (12), the integral Z(γ) reduces to:
+
+Z(γ) =
+� 2π
+
+0
+
+� +∞
+
+0
+exp
+� −r2
+
+2γ2
+
+�
+sinh(r)drdϕ
+
+which is readily calculated,
+
+Z(γ) = 2π ×
+�
+
+π
+2 γ × e
+γ2
+2 × erf
+� γ
+√
+
+2
+
+�
+(16)
+
+where erf denotes the error function. Formula (16) completes the definition of the Gaussian distribution
+G(¯z, γ). This definition is the same as suggested in [11], with the difference that, in the present work, it
+has been possible to compute exactly the normalization constant Z(γ).
+It is noteworthy that the normalization constant Z(γ) depends only on γ and not on ¯z. This shows
+that the shape of the probability density function (14) does not depend on ¯z, which only plays the role
+of a location parameter. At a deeper mathematical level, this reflects the fact that H is a homogeneous
+Riemannian space [13].
+
+199
+
+
+Entropy 2014, 16, 4015–4031
+
+The probability density function (14) bears a clear resemblance to the usual Gaussian (or normal)
+probability density function. Indeed, both are proportional to the exponential minus the “square
+distance”, but in one case, the distance is interpreted as Euclidean distance and, in the other (that of
+Equation (14)) as Rao’s distance.
+
+3.2. Maximum Likelihood Estimation of ¯z and γ
+
+Consider the problem of computing maximum likelihood estimates of the parameters ¯z and γ of
+the Gaussian distribution G(¯z, γ), based on independent samples {zi}N
+i=1 from this distribution. Given
+the expression (14) of the density p(z|¯z, γ), the log-likelihood function ℓ(¯z, γ) can be written,
+
+ℓ(¯z, γ) = −N log{Z(γ)} −
+1
+
+2γ2
+
+N
+∑
+i=1
+d2(zi, ¯z)
+(17)
+
+Since ¯z only appears in the second term, the maximum likelihood estimate of ¯z, say ˆz, can be computed
+first. It is given by the minimization problem:
+
+ˆz = argminz∈H
+1
+2
+
+N
+∑
+i=1
+d2(zi, z)
+(18)
+
+In other words, the maximum likelihood estimate ˆz minimizes the sum of squared Rao distances to the
+samples zi. This exhibits ˆz as the Riemannian center of mass, also called the Karcher or the Fréchet
+mean [16], of the samples zi.
+The notion of Riemannian center of mass is currently a widely popular one in signal and image
+processing, with applications ranging from blind source separation and radar signal processing [17,18]
+to shape and motion analysis [19,20]. The definition of Gaussian distributions, proposed in the
+present paper, shows how the notion of Riemannian center of mass is related to maximum likelihood
+estimation, thereby giving it a statistical foundation.
+An original result, due to Cartan and cited in Equation [16], states that ˆz, as defined in
+Equation (18), exists and is unique, since H, with the Riemannian distance (4), has constant negative
+curvature. Here, ˆz is computed using Riemannian gradient descent, as described in Section 2.3. The
+cost function f to be minimized is given by (the factor N−1 is conventional),
+
+f (z) =
+1
+2N
+
+N
+∑
+i=1
+d2(zi, z)
+(19)
+
+Its Riemannian gradient ∇ f (z) is easily found by noting the following fact. Let fi(z) = (1/2)d2(z, zi).
+Then, the Riemannian gradient of this function is (see [21] (page 407)),
+
+∇ fi(z) = logz(zi)
+(20)
+
+where logz : H → C is the inverse of expz : C → H. It follows from Equation (20) that,
+
+∇ f (z) = 1
+
+N
+
+N
+∑
+i=1
+logz(zi)
+(21)
+
+The analytic expression of logz, for any z ∈ H, will be given below (see Equation (23)).
+Here, the gradient descent algorithm for computing ˆz is described. This algorithm uses a constant
+step size λ, which is fixed manually.
+
+200
+
+
+Entropy 2014, 16, 4015–4031
+
+Once the maximum likelihood estimate ˆz has been computed, using the gradient descent
+algorithm, the maximum likelihood estimate of γ, say ˆγ, is found by solving the equation:
+
+F(γ) = 1
+
+N
+
+N
+∑
+i=1
+d2(zi, ˆz)
+where F(γ) = γ3 × d
+
+dγ log{Z(γ)}
+(22)
+
+The gradient descent algorithm for computing ˆz is the following,
+
+INPUT
+{z1, . . . , zN}
+% N independent samples from G(¯z, γ)
+
+ˆz ∈ H
+% Initial guess
+
+WHILE
+∥∇ f (ˆz)∥ > ε
+% ε ≈ 0 machine precision
+
+ˆz ← expˆz (−λ∇ f (ˆz))
+% ∇ f (ˆz) given by Equation (21)
+
+% step size λ is constant
+
+END WHILE
+
+OUTPUT
+ˆz
+% near Riemannian center of mass
+
+Application of Formula (21) requires computation of logˆz(zi) for i = 1, . . . , N. Fortunately, this
+can be done analytically as follows. In general, for ˆz = ( ¯x, ¯y),
+
+logˆz(z) = ¯y logi
+
+�z − ¯x
+
+¯y
+
+�
+(23)
+
+where logi is found by inverting Equation (6). Precisely,
+
+logi(z) = reiϕ
+(24)
+
+where, for z = (x, y) with x ̸= 0,
+
+r = acosh
+�
+1 + x2 + (y − 1)2
+
+2y
+
+�
+
+and:
+
+cos(ϕ) =
+x
+
+y sinh(r)
+sin(ϕ) = cosh(r) − y−1
+
+sinh(r)
+
+and, for z = (0, y),
+logi(z) = ln(y)i
+
+with ln denoting the natural logarithm.
+
+3.3. Significance of ¯z and γ
+
+The parameters ¯z and γ of a Gaussian distribution G(¯z, γ) have been called the center of mass
+and the dispersion parameter. In the present paragraph, it is proven that,
+
+¯z = argminz∈H
+1
+2
+
+�
+
+H d2(z′, z)p(z′|¯z, γ)dA(z′)
+(25)
+
+and also that:
+F(γ) =
+�
+
+H d2(z′, ¯z)p(z′|¯z, γ)dA(z′)
+(26)
+
+where F(γ) was defined in Equation (22) and p(z′|¯z, γ) is the probability density function of G(¯z, γ),
+given in Equation (14).
+
+201
+
+
+Entropy 2014, 16, 4015–4031
+
+Note that Equations (25) and (26) are asymptotic versions of Equations (18) and (22). Indeed,
+Equations (25) and (26) can be written:
+
+¯z = argminz∈H
+1
+2E¯z,γd2(z′, z)
+F(γ) = E¯z,γd2(z, ¯z)
+(27)
+
+where E¯z,γ denotes the expectation with respect to G(¯z, γ), and the expectation is carried out on the
+variable z′ in the first formula. Now, these two formulae are the same as Equations (18) and (22), but
+with expectation instead of empirical mean.
+Note, moreover, that Equations (25) and (26) can be interpreted as follows. If z′ is distributed
+according to the Gaussian distribution G(¯z, γ), then Equation (25) states that ¯z is the unique point, out
+of all z ∈ H, which minimizes the expectation of squared Rao’s distance to z′. Moreover, Equation (26)
+states that the expectation of squared Rao’s distance between ¯z and z′ is equal to F(γ), so F(γ) is the
+least possible expected squared Rao’s distance between a point z ∈ H and z′. This interpretation
+justifies calling ¯z the center of mass of G(¯z, γ) and shows that γ is uniquely related to the expected
+dispersion, as measured by squared Rao’s distance, away from ¯z.
+In order to prove Equation (25), consider the log-likelihood function,
+
+ℓ(¯z, γ; z) = − log{Z(γ)} −
+1
+
+2γ2 d2(z, ¯z)
+(28)
+
+Let fz(¯z) = (1/2)d2(z, ¯z). The score function, with respect to ¯z is, by definition,
+
+∇¯zℓ(¯z, γ; z) = ∇ fz(¯z)
+(29)
+
+where ∇¯z indicates the Riemannian gradient (defined in Equation (13) of Section 2.3) is with respect to
+the variable ¯z. Under certain regularity conditions, which are here easily verified, the expectation of
+the score function is identically zero,
+E¯z,γ∇ fz(¯z) = 0
+(30)
+
+Let f (z) be defined by:
+
+f (z) = E¯z,γ fz′(z) = 1
+
+2E¯z,γd2(z′, z)
+
+with the expectation carried out on the variable z′. Clearly, f (z) is the expression to be minimized
+in Equation (25) (or in the first formula in Equation (27), which is just the same). By interchanging
+Riemannian gradient and expectation,
+
+∇ f (¯z) = E¯z,γ∇ fz(¯z) = 0
+
+where the last equality follows from Equation (30).
+It has just been proved that ¯z is a stationary point of f (a point where the gradient is zero).
+Theorem 2.1 in [16] states the function f has one and only one stationary point, which is moreover a
+global minimizer. This concludes the Proof (25).
+The proof of Equation (26) follows exactly the same method, defining the score function with
+respect to γ and noting that its expectation is identically zero.
+
+4. Classification of Univariate Normal Populations
+
+The previous section studied Gaussian distributions on H, “as they stand”, focusing on the
+fundamental issue of maximum likelihood estimation of their parameters. The present Section
+considers the use of Gaussian distributions as prior distributions on the univariate normal model.
+The main motivation behind the introduction of Gaussian distributions is that a Gaussian
+distribution G(¯z, γ) can be used to give a geometric representation of a cluster or class of univariate
+normal populations. Recall that each point (x, y) ∈ H is identified with a univariate normal population
+
+202
+
+
+Entropy 2014, 16, 4015–4031
+
+with mean μ =
+√
+
+2x and standard deviation σ = y. The idea is that populations belonging to the same
+cluster, represented by G(¯z, γ), should be viewed as centered on ¯z and lying within a typical distance
+determined by γ.
+In the remainder of this Section, it is shown how the maximum likelihood estimation algorithm of
+Section 3.2 can be used to fit the hyperparameters ¯z and γ to data, consisting in a class S = {Si; i =
+1, . . . , K} of univariate normal populations. This is then applied to the problem of the classification
+of univariate normal populations. The whole development is based on marginalized likelihood
+estimation, as follows.
+Assume each population Si contains Ni points, Si = {sj; j = 1, . . . , Ni}, and the points sj, in any
+class, are drawn from a univariate normal distribution with mean μ and standard deviation σ. The
+focus will be on the asymptotic case where the number Ni of points in each population Si is large.
+In order to fit the hyperparameters ¯z and γ to the data S, assume moreover that the distribution
+of z = (x, y), where (x, y) = (μ/
+√
+
+2, σ), is a Gaussian distribution G(¯z, γ). Then, the distribution of S
+can be written in integral form:
+
+p(S|¯z, γ) =
+K
+∏
+i=1
+
+�
+
+H p(Si|z)p(z|¯z, γ)dA(z)
+(31)
+
+where p(z|¯z, γ) is the probability density of a Gaussian distribution G(¯z, γ), defined in Equation (14).
+Moreover, expressing p(Si|z) as a product of univariate normal distributions p(sj|z), it follows,
+
+p(S|¯z, γ) =
+K
+∏
+i=1
+
+�
+
+H
+
+Ni
+∏
+j=1
+p(sj|z)p(z|¯z, γ)dA(z)
+(32)
+
+This expression, given the data S, is to be maximized over (¯z, γ). Using the Laplace approximation,
+this task is reduced to the maximum likelihood estimation problem, addressed in Section 3.2.
+The Laplace approximation will here be applied in its “basic form” [9]. That is, up to terms of
+order N−1
+i
+. To do so, write each of the integrals in Equation (32), using Equation (8) of Section 2.2.
+These integrals then take on the form:
+
+� +∞
+
+0
+
+� +∞
+
+−∞
+
+Ni
+∏
+j=1
+|2πy2|−1/2 exp
+
+⎛
+
+⎜
+⎝
+−
+�
+sj −
+√
+
+2 x
+�2
+
+2y2
+
+⎞
+
+⎟
+⎠ × p(z|¯z, γ) × 1
+
+y2 dxdy
+(33)
+
+where the univariate normal distribution p(sj|z) has been replaced by its full expression. Now, this
+expression can be written p(sj|z) = exp [−Nih(x, y)], where:
+
+h(x, y) = −1
+
+2 ln
+�
+2πy2�
+− B2
+i + V2
+i
+
+2y2
+
+Here, B2 and V2
+i are the empirical bias and variance, within population Si,
+
+Bi = ˆSi −
+√
+
+2 x
+V2
+i = N−1
+i
+
+Ni
+∑
+j=1
+( ˆSi − sj)2
+
+where ˆSi is the empirical mean of the population ˆSi = N−1
+i
+∑Ni
+j=1 sj.
+The expression h(x, y) is maximized when x = ˆxi and y = ˆyi, where ˆzi = ( ˆxi, ˆyi) is the couple of
+maximum likelihood estimates of the parameters (x, y), based on the population Si.
+
+203
+
+
+Entropy 2014, 16, 4015–4031
+
+According to the Laplace approximation, the integral Equation (33) is equal to:
+
+2π
+���∂2h( ˆxi, ˆyi)
+���
+−1/2
+× exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) × 1
+
+ˆy2
+i
++ O(N−1
+i
+)
+
+where ∂2h( ˆxi, ˆyi) is the matrix of second derivatives of h, and | · | denotes the determinant. Now, since
+h is essentially the logarithm of p(sj|z), a direct calculation shows that ∂2h( ˆxi, ˆyi) is the same as the
+Fisher information matrix derived in Section 2.1 (where it was denoted I(z)). Thus, the first factor in
+the above expression is 2π ˆy2
+i , and cancels out with the last factor.
+Finally, the Laplace approximation of the integral Equation (33) reads:
+
+2π × exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) + O(N−1
+i
+)
+
+and the resulting approximation of the distribution of S, as given by Equation (32), can be written:
+
+p(S|¯z, γ) ≈
+K
+∏
+i=1
+α × p(ˆzi|¯z, γ)
+(34)
+
+where α is a constant, which does not depend either on the data or on the parameters, and p(ˆzi|¯z, γ)
+has the expression (14).
+Accepting this expression to give the distribution of the data S, conditionally on the
+hyperparameters (¯z, γ), the task of estimating these hyperparameters becomes the same as the
+maximum likelihood estimation problem, described in Section 3.2.
+In conclusion, if one assumes the populations Si belong to a single cluster or class S and wishes
+to fit the hyperparameters ¯z and γ of a Gaussian distribution representing this cluster, it is enough to
+start by computing the maximum likelihood estimates ˆxi and ˆyi for each population Si and then to
+consider these as input to the maximum likelihood estimation algorithm described in Section 3.2.
+The same reasoning just carried out, using the Laplace approximation, can be generalized to
+the problem of classification of univariate normal populations. Indeed, assume that classes {SL, L =
+1, . . . , C}, each containing some number KL of univariate normal populations, have been identified
+based on some training sequence. Using the Laplace approximation and the maximum likelihood
+estimation approach of Section 3.2, to each one of these classes, it is possible to fit hyperparameters
+(¯zL, γL) of a Gaussian distribution G(¯zL, γL) on H.
+For a test population St, the maximum likelihood rule, for deciding which of the classes SL this
+test population St belongs to, requires finding the following maximum:
+
+L∗ = argmaxLp(St|¯zL, γL)
+(35)
+
+and assigning the test population St to the class with label L∗. If the number of points Nt in the
+population St is large, the Laplace approximation, in the same way used above, approximates the
+maximum in Equation (35) by:
+L∗ = argmaxLp(ˆzt|¯zL, γL)
+(36)
+
+where ˆzt = ( ˆxt, ˆyt) is the couple of maximum likelihood estimates computed based on the test
+population St and where p(ˆzt|¯zL, γL) is given by Equation (14). Now, writing out Equation (14), the
+decision rule becomes:
+
+L∗ = argmaxL
+
+�
+
+− log {Z(γL)} −
+1
+
+2γ2
+L
+d2(ˆzt, ¯zL)
+
+�
+
+(37)
+
+204
+
+
+Entropy 2014, 16, 4015–4031
+
+Under the homoscedasticity assumption, that all of the γL are equal, this decision rule essentially
+becomes the same as the one proposed in [5], which requires St to be assigned to the “nearest” cluster,
+in terms of Rao’s distance. Indeed, if all the γL are equal, then Equation (37) is the same as,
+
+L∗ = argminLd2(ˆzt, ¯zL)
+(38)
+
+This decision rule is expected to be less efficient that the one proposed in Equation (37), which also takes
+into account the uncertainty associated with each cluster, as measured by its dispersion parameter γL.
+
+5. Application to Image Classification
+
+In this section, the framework proposed in Section 4, for classification of univariate normal
+populations, is applied to texture image classification using Gabor filters. Several authors have found
+that Gabor energy features are well-suited texture descriptors. In the following, consider 24 Gabor
+energy sub-bands that are the result of three scales and eight orientations. Hence, each texture image
+can be decomposed as the collection of those 24 sub-bands. For more information concerning the
+implementation, the interested reader is referred to [22].
+Starting from the VisTeX database of 40 images [10] (these are displayed in Figure 1), each image
+was divided into 16 non-overlapping subimages of 128 × 128 pixels each. A training sequence was
+formed by choosing randomly eight subimages out of each image. To each subimage in the training
+sequence, a bank of 24 Gabor filters was applied. The result of applying a Gabor filter with scale s
+and orientation o to a subimage i belonging to an image L is a univariate normal population Si,s,o of
+128 × 128 points (one point for each pixel, after the filter is applied).
+
+Figure 1. Forty images of the VisTex database.
+
+These populations Si,s,o (called sub-bands) are considered independent, each one of them
+univariate normal with mean μi,s,o =
+√
+
+2xi,s,o, standard deviation σi,s,o = yi,s,o and with zi,s,o =
+(xi,s,o, yi,s,o). The couple of maximum likelihood estimates for these parameters is denoted ˆzi,s,o =
+( ˆxi,s,o, ˆyi,s,o). An image L (recall, there are 40 images) contains, in each sub-band, eight populations
+Si,s,o, with which hyperparameters ¯zL,s,o and γL,s,o are associated, by applying the maximum likelihood
+estimation algorithm of Section 3.2 to the inputs ˆzi,s,o.
+If St is a test subimage, then one should begin by applying the 24 Gabor filters to it, obtaining
+independent univariate normal populations St,s,o, and then compute for each population the couple
+of maximum likelihood estimates ˆzt,s,o = ( ˆxt,s,o, ˆyt,s,o). The decision rule Equation (37) of Section 4
+requires that St should be assigned to the image L∗, which realizes the maximum:
+
+L∗ = argmaxL ∑
+s,o
+− log{Z(γL,s,o)} −
+1
+
+2γ2
+L,s,o
+d2(ˆzt,s,o, ¯zL,s,o)
+(39)
+
+205
+
+
+Entropy 2014, 16, 4015–4031
+
+When considering the homoscedasticity assumption, i.e., γL,s,o = γs,o for all L, this decision rule
+becomes:
+L∗ = argminL ∑
+s,o
+d2(ˆzt,s,o, ¯zL,s,o)
+(40)
+
+For this concrete application, to the VisTex database, it is pertinent to compare the rate of successful
+classification (or overall accuracy) obtained using the Riemannian prior, based on the framework
+of Section 4, to that obtained using a more classical conjugate prior, i.e., a normal-inverse gamma
+distribution of the mean μ =
+√
+
+2x and the standard deviation σ = y. This conjugate prior is given by:
+
+p(μ|σ, μp, κp) =
+√κp
+σ
+√
+
+2π
+exp
+�
+− κp
+
+2σ2 (μ − μp)2�
+
+with an inverse gamma prior, on σ2,
+
+p(σ2|α, β) =
+βα
+
+Γ(α)
+
+�
+σ2�−(α+1)
+exp
+�
+− β
+
+σ2
+
+�
+(41)
+
+Using this conjugate prior, instead of a Riemannian prior, and following the procedure of applying the
+Laplace approximation, a different decision rule is obtained, where L∗ is taken to be the maximum of
+the following expression:
+
+∑
+s,o
+
+ln κpL,s,o
+
+2
+− κpL,s,o
+
+2 ˆy2
+t,s,o
+
+�√
+
+2 ˆxt,s,o − μpL,s,o
+�2
+
++ αL,s,o ln βL,s,o − ln Γ(αL,s,o) − 2(αL,s,o + 1) ln ˆyt,s,o − βL,s,o
+
+ˆy2
+t
+(42)
+
+where, as in Equation (39), ˆxt,s,o and ˆyt,s,o are the maximum likelihood estimates computed for the
+population St,s,o.
+Both the Riemannian and conjugate priors have been applied to the VisTex database, with half of
+the database used for training and half for testing. In the course of 100 Monte Carlo runs, a significant
+gain of about 3% is observed with the Riemannian prior compared to the conjugate prior. This is
+summarized in the following table.
+
+Prior Model
+Overall Accuracy
+
+Riemannian prior Equation (39)
+71.88% ± 2.16%
+
+Riemannian prior, homoscedasticity assumption Equation (40)
+69.06% ± 1.96%
+
+Conjugate prior Equation (42)
+68.73% ± 2.92%
+
+Recall that the overall accuracy is the ratio of the number of successfully classified subimages
+to the total number of subimages. The table shows that the use of a Riemannian prior, even under a
+homoscedasticity assumption, yields significant improvement upon the use of a conjugate prior.
+
+6. Conclusions
+
+Motivated by the problem of the classification of univariate normal populations, this paper
+introduced a new class of prior distributions on the univariate normal model. With the univariate
+normal model viewed as the Poincaré half plane H, these new prior distributions, called Gaussian
+distributions, were meant to reflect the geometric picture (in terms of Rao’s distance) that a cluster or
+class of univariate normal populations can be represented as having a center ¯z ∈ H and a “variance”
+or dispersion γ2. Precisely, a Gaussian distribution G(¯z, γ) has a probability density function p(z),
+
+with respect to Riemannian volume of the Poincaré half plane, which is proportional to exp
+�
+− d2(z,¯z)
+
+2γ2
+�
+.
+
+206
+
+
+Entropy 2014, 16, 4015–4031
+
+Using Gaussian distributions as prior distributions in the problem of the classification of univariate
+normal populations was shown to lead to a new, more general and efficient decision rule. This
+decision rule was implemented in a real-world application to texture image classification, where it
+led to significant improvement in performance, in comparison to decision rules obtained by using
+conjugate priors.
+The general approach proposed in this paper contains several simplifications and approximations,
+which could be improved upon in future work. First, it is possible to use different prior distributions,
+which are more geometrically rich than Gaussian distributions, to represent classes of univariate normal
+populations. For example, it may be helpful to replace Gaussian distributions that are “isotropic”, in
+the sense of having a scalar dispersion parameter γ, by non-isotropic distributions, with a dispersion
+matrix Γ (a 2 × 2 symmetric positive definite matrix). Another possibility would be to represent
+each class of univariate normal populations by a finite mixture of Gaussian distributions, instead of
+representing it by a single Gaussian distribution.
+These variants, which would allow classes with a more complex geometric structure to be taken
+into account, can be integrated in the general framework proposed in the paper, based on: (i) fitting
+each class to a prior distribution (Gaussian non-isotropic, mixture of Gaussians); and (ii) choosing, for
+a test population, the most adequate class, based on a decision rule. These two steps can be realized as
+above, through the Laplace approximation and maximum likelihood estimation, or through alternative
+techniques, based on Markov chain Monte Carlo stochastic optimization.
+In addition to generalizing the approach of this paper and improving its performance, a further
+important objective for future work will be to extend it to other parametric models, beyond univariate
+normal models. Indeed, there is an increasing number of parametric models (generalized Gaussian,
+elliptical models, etc.), whose Riemannian geometry is becoming well understood and where the
+present approach may be helpful.
+
+Author Contributions
+
+Salem Said carry out the mathematical development, and specify the algorithms, appearing in
+Sections 2, 3 and 4. Lionel Bombrun carry out all numerical simulations, and to propose the theoretical
+development of Section 4. Yannick Berthoumieu devise the main idea of the paper. That is, use of
+Riemannian priors as geometric representation of a class or cluster of univariate normal population.
+All authors have read and approved the final manuscript.
+
+Conflicts of Interest
+
+The authors declare no conflict of interest.
+
+References
+
+1.
+Amari, S.I; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2000.
+2.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc. 1945, 37, 81–91.
+3.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234.
+4.
+Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276.
+5.
+Nielsen, F; Nock, R. Hyperbolic Voronoi diagrams made easy. 2009 , arXiv:0903.3287.
+6.
+Lenglet, C.; Rousson, M.; Deriche, R.; Fougeras, O. Statistics on the manifold of multivariate normal
+distributions: Theory and application to diffusion tensor MRI processing. J. Math. Imaging Vis. 2006, 25,
+423–444.
+7.
+Verdoolaege, G.; Scheunders, P. On the geometry of multivariate generalized Gaussian models. J. Math.
+Imaging Vis. 2012, 43, 180–193.
+
+207
+
+
+Entropy 2014, 16, 4015–4031
+
+8.
+Berkane, M.; Oden, K. Geodesic estimation in elliptical distributions. J. Multival. Anal. 1997, 63, 35–46.
+9.
+Erdélyi, A. Asymptotic Expansions; Dover Books: Mineola, New York, NY, USA, 2010.
+10.
+MIT Vision and Modeling Group.
+Vision Texture.
+Available online: http://vismod.media.mit.edu/
+pub/VisTex (accessed on 10 June 2014).
+11.
+Pennec, X. Intrinsic statistics on Riemannian manifold: Basic tools for geometric measurements. J. Math.
+Imaging Vis. 2006, 25, 127–154.
+12.
+Atkinson, C.; Mitchell, A.F.S. Rao’s distance measure. Sankhya Ser. A 1981, 43, 345–365.
+13.
+Gallot, S.; Hulin, D.; Lafontaine, J. Riemannian Geometry; Springer-Verlag: Berlin, Germany, 2004.
+14.
+Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
+Press: Cambridge, MA, USA, 2006.
+15.
+Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’I.H.P. 1948,
+10, 215–310. (In French)
+16.
+Afsari, B. Riemannian Lp center of mass: Existence, Uniqueness and convexity. Proc. Am. Math. Soc. 2011,
+139, 655–673.
+17.
+Manton, J.H. A centroid (Karcher mean) approach to the joint approximate diagonalisation problem: The real
+symmetric case. Digit. Sign. Process. 2006, 16, 468–478.
+18.
+Arnaudon, M.; Barbaresco, F. Riemannian medians and means with applications to RADAR signal processing.
+IEEE J. Sel. Top. Sign. Process. 2013, 7, 595–604.
+19.
+Le, H. On the consistency of procrustean mean shapes. Adv. Appl. Prob. 1998, 30, 53–63.
+20.
+Turaga, P.; Veeraraghavan, A.; Chellappa, R. Statistical Snalysis on Stiefel and Grassmann Manifolds with
+Applications in Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern
+Recognition, Anchorage, AK, USA, 23–28 June 2008; doi: 10.1109/CVPR.2008.4587733.
+21.
+Chavel, I. Riemannian geometry: A modern introduction; Cambridge University Press: Princeton, MA, USA, 2008.
+22.
+Grigorescu, S.E.; Petkov, N.; Kruizinga, P. Comparison of texture features based on Gabor filter. IEEE Trans.
+Image Process. 2002, 11, 1160–1167.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+208
+
+
+entropy
+
+Article
+Combinatorial Optimization with Information
+Geometry: The Newton Method
+
+Luigi Malagò 1 and Giovanni Pistone 2,*
+
+1 Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135 Milano, Italy;
+E-Mail: malago@di.unimi.it
+2 de Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024 Moncalieri, Italy
+*
+E-Mail: giovanni.pistone@carloalberto.org; Tel.: +39-011-670-5033; Fax: +39-011-670-5082.
+
+Received: 31 March 2014; in revised form: 10 July 2014 / Accepted: 11 July 2014 /
+Published: 28 July 2014
+
+Abstract: We discuss the use of the Newton method in the computation of max(p �→ Ep [ f ]), where
+p belongs to a statistical exponential family on a finite state space. In a number of papers, the authors
+have applied first order search methods based on information geometry. Second order methods
+have been widely used in optimization on manifolds, e.g., matrix manifolds, but appear to be new
+in statistical manifolds. These methods require the computation of the Riemannian Hessian in a
+statistical manifold. We use a non-parametric formulation of information geometry in view of further
+applications in the continuous state space cases, where the construction of a proper Riemannian
+structure is still an open problem.
+
+Keywords: statistical manifold; Riemannian Hessian; combinatorial optimization; Newton method
+
+1. Introduction
+
+In this paper, statistical exponential families [1] are thought of as differentiable manifolds along
+the approach called information geometry [2] or the exponential statistical manifold [3]. Specifically,
+our aim is to discuss optimization on statistical manifolds using the Newton method, as is suggested
+in ([4] (Ch. 5 and 6)); see also the monograph [5]. This method is based on classical Riemannian
+geometry [6], but here, we put our emphasis on coordinate-free differential geometry; see [7,8].
+We mainly refer to the above-mentioned references [2,4], with one notable exception in the
+description of the tangent space. Our manifold will be an exponential family EV of positive densities,
+V being a vector space of sufficient statistics. Given a one-dimensional statistical model p(t) ∈ EV,
+t ∈ I, we define its velocity at time t to be its Fisher score s(t) = d
+
+dt ln p(t) [9]. The Fisher score s(t)
+is a random variable with zero expectation with respect to p(t), Ep(t) [s(t)] = 0. Because of that, the
+tangent space at p ∈ EV is a vector space of random variables with zero expectation at p. A vector field
+is a mapping from p to a random variable V(p), such that for all p ∈ E, the random variable V(p) is
+centered at p, Ep [V(p)] = 0. In other words, each point of the manifold has a different tangent space,
+and this tangent space can be used as a non-parametric model space of the manifold. In this formalism,
+a vector field is a mapping from densities to centered random variables, that is, it is what in statistics is
+called a pivot of the statistical model. To avoid confusion with the product of random variables, we
+do not use the standard notation for the action of a vector field on a real function. This approach is
+possibly unusual in differential geometry, but it is fully natural from the statistical point of view, where
+the Fisher score has a central place. Moreover, this approach scales nicely from the finite state space to
+the general state space; see the discussion in [9] and the review in [3].
+A complete construction of the geometric framework based on the idea of using the Fisher scores
+as elements of the tangent bundle has been actually worked out. In this paper, we go on by considering
+a second order geometry based on the non-parametric settings.
+
+Entropy 2014, 16, 4260–4289; doi:10.3390/e16084260
+www.mdpi.com/journal/entropy
+209
+
+
+Entropy 2014, 16, 4260–4289
+
+Our main motivation for such a geometrical construction is its application to combinatorial
+optimization using exponential families, whose first order version was developed in [10–14]. We give
+here an illustration of the methods in the following toy example.
+Consider the function f (x1, x2) = a0 + a1x1 + a2x2 + a12x1x2, with x1, x2 = ±1, a0, a1, a2, a12 ∈ R.
+The function f is a real random variable on the sample space Ω = {+1, −1}2 with the uniform
+probability λ.
+Note that the coordinate mappings X1, X2 of Ω generate an orthonormal basis
+1, X1, X2, X1X2 of L2(Ω, λ) and that f is the general form of a real random variable on such a space. Let
+P> be the open simplex of positive densities on (Ω, λ), and let EV be a statistical model, i.e., a subset
+of P>. The relaxed mapping F: EV → R,
+
+F(p) = Ep [ f ] = a0 + a1 Ep [X1] + a2 Ep [X2] + a12 Ep [X1X2] ,
+(1)
+
+is strictly bounded by the maximum of f, F(p) = Ep [ f ] < maxx∈Ω f (x), unless f is constant. We are
+looking for a sequence pn, n ∈ N, such that Epn [ f ] → maxx∈Ω f (x) as n → ∞. The existence of such a
+sequence is a nontrivial condition for the model E. Precisely, the closure of EV must contain a density,
+whose support is contained in the set of maxima {x ∈ Ω| f (x) = max f }. This condition is satisfied by
+the independence model, V = Span {X1, X2}, where we can write:
+
+F(η1, η2) = a0 + a1η1 + a2η2 + a12η1η2,
+ηi = Ep [Xi] ,
+(2)
+
+See Figure 1.
+The gradient of Equation (2) has components ∂1F = a1 + a12η2, ∂2F = a2 + a12η1, and the flow
+along the gradient produces increasing values for F; however, the gradient flow does not converge to
+the maximum of F; see the dotted line in Figure 2. However, one can follow the suggestion by [15] and
+use a modified gradient (the “natural” gradient) flow that produces better results in our problem; see
+Figure 3. Full details on this example are given in Section 2.5.2.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3.
+
+210
+
+
+Entropy 2014, 16, 4260–4289
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+η1
+
+η2
+
+−20
+
+−10
+
+0
+
+10
+
+20
+
+ −10 
+
+ −10 
+
+ 0 
+
+ 0 
+
+10 
+
+ 10 
+
+ 20 
+
+Expectation parameters
+
+Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside
+the square [−1, +1]2.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Gradient vs Natural gradient
+
+Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting
+at (−1/4, −1/4).
+
+In combinatorial optimization, the values of the function f are assumed to be available at each
+point, and the curve of steepest ascent of the relaxed function is learned through a simulation procedure
+based on exponential statistical models.
+
+211
+
+
+Entropy 2014, 16, 4260–4289
+
+In this paper, we introduce, in Section 2, the geometry of exponential families and its first order
+calculus. The second order calculus and the Hessian are discussed in Section 3. Finally, in Section 4,
+we apply the formalism to the discussion of the Newton method in the context of the maximization of
+the relaxed function.
+
+2. Models on a Finite State Space
+
+We consider here the exponential statistical manifold on the set of positive densities on a measure
+space (Ω, μ) with Ω finite and counting measure μ. The setup we describe below is not strictly required
+in the finite case, because in such a case, other approaches are possible, but it provides a mathematical
+formalism that has its own pros and that scales naturally to the infinite case.
+We provide below a schematic presentation of our formalism as an introduction to this section.
+
+• Two different exponential families can actually be the same statistical model, as the set of
+densities in the two exponential families are actually equal. This fact is due to both the
+arbitrariness of the reference density and the fact that sufficient statistics are actually a vector
+basis of the vector space generated by the sufficient statistics. In a non-parametric approach,
+we can refer directly to the vector space of centered log-densities, while the change of reference
+density is geometrically interpreted as a change of chart. The set of all possible such charts
+defines a manifold.
+• We make a specific interpretation of the tangent bundle as the vector space of Fisher’s scores at
+each density and use such tangent spaces as the space of coordinates. This produces a different
+tangent space/space of coordinates at each density, and different tangent spaces are mapped
+one onto another by a proper parallel transport, which is nothing else than the re-centering of
+random variables.
+• If a basis is chosen, a parametrization is given, and such a parametrization is, in fact, a new
+chart, whose values are real vectors. In the real parametrization, the natural scalar product in
+each scores space is given by Fisher’s information matrix.
+• Riemannian gradients are defined in the usual way. It is customary in information geometry to
+call “natural gradient” the real coordinate presentation of the Riemannian gradient. The natural
+gradient is computed by applying the inverse of the Fisher information matrix to the Euclidean
+gradient. It seems that there are tree gradients involved, but they all represent the same object
+when correctly understood.
+• The classical notion of expectation parameters for exponential families carries on as another
+chart on the statistical manifold, which gives rise to a further presentation of a geometrical
+object.
+• While the statistical manifold is unique, there are at least three relevant connections as structures
+on the vector bundles of the manifold: one relating to the exponential charts, one relating to the
+expectation charts and one depending on the Riemannian structure.
+
+2.1. Exponential Families As Manifolds
+
+On the finite sample space Ω, #Ω = n, let a set of random variables B = {X1, . . . , Xm} be
+given, such that ∑J αjXj is constant if, and only if, the αj’s are zero, or, equivalently, such that X0 =
+1, X1, . . . , Xm are affinely independent. The condition implies, necessarily, the linear independence of
+B. A common choice is to take a set of linearly independent and μ-centered random variables.
+We write V = Span {X1, . . . , Xm} and define the following exponential family of positive densities
+p ∈ P>:
+EV =
+�
+q ∈ P>
+���q ∝ eV p, V ∈ V
+�
+.
+(3)
+
+Given any couple p, q ∈ EV, then there exist a unique set of parameters θ = θp(q), such that:
+
+q = exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p
+(4)
+
+212
+
+
+Entropy 2014, 16, 4260–4289
+
+where eUp is the centering at p, that is,
+
+eUp : V ∋ U �→ U − Ep [U] ∈ eUpV.
+(5)
+
+The linear mapping eUp is one-to-one on V and eUpXj, j = 1, . . . , m, and is a basis of eUpV. We view
+each choice of a specific reference p as providing a chart centered at p on the exponential family EV,
+namely:
+
+σp : exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p �→ θ,
+(6)
+
+If:
+
+U = eUpU + Ep [U] =
+m
+∑
+j=1
+θj eUpXj + Ep [U] ,
+(7)
+
+then:
+
+Ep
+�
+U eUpXi
+� =
+m
+∑
+j=1
+θj Ep
+�eUpXi
+eUpXj
+�
+,
+(8)
+
+so that θ = I−1
+B (p) Ep [U eUpX], where:
+
+IB(p) =
+�
+Covp
+�
+Xi, Xj
+��
+
+ij = Ep
+�
+XX′� − Ep [X] Ep
+�
+X′�
+(9)
+
+is the Fisher information matrix of the basis B = {X1, . . . , Xm}.
+The mappings:
+σp : EV ∋ q �→ U �→ θ ∈ Rm
+(10)
+
+where:
+
+sp : q �→ U = log
+� q
+
+p
+
+�
+− Ep
+
+�
+log
+� q
+
+p
+
+��
+,
+(11)
+
+σp : q �→ θ = I−1
+B (p) Ep
+�
+U eUpX
+� = I−1
+B (p) Ep
+
+�
+log
+� q
+
+p
+
+�
+eUpX
+�
+,
+(12)
+
+are global charts in the non-parametric and parametric coordinates, respectively.
+Notice that
+Equation (12) provides the regression coefficients of the least squares estimate on eUpV of the
+log-likelihood.
+We denote by ep : Rm → EV the inverse of σp, i.e.,
+
+ep(θ) = exp
+
+�
+m
+∑
+j=1
+θj eUpXj − ψp(θ)
+
+�
+
+· p,
+(13)
+
+so that the representation of the divergence q �→ D (p ∥q) in the chart σp is ψp:
+
+ψp(θ) = log
+�
+Ep
+�
+e∑m
+j=1 θj eUpXj��
+= Eθ
+
+�
+log
+�
+p
+
+ep(θ)
+
+��
+= D
+�
+p ∥ep(θ)
+�
+.
+(14)
+
+The mapping IB : p �→ Covp (X, X) ∈ Rm×m is represented in the chart centered at p by:
+
+IB,p(θ) = IB(ep(θ)) = [Covep(θ)
+�
+Xi, Xj
+�]i,j = Hess ψp(θ),
+(15)
+
+See [1].
+
+213
+
+
+Entropy 2014, 16, 4260–4289
+
+2.2. Change of Chart
+
+Fix p, ¯p ∈ EV; then, we can express p in the chart centered at ¯p,
+
+p = exp
+� ¯U − kp( ¯U)
+� · ¯p,
+¯U ∈ eU ¯pV,
+k ¯p( ¯U) = log
+�
+E ¯p
+�
+e ¯U��
+.
+(16)
+
+In coordinates ¯U = ∑m
+j=1‘ ¯θj eU ¯pXj.
+For all q ∈ EV, q = exp
+�
+U − kp(U)
+�
+p, U ∈ eUpV, kp(U) = log
+�Ep
+�
+eU��
+, in coordinates
+U = ∑m
+j=1‘ θj eUpXj, we can write:
+
+q = exp
+�
+U − kp(U)
+� · p
+
+= exp
+�
+U − kp(U)
+�
+exp
+� ¯U − k ¯p( ¯U)
+� · ¯p
+
+= exp
+�
+U − kp(U) + ¯U − k ¯p( ¯U)
+� · ¯p
+
+= exp
+��(U + ¯U) − E ¯p [U]
+� −
+�
+kp(U) − k ¯p( ¯U) + E ¯p [U]
+�� · ¯p,
+(17)
+
+hence, the non-parametric coordinate of q in the chart centered at ¯p is U + ¯U − E ¯p [U] = eU ¯p(U) + ¯U.
+From Equation (12):
+
+σ¯p(q) = I−1
+V ( ¯p) E ¯p
+�
+(eU ¯pU + ¯U) eU ¯pX
+�
+
+= θ + ¯θ
+(18)
+
+This provides the change of charts σ¯p ◦ σ−1
+p
+: θ �→ θ + ¯θ. This atlas of charts defines the affine
+manifold (EV, (σp)). This fact has deep consequences that we do not discuss here, e.g., our manifold is
+an instance of a Hessian manifold [16].
+
+2.3. Tangent Bundle
+
+The space of Fisher scores at p is eUpV, and it is identified with the tangent space of the manifold
+at p, TqEV; see the discussion in [3,9]. Let us check the consistency of this statement with our θ-
+parametrization.
+Let:
+
+q(τ) = exp
+
+�
+m
+∑
+j=1
+θj(τ)
+eUq(0)X − ψq(0)(τ)
+
+�
+
+· q(0),
+(19)
+
+τ ∈ I, I an open interval containing zero, a curve in EV. In the chart centered at q(0), we have from
+Equation (12):
+
+σq(0)(q(τ)) = I−1
+B (q(0)) Eq(0)
+
+�
+log
+�q(τ)
+
+q(0)
+
+�
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+
+��
+m
+∑
+j=1
+θj(τ)
+eUq(0)Xj − ψq(0)(θ(τ))
+
+�
+eUq(0)X
+
+�
+
+= I−1
+B (q(0))
+m
+∑
+j=1
+θj(τ) Eq(0)
+�eUq(0)Xj
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�eUq(0)X
+eUq(0)X
+�
+θ
+
+= θ(τ).
+(20)
+
+The vector space eUpV is represented by the coordinates in the base eUpB. The tangent bundle
+TEV as a manifold is defined by the charts (σp, ˙σp) on the domain:
+
+TEV =
+�(p, v)
+��p ∈ EV, v ∈ TpEV
+�
+(21)
+
+214
+
+
+Entropy 2014, 16, 4260–4289
+
+with:
+(σp, ˙σp): (q, V) �→
+�
+I−1
+B (p) Ep
+
+�
+log
+� q
+
+p
+
+�
+eUpX
+�
+, I−1
+B (p) Ep
+�
+V eUpX
+��
+.
+(22)
+
+The dot notation ˙σp for the charts on the tangent spaces is justified by the computation in Equation (23)
+below:
+
+d
+dtσq(0)(q(τ))
+����
+τ=0
+= I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+����
+τ=0
+
+eUq(0)X
+�
+=
+
+I−1
+B (q(0)) Eq(0)
+�
+δq(0)
+eUq(0)X
+�
+= ˙σq(0)(δq(0)).
+(23)
+
+The velocity at τ = 0 is δq(0) =
+d
+dτ log (q(τ))
+���
+τ=0 ∈ Tq(0)EV and:
+
+d
+dτ θ(τ)
+����
+τ=0
+= I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+����
+τ=0
+
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�
+δq(0)
+eUq(0)X
+�
+,
+(24)
+
+which is consistent with both the definition of tangent space as set of Fisher scores and with the chart
+of the tangent bundle as defined in Equation (22).
+The velocity at a generic τ is δq(τ) =
+d
+dτ log (q(τ)) ∈ Tq(τ)EV and has coordinates at p:
+
+d
+dτ θ(τ) = I−1
+B (q(0)) Eq(0)
+
+� d
+
+dτ log (q(τ))
+eUq(0)X
+�
+
+= I−1
+B (q(0)) Eq(0)
+�
+δq(τ)
+eUq(0)X
+�
+.
+(25)
+
+If V, W are vector fields on TEV, i.e., V(p), W(p) ∈ TpEV = eUpV, p ∈ EV, we define a Riemannian
+metric g(V, W)) by:
+g(V, W)(p) = gp(V(p), W(p)) = Ep [V(p)W(p)]
+(26)
+
+In coordinates at p, V(p) = ∑j ˙σj
+p(V) eUpXj, W(p) = ∑j ˙σj
+p(W) eUpXj, so that:
+
+gp(V(p), W(p)) = ˙σp(V)′IB(p) ˙σp(W).
+(27)
+
+2.4. Gradients
+
+Given a function φ: EV → R let φp = φ ◦ ep, ep = σ−1
+p , its representation in the chart centered
+at p:
+
+EV
+φ
+� R
+
+Rm
+
+ep
+�
+
+φp
+
+�
+(28)
+
+The derivative of θ �→ φp(θ) at θ = 0 along α ∈ Rm is:
+
+∇φp(0)α = ∇φp(0)I−1
+B (p)IB(p)α =
+�
+I−1
+B (p)∇φp(0)′�′
+IB(p)α = gp(I−1
+B (p)∇φp(0)′, α).
+(29)
+
+The mapping �∇φ: p �→ I−1
+B (p)(∇φp(0))′ ∈ Rm that appears in Equation (29) is Amari’s natural
+gradient of φ: EV; see [15]. It is a standard notion in Riemannian geometry; cf. [4] (p. 46).
+
+215
+
+
+Entropy 2014, 16, 4260–4289
+
+More generally, the derivative of θ �→ φp(θ) at θ along α ∈ Rm is:
+
+∇φp(θ)α = ∇φp(θ)I−1
+B (ep(θ))IB(ep(θ))α =
+�
+I−1
+B (ep(θ))∇φp(θ)′�′
+IB(ep(θ))α = gep(θ)(I−1
+B (ep(θ))∇φp(θ)′, α).
+(30)
+
+Let us compare ∇φq(0) and ∇φp(θ) when q = ep(θ). As φp = φ ◦ ep and φq = φ ◦ eq, we have the
+change of charts:
+φq = φ ◦ eq = φ ◦ ep ◦ σp ◦ eq = φp ◦ σp ◦ eq,
+(31)
+
+hence ∇φq(0) = ∇φp(σp(q))J(σp ◦ eq)(0), where J(σp ◦ eq) is the Jacobian of σp ◦ eq. As σp ◦ eq(θ) =
+θ + σp(q), we have J(σp ◦ eq) = Id, and in conclusion, ∇φep(θ)(0) = ∇φp(θ). For all p ∈ EV and
+θ ∈ Rm,
+�∇φ(ep(θ)) = I−1
+B (ep(θ))∇φp(θ).
+(32)
+
+Alternatively, for all q, p ∈ EV, �∇φ: EV → Rm is defined by:
+
+�∇φ(q) = I−1
+B (q)∇φp(σp(q)).
+(33)
+
+The Riemannian gradient of φ: EV is the vector field ∇φ, such that DYφ = g(∇φ, Y). Note that
+the Riemannian gradient takes values in the tangent bundle, while the natural gradient takes values in
+Rm. We compute the Riemannian gradient at p as follows. If y = ˙σp(Y(p)),
+
+DYφ(p) = dφp(0)y = gp( �∇φ(p), y) = Ep [∇φ(p)Y(p)] ,
+(34)
+
+hence �∇φ(p) = I−1
+B (p)∇φp(0)′ is the representation in the chart centered at p of the vector field
+∇φ: EV. Explicitly, we have (see Equation (22)),
+
+�∇φ(p) = I−1
+B (p)(∇φp(0))′ = I−1
+B (p) Ep
+�∇φ(p) eUpX
+�
+,
+(35)
+
+∇φ(p) = ∑
+j
+( �∇φ(p))j eUpXj
+(36)
+
+The Euclidean gradient ∇φp(θ) is sometimes called the “vanilla gradient.” It is equal to the
+covariance between the Riemannian gradient ∇φ(p) and the basis X, (∇φp(0))′ = Ep [∇φ(p) eUpX].
+We summarize in a display the relations between our three gradients: Euclidean ∇φp(0), natural
+�∇φ(p) and Riemannian ∇φ(p).
+
+TEV
+(σp, ˙σp)�
+
+π
+�
+
+R2m
+
+π1
+�
+
+EV
+σp
+� Rm
+
+TpEV
+˙σp
+� Rm
+
+IB(p)
+�
+
+EV
+
+∇φ(p)
+�
+
+∇φp(0)
+� Rm
+˙σp ◦ ∇φ(p) = I−1
+B ∇φp(0) = �∇φ(p)
+
+(37)
+In the following, we shall frequently use the fact that the representation of the gradient vector
+field ∇φ in a generic chart centered at p is:
+
+(∇φ)p(θ) = ˙σp(∇φ(ep(θ))) = ( �∇φ)(ep(θ)) = I−1
+B,p(θ)∇φp(θ).
+(38)
+
+It should be noted that the leftmost term (∇φ)p(θ) is the presentation of the gradient in the charts
+of the tangent bundle, while in the rightmost term, ∇φp(θ) denotes the Euclidean gradient of the
+presentation of the function φ in the charts of the manifold.
+
+216
+
+
+Entropy 2014, 16, 4260–4289
+
+2.4.1. Expectation Parameters
+
+As ψp is strictly convex, the gradient mapping θ �→ (∇ψp(θ))′ is a homeomorphism from the
+space of parameters Rm to the interior of the convex set generated by the image of eUpX; see [1]. The
+function μp : EV defined by:
+
+μp(q) = Eq
+�eUpX
+� = Eq [X] − Ep [X] = (∇ψp(θ))′,
+θ = σp(q)
+(39)
+
+is a chart for all p ∈ EV. The value of the inverse q = Lp(μ) is characterized as the unique q ∈ EV, such
+that μ = Eq [eUpX], i.e., the maximum likelihood estimator.
+Let us compute the change of chart from p to ¯p:
+
+μ ¯p ◦ μ−1
+p (η) = ¯η = η + Ep [X] − E ¯p [X] .
+(40)
+
+In fact, μ = ELp(μ) [eUpX] and ¯μ = μ ¯p(Lp(μ)) = ELp(μ)
+�eU ¯pX
+�
+.
+We do not discuss here the rich theory started in [2] about the duality between σp and μp. We
+limit ourselves to the computation of the Riemannian gradient in the expectation parameters. If φ: EV,
+
+φp(θ) = φ ◦ ep(θ) = φ ◦ Lp ◦ μp ◦ ep(θ) = (φ ◦ Lp) ◦ (∇ψp)(θ),
+(41)
+
+because μp ◦ ep(θ) = Eep(θ) [eUpX] = ∇φp(θ), hence:
+
+∇φp(θ) = ∇(φ ◦ Lp)(∇ψp(θ)) Hess ψp(θ),
+(42)
+
+�∇φ(p) = IV(p)−1(∇(φ ◦ Lp)(0) Hess ψp(0))′ = (∇(φ ◦ Lp)(0))′,
+(43)
+
+∇φ(p) = ∇(φ ◦ Lp)(0) eUpX,
+(44)
+
+that is, the natural gradient �∇φ at p = Lp(μ) is equal to the Euclidean gradient of μ �→ φ ◦ Lp(μ) at
+μ = 0.
+
+2.4.2. Vector Fields
+
+If V is a vector field of TEV and φ: EV is a real function, then we define the action of V on φ, ∇Vφ,
+to be the real function:
+∇Vφ: EV ∋ p �→ ∇Vφ(p) = ∇φp(0) ˙σp (V(p)) .
+(45)
+
+We prefer to avoid the standard notation Vφ, because in our setting, V(p) is a random variable, and
+the product V(p)φ(p) is otherwise defined as the ordinary product.
+Let us represent ∇Vφ in the chart centered at p:
+
+(∇Vφ)p(θ) = ∇Vφ(ep(θ)) = ∇φep(θ)(0) ˙σep(θ)
+�
+V(ep(θ))
+� = ∇φp(θ)Vp(θ),
+(46)
+
+where we have used the equality ∇φep(θ)(0) = ∇φp(θ) and Vp(θ) = ˙σep(θ)
+�
+V(ep(θ))
+�
+.
+If W is a vector field, we can compute ∇W∇Vφ at p as:
+
+∇W∇Vφ(p) = ∇(∇Vφ)p(0) ˙σp(W(p))
+
+= Vp(0)′ Hess φp(0)Wp(0) + ∇φp(0)JVp(0)Wp(0),
+(47)
+
+where J denotes the Jacobian matrix.
+The Lie bracket [W, V]φ (see [7] (§4.2), [8] (V, §1), [4] (Section 5.3.1)) is given by:
+
+[W, V]φ(p) = ∇W∇Vφ(p) − ∇V∇wφ(p) = ∇φp(0)
+�
+JVp(0)Wp(0) − JWp(0)Vp(0)
+�
+,
+(48)
+
+because of Equation (47) and the symmetry of the Hessian.
+
+217
+
+
+Entropy 2014, 16, 4260–4289
+
+The flow of the smooth vector field V : EV is a family of curves γ(t, p), p ∈ EV, t ∈ Jp, Jp open real
+interval containing zero, such that for all p ∈ EV and t ∈ Jp,
+
+γ(0, p) = p,
+(49)
+
+δγ(t, p) = V(γ(t, p)).
+(50)
+
+As uniqueness holds in Equation (50) (see [8] (VI, §1) or [7] (§4.1)), we have semi-group property
+γ(s + t, p) = γ(s, γ(t, p)), and Equation (50) is equivalent to δγ(0, p) = V(γ(0, p)), p ∈ EV.
+If a flow of V is available, we have an interpretation of ∇Vφ as a derivative of φ along γ(t, p),
+
+d
+dtφ(γ(t, p))
+����
+t=0
+= ∇φp(σp(γ(t, p)))
+� d
+
+dtσp(γ(t, p))
+�����
+t=0
+= ∇φp(0)V(p) = ∇Vφ(p).
+(51)
+
+2.5. Examples
+
+The following examples are intended to show how the formalism of gradients is usable in
+performing basic computations.
+
+2.5.1. Expectation
+
+Let f be any random variable, and define F: EV by F(p) = Ep [ f ]. In the chart centered at p, we
+have:
+
+Fp(θ) =
+�
+f exp
+
+�
+∑
+j
+θj eUpXj − ψp(θ)
+
+�
+
+· p dμ
+(52)
+
+and the Euclidean gradient:
+∇Fp(0) = Covp ( f, X) ∈ (Rm)′.
+(53)
+
+The natural gradient is:
+
+�∇F(p) = Covp (X, X)−1 Covp (X, f ) ∈ Rm,
+(54)
+
+and the Riemannian gradient is:
+
+∇F(p) = ( �∇F(p))′ eUpX = Covp ( f, X) Covp (X, X)−1 eUpX ∈ TpEV.
+(55)
+
+From Equation (55), it follows that ∇F(p) is the L2(p)-projection f onto eUpV, while �∇F(p) in
+Equation (54) are the coordinates of the projection. Let us consider the family of curves:
+
+γ(t, p) = exp
+
+�
+m
+∑
+j=1
+t( �∇F(p))j eUpXj − ψp(t �∇F(p))
+
+�
+
+· p,
+t ∈ R.
+(56)
+
+The velocity is:
+
+δγ(t, p) = d
+
+dt
+
+�
+m
+∑
+j=1
+t( �∇F(p))j eUpXj − ψp(t �∇F(p))
+
+�
+
+= ∇F(p) − Eγ(t,p) [∇F(p)] ,
+(57)
+
+which is different from ∇F(γ(t, p)), unless f ∈ V ⊕ R. Then, γ is not, in general, the flow of ∇F, but it
+is a local approximation, as δγ(0, p) = ∇F(p).
+These computation are the basis of model-based methods in combinatorial optimization; see [10–14].
+
+218
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.2. Binary Independent Variables
+
+Here, we present, in full generality, the toy example of the Introduction; see [17] for more
+information on the application to combinatorial optimization. Our example is a very special case of
+Ising exactly solvable models [18], our aim being here to explore the geometric framework.
+Let Ω = {+1, −1}m with counting measure μ, and let the space V be generated by the coordinate
+projections B = {X1, . . . , Xd}. Note that we use here the coding +1, −1 (from physics) instead of
+the coding 0, 1, which is more common in combinatorial optimization. The exponential family is
+EV =
+�
+exp
+�
+∑m
+J=1 θjXj − ψλ(θ)
+�
+· 2−m�
+, λ(x) = 2−m for x ∈ Ω being the uniform density. The
+independence of the sufficient statistics Xj under all distributions in EV implies:
+
+ψλ(θ) =
+m
+∑
+j=1
+ψ(θj),
+ψ(θ) = log (cosh(θ)) .
+(58)
+
+We have:
+
+∇ψλ(θ) = [tanh(θj): j = 1, . . . , d]
+
+= ηλ(θ),
+(59)
+
+Hess ψλ(θ) = diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+
+= diag
+�
+e−2ψ(θj) : j = 1, . . . , d
+�
+
+= IB,λ(θ),
+(60)
+
+IB,λ(θ)−1 = diag
+�
+cosh2(θj): j = 1, . . . , d
+�
+
+= diag
+�
+e2ψ(θj) : j = 1, . . . , d
+�
+.
+(61)
+
+The quadratic function f (X) = a0 + ∑j ajXj + ∑{i,j} ai,jXiXj has expected value at p = eλ(θ), i.e.,
+relaxed value, equal to:
+
+F(p) = Fλ(θ) = Eθ [ f (X)] = a0 + ∑
+j
+aj tanh(θj) + ∑
+{i,j}
+ai,j tanh(θi) tanh(θj),
+(62)
+
+and covariance with Xk ∈ B equal to:
+
+Covθ ( f (X), Xk) = ∑
+j
+aj Covθ
+�
+Xj, Xk
+� + ∑
+{i,j}
+ai,j Covθ
+�
+XiXj, Xk
+�
+
+= ak Varθ (Xk) + ∑
+i̸=k
+ai,k Eθ [Xi] Varθ (Xk)
+
+= cosh−2(θk)
+
+�
+
+ak + ∑
+i̸=k
+ai,k tanh(θi)
+
+�
+
+.
+(63)
+
+In the computation, we have used the independence and the special algebra of ±1, which implies
+X2
+i = 1, so that Covθ
+�
+XiXj, Xk
+� = 0 if i, j ̸= k, otherwise Covθ (XiXk, Xk) = Eθ [Xi] − Eθ [Xi] Eθ [Xk]2;
+see [13].
+
+219
+
+
+Entropy 2014, 16, 4260–4289
+
+The Euclidean gradient, the natural gradient and the Riemannian gradient are, respectively,
+
+∇Fλ(θ) =
+
+�
+
+cosh−2(θj)
+
+�
+
+aj + ∑
+i̸=j
+ai,j tanh(θi)
+
+�
+
+: j = 1, . . . , d
+
+�
+
+,
+(64)
+
+�∇F(eλ(θ)) =
+
+�
+
+aj + ∑
+i̸=j
+ai,j tanh(θi): j = 1, . . . , d
+
+�
+
+,
+(65)
+
+∇F(eλ(θ)) =
+m
+∑
+j=1
+
+�
+
+aj + ∑
+i̸=j
+ai,j Eθ [Xi]
+
+�
+�
+Xj − Eθ
+�
+Xj
+��
+.
+(66)
+
+The (natural) gradient flow equations are:
+
+˙θj(t) = aj + ∑
+i̸=j
+ai,j tanh(θi(t)),
+j = 1, . . . , d.
+(67)
+
+Equations (64)–(66) are usable in practice if the aj’s and the ai,j’s are estimable. Otherwise, one
+can use Equation (63) and the following forms of the gradients:
+
+∇Fλ(θ) =
+�
+Covθ
+�
+Xj, f (X)
+�
+: j = 1, . . . , d
+�
+,
+(68)
+
+�∇F(eλ(θ)) =
+�
+cosh2(θj) Covθ
+�
+f (X), Xj
+�
+: j = 1, . . . , d
+�
+,
+(69)
+
+in which case, the gradient flow equations are:
+
+˙θj(t) = cosh2(θj) Covθ
+�
+f (X), Xj
+�
+,
+j = 1, . . . , d.
+(70)
+
+Let us study the relaxed function in the expectation parameters ηj = ηj(θ), j = 1, . . . , d,
+
+Fλ(η) = a0 + ∑
+j
+ajηj + ∑
+{i,j}
+ai,jηiηj,
+η ∈] − 1, +1[m.
+(71)
+
+The Euclidean gradient with respect to η has components:
+
+∂jFλ(η) = aj + ∑
+i̸=j
+ai,jηi,
+(72)
+
+which are equal to the components of the natural gradient; see Section 2.4.1. As:
+
+˙ηj(t) = d
+
+dt tanh(θj(t)) = cosh−2(θj(t)) ˙θj(t) =
+�
+1 − ηj(t)2�
+˙θj(t),
+j = 1, . . . , m,
+(73)
+
+the gradient flow expressed in the η-parameters has equations:
+
+˙ηj(t) =
+�
+1 − ηj(t)2� �
+
+aj + ∑
+i̸=j
+ai,jηi(t)
+
+�
+
+,
+j = 1, . . . , d.
+(74)
+
+Alternatively, in vector form,
+
+˙η(t) = diag
+�
+1 − ηj(t)2 : j = 1, . . . , d
+�
+(a + Aη(t)) ,
+(75)
+
+where a = [aj : j = 1, . . . , d]t and Ai,j = 0 if i = j, Aij = ai,j. The matrix A is symmetric with zero
+diagonal, and it has the meaning of the adjacency matrix of the (weighted) interaction graph. We do
+not know a closed-form solution of Equation (74). An example of a numerical solution is shown in
+Figure 3.
+
+220
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.3. Escort Probabilities
+
+For a given a > 0, consider the function C(a) : EV defined by C(a)(p) = �
+pa dμ. We have:
+
+C(a)
+p (θ) =
+�
+exp
+
+�
+
+a
+m
+∑
+j=1
+θj eUpXj − aψp(θ)
+
+�
+
+pa dμ
+(76)
+
+and:
+
+dC(a)
+p (0)α =
+�
+a
+
+�
+m
+∑
+j=1
+αj eUpXj
+
+�
+
+pa dμ =
+
+m
+∑
+j=1
+αj
+�
+a eUpXjpa dμ =
+m
+∑
+j=1
+αj Covp
+�
+Xj, apa−1�
+,
+(77)
+
+that is, the Euclidean gradient is ∇C(a)
+p (0) = Covp
+�
+apa−1, X
+�
+(row vector). The natural gradient is
+computed from Equation (35) as:
+
+�∇C(a)(p) = I−1
+B (p)(∇C(a)
+p (0))′ = Covp (X, X)−1 Covp
+�
+X, apa−1�
+,
+(78)
+
+while the Riemannian gradient follows from Equation (36):
+
+∇C(a)(p) = Covp
+�
+apa−1, X
+�
+Covp (X, X)−1 eUpX.
+(79)
+
+Note that the Riemannian gradient is the orthogonal projection of the random variable apa−1 onto
+the tangent space TpEV = eUpV.
+The probability density pa/C(p) is called the escort density in the literature on non-extensive
+statistical mechanics; see, e.g., [19] (Section 7.4).
+We compute now the tangent mapping of EV ∋ p �→ pa/C(a)(a) ∈ P>. Let us extend the basis
+X1, . . . , Xm to a basis X1, . . . , Xn, n ≥ m, whose exponential family is full, i.e., equal to P>. The
+
+non-parametric coordinate of q =
+�
+exp
+�
+∑m
+j=1 θj eUpXj − ψp(θ)
+�
+p
+�a
+/C(a)
+p (θ) in the chart centered at
+
+¯p = pa/C(a)
+p (0) is the ¯p-centering of the random variable:
+
+log
+� q
+
+¯p
+
+�
+= log
+
+⎛
+
+⎜
+⎝
+
+�
+exp
+�
+∑m
+j=1 θj eUpXj − ψp(θ)
+�
+p
+�a
+/C(a)
+p (θ)
+
+pa/C(a)
+p (0)
+
+⎞
+
+⎟
+⎠
+
+= a
+m
+∑
+j=1
+θj eUpXj − aψp(θ) + ln C(a)
+( 0) − ln C(a)
+p (θ),
+(80)
+
+that is,
+
+v = a
+m
+∑
+j=1
+θj eU ¯pXj.
+(81)
+
+The coordinates of v in the basis eU ¯pX1, . . . , eU ¯pXn are (aθ1, . . . , aθm, 0, . . . , 0), and the Jacobian of
+θ �→ (aθ, 0n−m) is the m × n matrix [aIm|0m×(n−m)].
+
+221
+
+
+Entropy 2014, 16, 4260–4289
+
+2.5.4. Polarization Measure
+
+The polarization measure has been introduced in Economics by [20]. Here, we consider the
+qualitative version of [21]. If π is a distribution of a finite set, the probability that in three independent
+samples from π there are exactly two equal is 3 ∑j π2
+j (1 − πj). If p ∈ EV, define:
+
+G(p) =
+�
+p2(1 − p) dμ = C(2)(p) − C(3)(p),
+(82)
+
+where C(2) and C(3) are defined as in Example 2.5.3.
+From Equation (78), we find the natural gradient:
+
+�∇G(p) = Covp (X, X)−1 Covp
+�
+X, 2p − 3p2�
+.
+(83)
+
+Note that �∇G(p) = 0 if p is constant; see Figure 4.
+
+Figure 4. Normalized polarization.
+
+3. Second Order Calculus
+
+In this section, we turn to considering second order calculus, in particular Hessians, in order to
+prepare the discussion of the Newton method for the relaxed optimization of Section 4.
+
+3.1. Metric Derivative (Levi–Civita connection)
+
+Let V, W : EV be vector fields, that is, V(p), W(p) ∈ TpEV = eUpV, p ∈ EV. Consider the real
+function R = g(V, W): EV → R, whose value at p ∈ EV is R(p) = gp(V(p), W(p)) = Ep [V(p)W(p)].
+Assuming smoothness, we want to compute the derivative of R along the vector field Y : EV, that is,
+(DYR)(p) = dRp(0)α, with α = ˙σp(Y(p)). The expression of R in the chart centered at p is, according
+to Equation (27),
+
+θ �→ Rp(θ) = ˙σp(V(ep(θ)))′IB(ep(θ)) ˙σp(W(ep(θ))) = Vp(θ)′IB,p(θ)Wp(θ),
+(84)
+
+where Vp and Wp are the presentation in the chart of the vector fields V and W, respectively.
+
+222
+
+
+Entropy 2014, 16, 4260–4289
+
+The i-th component ∂iRp(θ) of the Euclidean gradient ∇Rp(θ) is:
+
+∂iRp(θ) = ∂i
+�
+Vp(θ)′IB,p(θ)Wp(θ)
+� =
+
+∂iVp(θ)′IB,p(θ)Wp(θ) + Vp(θ)′∂iIB,p(θ)Wp(θ) + Vp(θ)′IB,p(θ)∂iWp(θ) =
+�
+∂iVp(θ) + 1
+
+2 I−1
+B,p(θ)∂iIB,p(θ)Vp(θ)
+�′
+IB,p(θ)Wp(θ)+
+
+Vp(θ)′IB,p(θ)
+�
+∂iWp(θ) + 1
+
+2 I−1
+B,p(θ)∂iIB,p(θ)Wp(θ)
+�
+,
+(85)
+
+so that the derivative at θ along α = ˙σep(θ)(Y(ep(θ))) is:
+
+dRp(θ)α =
+�
+dVp(θ)α + 1
+
+2 I−1
+B,p(θ)
+�
+dIB,p(θ)α
+�
+Vp(θ)
+�′
+IB,p(θ)Wp(θ)+
+
+Vp(θ)′IB,p(θ)
+�
+dWp(θ)α + 1
+
+2 I−1
+B,p(θ)
+�
+dIB,p(θ)α
+�
+Wp(θ)
+�
+.
+(86)
+
+Proposition 1. If we define DYV to be the vector field on EV, whose value at q = ep(θ) has coordinates
+centered at p given by:
+
+˙σp(DYV(q)) = dVp(θ)α + 1
+
+2 I−1
+B (p)
+�
+dIB,p(θ)α
+�
+Vp(θ),
+α = ˙σp(Y(q)),
+(87)
+
+then:
+DYg(V, W) = g(DYV, W) + g(V, DYW),
+(88)
+
+i.e., Equation (87) is a metric covariant derivative; see [6] (Ch. 2 §3), [8] (VIII §4), [4] (§5.3.2).
+
+The metric derivative Equation (87) could be computed from the flow of the vector field Y. Let
+(t, p) �→ γ(t, p) be the flow of the vector field V, i.e., δγ(t, p) = V(γ(t, p)) and γ(0, p) = p. Using
+Equation (23), we have:
+
+d
+dt ˙σ(V(γ(t, p)))
+����
+t=0
+= d
+
+dtVp(σp(γ(t, p)))
+����
+t=0
+
+= dVp(σp(γ(t, p))) d
+
+dtσp(γ(t, p))
+����
+t=0
+= dVp(0) ˙σp(δγ(0, p)) = dVp(0) ˙σp(Y(p)),
+(89)
+
+and:
+
+d
+dt IV(γ(t, p))
+����
+t=0
+= d
+
+dt IB,p(σpγ(t, p))
+����
+t=0
+= dIB,p(0) ˙σp(δγ(0, p)) = dIB,p(0) ˙σp(Y(p))Vp(0),
+(90)
+
+so that:
+
+˙σ(DYV(p)) = d
+
+dt ˙σV(γ(t, p))
+����
+t=0
++ 1
+
+2 I−1
+V (p) d
+
+dt IV(γ(t, p))
+����
+t=0
+.
+(91)
+
+Let us check the symmetry of the metric covariant derivative to show that it is actually the unique
+Riemannian or Levi–Civita affine connection; see [6] (Th. 3.6).
+The Lie bracket of the vector fields V and W is the vector field [V, W], whose coordinates are:
+
+[V, W]p(θ) = dVp(0)˙σp(W(p)) − dWp(0) ˙σp(V(p)).
+(92)
+
+223
+
+
+Entropy 2014, 16, 4260–4289
+
+As the ij entry of ∂kIB,p(0) is ∂k∂i∂jψp(0), then the symmetry (dIB,p(0)α)β = (dIB,p(0)β)α holds,
+and we have:
+
+˙σp (DWV(p) − DVW(p)) =
+
+dVp(0)˙σp(W(p)) + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0) ˙σp(W(p))
+�
+Vp(0)
+
+− dWp(0) ˙σp(V(p)) − 1
+
+2 I−1
+B (p)
+�
+dIB,p(0) ˙σp(V(p))
+�
+Wp(0)
+
+= ˙σ[V, W](p).
+(93)
+
+The term Γk(p) = 1
+
+2 I−1
+p (0)∂kdIB,p(0) of Equation (87) is sometimes referred to as the Christoffel
+matrix, but we do not use this terminology in this paper. As:
+
+IB,p(θ) = IB(ep(θ)) =
+�
+Covep(θ)
+�
+Xi, Xj
+��
+
+i,j=1,...,m =
+�
+∂i∂jψp(θ)
+�
+
+i,j=i,...,m ,
+(94)
+
+we have ∂kIB(ep(θ)) = [∂i∂j∂kψp(θ)]i,j=i,...,m =
+�
+Covep(θ)
+�
+Xi, Xj, Xk
+��
+
+i,j=i,...,m and:
+
+Γk(p) = 1
+
+2
+�
+Covp
+�
+Xi, Xj
+��−1
+i,j=i,...,m
+�
+Covp
+�
+Xi, Xj, Xk
+��
+
+i,j=i,...,m
+(95)
+
+.
+If V, W are vector fields of TEV, we have:
+
+Γ(p, V, W) = 1
+
+2 I−1
+B (p) Covp (X, V, W)
+
+= 1
+
+2 I−1
+B (p) Ep
+�eUpXVW
+�
+,
+(96)
+
+which is the projection of V(p)W(p)/2 on eUpV.
+Notice also that:
+
+(dI−1
+p (0)α)IB,p(0) = −I−1
+p (0)(dIB,p(0)α)I−1
+p (0)IB,p(0)y = −I−1
+p (0)
+�
+dIB,p(0)α
+�
+.
+(97)
+
+3.2. Acceleration
+
+Let p(t), t ∈ I, be a smooth curve in EV. Then, the velocity δp(t) = d
+
+dt log (p(t)) is a vector field
+V(p(t)) = δp(t), defined on the support p(I) of the curve. As the curve is the flow of the velocity
+field, we can compute the metric derivative of the velocity along the the velocity itself Dδpδp from
+Equation (91) with V(p(0)) = δp(0); we can use Equation (91) to get:
+
+˙σp(Dδpδp)(p(0)) = d
+
+dt ˙σp(0) (δ(p(t)))
+����
+t=0
++ 1
+
+2 I−1
+B (p(0)) d
+
+dt IB(p(t))
+����
+t=0
+=
+
+d2
+
+dt2 σp(0)(p(t))
+����
+t=0
++ 1
+
+2 I−1
+B (p(0)) d
+
+dt IB(p(t))
+����
+t=0
+.
+(98)
+
+which can be defined to be the Riemannian acceleration of the curve at t = 0.
+Let us write θ(t) = σp(p(t)), p = p(0) and:
+
+p(t) = exp
+
+�
+m
+∑
+j=1
+θj(t) eUpXj − ψp(θ(t))
+
+�
+
+· p,
+(99)
+
+224
+
+
+Entropy 2014, 16, 4260–4289
+
+so that ˙σp(δp)(0) = ˙θ(0) and
+d2
+dt2 σp(p(t))
+���
+t=0 = ¨θ(0). We have:
+
+d
+dt IB(p(t))
+����
+t=0
+= d
+
+dt IB,p(θ(t))
+����
+t=0
+= d
+
+dt Hess ψp(θ(t))
+����
+t=0
+= Covp(X, X,
+m
+∑
+j=1
+˙θj(t)Xj)
+(100)
+
+so that the acceleration at p has coordinates:
+
+¨θ(0) + 1
+
+2
+
+m
+∑
+i,j=1
+˙θi(0) ˙θj(0) Covp (X, X)−1 Covp(X, Xi, Xj) =
+
+¨θ(0) + 1
+
+2 Covp (X, X)−1 Covp(X,
+m
+∑
+i
+˙θi(0)Xi,
+m
+∑
+j=1
+˙θj(0)Xj).
+(101)
+
+A geodesic is a curve whose acceleration is zero at each point. The exponential map is the mapping
+Exp: TEV → EV defined by:
+(p, U) �→ Expp U = p(1),
+(102)
+
+where t �→ p(t) is the geodesic, such that p(0) = p and δp(0) = U, for all U, such that the geodesic
+exists for t = 1.
+The exponential map is a particular retraction, that is, a family of mappings Rp, p ∈ E, from
+the tangent space at p to the manifold; here R: TpE → E, such that Rp(0) = p and dRp(0) = Id;
+see [4] (§5.4). It should be noted that exponential manifolds have natural retractions other than Exp, a
+notable one being the exponential family itself. A retraction provides a crucial step in a gradient search
+algorithms by mapping a direction of increase of the objective function to a new trial point.
+
+3.2.1. Example: Binary Independent 2.5.2 Continued.
+
+Let us consider the binary independent model of Section 2.5.2. We have
+
+IB(eλ(θ)) = IB,λ(θ) = diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+,
+(103)
+
+it follows that
+
+∂kIB,λ(θ) = ∂k diag
+�
+cosh−2(θj): j = 1, . . . , d
+�
+
+= −2 cosh−3(θk) sinh(θk)Ekk,
+(104)
+
+where Ekk is the d × d matrix with entry one at (k, k), zero otherwise. The k-th Christoffel’s matrix in
+the second term in the definition of the metric derivative (aka Levi–Civita connection) is:
+
+Γk
+B(eλ(θ)) = Γk
+λ(θ) = 1
+
+2 I−1
+B,λ(θ)∂kIB,λ(θ) = − tanh(θk)Ekk.
+(105)
+
+In terms of the moments, we have IB,λ(θ) = Covθ (X, X′) = Hess ψλ(θ). As ∂k∂i∂jψλ(θ) =
+Covθ
+�
+Xk, Xi, Xj
+�
+, we that can write:
+
+∂kIB,λ(θ) = ∂k diag
+�
+Varθ
+�
+Xj
+�
+: j = 1, . . . , d
+�
+
+= Covθ (Xk, Xk, Xk) Ekk
+(106)
+
+and:
+
+Γk
+λ(θ) = 1
+
+2 Covθ (Xk, Xk)−1 Covθ (Xk, Xk, Xk) Ekk
+
+= 1
+
+2(1 − (ηk)2)−1(−2ηk + 2(ηk)3)Ekk = −ηkEkk.
+(107)
+
+225
+
+
+Entropy 2014, 16, 4260–4289
+
+The equations for the geodesics starting from θ(0) with velocity ˙θ(0) = u are:
+
+¨θk(t) +
+m
+∑
+ij=1
+Γk
+ij(θ(t)) ˙θi(t) ˙θj(t) = ¨θk(t) − tanh(θk(t))( ˙θk(t))2 = 0,
+k = 1, . . . , d.
+(108)
+
+The ordinary differential equation:
+
+¨θ − tanh(θ) ˙θ2 = 0
+(109)
+
+has the closed form solution:
+
+θ(t) = gd−1
+�
+gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t
+�
+= tanh−1
+�
+sin
+�
+gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t
+��
+(110)
+
+for all t, such that:
+
+− π/2 < gd(θ(0)) +
+˙θ(0)
+
+cosh(θ(0))t < π/2,
+(111)
+
+where gd: R →] − π/2, +π/2[ is the Gudermannian function, that is, gd′(x) = 1/ cosh x, gd(0) = 0;
+in closed form, gd(x) = arcsin(tanh(x)). In fact, if θ is a solution of Equation (109), then:
+
+d
+dt gd(θ(t)) =
+˙θ(t)
+
+cosh(θ(t))
+(112)
+
+d2
+
+dt2 gd(θ(t)) = −sinh(θ(t))˙(θ(t))2
+
+cosh2(θ(t))
++
+¨θ(t)
+
+cosh(θ(t))
+
+=
+1
+
+cosh(θ(t))
+
+�
+¨θ(t) − tanh(θ(t))( ˙θ(t))2�
+= 0,
+(113)
+
+so that t �→ gd(θ(t)) coincides (where it is defined) with an affine function characterized by the initial
+conditions.
+In particular, at t = 1, the geodesic Equation (110) defines the Riemannian exponential
+Exp: TEV → EV.
+If (p, U) ∈ TEV, that is, p ∈ EV and U ∈ TpEV, then σλ(p) = θ(0) and
+U = ∑ uj
+eUpXj, ˙σλ(U) = u. If:
+
+− π/2 < gd(θj) +
+uj
+
+cosh(θj) < π/2,
+(114)
+
+then we can take ˙θ(0) = u and t = 1, so that:
+
+Expp : U
+˙σλ
+�−→ u �→
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+: j = 1, . . . , d
+�
+eλ
+�−→
+
+m
+∏
+j=1
+exp
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+Xj − ψ
+�
+gd−1
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+���
+2−m.
+(115)
+
+We have:
+
+exp
+�
+gd−1(v)
+�
+= exp
+�
+tanh−1(sin(v))
+�
+=
+
+�
+
+1 + sin v
+1 − sin v
+(116)
+
+and:
+
+ψ
+�
+gd−1(v)
+�
+= + log
+�
+gd−1(sin v)
+�
+= log
+�
+1
+
+cos v
+
+�
+,
+(117)
+
+226
+
+
+Entropy 2014, 16, 4260–4289
+
+hence u �→ Expp
+�
+∑d
+j=1 uj
+eUpXj
+�
+is given for:
+
+u ∈
+
+d×
+j=1
+
+�
+cosh(θj)(−π/2 − gd(θj)), cosh(θj)(π/2 − gd(θj))
+�
+,
+(118)
+
+by:
+
+Expθ(u) =
+m
+∏
+j=1
+cos
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+� ⎛
+
+⎝
+1 + sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+
+1 − sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+
+⎞
+
+⎠
+
+Xj
+2
+
+=
+
+m
+∏
+j=1
+
+�
+1 + sin
+�
+gd(θj) +
+uj
+
+cosh(θj)
+
+�
+Xj
+
+�
+2−m ∈ EV.
+(119)
+
+The expectation parameters are:
+
+ηi(t) = Eθ=0
+
+�
+
+Xi
+
+m
+∏
+j=1
+
+�
+1 + sin
+�
+gd(θj) +
+tuj
+
+cosh(θj)
+
+�
+Xj
+
+��
+
+= sin
+�
+gd(θj) +
+tuj
+
+cosh(θj)
+
+�
+,
+(120)
+
+and:
+gd(θj) = arcsin(ηj),
+cosh(θj) =
+1
+
+�
+1 − (ηj)2� 1
+
+2
+,
+(121)
+
+so that the exponential in terms of the expectation parameters is:
+
+Expη(u) =
+�
+sin
+�
+arcsin ηj +
+�
+1 − (ηj)2� 1
+
+2 uj
+
+�
+: j = 1, . . . , m
+�
+.
+(122)
+
+The inverse of the Riemannian exponential provides a notion of translation between two elements
+of the exponential model, which is a particular parametrization of the model:
+
+−−→
+η1η2 = Exp−1
+η1 η2 =
+��
+(1 − (ηj
+i)2�− 1
+
+2 �
+arcsin ηj
+2 − arcsin ηj
+1
+�
+: j = 1, . . . , m
+�
+(123)
+
+In particular, at θ = 0, we have the geodesic:
+
+t �→
+d
+∏
+j=1
+
+�
+1 + sin(tuj)Xj
+�
+2−m,
+|t| <
+π
+
+2 max
+��uj
+��
+(124)
+
+See in Figure 5 some geodesic curves.
+
+227
+
+
+Entropy 2014, 16, 4260–4289
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+Expectation parameters
+
+η1
+
+η2
+
+Figure 5. Geodesics from η = (0.75, 0.75).
+
+3.3. Riemannian Hessian
+
+Let φ: EV → R with Riemannian gradient ∇φ(p) = ∑i( �∇φ)i(p) eUpXi, �∇φ(p) = I−1
+B (p)∇φp(0).
+The Riemannian Hessian of φ is the metric derivative of the gradient ∇φ along the vector field Y, that
+is, HessY φ = DY∇φ; see [6] (Ch. 6, Ex. 11), [4] (§5.5). in the following, we denote by the symbol Hess,
+without a subscript, the ordinary Hessian matrix.
+From Equation (87), we have the coordinates of HessY φ(p). Given a generic tangent vector α, we
+compute from Equation (38):
+
+d(∇φ)p(θ)α
+��
+θ=0 = d
+�
+I−1
+B,p(θ)∇φp(θ)
+�
+α
+���
+θ=0
+= (dI−1
+B,p(0)α)∇φp(0) + I−1
+B,p(0) Hess φp(0)α
+
+= −I−1
+B (p)(dIB,p(0)α) �∇φ(p) + I−1
+B (p) Hess φp(0)α
+(125)
+
+and, upon substitution of (∇φ)p to Vp in Equation (87),
+
+˙σp(HessY φ(p)) = d(∇φ)p(0)α + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� (∇φ)p(0),
+α = Sp(Y(p))
+
+= −I−1
+B (p)(dIB,p(0)α) �∇φ(p) + I−1
+B (p) Hess φp(0) + 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� �∇φ(p)
+
+= I−1
+B (p) Hess φp(0)α − 1
+
+2 I−1
+B (p)
+�
+dIB,p(0)α
+� �∇φ(p)
+
+= I−1
+B (p)
+�
+Hess φp(0)α − 1
+
+2
+�
+dIB,p(0)α
+� �∇φ(p)
+�
+(126)
+
+HessY φ is characterized by knowing the value of g(HessY φ, X): EV for all vector fields X. We have
+from Equation (126), with α = ˙σp(Y(p)) and β = ˙σp(X(p)),
+
+gp(HessY(p) φ(p), X(p)) = β′ Hess φp(0)α − 1
+
+2 β′ �
+dIB,p(0)α
+� �∇φ(p).
+(127)
+
+228
+
+
+Entropy 2014, 16, 4260–4289
+
+This is the presentation of the Riemannian Hessian as a bi-linear form on TEV; see the comments
+in [4] (Prop. 5.5.2-3). Note that the Riemannian Hessian is positive definite if:
+
+α′ Hess φp(0)α ≥ 1
+
+2α′ �
+dIB,p(0)α
+� �∇φ(p),
+α ∈ Rm.
+(128)
+
+4. Application to Combinatorial Optimization
+
+We conclude our paper by showing how the geometric method applies to the problem of finding
+the maximum of the expected value of a function.
+
+4.1. Hessian of a Relaxed Function
+
+Here is a key example of vector field. Let f be any bounded random variable, and define the
+relaxed function to be φ(p) = Ep [ f ], p ∈ P>. Define F(p) to be the projection of f, as an element of
+L2(p), onto TpEV = eUpV, i.e., F(p) is the element of eUpV, such that:
+
+Ep [( f − F(p))v] = 0,
+v ∈ eUpV
+(129)
+
+In the basis eUpB, we have F(p) = ∑i ˆfp,i
+eUpXi and:
+
+Covp
+�
+f, Xj
+� = ∑
+i
+ˆfp,i Ep
+�eUpXi
+eUpXj
+�
+,
+j = 1, . . . , m,
+(130)
+
+so that ˆfp = I−1
+B (p) Covp (X, f ) and
+
+F(p) = ˆf ′
+p
+eUpX = Covp ( f, X) I−1
+B (p) eUpX.
+(131)
+
+Let us compute the gradient of the relaxed function φ = E· [ f ] : EV. We have φp(θ) = Eep(θ) [ f ],
+and from the properties of exponential families, the Euclidean gradient is ∇φp(0) = Covp ( f, X). It
+follows that the natural gradient is:
+
+�∇φp(0) = I−1
+B (p) Covp (X, f ) = ˆf,
+(132)
+
+and the Riemannian gradient is ∇φ(p) = F(p).
+From the properties of exponential families, we have:
+
+Hess φp(0) = Covp (X, X, f ) ,
+
+so that, in this case, Equation (127), when written in terms of the moments, is:
+
+β′ Covp (X, X, f ) α − 1
+
+2 β′ Covp (X, X, α · X) Covp (X, X)−1 Covp (X, f ) .
+(133)
+
+4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued
+
+We list below the computation of the Hessian in the case of two binary independent variables.
+Computations were done with Sage [22], which allows both the reduction x2
+i = 1 in the ring of
+polynomials and the simplifications in the symbolic ring of parameters.
+
+Covη (X, f ) =
+
+�
+−
+�
+η2
+1 − 1
+�
+a1 −
+�
+η2
+1η2 − η2
+�
+a12
+−
+�
+η2
+2 − 1
+�
+a2 −
+�
+η1η2
+2 − η1
+�
+a12
+
+�
+
+=
+
+�
+−(η1 − 1)(η1 + 1)(a12η2 + a1)
+−(η2 − 1)(η2 + 1)(a12η1 + a2)
+
+�
+
+(134)
+
+Covη (X, X) =
+
+�
+−η2
+1 + 1
+0
+0
+−η2
+2 + 1
+
+�
+
+=
+
+�
+−(η1 − 1)(η1 + 1)
+0
+0
+−(η2 − 1)(η2 + 1)
+
+�
+
+(135)
+
+229
+
+
+Entropy 2014, 16, 4260–4289
+
+Covη (X, X)−1 Covη (X, f ) =
+
+�
+a12η2 + a1
+a12η1 + a2
+
+�
+
+= ∇F(η)
+(136)
+
+Covη (X, X, f ) =
+�
+2
+�
+η3
+1 − η1
+�
+a1 + 2
+�
+η3
+1η2 − η1η2
+�
+a12
+�
+η2
+1η2
+2 − η2
+1 − η2
+2 + 1
+�
+a12
+�
+η2
+1η2
+2 − η2
+1 − η2
+2 + 1
+�
+a12
+2
+�
+η1η3
+2 − η1η2
+�
+a12 + 2
+�
+η3
+2 − η2
+�
+a2
+
+�
+
+=
+
+�
+2 (η1 − 1)(η1 + 1)(a12η2 + a1)η1
+(η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12
+(η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12
+2 (η2 − 1)(η2 + 1)(a12η1 + a2)η2
+
+�
+
+(137)
+
+Covη (X, X)−1 Covη (X, X, f ) =
+
+�
+−2 (a12η2 + a1)η1
+−a12η2
+2 + a12
+−a12η2
+1 + a12
+−2 (a12η1 + a2)η2
+
+�
+
+(138)
+
+Covη (X, X, ∇F(η)) =
+�
+2 (a12η2 + a1)(η1 + 1)(η1 − 1)η1
+0
+0
+2 (a12η1 + a2)(η2 + 1)(η2 − 1)η2
+
+�
+
+(139)
+
+Covη (X, X)−1 Covη (X, X, ∇F(η)) =
+�
+−2 (a12η2 + a1)η1
+0
+0
+−2 (a12η1 + a2)η2
+
+�
+
+(140)
+
+The Riemannian Hessian as a matrix in the basis of the tangent space is:
+
+Hess F(η) = Covη (X, X)−1
+�
+Covη (X, X, f ) − 1
+
+2 Covη (X, X, ∇F(η))
+�
+=
+�
+−(a12η2 + a1)η1
+−a12(η2 + 1)(η2 − 1)
+−a12(η1 + 1)(η1 − 1)
+−(a12η1 + a2)η2
+
+�
+
+(141)
+
+As a check, let us compute the Riemannian Hessian as a natural Hessian in the Riemannian
+
+parameters, Hess φ ◦ Expp(u)
+���
+u=0; see [4] (Prop. 5.5.4). We have:
+
+F ◦ Expη(u) =
+
+a12 sin
+��
+
+−η2
+1 + 1u1 + arcsin (η1)
+�
+sin
+��
+
+−η2
+2 + 1u2 + arcsin (η2)
+�
++
+
+a1 sin
+��
+
+−η2
+1 + 1u1 + arcsin (η1)
+�
++ a2 sin
+��
+
+−η2
+2 + 1u2 + arcsin (η2)
+�
+(142)
+
+and:
+
+Hess F ◦ Expη(u)
+���
+u=0 =
+� �
+η2
+1 − 1
+�
+a12η1η2 +
+�
+η2
+1 − 1
+�
+a1η1
+�
+η2
+1 − 1
+��
+η2
+2 − 1
+�
+a12
+�
+η2
+1 − 1
+��
+η2
+2 − 1
+�
+a12
+�
+η2
+2 − 1
+�
+a12η1η2 +
+�
+η2
+2 − 1
+�
+a2η2
+
+�
+
+=
+
+�
+(a12η2 + a1)(η1 + 1)(η1 − 1)η1
+a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1)
+a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1)
+(a12η1 + a2)(η2 + 1)(η2 − 1)η2
+
+�
+
+.
+(143)
+
+230
+
+
+Entropy 2014, 16, 4260–4289
+
+Note the presence of the factor Covη (X, X).
+
+4.2. Newton Method
+
+The Newton method is an iterative method that generates a sequence of points pt, with t = 0, 1, . . . ,
+that converges towards a stationary point ˆp of a F(p) = Ep [ f ], p ∈ EV, that is, a critical point of the
+vector field p �→ ∇F(p), ∇F( ˆp) = 0. Here, we follow [4] (Ch. 5–6), and in particular Algorithm 5 on
+Page 113.
+Let ∇F be a gradient field. We reproduce in our case the basic derivation of the Newton method
+in the following. Note that, in this section, we use the notation Hess •[α] to denote Hessα •. Using
+the definition of metric derivative, we have for a geodesic curve [0, 1] ∋ t �→ p(t) ∈ EV connecting
+p = p(0) to ˆp = p(1) that:
+
+d
+dt gp(t) (∇F(p(t)), δp(t)) = gp(t) (Hess F(p(t))[δp(t)], δp(t))
+(144)
+
+hence the increment from p to ˆp is:
+
+g ˆp (∇F( ˆp), δp(1)) − gp (∇F(p), δp(0)) =
+� 1
+
+0 gp(t) (Hess F(p(t))[δp(t)], δp(t)) dt.
+(145)
+
+Now, we assume that ∇F( ˆp) = 0 and that in Equation (145), the integral is approximated by the
+initial value of the integrand, that is to say, the Hessian is approximately constant on the geodesic from
+p to ˆp; we obtain:
+− gp (∇F(p), δp(0)) = gp (Hess F(p)[δp(0)], δp(0)) + ϵ.
+(146)
+
+If we can solve the Newton equation:
+
+Hess F(p(t))[u] = −∇F(p)
+(147)
+
+then u is approximately equal to the initial velocity of the geodesic connecting p to ˆp, that is, ˆp =
+Expp(u).
+The particular structure of the exponential manifold suggests at least two natural retractions
+that could be used to move from u to ˆp.
+Namely, we have the Riemannian exponential
+(θt, θt+1) �→ Expθt(θt+1 − θt) and the e-retraction coming from the exponential family itself and
+defined by (θt, θt+1) �→ eθt(θt+1 − θt), with θt+1 − θt = ut.
+In the θ parameters, with the e-retraction, the Newton method generates a sequence (θt) according
+to the following updating rule:
+
+θt+1 = θt − λ Hess F(θt)−1 �∇F(θt)
+(148)
+
+where λ > 0 is an extra parameter intended to control the step size and, in turn, the convergence to ˆθ;
+see [5].
+We can rewrite Equation (148) in terms of covariances as:
+
+θt+1 = θt − λ
+�
+Covθt(X, X, f ) − 1
+
+2 Covθt(X, X, �∇F(θt))
+�−1
+�∇F(θt).
+(149)
+
+4.3. Example: Binary Independent
+
+In the η parameters, the Newton step is:
+
+u = − Hess F(η)−1∇F(η) =
+
+⎛
+
+⎜
+⎝
+
+a2
+12η1+a12a2+(a1a12η1+a1a2)η2
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2
+a1a2η1+a1a12+(a12a2η1+a2
+12)η2
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2
+
+⎞
+
+⎟
+⎠
+(150)
+
+231
+
+
+Entropy 2014, 16, 4260–4289
+
+and the new η in the Riemannian retraction is:
+
+Expη(u) =
+
+⎛
+
+⎜
+⎜
+⎝
+
+sin
+�
+(a2
+12η1+a12a2+(a1a12η1+a1a2)η2)√
+
+−η2
+1+1
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2 + arcsin (η1)
+�
+
+sin
+�
+(a1a2η1+a1a12+(a12a2η1+a2
+12)η2)√
+
+−η2
+2+1
+
+a2
+12η2
+1+(a12a2η1+a2
+12)η2
+2−a2
+12+(a1a12η2
+1+a1a2η1)η2 + arcsin (η2)
+�
+.
+
+⎞
+
+⎟
+⎟
+⎠
+(151)
+
+In Figure 6, we represented the vector field associated with the Newton step in the η parameters,
+with λ = 0.05, using the Riemannian retraction, for the case a1 = 1, a2 = 2 and a12 = 3, with:
+
+Expη(u) =
+
+⎛
+
+⎜
+⎜
+⎝
+
+sin
+�
+λ
+√
+
+−η2
+1+1((3 η1+2)η2+9 η1+6)
+
+3 (2 η1+3)η2
+2+9 η2
+1+(3 η2
+1+2 η1)η2−9 + arcsin (η1)
+�
+
+sin
+�
+λ
+(3 (2 η1+3)η2+2 η1+3)√
+
+−η2
+2+1
+
+3 (2 η1+3)η2
+2+9 η2
+1+(3 η2
+1+2 η1)η2−9 + arcsin (η2)
+�
+
+⎞
+
+⎟
+⎟
+⎠ .
+(152)
+
+The red dotted lines represented in the figure identify the basins of attraction of the vector field and
+correspond to the solutions of the explicit equation in η for which the Newton step u is not defined.
+This vector field can be compared to that in Figure 7, associated with the Newton step for F(η) using
+the Euclidean geometry. In the Euclidean geometry, F(η) is a quadratic function with one saddle point,
+so that from any η, the Newton step points in the direction of the critical point. This makes the Newton
+step unsuitable for an optimization algorithm. On the other side, in the Riemannian geometry, the
+vertices of the polytope are critical points for F(η), and they determine the presence of multiple basins
+of attraction, as expected.
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines
+identify the different basins of attraction and correspond to the points for which the Newton step is not
+defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.
+
+232
+
+
+Entropy 2014, 16, 4260–4289
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+−1.0
+−0.5
+0.0
+0.5
+1.0
+
+η1
+
+η2
+
+−4
+
+−2
+
+0
+
+2
+
+4
+
+6
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Expectation parameters
+
+Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+θ1
+
+θ2
+
+−2
+
+0
+
+2
+
+4
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Natural parameters
+
+ 0 
+
+ 0 
+
+ 0 
+
+ 0 
+
+Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted
+lines identify the different basins of attraction and correspond to the points for which the Newton step
+is not defined. The instability along the critical lines, which identifies the basins of attraction, is not
+represented.
+
+233
+
+
+Entropy 2014, 16, 4260–4289
+
+−2
+−1
+0
+1
+2
+
+−2
+−1
+0
+1
+2
+
+θ1
+
+θ2
+
+−2
+
+0
+
+2
+
+4
+
+ −2 
+
+ 0 
+
+ 2 
+
+ 4 
+
+Natural parameters
+
+ 0 
+
+ 0 
+
+ 0 
+
+ 0 
+
+Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines
+identify the different basins of attraction and correspond to the points for which the Newton step is
+not defined. The instability along the critical lines, which identifies the basins of attraction, is not
+represented.
+
+Figure 8 shows the Newton step in the θ parameters based on the e-retraction of Equation (149),
+while Figure 9 represents the Newton step evaluated with respect to the Euclidean geometry. A
+comparison of the two vector fields shows that, differently from the η parameters, the number of
+basins of attraction is the same in the two geometries; however, the scale of the vectors is different.
+In particular, notice how on the plateau, for diverging θ, the Newton step in the Euclidean geometry
+vanishes, while in the Riemannian geometry, it gets larger. This behavior suggests better convergence
+properties for an optimization algorithm based on the Newton step evaluated using the proper
+Riemannian geometry. In the θ parameters, the boundaries of the basins of attraction represented by
+the red dotted lines have been computed numerically and correspond to the values of θ for which the
+update step is not defined.
+Finally, notice that in both the η and θ parameters, the step is not always in the direction of descent
+for the function, a common behavior of the Newton method, which converges to the critical points.
+
+5. Discussion and Conclusions
+
+In this paper, we introduced second-order calculus over a statistical manifold, following the
+approach described in [4], which has been adapted to the special case of exponential statistical
+models [2,3]. By defining the Riemannian Hessian and using the notion of retraction, we developed
+the proper machinery necessary for the definition of the updating rule of the Newton method for the
+optimization of a function defined over an exponential family.
+The examples discussed in the paper show that by taking into account the proper Riemannian
+geometry of a statistical exponential family, the vector fields associated with the Newton step in the
+different parametrizations change profoundly. Not only new basins of attraction associated with local
+and global minima appear, as for the expectation parameters, but also the magnitude of the Newton
+step is affected, as over the plateau in the natural parameters. Such differences are expected to have a
+strong impact on the performance of an optimization algorithm based on the Newton step, from both
+the point of view of achievable convergence and the speed of convergence to the optimum.
+
+234
+
+
+Entropy 2014, 16, 4260–4289
+
+The Newton method is a popular second order optimization technique based on the computation
+of the Hessian of the function to be optimized and is well known for its super-linear convergence
+properties. However, the use of the Newton method poses a number of issues in practice.
+First of all, as the examples in Figures 6 and 8 show, the Newton step does not always point
+in the direction of the natural gradient, and the algorithm may not converge to a (local) optimum
+of the function. Such behavior is not unexpected; indeed the Newton method tends to converge to
+critical points of the function to be optimized, which include local minima, local maxima and saddle
+points. In order to obtain a direction of ascent for the function to be optimized, the Hessian must
+be negative-definite, i.e., its eigenvalues must be strictly negative, which is not guaranteed in the
+general case. Another important remark is related to the computational complexity associated with
+the evaluation of the Hessian, compared to the (natural) gradient. Indeed, to obtain the Newton step d,
+Christoffel matrices have to be evaluated, together with the third order covariances between sufficient
+statistics and the function, and the Hessian has to be inverted. Finally, notice that when the Hessian is
+close to being non-invertible, numerical problems may arise in the computation of the Newton step,
+and the algorithm may become unstable and diverge.
+In the literature, different methods have been proposed to overcome these issues. Among them,
+we mention quasi-Newton methods, where the update vector is obtained using a modified Hessian,
+which has been made negative-definite, for instance, by adding a proper correction matrix.
+This paper represents the first step in the design of an algorithm based on the Newton method
+for the optimization over a statistical model. The authors are working on the computational aspects
+related to the implementation of the method, and a new paper with experimental results is in progress.
+
+Acknowledgments: Luigi Malagò was supported by the Xerox University Affairs Committee Award and by
+de Castro Statistics, Collegio Carlo Alberto, Moncalieri. Giovanni Pistone is supported by de Castro Statistics,
+Collegio Carlo Alberto, Moncalieri, and is a member of GNAMPA–INdAM, Roma.
+
+Author Contributions
+
+All authors contributed to the design of the research. The research was carried out by all authors.
+The study of the Hessian and of the Newton method in statistical manifolds was originally suggested
+by Luigi Malagò. The manuscript was written by Luigi Malagò and Giovanni Pistone. All authors
+have read and approved the final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory;
+Number 9 in IMS Lecture Notes. Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA,
+1986; p. 283.
+2.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
+USA, 2000; p. 206.
+3.
+Pistone, G. Nonparametric Information Geometry. In Geometric Science of Information, Proceedings of the
+First International Conference, GSI 2013, Paris, France, 28–30 August 2013; Nielsen, F., Barbaresco, F., Eds.;
+Lecture Notes in Computer Science, Volume 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36.
+4.
+Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
+Press: Princeton, NJ, USA, 2008; pp. xvi+224.
+5.
+Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial
+Engineering; Springer: New York, NY, USA, 2006; pp. xxii+664.
+6.
+Do Carmo, M.P. Riemannian geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston,
+MA, USA, 1992; pp. xiv+300.
+7.
+Abraham, R.; Marsden, J.E.; Ratiu, T.
+Manifolds, Tensor Analysis, and Applications, 2nd ed.; Applied
+Mathematical Sciences, Volume 75; Springer: New York, NY, USA, 1988; pp. x+654.
+
+235
+
+
+Entropy 2014, 16, 4260–4289
+
+8.
+Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York,
+NY, USA, 1995; pp. xiv+364.
+9.
+Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric
+Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P., Eds.; Cambridge University
+Press: Cambridge, UK, 2010.
+10.
+Malagò, L.; Matteucci, M.; Dal Seno, B. An information geometry perspective on estimation of distribution
+algorithms: Boundary analysis. In Proceedings of the 2008 GECCO Conference Companion On Genetic and
+Evolutionary Computation (GECCO ’08); ACM: New York, NY, USA, 2008; pp. 2081–2088.
+11.
+Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming. In
+Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity,
+Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
+12.
+Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by Estimation of Empirical
+Covariances. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA,
+USA, 5–8 June 2011; pp. 949–956.
+13.
+Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based
+on the exponential family. In Proceedings of the 11th Workshop on Foundations of Genetic Algorithms
+(FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011 ; ACM: New York, NY, USA, 2011; pp. 230–242.
+14.
+Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying
+perspective. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico,
+20–23 June 2013; pp. 486–493.
+15.
+Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276.
+16.
+Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA,
+2007; pp. xiv+246.
+17.
+Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis,
+Politecnico di Milano, Milano, Italy, 2012.
+18.
+Gallavotti, G. Statistical Mechanics: A Short Treatise; Texts and Monographs in Physics; Springer: Berlin,
+Germany, 1999; pp. xiv+339.
+19.
+Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149.
+20.
+Esteban, J.; Ray, D. On the Measurement of Polarization. Econometrica 1994, 62, 819–851.
+21.
+Montalvo, J.; Reynal-Querol, M. Ethnic polarization, potential conflict, and civil wars. Am. Econ. Rev. 2005,
+796–816.
+22.
+Stein, W. et al. Sage Mathematics Software (Version 6.0). The Sage Development Team, 2013. Available
+online: http://www.sagemath.org (accessed on 27 March 2014).
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+236
+
+
+entropy
+
+Article
+Information Geometric Complexity of a Trivariate
+Gaussian Statistical Model
+
+Domenico Felice 1,2,*, Carlo Cafaro 3 and Stefano Mancini 1,2
+
+1 School of Science and Technology, University of Camerino, I-62032 Camerino, Italy; E-Mail:
+stefano.mancini@unicam.it
+2 INFN-Sezione di Perugia, Via A. Pascoli, I-06123 Perugia, Italy
+3 Department of Mathematics, Clarkson University, Potsdam, 13699 NY, USA; E-Mail: carlocafaro2000@yahoo.it
+*
+E-Mail: domenico.felice@unicam.it
+
+Received: 1 April 2014; in revised form: 21 May 2014 / Accepted: 22 May 2014 /
+Published: 26 May 2014
+
+Abstract: We evaluate the information geometric complexity of entropic motion on low-dimensional
+Gaussian statistical manifolds in order to quantify how difficult it is to make macroscopic predictions
+about systems in the presence of limited information. Specifically, we observe that the complexity of
+such entropic inferences not only depends on the amount of available pieces of information but also
+on the manner in which such pieces are correlated. Finally, we uncover that, for certain correlational
+structures, the impossibility of reaching the most favorable configuration from an entropic inference
+viewpoint seems to lead to an information geometric analog of the well-known frustration effect that
+occurs in statistical physics.
+
+Keywords: probability theory; Riemannian geometry; complexity
+
+1. Introduction
+
+One of the main efforts in physics is modeling and predicting natural phenomena using relevant
+information about the system under consideration. Theoretical physics has had a general measure of
+the uncertainty associated with the behavior of a probabilistic process for more than 100 years: the
+Shannon entropy [1]. The Shannon information theory was applied to dynamical systems and became
+successful in describing their unpredictability [2].
+Along a similar avenue we may set Entropic Dynamics [3] which makes use of inductive inference
+(Maximum Entropy Methods [4]) and Information Geometry [5]. This is clearly remarkable given
+that microscopic dynamics can be far removed from the phenomena of interest, such as in complex
+biological or ecological systems. Extension of ED to temporally-complex dynamical systems on curved
+statistical manifolds led to relevant measures of chaoticity [6]. In particular, an information geometric
+approach to chaos (IGAC) has been pursued studying chaos in informational geodesic flows describing
+physical, biological or chemical systems. It is the information geometric analogue of conventional
+geometrodynamical approaches [7] where the classical configuration space is being replaced by a
+statistical manifold with the additional possibility of considering chaotic dynamics arising from non
+conformally flat metrics. Within this framework, it seems natural to consider as a complexity measure
+the (time average) statistical volume explored by geodesic flows, namely an Information Geometry
+Complexity (IGC).
+This quantity might help uncover connections between microscopic dynamics and experimentally
+observable macroscopic dynamics which is a fundamental issue in physics [8].
+An interesting
+manifestation of such a relationship appears in the study of the effects of microscopic external
+noise (noise imposed on the microscopic variables of the system) on the observed collective motion
+(macroscopic variables) of a globally coupled map [9]. These effects are quantified in terms of the
+complexity of the collective motion. Furthermore, it turns out that noise at a microscopic level reduces
+
+Entropy 2014, 16, 2944–2958; doi:10.3390/e16062944
+www.mdpi.com/journal/entropy
+237
+
+
+Entropy 2014, 16, 2944–2958
+
+the complexity of the macroscopic motion, which in turn is characterized by the number of effective
+degrees of freedom of the system.
+The investigation of the macroscopic behavior of complex systems in terms of the underlying
+statistical structure of its microscopic degrees of freedom also reveals effects due to the presence of
+microcorrelations [10]. In this article we first show which macro-states should be considered in a
+Gaussian statistical model in order to have a reduction in time of the Information Geometry Complexity.
+Then, dealing with correlated bivariate and trivariate Gaussian statistical models, the ratio between
+the IGC in the presence and in the absence of microcorrelations is explicitly computed, finding an
+intriguing, even though non yet deep understood, connection with the phenomenon of geometric
+frustration [11].
+The layout of the article is as follows. In Section 2 we introduce a general statistical model
+discussing its geometry and describing both its dynamics and information geometry complexity. In
+Section 3, Gaussian statistical models (up to a trivariate model) are considered. There, we compute
+the asymptotic temporal behaviors of their IGCs. Finally, in Section 4 we draw our conclusions by
+outlining our findings and proposing possible further investigations.
+
+2. Statistical Models and Information Geometry Complexity
+
+Given n real-valued random variables X1, . . . , Xn defined on the sample space Ω with joint
+probability density p : Rn → R satisfying the conditions
+
+p(x) ≥ 0 (∀x ∈ Rn)
+and
+�
+
+Rn dx p(x) = 1,
+(1)
+
+let us consider a family P of such distributions and suppose that they can be parametrized using m
+real-valued variables (θ1, . . . , θm) so that
+
+P = {pθ = p(x|θ)|θ = (θ1, . . . , θm) ∈ Θ},
+(2)
+
+where Θ ⊆ Rm is the parameter space and the mapping θ → pθ is injective. In such a way, P is an
+m-dimensional statistical model on Rn.
+The mapping ϕ : P → Rm defined by ϕ(pθ) = θ allows us to consider ϕ = [θi] as a coordinate
+system for P. Assuming parametrizations which are C∞, we can turn P into a C∞ differentiable
+manifold (thus, P is called statistical manifold) [5].
+The values x1, . . . , xn taken by the random variables define the micro-state of the system, while the
+values θ1, . . . , θm taken by parameters define the macro-state of the system.
+Let P = {pθ|θ ∈ Θ} be an m-dimensional statistical model. Given a point θ, the Fisher information
+matrix of P in θ is the m × m matrix G(θ) = [gij], where the (i, j) entry is defined by
+
+gij(θ) :=
+�
+
+Rn dxp(x|θ)∂i log p(x|θ)∂j log p(x|θ),
+(3)
+
+with ∂i standing for
+∂
+∂θi . The matrix G(θ) is symmetric, positive semidefinite and determines a
+Riemannian metric on the parameter space Θ [5]. Hence, it is possible to define a Riemannian statistical
+manifold M := (Θ, g), where g = gijdθi ⊗ dθj (i, j = 1, . . . , m) is the metric whose components gij are
+given by Equation (3) (throughout the paper we use the Einstein sum convention).
+Given the Riemannian manifold M = (Θ, g), it is well known that there exists only one
+linear connection ∇(the Levi–Civita connection) on M that is compatible with the metric g and
+symmetric [12]. We remark that the manifold M has one chart, being Θ an open set of Rm, and the
+Levi-Civita connection is uniquely defined by means of the Christoffel coefficients
+
+Γk
+ij = 1
+
+2 gkl�∂glj
+
+∂θi + ∂gil
+
+∂θj − ∂gij
+
+∂θl
+
+�
+,
+(i, j, k = 1, . . . , m)
+(4)
+
+238
+
+
+Entropy 2014, 16, 2944–2958
+
+where gkl is the (k, l) entry of the inverse of the Fisher matrix G(θ).
+The idea of curvature is the fundamental tool to understand the geometry of the manifold
+M = (Θ, g). Actually, it is the basic geometric invariant and the intrinsic way to obtain it is by
+means of geodesics. It is well-known, that given any point θ ∈ M and any vector v tangent to
+M at θ, there is a unique geodesic starting at θ with initial tangent vector v. Indeed, within the
+considered coordinate system, the geodesics are solutions of the following nonlinear second order
+coupled ordinary differential equations [12]
+
+d2θk
+
+dτ2 + Γk
+ij
+dθi
+
+dτ
+dθj
+
+dτ = 0,
+(5)
+
+with τ denoting the time.
+The recipe to compute some curvatures at a point θ ∈ M is the following: first, select a
+2-dimensional subspace Π of the tangent space to M at θ; second, follow the geodesics through
+θ whose initial tangent vectors lie in Π and consider the 2-dimensional submanifolds SΠ swiped out
+by them inheriting a Riemannian metric from M; finally, compute the Gaussian curvature of SΠ at θ,
+which can be obtained from its Riemannian metric as stated in the Theorema Egregium [13]. The number
+K(Π) found in such manner is called the sectional curvature of M at θ associated with the plane Π. In
+terms of local coordinates, to compute the sectional curvature we need the curvature tensor,
+
+Rh
+ijk =
+∂Γh
+jk
+
+∂θi − ∂Γh
+ik
+
+∂θj + Γl
+jkΓh
+il − Γl
+ikΓh
+jl.
+(6)
+
+For any basis (ξ, η) for a 2-plane Π ⊂ TθM, the sectional curvature at θ ∈ M is given by [12]
+
+K(ξ, η) =
+R(ξ, η, η, ξ)
+
+|ξ|2|η|2 − ⟨ξ, η⟩,
+(7)
+
+where R is the Riemann curvature tensor which is written in coordinates as R = Rijkldθi ⊗ dθj ⊗ dθk ⊗
+dθl with Rijkl = glhRh
+ijk and ⟨·, ·⟩ is the inner product defined by the metric g.
+The sectional curvature is directly related to the topology of the manifold; along this direction
+the Cartan-Hadamard Theorem [13] is enlightening by stating that any complete, simply connected
+n-dimensional manifold with non positive sectional curvature is diffeomorphic to Rn.
+We can consider upon the statistical manifold M = (Θ, g) the macro-variables θ as accessible
+information and then derive the information dynamical Equation (5) from a standard principle of least
+action of Jacobi type [3]. The geodesic Equations (5) describe a reversible dynamics whose solution is
+the trajectory between an initial and a final macrostate θinitial and θfinal, respectively. The trajectory can
+be equally traversed in both directions [10]. Actually, an equation relating instability with geometry
+exists and it makes hope that some global information about the average degree of instability (chaos)
+of the dynamics is encoded in global properties of the statistical manifolds [7]. The fact that this might
+happen is proved by the special case of constant-curvature manifolds, for which the Jacobi-Levi-Civita
+equation simplifies to [7]
+d2Ji
+
+dτ2 + KJi = 0,
+(8)
+
+where K is the constant sectional curvature of the manifold (see Equation (7)) and J is the geodesic
+deviation vector field. On a positively curved manifold, the norm of the separating vector J does not
+grow, whereas on a negatively curved manifold, the norm of J grows exponentially in time, and if the
+manifold is compact, so that its geodesic are sooner or later obliged to fold, this provide an example of
+chaotic geodesic motion [14].
+
+239
+
+
+Entropy 2014, 16, 2944–2958
+
+Taking into consideration these facts, we single out as suitable indicator of dynamical (temporal)
+complexity, the information geometric complexity defined as the average dynamical statistical
+volume [15]
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+:= 1
+
+τ
+
+� τ
+
+0 dτ′vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+,
+(9)
+
+where
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+:=
+�
+
+D(geodesic)
+Θ
+(τ′)
+
+�
+
+det(G(θ)) dθ,
+(10)
+
+with G(θ) the information matrix whose components are given by Equation (3). The integration space
+D(geodesic)
+Θ
+(τ′) is defined as follows
+
+D(geodesic)
+Θ
+(τ′) :=
+�
+θ = (θ1, . . . , θm) : θk(0) ≤ θk ≤ θk(τ′)
+�
+,
+(11)
+
+where θk ≡ θk(s) with 0 ≤ s ≤ τ′ such that θk(s) satisfies (5). The quantity vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+is the
+
+volume of the effective parameter space explored by the system at time τ′. The temporal average
+has been introduced in order to average out the possibly very complex fine details of the entropic
+dynamical description of the system’s complexity dynamics.
+Relevant properties, concerning complexity of geodesic paths on curved statistical manifolds, of
+the quantity (10) compared to the Jacobi vector field are discussed in [16].
+
+3. The Gaussian Statistical Model
+
+In the following we devote our attention to a Gaussian statistical model P whose element are
+multivariate normal joint distributions for n real-valued variables X1, . . . , Xn given by
+
+p(x|θ) =
+1
+�
+
+(2π)n det C
+exp
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+,
+(12)
+
+where μ =
+�E(X1), . . . , E(Xn)
+�
+is the n-dimensional mean vector and C denotes the n × n covariance
+matrix with entries cij = E(XiXj) − E(Xi)E(Xj), i, j = 1, . . . , n. Since μ is a n-dimensional real vector
+
+and C is a n × n symmetric matrix, the parameters involved in this model should be n + n(n+1)
+
+2
+.
+Moreover C is a symmetric, positive definite matrix, hence we have the parameter space given by
+
+Θ := {(μ, C)|μ ∈ Rn, C ∈ Rn×n, C > 0}.
+(13)
+
+Hereafter we consider the statistical model given by Equation (12) when the covariance matrix C has
+only variances σ2
+i = E(X2
+i ) − (E(Xi))2 as parameters. In fact we assume that the non diagonal entry
+(i, j) of the covariance matrix C equals ρσiσj with ρ ∈ R quantifying the degree of correlation.
+We may further notice that the function fij(x) := ∂i log p(x|θ)∂j log p(x|θ), when p(x|θ) is given
+by Equation (12), is a polynomial in the variables xi (i = 1, . . . , n) whose degree is not grater than four.
+Indeed, we have that
+
+∂i log p(x|θ) =
+1
+
+p(x|θ)∂ip(x|θ) = ∂i
+1
+�
+
+(2π)n det C + ∂i
+
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+,
+(14)
+
+and, therefore, the differentiation does not affect variables xi. With this in mind, in order to compute
+the integral in (3), we can use the following formula [17]
+
+1
+�
+
+(2π)n det C
+
+�
+dx fij(x) exp
+�
+−1
+
+2(x − μ)tC−1(x − μ)
+�
+= exp
+
+�
+1
+2
+
+n
+∑
+h,k=1
+chk
+∂
+∂xh
+
+∂
+∂xk
+
+�
+
+fij|x=μ,
+(15)
+
+where the exponential denotes the power series over its argument (the differential operator).
+
+240
+
+
+Entropy 2014, 16, 2944–2958
+
+3.1. The monovariate Gaussian Statistical Model
+
+We now start to apply the concepts of the previous section to a Gaussian statistical model of
+Equation (12) for n = 1. In this case, the dimension of the statistical Riemannian manifold M = (Θ, g)
+is at most two. Indeed, to describe elements of the statistical model P given by Equation (12), we
+basically need the mean μ = E(X) and variance σ2 = E(X − μ)2. We deal separately with the
+cases when the monovariate model has only μ as macro-variable (Case 1), when σ is the unique
+macro-variable (Case 2), and finally when both μ and σ are macro-variables (Case 3).
+
+3.1.1. Case 1
+
+Consider the monovariate model with only μ as macro-variable by setting σ = 1. In this case
+the manifold M is trivially the real flat straight line, since μ ∈ (−∞, +∞). Indeed, the integral
+
+in (3) is equal to 1 when the distribution p(x|θ) reads as p(x|μ) =
+exp
+�
+− 1
+
+2 (x−μ)2�
+
+√
+
+2π
+; so the metric
+
+is g = dμ2. Furthermore, from Equations (4) and (5) the information dynamics is described by
+the geodesic μ(τ) = A1τ + A2, where A1, A2 ∈ R. Hence, the volume of Equation (10) results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+= �
+dμ = A1τ + A2; since this quantity must be positive we assume A1, A2 > 0.
+
+Finally, the asymptotic behavior of the IGC (9) is
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� A1
+
+2
+
+�
+τ.
+(16)
+
+This shows that the complexity linearly increases in time meaning that acquiring information about μ
+and updating it, is not enough to increase our knowledge about the micro state of the system.
+
+3.1.2. Case 2
+
+Consider now the monovariate Gaussian statistical model of Equation(12) when μ = E(X) = 0
+and the macro-variable is only σ. In this case the probability distribution function reads p(x|σ) =
+
+exp
+�
+− x2
+
+2σ2
+�
+
+√
+
+2πσ
+while the Fisher–Rao metric becomes g =
+2
+σ2 dσ2. Emphasizing that also in this case the
+manifold is flat as well, we derive the information dynamics by means of Equations (4) and (5) and we
+obtain the geodesic σ(τ) = A1 exp
+�
+A2τ
+�
+. The volume in Equation (10) then results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� √
+
+2
+σ dσ =
+√
+
+2 log
+�
+A1 exp
+�
+A2τ
+��
+.
+(17)
+
+Again, to have positive volume we have to assume A1, A2 > 0. Finally, the (asymptotic) IGC (9)
+becomes
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+�√
+
+2A2
+2
+
+�
+τ.
+(18)
+
+This shows that also in this case the complexity linearly increases in time meaning that acquiring
+information about σ and updating it, is not enough to increase our knowledge about the micro-state of
+the system.
+
+3.1.3. Case 3
+
+The take home message of the previous cases is that we have to account for both mean μ and
+variance σ as macro-variables to look for possible non increasing complexity. Hence, consider the
+probability distribution function is given by,
+
+p(x1, x2|μ, σ) =
+exp
+�
+− 1
+
+2
+(x−μ)2
+
+σ2
+�
+
+σ
+√
+
+2π
+.
+(19)
+
+241
+
+
+Entropy 2014, 16, 2944–2958
+
+The dimension of the Riemannian manifold M = (Θ, g) is two, where the parameter space Θ is given
+by Θ = {(μ, σ)|μ ∈ (−∞, +∞), σ > 0} and the Fisher–Rao metric reads as g =
+1
+σ2 dμ2 + 2
+
+σ2 dσ2. Here,
+the sectional curvature given by Equation (7) is a negative function and despite the fact that is not
+constant, we expect a decreasing behavior in time of the IGC. Thanks to Equation (4), we find that the
+only non negative Christoffel coefficients are Γ1
+12 = − 1
+
+σ, Γ2
+11 =
+1
+2σ and Γ2
+22 = − 1
+
+σ. Substituting them
+into Equation (5) we derive the following geodesic equations
+
+⎧
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎩
+
+d2μ(τ)
+
+dτ2
+− 2
+
+σ
+dσ
+dτ
+dμ
+dτ = 0,
+
+d2σ(τ)
+
+dτ2
+− 1
+
+σ
+�
+dσ
+dτ
+�2
++ 1
+
+2σ
+� dμ
+
+dτ
+�2
+= 0.
+
+(20)
+
+The integration of the above coupled differential equations is non-trivial. We follow the method
+described in [10] and arrive at
+
+σ(τ) =
+2σ0 exp
+� σ0|A1|
+√
+
+2 τ
+�
+
+1 + exp
+� 2σ0|A1|
+√
+
+2
+τ
+�,
+μ(τ) = −
+2σ0
+√
+
+2A1
+
+|A1|
+�
+1 + exp
+� 2σ0|A1|
+√
+
+2
+τ
+��,
+(21)
+
+where σ0 and A1 are real constants. Then, using (21), the volume of Equation (10) results
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� √
+
+2
+
+σ2 dσdμ =
+√
+
+2A1
+|A1| exp
+�
+− σ0|A1|
+√
+
+2
+τ
+�
+.
+(22)
+
+Since the last quantity must be positive, we assume A1 > 0. Finally, employing the above expression
+into Equation (9) we arrive at
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+�
+2
+
+σ0A1
+
+� 1
+
+τ .
+(23)
+
+We can now see a reduction in time of the complexity meaning that acquiring information about both
+μ and σ and updating them allows us to increase our knowledge about the micro state of the system.
+Hence, comparing Equations (16), (18) and (23) we conclude that the entropic inferences on a
+Gaussian distributed micro-variable is carried out in a more efficient manner when both its mean and
+the variance in the form of information constraints are available. Macroscopic predictions when only
+one of these pieces of information are available are more complex.
+
+3.2. Bivariate Gaussian Statistical Model
+
+Consider now the Gaussian statistical model P of the Equation (12) when n = 2. In this case
+the dimension of the Riemannian manifold M = (Θ, g) is at most four. From the analysis of the
+monovariate Gaussian model in Section 3.1 we have understood that both mean and variance should
+be considered. Hence the minimal assumption is to consider E(X1) = E(X2) = μ and E(X1 − μ)2 =
+E(X2 − μ)2 = σ2. Furthermore, in this case we have also to take into account the possible presence of
+(micro) correlations, which appear at the level of macro-states as off-diagonal terms in the covariance
+matrix. In short, this implies considering the following probability distribution function
+
+p(x1, x2|μ, σ) =
+exp
+�
+−
+1
+
+2σ2(1−ρ2)
+
+�
+(x1 − μ)2 − 2ρ(x1 − μ)(x2 − μ) + (x2 − μ)2��
+
+2πσ2�
+
+1 − ρ2
+,
+(24)
+
+where ρ ∈ (−1, 1).
+Thanks to Equation (15) we compute the Fisher-Information matrix G and find g = g11dμ2 +
+g22dσ2 with,
+
+g11 =
+2
+
+σ2(ρ + 1); g22 = 4
+
+σ2 .
+(25)
+
+242
+
+
+Entropy 2014, 16, 2944–2958
+
+The only non trivial Christoffel coefficients (4) are Γ1
+12 = − 1
+
+σ, Γ2
+11 =
+1
+
+2σ(ρ+1) and Γ2
+22 = − 1
+
+σ. In this case
+as well, the sectional curvature (Equation (7)) of the manifold M is a negative function and so we may
+expect a decreasing asymptotic behavior for the IGC. From Equation (5) it follows that the geodesic
+equations are,
+⎧
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎩
+
+d2μ(τ)
+
+dτ2
+− 2
+
+σ
+dσ
+dτ
+dμ
+dτ = 0
+
+d2σ(τ)
+
+dτ2
+− 1
+
+σ
+�
+dσ
+dτ
+�2
++
+1
+
+2(1+ρ)σ
+
+� dμ
+
+dτ
+�2
+= 0,
+
+(26)
+
+whose solutions are,
+
+σ(τ) =
+2σ0 exp
+�
+σ0|A1|
+√
+
+2(1+ρ)τ
+�
+
+1 + exp
+� 2σ0|A1|
+√
+
+2(1+ρ)τ
+�,
+μ(τ) = −
+2σ0
+�
+
+2(1 + ρ)A1
+
+|A1|
+�
+1 + exp
+� 2σ0|A1|
+√
+
+2(1+ρ)τ
+��.
+(27)
+
+Using (27) in Equation (10) gives the volume,
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+�
+2
+√
+
+2
+�
+
+1 + ρ σ2 dσdμ = 4A1
+
+|A1| exp
+�
+−
+σ0|A1|
+�
+
+2(1 + ρ)
+τ
+�
+.
+(28)
+
+To have it positive we have to assume A1 > 0. Finally, employing (28) in (9) leads to the IGC,
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 4
+√
+
+2
+
+σ0A1
+
+��
+
+1 + ρ
+τ
+,
+(29)
+
+with ρ ∈ (−1, 1). We may compare the asymptotic expression of the ICGs in the presence and in the
+absence of correlations, obtaining
+
+R
+strong
+bivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+�
+
+1 + ρ,
+(30)
+
+where “strong” stands for the fully connected lattice underlying the micro-variables. The ratio R
+strong
+bivariate(ρ)
+results a monotonic increasing function of ρ.
+While the temporal behavior of the IGC (29) is similar to the IGC in (23), here correlations play
+a fundamental role. From Equation (30), we conclude that entropic inferences on two Gaussian
+distributed micro-variables on a fully connected lattice is carried out in a more efficient manner when
+the two micro-variables are negatively correlated. Instead, when such micro-variables are positively
+correlated, macroscopic predictions become more complex than in the absence of correlations.
+Intuitively, this is due to the fact that for anticorrelated variables, an increase in one variable
+implies a decrease in the other one (different directional change): variables become more distant, thus
+more distinguishable in the Fisher–Rao information metric sense. Similarly, for positively correlated
+variables, an increase or decrease in one variable always predicts the same directional change for the
+second variable: variables do not become more distant, thus more distinguishable in the Fisher–Rao
+information metric sense. This may lead us to guess that in the presence of anticorrelations, motion on
+curved statistical manifolds via the Maximum Entropy updating methods becomes less complex.
+
+3.3. Trivariate Gaussian Statistical Model
+
+In this section we consider a Gaussian statistical model P of the Equation (12) when n = 3.
+In this case as well, in order to understand the asymptotic behavior of the IGC in the presence of
+correlations between the micro-states, we make the minimal assumption that, given the random vector
+X = (X1, X2, X3) distributed according to a trivariate Gaussian, then E(X1) = E(X2) = E(X3) = μ
+
+243
+
+
+Entropy 2014, 16, 2944–2958
+
+and E(X1 − μ)2 = E(X2 − μ)2 = E(X2 − μ)2 = σ2. Therefore, the space of the parameters of P is
+given by Θ = {(μ, σ)|μ ∈ R, σ > 0}.
+The manifold M = (Θ, g) changes its metric structure depending on the number of correlations
+between micro-variables, namely, one, two, or three . The covariance matrices corresponding to these
+cases read, modulo the congruence via a permutation matrix [17],
+
+C1 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+0
+ρ
+1
+0
+0
+0
+1
+
+⎞
+
+⎟
+⎠ ,
+C2 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+ρ
+ρ
+1
+0
+ρ
+0
+1
+
+⎞
+
+⎟
+⎠ ,
+C3 = σ2
+
+⎛
+
+⎜
+⎝
+1
+ρ
+ρ
+ρ
+1
+ρ
+ρ
+ρ
+1
+
+⎞
+
+⎟
+⎠ .
+(31)
+
+3.3.1. Case 1
+
+First, we consider the trivariate Gaussian statistical model of Equation (12) when C ≡ C1. Then
+proceeding like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 =
+3+ρ
+
+(1+ρ)σ2 and g22 =
+6
+σ2 . Also in
+this case we find that the sectional curvature of Equation (7) is a negative function. Hence, as we state
+in Section 2, we may expect a decreasing (in time) behavior of the information geometry complexity.
+Furthermore, we obtain the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(32)
+
+where A(ρ) = A2
+1(3+ρ)
+6(1+ρ) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−1, 1). Then, the volume (10)
+becomes
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� �
+
+6(3 − 4ρ)
+(1 − 2ρ2)
+1
+σ2 dσdμ = 6A1
+
+|A1| exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+,
+(33)
+
+requiring A1 > 0 for its positivity. Finally, using (33) in (9) we arrive at the asymptotic behavior of
+the IGC
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 6
+√
+
+6
+
+σ0A1
+
+��
+
+1 + ρ
+3 + ρ
+1
+τ .
+(34)
+
+Comparing (34) in the presence and in the absence of correlations yields
+
+R
+weak
+trivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+√
+
+3
+
+�
+
+1 + ρ
+3 + ρ,
+(35)
+
+where “weak” stands for low degree of connection in the lattice underlying the micro-variables
+Notice that Rweak
+trivariate(ρ) is a monotonic increasing function of the argument ρ ∈ (−1, 1).
+
+3.3.2. Case 2
+
+When the trivariate Gaussian statistical model of Equation (12) has C ≡ C2, the condition C > 0
+constraints the correlation coefficient to be ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ). Proceeding again like in Section 3.2 we
+
+have g = g11dμ2 + g22dσ2, where g11 =
+3−4ρ
+
+(1−2ρ2)σ2 and g22 =
+6
+σ2 . The sectional curvature of Equation (7)
+is a negative function as well and so we may apply the arguments of Section 2 expecting a decreasing
+in time of the complexity. Furthermore, we obtain the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(36)
+
+244
+
+
+Entropy 2014, 16, 2944–2958
+
+where A(ρ) = A2
+1(3−4ρ)
+
+6(1−2ρ2) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ). Then, the
+volume (10) becomes
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+� �
+
+6(3 − 4ρ)
+(1 − 2ρ2)
+1
+σ2 dσdμ = 6A1
+
+|A1| exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+.
+(37)
+
+We have to set A1 > 0 for the positivity of the volume (37), and using it in (9) we arrive at the
+asymptotic behavior of the IGC
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 6
+√
+
+6
+
+σ0A1
+
+��
+
+1 − 2ρ2
+
+3 − 4ρ
+1
+τ .
+(38)
+
+Then, comparing (38) in the presence and in the absence of correlations yields
+
+R
+mildly weak
+trivariate (ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+√
+
+3
+
+�
+
+1 − 2ρ2
+
+3 − 4ρ ,
+(39)
+
+where “mildly weak” stands for a lattice (underlying micro-variables) neither fully connected nor with
+minimal connection.
+This is a function of the argument ρ ∈ (−
+√
+
+2
+2 ,
+√
+
+2
+2 ) that attains the maximum
+�
+
+3
+2 at ρ = 1
+
+2, while
+
+in the extrema of the interval (−
+√
+
+2
+2 ,
+√
+
+2
+2 ) it tends to zero.
+
+3.3.3. Case 3
+
+Last, we consider the trivariate Gaussian statistical model of the Equation (12) when C ≡ C3. In
+this case, the condition C > 0 requires the correlation coefficient to be ρ ∈ (− 1
+
+2, 1). Proceeding again
+like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 =
+3
+
+(1+2ρ)σ2 and g22 =
+6
+σ2 . We find that the
+sectional curvature of Equation (7) is a negative function; hence, we may expect a decreasing (in time)
+behavior of the complexity. It follows the geodesics
+
+σ(τ) =
+2σ0 exp
+�
+σ0
+�
+
+A(ρ) τ
+�
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�, μ(τ) = − 2σ0A1
+�
+
+A(ρ)
+
+1
+
+1 + exp
+�
+2σ0
+�
+
+A(ρ) τ
+�,
+(40)
+
+where A(ρ) =
+A2
+1
+
+2(1+2ρ) and A1 ∈ R. We note that A(ρ) > 0 for all ρ ∈ (− 1
+
+2, 1). Using (40), we compute
+
+vol
+�
+D(geodesic)
+Θ
+(τ′)
+�
+=
+�
+3
+√
+
+2
+�
+
+(1 + 2ρ)
+
+1
+σ2 dσdμ = 6
+√
+
+2A1
+
+|A1|
+exp
+�
+− σ0
+�
+
+A(ρ) τ
+�
+.
+(41)
+
+Also in this case we need to assume A1 > 0 to have positive volume. Finally, substituting Equation (41)
+into Equation (9), the asymptotic behavior of the IGC results
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+≈
+� 12
+
+σ0A1
+
+��
+
+1 + 2ρ 1
+
+τ .
+(42)
+
+The comparison of (42) in the presence and in the absence of correlations yields
+
+R
+strong
+trivariate(ρ) :=
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+�
+vol
+�
+D(geodesic)
+Θ
+(τ)
+�
+
+ρ=0
+
+=
+�
+
+1 + 2ρ,
+(43)
+
+245
+
+
+Entropy 2014, 16, 2944–2958
+
+where “strong” stands for a fully connected lattice underlying the (three) micro-variables. We remark
+the latter ratio is a monotonically increasing function of the argument ρ ∈ (− 1
+
+2, 1).
+
+The behaviors of R(ρ) of Equations (30), (35), (39) and (43) are reported in Figure 1.
+
+−1
+−0.5
+0
+0.5
+1
+
+ρpeak
+
+ρ
+
+R(ρ)
+
+Figure 1. Ratio R(ρ) of volumes vs. degree of correlations ρ. Solid line refers to R
+strong
+bivariate(ρ); Dotted line
+refers to Rweak
+trivariate(ρ); Dashed line referes to R
+mildly weak
+trivariate
+(ρ); Dash-dotted refers to R
+strong
+trivariate(ρ).
+
+The non-monotonic behavior of the ratio R
+mildly weak
+trivariate (ρ) in Equation (39) corresponds to the
+information geometric complexities for the mildly weak connected three-dimensional lattice.
+Interestingly, the growth stops at a critical value ρpeak = 1
+
+2 at which R
+mildly weak
+trivariate (ρpeak) = R
+strong
+bivariate(ρpeak). From
+Equation (30), we conclude that entropic inferences on three Gaussian distributed micro-variables on
+a fully connected lattice is carried out in a more efficient manner when the two micro-variables are
+negatively correlated. Instead, when such micro-variables are positively correlated, macroscopic
+predictions become more complex that in the absence of correlations.
+Furthermore, the ratio
+R
+strong
+trivariate(ρ) of the information geometric complexities for this fully connected three-dimensional
+lattice increases in a monotonic fashion. These conclusions are similar to those presented for the
+bivariate case. However, there is a key-feature of the IGC to emphasize when passing from the
+two-dimensional to the three-dimensional manifolds associated with fully connected lattices: the
+effects of negative-correlations and positive-correlations are amplified with respect to the respective
+absence of correlations scenarios,
+R
+strong
+trivariate(ρ)
+
+R
+strong
+bivariate(ρ) =
+
+�
+
+1 + 2ρ
+1 + ρ ,
+(44)
+
+where ρ ∈ (− 1
+
+2, 1).
+Specifically, carrying out entropic inferences on the higher-dimensional manifold in the presence
+
+of anti-correlations, that is for ρ ∈
+�
+− 1
+
+2, 0
+�
+, is less complex than on the lower-dimensional manifold as
+evident form Equation (44). The vice-versa is true in the presence of positive-correlations, that is for
+ρ ∈ (0, 1).
+
+4. Concluding Remarks
+
+In summary, we considered low dimensional Gaussian statistical models (up to a trivariate model)
+and have investigated their dynamical (temporal) complexity. This has been quantified by the volume
+of geodesics for parameters characterizing the probability distribution functions. To the best of our
+knowledge, there is no dynamic measure of complexity of geodesic paths on curved statistical manifolds
+that could be compared to our IGC. However, it could be worthwhile to understand the connection, if
+
+246
+
+
+Entropy 2014, 16, 2944–2958
+
+any, between our IGC and the complexity of paths of dynamic systems introduced in [20]. Specifically,
+according to the Alekseev-Brudno theorem in the algorithmic theory of dynamical systems [21], a way
+to predict each new segment of chaotic trajectory is obtained by adding information proportional to the
+length of this segment and independent of the full previous length of trajectory. This means that this
+information cannot be extracted from observation of the previous motion, even an infinitely long one!
+If the instability is a power law, then the required information per unit time is inversely proportional
+to the full previous length of the trajectory and, asymptotically, the prediction becomes possible.
+For the sake of completeness, we also point out that the relevance of volumes in quantifying the
+static model complexity of statistical models was already pointed out in [22] and [23]: complexity is
+related to the volume of a model in the space of distributions regarded as a Riemannian manifold
+of distributions with a natural metric defined by the Fisher–Rao metric tensor. Finally, we would
+like to point out that two of the Authors have recently associated Gaussian statistical models to
+networks [17]. Specifically, it is assumed that random variables are located on the vertices of the
+network while correlations between random variables are regarded as weighted edges of the network.
+Within this framework, a static network complexity measure has been proposed as the volume of the
+corresponding statistical manifold. We emphasize that such a static measure could be, in principle,
+applied to time-dependent networks by accommodating time-varying weights on the edges [24]. This
+requires the consideration of a time-sequence of different statistical manifolds. Thus, we could follow
+the time-evolution of a network complexity through the time evolution of the volumes of the associated
+manifolds.
+In this work we uncover that in order to have a reduction in time of the complexity one has to
+consider both mean and variance as macro-variables. This leads to different topological structures of
+the parameter space in (13); in particular, we have to consider at least a 2-dimensional manifold in
+order to have effects such as a power law decay of the complexity. Hence, the minimal hypothesis in a
+multivariate Gaussian model consists in considering all mean values equal and all covariances equal.
+In such a case, however, the complexity shows interesting features depending on the correlation among
+micro-variables (as summarized in Figure 1). For a trivariate model with only two correlations the
+information geometric complexity ratio exhibits a non monotonic behavior in ρ (correlation parameter)
+taking zero value at the extrema of the range of ρ. In contrast to closed configurations (bivariate
+and trivariate models with all micro-variables correlated each other) the complexity ratio exhibits a
+monotonic behavior in terms of the correlation parameter. The fact that in such a case this ratio cannot
+be zero at the extrema of the range of ρ is reminiscent of the geometric frustration phenomena that
+occurs in the presence of loops [11].
+Specifically, recall that a geometrically frustrated system cannot simultaneously minimize all
+interactions because of geometric constraints [11,18]. For example, geometric frustration can occur
+in an Ising model which is an array of spins (for instance, atoms that can take states ±1) that are
+magnetically coupled to each other. If one spin is, say, in the +1 state then it is energetically favorable
+for its immediate neighbors to be in the same state in the case of a ferromagnetic model. On the
+contrary, in antiferromagnetic systems, nearest neighbor spins want to align in opposite directions.
+This rule can be easily satisfied on a square. However, due to geometrical frustration, it is not possible
+to satisfy it on a triangle: for an antiferromagnetic triangular Ising model, any three neighboring spins
+are frustrated. Geometric frustration in triangular Ising models can be observed by considering spin
+configurations with total spin J = ±1 and analyzing the fluctuations in energy of the spin system as
+a function of temperature. There is no peak at all in the standard deviation of the energy in the case
+J = −1, and a monotonic behavior is recorded. This indicates that the antiferromagnetic system does
+not have a phase transition to a state with long-range order. Instead, in the case J = +1, a peak in
+the energy fluctuations emerges. This significant change in the behavior of energy fluctuations as a
+function of temperature in triangular configurations of spin systems is a signature of the presence of
+frustrated interactions in the system [19].
+
+247
+
+
+Entropy 2014, 16, 2944–2958
+
+In this article, we observe a significant change in the behavior of the information geometric
+complexity ratios as a function of the correlation coefficient in the trivariate Gaussian statistical models.
+Specifically, in the fully connected trivariate case, no peak arises and a monotonic behavior in ρ of
+the information geometric complexity ratio is observed. In the mildly weak connected trivariate
+case, instead, a peak in the information geometric complexity ratio is recorded at ρpeak ≥ 0. This
+dramatic disparity of behavior can be ascribed to the fact that when carrying out statistical inferences
+with positively correlated Gaussian random variables, the maximum entropy favorable scenario is
+incompatible with these working hypothesis. Thus, the system appears frustrated.
+These considerations lead us to conclude that we have uncovered a very interesting information
+geometric resemblance of the more standard geometric frustration effect in Ising spin models. However,
+for a conclusive claim of the existence of an information geometric analog of the frustration effect, we
+feel we have to further deepen our understanding. A forthcoming research project along these lines
+will be a detailed investigation of both arbitrary triangular and square configurations of correlated
+Gaussian random variables where we take into consideration both the presence of different intensities
+and signs of pairwise interactions (ρij ̸= ρik if j ̸= k, ∀i).
+
+Acknowledgments: Domenico Felice and Stefano Mancini acknowledge the financial support of the Future
+and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the
+European Commission, under the FET-Open grant agreement TOPDRIM, number FP7-ICT-318121.
+
+Author Contributions: The authors have equally contributed to the paper. All authors read and approved the
+final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Feldman, D.F.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+2.
+Kolmogorov, A.N. A new metric invariant of transitive dynamical systems and of automorphism of Lebesgue
+spaces. Doklady Akademii Nauk SSSR 1958, 119, 861–864.
+3.
+Caticha, A. Entropic Dynamics. Bayesian Inference and Maximum Entropy Methods in Science and Engineering,
+the 22nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and
+Engineering, Moscow, Idaho, 3-7 August 2002; Fry, R.L., Ed.; American Institute of Physics: College Park,
+MD, USA, 2002; Volume 617, p. 302.
+4.
+Caticha, A.; Preuss, R. Maximum entropy and Bayesian data analysis: Entropic prior distributions. Phys. Rev.
+E 2004, 70, 046127.
+5.
+Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NY, USA, 2000.
+6.
+Cafaro, C. Works on an information geometrodynamical approach to chaos. Chaos Solitons Fractals 2009, 41,
+886–891.
+7.
+Pettini, M. Geometry and Topology in Hamiltonian Dynamics and Statistical Mechanics; Springer-Verlag:
+Berlin/Heidelberg, Germany, 2007.
+8.
+Lebowitz, J.L. Microscopic Dynamics and Macroscopic Laws. Ann. N. Y. Acad. Sci. 1981, 373, 220–233.
+9.
+Shibata, T.; Chawanya, T.; Chawanya, K. Noiseless Collective Motion out of Noisy Chaos. Phys. Rev. Lett.
+1999, 82, doi: http://dx.doi.org/10.1103/PhysRevLett.82.4424.
+10.
+Ali, S.A.; Cafaro, C.; Kim, D.-H.; Mancini, S. The effect of the microscopic correlations on the information
+geometric complexity of Gaussian statistical models. Physica A 2010, 389, 3117–3127.
+11.
+Sadoc, J.F.; Mosseri, R. Geometrical Frustration; Cambridge University Press: Cambridge, UK, 2006.
+12.
+Lee, J.M. Riemannian Manifolds: An Introduction to Curvature; Springer: New York, NY, USA, 1997.
+13.
+Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992.
+14.
+Cafaro, C.; Ali, S.A. Jacobi fields on statistical manifolds of negative curvature. Physica D 2007, 234, 70–80.
+15.
+Cafaro, C.; Giffin, A.; Ali, S.A.; Kim, D.-H. Reexamination of an information geometric construction of
+entropic indicators of complexity. Appl. Math. Comput. 2010, 217, 2944–2951.
+16.
+Cafaro, C.; Mancini, S. Quantifying the complexity of geodesic paths on curved statistical manifolds through
+information geometric entropies and Jacobi fields. Physica D 2011, 240, 607–618.
+
+248
+
+
+Entropy 2014, 16, 2944–2958
+
+17.
+Felice, D.; Mancini, S.; Pettini, M. Quantifying Networks Complexity from Information Geometry Viewpoint.
+J. Math. Phys. 2014, 55, 043505.
+18.
+Moessner, R.; Ramirez, A.P. Geometrical Frustration. Phys. Today 2006, 59, 24–29.
+19.
+MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge,
+UK, 2003.
+20.
+Brudno, A.A. The complexity of the trajectories of a dynamical system. Uspekhi Mat. Nauk 1978, 33, 207–208.
+21.
+Alekseev, V.M.; Yacobson, M.V. Symbolic dynamics and hyperbolic dynamic systems. Phys. Rep. 1981, 75,
+290–325.
+22.
+Myung, J.; Balasubramanian, V.; Pitt, M.A. Counting probability distributions: differential geometry and
+model selection. Proc. Natl. Acad. Sci. USA 2000, 97, 11170.
+23.
+Rodriguez, C.C. The volume of bitnets. AIP Conf. Proc. 2004, 735, 555–564.
+24.
+Motter, A.E.; Albert, R. Networks in motion. Phys. Today 2012, 65, 43–48.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+249
+
+
+entropy
+
+Article
+Learning from Complex Systems: On the Roles of
+Entropy and Fisher Information in Pairwise Isotropic
+Gaussian Markov Random Fields
+
+Alexandre Levada
+
+Computing Department, Federal University of São Carlos, Rod. Washington Luiz, km 235, São Carlos, SP, Brazil;
+E-Mail: alexandre@dc.ufscar.br
+
+Received: 4 December 2013; / Accepted: 30 January 2014 / Published: 18 February 2014
+
+Abstract: Markov random field models are powerful tools for the study of complex systems.
+However, little is known about how the interactions between the elements of such systems are
+encoded, especially from an information-theoretic perspective.
+In this paper, our goal is to
+enlighten the connection between Fisher information, Shannon entropy, information geometry and
+the behavior of complex systems modeled by isotropic pairwise Gaussian Markov random fields.
+We propose analytical expressions to compute local and global versions of these measures using
+Besag’s pseudo-likelihood function, characterizing the system’s behavior through its Fisher curve , a
+parametric trajectory across the information space that provides a geometric representation for the
+study of complex systems in which temperature deviates from infinity. Computational experiments
+show how the proposed tools can be useful in extracting relevant information from complex patterns.
+The obtained results quantify and support our main conclusion, which is: in terms of information,
+moving towards higher entropy states (A –> B) is different from moving towards lower entropy states
+(B –> A), since the Fisher curves are not the same, given a natural orientation (the direction of time).
+
+Keywords: Markov random fields; information theory; Fisher information; entropy; maximum
+pseudo-likelihood estimation
+
+1. Introduction
+
+With the increasing value of information in modern society and the massive volume of digital
+data that is available, there is an urgent need for developing novel methodologies for data filtering and
+analysis in complex systems. In this scenario, the notion of what is informative or not is a top priority.
+Sometimes, patterns that at first may appear to be locally irrelevant may turn out to be extremely
+informative in a more global perspective. In complex systems, this is a direct consequence of the
+intricate non-linear relationship between the pieces of data along different locations and scales.
+Within this context, information theoretic measures play a fundamental role in a huge variety of
+applications once they represent statistical knowledge in a systematic, elegant and formal framework.
+Since the first works of Shannon [1], and later with many other generalizations [2–4], the concept of
+entropy has been adapted and successfully applied to almost every field of science, among which we
+can cite physics [5], mathematics [6–8], economics [9] and, fundamentally, information theory [10–12].
+Similarly, the concept of Fisher information [13,14] has been shown to reveal important properties of
+statistical procedures, from lower bounds on estimation methods [15–17] to information geometry [18,19].
+Roughly speaking, Fisher information can be thought of as the likelihood analog of entropy, which is a
+probability-based measure of uncertainty.
+In general, classical statistical inference is focused on capturing information about location and
+dispersion of unknown parameters of a given family of distribution and studying how this information
+is related to uncertainty in estimation procedures. In typical situations, an exponential family of
+
+Entropy 2014, 16, 1002–1036; doi:10.3390/e16021002
+www.mdpi.com/journal/entropy
+250
+
+
+Entropy 2014, 16, 1002–1036
+
+distributions and independence hypothesis (independent random variables) are often assumed, giving
+the likelihood function a series of desirable mathematical properties [15–17].
+Although mathematically convenient for many problems, in complex systems modeling,
+independence assumption is not reasonable, because much of the information is somehow encoded
+in the relations between the random variables [20,21]. In order to overcome this limitation, Markov
+random field (MRF) models appear to be a natural generalization of the classical approach by the
+replacement of the independence assumption by a more realistic conditional independence assumption.
+Basically, in every MRF, knowledge of a finite-support neighborhood around a given variable isolates it
+from all the remaining variables. A further simplification consists in considering a pairwise interaction
+model, constraining the size of the maximum clique to be two (in other words, the model captures
+only binary relationships). Moreover, if the MRF model is isotropic, which means that the parameter
+controlling the interactions between neighboring variables is invariant to change in the directions,
+all the information regarding the spatial dependence structure of the system is conveyed by a single
+parameter, from now on denoted by β (or simply, the inverse temperature).
+In this paper, we assume an isotropic pairwise Gaussian Markov random field (GMRF) model [22,23],
+also known as an auto-normal model or a conditional auto-regressive model [24,25]. Basically, the question
+that motivated this work and that we are trying to elucidate here is: What kind of information is encoded
+by the β parameter in such a model? We want to know how this parameter, and as a consequence, the
+whole spatial dependence structure of a complex system modeled by a Gaussian Markov random field, is
+related to both local and global information theoretic measures, more precisely the observed and expected
+Fisher information, as well as self-information and Shannon entropy.
+In searching for answers for our fundamental question, investigations led us to an exact expression
+for the asymptotic variance of the maximum pseudo-likelihood (MPL) estimator of β in an isotropic
+pairwise GMRF model, suggesting that asymptotic efficiency is not granted. In the context of statistical
+data analysis, Fisher information plays a central role in providing tools and insights for modeling
+the interactions between complex systems and their components. The advantage of MRF models
+over the traditional statistical ones is that MRFs take into account the dependence between pieces of
+information as a function of the system’s temperature, which may even be variable along time. Briefly
+speaking, this investigation aims to explore ways to measure and quantify distances between complex
+systems operating in different thermodynamical conditions. By analyzing and comparing the behavior
+of local patterns observed throughout the system (defined over a regular 2D lattice), it is possible
+to measure how informative those patterns for a given inverse temperature are, or simply β (which
+encodes the expected global behavior).
+In summary, our idea is to describe the behavior of a complex system in terms of information
+as its temperature deviates from infinity (when the particles are statistically independent) to a lower
+bound. The obtained results suggest that, in the beginning, when the temperature is infinite and the
+information equilibrium prevails, the information is somehow spread along the system. However,
+when temperature is low and this equilibrium condition does not hold anymore, we have a more
+sparse representation in terms of information, since this information is concentrated in the boundaries
+of the regions that define a smooth global configuration. In the vast remaining of this “universe”, due
+to this smooth constraint, the strong alignment between the particles prevails, which is exactly the
+expected global behavior for temperatures below a critical value (making the majority of the interaction
+patterns along the system uninformative).
+The remainder of the paper is organized as follows: Section 2 discusses a technique for the
+estimation of the inverse temperature parameter, called the maximum pseudo-likelihood (MPL)
+approach, and provides derivations for the observed Fisher information in an isotropic pairwise GMRF
+model. Intuitive interpretations for the two versions of this local measure are discussed. In Section 3,
+we derive analytical expressions for the computation of the expected Fisher information, which allows
+us to assign a global information measure for a given system configuration. Similarly, in Section 4, an
+expression for the global entropy of a system modeled by a GMRF is shown. The results suggest a
+
+251
+
+
+Entropy 2014, 16, 1002–1036
+
+connection between maximum pseudo-likelihood and minimum entropy criteria in the estimation of
+the inverse temperature parameter on GMRFs. Section 5 discusses the uncertainty in the estimation
+of this important parameter by defining an expression for the asymptotic variance of its maximum
+pseudo-likelihood estimator in terms of both forms of Fisher information. In Section 6, the definition
+of the Fisher curve of a system as a parametric trajectory in the information space is proposed. Section
+7 shows the experimental setup. Computational simulations with both Markov chain Monte Carlo
+algorithms and some real data were conducted, showing how the proposed tools can be used to extract
+relevant information from complex systems. Finally, Section 8 presents our conclusions, final remarks
+and possibilities for future works.
+
+2. Fisher Information in Isotropic Pairwise GMRFs
+
+The remarkable Hammersley–Clifford theorem [26] states the equivalence between Gibbs random
+fields (GRF) and Markov random fields (MRF), which implies that any MRF can be defined either in
+terms of a global (joint Gibbs distribution) or a local (set of local conditional density functions) model.
+For our purposes, we will choose the latter representation.
+
+Definition 1. An isotropic pairwise Gaussian Markov random field regarding a local neighborhood system,
+ηi, defined on a lattice S = {s1, s2, . . . , sn} is completely characterized by a set of n local conditional density
+functions p(xi|ηi,⃗θ), given by:
+
+p
+�
+xi|ηi,⃗θ
+�
+=
+1
+√
+
+2πσ
+exp
+
+⎧
+⎨
+
+⎩− 1
+
+2σ2
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+(1)
+
+with⃗θ = (μ, σ2, β), where μ and σ2 are the expected value and the variance of the random variables,
+and β = 1/T is the parameter that controls the interaction between the variables (inverse temperature).
+Note that, for β = 0, the model degenerates to the usual Gaussian distribution. From an information
+geometry perspective [18,19], this means that we are constrained to a sub-manifold within the
+Riemannian manifold of probability distributions, where the natural Riemannian metric (tensor)
+is given by the Fisher information. It has been shown that the geometric structure of exponential
+family distributions exhibits constant curvature. However, little is known about information geometry
+on more general statistical models, such as GMRFs. For β > 0, some degree of correlation between
+the observations is expected, making the interactions grow stronger. Typical choices for ηi are the
+first and second order non-causal neighborhood systems, defined by the sets of four and eight nearest
+neighbors, respectively.
+
+2.1. Maximum Pseudo-Likelihood Estimation
+
+Maximum likelihood estimation is intractable in MRF parameter estimation, due to the existence
+of the partition function in the joint Gibbs distribution. An alternative, proposed by Besag [24], is
+maximum pseudo-likelihood estimation, which is based on the conditional independence principle.
+The pseudo-likelihood function is defined as the product of the LCDFs for all the n variables of the
+system, modeled as a random field.
+
+Definition 2. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the set corresponding to the observations at time
+t, the pseudo-likelihood function of the model is defined by:
+
+L
+�
+⃗θ; X(t)�
+=
+n
+∏
+i=1
+p(xi|ηi,⃗θ)
+(2)
+
+252
+
+
+Entropy 2014, 16, 1002–1036
+
+Note that the pseudo-likelihood function is a function of the parameters. For better mathematical
+tractability, it is usual to take the logarithm of L(⃗θ; X(t)). Plugging Equation (1) into Equation (2) and
+taking the logarithm leads to:
+
+log L
+�
+⃗θ; X(t)�
+= −n
+
+2 log
+�
+2πσ2�
+−
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(3)
+
+By differentiating Equation (3) with respect to each parameter and properly solving the
+pseudo-likelihood equations, we obtain the following maximum pseudo-likelihood estimators for the
+parameters, μ, σ2 and β:
+
+ˆβMPL =
+
+n
+∑
+i=1
+
+�
+
+(xi − μ) ∑
+j∈ηi
+
+�
+xj − μ
+�
+�
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(4)
+
+ˆμMPL =
+1
+
+n (1 − kβ)
+
+n
+∑
+i=1
+
+�
+
+xi − β ∑
+j∈ηi
+xj
+
+�
+
+(5)
+
+ˆσ2
+MPL = 1
+
+n
+
+n
+∑
+i=1
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2
+(6)
+
+where k denotes the cardinality of the non-causal neighborhood set ηi. Note that if β = 0, the MPL
+estimators of both μ and σ2 become the widely known sample mean and sample variance.
+Since the cardinality of the neighborhood system, k = |ηi|, is spatially invariant (we are assuming
+a regular neighborhood system) and each variable is dependent on a fixed number of neighbors on a
+lattice, ˆβMPL can be rewritten in terms of cross-covariances:
+
+ˆβMPL =
+∑
+j∈ηi
+ˆσij
+
+∑
+j∈ηi ∑
+k∈ηi
+ˆσjk
+(7)
+
+where σij denotes the sample covariance between the central variable, xi, and xj ∈ ηi. Similarly, σjk
+denotes the sample covariance between two variables belonging to the neighborhood system, ηi (the
+definition of the neighborhood system, ηi, does not include the the location, si).
+
+2.2. Fisher Information of Spatial Dependence Parameters
+
+Basically, Fisher information measures the amount of information a sample conveys about
+an unknown parameter.
+It can be thought of as the likelihood analog of entropy, which is a
+probability-based measure of uncertainty. Often, when we are dealing with independent and identically
+distributed (i.i.d) random variables, the computation of the global Fisher information presented in
+
+a random sample X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } is quite straightforward, since each observation, xi,
+i = 1, 2, . . . , n, brings exactly the same amount of information (when we are dealing with independent
+samples, the superscript, t, is usually suppressed, since the underlying dependence structure does
+not change through time). However, this is not true for spatial dependence parameters in MRFs,
+since different configuration patterns (xi ∪ ηi) provide distinct contributions to the local observed
+Fisher information, which can be used to derive a reasonable approximation to the global Fisher
+information [27].
+
+253
+
+
+Entropy 2014, 16, 1002–1036
+
+2.3. The Information Equality
+
+It is widely known from statistical inference theory that, under certain regularity conditions,
+information equality holds in the case of independent observations in the exponential family [15–17].
+In other words, we can compute the Fisher information of a random sample regarding a parameter of
+interest, θ, by:
+
+I
+�
+θ; X(t)�
+= E
+
+�� ∂
+
+∂θ logL
+�
+θ; X(t)��2�
+
+= −E
+� ∂2
+
+∂θ2 logL
+�
+θ; X(t)��
+(8)
+
+where L
+�
+θ; X(t)�
+denotes the likelihood function at a time instant, t. In our investigations, to avoid the
+joint Gibbs distribution, often intractable due to the presence of the partition function (global Gibbs
+field), we replace the usual likelihood function by Besag’s pseudo-likelihood function, and then, we
+work with the local model instead (local Markov field).
+However, given the intrinsic spatial dependence structure of Gaussian Markov random field
+models, information equilibrium is not a natural condition. As we will discuss later, in general,
+information equality fails.
+Thus, in a GMRF model, we have to consider two kinds of Fisher
+information, from now on denoted by Type I (due to the first derivative of the pseudo-likelihood
+function) and Type II (due to the second derivative of the pseudo-likelihood function). Eventually,
+when certain conditions are satisfied, these two values of information will converge to a unique bound.
+Essentially, β is the parameter responsible to control whether both forms of information converge or
+diverge. Knowing the role of β (inverse temperature) in a GMRF model, it is expected that for β = 0
+(or T → ∞), information equilibrium prevails. In fact, we will see in the following sections that as β
+deviates from zero (and long-term correlations start to emerge), the divergence between the two kinds
+of information increases.
+In terms of information geometry, it has been shown that the geometric structure of the exponential
+family of distributions is basically given by the Fisher information matrix, which is the natural
+Riemmanian metric (metric tensor) [18,19]. So, when the inverse temperature parameter is zero, the
+geometric structure of the model is a surface since the parametric space is 2D (μ and σ2). However,
+as the inverse temperature parameter starts to increase, the original surface is gradually transformed
+to a 3D Riemmanian manifold, equipped with a novel metric tensor (the 3 × 3 Fisher information
+matrix for μ, σ2 and β). In this context, by measuring the Fisher information regarding the inverse
+temperature parameter along an interval ranging from βMIN = A = 0 to βMAX = B, we are essentially
+trying to capture part of the deformation in the geometric structure of the model. In this paper, we
+focus on the computation of this measure. In future works we expect to derive the complete Fisher
+information matrix in order to completely characterize the transformations in the metric tensor.
+
+2.4. Observed Fisher Information
+
+In order to quantify the amount of information conveyed by a local configuration pattern in a
+complex system, the concept of observed Fisher information must be defined.
+
+Definition 3. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The
+Type I local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β, is
+defined in terms of its local conditional density function as:
+
+φβ(xi) =
+� ∂
+
+∂βlog p
+�
+xi|ηi,⃗θ
+��2
+(9)
+
+254
+
+
+Entropy 2014, 16, 1002–1036
+
+Hence, for an isotropic pairwise GMRF model, the Type I local observed Fisher information
+regarding β for the observation, xi, is given by:
+
+φβ(xi) = 1
+
+σ4
+
+��
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+� �
+∑
+j∈ηi
+
+�
+xj − μ
+�
+��2
+
+= 1
+
+σ4
+
+�
+∑
+j∈ηi
+(xi − μ)
+�
+xj − μ
+� − β ∑
+j∈ηi ∑
+k∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�2
+(10)
+
+Definition 4. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The
+Type II local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β,
+is defined in terms of its local conditional density function as:
+
+ψβ(xi) = − ∂2
+
+∂β2 log p
+�
+xi|ηi,⃗θ
+�
+(11)
+
+In case of an isotropic pairwise GMRF model, the Type II local observed Fisher information
+regarding β for the observation, xi, is given by:
+
+φβ(xi) = 1
+
+σ2
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�
+
+(12)
+
+Note that φβ(xi) does not depend on xi, only on the neighborhood system, ηi.
+Therefore, we have two local measures, φβ(xi) and ψβ(xi), that can be assigned to every element of
+a system modeled by an isotropic pairwise GMRF. In the following, we will discuss some interpretations
+for what is being measured with the proposed tools and how to define global versions for these
+measures by means of the expected Fisher information.
+
+2.5. The Role of Fisher Information in GMRF Models
+
+At this point, a relevant issue is the interpretation of these Fisher information measures in a
+complex system modeled by an isotropic pairwise GMRF. Roughly speaking, φβ(xi) is the quadratic
+rate of change of the logarithm of the local likelihood function at xi, given a global value of β. As
+this global value of β determines what would be the expected global behavior (if β is large, a high
+degree of correlation among the observations is expected and if β is close to zero, the observations are
+independent), it is reasonable to admit that configuration patterns showing values of φβ(xi) close to
+zero are more likely to be observed throughout the field, once their likelihood values are high (close to
+the maximum local likelihood condition). In other words, these patterns are more “aligned” to what is
+considered to be the expected global behavior, and therefore, they convey little information about the
+spatial dependence structure (these samples are not informative once they are expected to exist in a
+system operating at that particular value of inverse temperature).
+Now, let us move on to configuration patterns showing high values of φβ(xi). Those samples
+can be considered landmarks, because they convey a large amount of information about the global
+spatial dependence structure. Roughly speaking, those points are very informative once they are
+not expected to exist for that particular value of β (which guides the expected global behavior of the
+system). Therefore, Type I local observed Fisher information minimization in GMRFs can be a useful
+tool in producing novel configuration patterns that are more likely to exist given the chosen value of
+inverse temperature. Basically, φβ(xi) tells us how informative a given pattern is for that specific global
+behavior (represented by a single parameter in an isotropic pairwise GMRF model). In summary, this
+
+255
+
+
+Entropy 2014, 16, 1002–1036
+
+measure quantifies the degree of agreement between an observation, xi, and the configuration defined
+by its neighborhood system for a given β.
+As we will see later in the experiments section, typical informative patterns (those showing
+high values of φβ(xi)) in an organized system are located at the boundaries of the regions defining
+homogeneous areas (since these boundary samples show an unexpected behavior for large β, which is:
+there is no strong agreement between xi and its neighbors).
+Let us analyze the Type II local observed Fisher information, ψβ(xi). Informally speaking, this
+measure can be interpreted as a curvature measure, that is, how curved is the local likelihood function
+at xi. Thus, patterns showing low values of ψβ(xi) tend to have a nearly flat local likelihood function.
+This means that we are dealing with a pattern that could have been observed for a variety of β values
+(a large set of β values have approximately the same likelihood). An implication of this fact is that
+in a system dominated by this kind of patterns (patterns for which ψβ(xi) is close to zero), small
+perturbations may cause a sharp change in β (and, therefore, in the expected global behavior). In other
+words, these patterns are more susceptible to changes once they do not have a “stable” configuration
+(it raises our uncertainty about the true value of β).
+On the other hand, if the global configuration is mostly composed of patterns exhibiting large
+values of ψβ(xi), changes on the global structure are unlikely to happen (uncertainty on β is sufficiently
+small). Basically, ψβ(xi) measures the degree of agreement or dependence among the observations
+belonging to the same neighborhood system. If at a given xi, the observations belonging to ηi are
+totally symmetric around the mean value, ψβ(xi) would be zero. It is reasonable to expect that in this
+situation, as there is no information about the induced spatial dependence structure (this means that
+there is no contextual information available at this point). Notice that the role of ψβ(xi) is not the same
+as φβ(xi). Actually, these two measures are almost inversely related, since if at xi the value of φβ(xi)
+is high (it is a landmark or boundary pattern), then it is expected that ψβ(xi) will be low (in decision
+boundaries or edges, the uncertainty about β is higher, causing ψβ(xi) to be small). In fact, we will
+observe this behavior in some computational experiments conducted in future sections of the paper.
+It is important to mention that these rather informal arguments define the basis for understanding
+the meaning of the asymptotic variance of maximum pseudo-likelihood estimators, as we will discuss
+ahead. In summary, ψβ(xi) is a measure of how sure or confident we are about the local spatial
+dependence structure (at a given point, xi), since a high average curvature is desired for predicting the
+system’s global behavior in a reasonable manner (reducing the uncertainty of β estimation).
+
+3. Expected Fisher Information
+
+In order to avoid the use of approximations in the computation of the global Fisher information
+in an isotropic pairwise GMRF, in this section, we provide an exact expression for ˆφβ and ˆψβ as Type
+I and Type II expected Fisher information. One advantage of using the expected Fisher information
+instead of its global observed counterpart is the faster computing time. As we will see, instead of
+computing a single local measure for each observation ,xi ∈ X, and then taking the average, both
+Φβ and Ψβ expressions depend only on the covariance matrix of the configuration patterns observed
+along the random field.
+
+3.1. The Type I Expected Fisher Information
+
+Recall that the Type I expected Fisher information, from now on denoted by Φβ, is given by:
+
+Φβ = E
+
+�� ∂
+
+∂βlog L
+�
+⃗θ; X(t)��2�
+
+(13)
+
+The Type II expected Fisher information, from now on denoted by Ψβ, is given by:
+
+Ψβ = −E
+� ∂2
+
+∂β2 log L
+�
+⃗θ; X(t)��
+(14)
+
+256
+
+
+Entropy 2014, 16, 1002–1036
+
+We first proceed to the definition of Φβ. Plugging Equation (3) in Equation (13), and after some
+algebra, we obtain the following expression, which is composed by four main terms:
+
+Φβ = 1
+
+σ4 E
+
+⎧
+⎨
+
+⎩
+
+�
+n
+∑
+s=1
+
+�
+
+xs − μ − β ∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+∑
+j∈ηs
+
+�
+xj − μ
+�
+��2⎫
+⎬
+
+⎭
+(15)
+
+= 1
+
+σ4 E
+
+�
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+
+xs − μ − β ∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+
+xr − μ − β ∑
+k∈ηr
+(xk − μ)
+
+�
+
+×
+
+�
+∑
+j∈ηs
+
+�
+xj − μ
+�
+� �
+∑
+k∈ηr
+(xk − μ)
+
+��
+
+= 1
+
+σ4 E
+
+�
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+
+(xs − μ) (xr − μ) − β ∑
+k∈ηr
+(xs − μ) (xk − μ) − β ∑
+j∈ηs
+(xr − μ)
+�
+xj − μ
+�
+
++β2 ∑
+j∈ηs ∑
+k∈ηr
+
+�
+xj − μ
+� (xk − μ)
+
+� �
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+xj − μ
+� (xk − μ)
+
+��
+
+= 1
+
+σ4
+
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+∑
+j∈ηs ∑
+k∈ηr
+E
+�(xs − μ) (xr − μ)
+�
+xj − μ
+� (xk − μ)
+�
+
+−β ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xs − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+�
+
+−β ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+E
+�(xr − μ) (xm − μ)
+�
+xj − μ
+� (xk − μ)
+�
+
++β2 ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xm − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+�
+�
+
+Hence, the expression for Φβ is composed by four main terms, each one of them involving
+a summation of higher-order cross-moments. According to Isserlis’ theorem [28], for normally
+distributed random variables, we can compute higher order moments in terms of the covariance
+matrix through the following identity:
+
+E [X1X2X3X4] = E [X1X2] E [X3X4] + E [X1X3] E [X2X4] + E [X2X3] E [X1X4]
+(16)
+
+Then, the first term of Equation (15) is reduced to:
+
+∑
+j∈ηs ∑
+k∈ηr
+E
+�(xs − μ) (xr − μ)
+�
+xj − μ
+� (xk − μ)
+� =
+(17)
+
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+E [(xs − μ) (xr − μ)] E
+��
+xj − μ
+� (xk − μ)
+�
+
++ E
+�(xs − μ)
+�
+xj − μ
+��
+E [(xr − μ) (xk − μ)]
+
++ E
+�(xr − μ)
+�
+xj − μ
+��
+E [(xs − μ) (xk − μ)]
+� =
+
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+σsrσjk + σsjσrk + σrjσsk
+�
+
+257
+
+
+Entropy 2014, 16, 1002–1036
+
+where σsr denotes the covariance between variables xs and xr (note that in an MRF, we have σsr = 0 if
+xr /∈ ηs). We now proceed to the expansion of the second main term of Equation (15). Similarly, by
+applying Isserlis’ identity, we have:
+
+∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xs − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+� = ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σsjσkl + σskσjl + σjkσsl
+�
+(18)
+
+The third term of Equation (15) can be rewritten as:
+
+∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+E
+�(xr − μ) (xm − μ)
+�
+xj − μ
+� (xk − μ)
+� =
+(19)
+
+= ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+
+�
+σrmσjk + σrjσmk + σmjσrk
+�
+
+Finally, the fourth term of it is:
+
+∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+E
+�(xm − μ)
+�
+xj − μ
+� (xk − μ) (xl − μ)
+� =
+(20)
+
+= ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σmjσkl + σmkσjl + σmlσjk
+�
+
+Therefore, by combining Expressions Equations (17)–(20), we have the complete expression for Φβ,
+the Type I expected Fisher information for an isotropic pairwise GMRF model regarding the inverse
+temperature parameter, as:
+
+Φβ = 1
+
+σ4
+
+n
+∑
+s=1
+
+n
+∑
+r=1
+
+�
+∑
+j∈ηs ∑
+k∈ηr
+
+�
+σsrσjk + σsjσrk + σrjσsk
+�
+(21)
+
+−β ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σsjσkl + σskσjl + σjkσsl
+�
+
+−β ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr
+
+�
+σrmσjk + σrjσmk + σmjσrk
+�
+
++β2 ∑
+m∈ηs ∑
+j∈ηs ∑
+k∈ηr ∑
+l∈ηr
+
+�
+σmjσkl + σmkσjl + σmlσjk
+��
+
+However, since we are interested in studying how the spatial correlations change as the system evolves,
+
+we need to estimate a value for Φβ given a single global state X(t) =
+�
+x(t)
+1 , x(t)
+2 , . . . , x(t)
+n
+�
+. Hence, to
+
+compute Φβ from a single static configuration X(t) (a photograph of the system at a given moment),
+we consider n = 1 in the previous equation, which means, among other things, that s = r (which
+implies ηs = ηr) and that observations belonging to different local neighborhoods are independent
+from each other (as we are dealing with a pairwise interaction Markovian process, it does not make
+sense to model the interactions between variables that are far away from each other in the lattice).
+Before proceeding, we would like to clarify some points regarding the estimation of the β
+parameter and the computation of the expected Fisher information in the isotropic pairwise GMRF
+model. Basically, there are two main possibilities: (1) the parameter is spatially-invariant, which
+means that we have a unique value, ˆβ(t), for a global configuration of the system, X(t) (this is our
+assumption); or (2) the parameter is spatially-variant, which means that we have a set of ˆβs values,
+
+for s = 1, 2, . . . , n, each one of them estimated from Xs =
+�
+x(1)
+s , x(2)
+s , . . . , x(t)
+s
+�
+(we are observing the
+outcomes of a random pattern along time in a fixed position of the lattice). When we are dealing with
+the first model (β is spatially-invariant), all possible observation patterns (samples) are extracted from
+the global configuration by a sliding window (with the shape of the neighborhood system) that moves
+
+258
+
+
+Entropy 2014, 16, 1002–1036
+
+through the lattice at a fixed time instant, t. In this case, we are interested in studying the spatial
+correlations, not the temporal ones. In other words, we would like to investigate how the the spatial
+structure of a GMRF model is related to Fisher information (this is exactly the scenario described
+above, for which n = 1). Our motivation here is to characterize, via information-theoretic measures,
+the behavior of the system as it evolves from states of minimum entropy to states of maximum entropy
+(and vice versa) by providing a geometrical tool based on the definition of the Fisher curve , which will
+be introduced in the following sections.
+Therefore, in our case (n = 1), Equation (21) is further simplified for practical usage. By unifying
+s and r to a unique index, i, we have a final expression for Φβ in terms of the local covariances between
+the random variables in a given neighborhood system (i.e., for the eight nearest neighbors):
+
+Φβ = 1
+
+σ4
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+σ2σjk + 2σijσik
+�
+− 2β ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi
+
+�
+σijσkl + σikσjl + σilσjk
+�
+(22)
+
++β2 ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi ∑
+m∈ηi
+
+�
+σjkσlm + σjlσkm + σjmσkl
+��
+
+Note that we have two types of covariances in the definition of Φβ for an isotropic pairwise GMRF: (1)
+covariances between the central variable, xi, and a neighboring variable, xj, denoted by σij, for j ∈ ηi;
+and (2) covariances between two neighboring variables, xj and xk, for j, k ∈ ηi. In the next sections, we
+will see how to compute the value of Ψβ directly from the covariance matrix of the local patterns.
+
+3.2. The Type II Expected Fisher Information
+
+Following the same methodology of replacing the likelihood function by the pseudo-likelihood
+function of the GMRF model, a closed form expression for Ψβ is developed. Plugging Equation (3)
+into Equation (14) leads us to:
+
+Ψβ = 1
+
+σ2
+
+n
+∑
+i=1
+E
+
+⎧
+⎨
+
+⎩
+
+�
+∑
+xj∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+(23)
+
+= 1
+
+σ2
+
+n
+∑
+i=1
+E
+
+�
+∑
+xj∈ηi ∑
+xk∈ηi
+
+�
+xj − μ
+� (xk − μ)
+
+�
+
+=
+
+= 1
+
+σ2
+
+n
+∑
+i=1
+
+�
+∑
+xj∈ηi ∑
+xk∈ηi
+E
+��
+xj − μ
+� (xk − μ)
+�
+�
+
+= 1
+
+σ2
+
+n
+∑
+i=1 ∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+Note that unlike Φβ, Ψβ does not depend explicitly on β (inverse temperature). As we have seen
+before, Φβ is a quadratic function of the spatial dependence parameter.
+In order to simplify the notations and also to make computations easier, the expressions for Φβ
+and Ψβ can be rewritten in a matrix-vector form. Let Σp be the covariance matrix of the random vectors
+⃗pi, i = 1, 2, . . . , n, obtained by lexicographic ordering of the local configuration patterns xi ∪ ηi. Thus,
+
+259
+
+
+Entropy 2014, 16, 1002–1036
+
+considering a neighborhood system, ηi, of size K, we have Σp given by a (K + 1) × (K + 1) symmetric
+matrix (for K + 1 odd, i.e., K = 4, 8, 12, . . .):
+
+Σp =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+σ1,1
+· · ·
+σ1,K/2
+σ1,(K/2)+1
+σ1,(K/2)+2
+· · ·
+σ1,K+1
+...
+...
+...
+...
+...
+...
+...
+σK/2,1
+· · ·
+σK/2,K/2
+σK/2,(K/2)+1
+σK/2,(K/2)+2
+· · ·
+σK/2,K+1
+σ(K/2)+1,1
+· · ·
+σ(K/2)+1,K/2
+σ(K/2)+1,(K/2)+1
+σ(K/2)+1,(K/2)+2
+· · ·
+σ(K/2)+1,K+1
+σ(K/2)+2,1
+· · ·
+σ(K/2)+2,K/2
+σ(K/2)+2,(K/2)+1
+σ(K/2)+2,(K/2)+2
+· · ·
+σ(K/2)+2,K+1
+...
+...
+...
+...
+...
+...
+...
+σK+1,1
+· · ·
+σK+1,K/2
+σK+1,(K/2)+1
+σK+1,(K/2)+2
+· · ·
+σK+1,K+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+Let Σ−
+p be the submatrix of dimensions K × K obtained by removing the central row and central
+column of Σp (the covariances between xi and each one of its neighbors, xj). Then, for K + 1 odd, we
+have:
+
+Σ−
+p =
+
+⎛
+
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎜
+⎝
+
+σ1,1
+· · ·
+σ1,K/2
+σ1,(K/2)+2
+· · ·
+σ1,K+1
+...
+...
+...
+...
+...
+...
+σK/2,1
+· · ·
+σK/2,K/2
+σK/2,(K/2)+2
+· · ·
+σK/2,K+1
+σ(K/2)+2,1
+· · ·
+σ(K/2)+2,K/2
+σ(K/2)+2,(K/2)+2
+· · ·
+σ(K/2)+2,K+1
+...
+...
+...
+...
+...
+...
+σK+1,1
+· · ·
+σK+1,K/2
+σK+1,(K/2)+2
+· · ·
+σK+1,K+1
+
+⎞
+
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎟
+⎠
+
+(24)
+
+Thus, Σ−
+p is a matrix that stores only the covariances among the neighboring variables. Furthermore,
+let ⃗ρ be the vector of dimensions K × 1 formed by all the elements of the central row of Σp, excluding
+the middle one (which is a variance actually), that is:
+
+⃗ρ =
+�
+σ(K/2)+1,1
+· · ·
+σ(K/2)+1,K/2
+σ(K/2)+1,(K/2)+2
+· · ·
+σ(K/2)+1,K+1
+�
+(25)
+
+Therefore, we can rewrite Equation (23) (for n = 1) using Kronecker products. The following definition
+provides a fast way to compute Φβ exploring these tensor products.
+
+Definition 5. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+system, ηi, of size K (usual choices for K are even values: four, eight, 12, 20 or 24). Assuming that X(t) =
+{x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t and ⃗ρ and Σ−
+p are defined as
+
+Equations (25) and (24), the Type I expected Fisher information, Φβ, for this state, X(t), is:
+
+Φβ = 1
+
+σ4
+
+�
+σ2 ���Σ−
+p
+���
++ + 2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(26)
+
+where ∥A∥+ denotes the summation of all the entries of the matrix, A (not to be confused with a matrix
+norm) and ⊗ denotes the Kronecker (tensor) product. From an information geometry perspective,
+the presence of tensor products indicates the intrinsic differential geometry of a manifold in the form
+of the Riemann curvature tensor [18]. Note that all the necessary information for computing the
+Fisher information is somehow encoded in the covariance matrix of the local configuration patterns,
+(xi ∪ ηi), i = 1, 2, . . . , n, as would be expected in the case of Gaussian variables (second-order statistics).
+The same procedure is applied to the Type II expected Fisher information.
+
+Definition 6. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi, of size K (usual choices for K are four, eight, 12, 20 or 24). Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n }
+
+260
+
+
+Entropy 2014, 16, 1002–1036
+
+denotes the global configuration of the system at time t and Σ−
+p is defined as Equation (24), the Type II expected
+
+Fisher information, Ψβ, for this state, X(t), is given by:
+
+Ψβ = 1
+
+σ2
+
+���Σ−
+p
+���
++
+(27)
+
+3.3. Information Equilibrium in GMRF Models
+
+From the definition of both Φβ and Ψβ, a natural question that raises would be: under what
+conditions do we have Φβ = Ψβ in an isotropic pairwise GMRF model? As we can see from
+
+Equations (26) and (27), the difference between Φβ and Ψβ, from now on denoted by Δβ
+�
+⃗ρ, Σ−
+p
+�
+,
+is simply:
+
+Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 1
+
+σ4
+
+�
+2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(28)
+
+Then, intuitively, the condition for information equality is achieved when Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0. As
+
+Δβ
+�
+⃗ρ, Σ−
+p
+�
+is a simple quadratic function of the inverse temperature parameter, β, we can easily find
+
+that the value, β∗, for which Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0, is:
+
+β∗ =
+
+���⃗ρT ⊗ Σ−
+p
+���
++
+��Σ−p ⊗ Σ−p
+��
++
+±
+√
+
+3
+3
+
+�
+
+3
+��⃗ρT ⊗ Σ−p
+��2
++ − 2
+��Σ−p ⊗ Σ−p
+��
++ ∥⃗ρ ⊗⃗ρT∥+
+��Σ−p ⊗ Σ−p
+��
++
+(29)
+
+provided that 3
+���⃗ρT ⊗ Σ−
+p
+���
+2
+
++ ≥ 2
+���Σ−
+p ⊗ Σ−
+p
+���
++
+
+��⃗ρ ⊗⃗ρT��
++ and
+���Σ−
+p ⊗ Σ−
+p
+���
++ ̸= 0.
+Note that if
+��⃗ρ ⊗⃗ρT��
++ = 0, then one solution for the above equation is β∗ = 0.
+In other words, when
+σij = 0, ∀j ∈ ηi (no correlation between xi and its neighbors, xj), information equilibrium is achieved for
+β∗ = 0, which in this case, is the maximum pseudo-likelihood estimate of β, since in this matrix-vector
+notation, ˆβMPL is given by:
+
+ˆβMPL =
+∑
+j∈ηi
+ˆσij
+
+∑
+j∈ηi ∑
+k∈ηi
+ˆσjk
+=
+∥⃗ρ∥+
+��Σ−p
+��
++
+(30)
+
+In the isotropic pairwise GMRF model, if β = 0, then we have ∥⃗ρ∥+ = 0, and as a consequence,
+Φβ = Ψβ. However, the opposite is not necessarily true, that is, we may observe that Φβ = Ψβ for a
+
+non-zero β. One example is for β∗, a solution of Δβ
+�
+⃗ρ, Σ−
+p
+�
+= 0.
+
+4. Entropy in Isotropic Pairwise GMRFs
+
+Our definition of entropy is done by repeating the same process employed to derive Φβ and Ψβ.
+Knowing that the entropy of random variable x is defined by the expected value of self-information,
+given by −log p(x), it can be thought of as a probability-based counterpart to the Fisher information.
+
+Definition 7. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t,
+then the entropy, Hβ, for this state X(t) is given by:
+
+261
+
+
+Entropy 2014, 16, 1002–1036
+
+Hβ = −E
+�
+log L
+�
+⃗θ; X(t)��
+= −E
+
+�
+
+log
+n
+∏
+i=1
+p
+�
+xi|ηi,⃗θ
+��
+
+=
+(31)
+
+= n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+E
+
+⎧
+⎨
+
+⎩
+
+�
+
+xi − μ − β ∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭ =
+
+= n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+E
+�
+(xi − μ)2�
+− 2βE
+
+�
+∑
+j∈ηi
+(xi − μ)
+�
+xj − μ
+�
+�
+
++ β2E
+
+⎧
+⎨
+
+⎩
+
+�
+∑
+j∈ηi
+
+�
+xj − μ
+�
+�2⎫
+⎬
+
+⎭
+
+⎫
+⎬
+
+⎭
+
+After some algebra, the expression for Hβ becomes:
+
+Hβ = n
+
+2 log
+�
+2πσ2�
++
+1
+
+2σ2
+
+n
+∑
+i=1
+
+�
+
+σ2 − 2β ∑
+j∈ηi
+σij + β2 ∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�
+
+=
+(32)
+
+=
+�n
+
+2 log(2πσ2) + n
+
+2
+
+�
+− β
+
+σ2
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi
+σij
+
+�
+
++ β2
+
+2σ2
+
+n
+∑
+i=1
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�
+
+Using the same matrix-vector notation introduced in the previous sections, we can further simplify the
+expression for Hβ (considering n = 1).
+
+Definition 8. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t
+and ⃗ρ and Σ−
+p are defined as Equations (25) and (24), the entropy, Hβ, for this state, X(t), is given by:
+
+Hβ = HG −
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2σ2
+
+���Σ−
+p
+���
++
+
+�
+= HG −
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2 Ψβ
+
+�
+(33)
+
+where HG denotes the entropy of a Gaussian random variable with variance σ2 and Ψβ is the Type II
+expected Fisher information.
+Note that Shannon entropy is a quadratic function of the spatial dependence parameter, β.
+Since the coefficient of the quadratic term is strictly non-negative (Ψβ is the Type II expected Fisher
+information), entropy is a convex function of β. Furthermore, as expected, when β = 0 and there is no
+induced spatial dependence in the system, the resulting expression for Hβ is the usual entropy of a
+Gaussian random variable, HG. Thus, there is a value,
+ˆ
+βMH, for the inverse temperature parameter,
+which minimizes the entropy of the system. In fact, ˆβMH is given by:
+
+∂Hβ
+∂β = β
+
+σ2
+
+���Σ−
+p
+���
++ − 1
+
+σ2 ∥⃗ρ∥+ = 0
+(34)
+
+ˆβMH = ∥⃗ρ∥+
+��Σ−p
+��
++
+= ˆβMPL
+
+262
+
+
+Entropy 2014, 16, 1002–1036
+
+showing that the maximum pseudo-likelihood and the minimum-entropy estimates are equivalent
+in an isotropic pairwise GMRF model. Moreover, using the derived equations, we see a relationship
+between Φβ, Ψβ and Hβ:
+
+Φβ − Ψβ = Δβ
+�
+⃗ρ, Σ−
+p
+�
+(35)
+
+∂2Hβ
+∂β2 = Ψβ
+
+where the functional Δβ
+�
+⃗ρ, Σ−
+p
+�
+that represents the difference between Φβ and Ψβ is defined by
+Equation (28). These equations relate the entropy and one form of Fisher information (Ψβ) in GMRF
+models, showing that Ψβ can be roughly viewed as the curvature of Hβ. In this sense, in a hypothetical
+information equilibrium condition Ψβ = Φβ = 0, the entropy’s curvature would be null (Hβ would
+never change). These results suggest that an increase in the value of Ψβ, which means stability (a
+measure of agreement between the neighboring observations of a given point), contributes to the curve
+and, therefore, to inducing a change in the entropy of the system. In this context, the analysis of the
+Fisher information could bring us insights in predicting the entropy of a system.
+
+5. Asymptotic Variance of MPL Estimators
+
+It is known from the statistical inference literature that unbiasedness is a property that is not
+granted by maximum likelihood estimation, nor by maximum pseudo-likelihood (MPL) estimation.
+Actually, there is no universal method that guarantees the existence of unbiased estimators for a fixed
+n-sized sample. Often, in the exponential family of distributions, maximum likelihood estimators
+(MLEs) coincide with the UMVU (uniform minimum variance unbiased) estimators, because MLEs
+are functions of complete sufficient statistics. There is an important result in statistical inference that
+shows that if the MLE is unique, then it is a function of sufficient statistics. We could enumerate
+and make a huge list of several properties that make maximum likelihood estimation a reference
+method [15–17]. One of the most important properties concerns the asymptotic behavior of MLEs:
+when we make the sample size grow infinitely (n → ∞), MLEs become asymptotically unbiased and
+efficient. Unfortunately, there is no result showing that the same occurs in maximum pseudo-likelihood
+estimation. The objective of this section is to propose a closed expression for the asymptotic variance
+of the maximum pseudo-likelihood of β in an isotropic pairwise GMRF model. Unsurprisingly, this
+variance is completely defined as a function of both forms of expected Fisher information, Ψβ and Φβ;
+as for general values of the inverse temperature parameter, the information equality condition fails.
+
+5.1. The Asymptotic Variance of the Inverse Temperature Parameter
+
+In mathematical statistics, asymptotic evaluations uncover several fundamental properties of
+inference methods, providing a powerful and general tool for studying and characterizing the behavior
+of estimators. In this section, our objective is to derive an expression for the asymptotic variance
+of the maximum pseudo-likelihood estimator of the inverse temperature parameter (β) in isotropic
+pairwise GMRF models. It is known from the statistical inference literature that both maximum
+likelihood and maximum pseudo-likelihood estimators share two important properties: consistency
+and asymptotic normality [29,30]. It is possible, therefore, to completely characterize their behaviors
+in the limiting case. In other words, the asymptotic distribution of ˆβMPL is normal, centered around
+the real parameter value (since consistency means that the estimator is asymptotically unbiased),
+with the asymptotic variance representing the uncertainty about how far we are from the mean (real
+value). From a statistical perspective, ˆβMPL ≈ N
+�
+β, υβ
+�
+, where υβ denotes the asymptotic variance
+
+263
+
+
+Entropy 2014, 16, 1002–1036
+
+of the maximum pseudo-likelihood estimator. It is known that the asymptotic covariance matrix of
+maximum pseudo-likelihood estimators is given by [31]:
+
+C(⃗θ) = H−1(⃗θ)J(⃗θ)H−1(⃗θ)
+(36)
+
+with:
+
+H(⃗θ) = Eβ
+�
+∇2log L
+�
+⃗θ; X(t)��
+(37)
+
+J(⃗θ) = Varβ
+�
+∇log L
+�
+⃗θ; X(t)��
+(38)
+
+where H and J denote, respectively, the Jacobian and Hessian matrices regarding the logarithm of the
+pseudo-likelihood function. Thus, considering the parameter of interest, β, we have the following
+definition for its asymptotic variance, υβ (the derivatives are taken with respect to β):
+
+υβ =
+Varβ
+�
+∂
+∂βlog L
+�
+⃗θ; X(t)��
+
+E2
+β
+�
+∂2
+∂β2 log L
+�
+⃗θ; X(t)
+�� =
+Eβ
+
+��
+∂
+∂βlog L
+�
+⃗θ; X(t)��2�
+− E2
+β
+�
+∂
+∂βlog L
+�
+⃗θ; X(t)��
+
+E2
+β
+�
+∂2
+∂β2 log L
+�
+⃗θ; X(t)
+��
+(39)
+
+However, note that the expected value of the first derivative of log L
+�
+⃗θ; X(t)�
+with relation to β is zero:
+
+E
+� ∂
+
+∂βlog L
+�
+⃗θ; X(t)��
+= 1
+
+σ2
+
+n
+∑
+i=1
+
+�
+
+E [xi − μ] − β ∑
+j∈ηi
+E
+�
+xj − μ
+�
+�
+
+= 0
+(40)
+
+Therefore, the second term of the numerator of Equation (39) vanishes and the final expression for the
+asymptotic variance of the inverse temperature parameter is given as the ratio between Φβ and Ψ2
+β:
+
+υβ =
+1
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+σjk
+
+�2
+
+�
+∑
+j∈ηi ∑
+k∈ηi
+
+�
+σ2σjk + 2σijσik
+�
+− 2β ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi
+
+�
+σijσkl + σikσjl + σilσjk
+�
+
++β2 ∑
+j∈ηi ∑
+k∈ηi ∑
+l∈ηi ∑
+m∈ηi
+
+�
+σjkσlm + σjlσkm + σjmσkl
+��
+
+(41)
+
+This derivation leads us to another definition concerning an isotropic pairwise GMRF.
+
+Definition 9. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+
+system, ηi. Assuming that X(t) = {x(t)
+1 , x(t)
+2 , . . . , x(t)
+n } denotes the global configuration of the system at time t,
+and⃗ρ and Σ−
+p are defined as Equations (25) and (24), the asymptotic variance of the maximum pseudo-likelihood
+estimator of the inverse temperature parameter, β, is given by (using the same matrix-vector notation from the
+previous sections):
+
+υβ =
+σ2 ���Σ−
+p
+���
++ + 2
+��⃗ρ ⊗⃗ρT��
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+��Σ−p
+��2
++
+=
+(42)
+
+=
+σ2
+
+��Σ−p
+��
++
++
+σ4Δβ
+�
+⃗ρ, Σ−
+p
+�
+
+��Σ−p
+��2
++
+= 1
+
+Ψβ
++ 1
+
+Ψ2
+β
+
+�
+Φβ − Ψβ
+�
+
+264
+
+
+Entropy 2014, 16, 1002–1036
+
+Note that when information equilibrium prevails, that is Φβ = Ψβ, the asymptotic variance is
+given by the inverse of the expected Fisher information. However, the interpretation of this equation
+indicates that the uncertainty in the estimation of the inverse temperature parameter is minimized when
+Ψβ is maximized. Essentially, this means that on average, the local pseudo-likelihood functions are not
+flat, that is small changes on the local configuration patterns along the system cannot cause abrupt
+changes in the expected global behavior (the global spatial dependence structure is not susceptible to
+sharp changes). To reach this condition, there must be a reasonable degree of agreement between the
+neighboring elements throughout the system, a behavior that is usually associated to low temperature
+states (β is above a critical value and there is a visible induced spatial dependence structure).
+
+6. The Fisher Curve of a System
+
+With the definition of Φβ, Ψβ and Hβ, we have the necessary tools to compute three important
+information-theoretic measures of a global configuration of the system. Our idea is that we can study
+the behavior of a complex system by constructing a parametric curve in this information-theoretic
+space as a function of the inverse temperature parameter, β. Our expectation is that the resulting
+trajectory provides a geometrical interpretation of how the system moves from an initial configuration,
+A (with a low entropy value for instance), to a desired final configuration, B (with a greater value
+of entropy, for instance), since the Fisher information plays an important role in providing a natural
+metric to the Riemannian manifolds of statistical models [18,19]. We will call the path from global State
+A to global State B as the Fisher curve (from A to B) of the system, denoted by ⃗FB
+A(β). Instead of using
+time as the parameter to build the curve, ⃗F, we parametrize ⃗F by the inverse temperature parameter, β.
+
+Definition 10. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
+system, ηi, and X(β1), X(β2), . . . , X(βn) be a sequence of outcomes (global configurations) produced by different
+values of βi (inverse temperature parameters) for which A = βMIN = β1 < β2 < · · · < βn = βMAX = B.
+The system’s Fisher curve from A to B is defined as the function ⃗F : ℜ → ℜ3 that maps each configuration,
+X(βi), to a point
+�
+Φβi, Ψβi, Hβi
+�
+from the information space, that is:
+
+⃗FB
+A (β) =
+�
+Φβ, Ψβ, Hβ
+�
+β = A, . . . , B
+(43)
+
+where Φβ, Ψβ and Hβ denote the Type I expected Fisher information, the Type II expected Fisher
+information and the Shannon entropy of the global configuration, X(β), defined by:
+
+Φβ = 1
+
+σ4
+
+�
+σ2 ���Σ−
+p
+���
++ + 2
+���⃗ρ ⊗⃗ρT���
++ − 6β
+���⃗ρT ⊗ Σ−
+p
+���
++ + 3β2 ���Σ−
+p ⊗ Σ−
+p
+���
++
+
+�
+(44)
+
+Ψβ = 1
+
+σ2
+
+���Σ−
+p
+���
++
+(45)
+
+Hβ = 1
+
+2
+
+�
+log
+�
+2πσ2 + 1
+��
+−
+� β
+
+σ2 ∥⃗ρ∥+ − β2
+
+2 Ψβ
+
+�
+(46)
+
+In the next sections, we show some computational experiments that illustrate the effectiveness of
+the proposed tools in measuring the information encoded in complex systems. We want to investigate
+what happens to the Fisher curve as the inverse temperature parameter is modified in order to control
+the system’s global behavior. Our main conclusion, which is supported by experimental analysis, is
+that ⃗FB
+A(β) ̸= ⃗FA
+B (β). In other words, in terms of information, moving towards higher entropy states
+is not the same as moving towards lower entropy states, since the Fisher curves that represent the
+trajectory between the initial State A and the final State B are significantly different.
+
+265
+
+
+Entropy 2014, 16, 1002–1036
+
+7. Computational Simulations
+
+This section discusses some numerical experiments proposed to illustrate some applications of
+the derived tools in both simulations and real data. Our computational investigations were divided
+into two main sets of experiments:
+
+(1) Local analysis: analysis of the local and global versions of the measures (φβ, ψβ, Φβ, Ψβ and Hβ),
+considering a fixed inverse temperature parameter;
+(2) Global analysis: analysis of the global versions of the measures (Φβ, Ψβ and Hβ) along Markov
+chain Monte Carlo (MCMC) simulations in which the inverse temperature parameter is modified
+to control the expected global behavior.
+
+7.1. Learning from Spatial Data with Local Information-Theoretic Measures
+
+First, in order to illustrate a simple application of both forms of local observed Fisher
+information, φβ and ψβ, we performed an experiment using some synthetic images generated by
+the Metropolis–Hastings algorithm. The basic idea of this simulation process is to start at an initial
+configuration in which temperature is infinite (or β = 0). This basic initial condition is randomly
+chosen, and after a fixed number of steps, the algorithm produces a configuration that is considered to
+be a valid outcome of an isotropic pairwise GMRF model. Figure 1 shows an example of the initial
+condition and the resulting system configuration after 1,000 iterations considering a second order
+neighborhood system (eight nearest neighbors). The model parameters were chosen as: μ = 0, σ2 = 5
+and β = 0.8.
+
+266
+
+
+Entropy 2014, 16, 1002–1036
+
+Figure 1. Example of Gaussian Markov random field (GMRF) model outputs. The values of the inverse
+temperature parameter, β, in the left and right configurations are zero and 0.8, respectively.
+
+Three Fisher information maps were generated from both initial and resulting configurations.
+The first map was obtained by calculating the value, φβ(xi), for every point of the system, that is for
+i = 1, 2, . . . , n. Similarly, the second one was obtained by using ψβ(xi). The last information map was
+built by using the ratio between φβ(xi) and ψβ(xi), motivated by the fact that boundaries are often
+composed of patterns that are not expected to be “aligned” to the global behavior (and, therefore, show
+high values of φβ(xi)) and also are somehow unstable (show low values of ψβ(xi)). We will recall
+this measure, Lβ(xi) = φβ(xi)/ψβ(xi), the local L-information, once it is defined in terms of the first
+two derivatives of the logarithm of the local pseudo-likelihood function. Figure 2 shows the obtained
+information maps as images. Note that while φβ has a strong response for boundaries (the edges are
+light), ψβ has a weak one (so the edges are dark), evidence in favor of considering L-information in
+boundary detection procedures. Note also that in the initial condition, when the temperature is infinite,
+the informative patterns are almost uniformly spread all over the system, while the final configuration
+
+267
+
+
+Entropy 2014, 16, 1002–1036
+
+shows a more sparse representation in terms of information. Figure 3 shows the distribution of local
+L-information for both systems’ configurations depicted in Figure 1.
+
+Figure 2. Fisher information maps. The first row shows the information maps of the system when the
+temperature is infinite (β = 0). The second row shows the same maps when the temperature is low
+(β = 0.8). The first and second columns show information maps that were generated by computing
+φβ(xi) and ψβ(xi) for each observation in the lattice. The column map was produced by computing the
+local L-information, that is the ratio between both local information measures. In terms of information,
+low temperature configurations are more sparse, since most local patterns are uninformative, due to
+the strong alignment of the particles throughout the system, which is the expected global behavior for
+β above a certain critical value.
+
+268
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+�
+�
+�
+�
+��
+��
+��
+��
+��
+��
+�
+
+����
+
+����
+
+����
+
+����
+
+�����
+
+�����
+
+�������������
+
+������������������������
+
+����������������������������������������������������������
+
+�
+�
+��
+��
+��
+��
+��
+��
+��
+��
+��
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+�
+
+��� ����
+�
+
+�������������
+
+������������������������
+
+�����������������������������������������������������������������
+
+Figure 3.
+Distribution of local L-information. When the temperature is infinite, the information
+is spread along the system. For low temperature configurations, the number of local patterns with
+zero information content significantly increases, that is the system is more sparse in terms of Fisher
+information.
+
+7.2. Analyzing Dynamical Systems with Global Information-Theoretic Measures
+
+In order to study the behavior of a complex system that evolves from an initial State A to
+another State B, we use the Metropolis–Hastings algorithm, an MCMC simulation method, to generate
+a sequence of valid isotropic pairwise GMRF model outcomes for different values of the inverse
+temperature parameter, β. This process is an attempt to perform a random walk on the state space of
+the system, that is, in the space of all possible global configurations in order to analyze the behavior of
+the proposed global measures: entropy and both forms of Fisher information. The main purpose of
+this experiment is to observe what happens to Φβ, Ψβ and Hβ when the system evolves from a random
+initial state to other global configurations. In other words, we want to investigate the Fisher curve of
+the system in order to characterize its behavior in terms of information. Basically, the idea is to use the
+Fisher curve as a kind of signature for the expected behavior of any system modeled by an isotropic
+pairwise GMRF, making it possible to gain insights into the understanding of large complex systems.
+
+269
+
+
+Entropy 2014, 16, 1002–1036
+
+To simulate a system in which we can control the inverse temperature parameter, we define an
+updating rule for β based on fixed increments. In summary, we start with a minimum value βMIN
+(when βMIN = 0, the temperature of the system is infinite). Then, the value of β in the iteration, t, is
+defined as the value of β in t − 1 plus a small increment (Δβ), until it reaches a pre-defined upper bound,
+βMAX. The process in then repeated with negative increments −Δβ, until the inverse temperature
+reaches its minimum value, βMIN, again. This process continues for a fixed number of iterations, NMAX,
+during an MCMC simulation. As a result of this approach, a sequence of GMRF samples is produced.
+We use this sequence to calculate Φβ, Ψβ and Hβ and define the Fisher curve ⃗F, for β = βMIN, . . . , βMAX.
+Figure 4 shows some of the system’s configurations along an MCMC simulation. In this experiment, the
+parameters were defined as: βMIN = 0, Δβ = 0.001, βMAX = 0.15 and NMAX = 1, 000, μ = 0, σ2 = 5
+and ηi = {(i − 1, j − 1), (i − 1, j), (i − 1, j + 1), (i, j − 1), (i, j + 1), (i + 1, j − 1), (i + 1, j), (i + 1, j + 1)}.
+
+Figure 4. Global configurations along a Markov chain Monte Carlo (MCMC) simulation. Evolution of
+the system as the inverse temperature parameter, β, is modified to control the expected global behavior.
+
+A plot of both forms of the expected Fisher information, Φβ and Ψβ, for each iteration of
+the MCMC simulation is shown in Figure 5.
+The graph produced by this experiment shows
+some interesting results. First of all, regarding upper and lower bounds on these measures, it is
+possible to note that when there is no induced spatial dependence structure (β ≈ 0), we have an
+information equilibrium condition (Φβ = Ψβ and the information equality holds). In this condition,
+the observations are practically independent in the sense that all local configuration patterns convey
+approximately the same amount of information. Thus, it is hard to find and separate the two categories
+of patterns we know: the informative and the non-informative ones. Once they all behave in a similar
+manner, there is no informative pattern to highlight. Moreover, in this information equilibrium
+situation, Ψβ reaches its lower bound (in this simulation, we observed that in the equilibrium
+Φβ ≈ Ψβ ≈ 8), indicating that this condition emerges when the system is most susceptible to a
+change in the expected global behavior, since the uncertainty about β is maximum at this moment. In
+other words, modification in the behavior of a small subset of local patterns may guide the system to a
+totally different stable configuration in the future.
+The results also show that the difference between Φβ and Ψβ is maximum when the system
+operates with large values of β, that is, when organization emerges and there is a strong dependence
+structure among the random variables (the global configuration shows clear visible clusters and
+boundaries between them). In such states, it is expected that the majority of patterns be aligned to the
+global behavior, which causes the appearance of few, but highly informative patterns: those connecting
+
+270
+
+
+Entropy 2014, 16, 1002–1036
+
+elements from different regions (boundaries). Besides that, the results suggest that it takes more time
+for the system to go from the information equilibrium state to organization than the opposite. We
+will see how this fact becomes clear by analyzing the Fisher curve along Markov chain Monte Carlo
+(MCMC) simulations. Finally, the results also suggest that both Φβ and Ψβ are bounded by a superior
+value, possibly related to the size of the neighborhood system.
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������������������
+
+�������������������������������������������������������������������
+
+���
+���
+�
+
+�
+
+��
+
+Figure 5. Evolution of Fisher information along an MCMC simulation. As the difference between Φβ
+and Ψβ is maximized (*), the uncertainty about the real inverse temperature parameter is minimized
+and the number of informative patterns increases. In the information equilibrium condition (**), it is
+hard to find informative patterns, since there is no induced spatial dependence structure.
+
+Figure 6 shows the real parameter values used to generate the GMRF outputs (blue line), the
+maximum pseudo-likelihood estimative used to calculate Φβ and Ψβ (red line) and also a plot of the
+asymptotic variances (uncertainty about the inverse temperature) along the entire MCMC simulation.
+
+271
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�����
+
+�
+
+����
+
+����
+
+����
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+����������
+
+�������������������������������������������������������������������������������������
+
+����������
+��������������
+��������
+
+Figure 6. Real and estimated inverse temperatures along the MCMC simulation. The system’s global
+behavior is controlled by the real inverse temperature parameter values (blue line), used to generate
+the GMRF outputs. The maximum pseudo-likelihood estimative is used to compute both Φβ and Ψβ.
+Note that the uncertainty about the inverse temperature increases as β → 0 and the system approaches
+the information equilibrium condition.
+
+We now proceed to the analysis of the Shannon entropy of the system along the simulation.
+Despite showing a behavior similar to Ψβ, the range of values for entropy is significantly smaller. In
+this simulation, we observed that 0 ≤ Hβ ≤ 4.5, 0 ≤ Φβ ≤ 18 and 8 ≤ Ψβ ≤ 61. An interesting point is
+that knowledge of Φβ and Ψβ allows us to infer the entropy of the system. For example, looking at
+Figures 5 and 7, we can see that Φβ and Ψβ start to diverge a little bit earlier (t ≈ 80), then the entropy
+in a GMRF model begins to grow (t ≈ 120). Therefore, in an isotropic pairwise GMRF model, if the
+system is close to the information equilibrium condition, then Hβ is low, since there is little variability
+in the observed configuration patterns. When the difference between Φβ and Ψβ is large, Hβ increases.
+
+272
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+����������
+
+�������
+
+��������������������������������������������
+
+�������������������
+����������
+
+Figure 7.
+Evolution of Shannon entropy along an MCMC simulation. Hβ start to grow when the
+system leaves the equilibrium condition, where the entropy in the isotropic pairwise GMRF model is
+identical to the entropy of a simple Gaussian random variable (since β → 0).
+
+Another interesting global information-theoretic measure is L-information, from now on denoted
+by Lβ, since it conveys all the information about the likelihood function (in a GMRF model, only the
+first two derivatives of L(⃗θ; X(t)) are not null). Lβ is defined as the ratio between the two forms of
+expected Fisher information, Φβ and Ψβ. A nice property about this measure is that 0 ≤ Lβ ≤ 1. With
+this single measurement, it is possible to gain insights about the global system behavior. Figure 8 shows
+that a value close to one indicates a system approximating the information equilibrium condition, while
+a value close to zero indicates a system close to the maximum entropy condition (a stable configuration
+with boundaries and informative patterns).
+
+273
+
+
+Entropy 2014, 16, 1002–1036
+
+�
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+���
+
+�
+
+����������
+
+�������������
+
+��������������������������������������������������
+
+Figure 8.
+Evolution of L-information along an MCMC simulation. When Lβ approaches one, the
+system tends to the information equilibrium condition. For values close to zero, the system tends to the
+maximum entropy condition.
+
+To investigate the intrinsic non-linear connection between Φβ, Ψβ and Hβ in a complex system
+modeled by an isotropic pairwise GMRF model, we now analyze its Fisher curves. The first curve,
+which is a planar one, is defined as ⃗F(β) = (Φβ, Ψβ), for A = βmin to B = βmax and shows how Fisher
+information changes when the inverse temperature of the system is modified to control the global
+behavior. Figure 9 shows the results. In the first image, the blue piece of the curve is the path from
+A to B, that is, ⃗F(β)B
+A, and the red piece is the inverse path (from B to A), that is, ⃗F(β)A
+B . We must
+emphasize that ⃗F(β)B
+A is the trajectory from a lower entropy global configuration to a higher entropy
+global configuration. On the other hand, when the system moves from B to A, we are moving towards
+entropy minimization. To make this clear, the second image of Figure 9 illustrates the same Fisher
+curve as before, but now in three dimensions, that is, ⃗F(β) = (Φβ, Ψβ, Hβ). For comparison purposes,
+Figure 10 shows the Fisher curves for another MCMC simulation with different parameter settings.
+Note that the shape of the curves are quite similar to those in Figure 9.
+
+274
+
+
+Entropy 2014, 16, 1002–1036
+
+0
+2
+4
+6
+8
+10
+12
+14
+16
+18
+20
+0
+
+10
+
+20
+
+30
+
+40
+
+50
+
+60
+
+70
+
+PHI
+
+PSI
+
+2D Fisher curve for a GMRF model
+
+A
+
+Equilibrium Line
+
+B
+
+0
+
+5
+
+10
+
+15
+
+20
+
+0
+
+20
+
+40
+
+60
+
+80
+
+2
+
+2.5
+
+3
+
+3.5
+
+4
+
+PHI
+
+3D Fisher curve for a GMRF model
+
+PSI
+
+H
+
+B
+
+A
+
+Figure 9. 2D and 3D Fisher curves of a complex system along an MCMC simulation. The graph shows
+a parametric curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from
+a differential geometry perspective, as the divergence between Φβ and Ψβ increases, the torsion of the
+parametric curve becomes evident (the curve leaves the plane of constant entropy).
+
+275
+
+
+Entropy 2014, 16, 1002–1036
+
+0
+2
+4
+6
+8
+10
+12
+14
+16
+18
+20
+0
+
+10
+
+20
+
+30
+
+40
+
+50
+
+60
+
+70
+
+PHI
+
+PSI
+
+2D Fisher curve for an isotropic pairwise GMRF model
+
+Equilibrium line
+
+B
+
+A
+
+0
+
+5
+
+10
+
+15
+
+20
+0
+
+20
+
+40
+
+60
+
+80
+2
+
+2.1
+
+2.2
+
+2.3
+
+2.4
+
+PSI
+
+3D Fisher curve for an isotropic pairwise GMRF model
+
+PHI
+
+H
+
+A
+
+B
+
+Figure 10. 2D and 3D Fisher curves along another MCMC simulation. The graph shows a parametric
+curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from a geometrical
+perspective, the properties of these curves are essentially the same as the ones from the previous
+simulation.
+
+We can see that the majority of points along the Fisher curve is concentrated around two regions
+of high curvature: (A) around the information equilibrium condition (an absence of short-term and
+long-term correlations, since β = 0); and (B) around the maximum entropy value, where the divergence
+between the information values are maximum (self-organization emerges, since β is greater than a
+critical value, βc). The points that lie in the middle of the path connecting these two regions represent
+the system undergoing a phase transition. Its properties change rapidly and in an asymmetric way,
+since ⃗F(β)B
+A ̸= ⃗F(β)A
+B for a given natural orientation.
+By now, some observations can be highlighted. First, the natural orientation of the Fisher curve
+defines the direction of time. The natural A–B path (increase in entropy) is given by the blue curve and
+the natural B–A path (decrease in entropy) is given by the red curve. In other words, the only possible
+way to walk from A to B (increase Hβ) by the red path or to walk from B to A (decrease Hβ) by the
+blue path would be moving back in time (by running the recorded simulation backwards).Eventually,
+we believe that a possible explanation for this fact could be that the deformation process that takes the
+original geometric structure (with constant curvature) of the usual Gaussian model (A) to the novel
+geometric structure of the isotropic pairwise GMRF model (B) is not reversible. In other words, the
+way the model is "curved" is not simply the reversal of the "flattering" process (when it is restored to its
+
+276
+
+
+Entropy 2014, 16, 1002–1036
+
+constant curvature form). Thus, even the basic notion of time seems to be deeply connected with the
+relationship between entropy and Fisher information in a complex system: in the natural orientation
+(forward in time), it seems that the divergence between Φβ and Ψβ is the cause of an increase in
+the entropy, and the decrease of entropy is the cause of the convergence of Φβ and Ψβ. During the
+experimental analysis, we repeated the MCMC simulations with different parameter settings, and
+the observed behavior for Fisher information and entropy was the same. Figure 11 shows the graphs
+of Φβ, Ψβ and Hβ for another recorded MCMC simulation. The results indicate that in the natural
+orientation (in the direction of time), an increase in Ψβ seems to be a trigger for an increase in the
+entropy and a decrease in the entropy seems to be a trigger for a decrease in Ψβ. Roughly speaking,
+Ψβ “pushes Hβ up” and Hβ “pushes Ψβ down”.
+
+�
+���
+���
+���
+���
+����
+����
+����
+����
+����
+����
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+�����������������������������������������������������������
+
+���
+���
+
+�
+���
+���
+���
+���
+����
+����
+����
+����
+����
+����
+�
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����������
+
+�������
+
+������������������������������������������������
+
+Figure 11. Relations between entropy and Fisher information. When a system modeled by an isotropic
+pairwise GMRF evolves in the natural orientation (forward in time), two rules that relate Fisher
+information and entropy can be observed: (1) an increase in Ψβ is the cause of an increase in Hβ (the
+increase in Hβ is a consequence of the increase in Ψβ); (2) a decrease in Hβ is the cause of a decrease in
+Ψβ (the decrease in Ψβ is a consequence of the decrease in Hβ). In other words, when moving towards
+higher entropy states, changes in Fisher information precedes changes in entropy (Ψβ “pushes Hβ
+up”). When moving towards lower entropy states, changes in entropy precedes changes in Fisher
+information (Hβ “pushes Ψβ down”).
+
+In summary, the central idea discussed here is that while entropy provides a measure
+of order/disorder of the system at a given configuration, X(t), Fisher information links these
+thermodynamical states through a path (Fisher curve). Thus, Fisher information is a powerful
+
+277
+
+
+Entropy 2014, 16, 1002–1036
+
+mathematical tool in the study of complex and dynamical systems, since it establishes how these
+different thermodynamical states are related along the evolution of the inverse temperature. Instead of
+knowing whether the entropy, Hβ, is increasing or decreasing, with Fisher information, it is possible to
+know how and why this change is happening.
+
+7.2.1. The Effect of Induced Perturbations in the System
+
+To test whether a system can recover part of its original configuration after a perturbation is
+induced, we conducted another computational experiment. During a stable simulation, two kinds of
+perturbations were induced in the system: (1) the value of the inverse temperature parameter was set
+to zero for the next consecutive two iterations; (2) the value of the inverse temperature parameter was
+set to the equilibrium value, β∗ (the solution of Equation 28), for the next consecutive two iterations.
+We should mention that in both cases, the original value of β is recovered after these two iterations
+are completed.
+When the system is disturbed by setting β to zero, the simulations indicate that the system is
+not successful in recovering components from its previous stable configuration (note that Φβ and Ψβ
+clearly touch one another in the graph). When the same perturbation is induced, but using the smallest
+of the two β∗ values (minimum solution of Equation 28), after a short period of turbulence, the system
+can recover parts (components, clusters) of its previous stable state. This behavior suggests that this
+softer perturbation is not enough to remove all the information encoded within the spatial dependence
+structure of system, preserving some of the long-term correlations in data (stronger bonds), slightly
+remodeling the large clusters presented in the system. Figures 12 and 13 illustrate the results.
+
+7.3. Considerations and Final Remarks
+
+The goal of this section is to summarize the main results obtained in this paper, focusing on the
+interpretation of the Fisher curve of a system modeled by a GMRF. First, our system is initialized with a
+random configuration, simulating that in the moment of its creation, the temperature is infinite (β = 0).
+We observe two important things at this moment: (1) there is a perfect symmetry in information, since
+the equilibrium condition prevails, that is, Φβ = Ψβ; (2) the entropy of the system is minimal. By a
+mere convention, we name this initial state of minimal entropy, A.
+By reducing the global temperature (β increases), this “universe” is deviating from this initial
+condition. As the system is drifted apart from the initial condition, we clearly see a break in the
+symmetry of information (Φβ diverges from Ψβ), which apparently is the cause for an increase in the
+system’s entropy, since this symmetry break seems to precede an increase in the entropy, H. This is a
+fundamental symmetry break, since other forms of ruptures that will happen in the future and will give
+rise to several properties of the system, including the basic notion of time as an irreversible process,
+follow from this first one. During this first stage of evolution, the system evolves to the condition of
+maximum entropy, named B.
+Hence, after the break in the information equilibrium condition, there is a significant increase in
+the entropy as the system continues to evolve. This stage lasts while the temperature of the system is
+further reduced or kept established. When the temperature starts to increase (β decreases), another
+form of symmetry break takes place. By moving towards the initial condition (A) from B, changes in
+the entropy seem to precede changes in Fisher information (when moving from A to B, we observe
+exactly the opposite). Moreover, the variations in entropy and Fisher information towards A are
+not symmetric with the variations observed when we move towards B, a direct consequence of that
+first fundamental break of the information equilibrium condition. By continuing this process of
+increasing the temperature of the system until infinity (β is approaching zero), we take our system to a
+configuration that is equivalent to the initial condition, that is, where information equilibrium prevails.
+This fundamental symmetry break becomes evident when we look at the Fisher curve of the
+system. We clearly see that the path from the state of minimum entropy, A, and the state of maximum
+entropy, B, defined by the curve, ⃗FB
+A (the blue trajectory in Figure 9), is not the same as the path from B
+
+278
+
+
+Entropy 2014, 16, 1002–1036
+
+to A, defined by the curve, ⃗FA
+B (the red trajectory in Figure 9). An implication of this behavior is that if
+the system is moving along the arrow of time, then we are moving through the Fisher curve in the
+clockwise orientation. Thus, the only way to go from A to B along ⃗FA
+B (the red path) is going back in
+time.
+Therefore, if that first fundamental symmetry break did not exist, or even if it had happened, but
+all the posterior evolution of Φβ, Ψβ and Hβ were absolutely symmetric (i.e., the variations in these
+measures were exactly the same when moving from A to B and when moving from B to A), what we
+
+�
+��
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+����������������������������������������������������
+
+���
+���
+
+�
+��
+���
+���
+���
+���
+���
+���
+���
+�
+
+��
+
+��
+
+��
+
+��
+
+��
+
+��
+
+����������
+
+������������������
+
+�����������������������������������������������������
+
+���
+���
+
+Figure 12.
+Disturbing the system to induce changes. Variation on Φβ and Ψβ after the system is
+disturbed by an abrupt change in the value of β. In the first image, the inverse temperature is set
+to zero. Note that Φβ and Ψβ touch one another, indicating that no residual information is kept, as
+if the simulation had been restarted from a random configuration. In the second image, the inverse
+temperature is set to the equilibrium value, β∗. The results suggest that this kind of perturbation is not
+enough to remove all the information within the spatial dependence structure, allowing the system to
+recover a significant part of its original configuration after a short stabilization period.
+
+279
+
+
+Entropy 2014, 16, 1002–1036
+
+Figure 13. The sequence of outputs along the MCMC simulation before and after the system is
+disturbed. The first row (when β is set to zero) shows that the system evolved to a different stable
+configuration after the perturbation. The second row (when β is set to β∗) indicates that the system
+was able to recover a significant part from its previous stable configuration.
+
+would actually see is that ⃗FB
+A = ⃗FA
+B . As a consequence, to decrease/increase the system’s temperature
+would be like moving towards the future/past. In fact, the basic notion of time in that system would
+be compromised, since time would be a perfectly reversible process (just similar to a spatial dimension,
+in which we can move in both directions). In other words, we would not distinguish whether the
+system is moving forward or moving backwards in time.
+
+8. Conclusions
+
+The definition of what is information in a complex system is a fundamental concept in
+the study of many problems. In this paper, we discussed the roles of two important statistical
+measures in isotropic pairwise Markov random fields composed of Gaussian variables: Shannon
+entropy and Fisher information. By using the pseudo-likelihood function of the GMRF model, we
+derived analytical expressions for these measures. The definition of a Fisher curve as a geometric
+representation for the study and analysis of complex systems allowed us to reveal the intrinsic
+non-linear relation between these information-theoretic measures and gain insights about the behavior
+of such systems. Computational experiments demonstrate the effectiveness of the proposed tools
+in decoding information from the underlying spatial dependence structure of a Gaussian-Markov
+random field. Typical informative patterns in a complex systems are located in the boundaries of
+the clusters. One of the main conclusions of this scientific investigation concerns the notion of time
+in a complex system. The obtained results suggest that the relationship between Fisher information
+and entropy determines whether the system is moving forward or backward in time. Apparently,
+in the natural orientation (when the system is evolving forward in time), when β is growing, that
+is, the temperature of the system is reducing, an increase in Fisher information leads to an increase
+in the system’s entropy, and when β is reducing, that is the temperature of the system is growing,
+
+280
+
+
+Entropy 2014, 16, 1002–1036
+
+a decrease in the system’s entropy leads to a decrease in the Fisher information. In future works
+we expect to completely characterize the metric tensor that represents the geometric structure of the
+isotropic pairwise GMRF model by specifying all the elements of the Fisher information matrix. Future
+investigations should also include the definition and analysis of the proposed tools in other Markov
+random field models, such as the Ising and Potts pairwise interaction models. Besides, a topic of
+interest concerns the investigation of minimum and maximum information paths in graphs to explore
+intrinsic similarity measures between objects belonging to a common surface or manifold in ℜn. We
+believe this study could bring benefits to some pattern recognition and data analysis computational
+applications.
+
+Acknowledgments: The author would like to thank CNPQ(Brazilian Council for Research and Development) for
+the financial support through research grant number 475054/2011-3.
+
+Conflicts of Interest: Conflict of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Shannon, C.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana,
+Chicago, IL & London, USA, 1949.
+2.
+Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on
+Mathematics, Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California
+Press: Berkeley, CA, USA, 1961. pp. 547–561
+3.
+Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487.
+4.
+Bashkirov, A. Rényi entropy as a statistical entropy for complex systems.
+Theor. Math. Phys. 2006,
+149, 1559–1573.
+5.
+Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630.
+6.
+Grad, H. The many faces of entropy. Comm. Pure. Appl. Math. 1961, 14, 323–254.
+7.
+Adler, R.; Konheim, A.; McAndrew, A. Topological entropy. Trans. Am. Math. Soc. 1965, 114, 309–319.
+8.
+Goodwyn, L. Comparing topological entropy with measure-theoretic entropy. Am. J. Math. 1972, 94, 366–388.
+9.
+Samuelson, P.A. Maximum principles in analytical economics. Am. Econ. Rev. 1972, 62, 249–262.
+10.
+Costa, M. Writing on dirty paper. IEEE T. Inform. Theory 1983, 29, 439–441.
+11.
+Dembo, A.; Cover, T.; Thomas, J.
+Information theoretic inequalities.
+IEEE T. Inform.
+Theory 1991,
+37, 1501–1518.
+12.
+Cover, T.; Zhang, Z. On the maximum entropy of the sum of two dependent random variables. IEEE T.
+Inform. Theory 1994, 40, 1244–1246.
+13.
+Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, London,
+2004.
+14.
+Frieden, B.R.; Gatenby, R.A. Exploratory Data Analysis Using Fisher Information; Springer: London, UK, 2006.
+15.
+Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983.
+16.
+Bickel, P.J. Mathematical Statistics; Holden Day: New York, NY, USA, 1991.
+17.
+Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury: New York, NY, USA, 2002.
+18.
+Amari, S. Nagaoka, H. Methods of information geometry (Translations of mathematical monographs vol. 191); AMS
+Bookstore: Tokyo, Japan, 2000.
+19.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234.
+20.
+Anandkumar, A.; Tong, L.; Swami, A. Detection of Gauss-Markov random fields with nearest-neighbor
+dependency. IEEE T. Inform. Theory 2009, 55, 816–827.
+21.
+Gómez-Villegas, M.A.; Main, P.; Susi, R. The effect of block parameter perturbations in Gaussian Bayesian
+networks: Sensitivity and robustness. Inform. Sci. 2013, 222, 439–458.
+22.
+Moura, J.; Balram, N. Recursive structure of noncausal Gauss-Markov random fields. IEEE T. Inform. Theory
+1992, 38, 334–354.
+23.
+Moura, J.; Goswami, S. Gauss-Markov random fields (GMrf) with continuous indices. IEEE Trans. Inform.
+Theory 1997, 43, 1560–1573.
+
+281
+
+
+Entropy 2014, 16, 1002–1036
+
+24.
+Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Stat. Soc. B Stat. Meth. 1974,
+36, 192–236.
+25.
+Besag, J. Statistical analysis of non-lattice data. The Statistician 1975, 24, 179–195.
+26.
+Hammersley, J.; Clifford, P. (University of California, Berkeley, Oxford and Bristol). Markov Field on Finite
+Graphs and Lattices. Unpublished work, 1971.
+27.
+Efron, B.F.; Hinkley, D.V. Assessing the accuracy of the ml estimator: Observed versus expected fisher
+information. Biometrika 1978, 65, 457–487.
+28.
+Isserlis, L. On a formula for the product-moment coefficient of any order of a normal frequency distribution
+in any number of variables. Biometrika 1918, 12, 134–139.
+29.
+Jensen, J.; Künsh, H. On asymptotic normality of pseudo likelihood estimates for pairwise interaction
+processes. Ann. Inst. Stat. Math. 1994, 46, 475–486.
+30.
+Winkler, G. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction;
+Springer-Verlag New York, Inc.: Secaucus, NJ, USA, 2006.
+31.
+Liang, G.; Yu, B. Maximum pseudo likelihood estimation in network tomography. IEEE T. Signal Proces.
+2003, 51, 2043–2053.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+282
+
+
+entropy
+
+Article
+Network Decomposition and Complexity Measures:
+An Information Geometrical Approach
+
+Masatoshi Funabashi
+
+Sony Computer Science Laboratories, inc. Takanawa muse bldg. 3F, 3-14-13, Higashi Gotanda, Shinagawa-ku,
+Tokyo 141-0022, Japan; E-Mail: masa_funabashi@csl.sony.co.jp; Tel.: +81-3-5448-4380; Fax: +81-3-5448-4273
+
+Received: 28 March 2014; in revised form: 24 June 2014 / Accepted: 14 July 2014 /
+Published: 23 July 2014
+
+Abstract: We consider the graph representation of the stochastic model with n binary variables, and
+develop an information theoretical framework to measure the degree of statistical association existing
+between subsystems as well as the ones represented by each edge of the graph representation. Besides,
+we consider the novel measures of complexity with respect to the system decompositionability, by
+introducing the geometric product of Kullback–Leibler (KL-) divergence. The novel complexity
+measures satisfy the boundary condition of vanishing at the limit of completely random and ordered
+state, and also with the existence of independent subsystem of any size. Such complexity measures
+based on the geometric means are relevant to the heterogeneity of dependencies between subsystems,
+and the amount of information propagation shared entirely in the system.
+
+Keywords: information geometry; complexity measure; complex network; system decomposition-
+ability; geometric mean
+
+1. Introduction
+
+Complex systems sciences emphasize on the importance of non-linear interactions that can not be
+easily approximated linearly. In other word, the degrees of non-linear interactions are the source of
+complexity. The classical reductionism approach generally decomposes a system into its components
+with linear interactions, and tries to evaluate whether the whole property of the system can still
+be reproduced. If this decomposition of a system destroys too much information to reproduce the
+system’s whole property, the plausibility of such reductionism is lost. Inversely, if we can evaluate
+how much information is ignored by the decomposition, we can assume how much complexity of the
+whole system is lost. This gives us a way to measure the complexity of a system with respect to the
+system decomposition.
+In stochastic systems described as a set of joint distributions, the interaction can basically be
+expressed as the statistical association between the variables. The simplest reductionism approach is to
+separate the whole system into some subsets of variables, and assume the independence between them.
+If such decomposition does not affect the system’s property, the isolated subsystem is independent
+from the rest. On the other hand, if the decomposition loses too much information, then the subsystem
+is inside of a larger subsystem with strong internal dependencies and can not be easily separated.
+The stochastic models have often been represented with the use of graph representation, and
+treated with the name of complex network [1–3]. Generally, the nodes represent the variables and
+the weights on the edges are the statistical association between them. However, if we consider the
+information contained in the different orders of dependencies among variables, the graph with a single
+kind of edges is not sufficient to express the whole information of the system [4]. An edge of a graph
+with n nodes contains the information of statistical association up to the n-th order dependencies
+among n variables. If we try to decompose the system independently by cutting these information, we
+have to consider what it means to cut the edge of the graph from the information theoretical point
+of view.
+
+Entropy 2014, 16, 4132–4167; doi:10.3390/e16074132
+www.mdpi.com/journal/entropy
+283
+
+
+Entropy 2014, 16, 4132–4167
+
+Indeed, analysis on the degree of dependencies existing between variables derived many defini-
+tion of complexity in stochastic model [5], which have been mostly studied with information theoretical
+perspective. Beginning with seminal works of Lempel and Ziv (e.g., [6]), computation-oriented definition
+of complexity takes deterministic formalization and measures the necessary information to reproduce
+a given symbolic sequence exactly, which is classified with the name of algorithmic complexity [7–9].
+On the other hand, statistical approach to complexity, namely statistical complexity, assumes
+some stochastic model as theoretical basis, and refers to the structure of information source on it in
+measure-theoretic way [10–12].
+One of the most classical statistical complexities is the mutual information between two stochastic
+variables, and its generalized form to measure dependence between n variables is proposed (e.g., [13])
+and explored in relevance to statistical models and theories by several authors [14–16].
+We should also recall that complexity is not necessary conditioned only by information theory,
+but rather motivated from the organization of living system such as brain activity. The TSE complexity
+shows further extension of generalized mutual information into biological context, where complexity
+exists as the heterogeneity between different system hierarchies [17]. These statistical complexities
+are all based on the boundary condition of vanishing at the limit of completely random and ordered
+state [18].
+The complexity measure is usually the projection from system’s variables to one-dimensional
+quantity, which is composed to express the degree of characteristic that we define to be important in
+what means “complexity”. Since the complexity measure is always a many-to-one association, it has both
+aspects of compressing information to classify the system from simple to complex, and losing resolution
+of the system’s phase space. If the system has n variables, we generally need n independent complexity
+measures to completely characterize the system with real-value resolution. The problematics of
+defining a complexity measure is situated on the edge of balancing the information compression on
+system’s complexity with theoretical support, and the resolution of the system identification to be
+maintained high enough to avoid trivial classification. The latter criterion increases its importance as
+the system size becomes larger. The better complexity measure is therefore a set of indices, with as
+less number as possible, which characterizes major features related to the complexity of the system.
+In this sense, the ensemble of complexity measures is also analogous to the feature space of support
+vector machine. A non-trivial set of complexity measures need to be complementary to each other in
+parameter space for the possible best discrimination of different systems.
+In this paper, we first consider the stochastic system with binary variables and theoretically
+develop a way to measure the information between subsystems, which is consistent to the information
+represented by the edges of the graph representation.
+Next, we particularly focus on the generalized mutual information as a start point of the argument,
+and further consider to incorporate network heterogeneity into novel measures of complexity with
+respect to the system’s decompositionability. This approach will be revealed to be complementary to
+TSE complexity as the difference between arithmetic and geometric means of information.
+
+2. System Decomposition
+
+Let us consider the stochastic system with n binary variables x = (x1, · · · , xn) where xi ∈
+{0, 1} (1 ≤ i ≤ n). We denote the joint distribution of x by p(x). We define the decomposition pdec(x)
+of p(x) into two subsystems y1 = (x1
+1, · · · , x1
+n1) and y2 = (x2
+1, · · · , x2
+n2) (n1 + n2 = n, y1 ∪ y2 = x,
+y1 ∩ y2 = φ) as follows:
+
+pdec(x) = p(y1)p(y2),
+(1)
+
+where p(y1) and p(y2) are the joint distributions of y1 and y2, respectively. For simplicity, hereafter
+we denote the system decomposition using the smallest subscript of variables in each subsystem. For
+example, in case n = 4, y1 = (x1, x3) and y2 = (x2, x4), we describe the decomposed system pdec(x)
+
+284
+
+
+Entropy 2014, 16, 4132–4167
+
+as < 1212 >. The system decomposition means to cut all statistical association between the two
+subsystems, which is expressed as setting the independent relation between them.
+We will further consider the Equation (1) in terms of the graph representation. We define
+the undirected graph Γ := (V, E) of the system p(x), whose vertices V = {x1, · · · , xn} and edges
+E = V × V represent the variables and the statistical association, respectively. To express the system,
+we set the value of each vertex as the value of the corresponding variable, and the weight of each edge
+as the degree of dependency between the connected variables.
+There is however a problem considering the representation with a single kind of edge. The
+statistical association among variables is not only between two variables, but can be independently
+defined among plural variables up to the n-th order. Therefore, the exact definition of the weight
+of the edges remains unclear. To clarify these problematics, we consider the hierarchical marginal
+distributions j as another coordinates of the system p(x) as follows:
+
+j = (j1; j2; · · · ; jn),
+(2)
+
+where
+
+j1
+=
+(η1, · · · , ηi, · · · , ηn), (1 < i < n),
+
+j2
+=
+(η1,2, · · · , ηi,j, · · · , ηn−1,n), (1 < i < j < n),
+
+...
+
+jn
+=
+η1,2,··· ,n,
+(3)
+
+and
+
+η1
+=
+∑
+i2,··· ,in∈{0,1}
+p(1, i2, · · · , in),
+
+...
+
+ηn
+=
+∑
+i1,··· ,in−1∈{0,1}
+p(i1, · · · , in−1, 1),
+
+η1,2
+=
+∑
+i3,··· ,in∈{0,1}
+p(1, 1, i3, · · · , in),
+
+...
+
+ηn−1,n
+=
+∑
+i1,··· ,in−2∈{0,1}
+p(i1, · · · , in−2, 1, 1),
+
+...
+
+η1,2,··· ,n
+=
+p(1, 1, · · · , 1).
+(4)
+
+Since the definition of j is a linear transformation of p(x), both coordinates have the degrees of
+freedom ∑n
+k=1 nCk.
+The subcoordinates j1 are simply the set of marginal distributions of each variable.
+The
+subcoordinates jk (1 < k ≤ n) include the statistical association among k variables, that can not
+be expressed with the coordinates less than the k-th order. This means that the different statistical
+associations exist independently in each order among the corresponding sets of the variables. The
+statistical association represented by the weight of a graph edge {xi, xj} is therefore the superposition
+of the different dependencies defined on every subset of x including xi and xj.
+To measure the degree of statistical association in each order, the information geometry established
+the following setting [19]. We first define another coordinates ` = (`1; `2; · · · ; `n) that are the dual
+
+285
+
+
+Entropy 2014, 16, 4132–4167
+
+coordinates of j with respect to the Legendre transformation of the exponential family’s potential
+function ψ(`) to its conjugate potential φ(j) as follows:
+
+`1
+=
+(θ1, · · · , θn),
+
+`2
+=
+(θ1,2, · · · , θn−1,n),
+(5)
+...
+
+`n
+=
+θ1,2,··· ,n,
+
+where
+
+ψ(`)
+=
+log
+1
+
+p(0, · · · , 0),
+
+φ(j)
+= ∑
+i
+θiηi + ∑
+i<j
+θi,jηi,j + · · · + θ1,2,··· ,nη1,2,··· ,n − ψ(`),
+
+θi
+=
+∂φ(j)
+
+∂ηi
+, (1 ≤ i ≤ n),
+
+θi,j
+=
+∂φ(j)
+∂ηi,j
+, (1 ≤ i < j ≤ n),
+(6)
+
+...
+
+θ1,2,··· ,n
+=
+∂φ(j)
+
+∂η1,2,··· ,n
+.
+
+Note that j can be inversely derived from `, following Legendre transformation between φ(j) and
+ψ(`):
+
+ηi
+=
+∂ψ(`)
+
+∂θi
+, (1 ≤ i ≤ n),
+
+ηi,j
+=
+∂ψ(`)
+∂θi,j
+, (1 ≤ i < j ≤ n),
+(7)
+
+...
+
+η1,2,··· ,n
+=
+∂ψ(`)
+
+∂θ1,2,··· ,n
+.
+
+Using the coordinates `, the system is described in the form of the exponential family as follows:
+
+p(x) = ∑
+i
+θixi + ∑
+i<j
+θi,jxixj + · · · + θ1,2,··· ,nx1x2 · · · xn − ψ(`).
+(8)
+
+The information geometry revealed that the exponential family of probability distribution forms a
+manifold with a dual-flat structure. More precisely, the coordinates ` form a flat manifold with respect
+to the Fisher information matrix as the Riemannian metric, and α-connection with α = 1. Dually to `,
+the coordinates j are flat with respect to the same metric but α-connection with α = −1. It is known that
+` and j are orthogonal to each other with respect to the Fisher information matrix. This structure give
+us a way to decompose the degree of statistical association among variables into separated elements of
+arbitrary orders. We define the so-called k-cut mixture coordinates ık as follows [14].
+
+ık
+=
+(jk−; `k+),
+(9)
+
+jk−
+=
+(j1, · · · , jk),
+(10)
+
+`k+
+=
+(`k+1, · · · , `n).
+(11)
+
+286
+
+
+Entropy 2014, 16, 4132–4167
+
+We also define the k-cut mixture coordinates ık
+0 = (jk−; 0, · · · , 0) with no dependency above the
+k-th order. We denote the system specified with ık and ık
+0 as p(x, ık) and p(x, ık
+0 ), respectively.
+Then the degree of the statistical association more than the k-th order in the system can be
+measured by the Kullback-Leibler (KL-) divergence D[p(x, ı) : p(x, ık
+0 )].
+
+2N · D[p(x, ı) : p(x, ık
+0)] ∼ χ2(
+n
+∑
+i=k+1
+nCi),
+(12)
+
+where D[· : ·] is the KL-divergence from the first system to the second one.
+Here, the decomposition is performed according to the orders of statistical association, which does
+not spatially distinguish the vertices. If we define the weight of an edge {xi, xj} with the KL-divergence,
+the above k-cut coordinates ık are not appropriate to measure the information represented in each
+edge. We need to set another mixture coordinates so that to separate only the existing information
+between xi and xj regardless of its order.
+Let us return to the definition of the system decomposition and consider on the dual-flat
+coordinates ` and j.
+
+Proposition 1. The independence between the two decomposed systems y1 = (x1
+1, · · · , x1
+n1) and y2 =
+(x2
+1, · · · , x2
+n2) can be expressed on the new coordinates jdec as follows:
+
+ηdec
+i
+=
+ηi, (1 ≤ i ≤ n),
+
+ηdec
+i,j
+=
+
+�
+ηi,j, (1 ≤ i < j ≤ n),
+if {xi, xj} ⊆ y1 or ⊆ y2
+
+ηiηj, (1 ≤ i < j ≤ n),
+else
+,
+
+ηdec
+i,j,k
+=
+
+⎧
+⎪
+⎪
+⎪
+⎨
+
+⎪
+⎪
+⎪
+⎩
+
+ηi,j,k, (1 ≤ i < j < k ≤ n),
+if {xi, xj, xk} ⊆ y1 or ⊆ y2
+
+ηi,jηk, (1 ≤ i < j < k ≤ n),
+else if {xi, xj} ⊆ y1 or ⊆ y2
+
+ηiηj,k, (1 ≤ i < j < k ≤ n),
+else if {xj, xk} ⊆ y1 or ⊆ y2
+
+ηjηi,k, (1 ≤ i < j < k ≤ n),
+else (if {xi, xk} ⊆ y1 or ⊆ y2)
+
+,
+
+...
+(13)
+
+ηdec
+1,2,··· ,n
+=
+ηs[i,k1,··· ,kn1−1]ηs[j,l1,··· ,ln2−1], (xi ∈ y1, xj ∈ y2),
+
+where s[· · · ] is the ascending sort of the internal sequence.
+Then the corresponding dual coordinates `dec take 0 elements as follows:
+
+θdec
+i,j
+=
+0,
+(1 ≤ i < j <≤ n),
+if {xi, xj} ∩ y1 ̸= φ and {xi, xj} ∩ y2 ̸= φ
+
+θdec
+i,j,k
+=
+0,
+(1 ≤ i < j < k ≤ n),
+if {xi, xj, xk} ∩ y1 ̸= φ and {xi, xj, xk} ∩ y2 ̸= φ
+
+...
+
+θdec
+1,2,··· ,n
+=
+0.
+(14)
+
+Proof. For simplicity, we show the cases of n = 2 and n = 3 for the first node separation.
+
+287
+
+
+Entropy 2014, 16, 4132–4167
+
+For n = 2, the above defined jdec for the system decomposition < 12 > give its dual coordinates
+`dec as follows:
+
+θdec
+1
+=
+log
+ηdec
+1
+− ηdec
+1,2
+
+1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2
+=
+log
+η1
+
+1 − η1
+,
+
+θdec
+2
+=
+log
+ηdec
+2
+− ηdec
+1,2
+
+1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2
+=
+log
+η2
+
+1 − η2
+,
+
+θdec
+1,2
+=
+log
+ηdec
+1,2 (1 − ηdec
+1
+− ηdec
+2
++ ηdec
+1,2 )
+
+(ηdec
+1
+− ηdec
+1,2 )(ηdec
+2
+− ηdec
+1,2 )
+=
+0,
+
+(15)
+
+which means the first and second node is independent.
+For n = 3, the above defined jdec for the system decomposition < 122 > give its dual coordinates
+`dec as follows:
+
+θdec
+1
+=
+log
+ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η1
+
+1 − η1
+,
+
+θdec
+2
+=
+log
+ηdec
+2
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η2 − η2,3
+
+1 − η2 − η3 + η2,3
+,
+
+θdec
+3
+=
+log
+ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3
+
+1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3
+= log
+η3 − η2,3
+
+1 − η2 − η3 + η2,3
+,
+
+(16)
+
+θdec
+1,2
+=
+log
+(ηdec
+1,2 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+0,
+
+θdec
+1,3
+=
+log
+(ηdec
+1,3 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+0,
+
+θdec
+2,3
+=
+log
+(ηdec
+2,3 − ηdec
+1,2,3)(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+=
+log η2,3(1 − η2 − η3 + η2,3)
+
+(η2 − η2,3)(η3 − η2,3) ,
+
+(17)
+
+θdec
+1,2,3
+=
+log
+
+�
+ηdec
+1,2,3
+
+(ηdec
+1,2 − ηdec
+1,2,3)(ηdec
+1,3 − ηdec
+1,2,3)(ηdec
+2,3 − ηdec
+1,2,3)
+
+×
+(ηdec
+1
+− ηdec
+1,2 − ηdec
+1,3 + ηdec
+1,2,3)(ηdec
+2
+− ηdec
+1,2 − ηdec
+2,3 + ηdec
+1,2,3)(ηdec
+3
+− ηdec
+1,3 − ηdec
+2,3 + ηdec
+1,2,3)
+
+(1 − ηdec
+1
+− ηdec
+2
+− ηdec
+3
++ ηdec
+1,2 + ηdec
+1,3 + ηdec
+2,3 − ηdec
+1,2,3)
+
+�
+
+=
+0,
+
+(18)
+
+which means the first node is independent from the other nodes.
+The generalization is possible with the use of recurrence formula between system size n and
+n + 1, according to the symmetry of the model and Legendre transformation between jdec and `dec
+
+coordinates.
+Numerical proof can be obtained by computing directly 0 elements of `dec from jdec.
+
+288
+
+
+Entropy 2014, 16, 4132–4167
+
+The definition of jdec means to decompose the hierarchical marginal distributions j into the
+products of the subsystems’ marginal distributions, in case the subscripts traverse the two subsystems.
+Therefore, only the statistical associations between two subsystems are set to be independent, while
+the internal dependencies of each subsystem remain unchanged. This is analytically equivalent to
+compose another mixture coordinates ¸, namely the < · · · >-cut coordinates, with proper description
+of the system decomposition with < · · · >. The ¸ consists of the j coordinates with the subscripts that
+do not traverse between the decomposed subsystems, and the ` coordinates whose subscripts traverse
+between them.
+For simplicity, we only describe here the case n = 4 and the decomposition < 1133 > (the set of
+the first, second, and the third, fourth nodes each form a subsystem). The system p(x) is expressed
+with the < 1133 >-cut coordinates ¸ as
+
+ξ1
+=
+η1,
+...
+
+ξ4
+=
+η4,
+
+ξ1,2
+=
+η1,2,
+
+ξ1,3
+=
+θ1,3,
+
+ξ1,4
+=
+θ1,4,
+
+ξ2,3
+=
+θ2,3,
+(19)
+
+ξ2,4
+=
+θ2,4,
+
+ξ3,4
+=
+η3,4,
+
+ξ1,2,3
+=
+θ1,2,3,
+...
+
+ξ2,3,4
+=
+θ2,3,4,
+
+ξ1,2,3,4
+=
+θ1,2,3,4.
+
+The decomposed system with no statistical association between two subsystems have the
+following coordinates ¸dec, which is, in any decomposition, equivalent to set all θ in ¸ as 0:
+
+ξdec
+1
+=
+η1,
+...
+
+ξdec
+4
+=
+η4,
+
+ξdec
+1,2
+=
+η1,2,
+
+ξdec
+1,3
+=
+0,
+
+ξdec
+1,4
+=
+0,
+
+ξdec
+2,3
+=
+0,
+(20)
+
+ξdec
+2,4
+=
+0,
+
+ξdec
+3,4
+=
+η3,4,
+
+ξdec
+1,2,3
+=
+0,
+
+...
+
+ξdec
+2,3,4
+=
+0,
+
+ξdec
+1,2,3,4
+=
+0.
+
+289
+
+
+Entropy 2014, 16, 4132–4167
+
+This is analytically equivalent to the definition of the decomposition (13)–(14) in case of < 1133 >.
+Therefore, the KL-divergence D[p(x, ¸) : p(x, ¸dec)] measures the information lost by the system
+decomposition. The following asymptotic agreement to χ2 test also holds.
+
+Proposition 2.
+
+2N · D[p(x, ¸) : p(x, ¸dec)] ∼ χ2(♯θ(¸)),
+(21)
+
+where ♯θ(¸) is the number of ` coordinates appearing in the ¸ coordinates.
+
+3. Edge Cutting
+
+We further expand the concept of system decomposition to eventually quantify the total amount of
+information expressed by an edge of the graph. Let us consider to cut an edge {xi, xj} (1 ≤ i < j ≤ n)
+of the graph with n vertices. Hereafter we call this operation as the edge cutting i − j. In the same way
+as the system decomposition, the edge cutting corresponds to modify the j coordinates to produce jec
+
+coordinates as follows:
+
+ηec
+i,j
+=
+ηiηj,
+
+ηec
+s[i,j,k1]
+=
+ηs[i,k1]ηs[j,k1],
+
+ηec
+s[i,j,k1,k2]
+=
+ηs[i,k1,k2]ηs[j,k1,k2],
+(22)
+
+...
+
+ηec
+s[i,j,k1,··· ,kn−2]
+=
+ηs[i,k1,··· ,kn−2]ηs[j,k1,··· ,kn−2],
+
+({i, j, k1, · · · , kn−2}
+=
+{1, · · · , n}),
+
+and the rest of jec remains the same as those of j.
+The formation of jec from j consists of replacing the k-th order elements (k ≥ 3) of j including both
+i and j in its subscripts, with the product of the k − 1-th order j in maximum subgraphs (k − 1 vertices)
+each including i or j. This means that all orders of statistical association including the variables xi and
+xj are set to be independent only between them. Other relations that do not include simultaneously xi
+and xj remain unchanged.
+Certain combinations of edge cuttings coincide with system decompositions. For example, in case
+n = 4, the edge cuttings 1 − 2, 1 − 3, and 1 − 4 are equivalent to the system decomposition < 1222 >.
+We define the i − j-cut mixture coordinates ¸ for orthogonal decomposition of the statistical
+association represented by the edge {xi, xj}. Although actual calculation can be performed only with j
+coordinates, this generalization is necessary to have a geometrical definition of the orthogonality. For
+simplicity, we only describe the ¸ in the case of n = 4:
+
+290
+
+
+Entropy 2014, 16, 4132–4167
+
+ξ1
+=
+η1,
+...
+
+ξ4
+=
+η4,
+
+ξ1,2
+=
+θ1,2,
+
+ξ1,3
+=
+η1,3,
+
+ξ1,4
+=
+η1,4,
+
+ξ2,3
+=
+η2,3,
+
+ξ2,4
+=
+η2,4,
+(23)
+
+ξ3,4
+=
+η3,4,
+
+ξ1,2,3
+=
+θ1,2,3,
+
+ξ1,2,4
+=
+θ1,2,4,
+
+ξ1,3,4
+=
+η1,3,4,
+
+ξ2,3,4
+=
+η2,3,4,
+
+ξ1,2,3,4
+=
+θ1,2,3,4,
+
+where orthogonality between the elements of j and ` holds with respect to the Fisher information
+matrix.
+Calculating the dual coordinates `ec of jec, we can define the coordinates ¸ec of the system after
+the edge cutting 1 − 2 as follows:
+
+ξec
+1
+=
+η1,
+...
+
+ξec
+4
+=
+η4,
+
+ξec
+1,2
+=
+θec
+1,2,
+
+ξec
+1,3
+=
+η1,3,
+
+ξec
+1,4
+=
+η1,4,
+
+ξec
+2,3
+=
+η2,3,
+
+ξec
+2,4
+=
+η2,4,
+
+ξec
+3,4
+=
+η3,4,
+
+ξec
+1,2,3
+=
+θec
+1,2,3,
+
+ξec
+1,2,4
+=
+θec
+1,2,4,
+
+ξec
+1,3,4
+=
+η1,3,4,
+
+ξec
+2,3,4
+=
+η2,3,4,
+
+ξec
+1,2,3,4
+=
+θec
+1,2,3,4.
+
+Note that the edge cutting can not be defined simply by setting the corresponding elements of `ec as 0.
+Then the KL-divergence D[p(x, ¸)
+:
+p(x, ¸ec)] represent the total amount of information
+represented by the edge 1 − 2.
+The following asymptotic agreement to χ2 test also holds:
+
+291
+
+
+Entropy 2014, 16, 4132–4167
+
+Proposition 3.
+
+2N · D[p(x, ¸) : p(x, ¸ec)] ∼ χ2(1 +
+n−2
+∑
+k=1
+n−2Ck).
+(24)
+
+We call this χ2 value or the KL-divergence itself as edge information of edge 1 − 2.
+
+4. Generalized Mutual Information as Complexity with Respect to the Total System
+Decomposition
+In previous sections, we have introduced a measure of complexity in terms of system
+decomposition, by measuring the KL-divergence between a given system and its independently
+decomposed subsystems.
+We consider here the total system decomposition, and measure the
+informational distance I between the system and the totally decomposed system where each element
+are independent.
+
+I :=
+n
+∑
+i=1
+H(xi) − H(x1, · · · , xn),
+(25)
+
+where
+
+H(x) := −∑
+x
+p(x) log(x).
+(26)
+
+This quantity is the generalization of mutual information, and is named in various ways such as
+generalized mutual information, integration, complexity, multi-information, etc. according to different
+authors. For simplicity, we call the I as “multi-information taking after [15]. This quantity can be
+interpreted as a measure of complexity that sums up the order-wise statistical association existing in
+each subset of components with information geometrical formalization [14]
+For simplicity, we denote the multi-information I of n-dimensional stochastic binary variables as
+follows, using the notation of the system decomposition:
+
+I = D[< 111 · · · 1 >:< 123 · · · n >].
+(27)
+
+5. Rectangle-Bias Complexity
+
+The multi-information contains some degrees of freedom in case n > 2. That is, we can define a
+set of distributions {p(x)|I = const.} with different parameters but the same I value. This fact can be
+clearly explained with the use of information geometry. From the Pythagorean relation, we obtain the
+followings in case of n = 3:
+
+D[< 111 >:< 113 >] + D[< 113 >:< 123 >]
+=
+D[< 111 >:< 123 >],
+
+D[< 111 >:< 121 >] + D[< 121 >:< 123 >]
+=
+D[< 111 >:< 123 >],
+(28)
+
+D[< 111 >:< 122 >] + D[< 122 >:< 123 >]
+=
+D[< 111 >:< 123 >].
+
+Using these relations, we can schematically represent the decomposed systems on a circle diagram
+with diameter
+√
+
+I. This representation is based on the analogous algebra between Pythagorean relation
+of KL-divergence, and that of Euclidian geometry where the circumferential angle of a semi-circular
+arc is always π
+
+2 .
+Figure1 represents two different cases with the same I value in case n = 3. The distance between
+two systems in the same diagram corresponds to the square root value of KL-divergence between
+them. Clearly the left and right figures represent different dependencies between nodes, although
+they both have the same I value. Such geometrical variation is possible by the abundance of degree of
+freedom in dual coordinates compared to the given constraint. There exist 7 degrees of freedom in η
+or θ coordinates for n = 3, while the only constraint is the invariance of I value, which only reduce 1
+
+292
+
+
+Entropy 2014, 16, 4132–4167
+
+degree of freedom. The remaining 6 degrees of freedom can then be deployed to produce geometrical
+variation in the circle diagram. As for considering system decomposition, the left figure is difficult to
+obtain decomposed systems without losing much information. While in the right figure there exists
+relatively easy decomposition < 122 >, which loses less information than any decomposition in the left
+figure. We call such degree of facility of decomposition with respect to the losing information as system
+decompositionability. In this sense, the left system is more complex although the 2 systems both have the
+same I value. Especially, in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] = D[< 111 >:< 121 >],
+the system does not have any easiest way of decomposition, and any isolation of a node loses significant
+amount of information.
+To further incorporate such geometrical structure reflecting system decompositionability into a
+measure of complexity, we consider a mathematical way to distinguish between these two figures.
+Although the total sum of KL-divergence along the sequence of system decomposition is always
+identical to I by Pythagorean relation, their product can vary according to the geometrical composition
+in the circle diagram. This is analogous to the isoperimetric inequality of rectangle, where regular
+tetragon gives the maximum dimensions amongst constant perimeter rectangles.
+We propose provisionary a new measure of complexity as follows, namely rectangle-bias
+complexity Cr:
+
+Cr =
+1
+
+|SD| − 2
+∑
+<···>∈SD
+D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >],
+(29)
+
+where SD is the set of possible system decomposition in n binary variables, and |SD| is the element
+number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5
+for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value
+for the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >
+] = D[< 111 >:< 121 >]. We propose provisionary a new measure of complexity as follows, namely
+rectangle-bias complexity Cr:
+
+Cr =
+1
+
+|SD| − 2
+∑
+<···>∈SD
+D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >],
+(30)
+
+where SD is the set of possible system decomposition in n binary variables, and |SD| is the element
+number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5
+for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value for
+the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] =
+D[< 111 >:< 121 >].
+
+293
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 1. Circle diagrams of system decomposition in 3-node network. Both systems have the same value of
+multi-information I that is expressed as the identical diameter length of the circles. 2 variations are
+shown, where the left system is more complex (Cr high) in a sense any system decomposition requires
+to lose more information than the easiest one (< 122 >) in the right figure (Cr low).
+
+6. Complementarity between Complexities Defined with Arithmetic and Geometric Means
+
+We evaluate the possibility and the limit of rectangle-bias complexity Cr comparing with other
+proposed measures of complexity.
+The Interests in measuring network heterogeneity have been developed toward the incorporation
+of multi-scale characteristics into complexity measures. The TSE complexity is motivated from the
+structure of the functional differentiation of brain activity, which measures the difference of neural
+integration between all sizes of subsystems and the whole system [17]. Biologically motivated TSE
+complexity is also investigated from theoretical point of view, to further attribute desirable property
+as an universal complexity measure independent of system size [20]. The hierarchical structure of
+the exponential family in information geometry also leads to the order-wise description of statistical
+association, which can be regarded as a multi-scale complexity measure [14]. The relation between
+the order-wise dependencies and the TSE complexity is theoretically investigated to establish the
+order-wise component correspondence between them [15].
+These indices of network heterogeneity, however, all depend on the arithmetic mean of the
+component-wise information theoretical measure. We show that these arithmetic means still miss to
+measure certain modularity based on the statistical independence between subsystems.
+Figure 2 present the simplified cases where complexity measures with arithmetic means fail to
+distinguish. We consider the two systems with different heterogeneity but identical multi-information
+I. Here, the multi-information can not reflect the network heterogeneity. The TSE complexity and its
+information geometrical correspondence in [15] has a sensitivity to measure the network heterogeneity,
+but since the arithmetic mean is taken over all subsystems, they do not distinguish the component-wise
+break of symmetry between different scales. The renormalized TSE complexity with respect to the
+multi-information I still has the same insensitivity. Even by incorporating the information of each
+subsystem scale, the arithmetic mean can balance out between the scale-wise variations, and a large
+range of the heterogeneity in different scale can realize the same value of these complexities. For the
+application in neuroscience, the assumption of a model with simple parametric heterogeneity and the
+comparison of TSE complexity between different I values alleviate this limitation [17].
+
+294
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+Figure 2. Schematic examples of stochastic systems with identical multi-information I where complexity
+measures with arithmetic mean fail to distinguish.
+(a):
+Example 1 of stochastic system with
+homogeneous mean of edge information and symmetric fluctuation of its heterogeneity; (b):
+Example 2 of heterogeneous stochastic system with bimodal edge information distribution and
+identical multi-information I and complexity based on arithmetic mean as example 1; (c): schematic
+representation of the distribution of statistical association (edge information) in upper network; (d):
+schematic representation of the distribution of statistical association (edge information) in upper
+network.
+
+In contrast to complexities with arithmetic mean, the rectangle-bias complexity Cr is related to
+the geometrical mean. The Cr can distinguish the two systems in Figure 2, giving relatively high Cr
+value to the left system and low value to the right one.
+This does not mean , however, that the Cr has a finer resolution than other complexity
+measures. The constant conditions of complexity measures are the constraints on ∑n
+k=1 nCk degrees of
+freedom in model parameter space, which define different geometrical composition of corresponding
+submanifolds. We basically need ∑n
+k=1 nCk independent measures to assure the real-value resolution
+of network feature characterization. Complexities with arithmetic and geometric means are just giving
+complementary information on network heterogeneity, or different constant-complexity submanifolds
+structure in statistical manifold as depicted in Figure 3. Therefore, it is also possible to construct a
+class of systems that has identical I and Cr values but different TSE complexity. Complexity measures
+should be utilized in combination, with respect to the non-linear separation capacity of network
+features of interest.
+
+295
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 3. Schematic representation of complementarity between complexity measures based on arithmetic
+mean (Ca) and geometric mean (Cg) of informational distance. An example of the n − 1 dimensional
+constant-complexity submanifolds with respect to Ca = const. and Cg = const. conditions are depicted
+with yellow and orange surface, respectively. The dimension of the whole statistical manifold S is the
+parameter number n.
+
+7. Cuboid-Bias Complexity with Respect to System Decompositionability
+
+We consider the expansion of Cr into general system size n. The n ≥ 4 situation is different from
+n = 3 and less in the existence of a hierarchical structure between system decompositions.
+Figure 4 shows the hierarchy of the system decompositions in case n = 4. Such hierarchical
+structure between system decompositions is not homogeneous with respect to the subsystems
+number, and depends on the isomorphic types of decomposed systems. This fact produces certain
+difference of meaning in complexity between each KL-divergences when considering the system
+decompositionability.
+
+Figure 4. Hierarchy of system decomposition for 4 nodes network (n = 4). Possible sequences of Seq =
+{SD1(is) → SD2(is) → SD3(is) → SD4(is)|1 ≤ is ≤ |Seq| = 18} are connected with the lines.
+
+A simple example in 4 nodes network is shown in Figure 5.
+
+296
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 5. Hierarchical effect of sequential system decomposition on cuboid volume and rectangle surface
+on circle graph. We consider to increase the diameter of the green circle from dotted to dashed one
+without changing those of the red and blue circles, which gives different effect on the change of
+D[< 1222 >:< 1233 >] and D[< 1133 >:< 1134 >] according to the hierarchical structure of the
+decomposition sequences.
+
+We consider the modification of 2 KL-divergences in the figure, D[< 1111 >:< 1222 >] and
+D[< 1111 >:< 1133 >] from the diameter of green dotted circle to the dashed one.
+The joint distribution P(x1, x2, x3, x4) of a discrete distribution with 4 binary variables
+(x1, x2, x3, x4) (x1, x2, x3, x4 ∈ {0, 1}) have 24 − 1 = 15 parameters, which define the dual-flat
+coordinates of statistical manifold in information geometry.
+On the other hand, the possible system decompositions exist as the followings in n = 4:
+
+SD
+:=
+{< 1111 >, < 1114 >, < 1131 >, < 1211 >, < 1222 >,
+
+< 1133 >, < 1212 >, < 1221 >, < 1134 >, < 1214 >,
+
+< 1231 >, < 1224 >, < 1232 >, < 1233 >, < 1234 >}.
+(31)
+
+Since the number of possible system decompositions is 15, and each is associated with the
+modification of different sets of P(x1, x2, x3, x4) parameters, the system decompositions and KL-
+divergences between them can be defined independently. This also holds even under the constant
+condition of I value or other complexity measures except the ones imposing dependency between
+system decompositions.
+This means that we can independently modify the diameter of green dotted circle in Figure 5,
+without changing the diameters of the red and blue circles, which define the system decompositions
+< 1233 > and < 1134 > in the sub-hierarchy of < 1222 > and < 1133 >, respectively. Other
+KL-divergences can also be maintained as given constant values for the same reason.
+The rectangle-biased complexity Cr increases its value with such modification, but does not reflect
+the heterogeneity of KL-divergences according to the hierarchy of system decompositions. If we
+consider the system decompositionability as the mean facility to decompose the given system into
+its finest components with respect to the “all” possible system decompositions, such hierarchical
+difference also has a meaning in the definition of complexity.
+The effect of modifying the diameter of the green dotted circle is different between the
+decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→<
+
+297
+
+
+Entropy 2014, 16, 4132–4167
+
+1133 >→< 1134 >→< 1234 >. The decrease of the KL-divergence D[< 1222 >:< 1233 >]
+is less than D[< 1133 >:< 1134 >] since the diameter of the red dotted circle is larger than the
+blue one in Figure 5. This means that the effect of changing the same amount of KL-divergences
+in D[< 1111 >:< 1222 >] and D[< 1111 >:< 1133 >] produces larger effect on the sequence
+< 1111 >→< 1133 >→< 1134 >→< 1234 > than < 1111 >→< 1222 >→< 1233 >→< 1234 >, if
+compared at the sequence level. The rectangle-biased complexity Cr does not reflect such characteristics
+since it does not distinguish between the hierarchical structure between the diameters of the green, red
+and blue dotted circles.
+To incorporate such hierarchical effect in a complexity measure with geometric mean, we have
+the natural expansion of the rectangle-biased complexity Cr as the cuboid-bias complexity Cc, which is
+defined as follows:
+
+Cc :=
+1
+
+|Seq|
+
+|Seq|
+∑
+is=1
+
+n−1
+∏
+i=1
+D[SDi(is) : SDi+1(is)],
+(32)
+
+where Seq represents the possible sequences of hierarchical system decompositions as follows:
+
+Seq = {SD1(is) → SD2(is) → · · · SDi(is) · · · → SDn(is)|1 ≤ is ≤ |Seq|}.
+(33)
+
+The elements SDi(is) of Seq corresponds to the system decomposition, which is aligned according to
+the hierarchy with the following algorithmic procedure (based on [15]):
+
+(1) Initialization: Set the initial sets of system decomposition of all sequences in Seq as the whole
+system SD1(is) :=< 111 · · · 1 > (1 ≤ is ≤ |Seq|).
+(2) Step i → i + 1: If the system decomposition is the total system decomposition (SDi(is) :=<
+123 · · · n >), then stop. Otherwise, choose a non-decomposed subsystem SSi(is) of the system
+decomposition SDi(is), and further divide it into two independent subsystems SS1
+i (is) and
+SS2
+i (is) different for each is. SDi+1(is) is then defined as a system decomposition of total system
+that further separates independently subsystems SS1
+i (is) and SS2
+i (is), in addition to the previous
+decomposition SDi(is).
+(3) Go to the next step i + 1 → i + 2.
+
+The value of |Seq| corresponds to the number of different sequences generated by this algorithm. For
+example, |Seq| = 3 and |Seq| = 18 holds for n = 3 and n = 4, respectively. The general analytical form
+|Seq|n of |Seq| with system size n is obtained as the following recurrence formula:
+
+|Seq|n =
+
+⌊ n
+
+2 ⌋
+∑
+i=1
+nCi|Seq|n−i|Seq|i,
+(34)
+
+where ⌊·⌋ is a floor function and with formal definition of |Seq|1 := 1.
+The products of KL-divergences according to the hierarchical sequences of system decompositions
+in Equation (32) is related to the volume of n − 1-dimensional cuboids in the circle diagram. An
+example in case of n = 4 is presented in Figure 5, where two cuboids with 3 orthogonal edges of the
+different decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→<
+1133 >→< 1134 >→< 1234 > are depicted, whose cuboid volumes are
+
+�
+
+D[< 1111 >:< 1222 >]D[< 1222 >:< 1233 >]D[< 1233 >:< 1234 >],
+(35)
+
+and
+�
+
+D[< 1111 >:< 1133 >]D[< 1133 >:< 1134 >]D[< 1134 >:< 1234 >],
+(36)
+
+respectively.
+
+298
+
+
+Entropy 2014, 16, 4132–4167
+
+In the same way as Cr, we took in the definition of Cc the arithmetic average of cuboid volumes
+so that to renormalize the combinatorial increase of the decomposition paths (|Seq|) according to the
+system size n.
+Note that on the other hand we did not renormalize the rectangle-bias complexity Cr and the
+cuboid-bias complexity Cc by taking the exact geometrical mean of each product of KL-divergences
+
+such as
+n−1�
+
+∏n−1
+i=1 D[SDi(is) : SDi+1(is)]. This is for further accessibility to theoretical analysis such as
+variational method (see “Further Consideration" section), and does not change qualitative behavior
+of Cr and Cc since the power root is a monotonically increasing function. This treatment can be
+interpreted as taking the (n − 1)-th power of the geometric means for the hierarchical sequences of
+KL-divergences.
+A more comprehensive example on the utility of the cuboid-bias complexity Cc with respect to
+the rectangle-biased one Cr is shown in Figure 6. We consider the 6 nodes networks (n = 6) with the
+same I and Cr values but different heterogeneity. The system in the top left figure has a circularly
+connected structure with medium intensity, while that of the top right figure has strongly connected 3
+subsystems. These systems have qualitatively five different ways of system decomposition that are
+the basic generators of all hierarchical sequences Seq = {SD1(is) → · · · → SD5(is)|1 ≤ is ≤ |Seq|} for
+these networks. The five basial system decompositions are shown with the number 1⃝, 2⃝, 2⃝′, 3⃝ and
+4⃝ in top figures.
+The circle diagrams of these systems are depicted in the middle figures. To suppose the same
+constant value of Cr in both systems, the following condition is satisfied in the middle right figure:
+D[< 111111 >:
+2⃝] < D[< 111111 >:
+1⃝in Middle Left figure] < D[< 111111 >:
+1⃝] < D[<
+111111 >: 2⃝in Middle Left figure] < D[< 111111 >: 3⃝] < D[< 111111 >: 4⃝]. Furthermore, the
+total surface of right triangles sharing the circle diameter as hypotenuse in the middle left and the
+middle right figures are conditioned to be identical, therefore the rectangle-bias complexity Cr fails to
+distinguish.
+On the other hand, under the same condition, the cuboid-bias complexity Cc distinguishes between
+these two systems and gives higher value to the left one. The volume of 5-dimensional cuboids of the
+
+decomposition sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝
+−−−−−−−−→< 123456 > are schematically shown in the bottom
+figures, maintaining the quantitative difference between KL-divergences. Since the multi-information
+I is identical between the two systems, so is the values of KL-divergence D[< 111111 >:< 123456 >],
+
+which is the sum of the KL-divergences along the sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝
+−−−−−−−−→< 123456 >
+from the Pythagorean theorem. This means that the inequality between the cuboid volumes can be
+represented as the isoperimetric inequality of high-dimensional cuboid. As a consequence, the left
+system has quantitatively higher value of Cc than the right one. The cuboid-bias complexity Cc is also
+sensitive to such heterogeneity.
+
+299
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+(e)
+(f)
+
+Figure 6. Meaning of taking geometric mean over the sequence of system decomposition in cuboid-bias complexity
+Cc. (a): Example of 6-node network with circularly connected structure with medium intensity. Edge
+width is proportional to edge information; (b): Example of 6-node network with strongly connected
+3 subsystems. Edge width is proportional to edge information. The multi-information I of the two
+systems in Top figures are conditioned to be identical; The dotted lines schematically represent possible
+system decompositions. (c,d): Circle diagrams of each system decomposition in upper networks; The
+total surface of right triangles sharing the circle diameter as hypotenuse in (c) and (d) are conditioned to
+be identical, therefore the rectangle-bias complexity Cr fails to distinguish. (e,f): 5-dimensional cuboids
+of upper networks (Figure 6a,b) whose edges are the root of KL-divergences for the strain of system
+
+decomposition < 111111 > 1⃝ 2⃝ 2⃝
+′ 3⃝ 4⃝
+−−−−−−−−→< 123456 >. Only the first 3-dimensional part is shown with
+solid line, and the remaining 2-dimensional part is represented with dotted line. The volume of
+cuboid in (e) is larger than the one in (f), according to the isoperimetric inequality of high-dimensional
+cuboid. The total squared length of each side is identical between two cuboids, which represents
+multi-information I = D[< 111111 >:< 123456 >].
+
+8. Regularized Cuboid-Bias Complexity with Respect to Generalized Mutual Information
+
+We further consider the geometrical composition of system decompositions in the circle diagram
+and insist the necessity of renormalizing the cuboid-bias complexity Cc with the multi-information I,
+which gives another measure of complexity namely “regularized cuboid-bias complexity CR
+c .”
+We consider the situation in actual data where the multi-information I varies. Figure 7 shows
+the n = 3 cases where the Cc fails to distinguish. Both the blue and red systems are supposed to have
+the same Cc value by adjusting the red system to have relatively smaller values of KL-divergences
+
+300
+
+
+Entropy 2014, 16, 4132–4167
+
+D[< 111 >:< 122 >] and D[< 113 >:< 123 >] than the blue one. Such conditioning is possible since
+the KL-divergences are independent parameters with each other.
+
+(a)
+(b)
+(c)
+
+Figure 7.
+Examples of the 3-node systems with identical cuboid-bias complexity Cc but different
+multi-information I on circle graph. (a): System with smaller I but larger CRc ; (b): System with larger I but
+smaller CRc ; (c): Superposition of the above two systems. The regularized cuboid-bias complexity CRc
+distinguishes between the blue and red systems.
+
+Although the Cc value is identical, the two systems have different geometrical composition of
+system decompositions in the circle diagram. The red system has relatively easier way of decomposition
+< 111 >→< 122 > if renormalized with the total system decomposition < 111 >→< 123 >. This
+relative decompositionability with respect to the renormalization with the multi-information I can
+be clearly understood by superimposing the circle diagram of the two systems and comparing the
+angles between each and total decomposition paths (bottom figure). The red system has larger angle
+between the decomposition paths < 111 >→< 122 > and < 111 >→< 123 > than any others in the
+blue system, which represents the relative facility of the decomposition under renormalization with I.
+In this term, the paths < 111 >→< 121 > in the red and blue system do not change its relative facility,
+and the paths < 111 >→< 113 > are easier in the blue system.
+To express the system decompositionability based on these geometrical compositions in a
+comprehensive manner, we define the regularized cuboid-bias complexity CR
+c as follows:
+
+CR
+c
+:=
+1
+
+|Seq|
+
+|Seq|
+∑
+is=1
+
+n−1
+∏
+i=1
+
+D[SDi(is) : SDi+1(is)]
+
+D[< 11 · · · 1 >:< 12 · · · n >]
+
+:=
+Cc
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+:=
+Cc
+In−1 .
+(37)
+
+The red system then has quantitatively smaller CR
+c value than the blue system in Figure 7.
+
+9. Modular Complexity with Respect to the Easiest System Decomposition Path
+
+We have considered so far the system decompositionability with respect to the all possible
+decomposition sequences.
+This was also a way to avoid the local fluctuation of the network
+heterogeneity to be reflected in some specific decomposition paths. On the other hand, the easiest
+decomposition is particularly important when considering the modularity of the system. If there exists
+hierarchical structure of modularity in different scales with different coherence of the system, the
+KL-divergence and the sequence of the easiest decomposition gives much information.
+
+301
+
+
+Entropy 2014, 16, 4132–4167
+
+Figure 8 schematically shows a typical example where there exist two levels of modularity. Such
+structure with different scales of statistical coherence appears as functional segregation in neural
+systems [17], and is expected to be observed widely in complex systems.
+The hierarchical topology of the easiest decomposition path reflects these structures.
+For
+example, in the system of Figure 8, the decompositions between <
+1 1 · · · 1
+> and <
+1 1 1 1 5 5 5 5 9 9 9 9 13 13 13 13 > are easier than those inside of the 4-node subsystems. The values of
+KL-divergence also reflect the hierarchy, giving relatively low values for the decomposition between
+the 4-node subsystems, and high values inside of them. By examining the shortest decomposition
+path and associated KL-divergences in possible Seq, one can project the hierarchical structure of the
+modularity existing in the system.
+
+Figure 8. Example of 16-node system < 11 · · · 1 > that has different levels of modularity. The four 4-node
+subsystems < 1111 > (blue blocks) are loosely connected and easy to be decomposed, while inside each
+component (red blocks) is tightly connected. The degree of connection represents statistical dependency
+or edge information between subsystems. Such hierarchical structure can be detected by observing the
+decomposition path of the modular complexity Cm.
+
+For this reason, we define the modular complexity Cm as follows, which is the shortest path
+component of the cuboid-bias complexity Cc:
+
+Cm :=
+n−1
+∏
+i=1
+D[SDi(imin) : SDi+1(imin)],
+(38)
+
+where the index imin of the sequence SD1(imin) → SD2(imin) → · · · → SDn(imin) is chosen as follows:
+
+imin
+=
+{i1} ∩ {i2} ∩ · · · ∩ {in−1},
+(39)
+
+where
+
+{i1}
+=
+argmin
+is
+{D[SD1(is) : SD2(is)]|1 ≤ is ≤ |Seq|},
+
+{i2}
+=
+argmin
+i1
+{D[SD2(i1) : SD3(i1)]|i1 ∈ {i1}},
+
+...
+
+{in−1}
+=
+argmin
+in−2
+{D[SDn−1(in−2) : SDn(in−2)]|in−1 ∈ {in−1}},
+(40)
+
+302
+
+
+Entropy 2014, 16, 4132–4167
+
+which gives eventually
+
+imin
+=
+in−1.
+(41)
+
+This means that beginning from the undecomposed state < 11 · · · 1 >, we continue to choose
+the shortest decomposition path in the next hierarchy of system decomposition. The minimization of
+the path length is guaranteed by the sequential minimization since the geometric mean of isometric
+path division is bounded below by its minimum component. imin is unique if the system is completely
+heterogenous (i.e., D[SD1(ik) : SD2(ik)] ̸= D[SD1(il) : SD2(il)], 1 ≤ ik < il ≤ |Seq|), otherwise plural
+decomposition paths that give the same Cm value are possible according to the homogeneity of the
+system. Besides its value, the modular complexity Cm should be utilized with the sequence information
+of the shortest decomposition path to evaluate the modularity structure of a system.
+The cases where Cm are identical but Cc are different can be composed by varying the system
+decompositions other than in the shortest path SD1(imin) → SD2(imin) → · · · → SDn(imin) without
+modifying the index imin. There exist also inverse examples with identical Cc and different Cm, due to
+the complementarity between Cm and Cc.
+We finally define the regularized modular complexity CR
+m as follows, for the same reason as defining
+CR
+c from Cc;
+
+CR
+m
+:=
+n−1
+∏
+i=1
+
+D[SDi(imin) : SDi+1(imin)]
+D[< 11 · · · 1 >:< 12 · · · n >]
+
+:=
+Cm
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+:=
+Cm
+In−1 .
+(42)
+
+Proposition 4. The cuboid-bias complexities Cc and CR
+c are bounded by the modular complexities Cm and CR
+m
+respectively:
+
+Cc ≤ Cm,
+(43)
+
+CR
+c ≤ CR
+m.
+(44)
+
+And they coincide at the maximum values under the given multi-information I:
+
+max{Cm|I = const.}
+=
+max{Cc|I = const.},
+(45)
+
+max{CR
+m}
+=
+max{CR
+c }.
+(46)
+
+These relations (43)–(46) are numerically shown in the “Numerical Comparison” section.
+The superiority of the modular complexities is due to the hierarchical dependency of
+KL-divergence value in decomposition paths. In the shortest decomposition path defining modular
+complexities, the easier system decomposition relatively increase its value since they incorporate more
+number of edge cutting. Since we eventually cut all edges to obtain < 12 · · · n > at the end of the
+decomposition sequence, collecting the edges with relatively weak edge information and cutting them
+together augment the value of the product of KL-divergences. The modular complexities are then
+the maximum value components among the possible decomposition paths calculated in cuboid-bias
+complexities:
+
+Cm
+=
+max
+
+�
+n−1
+∏
+i=1
+D[SDi(is) : SDi+1(is)]
+
+����� 1 ≤ is ≤ |Seq|
+
+�
+
+,
+(47)
+
+CR
+m
+=
+max
+
+�
+n−1
+∏
+i=1
+
+D[SDi(is) : SDi+1(is)]
+
+D[< 11 · · · 1 >:< 12 · · · n >]n−1
+
+����� 1 ≤ is ≤ |Seq|
+
+�
+
+.
+(48)
+
+303
+
+
+Entropy 2014, 16, 4132–4167
+
+The difference between the cuboid-bias complexities and the modular complexities is an index of
+the geometrical variation of decomposed systems in the circle graph, which reflects the fluctuation of
+the sequence-wise system decompositionability. If the variation of the system decompositionability for
+each system decomposition is large, accordingly the modular complexities tend to give higher values
+than the cuboid-bias complexities.
+
+10. Numerical Comparison
+
+We numerically investigate the complementarity between the proposed complexities, Cc, CR
+c , Cm,
+and CR
+m. Since the minimum node number giving non-trivial meaning to these measures is n = 4, the
+corresponding dimension of parameter space is ∑n
+k=1 nCk = 15. The constant-complexity submanifolds
+are therefore difficult to visualize due to the high dimensionality. For simplicity, we focus on the
+2-dimensional subspace of this parameter space whose first axis ranging from random to maximum
+dependencies of the system, and the second one representing the system decompositionability of
+< 1133 >.
+For this purpose, we introduce the following parameters α and β (0 ≤ α, β ≤ 1) in the j-coordinates
+of the discrete distribution with 4-dimensional binary stochastic variable:
+
+η1
+=
+η0,
+
+η2
+=
+η0,
+
+η3
+=
+η0,
+
+η4
+=
+η0,
+
+η1,2
+=
+η1η2 + α(η0 − ϵ − η1η2),
+(49)
+
+η3,4
+=
+η3η4 + α(η0 − ϵ − η3η4),
+
+η1,3
+=
+η1η3 + αβ(η0 − ϵ − η1η3),
+
+η1,4
+=
+η1η4 + αβ(η0 − ϵ − η1η4),
+
+η2,3
+=
+η2η3 + αβ(η0 − ϵ − η2η3),
+
+η2,4
+=
+η2η4 + αβ(η0 − ϵ − η2η4),
+
+η1,2,3
+=
+η1,2η3 + αβ(η0 − 2ϵ − η1,2η3),
+
+η1,2,4
+=
+η1,2η4 + αβ(η0 − 2ϵ − η1,2η4),
+
+η1,3,4
+=
+η1η3,4 + αβ(η0 − 2ϵ − η1η3,4),
+
+η2,3,4
+=
+η2η3,4 + αβ(η0 − 2ϵ − η2η3,4),
+
+η1,2,3,4
+=
+η1,2η3,4 + αβ(η0 − 3ϵ − η1,2η3,4).
+
+Where α represents the degree of statistical association from random (α = 0) to maximum (α = 1),
+and β control the system decompositionability of < 1133 >. If β = 1, the system has the maximum
+KL-divergence D[< 1111 >:< 1133 >] under the constraint of α parameter, and β = 0 gives D[<
+1111 >:< 1133 >] = 0.
+ϵ is the minimum value of the joint distribution of 4-dimensional variable, which is defined to be
+more than 0 to avoid singularity in the dual-flat coordinates of statistical manifold. ϵ = 1.0 × 10−10
+
+and η0 = 0.5 was chosen for the calculation.
+
+304
+
+
+Entropy 2014, 16, 4132–4167
+
+The system with maximum statistical association under given η0 corresponds to the α = β = 1
+condition in given parameters, whose j-coordinates become as follows:
+
+η1
+=
+η0,
+...
+
+η4
+=
+η0,
+
+η1,2
+=
+η0 − ϵ,
+...
+
+η3,4
+=
+η0 − ϵ,
+(50)
+
+η1,2,3
+=
+η0 − 2ϵ,
+...
+
+η2,3,4
+=
+η0 − 2ϵ,
+
+η1,2,3,4
+=
+η0 − 3ϵ, .
+
+On the other hand, the totally decomposed system corresponds to the α = 0 condition, and the
+j-coordinates are:
+
+η1
+=
+η0,
+...
+
+η4
+=
+η0,
+
+η1,2
+=
+η0η0,
+...
+
+η3,4
+=
+η0η0,
+(51)
+
+η1,2,3
+=
+η0η0η0,
+...
+
+η2,3,4
+=
+η0η0η0,
+
+η1,2,3,4
+=
+η0η0η0η0.
+
+Note that the completely deterministic case η0 = 1.0 and α = β = 1 gives I = 0.
+The intuitive meaning of these parameters α and β are also schematically depicted in Figure 9
+
+bottom right.
+
+305
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+(b)
+
+(c)
+(d)
+
+Figure 9. Contour plot of the complexity landscape of I, Cc, Cm, CRc , and CRm on α-β plane. (a): Contour plot
+superposition of Cc and Cm. (b): Contour plot superposition of CRc and CRm. (c): Contour plot of I.
+The color of contour plots corresponds to the color gradient of 3D plots in Figure 10; (d): Schematic
+representation of the system in different regions of α-β plane. Edge width represents the degree of edge
+information, and independence is depicted with dotted line.
+
+Figure 10 shows the landscape of the proposed complexities on the α-β plane. Their contour plots
+are depicted in Figure 9. The proposed complexities each differs from others in almost everywhere
+points on α-β plane except at the intersection lines. Therefore, these measures serve as the independent
+features of the system, each has its specific meaning with respect to the system decompositionability.
+The α-β plane shows a section of the actual structure of the complementarity expressed in Figure 3
+between the proposed complexity measures.
+The
+relations
+between
+the
+cuboid-bias
+complexities
+and
+modular
+complexities
+in
+Equations (43)–(46) are also numerically confirmed.
+The modular complexities are superior
+than the corresponding cuboid-bias complexities, and coincide at the parameter α = β = 1 giving
+maximum values and dependencies in this parameterization.
+
+306
+
+
+Entropy 2014, 16, 4132–4167
+
+(a)
+
+(b)
+(c)
+
+(d)
+(e)
+
+Figure 10. Landscape of complexities I, Cc, Cm, CRc , and CRm on α-β plane. (a): Multi-information I; (b):
+Cuboid-bias complexity Cc. (c): Modular complexity Cm;(d): Regularized cuboid-bias complexity
+CRc ; (e): Regularized modular complexity CRm. All complexity measures show the complementarity
+intersecting with each other, satisfying the boundary conditions vanishing at α = 0 and β = 0 except the
+multi-information I. Note that regularized complexities CRc and CRm show singularity of convergence at
+α → 0 due to the regularization of infinitesimal value.
+
+In general case without the parameterization with α, β and η0, the boundary conditions of Cc, CR
+c ,
+Cm and CR
+m include that of the multi-information I, which vanish at the completely random or ordered
+state. This is common to other complexity measures such as the LMC complexity, and fit to the basic
+
+307
+
+
+Entropy 2014, 16, 4132–4167
+
+intuition on the concept of complexity situated equivalently far from the completely predictable and
+disordered states [21,22].
+The proposed complexities further incorporate boundary conditions that vanish with the existence
+of a completely independent subsystem of any size. This means that the Cc, CR
+c , Cm and CR
+m of a system
+become 0 if we add another independent variable. This property does not reflect the intuition of
+complexity defined by the arithmetic average of statistical measures. The proposed complexity can
+better find its meaning in comparison to other complexity measures such as the multi-information
+I, and by interactively changing the system scale to avoid trivial results with small independent
+subsystem. For example, the proposed complexities could be utilized as the information criteria
+for the model selection problems, especially with an approximative modular structure based on the
+statistical independency of data between subsystems. We insist that the complementarity principle
+between plural complexity measures of different foundation is the key to understand the complexity
+in a comprehensive manner.
+To characterize the property of Cc, CR
+c , Cm and CR
+m in relation to the diverse composition of each
+system decomposition, it is useful to consider the geometry of their contour structure, as compared
+in Figure 9. The contour can be formalized as Cc, CR
+c , Cm, CR
+m = const. for each complexity measure,
+and D[< 11 · · · 1 >: SDi(is)] = const. (1 ≤ i ≤ n − 1, 1 ≤ is ≤ |Seq|) for each system decomposition.
+For that purpose, analysis with algebraic geometry can be considered as a prominent tool. Algebraic
+geometry investigates the geometrical property of polynomial equations [23]. The complexities Cc, CR
+c ,
+Cm and CR
+m can be interpreted as polynomial functions by taking each system decomposition as novel
+coordinates, therefore directly accessible to algebraic geometry. However, if we want to investigate the
+contour of the complexities on the p parameter space, logarithmic function appears as the definition of
+KL-divergence, which is a transcendental function and outreach the analytical requirement of algebraic
+geometry. To introduce compatibility between the p parameter space of information geometry and
+algebraic geometry, it suffices to describe the model by replacing the logarithmic functions as another n
+variables such as q = log p, and reconsider the intersection between the result from algebraic geometry
+on the coordinates (p, q) and q = log p condition. The contour of Cc, CR
+c , Cm and CR
+m is also important
+to seek for the utility of these measures as a potential to interpret the dynamics of statistical association
+as geodesics.
+
+11. Further Consideration
+
+11.1. Pythagorean Relations in System Decomposition and Edge Cutting
+
+We further look back at the system decomposition and edge cutting in terms of the Pythagorean
+relation between KL-divergences, which is based on the orthogonality between ` and j coordinates.
+In system decomposition, the distribution of decomposed system is analytically obtained from
+the product of subsystems’ η coordinates, which is equivalent to set all θdec parameters as 0 in mixture
+coordinate ξdec. From the consistency of θdec parameters in ξdec being 0 in all system decompositions,
+we have the Pythagorean relation according to the inclusion relation of system decomposition. For
+example, the following holds:
+
+D[< 1111 >:< 1234 >]
+=
+D[< 1111 >:< 1222 >]
+
++
+D[< 1222 >:< 1233 >]
+(52)
+
++
+D[< 1233 >:< 1234 >].
+
+The proof is in the same way as k-cut coordinates isolating k-tuple statistical association between
+variables [14].
+On the other hand, the edge cutting previously defined using the product of remaining maximum
+cliques’ η coordinates does not coincides with the θec = 0 condition in mixture coordinates ξec. We
+have defined the ηec values of edge cutting based only on the orthogonal relation between η and θ
+
+308
+
+
+Entropy 2014, 16, 4132–4167
+
+coordinates, by generalizing the rule of system decomposition in ηec coordinates, and did not consider
+the Pythagorean relation between different edge cuttings.
+It is then possible to define another way of edge cutting using θec = 0 condition in ξec. Indeed,
+in k-cut mixture coordinates, θk+ = 0 condition is derived from the independent condition of the
+variables in all orders, and k-tuple statistical association is measured by reestablishing the η parameters
+for the statistical association up to k − 1-tuple order. In the same way, we can set θdec = 0 condition for
+ξdec of a system decomposition, and reestablish edges with respect to the η parameters, except the one
+in focus for edge cutting.
+As a simple example, consider the system decomposition < 1222 > and edge cutting 1 − 2 in
+4-node graph. We have the mixture coordinate ξdec for the system decomposition as follows:
+
+ξdec
+1,2
+=
+θdec
+1,2 = 0,
+
+ξdec
+1,3
+=
+θdec
+1,3 = 0,
+
+ξdec
+1,4
+=
+θdec
+1,4 = 0,
+
+ξdec
+1,2,3
+=
+θdec
+1,2,3 = 0,
+(53)
+
+ξdec
+1,3,4
+=
+θdec
+1,3,4 = 0,
+
+ξdec
+1,2,3,4
+=
+θdec
+1,2,3,4 = 0,
+
+where all the rest of ξdec coordinates is equivalent to that of η coordinates.
+We then consider the new way of edge cutting 1 − 2 by recovering the statistical association in
+edges 1 − 3 and 1 − 4 from system decomposition < 1222 >, orthogonally to that of edge 1 − 2. The
+new mixture coordinate ξEC changes to the following:
+
+ξEC
+1,2 = θEC
+1,2 = 0,
+
+ξEC
+1,3 = η1,3,
+
+ξEC
+1,4 = η1,4,
+
+ξEC
+1,2,3 = θEC
+1,2,3 = 0,
+(54)
+
+ξEC
+1,3,4 = η1,3,4,
+
+ξEC
+1,2,3,4 = θEC
+1,2,3,4 = 0,
+
+and the rest is equivalent to that of η coordinates.
+This new ξEC is also compatible with k-cut coordinates formalization for its simple θEC = 0
+conditions. To obtain ξEC for arbitrary edge cutting i − j, one should take θEC containing i and j in
+its subscript, set them to 0, and combine with η coordinates for the rest of the subscript. For plural
+edge cuttings i − j, · · · , k − l (1 ≤ i, j, k, l ≤ n), it suffices to take θEC containing i and j, ... , k and l in
+its subscript respectively, then set them to 0.
+We finally obtain the Pythagorean relation between edge cuttings. Denoting the general edge
+cutting(s) coordinates as ξi−j,··· ,k−l, the following holds for the example of system decomposition
+< 1222 >:
+
+D[< 1111 >:< 1222 >]
+=
+D[< 1111 >: p(ξ1−2)]
+
++
+D[p(ξ1−2) : p(ξ1−2,1−3)]
+(55)
+
++
+D[p(ξ1−2,1−3) : p(ξ1−2,1−3,1−4)].
+
+Despite the consistency with the dual structure between θ and η, we do not generally have
+analytical solution to determine ηEC values from θEC = 0 conditions. We should call for some
+numerical algorithm to solve θEC = 0 conditions with respect to ηEC values, which are in general
+high-degree simultaneous polynomials. Furthermore, numerical convergence of the solution has to be
+
+309
+
+
+Entropy 2014, 16, 4132–4167
+
+very strict, since tiny deviation from the conditions can become non-negligible by passing fractional
+function and logarithmic function of θ coordinates.
+On the other hand, the previously defined edge cutting with ξec using the product between
+subgraphs’ η coordinates is analytically simple and does not need to consider the other edges’ recovery
+from system decomposition or independence hypothesis. We then chose the previous way of edge
+cutting for both calculability and clarity of the concept.
+There have been many attempts to approximate complex network by low-dimensional system
+with the use of statistical physics and network theory. As a contemporary example, moment-closure
+approximation provides a various way to abstract essential dynamics e.g., in discrete adaptive
+network [24]. Although the approximation takes several theoretical assumptions such as random graph
+approximation, it is difficult to quantitatively reproduce the dynamics even in some simplest model.
+This is partly due to homogeneous treatment of statistics such as truncation into pair-wise order. The
+edge cutting can offer a complementary view on the evaluation of moment-closure approximations.
+Using orthogonal decomposition between edge information, one can evaluate which part of network
+link and which order of statistics contain essential information, which does not necessary conform to
+top-down theoretical treatment.
+
+11.2. Complexity of the Systems with Continuous Phase Space
+
+We have developed the concept of system decompositionability based on discrete binary variables.
+One can also apply the same principle to continuous variable.
+For an ergodic map G : X → X in continuous space X, KS entropy h(μ, G) is defined as the
+maximum of entropy rate with respect to all possible system decomposition A, when the invariant
+measure μ exists:
+
+h(μ, G) = sup
+A
+h(μ, G, A).
+(56)
+
+where A is the disjoint decomposition of X that consists of non-trivial sets ai, whose total number is
+n(A), defined as
+
+X =
+
+n(A)
+�
+
+i=1
+ai,
+(57)
+
+ai ∩ aj = φ, i ̸= j, 1 ≤ i, j ≤ n(A),
+(58)
+
+meaning the natural expansion of system decomposition into continuous space.
+The entropy rate h(μ, G, A) in Equation (56) is defined as
+
+h(μ, G, A) = lim
+n→∞
+1
+n H(μ, A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)),
+(59)
+
+according to the entropy H(μ, A) based on the decomposition A = {ai}
+
+H(μ, A) = −
+
+n(A)
+∑
+i=1
+μ(ai) ln μ(ai),
+(60)
+
+and the product C = A ∨ B as
+
+C
+=
+A ∨ B
+
+=
+{ci = aj ∩ bk|1 ≤ j ≤ n(A), 1 ≤ k ≤ n(B)}.
+(61)
+
+310
+
+
+Entropy 2014, 16, 4132–4167
+
+In a more general case, topological entropy hT(G) is defined simply with the number of
+decomposed subsystem elements by preimages as follows, without requiring ergodicity, therefore
+neither the existence of invariant measure μ:
+
+hT(G) = sup
+A
+lim
+n→∞
+1
+n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)).
+(62)
+
+Topological entropy takes the maximum value of the possible preimage divisions, in order to
+measure the complexity in terms of the mixing degree of the orbits. For example, if the KS entropy
+is positive as h(μ, G) > 0, the dynamics of G on an invariant set of invariant measure μ is chaotic for
+almost everywhere initial conditions. As for the positive topological entropy hT(G) > 0, the dynamics
+of G contain chaotic orbits, but not necessary as attractive chaotic invariant set, since hT(G) ≥ h(μ, G)
+and the KS entropy can be negative.
+Although these definitions are useful to characterize the existence of chaotic dynamics, the system
+decompositionability is another property representing different aspect of the system complexity. It
+is rather the matter of the existence of independent dynamics components, or the degree of orbit
+localization between arbitrary system decompositions. We propose the following “geometric topological
+entropy” hg(G) applying the same principle of taking geometric product between all hierarchical
+structure of the system decomposition A.
+
+hg(G) := ∏
+σ(A)>0
+lim
+n→∞
+1
+n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)),
+(63)
+
+where σ(A) > 0 means to take all components of A having positive Lebesgue measure on X.
+This gives 0 if the preimage of certain ai ∈ A is ai itself, meaning there exist a subsystem ai whose
+range is invariant under G, closed by itself. The system X can be completely divided into ai and the
+rest. This corresponds to the existence of an independent subsystem in cuboid-bias and modular
+complexities. In case such independent components do not exist, it still reflects the degree of orbit
+localization for all possible system decompositions in multiplicative manner. The condition σ(A) > 0
+is to avoid trivial case such as the existence of unstable limit cycle, whose Lebesgue measure is 0.
+Typical example giving hg(G) = 0 is the function having independent ergodic components, such
+as the Chirikov-Taylor map with appropriate parameter [25].
+
+12. Conclusions and Discussion
+
+We have theoretically developed a framework to measure the degree of statistical association
+existing between subsystems as well as the ones represented by each edge of the graph representation.
+We then reconsidered the problem of how to define complexity measures in terms of the construction
+of non-linear feature space. We defined new type of complexity based on the geometrical product of
+KL-divergence representing the degree of system decompositionability. Different complexity measures
+as well as newly proposed ones are compared on a complementarity basis on statistical manifold.
+Application of presented theory can encompass a large field of complex systems and data science,
+such as social network, genetic expression network, neural activities, ecological database, and any
+kind of complex networks with binary co-occurrence matrix data e.g., [26–29], databases: [30–34].
+Continuous variables are also accessible by appropriate discretization of information source with e.g.,
+entropy maximization principle.
+In contrast to arithmetic mean of information over the whole system, geometric mean has not been
+investigated sufficiently in the analysis of complex network. However in different fields, theoretical
+ecology has already pointed out the importance of geometric mean when considering the long-term
+fitness of a species population in a randomly varying environment [35,36]. Long-term fitness refers
+to the ecological complexity of its survival strategy under large stochastic fluctuation. Here, we can
+find useful analogy between the growth rate of a population in ecology and the spatio-temporal
+
+311
+
+
+Entropy 2014, 16, 4132–4167
+
+propagation rate of information between subsystems in general. If we take an arbitrary subsystem
+and consider the amount of information it can exchange with all other subsystems, the proposed
+complexity measures with geometric mean reflect the minimum amount with amongst all possible
+other subsystems, which can not be distinguished with arithmetic mean. The propagation rate of a
+population in ecology and the information transmission in complex network hold mathematically
+analogous structure. In population ecology, the variance of growth rate is crucial to evaluate the
+long-term survival of the population. Even if the arithmetic mean of growth rate is high, large variance
+will lead to low geometric mean even with a small amount of exceptionally small fitness situation,
+which ecologically means extinction of an entire species. In stochastic network, the variance of system
+decompositionability is essential to evaluate the amount of information shared between subsystems, or
+information persistence in the entire network. Even the multi-information I is high, large heterogeneity
+of edge information can lead to informational isolation of certain subsystem, which means extinction
+of its information. If such subsystem is situated on the transmission pathway, information cannot
+propagate across these nodes. Therefore, the proposed complexity measures CC, CR
+C, Cm and CR
+m
+generally reflect the minimum amount of information propagation rate spread entirely on the system
+without exception of isolated division.
+Some recent studies on adaptive network focus on the evolution of network topology in response
+to node activity, such as game-theoretic evolution of strategies [37], opinion dynamics on an evolving
+network [38], epidemic spreading on an adaptive network [39], etc. Analysis of coevolution network
+between variables and interactions can capture important dynamical feature of complex systems. In
+contrast to topological network analysis, the newly proposed complexity measures can complement
+its statistical dynamics analysis. In addition to the topological change of network model, (e.g., linking
+dynamics of game theory, opinion community network structure, contact network of epidemics
+transmission), one can evaluate the emerged statistical association between the variables that does
+not necessary coincide with the network topology. Interesting feature of non-linear dynamics is the
+unexpected correlation between distant variables, which is quantified as Tsallis entropy [40]. The
+complementary relation between concrete interaction and resulting statistical association can provide a
+twofold methodology to characterize the coevolutionary dynamics of adaptive network. Such strategy
+can promote integrated science from laboratory experiments to open-field in natura situation, where
+actual multi-scale problematics remain to be solved [41].
+Arithmetic and geometric means can be integrated in a mutual formula called generalized
+mean [42]. Therefore, the proposed complexity measures with geometric mean of KL-divergence is
+an expansion of preexisting complexity measures with mixture coordinates. Table 1 summarizes the
+generalization of complexity measure in this article. Based on the k-cut coordinates ı, the weighted
+sum of KL-divergence representing k-tuple order of statistical association derived complexity measures
+with (weighted) arithmetic mean such as multi-information I and TSE complexity. On the other hand,
+we showed that subsystem-wise correlation can also be isolated with the use of mixture coordinates,
+namely < · · · >-cut coordinates ¸. To quantify the heterogeneity of system decompositionability, we
+generally took a weighted geometric mean of KL-divergence in CC, CR
+C, Cm and CR
+m. Here, the shortest
+path selection of Cm and CR
+m, and regularization of CR
+C and CR
+m with respect to multi-information I
+can be interpreted as the weight function of geometric mean. This perspective brings a definition
+of a generalized class of complexity measures based on the mixture coordinates and generalized
+mean of KL-divergence. Information discrepancy can also be generalized from KL-divergence to
+Bregman divergence, providing access to the concept of multiple centroids in large stochastic data
+analysis such as image processing [43]. The blank columns of the Table 1 imply the possibility of
+other complexity measures in this class. For example, the weighted geometric mean of KL-divergence
+defined between k-cut coordinates is expected to yield complexity measures that are sensitive to
+the heterogeneity of correlation orders. The weighted arithmetic mean of KL-divergence defined
+between < · · · >-cut coordinates should be sensitive to the mean decompositionability of arbitrary
+subsystem. Since these measures take analytically different form on mixture coordinates and/or mean
+
+312
+
+
+Entropy 2014, 16, 4132–4167
+
+functions, their derivatives do not coincide, which give independent information of the system on
+the complementary basis on statistical manifold, as long as the number of complexity measures are
+inferior to the freedom degree of the system.
+
+Table 1. Classification of complexity measures with KL-divergence on mixture coordinates.
+
+Generalized Mean of KL-Divergence
+
+Arithmetic Mean
+Geometric Mean
+
+Mixture Coordinates
+k-cut ı
+TSE complexity, I
+
+< · · · >-cut ¸
+CC, CR
+C, Cm, CRm
+
+Acknowledgments: This study was partially supported by CNRS, the long term study abroad support program
+of the university of Tokyo, and the French government (Promotion Simone de Beauvoir).
+
+Conflicts of Interest: Conflicts of Interest
+The author declares no conflict of interest.
+
+References
+
+1.
+Boccalettia, S.; Latorab, V.; Morenod, Y.; Chavezf, M.; Hwang, D.U. Complex Networks: Structure and
+Dynamics. Phys. Rep. 2006, 424, 175–308.
+2.
+Strogatz, S.H. Exploring Complex Networks. Nature 2001, 410, 268–276.
+3.
+Wasserman, S.; Faust, K. Social Network Analysis; Cambridge University Press: Cambridge, UK, 1994.
+4.
+Funabashi, M.; Cointet, J.P.; Chavalarias, D. Complex Network. In Studies in Computational Intelligence;
+Springer: Berlin/Heidelberg, Germany, 2009; Volume 207, pp. 161–172.
+5.
+Badii, R.; Politi, A. Complexity: Hierarchical Structures and Scaling in Physics; Cambridge University Press:
+Cambridge, UK, 2008.
+6.
+Lempel, A.; Ziv, J. On the Complexity of Finite Sequences. IEEE Trans. Inf. Theory 1976, 22, 75–81.
+7.
+Li, M.; Vitanyi, P. Texts in Computer Science. In An Introduction to Kolmogorov Complexity and Its Applications,
+2nd ed.; Springer: Berlin/Heidelberg, Germany, 1997.
+8.
+Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 2006.
+9.
+Bennett, C. On the Nature and Origin of Complexity in Discrete, Homogeneous, Locally-Interacting Systems.
+Found. Phys. 1986, 16, 585–592.
+10.
+Grassberger, P. Toward a Quantitative Theory of Self-Generated Complexity. Int. J. Theor. Phys. 1986,
+25, 907–938.
+11.
+Crutchfield, J.P.; Feldman, D.P. Regularities Unseen, Randomness Observed: The Entropy Convergence
+Hierarchy. Chaos 2003, 15, 25–54.
+12.
+Crutchfield, J.P. Inferring Statistical Complexity. Phys. Rev. Lett. 1989, 63, 105–108.
+13.
+Prichard, D.; Theiler, J. Generalized Redundancies for Time Series Analysis. Physica D 1995, 84, 476–493.
+14.
+Amari, S. Information Geometry on Hierarchy of Probability Distributions. IEEE Trans. Inf. Theory 2001,
+47, 1701–1711.
+15.
+Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A Unifying Framework for Complexity Measures of Finite Systems;
+Report 06-08-028; Santa Fe Institute: Santa Fe, NM, USA, 2006.
+16.
+MacKay, R.S. Nonlinearity in Complexity Science. Nonlinearity 2008, 21, T273–T281.
+17.
+Tononi, G.; Sporns, O.; Edelman, M. A Measure for Brain Complexity: Relating Functional Segregation and
+Integration in the Nervous System. Proc. Natl. Acad. Sci. USA 1994, 91, 5033.
+18.
+Feldman, D.P.; Crutchfield, J.P. Measures of statistical complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+19.
+Nakahara, H.; Amari, S. Information-Geometric Measure for Neural Spikes. Neural Comput. 2002, 14, 2269–
+2316.
+20.
+Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How Should Complexity Scale with System Size? Eur. Phys. J. B
+2008, 63, 407–415.
+21.
+Feldman, D.P.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252.
+22.
+Lopez-Ruiz, R.; Mancini, H.; Calbet, X. A Statistical Measure of Complexity. Phys. Lett. A 1995, 209, 321–326.
+
+313
+
+
+Entropy 2014, 16, 4132–4167
+
+23.
+Hodge, W.; Pedoe, D. Methods of Algebraic Geometry; Cambridge Mathematical Library, Cambridge University
+Press: Cambridge, UK, 1994; Volume 1–3.
+24.
+Demirel, G.; Vazquez, F.; Bohme, G.; Gross, T. Moment-closure Approximations for Discrete Adaptive
+Networks. Physica D 2014, 267, 68–80.
+25.
+Fraser, G., Ed. The New Physics for the Twenty-First Century; Cambridge University Press: Cambridge, UK,
+2006; p. 335.
+26.
+Scott, J. Social Network Analysis: A Handbook; SAGE Publications Ltd.: London, UK, 2000.
+27.
+Geier, F.; Timmer, J.; Fleck, C. Reconstructing Gene-Regulatory Networks from Time Series, Knock-Out Data,
+and Prior Knowledge. BMC Syst. Biol. 2007, 1, doi:10.1186/1752-0509-1-11.
+28.
+Brown, E.N.; Kass, R.E.; Mitra, P.P. Multiple Neural Spike Train Data Analysis: State-of-the-Art and Future
+Challenges. Nat. Neurosci. 2004, 7, 456–461.
+29.
+Yee, T.W. The Analysis of Binary Data in Quantitative Plant Ecology.
+Ph.D. Thesis, The University of
+Auckland, New Zealand, 1993.
+30.
+Stanford Large Network Dataset Collection. Available online: http://snap.stanford.edu/data/ (accessed on
+19 July 2014).
+31.
+BioGRID. Available online: http://thebiogrid.org/ (accessed on 19 July 2014).
+32.
+Neuroscience Information Framework.
+Available online:
+http://www.neuinfo.org/ (accessed on
+19 July 2014).
+33.
+Global Biodiversity Information Facility. Available online: http://www.gbif.org/ (accessed on 19 July 2014).
+34.
+UCI Network Data Repository. Available online: http://networkdata.ics.uci.edu/index.php (accessed on 19
+July 2014).
+35.
+Lewontin, R.C.; Cohen, D. On Population Growth in a Randomly Varying Environment. Proc. Natl. Acad.
+Sci. USA 1969, 62, 1056–1060.
+36.
+Yoshimura, J.; Clark, C.W. Individual Adaptations in Stochastic Environments. Evol. Ecol. 1969, 5, 173–192.
+37.
+Wu, B.; Zhou, D.; Wang, L. Evolutionary Dynamics on Stochastic Evolving Networks for Multiple-Strategy
+Games. Phys. Rev. E 2011, 84, 046111.
+38.
+Fu, F.; Wang, L. Coevolutionary Dynamics of Opinions and Networks: From Diversity to Uniformity.
+Phys. Rev. E 2008, 78, 016104.
+39.
+Gross, T.; D’Lima, C.J.D.; Blasius, B. Epidemic Dynamics on an Adaptive Network.
+Phys. Rev. Lett.
+2006, 96, 208701.
+40.
+Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statistics. J. Stat. Phys. 1988, 52, 479–487.
+41.
+Quintana-Murci, L.; Alcais, A.; Abel, L.; Casanova, J.L. Immunology in natura: Clinical, Epidemiological
+and Evolutionary Genetics of Infectious Diseases. Nat. Immunol. 2007, 8, 1165–1171.
+42.
+Hardy, G.; Littlewood, J.; Polya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1967; Chapter 3.
+43.
+Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+314
+
+
+entropy
+
+Article
+The Entropy-Based Quantum Metric
+
+Roger Balian
+
+Institut de Physique Théorique, CEA/Saclay, F-91191 Gif-sur-Yvette Cedex, France;
+E-Mail: roger@balian.fr
+
+Received: 15 May 2014; in revised form: 25 June 2014 / Accepted: 11 July 2014 /
+Published: 15 July 2014
+
+Abstract: The von Neumann entropy S( ˆD) generates in the space of quantum density matrices
+ˆD the Riemannian metric ds2 = −d2S( ˆD), which is physically founded and which characterises
+the amount of quantum information lost by mixing ˆD and ˆD + d ˆD. A rich geometric structure is
+thereby implemented in quantum mechanics. It includes a canonical mapping between the spaces
+of states and of observables, which involves the Legendre transform of S( ˆD). The Kubo scalar
+product is recovered within the space of observables. Applications are given to equilibrium and non
+equilibrium quantum statistical mechanics. There the formalism is specialised to the relevant space of
+observables and to the associated reduced states issued from the maximum entropy criterion, which
+result from the exact states through an orthogonal projection. Von Neumann’s entropy specialises
+into a relevant entropy. Comparison is made with other metrics. The Riemannian properties of the
+metric ds2 = −d2S( ˆD) are derived. The curvature arises from the non-Abelian nature of quantum
+mechanics; its general expression and its explicit form for q-bits are given, as well as geodesics.
+
+Keywords: quantum entropy; metric; q-bit; information; geometry; geodesics; relevant entropy
+
+1. A Physical Metric for Quantum States
+
+Quantum physical quantities pertaining to a given system, termed as “observables” ˆO, behave
+as non-commutative random variables and are elements of a C*-algebra. We will consider below
+systems for which these observables can be represented by n-dimensional Hermitean matrices in a
+finite-dimensional Hilbert space H. In quantum (statistical) mechanics, the “state” of such a system
+encompasses the expectation values of all its observables [1]. It is represented by a density matrix ˆD,
+which plays the rôle of a probability distribution, and from which one can derive the expectation value
+of ˆO in the form
+< ˆO >= Tr ˆD ˆO = ( ˆD; ˆO) .
+(1)
+
+Density matrices should be Hermitean (< ˆO > is real for ˆO = ˆO†), normalised (the expectation
+value of the unit observable is Tr ˆD = 1) and non-negative (variances <
+ˆO2 > − <
+ˆO >2 are
+non-negative). They depend on n2 − 1 real parameters. If we keep aside the multiplicative structure of
+the set of operators and focus on their linear vector space structure, Equation (1) appears as a linear
+mapping of the space of observables onto real numbers. We can therefore regard the observables and
+the density operators ˆD as elements of two dual vector spaces, and expectation values (1) appear as
+scalar products.
+It is of interest to define a metric in the space of states. For instance, the distance between an
+exact state ˆD and an approximation ˆDapp would then characterise the quality of this approximation.
+However, all physical quantities come out in the form (1) which lies astride the two dual spaces of
+observables and states. In order to build a metric having physical relevance, we need to rely on another
+meaningful quantity which pertains only to the space of states.
+
+Entropy 2014, 16, 3878–3888; doi:10.3390/e16073878
+www.mdpi.com/journal/entropy
+315
+
+
+Entropy 2014, 16, 3878–3888
+
+We note at this point that quantum states are probabilistic objects that gather information about
+the considered system. Then, the amount of missing information is measured by von Neumann’s
+entropy
+S( ˆD) ≡ − Tr ˆD ln ˆD .
+(2)
+
+Introduced in the context of quantum measurements, this quantity is identified with the
+thermodynamic entropy when ˆD is an equilibrium state. In non-equilibrium statistical mechanics,
+it encompasses, in the form of “relevant entropy” (see Section 5 below), various entropies defined
+through the maximum entropy criterion. It is also introduced in quantum computation. Alternative
+entropies have been introduced in the literature, but they do not present all the distinctive and natural
+features of von Neumann’s entropy, such as additivity and concavity.
+As S( ˆD) is a concave function, and as it is the sole physically meaningful quantity apart from
+expectation values, it is natural to rely on it for our purpose. We thus define [2] the distance ds between
+two neighbouring density matrices ˆD and ˆD + d ˆD as the square root of
+
+ds2 = −d2S( ˆD) = Tr d ˆDd ln ˆD .
+(3)
+
+This Riemannian metric is of the Hessian form since the metric tensor is generated by taking second
+derivatives of the function S( ˆD) with respect to the n2 − 1 coordinates of ˆD. We may take for such
+coordinates the real and imaginary parts of the matrix elements, or equivalently (Section 6) some linear
+transform of these (keeping aside the norm Tr ˆD = 1).
+
+2. Interpretation in the Context of Quantum Information
+
+The simplest example, related to quantum information theory, is that of a q-bit (two-level system
+or spin 1
+
+2) for which n = 2. Its states, represented by 2 × 2 Hermitean normalised density matrices ˆD,
+can conveniently be parameterised, on the basis of Pauli matrices, by the components rμ = D12 + D21,
+i(D12 − D21), D11 − D22 (μ = 1, 2, 3) of a 3-dimensional vector r lying within the unit Poincaré–Bloch
+sphere (r ≤ 1). From the corresponding entropy
+
+S = 1 + r
+
+2
+ln
+2
+
+1 + r + 1 − r
+
+2
+ln
+2
+
+1 − r ,
+(4)
+
+we derive the metric
+
+ds2 =
+1
+
+1 − r2
+
+�r · dr
+
+r
+
+�2
++ 1
+
+2r ln 1 + r
+
+1 − r
+
+����
+r × dr
+
+r
+
+����
+
+2
+,
+(5)
+
+which is a natural Riemannian metric for q-bits, or more generally for positive 2 × 2 matrices. The
+metric tensor characterizing (5) diverges in the vicinity of pure states r = 1, due to the singularity of
+the entropy (2) for vanishing eigenvalues of ˆD. However, the distance between two arbitrary (even
+pure) states ˆD′ and ˆD′′ measured along a geodesic is always finite. We shall see (Equation (29)) that
+for n = 2 the geodesic distance s between two neighbouring pure states ˆD′ and ˆD′′, represented by
+unit vectors r′ and r′′ making a small angle δϕ ∼ |r′ − r′′|, behaves as δs2 ∼ δϕ2 ln(4√π/δϕ). The
+singularity of the metric tensor manifests itself through this logarithmic factor.
+Identifying von Neumann’s entropy to a measure of missing information, we can give a simple
+interpretation to the distance between two states. Indeed, the concavity of entropy expresses that some
+information is lost when two statistical ensembles described by different density operators merge. By
+mixing two equal size populations described by the neighbouring distributions ˆD′ = ˆD + 1
+
+2δ ˆD and
+ˆD′′ = ˆD − 1
+
+2δ ˆD separated by a distance δs, we lose an amount of information given by
+
+ΔS ≡ S
+� ˆD
+� − S( ˆD′) + S( ˆD′′)
+
+2
+∼ ffis2
+
+8
+,
+(6)
+
+316
+
+
+Entropy 2014, 16, 3878–3888
+
+and thereby directly related to the distance δs defined by (3). The proof of this equivalence relies on
+the expansion of the entropies S( ˆD′) and S( ˆD′′) around ˆD, and is valid when Tr δ ˆD2 is negligible
+compared to the smallest eigenvalue of ˆD. If ˆD′ and ˆD′′ are distant, the quantity 8ΔS cannot be
+regarded as the square of a distance that would be generated by a local metric. The equivalence (6) for
+neighbouring states shows that ds2 is the metric that is the best suited to measure losses of information
+my mixing.
+The singularity of δs2 at the edge of the positivity domain of ˆD may suggest that the result (6)
+holds only within this domain. In fact, this equivalence remains nearly valid even in the limit of
+pure states because ΔS itself involves a similar singularity. Indeed, if the states ˆD′ = |ψ′ >< ψ′|
+and ˆD′′ = |ψ′′ >< ψ′′| are pure and close to each other, the loss of information ΔS behaves as
+8ΔS ∼ ffi’2 ln(4/ffi’) where δϕ2 ∼ 2 Tr δD2. This result should be compared to various geodesic
+distances between pure quantum states, which behave as δs2 ∼ δϕ2 ln(4√π/δϕ for the present metric,
+and as δs2
+BH = 4δs2
+FS ∼ δϕ2 ∼ Tr( ˆD′ − ˆD′′)2 for the Bures – Helstrom and the quantum Fubini – Study
+metrics, respectively (see Section 7; these behaviours hold not only for n = 2 but for arbitrary n since
+only the space spanned by |ψ′ > and |ψ′′ > is involved). Thus, among these metrics, only ds2 = −d2S
+can be interpreted in terms of information loss, whether the states ˆD′ and ˆD′′ are pure or mixed.
+At the other extreme, around the most disordered state ˆD = ˆI/n, in the region ∥ n ˆD − ˆI ∥≪ 1,
+the metric becomes Euclidean since ds2 = Tr d ˆDd ln ˆD ∼ n Tr(d ˆD)2 (for n = 2, ds2 = dr2). For a given
+shift d ˆD, the qualitative change of a state ˆD, as measured by the distance ds, gets larger and larger as
+the state ˆD becomes purer and purer, that is, when the information contents of ˆD increases.
+
+3. Geometry of Quantum Statistical Mechanics
+
+A rich geometric structure is generated for both states and observables by von Neumann’s
+entropy through introduction of the metric ds2 = −d2S. Now, this metric (3) supplements the
+algebraic structure of the set of observables and the above duality between the vector spaces of states
+and of observables, with scalar product (1). Accordingly, we can define naturally within the space of
+states scalar products, geodesics, angles, curvatures.
+We can also regard the coordinates of d ˆD and d ln ˆD as covariant and contravariant components
+of the same infinitesimal vector (Section 6). To this aim, let us introduce the mapping
+
+ˆD ≡
+e ˆX
+
+Tr e ˆX
+(7)
+
+between ˆD in the space of states and ˆX in the space of observables. The operator ˆX appears as a
+parameterisation of ˆD. (The normalisation of ˆD entails that ˆX, defined within an arbitrary additive
+constant operator X0 ˆI, also depends on n2 − 1 independent real parameters.) The metric (3) can then
+be re-expressed in terms of ˆX in the form
+
+ds2 = Tr d ˆDd ˆX = Tr
+� 1
+
+0 dξ ˆDe−ξ ˆXd ˆXeξ ˆXd ˆX − (Tr ˆDd ˆX)2 = d2 ln Tr e ˆX = d2F ,
+(8)
+
+where we introduced the function
+F( ˆX) ≡ ln Tr e ˆX
+(9)
+
+of the observable ˆX(The addition of X0 ˆI to ˆX results in the addition of the irrelevant constant X0 to F).
+This mapping provides us with a natural metric in the space of observables, from which we recover
+the scalar product between d ˆX1 and d ˆX2 in the form of a Kubo correlation in the state ˆD. The metric
+(8) has been quoted in the literature under the names of Bogoliubov–Kubo–Mori.
+
+4. Covariance and Legendre Transformation
+
+We can recover the above geometric mapping (7) between ˆD and ˆX, or between the covariant
+and contravariant coordinates of d ˆD, as the outcome of a Legendre transformation, by considering
+
+317
+
+
+Entropy 2014, 16, 3878–3888
+
+the function F( ˆX). Taking its differential dF = Tr e ˆXd ˆX/ Tr e ˆX, we identify the partial derivatives
+of F( ˆX) with the coordinates of the state ˆD = e ˆX/ Tr e ˆX, so that ˆD appears as conjugate to ˆX in the
+sense of Legendre transformations. Expressing then ˆX as function of ˆD and inserting into F − Tr ˆD ˆX,
+we recognise that the Legendre transform of F( ˆX) is von Neumann’s entropy F − Tr ˆD ˆX = S( ˆD) =
+− Tr ˆD ln ˆD. The conjugation between ˆD and ˆX is embedded in the equations
+
+dF = Tr ˆDd ˆX ;
+dS = − Tr ˆXd ˆD .
+(10)
+
+Legendre transformations are currently used in equilibrium thermodynamics. Let us show that
+they come out in this context directly as a special case of the present general formalism. The entropy of
+thermodynamics is a function of the extensive variables, energy, volume, particle numbers, etc. Let us
+focus for illustration on the energy U, keeping the other extensive variables fixed. The thermodynamic
+entropy S(U), a function of the single variable U, generates the inverse temperature as β = ∂S/∂U.
+Its Legendre transform is the Massieu potential F(β) = S − βU. In order to compare these properties
+with the present formalism, we recall how thermodynamics comes out in the framework of statistical
+mechanics. The thermodynamic entropy S(U) is identified with the von Neumann entropy (2) of the
+Boltzmann–Gibbs canonical equilibrium state ˆD, and the internal energy with U = Tr ˆD ˆH. In the
+relation (7), the operator ˆX reads ˆX = −β ˆH (within an irrelevant additive constant). By letting U or
+β vary, we select within the spaces of states and of observables a one-dimensional subset. In these
+restricted subsets, ˆD is parameterised by the single coordinate U, and the corresponding ˆX by the
+coordinate −β.
+By specialising the general relations (10) to these subsets, we recover the thermodynamic relations
+dF = −Udβ and dS = βdU. We also recover, by restricting the metric (3) or (8) to these subsets, the
+current thermodynamic metric ds2 =−(∂2S/∂U2)dU2 =−dUdβ.
+More generally, we can consider the Boltzmann–Gibbs states of equilibrium statistical mechanics
+as the points of a manifold embedded in the full space of states. The thermodynamic extensive
+variables, which parameterise these states, are the expectation values of the conserved macroscopic
+observables, that is, they are a subset of the expectation values (1) which parameterise arbitrary
+density operators. Then the standard geometric structure of thermodynamics simply results from the
+restriction of the general metric (3) to this manifold of Boltzmann–Gibbs states. The commutation of
+the conserved observables simplifies the reduced thermodynamic metric, which presents the same
+features as a Fisher metric (see Section 6).
+
+5. Relevant Entropy and Geometry of the Projection Method
+
+The above ideas also extend to non-equilibrium quantum statistical mechanics [2–4]. When
+introducing the metric (3), we indicated that it may be used to estimate the quality of an approximation.
+Let us illustrate this point with the Nakajima–Zwanzig–Mori–Robertson projection method, best
+introduced through maximum entropy. Consider some set { ˆAk} of “relevant observables”, whose
+time-dependent expectation values ak ≡ < ˆAk > = Tr ˆD ˆAk we wish to follow, discarding all other
+variables. The exact state ˆD encodes the variables {ak} that we are interested in, but also the expectation
+values (1) of the other observables that we wish to eliminate. This elimination is performed by
+associating at each time with ˆD a “reduced state” ˆDR which is equivalent to ˆD as regards the set
+ak = Tr ˆDR ˆAk, but which provides no more information than the values{ak}. The former condition
+provides the constraints <
+ˆAk > = ak, and the latter condition is implemented by means of the
+maximum entropy criterion: One expresses that, within the set of density matrices compatible with
+these constraints, ˆDR is the one which maximises von Neumann’s entropy (2), that is, which contains
+solely the information about the relevant variables ak. The least biased state ˆDR thus defined has the
+form ˆDR = e ˆXR/ Tr e ˆXR, where ˆXR ≡ ∑k λk ˆAk involves the time-dependent Lagrange multipliers λk,
+which are related to the set ak through Tr ˆDR ˆAk = ak.
+
+318
+
+
+Entropy 2014, 16, 3878–3888
+
+The von Neumann entropy S( ˆDR) ≡ SR{ak} of this reduced state ˆDR is called the “relevant
+entropy” associated with the considered relevant observables ˆAk. It measures the amount of missing
+information, when only the values {ak} of the relevant variables are given. During its evolution, ˆD
+keeps track of the initial information about all the variables < ˆO > and its entropy S( ˆD) remains
+constant in time. It is therefore smaller than the relevant entropy S( ˆDR) which accounts for the
+loss of information about the irrelevant variables. Depending on the choice of relevant observables
+{ ˆAk}, the corresponding relevant entropies SR{ak} encompass various current entropies, such as the
+non-equilibrium thermodynamic entropy or Boltzmann’s H-entropy.
+The same structure as the one introduced above for the full spaces of observables and states is
+recovered in this context. Here, for arbitrary values of the parameters λk, the exponents ˆXR = ∑k λk ˆAk
+constitute a subspace of the full vector space of observables, and the parameters {λk} appear as the
+coordinates of ˆXR on the basis { ˆAk}. The corresponding states ˆDR, parameterised by the set {ak},
+constitute a subset of the space of states, the manifold R of “reduced states”(Note that this manifold is
+not a hyperplane, contrary to the space of relevant observables; it is embedded in the full vector space
+of states, but does not constitute a subspace). By regarding SR{ak} as a function of the coordinates {ak},
+we can define a metric ds2 = −d2SR{ak} on the manifold R, which is the restriction of the metric (3).
+Its alternative expression ds2 = ∑k dakdλk = d2FR{λk}, where FR{λk} ≡ ln Tr exp ∑k λk ˆAk, is a
+restriction of (8). The correspondence between the two parameterisations {ak} and {λk} is again
+implemented by the Legendre transformation which relates SR{ak} and FR{λk}.
+The projection method relies on the mapping ˆD �→ ˆDR which associates ˆDR to ˆD. It consists
+in replacing the Liouville–von Neumann equation of motion for ˆD by the corresponding dynamical
+equation for ˆDR on the manifold R, or equivalently for the coordinates {ak} or for the coordinates {λk},
+a programme that is in practice achieved through some approximations. This mapping is obviously
+a projection in the sense that ˆD �→ ˆDR �→ ˆDR, but moreover the introduction of the metric (3) shows
+that the vector ˆD − ˆDR in the space of states is perpendicular to the manifold R at the point ˆDR.
+This property is readily shown by writing, in this metric, the scalar product Tr d ˆD d ˆX′ of the vector
+d ˆD = ˆD − ˆDR by an arbitrary vector d ˆD′ in the tangent plane of R. The latter is conjugate to any
+combination d ˆX′ of observables ˆAk, and this scalar product vanishes because Tr ˆD ˆAk = Tr ˆDR ˆAk. Thus
+the mapping ˆD �→ ˆDR appears as an orthogonal projection, so that the relevant state ˆDR associated
+with ˆD may be regarded as its best possible approximation on the manifold R.
+
+6. Properties of the Metric
+
+The metric tensor can be evaluated explicitly in a basis where the matrix ˆD is diagonal. Denoting
+by Di its eigenvalues and by dDij the matrix elements of its variations, we obtain from (3)
+
+ds2 = Tr
+� ∞
+
+0
+dξ
+� d ˆD
+
+ˆD + ξ
+
+�2
+= ∑
+ij
+
+ln Di − ln Dj
+
+Di − Dj
+dDijdDji .
+(11)
+
+(For Di = Dj,whether or not i = j, the ratio is defined as 1/Di by continuity.) In the same basis, the
+form (8) of the metric reads
+
+ds2 = 1
+
+Z ∑
+ij
+
+eXi − eXj
+
+Xi − Xj
+dXijdXji −
+�
+∑i eXidXii
+
+Z
+
+�2
+,
+(12)
+
+with Z = ∑i eXi(For Xi = Xj, the ratio is eXi). The singularity of the metric (11) in the vicinity of
+vanishing eigenvalues of ˆD, in particular near pure states (end of Section 2), is not apparent in the
+representation (12) of this metric, because the mapping from ˆD to ˆX sends the eignevalue Xi to −∞
+when Di tends to zero.
+Let us compare the expression (11) with the corresponding classical metric, which is obtained
+by starting from Shannon’s entropy instead of von Neumann’s entropy. For discrete probabilities pi,
+
+319
+
+
+Entropy 2014, 16, 3878–3888
+
+we have then S{pi} = − ∑i pi ln pi and hence the same definition ds2 = −d2S{pi} as above of an
+entropy-based metric yields ds2 = ∑i dp2
+i /pi, which is identified with the Fisher information metric.
+The present metric thus appears as the extension to quantum statistical mechanics of the Fisher metric
+when the latter is interpreted in terms of entropy. In fact, the terms of (11) which involve the diagonal
+elements i = j of the variations d ˆD reduce to dD2
+ii/Di. This result was expected since density matrices
+behave as probability distributions if both ˆD and d ˆD are diagonal.
+Let us more generally consider in (11), instead of solely diagonal variations dDii, variations dDij
+with indices i and j such that
+��Di − Dj
+�� ≪ Di + Dj. The expansion of Di and Dj around 1
+
+2(Di + Dj)
+in the corresponding ratios of (11) yields (ln Di − ln Dj)/(Di − Dj) ∼ 2/(Di + Dj). The considered
+terms of (11) are therefore the same as in the Bures–Helstrom metric
+
+ds2
+BH = ∑
+ij
+
+2
+
+Di + Dj
+dDijdDji ,
+(13)
+
+introduced long ago as an extension to matrices of the Fisher metric [5]. We thus recover this
+Bures–Helstrom metric as an approximation of the present entropy-based metric ds2 = −d2S( ˆD).
+For n = 2, ds2
+BH is obtained from the expression (5) of ds2 by omitting the factor tanh−1 r/r entering
+the second term.
+In order to express the properties of the Riemannian metric (3) in a general form, which will
+exhibit the tensor structure, we use a Liouville representation. There, the observables ˆO = Oμ ˆΩμ,
+regarded as elements of a vector space, are represented by their coordinates Oμ on a complete basis
+ˆΩμ of n2 observables. The space of states is spanned by the dual basis ˆΣμ, such that Tr ˆΩν ˆΣμ = δν
+μ, and
+the states ˆD = Dμ ˆΣμ are represented by their coordinates Dμ. Thus, the expectation value (1) is the
+scalar product DμOμ. In the matrix representation which appears as a special case, μ denotes a pair of
+indices i, j, ˆΩμ stands for | j >< i |, ˆΣμ for | i >< j |, Oμ denotes the matrix element Oji and Dμ the
+element Dij. For the q-bit (n = 2) considered in Section 2, we have chosen the Pauli operators ˆσμ as
+basis ˆΩμ for observables, and 1
+
+2 ˆσμ as dual basis ˆΣμ for states, so that the coordinates Dμ = Tr ˆD ˆΩμ
+
+of ˆD = 1
+
+2( ˆI + rμ ˆσμ) are the components rμ of the vector r (The unit operator ˆI is kept aside since ˆD
+is normalised and since constants added to ˆX are irrelevant). The function F{X} = ln Tr e ˆX of the
+coordinates Xμ of the observable ˆX, and the von Neumann entropy S{D} as function of the coordinates
+Dμ of the state ˆD, are related by the Legendre transformation F = S + DμXμ, and the relations (10) are
+expressed by Dμ = ∂F/∂Xμ, Xμ = −∂S/∂Dμ. The metric tensor is given by
+
+gμν =
+∂2F
+
+∂Xμ∂Xν
+,
+gμν = −
+∂2S
+
+∂Dμ∂Dν ,
+(14)
+
+and the correspondence issued from (7) between covariant and contravariant infinitesimal variations
+of ˆX and ˆD is implemented as dDμ = gμνdXν, dXμ = gμνdDν.
+These expressions exhibit the Hessian nature of the metric. This property simplifies the expression
+of the Christoffel symbol, which reduces to
+
+Γμνρ = −1
+
+2
+∂3S
+
+∂Dμ∂Dν∂Dρ ,
+(15)
+
+and which provides a parametric representation ˆD(t) of the geodesics in the space of states through
+
+d2Dμ
+
+dt2
++ gμσΓσνρ
+dDν
+
+dt
+dDρ
+
+dt
+= 0 .
+(16)
+
+Then, the Riemann curvature tensor comes out as
+
+Rμρ νσ = gξζ(ΓμσξΓνρζ − ΓμνξΓρσζ) ,
+(17)
+
+320
+
+
+Entropy 2014, 16, 3878–3888
+
+the Ricci tensor and the scalar curvature as
+
+Rμν = gρσRμρ νσ,
+R = gμνRμν ,
+(18)
+
+We have noted that the classical equivalent of the entropy-based metric ds2 = −d2S is the Fisher
+metric ∑i dp2
+i /pi, which as regards the curvature is equivalent to a Euclidean metric. While the space of
+classical probabilities is thus flat, the above equations show that the space of quantum states is curved.
+This curvature arises from the non-commutation of the observables, it vanishes for the completely
+disordered state ˆD = ˆI/n. Curvature can thus be used as a measure of the degree of classicality of a
+state.
+
+7. Geometry of the Space of q-Bits
+
+In the illustrative example of a q-bit, the operator ˆX = χμ ˆσμ associated with ˆD is parameterised
+by the 3 components of the vector χμ (μ = 1, 2, 3), related to r by χ = tanh−1 r and χμ/χ = rμ/r. The
+metric tensor given by (5) is expressed as
+
+gμν = Krμrν + χ
+
+r δμν ,
+K ≡ 1
+
+r
+d
+dr
+χ
+r = 1
+
+r2
+
+�
+1
+
+1 − r2 − χ
+
+r
+
+�
+,
+(19)
+
+gμν = (1 − r2)pμν + r
+
+χqμν .
+
+(We have defined rμ = rμ, δμν = δμ
+ν = δμν so as to introduce the projectors rμrν/r2 ≡ pμν ≡ δμν − qμν
+
+in the Euclidean 3-dimensional space, and thus to simplify the subsequent calculations.) In polar
+coordinates r = (r, θ, ϕ), the infinitesimal distance takes the form
+
+ds2 = drdχ + rχ(dθ2 + sin2 θdϕ2) .
+(20)
+
+We determine from (15) and (19) the explicit form
+
+Γμνρ = K
+
+2
+�
+rμδνρ + rνδμρ + rρδμν
+� + 1
+
+2r
+dK
+dr rμrνrρ
+(21)
+
+of the Christoffel symbol. By raising its first index with gμν and using polar coordinates, we obtain
+from (16) the equations of geodesics for n = 2. Within the Poincaré–Bloch sphere the geodesics are
+deduced by rotations from a one-parameter family of curves which lie in the θ = 1
+
+2π, |ϕ| ≤ 1
+
+2π
+half-plane and which are symmetric with respect to the ϕ = 0 axis. This family is characterized by the
+equations (where χ = tanh−1 r):
+
+d2r
+dt2 +
+r
+
+1 − r2
+
+�dr
+
+dt
+
+�2
+− r
+
+2
+
+�
+1 + χ
+
+r
+
+�
+1 − r2�� �dϕ
+
+dt
+
+�2
+= 0 ,
+(22)
+
+d2ϕ
+dt2 + 1
+
+r
+dr
+dt
+dϕ
+dt + 1
+
+χ
+dχ
+dt
+dϕ
+dt = 0 ,
+(23)
+
+and the boundary conditions at t = 0:
+
+r (0) = a ,
+ϕ (0) = 0 ,
+dr (0)
+
+dt
+= 0 ,
+dϕ (0)
+
+dt
+= 1
+
+k ,
+k2 = a tanh−1 a .
+(24)
+
+Equation (23) provides, using the boundary conditions (24):
+
+dϕ
+dt = k
+
+rχ .
+(25)
+
+321
+
+
+Entropy 2014, 16, 3878–3888
+
+Insertion of (25) into (22) gives rise to an equation for r (t), which can be integrated by regarding t as a
+function of ζ = arcsin r. One obtains:
+
+�dr
+
+dt
+
+�2
+=
+�
+1 − r2� �
+1 − k2
+
+rχ
+
+�
+.
+(26)
+
+The scale of t has been fixed by relating to r (0) the boundary condition (24) for dϕ (0) /dt, a choice
+which ensures that ds2 = drdχ + rχdϕ2 = dt2, and hence that the parameter t measures the distance
+along geodesics.
+For k = 0, we obtain r = |sin t|, ϕ = ±π/2. Thus, the longest geodesics are the diameters of
+the Poincaré–Bloch sphere. We find the value π for their “length”, that is, for the geodesic distance
+between two orthogonal pure states. At the other extreme, when the middle point r = a, ϕ = 0 of
+a geodesic lies close to the surface r = 1 of the sphere, the asymptotic form of the equation (26) is
+solved as
+
+t = ±2k√
+
+πe−k2 erf ξ ,
+ξ =
+
+�
+
+1
+2 ln 1 − a
+
+1 − r ,
+k2 = 1
+
+2 ln
+2
+
+1 − a
+(27)
+
+(by taking ξ as variable instead of r). The determination of the explicit equations of such short geodesic
+curves is achieved by integrating (25) into
+
+ϕ = t
+
+k = ±2√
+
+πe−k2 erf ξ .
+(28)
+
+From (27) and (28) we can determine the geodesic distance between two neighbouring pure states ˆD′ =
+|ψ′ >< ψ′| and ˆD′′ = |ψ′′ >< ψ′′| represented by the points rmax = 1, ϕmax = ± 1
+
+2δϕ with δϕ small.
+At these two points, we have ξ → ∞, erf ξ = 1, and this determines k in terms of 1
+
+2δϕ through (28).
+The length of the geodesic that joins them, given by (27), is:
+
+δs2 = δϕ2 ln 4√π
+
+δϕ
+,
+δϕ = arccos
+��< ψ′ | ψ′′ >
+�� .
+(29)
+
+Thus, in spite of its singularity for r = 1, the present 3-dimensional metric (5) in the space r, θ, ϕ defines
+distances between pure states represented by points on the surface r = 1 of the Poincaré–Bloch sphere.
+However, It should be noted that the presence of the logarithmic factor in (29) forbids such distances
+to be generated by a 2-dimensional metric in the space θ, ϕ. In fact, the distance (29) is measured along
+a geodesic that penetrates the sphere r = 1, because no geodesic is tangent to the surface of this sphere
+nor lies on its surface.
+In contrast, all geodesics produced by the Bures–Helstrom metric are tangent to the surface of the
+sphere, or are its great circles. They are given by Equations (25) and (26), where χ is replaced by r and
+k by a; the solution of these equations provides the ellipses
+
+r cos ϕ = a cos t ,
+r sin ϕ = sin t .
+(30)
+
+Here as above, the largest distance π is reached for orthogonal pure states represented by opposite
+points on the sphere, but now a peculiarity occurs. Whereas the metric ds2 = −d2S produces a single
+geodesic, the diameter joining these two points (with “length” π), the Bures metric produces a double
+infinity of geodesics, the half-ellipses (30) having as long axis this diameter, and having all the same
+“length” π. Other pairs of pure states are joined by geodesics which are arcs of great circles, and their
+Bures distance δsBH = δϕ is identified with the ordinary length of the arc. Here for n = 2 as in the
+general case, the 3-dimensional Bures–Helstrom metric admits a restriction to pure states generated by
+a 2-dimensional metric, which is identified with the quantum Fubini–Study metric, itself defined only
+for pure states by sFS = arccos |< ψ′ | ψ′′ >| = 1
+
+2sBH.
+
+322
+
+
+Entropy 2014, 16, 3878–3888
+
+Returning to the metric ds2 = d2S, the Riemann curvature is obtained from (17) as
+
+Rμ
+ρ νσ = K
+
+4
+
+�
+(r2 + r
+
+χ − 1)(qμ
+σqνρ − qμ
+νqρσ) + (r2 − r
+
+χ + 1)(pμ
+σqνρ − pμ
+νqρσ)
+(31)
+
++ r
+
+χ
+1
+
+1 − r2 (r2 − r
+
+χ + 1)(qμ
+σpνρ − qμ
+νpρσ)
+�
+.
+
+Contracting with gρσ the indices of (30) as in (18), we finally derive the Ricci curvature
+
+Rμ
+ν = −Kr
+
+2χ
+
+�
+r2δμ
+ν + χ − r
+
+χ
+pμ
+ν
+
+�
+,
+(32)
+
+and the scalar curvature
+
+R = −Kr
+
+2χ
+
+�
+3r2 + χ − r
+
+χ
+
+�
+.
+(33)
+
+Both are negative in the whole Poincaré sphere. In the limit r → 0, the curvature R vanishes as
+R ∼ − 10
+
+9 r2, as expected from the general argument of Section 2: a weakly polarised spin behaves
+classically. At the other extreme r → 1, R behaves as R ∼ −2 [(1 − r) | ln(1 − r) |]−1; it diverges, again
+as expected: pure states have the largest quantum nature.
+
+The metric ds2 = −d2S, introduced above in the context of quantum mechanics for mixed states
+(and their pure limit) and information theory, might more generally be useful to characterise distances
+in spaces of positive matrices.
+
+Conflicts of Interest: Conflicts of Interest
+The author declares no conflict of interest.
+
+References
+
+1.
+Thirring, W. Quantum Mechanics of Large Systems.
+In A Course of Mathematical Physics; Volume 4;
+Springler-Verlag: New York, NY, USA, 1983.
+2.
+Balian, R.; Alhassid, Y.; Reinhardt, H. Dissipation in many-body systems: A geometric approach based on
+information theory. Phys. Rep. 1986, 131, 1–146.
+3.
+Balian, R. Incomplete descriptions and relevant entropies. Am. J. Phys. 1999, 67, 1078–1090.
+4.
+Balian, R. Information in statistical physics. Stud. Hist. Philos. Mod. Phys. 2005, 36, 323–353.
+5.
+Bures, D. An extension of Kakutani’s theorem. Trans. Am. Math. Soc. 1969,135, 199–212.
+
+c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+323
+
+
+entropy
+
+Article
+Extending the Extreme Physical Information to
+Universal Cognitive Models via a Confident
+Information First Principle
+
+Xiaozhao Zhao 1, Yuexian Hou 1,2,*, Dawei Song 1,3 and Wenjie Li 2
+
+1 School of Computer Science and Technology, Tianjin University, Tianjin 300072, China; E-Mails:
+0.25eye@gmail.com (X.Z.); dawei.song2010@gmail.com (D.S.)
+2 Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China;
+E-Mail: cswjli@comp.polyu.edu.hk
+3 Department of Computing and Communications, The Open University, Milton Keynes MK76AA, UK
+*
+E-Mail: yxhou@tju.edu.cn; Tel.: +86-022-27406538.
+
+Received: 25 March 2014; in revised form: 6 June 2014 / Accepted: 20 June 2014 /
+Published: 1 July 2014
+
+Abstract: The principle of extreme physical information (EPI) can be used to derive many known
+laws and distributions in theoretical physics by extremizing the physical information loss K, i.e.,
+the difference between the observed Fisher information I and the intrinsic information bound J
+of the physical phenomenon being measured. However, for complex cognitive systems of high
+dimensionality (e.g., human language processing and image recognition), the information bound
+J could be excessively larger than I (J ≫ I), due to insufficient observation, which would lead
+to serious over-fitting problems in the derivation of cognitive models. Moreover, there is a lack
+of an established exact invariance principle that gives rise to the bound information in universal
+cognitive systems. This limits the direct application of EPI. To narrow down the gap between I and
+J, in this paper, we propose a confident-information-first (CIF) principle to lower the information
+bound J by preserving confident parameters and ruling out unreliable or noisy parameters in the
+probability density function being measured. The confidence of each parameter can be assessed
+by its contribution to the expected Fisher information distance between the physical phenomenon
+and its observations. In addition, given a specific parametric representation, this contribution can
+often be directly assessed by the Fisher information, which establishes a connection with the inverse
+variance of any unbiased estimate for the parameter via the Cramér–Rao bound. We then consider
+the dimensionality reduction in the parameter spaces of binary multivariate distributions. We show
+that the single-layer Boltzmann machine without hidden units (SBM) can be derived using the CIF
+principle. An illustrative experiment is conducted to show how the CIF principle improves the
+density estimation performance.
+
+Keywords: information geometry; Boltzmann machine; Fisher information; parametric reduction
+
+1. Introduction
+
+Information has been found to play an increasingly important role in physics. As stated in
+Wheeler [1]: “All things physical are information-theoretic in origin and this is a participatory
+universe...Observer participancy gives rise to information; and information gives rise to physics”.
+Following this viewpoint, Frieden [2] unifies the derivation of physical laws in major fields of physics,
+from the Dirac equation to the Maxwell-Boltzmann velocity dispersion law, using the extreme physical
+information principle (EPI). More specifically, a variety of equations and distributions can be derived by
+extremizing the physical information loss K, i.e., the difference between the observed Fisher information
+I and the intrinsic information bound J of the physical phenomenon being measured.
+
+Entropy 2014, 16, 3670–3688; doi:10.3390/e16073670
+www.mdpi.com/journal/entropy
+324
+
+
+Entropy 2014, 16, 3670–3688
+
+The first quantity, I, measures the amount of information as a finite scalar implied by the data
+with some suitable measure [2]. It is formally defined as the trace of the Fisher information matrix [3].
+In addition to I, the second quantity, the information bound J, is an invariant that characterizes the
+information that is intrinsic to the physical phenomenon [2]. During the measurement procedure, there
+may be some loss of information, which entails I = κJ, where κ ≤ 1 is called the efficiency coefficient
+of the EPI process in transferring the Fisher information from the phenomenon (specified by J) to
+the output (specified by I). For closed physical systems, in particular, any solution for I attains some
+fraction of J between 1/2 (for classical physics) and one (for quantum physics) [4].
+However, it is usually not the case in cognitive science. For complex cognitive systems (e.g.,
+human language processing and image recognition), the target probability density function (pdf) being
+measured is often of high dimensionality (e.g., thousands of words in a human language vocabulary
+and millions of pixels in an observed image). Thus, it is infeasible for us to obtain a sufficient collection
+of observations, leading to excessive information loss between the observer and nature. Moreover,
+there is a lack of an established exact invariance principle that gives rise to the bound information in
+universal cognitive systems. This limits the direct application of EPI in cognitive systems.
+In terms of statistics and machine learning, the excessive information loss between the observer
+and nature will lead to serious over-fitting problems, since the insufficient observations may not
+provide necessary information to reasonably identify the model and support the estimation of the
+target pdf in complex cognitive systems. Actually, a similar problem is also recognized in statistics and
+machine learning, known as the model selection problem [5]. In general, we would require a complex
+model with a high-dimensional parameter space to sufficiently depict the original high-dimensional
+observations. However, over-fitting usually occurs when the model is excessively complex with
+respect to the given observations. To avoid over-fitting, we would need to adjust the complexity of the
+models to the available amount of observations and, equivalently, to adjust the information bound J
+corresponding to the observed information I.
+In order to derive feasible computational models for cognitive phenomenon, we propose a
+confident-information-first (CIF) principle in addition to EPI to narrow down the gap between I and J
+(thus, a reasonable efficiency coefficient κ is implied), as illustrated in Figure 1. However, we do not
+intend to actually derive the distribution laws by solving the differential equations of the extremization
+of the new information loss K′. Instead, we assume that the target distribution belongs to some general
+multivariate binary distribution family and focus on the problem of seeking a proper information
+bound with respect to the constraint of the parametric number and the given observations.
+
+Figure 1. (a) The paradigm of the extreme physical information principle (EPI) to derive physical laws
+by the extremization of the information loss K∗ (K∗ = J/2 for classical physics and K∗ = 0 for quantum
+physics); (b) the paradigm of confident-information-first (CIF) to derive computational models by
+reducing the information loss K′ using a new physical bound J′.
+
+The key to the CIF approach is how to systematically reduce the physical information bound for
+high-dimensional complex systems. As stated in Frieden [2], the information bound J is a functional
+form that depends upon the physical parameters of the system. The information is contained in
+
+325
+
+
+Entropy 2014, 16, 3670–3688
+
+the variations of the observations (often imperfect, due to insufficient sampling, noise and intrinsic
+limitations of the “observer”), and can be further quantified using the Fisher information of system
+parameters (or coordinates) [3] from the estimation theory. Therefore, the physical information bound
+J of a complex system can be reduced by transforming it to a simpler system using some parametric
+reduction approach. Assuming there exists an ideal parametric model S that is general enough to
+represent all system phenomena (which gives the ultimate information bound in Figure 1), our goal is
+to adopt a parametric reduction procedure to derive a lower-dimensional sub-model M (which gives
+the reduced information bound in Figure 1) for a given dataset (usually insufficient or perturbed by
+noises) by reducing the number of free parameters in S.
+Formally speaking, let q(ξ) be the ideal distribution with parameters ξ that describes the physical
+system and q(ξ + Δξ) be the observations of the system with some small fluctuation Δξ in parameters.
+In [6], the averaged information distance I(Δξ) between the distribution and its observations, the
+so-called shift information, is used as a disorder measure of the fluctuated observations to reinterpret
+the EPI principle. More specifically, in the framework of information geometry, this information
+distance could also be assessed using the Fisher information distance induced by the Fisher–Rao
+metric, which can be decomposed into the variation in the direction of each system parameter [7].
+In principle, it is possible to divide system parameters into two categories, i.e., the parameters with
+notable variations and the parameters with negligible variations, according to their contributions to the
+whole information distance. Additionally, the parameters with notable contributions are considered
+to be confident, since they are important for reliably distinguishing the ideal distribution from its
+observation distributions. On the other hand, the parameters with negligible contributions can be
+considered to be unreliable or noisy. Then, the CIF principle can be stated as the parameter selection
+criterion that maximally preserves the Fisher information distance in an expected sense with respect
+to the constraint of the parametric number and the given observations (if available), when projecting
+distributions from the parameter space of S into that of the reduced sub-model M. We call it the
+distance-based CIF. As a result, we could manipulate the information bound of the underlying system
+by preserving the information of confident parameters and ruling out noisy parameters.
+In this paper, the CIF principle is analyzed in the multivariate binary distribution family in the
+mixed-coordinate system [8]. It turns out that, in this problematic configuration, the confidence of
+a parameter can be directly evaluated by its Fisher information, which also establishes a connection
+with the inverse variance of any unbiased estimate for the parameter via the Cramér–Rao bound [3].
+Hence, the CIF principle can also be interpreted as the parameter selection procedure that keeps the
+parameters with reliable estimates and rules out unreliable or noisy parameters. This CIF is called
+the information-based CIF. Note that the definition of confidence in distance-based CIF depends on
+both Fisher information and the scale of fluctuation, and the confidence in the information-based CIF
+(i.e., Fisher information) can be seen as a special case of confidence measure with respect to certain
+coordinate systems. This simplification allows us to further apply the CIF principle to improve existing
+learning algorithms for the Boltzmann machine.
+The paper is organized as follows. In Section 2, we introduce the parametric formulation for
+the general multivariate binary distributions in terms of information geometry (IG) framework [7].
+Then, Section 3 describes the implementation details of the CIF principle. We also give a geometric
+interpretation of CIF by showing that it can maximally preserve the expected information distance (in
+Section 3.2.1), as well as the analysis on the scale of the information distance in each individual system
+parameter (in Section 3.2.2). In Section 4, we demonstrate that a widely used cognitive model, i.e., the
+Boltzmann machine, can be derived using the CIF principle. Additionally, an illustrative experiment is
+conducted to show how the CIF principle can be utilized to improve the density estimation performance
+of the Boltzmann machine in Section 5.
+
+326
+
+
+Entropy 2014, 16, 3670–3688
+
+2. The Multivariate Binary Distributions
+
+Similar to EPI, the derivation of CIF depends on the analysis of the physical information bound,
+where the choice of system parameters, also called “Fisher coordinates” in Frieden [2], is crucial.
+Based on information geometry (IG) [7], we introduce some choices of parameterizations for binary
+multivariate distributions (denoted as statistical manifold S) with a given number of variables n, i.e.,
+the open simplex of all probability distributions over binary vector x ∈ {0, 1}n.
+
+2.1. Notations for Manifold S
+
+In IG, a family of probability distributions is considered as a differentiable manifold with certain
+parametric coordinate systems. In the case of binary multivariate distributions, four basic coordinate
+systems are often used: p-coordinates, η-coordinates, θ-coordinates and mixed-coordinates [7,9].
+Mixed-coordinates is of vital importance for our analysis.
+For the p-coordinates [p] with n binary variables, the probability distribution over 2n states
+of x can be completely specified by any 2n − 1 positive numbers indicating the probability of the
+corresponding exclusive states on n binary variables. For example, the p-coordinates of n = 2 variables
+could be [p] = (p01, p10, p11). Note that IG requires all probability terms to be positive [7].
+For simplicity, we use the capital letters I, J, . . . to index the coordinate parameters of probabilistic
+distribution. To distinguish the notation of Fisher information (conventionally used in literature,
+e.g., data information I and information bound J in Section 1) from the coordinate indexes, we
+make explicit explanations when necessary from now on. An index I can be regarded as a subset
+of {1, 2, . . . , n}. Additionally, pI stands for the probability that all variables indicated by I equal
+to one and the complemented variables are zero. For example, if I = {1, 2, 4} and n = 4, then
+pI = p1101 = Prob(x1 = 1, x2 = 1, x3 = 0, x4 = 1). Note that the null set can also be a legal index of
+the p-coordinates, which indicates the probability that all variables are zero, denoted as p0...0.
+Another coordinate system often used in IG is η-coordinates, which is defined by:
+
+ηI = E[XI] = Prob{∏
+i∈I
+xi = 1}
+(1)
+
+where the value of XI is given by ∏i∈I xi and the expectation is taken with respect to the probability
+distribution over x. Grouping the coordinates by their orders, the η-coordinate system is denoted
+as [η] = (η1
+i , η2
+ij, . . . , ηn
+1,2...n), where the superscript indicates the order number of the corresponding
+
+parameter. For example, η2
+ij denotes the set of all η parameters with the order number two.
+The θ-coordinates (natural coordinates) are defined by:
+
+log p(x) =
+∑
+I⊆{1,2,...,n},I̸=NullSet
+θIXI − ψ(θ)
+(2)
+
+where ψ(θ) = log(∑x exp{∑I θIXI(x)}) is the cumulant generating function and its value equals to
+− log Prob{xi = 0, ∀i ∈ {1, 2, ..., n}}. The θ-coordinate is denoted as [θ] = (θi
+1, θij
+2 , . . . , θ1,...,n
+n
+), where
+the subscript indicates the order number of the corresponding parameter. Note that the order indices
+locate at different positions in [η] and [θ] following the convention in Amari et al. [8].
+The relation between coordinate systems [η] and [θ] is bijective. More formally, they are connected
+by the Legendre transformation:
+
+θI = ∂φ(η)
+
+∂ηI
+, ηI = ∂ψ(θ)
+
+∂θI
+(3)
+
+where ψ(θ) is given in Equation (2) and φ(η) = ∑x p(x; η) log p(x; η) is the negative of entropy. It can
+be shown that ψ(θ) and φ(η) meet the following identity [7]:
+
+ψ(θ) + φ(η) − ∑ θIηI = 0
+(4)
+
+327
+
+
+Entropy 2014, 16, 3670–3688
+
+Next, we introduce mixed-coordinates, which is important for our derivation of CIF. In general,
+the manifold S of probability distributions could be represented by the l-mixed-coordinates [8]:
+
+[ζ]l = (η1
+i , η2
+ij, . . . , ηl
+i,j,...,k, θi,j,...,k
+l+1 , . . . , θ1,...,n
+n
+)
+(5)
+
+where the first part consists of η-coordinates with order less or equal to l (denoted by [ηl−]) and the
+second part consists of θ-coordinates with order greater than l (denoted by [θl+]), l ∈ {1, ..., n − 1}.
+
+2.2. Fisher Information Matrix for Parametric Coordinates
+
+For a general coordinate system [ξ], the i-th row and j-th column element of the Fisher information
+matrix for [ξ] (denoted by Gξ) is defined as the covariance of the scores of [ξi] and [ξj] [3], i.e.,
+
+gij = E[∂ log p(x; ξ)
+
+∂ξi
+· ∂ log p(x; ξ)
+
+∂ξj
+]
+
+under the regularity condition for the pdf that the partial derivatives exist. The Fisher information
+measures the amount of information in the data that a statistic carries about the unknown
+parameters [10]. The Fisher information matrix is of vital importance to our analysis, because the
+inverse of Fisher information matrix gives an asymptotically tight lower bound to the covariance
+matrix of any unbiased estimate for the considered parameters [3]. Another important concept related
+to our analysis is the orthogonality defined by Fisher information. Two coordinate parameters ξi and
+ξj are called orthogonal if and only if their Fisher information vanishes, i.e., gij = 0, meaning that their
+influences on the log likelihood function are uncorrelated.
+
+The Fisher information for [θ] can be rewritten as gIJ = ∂2ψ(θ)
+
+∂θI∂θJ , and for [η], it is gIJ = ∂2φ(η)
+
+∂ηI∂ηJ [7].
+
+Let Gθ = (gIJ) and Gη = (gIJ) be the Fisher information matrices for [θ] and [η], respectively. It can be
+shown that Gθ and Gη are mutually inverse matrices, i.e., ∑J gIJgJK = δI
+K, where δI
+K = 1 if I = K and
+zero otherwise [7]. In order to generally compute Gθ and Gη, we develop the following Propositions 1
+and 2. Note that Proposition 1 is a generalization of Theorem 2 in Amari et al. [8].
+
+Proposition 1. The Fisher information between two parameters θI and θJ in [θ], is given by:
+
+gIJ(θ) = ηI � J − ηIηJ
+(6)
+
+Proof. in Appendix A.
+
+Proposition 2. The Fisher information between two parameters ηI and ηJ in [η], is given by:
+
+gIJ(η) = ∑
+K⊆I∩J
+(−1)|I−K|+|J−K| · 1
+
+pK
+(7)
+
+where | · | denotes the cardinality operator.
+
+Proof. in Appendix B.
+
+Based on the Fisher information matrices Gη and Gθ, we can calculate the Fisher information
+matrix Gζ for the l-mixed-coordinate system [ζ]l, as follows:
+
+Proposition 3. The Fisher information matrix Gζ of the l-mixed-coordinates [ζ]l is given by:
+
+Gζ =
+
+�
+A
+0
+0
+B
+
+�
+
+(8)
+
+328
+
+
+Entropy 2014, 16, 3670–3688
+
+where A = ((G−1
+η )Iη)−1, B = ((G−1
+θ )Jθ)−1, Gη and Gθ are the Fisher information matrices of [η] and [θ],
+respectively, Iη is the index set of the parameters shared by [η] and [ζ]l, i.e., {η1
+i , ..., ηl
+i,j,...,k}, and Jθ is the index
+
+set of the parameters shared by [θ] and [ζ]l, i.e., {θi,j,...,k
+l+1 , . . . , θ1,...,n
+n
+}.
+
+Proof. in Appendix C.
+
+3. The General CIF Principle
+
+In this section, we propose the CIF principle to reduce the physical information bound for
+high-dimensionality systems. Given a target distribution q(x) ∈ S, we consider the problem of
+realizing it by a lower-dimensionality submanifold. This is defined as the problem of parametric
+reduction for multivariate binary distributions. The family of multivariate binary distributions has
+been proven to be useful when we deal with discrete data in a variety of applications in statistical
+machine learning and artificial intelligence, such as the Boltzmann machine in neural networks [11,12]
+and the Rasch model in human sciences [13,14].
+Intuitively, if we can construct a coordinate system so that the confidences of its parameters
+entail a natural hierarchy, in which high confident parameters are significantly distinguished from
+and orthogonal to lowly confident ones, then we can conveniently implement CIF by keeping
+the high confident parameters unchanged and setting the lowly confident parameters to neutral
+values. Therefore, the choice of coordinates (or parametric representations) in CIF is crucial to its
+usage. This strategy is infeasible in terms of p-coordinates, η-coordinates or θ-coordinates, since the
+orthogonality condition cannot hold in these coordinate systems. In this section, we will show that the
+l-mixed-coordinates [ζ]l meets the requirement of CIF.
+In principle, the confidence of parameters should be assessed according to their contributions to
+the expected information distance between the ideal distribution and its fluctuated observations. This is
+called the distance-based CIF (see Section 1). For some coordinated systems, e.g., the mixed-coordinate
+system [ζ]l, the confidence of a parameter can also be directly evaluated by its Fisher information. This
+is called the information-based CIF (see Section 1). The information-based CIF (i.e., Fisher information)
+can be seen as an approximation to distance-based CIF, since it neglects the influence of parameter
+scaling to the expected information distance. However, considering the standard mixed-coordinates
+[ζ]l for the manifold of multivariate binary distributions, it turns out that both distance-based CIF and
+information-based CIF entail the same submanifold M (refer to Section 3.2 for detailed reasons).
+For the purpose of legibility, we will start with the information-based CIF, where the parameter’s
+confidence is simply measured using its Fisher information.
+After that, we show that the
+information-based CIF leads to an optimal submanifold M, which is also optimal in terms of the
+more rigorous distance-based CIF.
+
+3.1. The Information-Based CIF Principle
+
+In this section, we will show that the l-mixed-coordinates [ζ]l meet the requirement of the
+information-based CIF. According to Proposition 3 and the following Proposition 4, the confidences of
+coordinate parameters (measured by Fisher information) in [ζ]l entail a natural hierarchy: the first part
+of high confident parameters [ηl−] are separated from the second part of low confident parameters
+[θl+]. Additionally, those low confident parameters [θl+] have the neutral value of zero.
+
+Proposition 4. The diagonal elements of A are lower bounded by one, and those of B are upper bounded by one.
+
+Proof. in Appendix D.
+
+Moreover, the parameters in [ηl−] are orthogonal to the ones in [θl+], indicating that we could
+estimate these two parts independently [9].
+Hence, we can implement the information-based
+CIF for parametric reduction in [ζ]l by replacing low confident parameters with neutral value
+
+329
+
+
+Entropy 2014, 16, 3670–3688
+
+zero and reconstructing the resulting distribution.
+It turns out that the submanifold of S
+tailored by information-based CIF becomes [ζ]lt
+=
+(η1
+i , ..., ηl
+ij...k, 0, . . . , 0).
+We call [ζ]lt the
+l-tailored-mixed-coordinates.
+To grasp an intuitive picture for the CIF strategy and its significance w.r.t mixed-coordinates,
+let us consider an example with [p] = (p001 = 0.15, p010 = 0.1, p011 = 0.05, p100 = 0.2, p101 =
+0.1, p110 = 0.05, p111 = 0.3). Then, the confidences for coordinates in [η], [θ] and [ζ]2 are given by the
+diagonal elements of the corresponding Fisher information matrices. Applying the two-tailored CIF in
+mixed-coordinates, the loss ratio of Fisher information is 0.001%, and the ratio of the Fisher information
+of the tailored parameter (θ123
+3
+) to the remaining η parameter with the smallest Fisher information is
+0.06%. On the other hand, the above two ratios become 7.58% and 94.45% (in η-coordinates) or 12.94%
+and 92.31% (in θ-coordinates), respectively. We can see that [ζ]2 gives us a much better way to tell
+apart confident parameters from noisy ones.
+
+3.2. The Distance-Based CIF: A Geometric Point-of-View
+
+In the previous section, the information-based CIF entails a submanifold of S determined by the
+l-tailored-mixed-coordinates [ζ]lt. A more rigorous definition for the confidence of coordinates is the
+distance-based confidence used in the distance-based CIF, which relies on both of the coordinate’s
+Fisher information and its fluctuation scaling. In this section, we will show that the the submanifold
+M determined by [ζ]lt is also an optimal submanifold M in terms of the distance-based CIF. Note that,
+for other coordinate systems (e.g., arbitrarily rescaling coordinates), the information-based CIF may
+not entail the same submanifold as the distance-based CIF.
+Let q(x), with coordinate ζq, denote the exact solution to the physical phenomenon being
+measured. Additionally, the act of observation would cause small random perturbations to q(x),
+leading to some observation q′(x) with coordinate ζq + Δζq. When two distributions q(x) and q′(x) are
+close, the divergence between q(x) and q′(x) on manifold S could be assessed by the Fisher information
+distance: D(q, q′) = (Δζq · Gζ · Δζq)1/2, where Gζ is the Fisher information matrix and the perturbation
+Δζq is small. The Fisher information distance between two close distributions q(x) and q′(x) on
+manifold S is the Riemannian distance under the Fisher–Rao metric, which is shown to be the square
+root of the twice of the Kullback–Leibler divergence from q(x) to q′(x) [8]. Note that we adopt the
+Fisher information distance as the distance measure between two close distributions, since it is shown
+to be the unique metric meeting a set of natural axioms for the distribution metrics [7,15,16], e.g., the
+invariant property with respect to reparametrizations and the monotonicity with respect to the random
+maps on variables.
+Let M be a smooth k-dimensionality submanifold in S (k < 2n − 1). Given the point q(x) ∈ S,
+the projection [8] of q(x) on M is the point p(x) that belongs to M and is closest to q(x) with respect
+to the Kullback–Leibler divergence (K-L divergence) [17] from the distribution q(x) to p(x). On the
+submanifold M, the projections of q(x) and q′(x) are p(x) and p′(x), with coordinates ζp and ζp + Δζp,
+respectively, shown in Figure 2.
+Let the preserved Fisher information distance be D(p, p′) after projecting on M. In order to retain
+
+the information contained in observations, we need the ratio D(p,p′)
+
+D(q,q′) to be as large as possible in the
+expected sense, with respect to the given dimensionality k of M. The next two sections will illustrate
+that CIF leads to an optimal submanifold M based on different assumptions on the perturbations Δζq.
+
+330
+
+
+Entropy 2014, 16, 3670–3688
+
+Figure 2. By projecting a point q(x) on S to a submanifold M, the l-tailored mixed-coordinates [ζ]lt
+gives a desirable M that maximally preserves the expected Fisher information distance when projecting
+a ε-neighborhood centered at q(x) onto M.
+
+3.2.1. Perturbations in Uniform Neighborhood
+
+Let Bq be a ε-sphere surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε},
+where KL(·, ·) denotes the K-L divergence and ε is small. Additionally, q′(x) is a neighbor of q(x)
+uniformly sampled on Bq, as illustrated in Figure 2. Recall that, for a small ε, the K-L divergence can be
+approximated by half of the squared Fisher information distance. Thus, in the parameterization
+of [ζ]l, Bq is indeed the surface of a hyper-ellipsoid (centered at q(x)) determined by Gζ.
+The
+following proposition shows that the general CIF would lead to an optimal submanifold M that
+maximally preserves the expected information distance, where the expectation is taken upon the
+uniform neighborhood, Bq.
+
+Proposition 5. Consider the manifold S in l-mixed-coordinates [ζ]l. Let k be the number of free parameters
+in the l-tailored-mixed-coordinates [ζ]lt. Then, among all k-dimensional submanifolds of S, the submanifold
+determined by [ζ]lt can maximally preserve the expected information distance induced by the Fisher–Rao metric.
+
+Proof. in Appendix E.
+
+3.2.2. Perturbations in Typical Distributions
+
+To facilitate our analysis, we make a basic assumption on the underlying distributions q(x)
+that at least (2n − 2n/2) p-coordinates are of the scale ϵ, where ϵ is a sufficiently small value. Thus,
+residual p-coordinates (at most 2n/2) are all significantly larger than zero (of scale Θ(1/2(n/2))), and
+their sum approximates one. Note that these assumptions are common situations in real-world data
+collections [18], since the frequent (or meaningful) patterns are only a small fraction of all of the
+system states.
+Next, we introduce a small perturbation Δp to the p-coordinates [p] for the ideal distribution
+q(x). The scale of each fluctuation ΔpI is assumed to be proportional to the standard variation of
+corresponding p-coordinate pI by some small coefficients (upper bounded by a constant a), which can
+be approximated by the inverse of the square root of its Fisher information via the Cramér–Rao bound.
+It turns out that we can assume the perturbation ΔpI to be a√pI.
+In this section, we adopt the l-mixed-coordinates [ζ]l = (ηl−; θl+), where l = 2 is used in
+the following analysis. Let Δζq = (Δη2−; Δθ2+) be the incremental of mixed-coordinates after the
+perturbation. The squared Fisher information distance D2(p, p′) = Δζq · Gζ · Δζq could be decomposed
+into the direction of each coordinate in [ζ]l. We will clarify that, under typical cases, the scale of the
+
+331
+
+
+Entropy 2014, 16, 3670–3688
+
+Fisher information distance in each coordinate of θl+ (reduced by CIF) is asymptotically negligible,
+compared to that in each coordinate of ηl− (preserved by CIF).
+The scale of squared Fisher information distance in the direction of ηI is proportional to ΔηI ·
+(Gζ)I,I · ΔηI, where (Gζ)I,I is the Fisher information of ηI in terms of the mixed-coordinates [ζ]2. From
+Equation (1), for any I of order one (or two), ηI is the sum of 2n−1 (or 2n−2) p-coordinates, and the scale
+is Θ(1). Hence, the incremental Δη2− is proportional to Θ(1), denoted as a · Θ(1). It is difficult to give
+an explicit expression of (Gζ)I,I analytically. However, the Fisher information (Gζ)I,I of ηI is bounded
+by the (I, I)-th element of the inverse covariance matrix [19], which is exactly 1/gI,I(θ) =
+1
+
+ηI−η2
+I (see
+
+Proposition 3). Hence, the scale of (Gζ)I,I is also Θ(1). It turns out that the scale of squared Fisher
+information distance in the direction of ηI is a2 · Θ(1).
+Similarly, for the part θ2+, the scale of squared Fisher information distance in the direction of
+θJ is proportional to ΔθJ · (Gζ)J,J · ΔθJ, where (Gζ)J,J is the Fisher information of θJ in terms of the
+mixed-coordinates [ζ]2. The scale of θJ is maximally f (k)|log(√ϵ)| based on Equation (2), where k is
+the order of θJ and f (k) is the number of p-coordinates of scale Θ(1/2(n/2)) that are involved in the
+calculation of θJ. Since we assume that f (k) ≤ 2(n/2), the maximum scale of θJ is 2(n/2)|log(√ϵ)|. Thus,
+the incremental ΔθJ is of a scale bounded by a · 2(n/2)|log(√ϵ)|. Similar to our previous deviation, the
+Fisher information (Gζ)J,J of θJ is bounded by the (J, J)-th element of the inverse covariance matrix,
+which is exactly 1/gJ,J(η) (see Proposition 3). Hence, the scale of (Gζ)J,J is (2k − f (k))−1ϵ. In summary,
+the scale of squared Fisher information distance in the direction of θJ is bounded by the scale of
+
+a2 · Θ(2nϵ |log(√ϵ)|2
+
+2k− f (k) ). Since ϵ is a sufficiently small value and a is constant, the scale of squared Fisher
+
+information distance in the direction of θJ is asymptotically zero.
+In summary, in terms of modeling the fluctuated observations of typical cognitive systems, the
+original Fisher information distance between the physical phenomenon (q(x)) and observations (q′(x))
+is systematically reduced using CIF by projecting them on an optimal submanifold M. Based on our
+above analysis, the scale of Fisher information distance in the directions of [ηl−] preserved by CIF is
+significantly larger than that of the directions [θl+] reduced by CIF.
+
+4. Derivation of Boltzmann Machine by CIF
+
+In the previous section, the CIF principle is uncovered in the [ζ]l coordinates. Now, we consider
+an implementation of CIF when l equals to two, which gives rise to the single-layer Boltzmann machine
+without hidden units (SBM).
+
+4.1. Notations for SBM
+
+The energy function for SBM is given by:
+
+ESBM(x; ξ) = −1
+
+2xTUx − bTx
+(9)
+
+where ξ = {U, b} are the parameters and the diagonals of U are set to zero.
+The Boltzmann
+distribution over x is p(x; ξ) =
+1
+Zexp{−ESBM(x; ξ)}, where Z is a normalization factor. Actually,
+the parametrization for SBM could be naturally expressed by the coordinate systems in IG (e.g.,
+[θ] = (θi
+1 = bi, θij
+2 = Uij, θijk
+3
+= 0, ..., θ1,2,...,n
+n
+= 0)).
+
+4.2. The Derivation of SBM using CIF
+
+Given any underlying probability distribution q(x) on the general manifold S over {x}, the
+logarithm of q(x) can be represented by a linear decomposition of θ-coordinates, as shown in
+Equation (2). Since it is impractical to recognize all coordinates for the target distribution, we would
+like to only approximate part of them and end up with a k-dimensional submanifold M of S, where k
+(≪ 2n − 1) is the number of free parameters. Here, we set k to be the same dimensionality as SBM,
+i.e., k = n(n+1)
+
+2
+, so that all candidate submanifolds are comparable to the submanifold endowed by
+
+332
+
+
+Entropy 2014, 16, 3670–3688
+
+SBM (denoted as Msbm). Next, the rationale underlying the design of Msbm can be illustrated using the
+general CIF.
+Let the two-mixed-coordinates of q(x) on S be [ζ]2 = (η1
+i , η2
+ij, θi,j,k
+3
+, . . . , θ1,...,n
+n
+). Applying the
+general CIF on [ζ]2, our parametric reduction rule is to preserve the high confident part parameters
+[η2−] and replace low confident parameters [θ2+] by a fixed neutral value of zero. Thus, we derive
+the two-tailored-mixed-coordinates: [ζ]2t = (η1
+i , η2
+ij, 0, . . . , 0), as the optimal approximation of q(x)
+by the k-dimensional submanifolds. On the other hand, given the two-mixed-coordinates of q(x),
+the projection p(x) ∈ Msbm of q(x) is proven to be [ζ]p = (η1
+i , η2
+ij, 0, . . . , 0) [8]. Thus, SBM defines a
+probabilistic parameter space that is derived from CIF.
+
+4.3. The Learning Algorithms for SBM
+
+Let q(x) be the underlying probability distribution from which samples D = {d1, d2, . . . , dN} are
+generated independently. Then, our goal is to train an SBM (with stationary probability p(x)) based on
+D that realizes q(x) as faithfully as possible. Here, we briefly introduce two typical learning algorithms
+for SBM: maximum-likelihood and contrastive divergence [11,20,21].
+Maximum-likelihood (ML) learning realizes a gradient ascent of log-likelihood of D:
+
+ΔUij = ε∂l(ξ; D)
+
+∂Uij
+= ε(Eq[xixj] − Ep[xixj])
+(10)
+
+where ε is the learning rate and l(ξ; D) =
+1
+N ∑N
+n=1 log(dn; ξ). Eq[·] and Ep[·] are expectations over
+q(x) and p(x), respectively. Actually, Eq[xixj] and Ep[xixj] are the coordinates η2
+ij of q(x) and p(x),
+respectively. Eq[xixj] could be unbiasedly estimated from the sample. Markov chain Monte Carlo [22]
+is often used to approximate Ep[xixj] with an average over samples from p(x).
+Contrastive divergence (CD) learning realizes the gradient descent of a different objective function
+to avoid the difficulty of computing the log-likelihood gradient, shown as follows:
+
+ΔUij = −ε∂(KL(q0||p) − KL(pm||p))
+
+∂Uij
+= ε(Eq0[xixj] − Epm[xixj])
+(11)
+
+where q0 is the sample distribution, pm is the distribution by starting the Markov chain with the data
+and running m steps and KL(·||·) denotes the K-L divergence. Taking samples in D as initial states, we
+could generate a set of samples for pm(x). Those samples can be used to estimate Epm[xixj].
+From the perspective of IG, we can see that ML/CD learning is to update parameters in SBM,
+so that its corresponding coordinates [η2−] are getting closer to the data (along with the decreasing
+gradient). This is consistent with our theoretical analysis in Section 3 and Section 4.2 that SBM uses
+the most confident information (i.e., [η2−]) for approximating an arbitrary distribution in an expected
+sense.
+
+5. Experimental Study: Incorporate Data into CIF
+
+In the information-based CIF, the actual values of the data were not used to explicitly effect the
+output PDF (e.g., the derivation of SBM in Section 4). The data constrains the state of knowledge about
+the unknown pdf. In order to force the estimate of our probabilistic model to obey the data, we need
+to further reduce the difference between data information and physical information bound. How can
+this be done?
+In this section, the CIF principle will also be used to modify existing SBM training algorithm (i.e.,
+CD-1) by incorporating data information. Given a particular dataset, the CIF can be used to further
+recognize less-confident parameters in SBM and to reduce them properly. Our solution here is to
+apply CIF to take effect on the learning trajectory with respect to specific samples and, hence, further
+confine the parameter space to the region indicated by the most confident information contained in
+the samples.
+
+333
+
+
+Entropy 2014, 16, 3670–3688
+
+5.1. A Sample-Specific CIF-Based CD Learning for SBM
+
+The main modification of our CIF-based CD algorithm (CD-CIF for short) is that we generate
+the samples for pm(x) based on those parameters with confident information, where the confident
+information carried by certain parameter is inherited from the sample and could be assessed using its
+Fisher information computed in terms of the sample.
+For CD-1 (i.e., m=1), the firing probability for the i-th neuron after a one-step transition from the
+
+initial state x(0) = {x(0)
+1 , x(0)
+2 , . . . , x(0)
+n }) is:
+
+p(x(1)
+i
+= 1|x(0)) =
+1
+
+1 + exp{− ∑j̸=i Uijx(0)
+j
+− bi}
+(12)
+
+For CD-CIF, the firing probability for the i-th neuron in Equation (12) is modified as follows:
+
+p(x(1)
+i
+=1|x(0))=
+1
+
+1+exp{−
+∑
+(j̸=i)&(F(Uij)>τ)
+Uijx(0)
+j −bi}
+(13)
+
+where τ is a pre-selected threshold, F(Uij) = Eq0[xixj] − Eq0[xixj]2 is the Fisher information of Uij (see
+Equation (6)) and the expectations are estimated from the given sample D. We can see that those
+weights whose Fisher information are less than τ are considered to be unreliable w.r.t D. In practice,
+we could setup τ by the ratio r to specify the proportion of the total Fisher information TFI of all
+parameters that we would like to remain, i.e., ∑Uij>τ,i<j F(Uij) = r ∗ TFI.
+In summary, CD-CIF is realized in two phases. In the first phase, we initially “guess” whether
+certain parameter could be faithfully estimated based on the finite sample. In the second phase, we
+approximate the gradient using the CD scheme, except for when the CIF-based firing function in
+Equation (13) is used.
+
+5.2. Experimental Results
+In this section, we empirically investigate our justifications for the CIF principle, especially how
+the sample-specific CIF-based CD learning (see Section 5) works in the context of density estimation.
+Experimental Setup and Evaluation Metric: We utilize the random distribution uniformly generated
+from the open probability simplex over 10 variables as underlying distributions, whose samples size
+N may vary. Three learning algorithms are investigated: ML, CD-1 and our CD-CIF. K-L divergence is
+used to evaluate the goodness-of-fit of the SBM’s trained by various algorithms. For sample size N,
+we run 100 instances (20 (randomly generated distributions) × 5 (randomly running)) and report the
+averaged K-L divergences. Note that we focus on the case that the variable number is relatively small
+(n = 10) in order to analytically evaluate the K-L divergence and give a detailed study on algorithms.
+Changing the number of variables only offers a trivial influence for the experimental results, since we
+obtained qualitatively similar observations on various variable numbers (not reported here).
+Automatically Adjusting r for Different Sample Sizes: The Fisher information is additive for i.i.d.
+sampling. When sample sizes change, it is natural to require that the total amount of Fisher information
+contained in all tailored parameters is steady. Hence, we have α = (1 − r)N, where α indicates the
+amount of Fisher information and becomes a constant when the learning model and the underlying
+distribution family are given. It turns out that we can first identify α using the optimal r w.r.t several
+distributions generated from the underlying distribution family and then determine the optimal r’s for
+various sample sizes using r = 1 − α/N. In our experiments, we set α = 35.
+Density Estimation Performance: The averaged K-L divergences between SBMs (learned by ML,
+CD-1 and CD-CIF with the r automatically determined) and the underlying distribution are shown in
+Figure 3a. In the case of relatively small samples (N ≤ 500) in Figure 3a, our CD-CIF method shows
+significant improvements over ML (from 10.3% to 16.0%) and CD-1 (from 11.0% to 21.0%). This is
+because we could not expect to have reliable identifications for all model parameters from insufficient
+
+334
+
+
+Entropy 2014, 16, 3670–3688
+
+samples, and hence, CD-CIF gains its advantages by using parameters that could be confidently
+estimated. This result is consistent with our previous theoretical insight that Fisher information gives a
+reasonable guidance for parametric reduction via the confidence criterion. As the sample size increases
+(N ≥ 600), CD-CIF, ML and CD-1 tend to have similar performances, since, with relatively large
+samples, most model parameters can be reasonably estimated, hence the effect of parameter reduction
+using CIF gradually becomes marginal. In Figure 3b and Figure 3c, we show how sample size affects
+the interval of r. For N = 100, CD-CIF achieves significantly better performances for a wide range of r.
+While, for N = 1, 200, CD-CIF can only marginally outperform baselines for a narrow range of r.
+
+�
+���
+���
+���
+���
+����
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+���
+
+����
+
+�������������������
+
+�����������������������������������
+
+�����������������������������������������
+
+�
+
+�
+
+����
+��
+������
+
+(a)
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+���
+
+����
+
+����
+
+���������������������������������������������������������������������
+
+�
+
+�����������������������������������
+
+�
+
+�
+
+����
+��
+������
+
+(b)
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+�
+
+�
+
+���
+
+�
+
+���
+
+�
+
+���
+
+�
+����������������������������������������������������������������������
+
+�
+
+�����������������������������������
+
+�
+
+�
+
+����
+����
+����
+�
+
+����
+
+�����
+
+����
+
+����
+��
+������
+
+(c)
+
+����
+����
+����
+����
+����
+����
+����
+
+����
+
+���
+
+����
+
+����
+
+����
+
+����
+
+����
+
+����
+
+�������������������������������������
+
+�
+
+�
+
+��������������������
+
+�������������������
+
+�����������������
+
+�����������
+���������
+
+�����������������
+
+(d)
+
+Figure 3. (a): the performance of CD-CIF on different sample sizes; (b) and (c): The performances
+of CD-CIF with various values of r on two typical sample sizes, i.e., 100 and 1200; (d) illustrates one
+learning trajectory of the last 100 steps for ML (squares), CD-1 (triangles) and CD-CIF (circles).
+
+Effects on Learning Trajectory: We use the 2D visualizing technology SNE [20] to investigate learning
+trajectories and dynamical behaviors of three comparative algorithms. We start three methods with the
+same parameter initialization. Then, each intermediate state is represented by a 55-dimensional vector
+formed by its current parameter values. From Figure 3d, we can see that: (1) In the final 100 steps, the
+three methods seem to end up staying in different regions of the parameter space, and CD-CIF confines
+the parameter in a relatively thinner region compared to ML and CD-1; (2) The true distribution is
+usually located on the side of CD-CIF, indicating its potential for converging to the optimal solution.
+Note that the above claims are based on general observations, and Figure 3d is shown as an illustration.
+Hence, we may conclude that CD-CIF regularizes the learning trajectories in a desired region of the
+parameter space using the sample-specific CIF.
+
+6. Conclusions
+
+Different from the traditional EPI, the CIF principle proposed in this paper aims at finding a
+way to derive computational models for universal cognitive systems by a dimensionality reduction
+
+335
+
+
+Entropy 2014, 16, 3670–3688
+
+approach in parameter spaces: specifically, by preserving the confident parameters and reducing the
+less confident parameters. In principle, the confidence of parameters should be assessed according
+to their contributions to the expected information distance between the ideal distribution and its
+fluctuated observations. This is called the distance-based CIF. For some coordinated systems, e.g.,
+the mixed-coordinate system [ζ]l, the confidence of a parameter can also be directly evaluated by
+its Fisher information, which establishes a connection with the inverse variance of any unbiased
+estimate for the parameter via the Cramér–Rao bound. This is called the information-based CIF.
+The criterion of information-based CIF (i.e., Fisher information) can be seen as an approximation to
+distance-based CIF, since it neglects the influence of parameter scaling to the expected information
+distance. However, considering the standard mixed-coordinates [ζ]l for the manifold of multivariate
+binary distributions, it turns out that both distance-based CIF and information-based CIF entail the
+same optimal submanifold M.
+The CIF provides a strategy for the derivation of probabilistic models. The SBM is a specific
+example in this regard.
+It has been theoretically shown that the SBM can achieve a reliable
+representation in parameter spaces by using the CIF principle.
+The CIF principle can also be used to modify existing SBM training algorithms by incorporating
+data information, such as CD-CIF. One interesting result shown in our experiments is that: although
+CD-CIF is a biased algorithm, it could significantly outperform ML when the sample is insufficient.
+This suggests that CIF gives us a reasonable criterion for utilizing confident information from the
+underlying data, while ML lacks a mechanism to do so.
+In the future, we will further develop the formal justification of CIF w.r.t various contexts (e.g.,
+distribution families or models).
+
+Acknowledgments: We would like to thank the anonymous reviewers for their valuable comments. We also
+thank Mengjiao Xie and Shuai Mi for their helpful discussions. This work is partially supported by the Chinese
+National Program on Key Basic Research Project (973 Program, Grant No. 2013CB329304 and 2014CB744604), the
+Natural Science Foundation of China (Grant Nos. 61272265, 61070044, 61272291, 61111130190 and 61105072).
+
+Appendix
+
+Appendix A Proof of Proposition 1
+
+Proof. By definition, we have:
+
+gIJ = ∂2ψ(θ)
+
+∂θI∂θJ
+
+where ψ(θ) is defined by Equation (4). Hence, we have:
+
+gIJ = ∂2(∑I θIηI − φ(η))
+
+∂θI∂θJ
+= ∂ηI
+
+∂θJ
+
+By differentiating ηI, defined by Equation (1), with respect to θJ, we have:
+
+gIJ
+=
+∂ηI
+∂θJ = ∂ ∑x XI(x)(exp{∑I θIXI(x) − ψ(θ)})
+
+∂θJ
+
+= ∑
+x
+XI(x)[XJ(x) − ηJ]p(x; θ) = ηI � J − ηIηJ
+
+This completes the proof.
+
+Appendix B Proof of Proposition 2
+
+Proof. By definition, we have:
+
+gIJ = ∂2φ(η)
+
+∂ηI∂ηJ
+
+336
+
+
+Entropy 2014, 16, 3670–3688
+
+where φ(η) is defined by Equation (4). Hence, we have:
+
+gIJ
+=
+∂2(∑J θJηJ − ψ(θ))
+
+∂ηI∂ηJ
+= ∂θI
+
+∂ηJ
+
+Based on Equations (2) and (1), the θI and pK could be calculated by solving a linear equation of [p]
+and [η], respectively. Hence, we have:
+
+θI = ∑
+K⊆I
+(−1)|I−K|log(pK); pK = ∑
+K⊆J
+(−1)|J−K|ηJ
+
+Therefore, the partial derivation of θI with respect to ηJ is:
+
+gIJ = ∂θI
+
+∂ηJ
+= ∑
+K
+
+∂θI
+
+∂pK
+· ∂pK
+
+∂ηJ
+= ∑
+K⊆I∩J
+(−1)|I−K|+|J−K| · 1
+
+pK
+
+This completes the proof.
+
+Appendix C Proof of Proposition 3
+
+Proof. The Fisher information matrix of [ζ] could be partitioned into four parts: Gζ =
+
+�
+A
+C
+D
+B
+
+�
+
+.
+
+It can be verified that in the mixed coordinate, the θ-coordinate of order k is orthogonal to any
+η-coordinate less than k-order, implying the corresponding element of the Fisher information matrix is
+zero (C = D = 0) [23]. Hence, Gζ is a block diagonal matrix.
+According to the Cramér–Rao bound [3], a parameter (or a pair of parameters) has a unique
+asymptotically tight lower bound of the variance (or covariance) of the unbiased estimate, which is
+given by the corresponding element of the inverse of the Fisher information matrix involving this
+parameter (or this pair of parameters). Recall that Iη is the index set of the parameters shared by [η]
+and [ζ]l and that Jθ is the index set of the parameters shared by [θ] and [ζ]l; we have (G−1
+ζ )Iζ = (G−1
+η )Iη
+
+and (G−1
+ζ )Jζ = (G−1
+θ )Jθ, i.e., G−1
+ζ
+=
+
+�
+(G−1
+η )Iη
+0
+0
+(G−1
+θ )Jθ
+
+�
+
+. Since Gζ is a block tridiagonal matrix, the
+
+proposition follows.
+
+Appendix D Proof of Proposition 4
+
+Proof. Assume the Fisher information matrix of [θ] to be: Gθ =
+
+�
+U
+X
+XT
+V
+
+�
+
+, which is partitioned
+
+based on Iη and Jθ. Based on Proposition 3, we have A = U−1. Obviously, the diagonal elements
+of U are all smaller than one. According to the succeeding Lemma A1, we can see that the diagonal
+elements of A (i.e., U−1) are greater than one.
+Next, we need to show that the diagonal elements of B are smaller than 1. Using the Schur
+complement of Gθ, the bottom-right block of G−1
+θ , i.e., (G−1
+θ )Jθ, equals to (V − XTU−1X)−1. Thus, the
+diagonal elements of B: Bjj = (V − XTU−1X)jj < Vjj < 1. Hence, we complete the proof.
+
+Lemma A1. With a l × l positive definite matrix H, if Hii < 1, then (H−1)ii > 1, ∀i ∈ {1, 2, . . . , l}.
+
+Proof. Since H is positive definite, it is a Gramian matrix of l linearly independent vectors v1, v2, . . . , vl,
+i.e., Hij = ⟨vi, vj⟩ (⟨·, ·⟩ denotes the inner product). Similarly, H−1 is the Gramian matrix of l linearly
+independent vectors w1, w2, . . . , wl and (H−1)ij = ⟨wi, wj⟩. It is easy to verify that ⟨wi, vi⟩ = 1, ∀i ∈
+{1, 2, . . . , l}. If Hii < 1, we can see that the norm ∥vi∥ = √Hii < 1. Since ∥wi∥ × ∥vi∥ ≥ ⟨wi, vi⟩ = 1,
+we have ∥wi∥ > 1. Hence, (H−1)ii = ⟨wi, wi⟩ = ∥wi∥2 > 1.
+
+337
+
+
+Entropy 2014, 16, 3670–3688
+
+Appendix E Proof of Proposition 5
+
+Proof. Let Bq be a ε-ball surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε},
+where KL(·, ·) denotes the Kullback–Leibler divergence and ε is small. ζq is the coordinates of q(x). Let
+q(x) + dq be a neighbor of q(x) uniformly sampled on Bq and ζq(x)+dq be its corresponding coordinates.
+For a small ε, we can calculate the expected information distance between q(x) and q(x) + dq as follows:
+
+EBq =
+�
+[(ζq(x)+dq − ζq)TGζ(ζq(x)+dq − ζq)]
+1
+2 dBq
+(A1)
+
+where Gζ is the Fisher information matrix at q(x).
+Since Fisher information matrix Gζ is both positive definite and symmetric, there exists a singular
+value decomposition Gζ = UTΛU where U is an orthogonal matrix and Λ is a diagonal matrix with
+diagonal entries equal to the eigenvalues of Gζ (all ≥ 0).
+Applying the singular value decomposition into Equation (A1), the distance becomes:
+
+EBq=
+�
+[(ζq(x)+dq − ζq)TUTΛU(ζq(x)+dq − ζq)]
+1
+2 dBq
+(A2)
+
+Note that U is an orthogonal matrix, and the transformation U(ζq(x)+dq − ζq) is a norm-preserving rotation.
+Now, we need to show that among all tailored k-dimensional submanifolds of S, [ζ]lt is the
+one that preserves maximum information distance. Assume IT = {i1, i2, . . . , ik} is the index of k
+coordinates that we choose to form the tailored submanifold T in the mixed-coordinates [ζ]. According
+to the fundamental analytical properties of the surface of the hyper-ellipsoid and the orthogonality of
+the mixed-coordinates, there exists a strict positive monotonicity between the expected information
+distance EBq for T and the sum of eigenvalues of the sub-matrix (Gζ)IT, where the sum equals to the
+trace of (Gζ)IT. That is, the greater the trace of (Gζ)IT, the greater the expected information distance
+EBq for T.
+Next, we show that the sub-matrix of Gζ specified by [ζ]lt gives a maximum trace. Based on
+Proposition 4, the elements on the main diagonal of the sub-matrix A are lower bounded by one and
+those of B upper bounded by one. Therefore, [ζ]lt gives the maximum trace among all sub-matrices of
+Gζ. This completes the proof.
+
+Author Contributions: Author Contributions
+Theoretical study and proof: Yuexian Hou and Xiaozhao Zhao.
+Conceived and designed
+the experiments:
+Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li.
+Performed the
+experiments: Xiaozhao Zhao. Analyzed the data: Xiaozhao Zhao, Yuexian Hou. Wrote the manuscript:
+Xiaozhao Zhao, Dawei Song, Wenjie Li and Yuexian Hou. All authors have read and approved the
+final manuscript.
+
+Conflicts of Interest: Conflicts of Interest
+The authors declare no conflict of interest.
+
+References
+
+1.
+Wheeler, J.A. Time Today; Cambridge University Press: Cambridge, UK, 1994; pp. 1–29.
+2.
+Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, UK, 2004.
+3.
+Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
+Math. Soc. 1945, 37, 81–91.
+4.
+Frieden, B.R.; Gatenby, R.A. Principle of maximum Fisher information from Hardy’s axioms applied to
+statistical systems. Phys. Rev. E 2013, 88, 042144.
+5.
+Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information—Theoretic
+Approach; Springer: Berlin/Heidelberg, Germany, 2002.
+6.
+Vstovsky, G.V. Interpretation of the extreme physical information principle in terms of shift information.
+Phys. Rev. E 1995, 51, 975–979.
+
+338
+
+
+Entropy 2014, 16, 3670–3688
+
+7.
+Amari, S.; Nagaoka, H.
+Methods of Information Geometry; Translations of Mathematical Monographs;
+Oxford University Press: Oxford, UK, 1993.
+8.
+Amari, S.; Kurata, K.; Nagaoka, H. Information geometry of Boltzmann machines. IEEE Trans. Neural Netw.
+1992, 3, 260–271.
+9.
+Hou, Y.; Zhao, X.; Song, D.; Li, W. Mining pure high-order word associations via information geometry for
+information retrieval. ACM Trans. Inf. Syst. 2013, 31, 12:1–12:32.
+10.
+Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–219.
+11.
+Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for Boltzmann machines. Cogn. Sci. 1985,
+9, 147–169.
+12.
+Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006,
+313, 504–507.
+13.
+Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational
+Research: Copenhagen, Denmark, 1960.
+14.
+Bond, T.; Fox, C. Applying the Rasch Model: Fundamental Measurement in the Human Sciences; Psychology Press:
+London, UK, 2013.
+15.
+Gibilisco, P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: Cambridge, UK, 2010.
+16.
+ˇCencov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Washington,
+D.C., USA, 1982.
+17.
+Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86.
+18.
+Buhlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory And Applications;
+Springer:
+Berlin/Heidelberg, Germany, 2011.
+19.
+Bobrovsky, B.; Mayer-Wolf, E.; Zakai, M. Some classes of global Cramér-Rao bounds. Ann. Stat. 1987,
+15, 1421–1438.
+20.
+Hinton, G.E. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002,
+14, 1771–1800.
+21.
+Carreira-Perpinan, M.A.; Hinton, G.E.
+On contrastive divergence learning.
+In Proceedings of the
+International Workshop on Artificial Intelligence and Statistics, Key West, FL, USA, 6–8 January 2005;
+pp. 33–40.
+22.
+Gilks, W.R.; Richardson, S.; Spiegelhalter, D. Introducing markov chain monte carlo. In Markov Chain Monte
+Carlo in Practice; Chapman and Hall/CRC: London, UK, 1996; pp. 1–19.
+23.
+Nakahara, H.; Amari, S.
+Information geometric measure for neural spikes.
+Neural Comput.
+2002,
+14, 2269–2316.
+
+c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
+article distributed under the terms and conditions of the Creative Commons Attribution
+(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
+
+339
+
+
+
+MDPI
+St. Alban-Anlage 66
+4052 Basel
+
+Switzerland
+Tel. +41 61 683 77 34
+Fax +41 61 302 89 18
+www.mdpi.com
+
+Entropy Editorial Office
+E-mail: entropy@mdpi.com
+
+www.mdpi.com/journal/entropy
+
+
+
+MDPI  
+St. Alban-Anlage 66 
+4052 Basel 
+Switzerland
+
+Tel: +41 61 683 77 34 
+Fax: +41 61 302 89 18
+
+www.mdpi.com
+ISBN 978-3-03897-633-2
+
+
diff --git a/papers/references/temp/Bastos2012.pdf b/papers/references/temp/Bastos2012.pdf
new file mode 100644
index 00000000..28ce3bd5
--- /dev/null
+++ b/papers/references/temp/Bastos2012.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:af1955d6a49759e0a78310c2e1948cf421e3c78babd18a294b7b621a0c7cd599
+size 533952
diff --git a/papers/references/temp/Bastos2012.txt b/papers/references/temp/Bastos2012.txt
new file mode 100644
index 00000000..652e6b8b
--- /dev/null
+++ b/papers/references/temp/Bastos2012.txt
@@ -0,0 +1,1759 @@
+Canonical microcircuits for predictive coding
+
+Andre M. Bastos1,2,6, W. Martin Usrey1,3,4, Rick A. Adams8, George R. Mangun2,3,5, Pascal
+Fries6,7, and Karl J. Friston8
+
+1Center for Neuroscience, University of California-Davis, Davis, CA 95618 USA. 2Center for Mind
+and Brain, University of California-Davis, Davis, CA 95618 USA. 3Department of Neurology,
+University of California-Davis, Davis, CA 95618 USA. 4Department of Neurobiology, Physiology
+and Behavior, University of California-Davis, Davis, CA 95618 USA. 5Department of Psychology,
+University of California-Davis, Davis, CA 95618 USA. 6Ernst Strüngmann Institute (ESI) for
+Neuroscience in Cooperation with Max Planck Society, Deutschordenstraße 46, 60528 Frankfurt,
+Germany. 7Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen,
+Kapittelweg 29, 6525 EN Nijmegen, Netherlands. 8The Wellcome Trust Centre for Neuroimaging,
+University College London, Queen Square, London WC1N 3BG, UK.
+
+Summary
+
+This review considers the influential notion of a canonical (cortical) microcircuit in light of recent
+theories about neuronal processing. Specifically, we conciliate quantitative studies of
+microcircuitry and the functional logic of neuronal computations. We revisit the established idea
+that message passing among hierarchical cortical areas implements a form of Bayesian inference –
+paying careful attention to the implications for intrinsic connections among neuronal populations.
+By deriving canonical forms for these computations, one can associate specific neuronal
+populations with specific computational roles. This analysis discloses a remarkable
+correspondence between the microcircuitry of the cortical column and the connectivity implied by
+predictive coding. Furthermore, it provides some intuitive insights into the functional asymmetries
+between feedforward and feedback connections and the characteristic frequencies over which they
+operate.
+
+Keywords
+
+neuronal; connectivity; cortical; microcircuit; computation; predictive coding; free energy
+principle; gamma oscillations; beta oscillations
+
+Introduction
+
+The idea that the brain actively constructs explanations for its sensory inputs is now
+generally accepted. This notion builds on a long history of proposals that the brain uses
+internal or generative models to make inferences about the causes of its sensorium
+(Helmholtz, 1860; Gregory 1968, 1980; Dayan et al., 1995). In terms of implementation,
+predictive coding is, arguably, the most plausible neurobiological candidate for making
+these inferences (Srinivasan et al., 1982; Mumford, 1992; Rao and Ballard, 1999). This
+review considers the canonical microcircuit in light of predictive coding. We focus on the
+intrinsic connectivity within a cortical column and the extrinsic connections between
+
+Correspondence: Karl Friston The Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, 12
+Queen Square, London, WC1N 3BG, UK. Tel (44) 207 833 7454 k.friston@ucl.ac.uk.
+
+NIH Public Access
+Author Manuscript
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+Published in final edited form as:
+Neuron. 2012 November 21; 76(4): 695–711. doi:10.1016/j.neuron.2012.10.038.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+columns in different cortical areas. We try to relate this circuitry to neuronal computations
+by showing the computational dependencies – implied by predictive coding – recapitulate
+the physiological dependencies implied by quantitative studies of intrinsic connectivity. This
+issue is important as distinct neuronal dynamics in different cortical layers are becoming
+increasingly apparent (de Kock et al., 2007; Sakata and Harris, 2009; Maier et al., 2010;
+Bollimunta et al., 2011). For example, recent findings suggest that the superficial layers of
+cortex show neuronal synchronization and spike-field coherence predominantly in the
+gamma frequencies, while deep layers prefer lower (alpha or beta) frequencies (Roopun et
+al., 2006, 2008; Maier et al., 2010; Buffalo et al., 2011). Since feedforward connections
+originate predominately from superficial layers and feedback connections from deep layers,
+these differences suggest that feedforward connections use relatively high frequencies,
+compared to feedback connections, as recently demonstrated empirically (Bosman et al.,
+2012). These asymmetries call for something quite remarkable: namely, a synthesis of
+spectrally distinct inputs to a cortical column and the segregation of its outputs. This
+segregation can only arise from local neuronal computations that are structured and
+precisely interconnected. It is the nature of this intrinsic connectivity – and the dynamics it
+supports – that we consider. The aim of this review is to speculate about the functional roles
+of neuronal populations in specific cortical layers in terms of predictive coding. Our long-
+term aim is to create computationally informed models of microcircuitry that can be tested
+with dynamic causal modelling (David et al., 2006; Moran et al., 2008, 2011).
+
+This review comprises three sections. We start with an overview of the anatomy and
+physiology of cortical connections – with an emphasis on quantitative advances. The second
+section considers the computational role of the canonical microcircuit that emerges from
+these studies. The third section provides a formal treatment of predictive coding and defines
+the requisite computations in terms of differential equations. We then associate the form of
+these equations with the canonical microcircuit to define a computational architecture. We
+conclude with some predictions about intrinsic connections and note some important
+asymmetries in feedforward and feedback connections that emerge from this treatment.
+
+The anatomy and physiology of cortical connections
+
+This section reviews laminar-specific connections that underlie the notion of a canonical
+microcircuit (Douglas et al., 1989; Douglas and Martin, 1991, 2004). We first focus on
+mammalian visual cortex and then consider whether visual microcircuitry can be generalized
+to a canonical circuit for the entire cortex. Both functional and anatomical techniques have
+been applied to study intrinsic (intracortical) and extrinsic connections. We will emphasise
+the insights from recent studies that combine both techniques.
+
+Intrinsic connections and the canonical microcircuit
+
+The seminal work of Douglas and Martin (1991), in the cat visual system, produced a model
+of how information flows through the cortical column. Douglas and Martin recorded
+intracellular potentials from cells in primary visual cortex during electrical stimulation of its
+thalamic afferents. They noted a stereotypical pattern of fast excitation, followed by slower
+and longer-lasting inhibition. The latency of the ensuing hyperpolarisation distinguished
+responses in supragranular and infragranular layers. Using conductance-based models, they
+showed that a simple model could reproduce these responses. Their model contained
+superficial and deep pyramidal cells with a common pool of inhibitory cells. All three
+neuronal populations received thalamic drive and were fully interconnected. The deep
+pyramidal cells received relatively weak thalamic drive but strong inhibition (Figure 1).
+These interconnections allowed the circuit to amplify transient thalamic inputs to generate
+sustained activity in the cortex, while maintaining a balance between excitation and
+inhibition, two tasks that must be solved by any cortical circuit. Their circuit, although based
+
+Bastos et al.
+Page 2
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+on recordings from cat visual cortex, was also proposed as a basic theme that might be
+present and replicated, with minor variations, throughout the cortical sheet (Douglas et al.,
+1989).
+
+Subsequent studies have used intracellular recordings and histology to measure spikes (and
+depolarisation) in pre and post-synaptic cells, whose cellular morphology can be determined.
+This approach quantifies both the connection probability – defined as the number of
+observed connections divided by total number of pairs recorded – and connection strength –
+defined in terms of post-synaptic responses. Thomson et al (2002) used these techniques to
+study layers 2 to 5 (L2 to L5) of the cat and rat visual systems. The most frequently
+connected cells were located in the same cortical layer, where the largest interlaminar
+projections were the ‘feedforward’ connections from L4 to L3 and from L3 to L5. Excitatory
+reciprocal ‘feedback’ connections were not observed (L3 to L4) or less commonly observed
+(L5 to L3), suggesting that excitation spreads within the column in a feedforward fashion.
+Feedback connections were typically seen when pyramidal cells in one layer targeted
+inhibitory cells in another (see Thomson and Bannister, 2003 for a review).
+
+While many studies have focused on excitatory connections, a few have examined inhibitory
+connections. These are more difficult to study, because inhibitory cells are less common
+than excitatory cells, and because there are at least seven distinct morphological classes
+(Salin and Bullier, 1995). However, recent advances in optogenetics have made it possible
+to more easily target inhibitory cells: Kätzel and colleagues combined optogenetics and
+whole-cell recording to investigate the intrinsic connectivity of inhibitory cells in mouse
+cortical areas M1, S1, and V1 (Kätzel et al., 2010). They transgenically expressed
+channelrhodopsin in inhibitory neurons and activated them, while recording from pyramidal
+cells. This allowed them to assess the effect of inhibition as a function of laminar position
+relative to the recorded neuron.
+
+Several conclusions can be drawn from this approach (Kätzel et al., 2010): first, L4
+inhibitory connections are more restricted in their lateral extent, relative to other layers. This
+supports the notion that L4 responses are dominated by thalamic inputs, while the remaining
+laminae integrate afferents from a wider cortical patch. Second, the primary source of
+inhibition originates from cells in the same layer, reflecting the prevalence of inhibitory
+intralaminar connections. Third, several interlaminar motifs appeared to be general – at least
+in granular cortex: principally, a strong inhibitory connection from L4 onto supragranular
+L2/3 and from infragranular layers onto L4. For more information on the cell-type
+specificity of inhibitory connections, see Yoshimura and Callaway, (2005). Figure 2
+provides a summary of key excitatory and inhibitory intralaminar connections.
+
+Microcircuits in the sensorimotor cortex
+
+Do the features of visual microcircuits generalize to other cortical areas? Recently, two
+studies have mapped the intrinsic connectivity of mouse sensory and motor cortices: Lefort
+and colleagues (2009) used multiple whole-cell recordings in mouse barrel cortex to
+determine the probability of monosynaptic connections – and the corresponding connection
+strength. As in visual cortex, the strongest connections were intralaminar and the strongest
+interlaminar connections were the ascending L4 to L2, and descending L3 to L5.
+
+One puzzle about canonical microcircuits is whether motor cortex has a local circuitry that is
+qualitatively similar to sensory cortex. This question is important because motor cortex lacks
+a clearly defined granular L4 (a property that earns it the name “agranular cortex”). Weiler
+and colleagues combined whole-cell recordings in mouse motor cortex with photo-
+stimulation to uncage Glutamate (Weiler et al., 2008). This allowed them to systematically
+stimulate the cortical column in a grid – centred on the pyramidal neuron from which they
+
+Bastos et al.
+Page 3
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+recorded. By recording from pyramidal neurons in L2-6 (L1 lacks pyramidal cells), the
+authors mapped the excitatory influence that each layer exerts over the others. They found
+that the L2/3 to L5A/B was the strongest connection – accounting for one-third of the total
+synaptic current in the circuit. The second strongest interlaminar connection was the
+reciprocal L5A to L2/3 connection. This pathway may be homologous to the prominent
+L4/5A to L2/3 pathway in sensory cortex. Also – as in sensory cortex – recurrent
+(intralaminar) connections were prominent, particularly in L2, L5A/B and L6. The largest
+fraction of synaptic input arrived in L5A/B, consistent with its key role in accumulating
+information from a wide range of afferents – before sending its output to the corticospinal
+tract. In summary, strong input-layer to superficial, and superficial to deep connectivity,
+together with strong intralaminar connectivity, suggests that the intrinsic circuitry of motor
+cortex is similar to other cortical areas.
+
+The anatomy and physiology of extrinsic connections
+
+Clearly, an account of microcircuits must refer to the layers of origin of extrinsic
+connections and their laminar targets. Although the majority of presynaptic inputs arise from
+intrinsic connections, cortical areas are also richly interconnected, where the balance
+between intrinsic and extrinsic processing mediates functional integration among specialised
+cortical areas (Engel et al., 2010). By numbers alone, intrinsic connections appear to
+dominate – 95% of all neurons labelled with a retrograde tracer lie within about 2 mm of the
+injection site (Markov et al., 2011). The remaining 5% represent cells giving rise to extrinsic
+connections, which – although sparse – can be extremely effective in driving their targets. A
+case in point is the LGN to V1 connection: although it is only the sixth strongest connection
+to V1, LGN afferents have a substantial effect on V1 responses (Markov et al., 2011).
+
+Hierarchies and functional asymmetries—Current dogma holds that the cortex is
+hierarchically organized. The idea of a cortical hierarchy rests on the distinction between
+three types of extrinsic connections: feedforward connections, that link an earlier area to a
+higher area, feedback connections, that link a higher to an earlier area, and lateral
+connections, that link areas at the same level (reviewed in Felleman and Van Essen, 1991).
+These connections are distinguished by their laminar origins and targets. Feedforward
+connections originate largely from superficial pyramidal cells and target L4, while feedback
+connections originate largely from deep pyramidal cells and terminate outside of L4
+(Felleman and van Essen 1991). Clearly, this description of cortical hierarchies is a
+simplification and can be nuanced in many ways: for example, as the hierarchical distance
+between two areas increases, the percentage of cells that send feedforward (resp. feedback)
+projections from a lower (resp. higher) level becomes increasingly biased towards the
+superficial (resp. deep) layers (Barone et al., 2000; Vezoli, 2004) .
+
+In addition to the laminar specificity of their origins and targets, feedforward and feedback
+connections also differ in their synaptic physiology. The traditional view holds that
+feedforward connections are strong and driving, capable of eliciting spiking activity in their
+targets and conferring classical receptive field properties – the prototypical example being
+the synaptic connection between LGN and V1 (Sherman and Guillery, 1998). Feedback
+connections are thought to modulate (extra-classical) receptive field characteristics
+according to the current context; e.g., visual occlusion, attention, salience, etc. The
+prototypical example of a feedback connection is the cortical L6 to LGN connection.
+Sherman and Guillery identified several properties that distinguish drivers from modulators.
+Driving connections tend to show a strong ionotropic component in their synaptic response,
+evoke large EPSPs, and respond to multiple EPSPs with depressing synaptic effects.
+Modulatory connections produce metabotropic and ionotropic responses when stimulated,
+evoke weak EPSPs, and show paired-pulse facilitation (Sherman and Guillery, 1998; 2011).
+
+Bastos et al.
+Page 4
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+These distinctions were based upon the inputs to the LGN, where retinal input is driving and
+cortical input is modulatory. Until recently, little data were available to assess whether a
+similar distinction applies to cortico-cortical feedforward and feedback connections.
+However, recent studies show that cortical feedback connections express not only
+modulatory but also driving characteristics:
+
+Are feedback connections driving, modulatory or both?—Although it is generally
+thought that feedback connections are weak and modulatory (Crick and Koch, 1998;
+Sherman and Guillery, 1998), recent evidence suggests that feedback connections do more
+than modulate lower level responses: Sherman and colleagues recorded cells in mouse area
+V1/V2 and A1/A2, while stimulating feedforward or feedback afferents. In both cases,
+driving-like responses as well as modulatory-like responses were observed (Covic and
+Sherman, 2011; De Pasquale and Sherman, 2011). This indicates that – for these
+hierarchically proximate areas – feedback connections can drive their targets just as strongly
+as feedforward connections. This is consistent with earlier studies showing that feedback
+connections can be driving: Mignard and Malpeli (1991) studied the feedback connection
+between areas 18 and 17, while layer A of the LGN was pharmacologically inactivated. This
+silenced the cells in L4 in area 17 but spared activity in superficial layers. However,
+superficial cells were silenced when area 18 was lesioned. This is consistent with a driving
+effect of feedback connections from area 18, in the absence of geniculate input. In summary,
+feedback connections can mediate modulatory and driving effects. This is important from
+the point of view of predictive coding, because top-down predictions have to elicit
+obligatory responses in their targets (cells reporting prediction errors):
+
+In predictive coding, feedforward connections convey prediction errors, while feedback
+connections convey predictions from higher cortical areas to suppress prediction errors in
+lower areas. In this scheme, feedback connections should therefore be capable of exerting
+strong (driving) influences on earlier areas to suppress or counter feedforward driving
+inputs. However, as we will see later, these influences also need to exert nonlinear or
+modulatory effects. This is because top-down predictions are necessarily context sensitive:
+e.g., the occlusion of one visual object by another. In short, predictive coding requires
+feedback connections to drive cells in lower levels in a context sensitive fashion, which
+necessitates a modulatory aspect to their postsynaptic effects.
+
+Are feedback connections excitatory or inhibitory?—Crucially, because feedback
+connections convey predictions – that serve to explain and thereby reduce prediction errors
+in lower levels – their effective (polysynaptic) connectivity is generally assumed to be
+inhibitory. An overall inhibitory effect of feedback connections is consistent with in vivo
+studies. For example, electrophysiological studies of the mismatch negativity suggest that
+neural responses to deviant stimuli – that violate sensory predictions established by a regular
+stimulus sequence – are enhanced relative to predicted stimuli (Garrido et al., 2009).
+Similarly, violating expectations of auditory repetition causes enhanced gamma-band
+responses in early auditory cortex (Todorovic et al., 2011). These enhanced responses are
+thought to reflect an inability of higher cortical areas to predict, and thereby suppress, the
+activity of populations encoding prediction error (Garrido et al., 2007; Wacongne et al.,
+2011). The suppression of predictable responses can also be regarded as repetition
+suppression, observed in single unit recordings from the inferior temporal cortex of macaque
+monkeys (Desimone, 1996). Furthermore, neurons in monkey inferotemporal cortex respond
+significantly less to a predicted sequence of natural images, compared to an unpredicted
+sequence (Meyer and Olson, 2011).
+
+The inhibitory effect of feedback connections is further supported by neuroimaging studies
+(Murray et al., 2002; Murray, 2005; Harrison et al., 2007; Summerfield et al., 2008, 2011;
+
+Bastos et al.
+Page 5
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Alink et al., 2010). These studies show that predictable stimuli evoke smaller responses in
+early cortical areas. Crucially, this suppression cannot be explained in terms of local
+adaptation, because the attributes of the stimuli that can be predicted are not represented in
+early sensory cortex (e.g., Harrison et al. 2007). It should be noted that the suppression of
+responses to predictable stimuli can coexist with (top-down) attentional enhancement of
+signal processing (Wyart et al., 2012): in predictive coding, attention is mediated by
+increasing the gain of populations encoding prediction error (Spratling, 2008; Feldman and
+Friston, 2010). The resulting attentional modulation (e.g., Hopfinger et al., 2000) can
+interact with top-down predictions to override their suppressive influence – as demonstrated
+empirically (Kok et al., 2011). See Buschman and Miller, (2007), Saalmann et al., (2007),
+Anderson et al. (2011), and Armstrong et al. (2012) for further discussion of top-down
+connections in attention.
+
+Further evidence for the inhibitory (suppressive) effect of feedback connections comes from
+neuropsychology: Patients with damage to the prefrontal cortex (PFC) show disinhibition of
+event related potential responses (ERP) to repeating stimuli (Knight et al., 1989; Yamaguchi
+and Knight, 1990; but see Barceló et al., 2000). In contrast, they show reduced-amplitude
+P300 ERPs in response to novel stimuli – as if there were a failure to communicate top-
+down predictions to sensory cortex (Knight, 1984). Furthermore, normal subjects show a
+rapid adaptation to deviant stimuli as they become predictable – an effect not seen in
+prefrontal patients.
+
+Several invasive studies complement these human studies in suggesting an overall inhibitory
+role for feedback connections. In a recent seminal study, Olsen et al. studied corticothalamic
+feedback between L6 of V1 and the LGN using transgenic expression of channel rhodopsin
+in L6 cells of V1. By driving these cells optogenetically – while recording units in V1 and
+the LGN – the authors showed that deep L6 principal cells inhibited their extrinsic targets in
+the LGN and their intrinsic targets in cortical layers 2 to 5 (Olsen et al., 2012). This
+suppression was powerful – in the LGN, visual responses were suppressed by 76%.
+Suppression was also high in V1, around 80-84% (Olsen et al., 2012). This evidence is in
+line with classical studies of corticogeniculate contributions to length tuning in the LGN,
+showing that cortical feedback contributes to the surround suppression of feline LGN cells:
+without feedback, LGN cells are disinhibited and show weaker surround suppression
+(Murphy and Sillito, 1987; Sillito et al., 1993; but see Alitto and Usrey, 2008).
+
+While these studies provide convincing evidence that cortical feedback to the LGN is
+inhibitory, the evidence is more complicated for corticocortical feedback connections
+(Sandell and Schiller, 1982; Johnson and Burkhalter, 1996, 1997). Hupé and colleagues
+cooled area V5/MT while recording from areas V1, V2, and V3 in the monkey. When visual
+stimuli were presented in the classical receptive field (CRF), cooling of area V5/MT
+decreased unit activity in earlier areas, suggesting an excitatory effect of extrinsic feedback
+(Hupé et al., 1998). However, when the authors used a stimulus that spanned the extra-
+classical RF the responses of V1 neurons were – on average – enhanced after cooling area
+V5, consistent with the suppressive role of feedback connections. These results indicate that
+the inhibitory effects of feedback connections may depend on (natural) stimuli that require
+integration over the visual field. Similar effects were observed when area V2 was cooled and
+neurons were measured in V1: when stimuli were presented only to the CRF, cooling V2
+decreased V1 spiking activity; however, when stimuli were present in the CRF and the
+surround, cooling V2 increased V1 activity (Bullier et al., 1996). Finally, others have argued
+for an inhibitory effect of feedback based on the timing and spatial extent of surround
+suppression in monkey V1 – concluding that the far surround suppression effects were most
+likely mediated by feedback (Bair et al., 2003).
+
+Bastos et al.
+Page 6
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+The empirical finding that feedback connections can both facilitate and suppress firing in
+lower hierarchical areas – depending on the content of classical and extra-classical receptive
+fields – is consistent with predictive coding: Rao and Ballard (1999) trained a hierarchical
+predictive coding network to recognise natural images. They showed that higher levels in
+the hierarchy learn to predict visual features that extend across many CRFs in the lower
+levels (e.g. tree trunks or horizons). Hence, higher visual areas come to predict that visual
+stimuli will span the receptive fields of cells in lower visual areas. In this setting, a stimulus
+that is confined to a CRF would elicit a strong prediction error signal (because it cannot be
+predicted). This provides a simple explanation for the findings of Hupé et al (1998) and
+Bullier et al (1996): when feedback connections are deactivated, there are no top-down
+predictions to explain responses in lower areas – leading to a disinhibition of responses in
+earlier areas, when – and only when – stimuli can be predicted over multiple CRFs.
+
+Feedback connections and layer 1—How might the inhibitory effect of feedback
+connections be mediated? The established view is that extrinsic corticocortical connections
+are exclusively excitatory (using glutamate as their excitatory neurotransmitter) – although
+recent evidence suggests that inhibitory extrinsic connections exist and may play an
+important role in synchronizing distant regions (Melzer et al., 2012). However, one
+important route by which feedback connections could mediate selective inhibition is via
+their termination in L1 (Anderson and Martin, 2006; Shipp, 2007): layer 1 is sometimes
+referred to as acellular due to its pale appearance with Nissl staining (the classical method
+for separating layers that selectively labels cell bodies). Indeed, a recent study concluded
+that L1 contains less than 0.5% of all cells in a cortical column (Meyer et al., 2011). These
+L1 cells are almost all inhibitory and interconnect strongly with each other, via electrical
+connections and chemical synapses (Chu et al., 2003). Simultaneous whole cell patch clamp
+recordings show that they provide strong monosynaptic inhibition to L2/3 pyramidal cells,
+whose apical dendrites project into L1 (Chu et al., 2003; Wozny and Williams, 2011). This
+means L1 inhibitory cells are in a prime position to mediate inhibitory effects of extrinsic
+feedback. The laminar location highlighted by these studies – the bottom of L1 and the top
+of L2/3 – has recently been shown to be a “hotspot” of inhibition in the column (Meyer et
+al., 2011). Indeed, a study of rat barrel cortex – that stimulated (and inactivated) L1 –
+showed that it exerts a powerful inhibitory effect on whisker-evoked responses (Shlosberg et
+al., 2006). These studies suggest that corticocortical feedback connections could deliver
+strong inhibition, if they were to recruit the inhibitory potential of L1.
+
+In terms of the excitatory and modulatory effect of feedback connections, predictive input
+from higher cortical areas might have an important impact via the distal dendrites of
+pyramidal neurons (Larkum et al., 2009). Furthermore, there is a specific type of
+GABAergic neuron that appears to control distal dendritic excitability, gating top down
+excitatory signals differentially during behavior (Gentet et al., 2012). Table 1 summarises
+the studies we have discussed in relation to the role of feedback connections.
+
+Feedforward and transthalamic connections
+
+While the evidence for an inhibitory effect of feedback connections has to be evaluated
+carefully, the evidence for an excitatory effect of feedforward connections is unequivocal.
+For example, in the monkey, V1 projects monosynaptically to V2, V3, V3a, V4, and V5/MT
+(Zeki, 1978; Zeki and Shipp, 1988). In all cases – when V1 is reversibly inactivated through
+cooling – single-cell activity in target areas is strongly suppressed (Girard and Bullier, 1989;
+Girard et al., 1991a, 1991b, 1992). In the cases of V2 and V3, the result of cooling area V1
+is a near-total silencing of single unit activity. These studies illustrate that activity in higher
+cortical areas depends on driving inputs from earlier cortical areas that establish their
+receptive field properties.
+
+Bastos et al.
+Page 7
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Finally, while many studies have focused on extrinsic connections that project directly from
+one cortical area to the next, there is mounting evidence that feedforward driving
+connections (and perhaps feedback) in the cortex could be mediated by transthalamic
+pathways (Sherman and Guillery, 1998, 2011). The strongest evidence for this claim comes
+from the somatosensory system, where it was shown recently that the posterior medial
+nucleus of the thalamus (POm) – a higher-order thalamic nucleus that receives direct input
+from cortex – can relay information between S1 and S2 (Theyel et al., 2009). In addition, the
+thalamic reticular nucleus has been proposed to mediate the inhibition that might underlie
+cross-modal attention or top-down predictions (Yamaguchi and Knight, 1990; Crick, 1984;
+Wurtz et al., 2011). Furthermore, computational considerations and recent experimental
+findings point to a potentially important role for higher-order thalamic nuclei in coordinating
+and synchronizing cortical responses (Vicente et al., 2008; Saalmann et al., 2012). The
+degree to which cortical areas are integrated directly via corticocortical or indirectly via
+cortico-thalamo-cortical connections – and the extent to which transthalamic pathways
+dissociate feedforward from feedback connections in the same way as we have proposed for
+the cortico-cortical connections – are open questions.
+
+The canonical microcircuit
+
+Central to the idea of a canonical microcircuit is the notion that a cortical column contains
+the circuitry necessary to perform requisite computations, and that these circuits can be
+replicated with minor variations throughout the cortex. One of the clearest examples of how
+cortical circuits process simple inputs – to generate complex outputs – is the emergence of
+orientation tuning in V1. Orientation tuning is a distinctly cortical phenomenon because
+geniculocortical relay cells show no orientation preferences. A further elaboration of cortical
+responses can be found in the distinction between simple and complex cells – while simple
+cells possess spatially confined receptive fields, complex cells are orientation-tuned but
+show less preference for the location of an oriented bar. Hubel and Wiesel proposed a model
+for how intrinsic and extrinsic connectivity could establish a circuit explaining these
+receptive field properties. They proposed that orientation tuning in simple cells could be
+generated by a single cortical cell receiving input from several ON centre – OFF surround
+geniculate cells arranged along a particular orientation, thereby endowing it with a
+preference for bars oriented in a particular direction (Hubel and Wiesel, 1962). Complex
+cells were hypothesized to receive inputs from several simple cells – with the same
+orientation preference and slightly varying receptive field locations. Thus, complex cells
+were thought not to receive direct LGN input but to be higher order cells in cortex.
+Subsequent findings supported these predictions, showing that input layer 4Cα and 4Cβ
+contained the largest proportion of cells receiving monosynaptic geniculate input, while
+superficial and deep layer cells contain a larger number of cells receiving disynaptic or
+polysynaptic input (Bullier and Henry, 1980). Furthermore, simple cells project
+monosynaptically onto complex cells, where they exert a strong feedforward influence
+(Alonso and Martinez, 1998; Alonso, 2002). These models suggest that intrinsic cortical
+circuitry allows processing to proceed along discrete steps that are capable of producing
+response properties in outputs that are not present in inputs.
+
+Segregation of processing streams
+
+A key property of canonical circuits is the segregation of parallel streams of processing. For
+example, in primates, parvocellular input enters the cortex primarily in layer 4Cβ, whereas
+magnocellular inputs enter in 4Cα. The corticogeniculate feedback pathway from L6
+maintains this segregation, as upper L6 cells preferentially synapse onto parvocellular cells
+in the LGN, while lower L6 cells target the magnocellular LGN layers (Fitzpatrick et al.,
+1994; Briggs and Usrey, 2009). Further examples of stream segregation are also present in
+
+Bastos et al.
+Page 8
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+the dorsal “where” and the ventral “what” pathways, and in the projection from V1 to the
+thick, thin, and inter-stripe regions of V2 (Zeki and Shipp, 1988; Sincich and Horton, 2005).
+
+Superficial and deep layers are anatomically interconnected, but mounting evidence suggests
+that they constitute functionally distinct processing streams: in an elegant experiment,
+Roopun et al. (2006) showed that L2/3 of rat somatomotor cortex shows prominent gamma
+oscillations that are co-expressed with beta oscillations in L5. Both rhythms persisted, when
+superficial and deep layers were disconnected at the level of L4. Maier and colleagues
+(2010) used multilaminar recordings to show strong LFP coherence amongst sites within the
+superficial layers (the superficial compartment), as well as strong coherence amongst sites in
+deep layers (the deep compartment), but weak inter-compartment coherence. These studies
+indicate a segregation of – potentially autonomous – supragranular and infragranular
+dynamics. Maier et al., found that supragranular sites had higher broadband gamma power
+than infragranular sites. This pattern was reversed in the alpha and beta range; with greater
+power in the infragranular and granular layers. Finally, the spiking activity of neurons in the
+superficial layers of visual cortex are more coherent with gamma frequency oscillations in
+the local field potential, while neurons in deep layers are more coherent with alpha
+frequency oscillations (Buffalo et al., 2011). This finding is consistent with an earlier study
+by Livingstone (1996) showing that 50% of cells in L2/3 of squirrel monkey V1 expressed
+gamma oscillations, compared to less than 20% of cells in L4C and infragranular layers. The
+different spectral behaviour of superficial and deep layers has led to the interesting proposal
+that feedforward and feedback signalling may be mediated by distinct (high and low)
+frequencies (reviewed in Wang, 2010, see also Buschman and Miller, 2007 in the context of
+attention), a proposal that has recently received experimental support - at least for the
+feedforward connections (Bosman et al., 2012; see also Gregoriou et al., 2009).
+
+Integration and segregation within canonical circuits—Given this functional and
+anatomical segregation into parallel streams, the question naturally arises, how are these
+streams integrated? It has been previously suggested that integration occurs through the
+synchronized firing of multiple neurons that form a neural ensemble (Gray et al., 1989;
+Singer, 1999), while others have emphasized inter-areal phase-synchronization or coherence
+(Varela et al., 2001; Fries, 2005; Fujisawa and Buzsáki, 2011). While a full treatment of this
+question is beyond the scope of the current review, we propose that the canonical
+microcircuit contains a clue for how the dialectic between segregation and integration might
+be resolved. While top-down and bottom-up inputs and outputs may be segregated in layers,
+streams, and frequency bands, the canonical microcircuit specifies the circuitry for how the
+basic units of cortex are interconnected and therefore how the intrinsic activity of the
+cortical column is entrained by extrinsic inputs. This intrinsic connectivity specifies how the
+cells of origin and termination of extrinsic projections are interconnected, and thus
+determine how top-down and bottom-up streams are integrated within each cortical column.
+
+Spatial segregation and cortical columns
+
+The notion of a canonical microcircuit implicitly assumes that each circuit is distinct from
+its neighbours; that could presumably carry out computations in parallel. Therefore, the
+canonical microcircuit specifies the spatial scale over which processing is integrated. The
+most likely candidate for this spatial scale is the cortical column – that can vary over three
+orders of magnitude between minicolumns, columns, and hypercolumns. Minicolumns are
+only a few cells wide, estimated to be about 50-60 micrometers in diameter by Mountcastle
+(1997) and are seen in Nissl sections of cortex as slight variations in cell density.
+Minicolumns were originally proposed as elementary units of cortex by Lorente de No
+(1949) and appear to reflect the migration of cells from the ventricular zone to the cortical
+sheet during foetal development (reviewed in Horton and Adams, 2005). Hubel and Wiesel
+
+Bastos et al.
+Page 9
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+estimated that orientation columns were on this order of magnitude, about 25-50
+micrometers wide, although they failed to establish a correspondence between orientation
+columns observed physiologically and the minicolumns seen in Nissl sections (Hubel and
+Wiesel, 1974). A cortical column was classically defined as a vertical alignment of cells
+containing neurons with similar receptive field properties, such as orientation preference and
+ocular dominance in V1; or touch in somatosensory cortex (Mountcastle, 1957; Hubel and
+Wiesel, 1972). These columns were suggested by Mountcastle to encompass a number of
+minicolumns, with a width of 300-400 micrometers (Mountcastle, 1997). Finally, Hubel and
+Wiesel defined a hypercolumn to be the unit of cortex necessary to traverse all possible
+values of a particular receptive field property, such as orientation or eye dominance –
+estimated to be between 0.5 to 1 mm wide (Hubel and Wiesel, 1974).
+
+Columns, connections and computations—So is the cortical column the basic unit
+of cortical computation? Some authors emphasize that even within a dendrite, there are all
+the necessary biophysical mechanisms for performing surprisingly advanced computations,
+such as direction selectivity, coincidence detection, or temporal integration (Häusser and
+Mel, 2003; London and Häusser, 2005). Others argue that single neurons can process their
+inputs at the dendrite, soma, and initial segment, such that the output spike trains of just two
+interconnected cells could mediate computations like independent components analysis
+(Klampfl et al., 2009). Others posit that cortical columns form the basic computational unit
+(Mountcastle, 1997; Hubel and Wiesel, 1972) but see (Horton and Adams, 2005). Donald
+Hebb proposed that neurons distributed over several cortical areas could form a functional
+computational unit called a neural assembly (Hebb, 1949). This view has re-emerged in
+recent years, with the development of the requisite recording and analytic techniques for
+evaluating this proposal (Buzsáki, 2010; Canolty et al., 2010; Singer et al., 1997; Lopes-dos-
+Santos et al., 2011).
+
+Computational modelling studies indicate that cortical columns with structured connectivity
+are computationally more efficient than a network containing the same number of neurons
+but with random connectivity (Haeusler and Maass, 2007). Others suggest that this circuitry
+allows the cortex to organize and integrate bottom-up, lateral, and top-down information
+(Ullman, 1995; Raizada and Grossberg, 2003). Douglas and Martin suggest that the rich
+anatomical connectivity of L2/3 pyramidal cells allows them to collect information from
+top-down, lateral, and bottom-up inputs, and – through processing in the dendritic tree –
+select the most likely interpretation of its inputs. More recently, George and Hawkins have
+suggested that the canonical microcircuit implements a form of Bayesian processing
+(George and Hawkins, 2009). In the following section, we pursue similar ideas, but ground
+them in the framework of predictive coding, and propose a cortical circuit that could
+implement predictive coding through canonical interconnections. In particular, we find that
+the proposed circuitry agrees remarkably well with quantitative characterisations of the
+canonical microcircuit (Haeusler and Maass, 2007).
+
+A canonical microcircuit for predictive coding
+
+This section considers the computational role of cortical microcircuitry in more detail. We
+try to show that the computations performed by canonical microcircuits can be specified
+more precisely than one might imagine and that these computations can be understood
+within the framework of predictive coding. In brief, we will show that (hierarchical
+Bayesian) inference about the causes of sensory input can be cast as predictive coding. This
+is important because it provides formal constraints on the dynamics one would expect to
+find in neuronal circuits. Having established these constraints, we then attempt to match
+them with the neurobiological constraints afforded by the canonical microcircuit. The
+endpoint of this exercise is a canonical microcircuit for predictive coding.
+
+Bastos et al.
+Page 10
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Predictive coding and the free energy principle
+
+It might be thought impossible to specify the computations performed by the brain.
+However, there are some fairly fundamental constraints on the basic form of neuronal
+dynamics. The argument goes as follows – and can be regarded as a brief summary of the
+free energy principle (see Friston, 2010 for details):
+
+•
+Biological systems are homoeostatic (or allostatic), which means that they
+minimise the dispersion (entropy) of their interoceptive and exteroceptive states.
+
+•
+Entropy is the average of surprise over time, which means biological systems
+minimise the surprise associated with their sensory states at each point in time.
+
+•
+In statistics, surprise is the negative logarithm of Bayesian model evidence, which
+means biological systems – like the brain – must continually maximise the
+Bayesian evidence for their (generative) model of sensory inputs.
+
+•
+Maximising Bayesian model evidence corresponds to Bayesian filtering of sensory
+inputs. This is also known as predictive coding.
+
+These arguments mean that by minimising surprise, through selecting appropriate
+sensations, the brain is implicitly maximising the evidence for its own existence – this is
+known as active inference. In other words, to maintain a homoeostasis the brain must predict
+its sensory states on the basis of a model. Fulfilling those predictions corresponds to
+accumulating evidence for that model – and the brain that embodies it. The implicit
+maximisation of Bayesian model evidence provides an important link to the Bayesian brain
+hypothesis (Hinton and van Camp, 1993; Dayan et al., 1995; Knill and Pouget, 2004) and
+many other compelling proposals about perceptual synthesis – including analysis by
+synthesis (Neisser, 1967; Yuille and Kersten, 2006) epistemological automata (MacKay,
+1956), the principle of minimum redundancy (Attneave, 1954; Barlow, H.B., 1961; Dan et
+al., 1996), the Infomax principle (Linsker, 1990; Atick, 1992; Kay and Phillips, 2011), and
+perception as hypothesis testing (Gregory, 1968; 1980).
+
+The most popular scheme – for Bayesian filtering in neuronal circuits – is predictive coding
+(Srinivasan et al., 1982; Buchsbaum and Gottschalk, 1983; Rao and Ballard, 1999). In this
+context, surprise corresponds (roughly) to prediction error. In predictive coding, top-down
+predictions are compared with bottom-up sensory information to form a prediction error.
+This prediction error is used to update higher-level representations – upon which top-down
+predictions are based. These optimised predictions then reduce prediction error at lower
+levels.
+
+To predict sensations, the brain must be equipped with a generative model of how its
+sensations are caused (Helmholtz, 1860). Indeed, this led Geoffrey Hinton and colleagues to
+propose that the brain is an inference (Helmholtz) machine (Hinton and Zemel, 1994; Dayan
+et al., 1995). A generative model describes how variables or causes in the environment
+conspire to produce sensory input. Generative models map from (hidden) causes to (sensory)
+consequences. Perception then corresponds to the inverse mapping from sensations to their
+causes, while action can be thought of as the selective sampling of sensations. Crucially, the
+form of the generative model dictates the form of the inversion – for example, predictive
+coding. Figure 3 depicts a general model as a probabilistic graphical model. A special case
+of these models are hierarchical dynamic models (see Figure 4), which grandfather most
+parametric models in statistics and machine learning (see Friston, 2008). These models
+explain sensory data in terms of hidden causes and states. Hidden causes and states are both
+hidden variables that cause sensations but they play slightly different roles: hidden causes
+link different levels of the model and mediate conditional dependencies among hidden states
+at each level. Conversely, hidden states model conditional dependencies over time (i.e.,
+
+Bastos et al.
+Page 11
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+memory) by modelling dynamics in the world. In short, hidden causes and states mediate
+structural and dynamic dependencies respectively.
+
+The details of the graph in Figure 3 are not important; it just provides a way of describing
+conditional dependencies among hidden states and causes responsible for generating sensory
+input. These dependencies mean that we can interpret neuronal activity as message passing
+among the nodes of a generative model, where each canonical microcircuit contains
+representations or expectations about hidden states and causes. In other words, the form of
+the underlying generative model defines the form of the predictive coding architecture used
+to invert the model. This is illustrated in Figure 4, where each node has a single parent. We
+will deal with this simple sort of model because it lends itself to an unambiguous description
+in terms of bottom-up (feedforward) and top-down (feedback) message passing. We now
+look at how perception or model inversion – recovering the hidden states and causes of this
+model given sensory data – might be implemented at the level of a microcircuit:
+
+Predictive coding and message passing
+
+In predictive coding, representations (or conditional expectations) generate top-down
+predictions to produce prediction errors. These prediction errors are then passed up the
+hierarchy in the reverse direction, to update conditional expectations. This ensures an
+accurate prediction of sensory input and all its intermediate representations. This hierarchal
+message passing can be expressed mathematically as a gradient descent on the (sum of
+squared) prediction errors 
+, where the prediction errors are weighted by their
+precision (inverse variance).
+
+(1)
+
+The first pair of equalities just says that conditional expectations about hidden causes and
+
+states 
+ are updated based upon the way we would predict them to change – the first
+term – and subsequent terms that minimise prediction error. The second pair of equations
+
+simply expresses prediction error 
+ as the difference between conditional
+expectations about hidden causes and (the changes in) hidden states and their predicted
+
+values, weighed by their precisions 
+. These predictions are nonlinear functions of
+conditional expectations (g(i), f(i)) at each level of the hierarchy and the level above.
+
+It is difficult to overstate the generality and importance of Equation (1) – it grandfathers
+nearly every known statistical estimation scheme, under parametric assumptions about
+additive noise. These range from ordinary least squares to advanced Bayesian filtering
+schemes (see Friston 2008). In this general setting, Equation (1) minimises variational free
+energy and corresponds to generalised predictive coding. Under linear models, it reduces to
+linear predictive coding, also known as Kalman-Bucy filtering (see Friston et al, 2010 for
+details).
+
+In neuronal network terms, Equation (1) says that prediction error units receive messages
+from the same level and the level above. This is because the hierarchical form of the model
+only requires conditional expectations from neighbouring levels to form prediction errors, as
+can be seen schematically in Figure 4. Conversely, expectations are driven by prediction
+
+Bastos et al.
+Page 12
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+error from the same level and the level below – updating expectations about hidden states
+and causes respectively. These constitute the bottom-up and lateral messages that drive
+conditional expectations to provide better predictions – or representations – that suppress
+prediction error. This updating corresponds to an accumulation of prediction errors – in that
+the rate of change of conditional expectations is proportional to prediction error.
+Electrophysiologically, this means that one would expect to see a transient prediction error
+response to bottom-up afferents (in neuronal populations encoding prediction error) that is
+suppressed to baseline firing rates by sustained responses (in neuronal populations encoding
+predictions). This is the essence of recurrent message passing between hierarchical levels to
+suppress prediction error (see Friston 2008 for a more detailed discussion).
+
+The nature of this message passing is remarkably consistent with the anatomical and
+physiological features of cortical hierarchies. An important prediction is that the nonlinear
+functions of the generative model – modelling context sensitive dependencies among hidden
+variables – appear only in the top-down and lateral predictions. This means,
+neurobiologically, we would predict feedback connections to possess nonlinear or
+neuromodulatory characteristics; in contrast to feedforward connections that mediate a linear
+mixture of prediction errors. This functional asymmetry is exactly consistent with the
+empirical evidence reviewed above. Another key feature of Equation (1) is that the top-
+down predictions produce prediction errors through subtraction. In other words, feedback
+connections should exert inhibitory effects, of the sort seen empirically. Table 2 summarises
+the features of extrinsic connectivity (reviewed in the previous section) that are explained by
+predictive coding. In the remainder of this review, we focus on intrinsic connections and
+cortical microcircuits.
+
+The cortical microcircuit and predictive coding
+
+We now try to associate the variables in Equation (1) with specific populations in the
+canonical microcircuit. Figure 5 illustrates a remarkable correspondence between the form
+of Equation (1) and the connectivity of the canonical microcircuit. Furthermore, the
+resulting scheme corresponds almost exactly to the computational architecture proposed by
+Mumford (1992). This correspondence rests upon the following intuitive steps:
+
+•
+First, we divide the excitatory cells in the superficial and deep layers into principal
+(pyramidal) cells and excitatory interneurons. This accommodates the fact that (in
+macaque V1) a significant percentage of superficial L2/3 cells (about half) and
+deep L5 excitatory cells (about 80%) do not project outside the cortical column
+(Callaway and Wiser, 1996; Briggs and Callaway, 2005).
+
+•
+Second, we know that the superficial and deep pyramidal cells provide feedforward
+and feedback connections respectively. This means that superficial pyramidal cells
+
+must encode and broadcast prediction errors on hidden causes 
+, while deep
+
+pyramidal cells must encode conditional expectations 
+ so that they can
+elaborate feedback predictions.
+
+•
+Third, we know that the (spiny stellate) excitatory cells in the granular layer receive
+
+feedforward connections encoding prediction errors 
+ on the hidden causes of the
+level below.
+
+•
+This leaves the inhibitory interneurons in the granular layer, which, for symmetry,
+we associate with prediction errors on the hidden states.
+
+•
+The remaining populations are the excitatory and inhibitory interneurons in the
+supragranular layer, to which we assign expectations about hidden causes and
+states respectively. These are mapped through descending (intrinsic) feedforward
+
+Bastos et al.
+Page 13
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+connections to cells in the deep layers that generate predictions. We do not suppose
+that this is a simple one-to-one mapping – rather it mediates the nonlinear
+transformation of expectations to predictions required by the earlier cortical level.
+
+This arrangement accommodates the fact that the dependencies among hidden states are
+confined to each node (by the nature of graphical models), which means that their
+expectations and prediction errors should be encoded by interneurons. Furthermore, the
+splitting of excitatory cells in the upper layers into two populations (encoding expectations
+and prediction errors on hidden causes) is sensible, because there is a one-to-one mapping
+between the expectations on hidden causes and their prediction errors.
+
+The ensuing architecture bears a striking correspondence to the microcircuit in (Haeusler
+and Maass, 2007) in the left panel of Figure 5 – in the sense that nearly every connection
+required by the predictive coding scheme appears to be present in terms of quantitative
+measures of intrinsic connectivity. However, there are two exceptions that both involve
+connections to the inhibitory cells in the granular layer (shown as dotted lines in Figure 5).
+Predictive coding requires that these cells (that encode prediction errors on hidden states)
+compare the expected changes in hidden states with the actual changes. This suggests that
+there should be interlaminar projections from supragranular (inhibitory) and infragranular
+(excitatory) cells. In terms of their synaptic characteristics, one would predict that these
+intrinsic connections would be of a feedback sort – in the sense that they convey predictions.
+Although not considered in this Haeusler and Maass scheme, feedback connections from
+infragranular layers are an established component of the canonical microcircuit (see Figure
+2).
+
+Functional asymmetries in the microcircuit
+
+The circuitry in Figure 5 appears consistent with the broad scheme of ascending
+(feedforward) and descending (feedback) intrinsic connections: feedforward prediction
+errors from a lower cortical level arrive at granular layers and are passed forward to
+excitatory and inhibitory interneurons in supragranular layers, encoding expectations. Strong
+and reciprocal intralaminar connections couple superficial excitatory interneurons and
+pyramidal cells. Excitatory and inhibitory interneurons in supragranular layers then send
+strong feedforward connections to the infragranular layer. These connections enable deep
+pyramidal cells and excitatory interneurons to produce (feedback) predictions, which ascend
+back to L4 or descend to a lower hierarchical level. This arrangement recapitulates the
+functional asymmetries between extrinsic feedforward and feedback connections and is
+consistent with the empirical characteristics of intrinsic connections.
+
+If we focus on the superficial and deep pyramidal cells, the form of the recognition
+dynamics in Equation (1) tells us something quite fundamental: We would anticipate higher
+frequencies in the superficial pyramidal cells, relative to the deep pyramidal cells. One can
+see this easily by taking the Fourier transform of the first equality in Equation (1)
+
+(2)
+
+This equation says that the contribution of any (angular) frequency ω in the prediction errors
+(encoded by superficial pyramidal cells) to the expectations (encoded by the deep pyramidal
+cells) is suppressed in proportion to that frequency (Friston 2008). In other words, high
+frequencies should be attenuated when passing from superficial to deep pyramidal cells.
+There is nothing mysterious about this attenuation – it is a simple consequence of the fact
+that conditional expectations accumulate prediction errors, thereby suppressing high-
+frequency fluctuations to produce smooth estimates of hidden causes. This smoothing –
+
+Bastos et al.
+Page 14
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+inherent in Bayesian filtering – leads to an asymmetry in frequency content of superficial
+and deep cells: for example, superficial cells should express more gamma relative to beta,
+and deep cells should express more beta relative to gamma (Roopun et al., 2006, 2008;
+Maier et al., 2010).
+
+Figure 6 provides a schematic illustration of the spectral asymmetry predicted by Equation
+2. Note that predictions about the relative amplitudes of high and low frequencies in
+superficial and deep layers pertain to all frequencies – there is nothing in predictive coding
+per se to suggest characteristic frequencies in the gamma and beta ranges. However, one
+might speculate the characteristic frequencies of canonical microcircuits have evolved to
+model and – through active inference – create the sensorium (Friston, 2010; Berkes et al.,
+2011; Engbert et al., 2011). Indeed, there is empirical evidence to support this notion; in the
+visual (Lakatos et al., 2008; Meirovithz et al., 2012; Bosman et al., 2009) and motor domain
+(Gwin and Ferris, 2012).
+
+In summary, predictions are formed by a linear accumulation of prediction errors.
+Conversely, prediction errors are nonlinear functions of predictions. This means that the
+conversion of prediction errors into predictions (Bayesian filtering) necessarily entails a loss
+of high frequencies. However, the nonlinearity in the mapping from predictions to prediction
+errors means that high frequencies can be created (consider the effect of squaring a sine
+wave, which would convert beta into gamma). In short, prediction errors should express
+higher frequencies than the predictions that accumulate them. This is another example of a
+potentially important functional asymmetry between feedforward and feedback message
+passing that emerges under predictive coding. It is particularly interesting given recent
+evidence that feedforward connections may use higher frequencies than feedback
+connections (Bosman et al., 2012).
+
+Conclusion
+
+In conclusion, there is a remarkable correspondence between the anatomy and physiology of
+the canonical microcircuit and the formal constraints implied by generalised predictive
+coding. Having said this, there are many variations on the mapping between computational
+and neuronal architectures: Even if predictive coding is an appropriate implementation of
+Bayesian filtering, there are many variations on the arrangement shown in Figure 5. For
+example, feedback connections could arise directly from cells encoding conditional
+expectations in supragranular layers. Indeed, there is emerging evidence that feedback
+connections between proximate hierarchical levels originate from both deep and superficial
+layers (Markov et al 2011). Note that this putative splitting of extrinsic streams is only
+predicted in the light of empirical constraints on intrinsic connectivity.
+
+One of our motivations – for considering formal constraints on connectivity – was to
+produce dynamic causal models of canonical microcircuits. Dynamic causal modelling
+enables one to compare different connectivity models, using empirical electrophysiological
+responses (David et al, 2006; Moran et al, 2008, 2011). This form of modelling rests upon
+Bayesian model comparison and allows one to assess the evidence for one microcircuit
+relative to another. In principle, this provides a way to evaluate different microcircuit
+models, in terms of their ability to explain observed activity. One might imagine that the
+particular circuits for predictive coding presented in this paper will be nuanced as more
+anatomical and physiological information becomes available. The ability to compare
+competing models or microcircuits – using optogenetics, local field potentials and
+electroencephalography – may be important for refining neurobiologically informed
+microcircuits. In short, many of the predictions and assumptions we have made about the
+specific form of the microcircuit for predictive coding may be testable in the near future.
+
+Bastos et al.
+Page 15
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Acknowledgments
+
+This work was supported by the Wellcome Trust and the NSF Graduate Research Fellowship under Grant No.
+2009090358 to A.M.B. Support was also provided by NIH grants MH055714 (G.R.M.) and EY013588 (W.M.U.),
+and NSF grant 1228535 (G.R.M and W.M.U). The authors would like to thank Julien Vezoli, Will Penny, Dimitris
+Pinotsis, Stewart Shipp, Vladimir Litvak, Conrado Bosman, Laurent Perrinet and Henry Kennedy for helpful
+discussions. We would also like to thank our reviewers for helpful comments and guidance.
+
+References
+
+Alink A, Schwiedrzik CM, Kohler A, Singer W, Muckli L. Stimulus predictability reduces responses
+in primary visual cortex. J. Neurosci. 2010; 30:2960–2966. [PubMed: 20181593]
+Alitto HJ, Usrey WM. Origin and dynamics of extraclassical suppression in the lateral geniculate
+nucleus of the macaque monkey. Neuron. 2008; 57:135–146. [PubMed: 18184570]
+Alonso JM. Book Review: Neural Connections and Receptive Field Properties in the Primary Visual
+Cortex. The Neuroscientist. 2002; 8:443. [PubMed: 12374429]
+Alonso JM, Martinez LM. Functional connectivity between simple cells and complex cells in cat
+striate cortex. Nat. Neurosci. 1998; 1:395–403. [PubMed: 10196530]
+Anderson JC, Kennedy H, Martin KA. Pathways of Attention: Synaptic Relationships of Frontal Eye
+Field to V4, Lateral Intraparietal Cortex, and Area 46 in Macaque Monkey. The Journal of
+Neuroscience. 2011; 31:10872. [PubMed: 21795539]
+Anderson JC, Martin KAC. Synaptic connection from cortical area V4 to V2 in macaque monkey. The
+Journal of Comparative Neurology. 2006; 495:709–721. [PubMed: 16506191]
+Armstrong, KM.; Schafer, RJ.; Chang, MH.; Moore, T. Attention and action in the frontal eye field. In
+The Neuroscience of Attention: Attentional Control and Selection. Mangun, GR., editor. Oxford
+University Press; New York: 2012. p. 151-166.
+Arnal LH, Wyart V, Giraud A-L. Transitions in neural oscillations reflect prediction errors generated
+in audiovisual speech. Nature Neuroscience. 2011; 14:797–801.
+Atick JJ. Could information theory provide an ecological theory of sensory processing? Network:
+Computation in Neural Systems. 1992; 3:213–251.
+Attneave F. Some informational aspects of visual perception. Psychol Rev. 1954; 61:183–193.
+[PubMed: 13167245]
+Bair W, Cavanaugh JR, Movshon JA. Time course and time-distance relationships for surround
+suppression in macaque V1 neurons. The Journal of Neuroscience. 2003; 23:7690. [PubMed:
+12930809]
+Barceló F, Suwazono S, Knight RT. Prefrontal modulation of visual processing in humans. Nat.
+Neurosci. 2000; 3:399–403. [PubMed: 10725931]
+Barlow, HB. Possible principles underlying the transformations of sensory messages.. In: Rosenblith,
+WA., editor. Sensory Communication. MIT Press; Cambridge, MA: 1961. p. 217-234.
+Barone P, Batardiere A, Knoblauch K, Kennedy H. Laminar distribution of neurons in extrastriate
+areas projecting to visual areas V1 and V4 correlates with the hierarchical rank and indicates the
+operation of a distance rule. The Journal of Neuroscience. 2000; 20:3263. [PubMed: 10777791]
+Berkes P, Orban G, Lengyel M, Fiser J. Spontaneous Cortical Activity Reveals Hallmarks of an
+Optimal Internal Model of the Environment. Science. 2011; 331:83–87. [PubMed: 21212356]
+Bollimunta A, Mo J, Schroeder CE, Ding M. Neuronal Mechanisms and Attentional Modulation of
+Corticothalamic Alpha Oscillations. The Journal of Neuroscience. 2011; 31:4935. [PubMed:
+21451032]
+Bosman CA, Schoffelen J-M, Brunet N, Oostenveld R, Bastos AM, Womelsdorf T, Rubehn B,
+Stieglitz T, De Weerd P, Fries P. Attentional Stimulus Selection through Selective
+Synchronization between Monkey Visual Areas. Neuron. 2012; 75:875–888. [PubMed: 22958827]
+Briggs F, Callaway EM. Laminar patterns of local excitatory input to layer 5 neurons in macaque
+primary visual cortex. Cerebral Cortex. 2005; 15:479. [PubMed: 15319309]
+Briggs F, Usrey WM. Parallel processing in the corticogeniculate pathway of the macaque monkey.
+Neuron. 2009; 62:135–146. [PubMed: 19376073]
+
+Bastos et al.
+Page 16
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Buchsbaum G, Gottschalk A. Trichromacy, opponent colours coding and optimum colour information
+transmission in the retina. Proc. R. Soc. Lond., B, Biol. Sci. 1983; 220:89–113. [PubMed:
+6140684]
+Buffalo EA, Fries P, Landman R, Buschman TJ, Desimone R. Laminar differences in gamma and
+alpha coherence in the ventral stream. Proceedings of the National Academy of Sciences. 2011;
+108:11262.
+Bullier J, Henry GH. Ordinal position and afferent input of neurons in monkey striate cortex. J. Comp.
+Neurol. 1980; 193:913–935. [PubMed: 6253535]
+Bullier J, Hupé JM, James A, Girard P. Functional interactions between areas V1 and V2 in the
+monkey. Journal of Physiology-Paris. 1996; 90:217–220.
+Buschman TJ, Miller EK. Top-Down Versus Bottom-Up Control of Attention in the Prefrontal and
+Posterior Parietal Cortices. Science. 2007; 315:1860–1862. [PubMed: 17395832]
+Buzsáki G. Neural Syntax: Cell Assemblies, Synapsembles, and Readers. Neuron. 2010; 68:362–385.
+[PubMed: 21040841]
+Callaway EM. Local circuits in primary visual cortex of the macaque monkey. Annual Review of
+Neuroscience. 1998; 21:47–74.
+Callaway EM, Wiser AK. Contributions of individual layer 2-5 spiny neurons to local circuits in
+macaque primary visual cortex. Vis. Neurosci. 1996; 13:907–922. [PubMed: 8903033]
+Canolty RT, Ganguly K, Kennerley SW, Cadieu CF, Koepsell K, Wallis JD, Carmena JM. Oscillatory
+phase coupling coordinates anatomically dispersed functional cell assemblies. Proceedings of the
+National Academy of Sciences. 2010; 107:17356.
+Chu Z, Galarreta M, Hestrin S. Synaptic interactions of late-spiking neocortical neurons in layer 1. The
+Journal of Neuroscience. 2003; 23:96. [PubMed: 12514205]
+Covic EN, Sherman SM. Synaptic properties of connections between the primary and secondary
+auditory cortices in mice. Cereb. Cortex. 2011; 21:2425–2441. [PubMed: 21385835]
+Crick F. Function of the thalamic reticular complex: the searchlight hypothesis. Proceedings of the
+National Academy of Sciences of the United States of America. 1984; 81:4586. [PubMed:
+6589612]
+Crick F, Koch C. Constraints on cortical and thalamic projections: the no-strong-loops hypothesis.
+Nature. 1998; 391:245–250. [PubMed: 9440687]
+Dan Y, Atick JJ, Reid RC. Efficient coding of natural scenes in the lateral geniculate nucleus:
+experimental test of a computational theory. The Journal of Neuroscience. 1996; 16:3351–3362.
+[PubMed: 8627371]
+David O, Kiebel SJ, Harrison LM, Mattout J, Kilner JM, Friston KJ. Dynamic causal modeling of
+evoked responses in EEG and MEG. NeuroImage. 2006; 30:1255–1272. [PubMed: 16473023]
+Dayan P, Hinton GE, Neal RM, Zemel RS. The Helmholtz machine. Neural Comput. 1995; 7:889–
+904. [PubMed: 7584891]
+Desimone R. Neural mechanisms for visual memory and their role in attention. Proc. Natl. Acad. Sci.
+U.S.A. 1996; 93:13494–13499. [PubMed: 8942962]
+Douglas RJ, Martin K. A functional microcircuit for cat visual cortex. The Journal of Physiology.
+1991; 440:735. [PubMed: 1666655]
+Douglas RJ, Martin KA, Whitteridge D. A canonical microcircuit for neocortex. Neural Computation.
+1989; 1:480–488.
+Douglas RJ, Martin KAC. Neuronal Circuits of the Neocortex. Annu. Rev. Neurosci. 2004; 27:419–
+451. [PubMed: 15217339]
+Engbert R, Mergenthaler K, Sinn P, Pikovsky A. PNAS Plus: An integrated model of fixational eye
+movements and microsaccades. Proceedings of the National Academy of Sciences. 2011;
+108:E765–E770.
+Engel, AK.; Friston, KJ.; Kelso, JA.; König, P.; Kovács, I.; MacDonald, A.; Miller, EK.; Phillips,
+WA.; Silverstein, SM.; Tallon-Baudry, C., et al. Coordination in Behavior and Cognition.. In: von
+der Malsburg, C.; Phillips, WA.; Singer, W., editors. Dynamic Coordination in the Brain: From
+Neurons to Mind. MIT Press; Cambridge, MA: 2010. p. 267-299.
+
+Bastos et al.
+Page 17
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Feldman H, Friston KJ. Attention, uncertainty, and free-energy. Front Hum Neurosci. 2010; 4:215.
+[PubMed: 21160551]
+Felleman DJ, Van Essen DC. Distributed hierarchical processing in the primate cerebral cortex. Cereb.
+Cortex. 1991; 1:1–47. [PubMed: 1822724]
+Fitzpatrick D, Usrey WM, Schofield BR, Einstein G. The sublaminar organization of corticogeniculate
+neurons in layer 6 of macaque striate cortex. Vis. Neurosci. 1994; 11:307–315. [PubMed:
+7516176]
+Fries P. A mechanism for cognitive dynamics: neuronal communication through neuronal coherence.
+Trends in Cognitive Sciences. 2005; 9:474–480. [PubMed: 16150631]
+Fries P, Reynolds JH, Rorie AE, Desimone R. Modulation of oscillatory neuronal synchronization by
+selective visual attention. Science. 2001; 291:1560–1563. [PubMed: 11222864]
+Friston K. Hierarchical Models in the Brain. PLoS Comput Biol. 2008; 4:e1000211. [PubMed:
+18989391]
+Friston K. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. 2010;
+11:127–138.
+Fujisawa S, Buzsáki G. A 4 Hz Oscillation Adaptively Synchronizes Prefrontal, VTA, and
+Hippocampal Activities. Neuron. 2011; 72:153–165. [PubMed: 21982376]
+Garrido MI, Kilner JM, Kiebel SJ, Friston KJ. Evoked brain responses are generated by feedback
+loops. Proc. Natl. Acad. Sci. U.S.A. 2007; 104:20961–20966. [PubMed: 18087046]
+Garrido MI, Kilner JM, Stephan KE, Friston KJ. The mismatch negativity: a review of underlying
+mechanisms. Clinical Neurophysiology. 2009; 120:453–463. [PubMed: 19181570]
+Gentet LJ, Kremer Y, Taniguchi H, Huang ZJ, Staiger JF, Petersen CCH. Unique functional properties
+of somatostatin-expressing GABAergic neurons in mouse barrel cortex. Nature Neuroscience.
+2012; 15:607–612.
+George D, Hawkins J. Towards a mathematical theory of cortical micro-circuits. PLoS Computational
+Biology. 2009; 5:e1000532. [PubMed: 19816557]
+Gilbert CD, Wiesel TN. Functional organization of the visual cortex. Prog. Brain Res. 1983; 58:209–
+218. [PubMed: 6138809]
+Girard P, Bullier J. Visual activity in area V2 during reversible inactivation of area 17 in the macaque
+monkey. Journal of Neurophysiology. 1989; 62:1287. [PubMed: 2600626]
+Girard P, Salin PA, Bullier J. Visual activity in areas V3a and V3 during reversible inactivation of area
+V1 in the macaque monkey. Journal of Neurophysiology. 1991a; 66:1493. [PubMed: 1765790]
+Girard P, Salin PA, Bullier J. Visual activity in macaque area V4 depends on area 17 input.
+Neuroreport. 1991b; 2:81–84. [PubMed: 1883988]
+Girard P, Salin PA, Bullier J. Response selectivity of neurons in area MT of the macaque monkey
+during reversible inactivation of area V1. Journal of Neurophysiology. 1992; 67:1437. [PubMed:
+1629756]
+Gray CM, König P, Engel AK, Singer W. Oscillatory responses in cat visual cortex exhibit inter-
+columnar synchronization which reflects global stimulus properties. Nature. 1989; 338:334–337.
+[PubMed: 2922061]
+Gregoriou GG, Gotts SJ, Zhou H, Desimone R. High-Frequency, Long-Range Coupling Between
+Prefrontal and Visual Cortex During Attention. Science. 2009; 324:1207–1210. [PubMed:
+19478185]
+Gregory RL. Perceptual illusions and brain models. Proc. R. Soc. Lond., B. Biol. Sci. 1968; 171:279–
+296. [PubMed: 4387405]
+Gregory RL. Perceptions as hypotheses. Philos. Trans. R. Soc. Lond., B. Biol. Sci. 1980; 290:181–197.
+[PubMed: 6106237]
+Gwin JT, Ferris DP. Beta- and gamma-range human lower limb corticomuscular coherence. Front
+Hum Neurosci. 2012; 6:258. [PubMed: 22973219]
+Haeusler S, Maass W. A statistical analysis of information-processing properties of lamina-specific
+cortical microcircuit models. Cerebral Cortex. 2007; 17:149. [PubMed: 16481565]
+Harrison LM, Stephan KE, Rees G, Friston KJ. Extra-classical receptive field effects measured in
+striate cortex with fMRI. Neuroimage. 2007; 34:1199–1208. [PubMed: 17169579]
+
+Bastos et al.
+Page 18
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Häusser M, Mel B. Dendrites: bug or feature? Current Opinion in Neurobiology. 2003; 13:372–383.
+[PubMed: 12850223]
+Hebb, DO. The Organization of Behavior: A Neuropsychological Theory. Wiley; New York: 1949.
+Helmholtz, H. English translation. Dover; New York: 1860. Handbuch der Physiologischen Optik..
+Hinton G, van Camp D. Keeping neural networks simple by minimizing the description length of
+weights. Proceedings of COLT-93. 1993:5–13.
+Hinton, GE.; Zemel, RS. Autoencoders, Minimum Description Length, and Helmholtz Free Energy..
+In: Cowan, JD.; Tesauro, G.; Alspector, J., editors. Advances in Neural Information Processing
+Systems 6. Morgan Kaufmann; San Mateo, CA: 1994.
+Hopfinger JB, Buonocore MH, Mangun GR. The neural mechanisms of top- down attentional control.
+Nature Neuroscience. 2000; 3:284–291.
+Horton JC, Adams DL. The cortical column: a structure without a function. Philosophical Transactions
+of the Royal Society B: Biological Sciences. 2005; 360:837–862.
+Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat's
+visual cortex. The Journal of Physiology. 1962; 160:106. [PubMed: 14449617]
+Hubel DH, Wiesel TN. Laminar and columnar distribution of geniculo-cortical fibers in the macaque
+monkey. The Journal of Comparative Neurology. 1972; 146:421–450. [PubMed: 4117368]
+Hubel DH, Wiesel TN. Sequence regularity and geometry of orientation columns in the monkey striate
+cortex. J. Comp. Neurol. 1974; 158:267–293. [PubMed: 4436456]
+Hupé JM, James AC, Payne BR, Lomber SG, Girard P, Bullier J. Cortical feedback improves
+discrimination between figure and background by V1, V2 and V3 neurons. Nature. 1998;
+394:784–787. [PubMed: 9723617]
+Johnson RR, Burkhalter A. Microcircuitry of forward and feedback connections within rat visual
+cortex. J. Comp. Neurol. 1996; 368:383–398. [PubMed: 8725346]
+Johnson RR, Burkhalter A. A polysynaptic feedback circuit in rat visual cortex. The Journal of
+Neuroscience. 1997; 17:7129. [PubMed: 9278547]
+Kätzel D, Zemelman BV, Buetfering C, Wölfel M, Miesenböck G. The columnar and laminar
+organization of inhibitory connections to neocortical excitatory cells. Nature Neuroscience. 2010;
+14:100–107.
+Kay JW, Phillips WA. Coherent Infomax as a computational goal for neural systems. Bull. Math. Biol.
+2011; 73:344–372. [PubMed: 20821064]
+Klampfl S, Legenstein R, Maass W. Spiking neurons can learn to solve information bottleneck
+problems and extract independent components. Neural Computation. 2009; 21:911–959. [PubMed:
+19018708]
+Knight RT. Decreased response to novel stimuli after prefrontal lesions in man. Electroencephalogr
+Clin Neurophysiol. 1984; 59:9–20. [PubMed: 6198170]
+Knight RT, Scabini D, Woods DL. Prefrontal cortex gating of auditory transmission in humans. Brain
+Res. 1989; 504:338–342. [PubMed: 2598034]
+Knill DC, Pouget A. The Bayesian brain: the role of uncertainty in neural coding and computation.
+Trends in Neurosciences. 2004; 27:712–719. [PubMed: 15541511]
+de Kock CPJ, Bruno RM, Spors H, Sakmann B. Layer- and cell-type-specific suprathreshold stimulus
+representation in rat primary somatosensory cortex. J. Physiol. (Lond.). 2007; 581:139–154.
+[PubMed: 17317752]
+Kok P, Rahnev D, Jehee JFM, Lau HC, de Lange FP. Attention Reverses the Effect of Prediction in
+Silencing Sensory Signals. Cerebral Cortex. 2011; 22:2197–2206. [PubMed: 22047964]
+Lakatos P, Karmos G, Mehta AD, Ulbert I, Schroeder CE. Entrainment of neuronal oscillations as a
+mechanism of attentional selection. Science. 2008; 320:110–113. [PubMed: 18388295]
+Larkum ME, Nevian T, Sandler M, Polsky A, Schiller J. Synaptic Integration in Tuft Dendrites of
+Layer 5 Pyramidal Neurons: A New Unifying Principle. Science. 2009; 325:756–760. [PubMed:
+19661433]
+Linsker R. Perceptual neural organization: some approaches based on network models and information
+theory. Annual Review of Neuroscience. 1990; 13:257–281.
+
+Bastos et al.
+Page 19
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Livingstone MS. Oscillatory firing and interneuronal correlations in squirrel monkey striate cortex.
+Journal of Neurophysiology. 1996; 75:2467–2485. [PubMed: 8793757]
+London M, Häusser M. Dendritic computation. Annu. Rev. Neurosci. 2005; 28:503–532. [PubMed:
+16033324]
+Lopes-dos-Santos V, Conde-Ocazionez S, Nicolelis MAL, Ribeiro ST, Tort ABL. Neuronal Assembly
+Detection and Cell Membership Specification by Principal Component Analysis. PLoS ONE.
+2011; 6:e20996. [PubMed: 21698248]
+MacKay, DM. Automata Studies. Shannon, CE.; McCarthy, J., editors. Princeton Univ. Press;
+Princeton, NJ: 1956. p. 235-251.
+Maier A, Adams GK, Aura C, Leopold DA. Distinct superficial and deep laminar domains of activity
+in the visual cortex during rest and stimulation. Frontiers in Systems Neuroscience. 2010; 4
+Markov NT, Misery P, Falchier A, Lamy C, Vezoli J, Quilodran R, Gariel MA, Giroud P, Ercsey-
+Ravasz M, Pilaz LJ, et al. Weight consistency specifies regularities of macaque cortical networks.
+Cerebral Cortex. 2011; 21:1254. [PubMed: 21045004]
+Meirovithz E, Ayzenshtat I, Werner-Reiss U, Shamir I, Slovin H. Spatiotemporal Effects of
+Microsaccades on Population Activity in the Visual Cortex of Monkeys during Fixation. Cerebral
+Cortex. 2011; 22:294–307. [PubMed: 21653284]
+Melzer S, Michael M, Caputi A, Eliava M, Fuchs EC, Whittington MA, Monyer H. Long-Range-
+Projecting GABAergic Neurons Modulate Inhibition in Hippocampus and Entorhinal Cortex.
+Science. 2012; 335:1506–1510. [PubMed: 22442486]
+Meyer HS, Schwarz D, Wimmer VC, Schmitt AC, Kerr JND, Sakmann B, Helmstaedter M. Inhibitory
+interneurons in a cortical column form hot zones of inhibition in layers 2 and 5A. Proceedings of
+the National Academy of Sciences. 2011; 108:16807–16812.
+Meyer T, Olson CR. Statistical learning of visual transitions in monkey inferotemporal cortex.
+Proceedings of the National Academy of Sciences. 2011; 108:19401–19406.
+Mignard M, Malpeli JG. Paths of information flow through visual cortex. Science. 1991; 251:1249.
+[PubMed: 1848727]
+Moran RJ, Stephan KE, Kiebel SJ, Rombach N, O'Connor WT, Murphy KJ, Reilly RB, Friston KJ.
+Bayesian estimation of synaptic physiology from the spectral responses of neural masses.
+Neuroimage. 2008; 42:272–284. [PubMed: 18515149]
+Moran RJ, Symmonds M, Stephan KE, Friston KJ, Dolan RJ. An in vivo assay of synaptic function
+mediating human cognition. Curr. Biol. 2011; 21:1320–1325. [PubMed: 21802302]
+Mountcastle VB. Modality and topographic properties of single neurons of cat's somatic sensory
+cortex. Journal of Neurophysiology. 1957; 20:408. [PubMed: 13439410]
+Mountcastle VB. The columnar organization of the neocortex. Brain. 1997; 120:701. [PubMed:
+9153131]
+Mumford D. On the computational architecture of the neocortex. II. The role of cortico-cortical loops.
+Biol Cybern. 1992; 66:241–251. [PubMed: 1540675]
+Murphy PC, Sillito AM. Corticofugal feedback influences the generation of length tuning in the visual
+pathway. Nature. 1987; 329:727–729. [PubMed: 3670375]
+Murray SO. Spatially Specific fMRI Repetition Effects in Human Visual Cortex. Journal of
+Neurophysiology. 2005; 95:2439–2445. [PubMed: 16394067]
+Murray SO, Kersten D, Olshausen BA, Schrater P, Woods DL. Shape perception reduces activity in
+human primary visual cortex. Proc. Natl. Acad. Sci. U.S.A. 2002; 99:15164–15169. [PubMed:
+12417754]
+Neisser, U. Cognitive psychology. Appleton-Century-Crofts; New York: 1967.
+de No, LR. Cerebral cortex: architecture, intracortical connections, motor projections.. In: Fulton, JF.,
+editor. Physiology of the Nervous System. Oxford University Press; Oxford: 1949. p. 288-330.
+Olsen SR, Bortone DS, Adesnik H, Scanziani M. Gain control by layer six in cortical circuits of vision.
+Nature. 2012
+De Pasquale R, Sherman SM. Synaptic properties of corticocortical connections between the primary
+and secondary visual cortical areas in the mouse. J. Neurosci. 2011; 31:16494–16506. [PubMed:
+22090476]
+
+Bastos et al.
+Page 20
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Raizada RDS, Grossberg S. Towards a theory of the laminar architecture of cerebral cortex:
+Computational clues from the visual system. Cerebral Cortex. 2003; 13:100–113. [PubMed:
+12466221]
+Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-
+classical receptive-field effects. Nature Neuroscience. 1999; 2:79–87.
+Roopun AK, Kramer MA, Carracedo LM, Kaiser M, Davies CH, Traub RD, Kopell NJ, Whittington
+MA. Period concatenation underlies interactions between gamma and beta rhythms in neocortex.
+Front Cell Neurosci. 2008; 2:1. [PubMed: 18946516]
+Roopun AK, Middleton SJ, Cunningham MO, LeBeau FE, Bibbig A, Whittington MA, Traub RD. A
+beta2-frequency (20–30 Hz) oscillation in nonsynaptic networks of somatosensory cortex.
+Proceedings of the National Academy of Sciences. 2006; 103:15646.
+Saalmann YB, Pigarev IN, Vidyasagar TR. Neural Mechanisms of Visual Attention: How Top-Down
+Feedback Highlights Relevant Locations. Science. 2007; 316:1612–1615. [PubMed: 17569863]
+Saalmann YB, Pinsk MA, Wang L, Li X, Kastner S. The Pulvinar Regulates Information Transmission
+Between Cortical Areas Based on Attention Demands. Science. 2012; 337:753–756. [PubMed:
+22879517]
+Sakata S, Harris KD. Laminar Structure of Spontaneous and Sensory-Evoked Population Activity in
+Auditory Cortex. Neuron. 2009; 64:404–418. [PubMed: 19914188]
+Salin PA, Bullier J. Corticocortical connections in the visual system: structure and function. Physiol.
+Rev. 1995; 75:107–154. [PubMed: 7831395]
+Sandell JH, Schiller PH. Effect of cooling area 18 on striate cortex cells in the squirrel monkey.
+Journal of Neurophysiology. 1982; 48:38. [PubMed: 6288886]
+Sherman SM, Guillery R. On the actions that one nerve cell can have on another: distinguishing
+“drivers” from “modulators.”. Proceedings of the National Academy of Sciences of the United
+States of America. 1998; 95:7121. [PubMed: 9618549]
+Sherman SM, Guillery RW. Distinct functions for direct and transthalamic corticocortical connections.
+Journal of Neurophysiology. 2011; 106:1068–1077. [PubMed: 21676936]
+Shipp S. Structure and function of the cerebral cortex. Current Biology. 2007; 17:443–449.
+Shlosberg D, Amitai Y, Azouz R. Time-Dependent, Layer-Specific Modulation of Sensory Responses
+Mediated by Neocortical Layer 1. Journal of Neurophysiology. 2006; 96:3170–3182. [PubMed:
+17110738]
+Sillito AM, Cudeiro J, Murphy PC. Orientation sensitive elements in the corticofugal influence on
+centre-surround interactions in the dorsal lateral geniculate nucleus. Exp Brain Res. 1993; 93:6–
+16. [PubMed: 8467892]
+Sincich LC, Horton JC. The circuitry of V1 and V2: integration of color, form, and motion. Annu.
+Rev. Neurosci. 2005; 28:303–326. [PubMed: 16022598]
+Singer W. Neuronal synchrony: a versatile code for the definition of relations? Neuron. 1999; 24:49–
+65. 111–125. [PubMed: 10677026]
+Singer W, Engel AK, Kreiter AK, Munk MH, Neuenschwander S, Roelfsema PR. Neuronal
+assemblies: necessity, signature and detectability. Trends Cogn. Sci. (Regul. Ed.). 1997; 1:252–
+261. [PubMed: 21223920]
+Spratling MW. Reconciling predictive coding and biased competition models of cortical function.
+Front Comput Neurosci. 2008; 2:4. [PubMed: 18978957]
+Srinivasan MV, Laughlin SB, Dubs A. Predictive coding: a fresh view of inhibition in the retina. Proc.
+R. Soc. Lond., B, Biol. Sci. 1982; 216:427–459. [PubMed: 6129637]
+Summerfield C, Trittschuh EH, Monti JM, Mesulam M-M, Egner T. Neural repetition suppression
+reflects fulfilled perceptual expectations. Nature Neuroscience. 2008; 11:1004–1006.
+Summerfield C, Wyart V, Johnen VM, de Gardelle V. Human Scalp Electroencephalography Reveals
+that Repetition Suppression Varies with Expectation. Front Hum Neurosci. 2011; 5:67. [PubMed:
+21847378]
+Theyel BB, Llano DA, Sherman SM. The corticothalamocortical circuit drives higher-order cortex in
+the mouse. Nature Neuroscience. 2009; 13:84–88.
+
+Bastos et al.
+Page 21
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Thomson AM, Bannister AP. Interlaminar connections in the neocortex. Cerebral Cortex. 2003; 13:5–
+14. [PubMed: 12466210]
+Todorovic A, van Ede F, Maris E, de Lange FP. Prior Expectation Mediates Neural Adaptation to
+Repeated Sounds in the Auditory Cortex: An MEG Study. Journal of Neuroscience. 2011;
+31:9118–9123. [PubMed: 21697363]
+Ullman S. Sequence seeking and counter streams: a computational model for bidirectional information
+flow in the visual cortex. Cereb. Cortex. 1995; 5:1–11. [PubMed: 7719126]
+Usrey WM, Fitzpatrick D. Specificity in the axonal connections of layer VI neurons in tree shrew
+striate cortex: evidence for distinct granular and supragranular systems. The Journal of
+Neuroscience. 1996; 16:1203. [PubMed: 8558249]
+Varela F, Lachaux JP, Rodriguez E, Martinerie J. The brainweb: phase synchronization and large-scale
+integration. Nature Reviews Neuroscience. 2001; 2:229–239.
+Vezoli J. Quantitative Analysis of Connectivity in the Visual Cortex: Extracting Function from
+Structure. The Neuroscientist. 2004; 10:476–482. [PubMed: 15359013]
+Vicente R, Gollo LL, Mirasso CR, Fischer I, Pipa G. Dynamical relaying can yield zero time lag
+neuronal synchrony despite long conduction delays. Proceedings of the National Academy of
+Sciences. 2008; 105:17157.
+Wacongne C, Labyt E, van Wassenhove V, Bekinschtein T, Naccache L, Dehaene S. Evidence for a
+hierarchy of predictions and prediction errors in human cortex. Proceedings of the National
+Academy of Sciences. 2011; 108:20754–20759.
+Wang X-J. Neurophysiological and computational principles of cortical rhythms in cognition. Physiol.
+Rev. 2010; 90:1195–1268. [PubMed: 20664082]
+Weiler N, Wood L, Yu J, Solla SA, Shepherd GMG. Top-down laminar organization of the excitatory
+network in motor cortex. Nat Neurosci. 2008; 11:360–366. [PubMed: 18246064]
+Wozny C, Williams SR. Specificity of Synaptic Connectivity between Layer 1 Inhibitory Interneurons
+and Layer⅔ Pyramidal Neurons in the Rat Neocortex. Cerebral Cortex. 2011
+Wurtz RH, McAlonan K, Cavanaugh J, Berman RA. Thalamic pathways for active vision. Trends
+Cogn. Sci. (Regul. Ed.). 2011; 15:177–184. [PubMed: 21414835]
+Wyart V, Nobre AC, Summerfield C. Dissociable prior influences of signal probability and relevance
+on visual contrast sensitivity. Proceedings of the National Academy of Sciences. 2012;
+109:3593–3598.
+Yamaguchi S, Knight RT. Gating of somatosensory input by human prefrontal cortex. Brain Res.
+1990; 521:281–288. [PubMed: 2207666]
+Yoshimura Y, Callaway EM. Fine-scale specificity of cortical networks depends on inhibitory cell
+type and connectivity. Nat. Neurosci. 2005; 8:1552–1559. [PubMed: 16222228]
+Yuille A, Kersten D. Vision as Bayesian inference: analysis by synthesis? Trends Cogn. Sci. (Regul.
+Ed.). 2006; 10:301–308. [PubMed: 16784882]
+Zeki S, Shipp S. The functional logic of cortical connections. Nature. 1988; 335:311–317. [PubMed:
+3047584]
+Zeki SM. The cortical projections of foveal striate cortex in the rhesus monkey. J. Physiol. (Lond.).
+1978; 277:227–244. [PubMed: 418174]
+
+Bastos et al.
+Page 22
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 1.
+This is a schematic of the classical microcircuit adapted from Douglas and Martin (1991).
+This minimal circuitry comprises superficial (layers 2 and 3) and deep (layers, 5 and 6)
+pyramidal cells and a population of smooth inhibitory cells. Feedforward inputs – from the
+thalamus – target all cell populations, but with an emphasis on inhibitory interneurons and
+superficial and granular layers. Note the symmetrical deployment of inhibitory and
+excitatory intrinsic connections that maintain a balance of excitation and inhibition.
+
+Bastos et al.
+Page 23
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 2.
+This is a simplified schematic of the key intrinsic connections among excitatory (E) and
+inhibitory (I) populations in granular (L4), supragranular (L1/2/3) and infragranular (L5/6)
+layers. The excitatory interlaminar connections are based largely on Gilbert and Wiesel
+(1983). Forward connections denote feedforward extrinsic corticocortical or thalamocortical
+afferents that are reciprocated by backward or feedback connections. Anatomical and
+functional data suggest that afferent input enters primarily into L4 and is conveyed to
+superficial layers L2/3 that are rich in pyramidal cells, which project forward to the next
+cortical area, forming a disynaptic route between thalamus and secondary cortical areas
+(Callaway, 1998). Information from L2/3 is then sent to L5 and L6, which sends (intrinsic)
+feedback projections back to L4 (Usrey and Fitzpatrick, 1996). L5 cells originate feedback
+connections to earlier cortical areas as well as to the pulvinar, superior colliculus, and brain
+stem. In summary, forward input is segregated by intrinsic connections into a superficial
+forward stream and a deep backward stream. In this schematic, we have juxtaposed densely
+interconnected excitatory and inhibitory populations within each layer.
+
+Bastos et al.
+Page 24
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 3.
+This schematic shows an example of a generative model. Generative models describe how
+(sensory) data are caused. In this figure, sensory states (blue circles on the periphery) are
+generated by hidden variables (in the centre). The left panel shows the model as a
+probabilistic graphical model, where unknown variables (hidden causes and states) are
+associated with the nodes of a dependency graph and conditional dependencies are indicated
+by arrows. Hidden states confer memory on the model by virtue of having dynamics, while
+hidden causes connect nodes. A graphical model describes the conditional dependencies
+among hidden variables generating data. These dependencies are typically modelled as
+(differential) equations with nonlinear mappings and random fluctuations 
+ with precision
+(inverse variance) Π(i) (see the equations in the insert on the left). This allows one to specify
+the precise form of the probabilistic generative model and leads to a simple and efficient
+inversion scheme (predictive coding; see next figure). Here 
+ denotes the set of hidden
+causes that constitute the parents of sensory s̃
+(i) or hidden x̃
+(i) states. The ~ indicates states in
+generalised coordinates of motion: x̃
+ = (x, x′, x″,...). An intuitive version of the model is
+shown on the right: here, we imagine that a singing bird is the cause of sensations, which –
+through a cascade of dynamical hidden states – produces modality-specific consequences
+(e.g., the auditory object of a bird song and the visual object of a song bird). These
+intermediate causes are themselves (hierarchically) unpacked to generate sensory signals.
+The generative model therefore maps from causes (e.g., concepts) to consequences (e.g.,
+sensations), while its inversion corresponds to mapping from sensations to concepts or
+representations. This inversion corresponds to perceptual synthesis – in which the generative
+model is used to generate predictions. Note that this inversion implicitly resolves the binding
+problem - by explaining multisensory cues with a single cause.
+
+Bastos et al.
+Page 25
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 4.
+This figure describes the predictive coding scheme associated with a simple hierarchical
+model shown on the left. In this model each node has a single parent. The ensuing inversion
+or generalised predictive coding scheme is shown on the right. The key quantities in this
+scheme are (conditional) expectations of the hidden states and causes and their associated
+prediction errors. The basic architecture – implied by the inversion of the graphical
+(hierarchical) model – suggests that prediction errors (caused by unpredicted fluctuations in
+hidden variables) are passed up the hierarchy to update conditional expectations. These
+conditional expectations now provide predictions that are passed down the hierarchy to form
+prediction errors. We presume that the forward and backward message passing between
+hierarchical levels is mediated by extrinsic (feedforward and feedback) connections.
+Neuronal populations encoding conditional expectations and prediction errors now have to
+be deployed in a canonical microcircuit to understand the computational logic of intrinsic
+connections – within each level of the hierarchy – as shown in the next figure.
+
+Bastos et al.
+Page 26
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 5.
+The left hand panel is the canonical microcircuit based on Haeusler and Maass (2007),
+where we have removed inhibitory cells from the deep layers – because they have very little
+interlaminar connectivity. The numbers denote connection strengths (mean amplitude of
+PSPs measured at soma in mV) and connection probabilities (in parentheses) according to
+Thomson et al. (2002). The right panel shows the proposed cortical microcircuit for
+predictive coding, where the quantities of the previous figure have been associated with
+various cell types. Here, prediction error populations are highlighted in pink. Inhibitory
+connections are shown in red, while excitatory connections are in black. The dotted lines
+refer to connections that are not present in the microcircuit on the left (but see Figure 2). In
+this scheme, expectations (about causes and states) are assigned to (excitatory and
+inhibitory) interneurons in the supragranular layers, which are passed to infragranular layers.
+The corresponding prediction errors occupy granular layers, while superficial pyramidal
+cells encode prediction errors that are sent forward to the next hierarchical level. Conditional
+expectations and prediction errors on hidden causes are associated with excitatory cell types,
+while the corresponding quantities for hidden states are assigned to inhibitory cells. Dark
+circles indicate pyramidal cells. Finally, we have placed the precision of the feedforward
+prediction errors against the superficial pyramidal cells. This quantity controls the
+postsynaptic sensitivity or gain to (intrinsic and top-down) pre-synaptic inputs. We have
+previously discussed this in terms of attentional modulation, which may be intimately linked
+to the synchronisation of pre-synaptic inputs and ensuing postsynaptic responses (Fries et al
+2001; Feldman and Friston, 2010).
+
+Bastos et al.
+Page 27
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+Figure 6.
+This schematic illustrates the functional asymmetry between the spectral activity of
+superficial and deep cells predicted theoretically. In this illustrative example, we have
+ignored the effects of influences on the expectations of hidden causes (encoded by deep
+pyramidal cells), other than the prediction error on causes (encoded by superficial pyramidal
+cells). The lower panel shows the spectral density of deep pyramidal cell activity, given the
+spectral density of superficial pyramidal cell activity in the upper panel. The equation
+expresses the spectral density of the deep cells as a function of the spectral density of the
+superficial cells; using Equation (2). This schematic is meant to illustrate how the relative
+amounts of low (beta) and high (gamma) frequency activity in superficial and deep cells can
+be explained by the evidence accumulation implicit in predictive coding.
+
+Bastos et al.
+Page 28
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+Bastos et al.
+Page 29
+
+Table 1
+
+Electrophysiological and neuroimaging findings consistent with predictive coding.
+
+Prediction violated
+Area studied
+Neuronal expression of Prediction-
+error
+
+Study
+
+Learned visual object pairings
+Monkey inferotemporal cortex
+(IT)
+
+Enhanced firing rate
+Meyer and Olson, 2011
+
+Natural image statistics
+Monkey V1, V2, V3
+Enhanced firing rate
+Hupé et al., 1998;
+Bullier et al., 1996; Bair
+et al., 2003
+
+Repetitive auditory stream
+Early human auditory cortex
+Enhanced Event Related Potentials
+(ERPs), enhanced gamma-band power
+
+Garrido et al., 2007,
+2009; Todorovic et al.,
+2011
+
+Coherence of visual form and
+motion
+
+Human V1, V2, V3, V4, V5/
+MT
+
+Enhanced BOLD response
+Murray et al 2002;
+Murray et al 2005;
+Harrison et al., 2007
+
+Audio-visual congruence of speech
+Visual and auditory cortex
+Gamma-band oscillatory activity
+Arnal et al., 2011
+
+Predictability of visual stimuli as a
+function of attention
+
+Human V1, V2, V3
+Enhanced BOLD response when
+unattended, reduced BOLD when
+attended
+
+Kok et al., 2011
+
+Hierarchical expectations in
+auditory sequences
+
+Human temporal cortex
+Enhanced Event Related Potentials
+(ERPs)
+
+Wacongne et al., 2011
+
+Expected repetition (or
+alternation) of face stimuli
+
+FFA in fMRI, parietal and
+central electrodes of EEG
+
+Enhanced BOLD response, diminished
+repetition suppression of ERP
+
+Summerfield et al.,
+2008, 2011
+
+Apparent motion of visual
+stimulus
+
+V1
+Enhanced BOLD response
+Alink et al., 2010
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+NIH-PA Author Manuscript
+
+Bastos et al.
+Page 30
+
+Table 2
+
+The functional (computational) correlates of the anatomy and physiology of cortical hierarchies and their
+extrinsic connections.
+
+Anatomy and physiology
+Functional correlates
+
+Hierarchical organisation of cortical areas (Zeki and Shipp 1988;
+Felleman and Van Essen, 1991; Barone et al., 2000; Vezoli, 2004)
+
+Encoding of conditional dependencies in terms of a graphical
+model (Mumford, 1992; Rao and Ballard, 1999; Friston 2008).
+
+Distinct (laminar-specific) neuronal responses (Douglas et al., 1989;
+Douglas and Martin, 1991)
+
+Encoding expected states of the world (superficial pyramidal cells)
+and prediction errors (deep pyramidal cells) (Mumford, 1992;
+Friston 2008).
+
+Distinct (laminar-specific) extrinsic connections (Zeki and Shipp
+1988; Felleman and Van Essen, 1991; Barone et al., 2000; Vezoli, 2004;
+Markov et al., 2011).
+
+Forward connections convey prediction error (from superficial
+pyramidal cells) and backward connections convey predictions
+(from deep pyramidal cells) (Mumford, 1992; Friston 2008).
+
+Reciprocal extrinsic connectivity (Zeki and Shipp 1988; Felleman and
+Van Essen, 1991; Barone et al., 2000; Vezoli, 2004; Markov et al., 2011)
+
+Recurrent dynamics are intrinsically stable because they are trying
+to suppress prediction error (Crick and Koch 1998;; Friston 2008).
+
+Feedback extrinsic connections are (driving and) modulatory
+(Mignard and Malpeli 1991; Bullier et al., 1996; Sherman and Guillery
+1998; Covic and Sherman, 2011; De Pasquale and Sherman, 2011).
+
+Forwards (driving) and backwards (driving and modulatory)
+connections mediate the (linear) influence of prediction errors and
+the (linear and non-linear) construction of predictions (Friston
+2008; 2010).
+
+Feedback extrinsic connections are inhibitory (Murphy and Sillito,
+1987; Sillito et al., 1993; Chu et al., 2003; Olsen et al. 2012; Meyer et al.,
+2011; Wozny and Williams, 2011).
+
+Top-down predictions suppress or counter prediction errors
+produced by bottom up inputs (Mumford, 1992; Rao and Ballard,
+1999; Friston 2008).
+
+Differences in neuronal dynamics of superficial and deep layers (de
+Kock et al., 2007; Sakata and Harris, 2009; Maier et al., 2010;
+Bollimunta et al., 2011; Buffalo et al., 2011).
+
+Principal cells elaborating predictions (deep pyramidal cells) may
+show distinct (low-pass) dynamics, relative to those encoding error
+(superficial pyramidal cells) (Friston 2008).
+
+Dense intrinsic and horizontal connectivity (Thomson and Bannister,
+2003; Katzel et al., 2010).
+
+Lateral predictions and prediction errors mediating winnerless
+competition and competitive lateral dependencies (Desimone,
+1996; Friston 2010).
+
+Predominance of nonlinear synaptic (dendritic and
+neuromodulatory) infrastructure in superficial layers (Häusser and
+Mel, 2003; London and Häusser, 2005; Gentet et al., 2012).
+
+Required to scale prediction errors, in proportion to their precision,
+affording a form of cortical bias or gain control that encodes
+uncertainty (Feldman and Friston 2010; Spratling, 2008)
+
+Neuron. Author manuscript; available in PMC 2013 September 19.
+
+
diff --git a/papers/references/temp/Schlosshauer2007.pdf b/papers/references/temp/Schlosshauer2007.pdf
new file mode 100644
index 00000000..6b0ad828
--- /dev/null
+++ b/papers/references/temp/Schlosshauer2007.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:eecf513088343ae57d312f1892d2e7069214be969b54f31fc64e14339bafd4cd
+size 2028534
diff --git a/papers/references/temp/Schlosshauer2007.txt b/papers/references/temp/Schlosshauer2007.txt
new file mode 100644
index 00000000..7f1a1127
--- /dev/null
+++ b/papers/references/temp/Schlosshauer2007.txt
@@ -0,0 +1,3798 @@
+The quantum-to-classical transition and decoherence
+
+Maximilian Schlosshauer
+Department of Physics, University of Portland, 5000 North Willamette Boulevard, Portland, Oregon 97203, USA
+
+I give a pedagogical overview of decoherence and its role in providing a dynamical account of the
+quantum-to-classical transition. The formalism and concepts of decoherence theory are reviewed,
+followed by a survey of master equations and decoherence models.
+I also discuss methods for
+mitigating decoherence in quantum information processing and describe selected experimental
+investigations of decoherence processes.
+Note: Please see arXiv:1911.06282 [quant-ph] (published as Phys. Rep. 831, 1–57, 2019) for
+a much more extensive and up-to-date review of decoherence.
+
+CONTENTS
+
+I. Introduction
+1
+
+II. Basic formalism and concepts
+2
+
+A. Decoherence and interference damping
+2
+
+B. Environmental monitoring and information
+transfer
+3
+
+C. Environment-induced superselection and
+decoherence-free subspaces
+4
+
+1. Pointer states and the commutativity
+criterion
+5
+
+2. Decoherence-free subspaces
+6
+
+D. Proliferation of information and quantum
+Darwinism
+6
+
+E. Decoherence versus dissipation and noise
+7
+
+III. Master equations
+7
+
+A. Born–Markov master equations
+8
+
+B. Lindblad master equations
+8
+
+C. Non-Markovian decoherence
+9
+
+IV. Decoherence models
+10
+
+A. Collisional decoherence
+10
+
+B. Quantum Brownian motion
+11
+
+C. Spin–boson models
+13
+
+D. Spin-environment models
+13
+
+V. Qubit decoherence, quantum error correction,
+and error avoidance
+14
+
+A. Correction of decoherence-induced quantum
+errors
+14
+
+B. Quantum computation on decoherence-free
+subspaces
+15
+
+C. Environment engineering and dynamical
+decoupling
+16
+
+VI. Experimental studies of decoherence
+16
+
+A. Atoms in a cavity
+17
+
+B. Matter-wave interferometry
+17
+
+C. Superconducting systems
+17
+
+VII. Decoherence and the foundations of quantum
+mechanics
+19
+
+References
+19
+
+I.
+INTRODUCTION
+
+Realistic quantum systems are never completely iso-
+lated from their environment. When a quantum system
+interacts with its environment, it will in general become
+entangled with a large number of environmental degrees
+of freedom. This entanglement influences what we can
+locally observe upon measuring the system. In partic-
+ular, quantum interference effects with respect to cer-
+tain physical quantities (most notably, “classical” quan-
+tities such as position) become effectively suppressed,
+making them prohibitively difficult to observe in most
+cases of practical interest. This is the process of deco-
+herence, sometimes also called dynamical decoherence or
+environment-induced decoherence [1–10]. Stated in gen-
+eral and interpretation-neutral terms, decoherence de-
+scribes how entangling interactions with the environment
+influence the statistics of results of future measurements
+on the system.
+Formally, decoherence can be viewed as a dynamical
+filter on the space of quantum states, singling out those
+states that, for a given system, can be stably prepared
+and maintained, while effectively excluding most other
+states, in particular, nonclassical superposition states of
+the kind popularized by Schr¨odinger’s cat. In this way,
+decoherence lies at the heart of the quantum-to-classical
+transition. It ensures consistency between quantum and
+classical predictions for systems observed to behave clas-
+sically. It provides a quantitative, dynamical account of
+the boundary between quantum and classical physics. In
+any concrete experimental situation, decoherence theory
+specifies the physical requirements, both qualitative and
+quantitative, for pushing the quantum–classical bound-
+ary toward the quantum realm. Decoherence is a pure
+quantum effect, to be distinguished from classical dissi-
+pation and stochastic fluctuations (noise).
+Decoherence processes are extremely efficient.
+Even
+when the environment does not, from a classical point
+of view, impart significant classical perturbations on the
+system, quantum-mechanically the system will in most
+circumstances become rapidly and strongly entangled
+with the environment. Furthermore, due to the many un-
+controllable degrees of freedom of the environment, such
+entanglement is usually irreversible for all practical pur-
+poses. Increasingly realistic models of decoherence pro-
+
+arXiv:1404.2635v2  [quant-ph]  20 Nov 2019
+
+
+2
+
+cesses have been developed, progressing from toy models
+to complex models tailored to specific experiments (see
+Sec. IV). Advances in experimental techniques have made
+it possible to observe the gradual action of decoherence
+in experiments such as matter-wave interferometry [11],
+cavity QED [12], and superconducting systems [13] (see
+Sec. VI).
+The superposition states necessary for quantum in-
+formation processing are typically also those most sus-
+ceptible to decoherence. Thus, decoherence is a major
+barrier to implementing devices for quantum informa-
+tion processing such as quantum computers (see Sec. V).
+Qubit systems must be engineered to minimize environ-
+mental interactions detrimental to the preparation and
+longevity of the desired superposition states.
+At the
+same time, they must remain sufficiently open to al-
+low for their control.
+Quantum error correction can
+undo some of the decoherence-induced degradation of
+the superposition state and will be an integral part of
+quantum computers (see Sec. V A). Not only is deco-
+herence relevant to quantum information, but also vice
+versa. An information-centric view of quantum mechan-
+ics proves helpful in conveying the essence of the deco-
+herence process and is also used in recent explorations
+of the role of the environment as an information channel
+(see Sec. II B).
+It is a curious “historical accident” (Joos’s term [14,
+p. 13]) that the role of the environment in quantum me-
+chanics was appreciated only relatively late. While one
+can find—for example, in Heisenberg’s writings [15]—a
+few early anticipatory remarks about the role of environ-
+mental interactions in the quantum-mechanical descrip-
+tion of systems, it wasn’t until the 1970s that the ubiquity
+and implications of environmental entanglement were re-
+alized by Zeh [1, 16]. It took another decade for the for-
+malism of decoherence to be developed, chiefly by Zurek
+[2, 3], and for concrete models and numerical estimates
+of decoherence rates to be worked out [17, 18].
+Review papers on decoherence include Refs. [4–6, 10,
+
+19].
+There are two books on decoherence:
+a volume
+by Joos et al. [8] (a collection of chapters written by
+different authors) and a monograph by this author [9].
+Ref. [20] also contains material on decoherence. Foun-
+dational implications of decoherence are discussed in
+Refs. [6, 7, 9, 21].
+
+II.
+BASIC FORMALISM AND CONCEPTS
+
+In the double-slit experiment, we cannot observe an in-
+terference pattern if we also measure which slit the parti-
+cle went through (that is, if we obtain perfect which-path
+information). In fact, there is a continuous tradeoff be-
+tween interference (phase information) and which-path
+information: the better we can distinguish the two pos-
+sible paths, the less visible the interference pattern be-
+comes [22]. What is more, for a decrease in interference
+visibility to occur it suffices that there are degrees of
+
+freedom somewhere in the world that, if they were mea-
+sured, would allow us to make, with a certain degree of
+confidence, a statement about the path of the particle
+through the slits.
+While we cannot say that prior to
+their measurement, these degrees of freedom have en-
+coded information about a particular, definitive path of
+the particle—instead, we have merely correlations involv-
+ing both possible paths—no actual measurement is re-
+quired to bring about the decrease in interference visibil-
+ity. It is enough that, in principle, we could make such
+a measurement to obtain which-path information.
+This is somewhat loose talk, and conceptual caveats
+lurk. But it captures quite well the essence of what is
+happening in decoherence, where those “degrees of free-
+dom somewhere in the world” are the degrees of freedom
+of the system’s environment interacting with the system,
+leading to the creation of quantum correlations (entan-
+glement) between system and environment. Decoherence
+can thus be thought of as a process arising from the con-
+tinuous monitoring of the system by the environment;
+effectively, the environment is performing nondemolition
+measurements on the system (see Sec. II B). We now give
+a formal quantum-mechanical account of what we have
+just tried to convey in words, and then flesh out the con-
+sequences and details.
+
+A.
+Decoherence and interference damping
+
+Consider again the double-slit experiment and denote
+the quantum states of the particle (call it S, for “sys-
+tem”) corresponding to passage through slit 1 and 2 by
+|s1⟩ and |s2⟩, respectively. Suppose that the particle in-
+teracts with another system E—for example, a detec-
+tor or an environment—such that if the quantum state
+of the particle before the interaction is |s1⟩, then the
+quantum state of E will become |E1⟩ (and similarly for
+|s2⟩), resulting in the final composite states |s1⟩ |E1⟩ and
+|s2⟩ |E2⟩, respectively. For an initial superposition state
+α |s1⟩+β |s2⟩, the final composite state will be entangled,
+
+|Ψ⟩ = α |s1⟩ |E1⟩ + β |s2⟩ |E2⟩ .
+(1)
+
+The statistics of all possible local measurements on S
+are exhaustively encoded in the reduced density matrix
+ρS,
+
+ρS = TrE(ρSE) = TrE|Ψ⟩⟨Ψ|
+
+= |α|2 |s1⟩⟨s1| + |β|2 |s2⟩⟨s2|
++ αβ∗|s1⟩⟨s2|⟨E2|E1⟩ + α∗β|s2⟩⟨s1|⟨E1|E2⟩.
+(2)
+
+For example, suppose we measure particle’s position by
+letting the particle impinge on a distant detection screen.
+Statistically, the resulting particle probability density
+p(x) will be given by
+
+p(x) = TrS(ρSx)
+
+= |α|2 |ψ1(x)|2 + |β|2 |ψ2(x)|2
+
++ 2 Re {αβ∗ψ1(x)ψ∗
+2(x)⟨E2|E1⟩} ,
+(3)
+
+
+3
+
+where ψi(x) ≡ ⟨x|si⟩. The last term represents the in-
+terference contribution. Thus, the visibility of the inter-
+ference pattern is quantified by the overlap ⟨E2|E1⟩, i.e.,
+by the distinguishability of |E1⟩ and |E2⟩. In the lim-
+iting case of perfect distinguishability, ⟨E2|E1⟩ = 0, no
+interference pattern will be observable and we obtain the
+classical prediction. Phase relations have become locally
+(i.e., with respect to S) inaccessible, and there is no mea-
+surement on S that can reveal coherence between |s1⟩ and
+|s2⟩. The coherence is now between the states |s1⟩ |E1⟩
+and |s2⟩ |E2⟩, requiring an appropriate global measure-
+ment (acting jointly on S and E) for it to be revealed.
+Conversely, if the interaction between S and E is such
+that E is completely unable to resolve the path of the
+particle, then |E1⟩ and |E2⟩ are indistinguishable and full
+coherence is retained at the level of S, as is also directly
+obvious from Eq. (1). In the intermediary regime where
+0 < |⟨E2|E1⟩| < 1, meaning that |E1⟩ and |E2⟩ can be
+distinguished in a one-shot measurement with nonzero
+probability p = 1 − |⟨E2|E1⟩|2 < 1, an interference pat-
+tern of reduced visibility is obtained. Equation (3) shows
+that the reduction in visibility increases as |E1⟩ and |E2⟩
+become more distinguishable.
+Here is another way of putting the matter. Looking
+back at Eq. (1), we see that E encodes which-way infor-
+mation about S in the same “relative-state” sense [23] in
+which EPR correlations [24–26] may be said to encode
+“information.” That is, if ⟨E2|E1⟩ = 0 and we were to
+measure E and found it to be in state |E1⟩, we could, in
+EPR’s words [24, p. 777], “predict with certainty” that
+we will find S in |s1⟩.1 Whenever such a prediction is
+possible were we to measure E, no interference effects be-
+tween the components |s1⟩ and |s2⟩ can be measured at
+S, even if E is never actually measured. If |⟨E2|E1⟩| > 0,
+then E encodes only partial which-way information about
+S, in the sense that a measurement of E could not reliably
+distinguish between |E1⟩ and |E2⟩; instead, sometimes
+the measurement will result in an outcome compatible
+with both |E1⟩ and |E2⟩. Consequently, an interference
+experiment carried out on S would find reduced visibil-
+ity, representing diminished local coherence between the
+components |s1⟩ and |s2⟩.
+As hinted above, the description developed so far de-
+scribes the essence of the decoherence process if we iden-
+tify the particle S more generally with an arbitrary quan-
+tum system and the second system E with the environ-
+ment of S. Then an idealized account of the decoherence
+
+1 Of course, this must not be read as saying that S was already
+in |s1⟩ (i.e., “went through slit 1”) prior to the measurement
+of E.
+Nor does it mean that the result of a subsequent path
+measurement on S is necessarily determined, by virtue of the
+measurement on E, prior to this S-measurement’s actually be-
+ing carried out. After all, as Peres has cautioned us [27], unper-
+formed measurements have no outcomes. So while the picture
+of E as “encoding which-path information” about S is certainly
+suggestive and helpful, it should be used with an understanding
+of its conceptual pitfalls.
+
+interaction has form
+��
+
+i
+ci |si⟩
+�
+|E0⟩
+−→
+�
+
+i
+ci |si⟩ |Ei(t)⟩ .
+(4)
+
+We have here introduced a time parameter t, where t = 0
+corresponds to the onset of the environmental interac-
+tion, with |Ei(t)⟩ ≡ |E0⟩ for all i; at t < 0 the system
+and environment are assumed to be uncorrelated (an as-
+sumption common to most decoherence models).
+A single environmental particle interacting with the
+system will typically only insufficiently resolve the com-
+ponents |si⟩ in the system’s superposition state. But be-
+cause of the large number of such particles (and, hence,
+degrees of freedom), the overlap between their different
+joint states |Ei(t)⟩ will rapidly decrease as a result of
+the build-up of many interaction events. Specifically, in
+many decoherence models an exponential decay of over-
+lap is found [3, 5, 9, 17, 20, 28–31],
+
+⟨Ei(t)|Ej(t)⟩ ∝ e−t/τd
+for i ̸= j.
+(5)
+
+Here τd is the characteristic decoherence timescale, which
+can be evaluated for particular choices of the parameters
+in each model (see Sec. IV).
+
+B.
+Environmental monitoring and information
+transfer
+
+We will now motivate, in a different and more rigorous
+way, the picture of decoherence as a process of environ-
+mental monitoring.
+First, we express the influence of
+the environment in a completely general way.
+We as-
+sume that at t = 0 there are no correlations between
+system S and environment E, ρSE(0) = ρS(0) ⊗ ρE(0).
+We write ρE(0) in its diagonal decomposition, ρE(0) =
+�
+
+i pi|Ei⟩⟨Ei|, where �
+
+i pi = 1 and the states |Ei⟩ form
+an orthonormal basis of the Hilbert space of E.
+If
+H denotes the Hamiltonian (here assumed to be time-
+independent) of SE and U(t) = e−iHt represents the uni-
+tary time evolution operator, then the density matrix of
+S evolves according to
+
+ρS(t) = TrE
+
+�
+
+U(t)
+
+�
+
+ρS(0) ⊗
+
+��
+
+i
+pi|Ei⟩⟨Ei|
+
+��
+
+U †(t)
+
+�
+
+=
+�
+
+ij
+pi ⟨Ej| U(t) |Ei⟩ ρS(0) ⟨Ei| U †(t) |Ej⟩ .
+(6)
+
+Introducing the Kraus operators [32] defined by Eij ≡
+√pi ⟨Ej| U(t) |Ei⟩, we obtain
+
+ρS(t) =
+�
+
+ij
+EijρS(0)E†
+ij.
+(7)
+
+It is customary to combine the two indices i and j into a
+single index and write the Kraus operators as
+
+Wk ≡ √pik ⟨Ejk| U(t) |Eik⟩ ,
+(8)
+
+
+4
+
+such that
+
+ρS(t) =
+�
+
+k
+WkρS(0)W †
+k.
+(9)
+
+This Kraus-operator formalism (also called operator-sum
+formalism) represents the effect of the environment as
+a sequence of (in general nonunitary) transformations of
+ρS generated by the operators Wk. The Kraus operators
+exhaustively encode information about the initial state
+of the environment and about the dynamics of the joint
+SE system. Because the evolution of SE is unitary, the
+Kraus operators satisfy the completeness constraint
+�
+
+k
+WkW †
+k = IS,
+(10)
+
+where IS is the identity operator in the Hilbert space of
+S. Equations (9) and (10) together imply that the Wk are
+the generators of a completely positive map Φ : ρS(0) �→
+ρS, also known as a quantum operation [32] or quantum
+channel.2
+
+We will now use Eq. (9) to formally motivate the view
+that decoherence corresponds to an indirect measurement
+of the system by the environment, and that it thus re-
+sults from a transfer of information from the system to
+the environment (see also Ref. [19]).
+In such an indi-
+rect measurement, we let the system S interact with a
+probe—here the environment E—followed by a projec-
+tive measurement on E. The probe is treated as a quan-
+tum system. This procedure aims to yield information
+about S without performing a projective (and thus de-
+structive) direct measurement on S. To model such an
+indirect measurement, consider again an initial compos-
+ite density operator ρSE(0) = ρS(0) ⊗ ρE(0) evolving
+under the action of U(t) = e−iHt, where H is the to-
+tal Hamiltonian. Consider a projective measurement M
+on E with eigenvalues α and corresponding projectors
+Pα ≡ |α⟩⟨α|, with P 2
+α = P †
+α = Pα. The probability of
+obtaining outcome α in this measurement when S is de-
+scribed by the density operator ρS(t) is
+
+Prob (α | ρS(t)) = TrE (PαρE(t))
+
+= TrE
+�
+PαTrS
+�
+U(t) (ρS(0) ⊗ ρE(0)) U †(t)
+��
+.
+(11)
+
+The density matrix of S conditioned on the particular
+
+2 The Kraus formalism is of limited use in calculating decoherence
+dynamics for concrete situations of physical interest. This is so
+because finding the Kraus operators corresponds to diagonaliz-
+ing the full Hamiltonian of SE, usually a prohibitively difficult
+task.
+Moreover, the Kraus operators contain all contributions
+to the evolution of the reduced density matrix, while for con-
+siderations of decoherence we are typically interested only in
+the nonunitary terms, and certain contributions—such as back-
+action effects from the system on the environment—can often be
+neglected. (This is where master equations come into play; see
+Sec. III.)
+
+outcome α is
+
+ρ(α)
+S (t) = TrE {[I ⊗ Pα] ρSE(t) [I ⊗ Pα]}
+
+Prob (α | ρS(t))
+
+= TrE
+�
+[I ⊗ Pα] U(t) [ρS(0) ⊗ ρE(0)] U †(t) [I ⊗ Pα]
+�
+
+Prob (α | ρS(t))
+.
+
+(12)
+
+Inserting
+the
+diagonal
+decomposition
+ρE(0)
+=
+�
+
+k pk|Ek⟩⟨Ek|
+and
+carrying
+out
+the
+trace
+gives
+[19]
+
+ρ(α)
+S (t) =
+�
+
+k
+
+Mα,kρS(t)M †
+α,k
+
+Prob (α | ρS(t)).
+(13)
+
+Here we have introduced the measurement operators
+
+Mα,k ≡ √pk ⟨α| U(t) |Ek⟩ ,
+(14)
+
+which
+obey
+the
+completeness
+constraint
+�
+
+α,k Mα,kM †
+α,k = IS.
+Equation (12) describes the
+effect of the indirect measurement on the state of the
+system. If, however, we do not actually inquire about
+the result of this measurement, we must assign to the
+system a density operator that is a sum over all the
+possible conditional states ρ(α)
+S (t) weighted by their
+probabilities Prob (α | ρS(t)),
+
+ρS(t) =
+�
+
+α
+Prob (α | ρS(t)) ρ(α)
+S (t)
+
+=
+�
+
+α,k
+Mα,kρS(t)M †
+α,k.
+(15)
+
+Note that this expression is formally analogous to the
+Kraus-operator expression of Eq. (9), which described
+the effect of a general environmental interaction on the
+state of the system. Recall, further, that the situation we
+encounter in decoherence is precisely one in which we do
+not actually read out the environment—or, in the present
+picture, in which we do not inquire about the result of the
+indirect measurement.
+This suggests that decoherence
+can indeed be understood as an indirect measurement—
+a monitoring—of the system by its environment.
+
+C.
+Environment-induced superselection and
+decoherence-free subspaces
+
+Decoherence can occur in any basis; which observable
+is monitored by the environment depends on the spe-
+cific form of the system–environment interaction. The
+preferred states (or preferred observables) of the system
+emerge dynamically as those states that are the most ro-
+bust to the interaction with the environment, in the sense
+that they become least entangled with the environment;
+thus, they are the states most immune to decoherence.
+
+
+5
+
+This is the stability criterion for the selection of pre-
+ferred states, resulting in the dynamical selection of pre-
+ferred states (“environment-induced superselection”) [1–
+3, 16]. These environment-superselected preferred states
+(or observables) are sometimes also called pointer states
+(or pointer observables) [2], since they correspond to the
+physical quantities that are most easily “read off” at the
+level of the system, akin to the pointer on the dial of a
+measurement apparatus.
+
+1.
+Pointer states and the commutativity criterion
+
+To find the preferred states,
+we decompose the
+total system–environment Hamiltonian into the self-
+Hamiltonians of the system S and environment E rep-
+resenting the intrinsic dynamics, and a part Hint repre-
+senting the interaction between system and environment,
+
+H = HS + HE + Hint.
+(16)
+
+In many cases of practical interest, Hint dominates
+the evolution of the system, H ≈ Hint (the quantum-
+measurement limit of decoherence). We look for system
+states |si⟩ such that the composite system–environment
+state, when starting from a product state |si⟩ |E0⟩ at
+t = 0, remains in the product form |si⟩ |Ei(t)⟩ for all
+t > 0 under the action of Hint (we shall assume here
+that Hint is not explicitly time-dependent). That is, we
+demand that (setting ℏ ≡ 1 from here on)
+
+e−iHintt |si⟩ |E0⟩ = λi |si⟩ e−iHintt |E0⟩ ≡ |si⟩ |Ei(t)⟩ .
+(17)
+Thus, the pointer states |si⟩ are the eigenstates of the
+part of the interaction Hamiltonian Hint pertaining to the
+Hilbert space of the system, with eigenvalues λi. These
+states will be stationary under Hint [2]. It follows that the
+pointer observable defined by OS = �
+
+i oi|si⟩⟨si| com-
+mutes with Hint,
+�
+OS, Hint
+�
+= 0.
+(18)
+
+This commutativity criterion [2, 3] is particularly easy to
+apply when Hint takes the tensor-product form Hint =
+S ⊗ E, as is frequently the case. Then the environment-
+superselected observables will be those observables that
+commute with S.
+If S is Hermitian, it represents the
+physical quantity monitored by the environment. In gen-
+eral, any Hint can be written as a diagonal decomposition
+of (unitary but not necessarily Hermitian) system and
+environment operators Sα and Eα, Hint = �
+
+α Sα ⊗ Eα.
+If the Sα are Hermitian, such a Hamiltonian represents
+the simultaneous environmental monitoring of different
+observables Sα of the system. A sufficient condition for
+{|si⟩} to form a set of pointer states of the system is then
+given by the requirement that the |si⟩ be simultaneous
+eigenstates of the operators Sα,
+
+Sα |si⟩ = λ(α)
+i
+|si⟩
+for all α and i.
+(19)
+
+Interaction Hamiltonians frequently describe the scat-
+tering of surrounding particles (photons, air molecules,
+etc.), leading to collisional decoherence (see Sec. IV A).
+Since the force laws describing such processes typically
+depend on some power of distance, the interaction Hamil-
+tonian will then commute with the position operator.
+Thus, the pointer states will be approximate eigenstates
+of position (i.e., narrow position-space wave packets).
+This explains why superpositions of mesoscopically and
+macroscopically distinct positions are prohibitively diffi-
+cult to observe [2, 3, 17, 31, 33–39]. Collisional decoher-
+ence can also be dominant in microscopic systems when
+these systems occur in distinct spatial configurations that
+couple strongly to the surrounding medium. For exam-
+ple, chiral molecules such as sugar are always observed to
+be in chirality eigenstates (left-handed or right-handed),
+which are superpositions of different energy eigenstates.
+Any attempt to prepare such molecules in energy eigen-
+states leads to immediate decoherence into the environ-
+mentally stable chirality eigenstates [40, 41].
+The quantum limit of decoherence [42] arises when the
+modes of the environment are slow in comparison with
+the evolution of the system—that is, when the highest
+frequencies (i.e., energies) available in the environment
+are smaller than the separation between the energy eigen-
+states of the system. Then the environment will be able
+to monitor only quantities that are constants of motion.
+In the case of nondegeneracy, this quantity will be the en-
+ergy of the system, leading to the environment-induced
+superselection of energy eigenstates for the system [42].3
+
+In many realistic situations, the commutativity crite-
+rion, Eq. (18), can only be fulfilled approximately [43, 44].
+In addition, the self-Hamiltonian of the system and the
+interaction Hamiltonian may contribute in roughly equal
+strengths (e.g., in models for quantum Brownian motion
+[4, 45]; see Sec. IV B), rendering neither the quantum-
+measurement limit of negligible intrinsic dynamics nor
+the quantum limit of decoherence of a slow environ-
+ment appropriate. In such cases, more general methods
+for determining the preferred states are required. The
+predictability-sieve strategy [43, 44, 46] computes the time
+dependence of the amount of decoherence introduced into
+the system for a large set of initial states of the system
+evolving under the total system–environment Hamilto-
+nian. Typically, this decoherence is measured using ei-
+ther the purity Tr
+�
+ρ2
+S
+�
+or the von Neumann entropy
+
+3 Textbooks on quantum mechanics usually attribute a special role
+to such energy eigenstates (for closed systems) since they are
+stationary under the action of the Hamiltonian. In this closed-
+system picture, however, arbitrary superpositions of energy
+eigenstates should nonetheless be perfectly legitimate. Thus, it
+is important to realize that the environment-induced superselec-
+tion of energy eigenstates is not equivalent to a situation in which
+the presence of the environment could be neglected altogether;
+instead, the environment plays the crucial role of continuously
+monitoring the energy of the system, leading to a local suppres-
+sion of coherence between energy eigenstates.
+
+
+6
+
+S(ρS) = −Tr (ρS log2 ρS) of the reduced density matrix
+ρS. The states most immune to decoherence will be those
+which lead to the smallest decrease in purity or the small-
+est increase in von Neumann entropy.
+Application of
+this method leads to a ranking of the possible preferred
+states with respect to their robustness to the interac-
+tion with the environment. For particular models it has
+been explicitly shown that the states picked out by the
+predictability sieve are robust to the particular choice of
+the measure of decoherence. For example, in the model
+for quantum Brownian motion, different measures lead
+to the same minimum-uncertainty wave packets in phase
+space [5, 8, 16, 44, 47, 48].
+
+2.
+Decoherence-free subspaces
+
+The
+pointer-state
+condition
+of
+Eq.
+(19)
+can
+be
+strengthened to the concept of pointer subspaces [3] or
+decoherence-free subspaces (DFS) [49–58]. These are sub-
+spaces of the Hilbert space of the system in which every
+state in the subspace is immune to decoherence; this is
+a nontrivial requirement, since in general superpositions
+of pointer states will not be pointer states themselves.
+One important condition for this to happen is that the
+preferred states |si⟩ defined by Eq. (19) form an orthonor-
+mal basis of the subspace, and that the eigenvalues λ(α)
+i
+in Eq. (19) are independent of i, i.e., that all |si⟩ are
+simultaneous degenerate eigenstates of each Sα,
+
+Sα |si⟩ = λ(α) |si⟩
+for all α and i.
+(20)
+
+This condition states that the action of a given Sα must
+be the same for all basis states |si⟩ of the DFS, and thus
+the existence of a DFS corresponds to a symmetry in the
+structure of the system–environment interaction, i.e., to
+a dynamical symmetry. A necessary condition for such a
+symmetry to obtain is the absence of terms in Hint that
+act jointly on system and environment in a nontrivial
+manner.
+An arbitrary state |ψ⟩ in the DFS can then be written
+as |ψ⟩ = �
+
+i ci |si⟩ and will evolve according to
+
+e−iHintt |ψ⟩ |E0⟩ = |ψ⟩ e−i(
+�
+
+α λ(α)Eα)t |E0⟩
+≡ |ψ⟩ |Eψ(t)⟩ .
+(21)
+
+Thus, the state |ψ⟩ does not become entangled with the
+environment and is therefore immune to decoherence.
+When the self-Hamiltonian HS of the system cannot be
+neglected, one needs to additionally ensure that none of
+the basis states |si⟩ of the DFS will drift out of the sub-
+space under the evolution generated by HS. Otherwise
+an initially decoherence-free state would again become
+prone to decoherence. The concept of DFS can be gener-
+alized to the formalism of noiseless subsystems (or noise-
+less quantum codes) [58–60].
+
+D.
+Proliferation of information and quantum
+Darwinism
+
+Quantum Darwinism [61–69] builds on the ideas of de-
+coherence and environmental encoding of information, by
+broadening the role of the environment to that of a com-
+munication and amplification channel. Interactions be-
+tween the system and its environment lead to the redun-
+dant storage of selected information about the system in
+many fragments of the environment. By measuring some
+of these fragments, observers can indirectly obtain infor-
+mation about the system without appreciably disturbing
+the system itself. Indeed, this represents how we typi-
+cally observe objects. For example, we see an object not
+by directly interacting with it, but by intercepting scat-
+tered photons that encode information about the object’s
+spatial structure [67, 68].
+
+In this sense, quantum Darwinism provides a dynami-
+cal explanation for the robustness of states of (especially)
+macroscopic objects to observation. It was found that
+the observable of the system that can be imprinted most
+completely and redundantly in many distinct fragments
+of the environment coincides with the pointer observable
+selected by the system–environment interaction [62–65];
+conversely, most other states do not seem to be redun-
+dantly storable. Indeed, it has been shown that the re-
+dundant proliferation of information regarding pointer
+states is as inevitable as decoherence itself [70]. Quantum
+Darwinism has been studied in several concrete models,
+for example, in spin environments [64], quantum Brow-
+nian motion [71], and photon and photon-like environ-
+ments [67, 68, 70]. The efficiency of the amplification pro-
+cess described by quantum Darwinism can be expressed
+in terms of the quantum Chernoff information [70].
+
+The structure and amount of information that the
+environment encodes about the system can be quanti-
+fied using the measure of (classical [62, 63] or quantum
+[5, 64, 65]) mutual information. Classical mutual infor-
+mation is based on the choice of particular observables of
+the system S and the environment E and quantifies how
+well one can predict the outcome of a measurement of a
+given observable of S by measuring some observable on
+a fraction of E [62, 63]. Quantum mutual information is
+defined as S(ρS)+S(ρE)−S(ρ), where ρS, ρE, and ρ are
+the density matrices of S, E, and the composite system
+SE, respectively, and S(ρ) = −Tr (ρ log2 ρ) is the von
+Neumann entropy associated with ρ. Quantum mutual
+information quantifies the degree of quantum correlations
+between S and E. Classical and quantum mutual infor-
+mation give similar results [5, 62–65] because the differ-
+ence between the two measures, known as the quantum
+discord [72], disappears when decoherence is sufficiently
+effective to select a well-defined pointer basis [72].
+
+
+7
+
+E.
+Decoherence versus dissipation and noise
+
+While the presence of dissipation implies the pres-
+ence of decoherence, the converse is not necessarily true.
+When dissipation and decoherence are both present, they
+typically occur on vastly different timescales; the deco-
+herence timescale is typically many orders of magnitude
+shorter than the relaxation timescale. A rule-of-thumb
+estimate for the ratio of the relaxation timescale τr to the
+decoherence timescale τd for a massive object described
+by a superposition of two different positions a distance
+∆x apart is [18]
+
+τr
+τd
+∼
+� ∆x
+
+λdB
+
+�2
+,
+(22)
+
+where λdB = (2mkT)−1/2 is the thermal de Broglie wave-
+length of the object. For an object of mass m = 1 g at
+room temperature in a coherent superposition of two lo-
+cations a distance ∆x = 1 cm apart, τr/τd is on the order
+of 1040. Thus, for macroscopic objects the dissipative in-
+fluence of the environment is usually completely negligi-
+ble on the timescale relevant to the decoherence induced
+by this environment.
+Decoherence is a consequence of environmental entan-
+glement. In the literature on quantum computing, how-
+ever, the term “decoherence” is often used to refer to
+any process that affects the qubits, including perturba-
+tions due to classical fluctuations and imperfections. Ex-
+amples for sources of such classical noise in the context
+of quantum computing are the fluctuations in the inten-
+sity [73] and duration [74] of the laser beam incident on
+qubits in an ion trap, inhomogeneities in the magnetic
+fields in NMR quantum computing [75], and bias fluctu-
+ations in superconducting qubits [76]. The distinction be-
+tween classical noise and quantum decoherence has been
+further blurred in quantum error correction, since the
+error-correcting schemes are insensitive to the physical
+origin of the qubit errors (see Sec. V A).
+Phenomenologically and formally the influence of clas-
+sical noise processes may be described in a manner simi-
+lar to the effect of environmental entanglement, namely,
+in terms of a decay of the off-diagonal elements (in-
+terference terms) in the local density matrix (in the
+environment-superselected basis).
+But in the case of
+noise, the decay of the off-diagonal elements occurs be-
+cause the system’s density matrix is identified with an
+average over a physical ensemble of systems (or, put dif-
+ferently, over the different instances of particular noise
+processes), while in the case of decoherence the decay is
+due to an entanglement-induced delocalization of phase
+coherence for individual systems. The fundamental dif-
+ference between these physical processes is masked by the
+density-matrix description. Indeed, one can always find
+an experimental procedure that would, at least in princi-
+ple, distinguish between the different physical processes
+underlying formally similar density-matrix descriptions.
+In contrast with decoherence, noise does not create
+system–environment entanglement and can in principle
+
+always be undone using only local operations (witness,
+for example, the reversal of ensemble dephasing in NMR
+experiments using the spin-echo technique). In any indi-
+vidual realization of the noise process the dynamics of the
+system are completely unitary, and thus no coherence is
+lost from the system. By contrast, if the system becomes
+entangled with environmental degrees of freedom, at the
+very least we would need to perform a pair of measure-
+ments on the environment before and after the interac-
+tion with the system in order to gather enough informa-
+tion to reverse the effect of decoherence by application
+of an appropriate countertransformation. Moreover, as
+also seen experimentally [77], these measurements would
+not always constitute a sufficient procedure for “undo-
+ing” decoherence (see also Sec. IV.C of Ref. [5]).
+The loss of phase coherence due to environmental
+entanglement is sometimes simulated (with the above
+caveats) by classical fluctuations perturbing the system,
+i.e., by the addition of certain time-dependent terms to
+the self-Hamiltonian of the system. This strategy was
+implemented, for example, in theoretical [73, 78] and ex-
+perimental [77, 79] studies of the influence of fluctuating
+parameters in ion-trap quantum computers.
+
+III.
+MASTER EQUATIONS
+
+In the usual approach to modeling decoherence, the
+reduced density matrix ρS(t) is obtained from
+
+ρS(t) = TrE ρSE(t) ≡ TrE
+�
+U(t)ρSE(0)U †(t)
+�
+,
+(23)
+
+where U(t) is the time-evolution operator for the compos-
+ite system SE. The task of calculating ρSE(t) is often
+computationally cumbersome or even intractable. It is
+also unnecessarily detailed, because we are usually only
+interested in the dynamics of the system. A master equa-
+tion allows us to calculate ρS(t) directly from an expres-
+sion of the form
+
+ρS(t) = V(t)ρS(0),
+(24)
+
+where the superoperator V(t) is the dynamical map gen-
+erating the evolution of ρS(t).
+If the master equation
+is exact, then we merely have the identity V(t)ρS(0) ≡
+TrE
+�
+U(t)ρSE(0)U †(t)
+�
+and no computational advantage
+is gained.
+Therefore, master equations are typically
+based on simplifying approximations.
+In modeling decoherence, we focus on master equations
+that are first-order time-local differential equations of the
+form
+
+d
+dtρS(t) = L [ρS(t)] ≡ −i [H′
+S, ρS(t)] + D[ρS(t)].
+(25)
+
+This equation is local in time in the sense that the change
+of ρS at time t depends only on ρS evaluated at t. The
+superoperator L acting on ρS(t) typically depends on the
+initial state of the environment and the different terms
+in the Hamiltonian.
+We have decomposed L into two
+
+
+8
+
+parts to distinguish their physical interpretation.
+The
+first term, −i [H′
+S, ρS(t)], is unitary and given by the
+Liouville–von Neumann commutator with the “renormal-
+ized” Hamiltonian H′
+S of the system. (Because the en-
+vironment typically leads to a renormalization of the en-
+ergy levels of the system, this Hamiltonian does in general
+not coincide with the unperturbed free Hamiltonian HS
+of S that would generate the evolution of S in absence of
+the environment.) The second, nonunitary term D[ρS(t)]
+represents decoherence (and often also dissipation) due to
+the environment.
+
+A.
+Born–Markov master equations
+
+Born–Markov master equations allow for many deco-
+herence problems to be treated in a mathematically sim-
+ple, and often closed, form. They are based on the fol-
+lowing two approximations:
+
+1. The
+Born
+approximation.
+The
+system–
+environment coupling is sufficiently weak and
+the environment is reasonably large such that
+changes of the density operator ρE of the environ-
+ment are negligible and the system–environment
+density operator remains remains approximately
+factorized at all times, ρSE(t) ≈ ρS(t) ⊗ ρE.
+
+2. The Markov approximation. Memory effects of the
+environment are negligible, in the sense that any
+self-correlations within the environment created by
+the coupling to the system decay rapidly compared
+to the characteristic relaxation timescale of the
+open quantum system.
+
+Comparisons between the predictions of models based
+on Born–Markov master equations and experimental
+data indicate that the Born and Markov assumptions are
+reasonable in many physical situations (but see Sec. III C
+below for exceptions and non-Markovian models). As-
+suming these assumptions hold and writing the inter-
+action Hamiltonian as Hint = �
+
+α Sα ⊗ Eα, the Born–
+Markov master equation reads [9, 20]
+
+d
+dtρS(t) = −i [HS, ρS(t)]
+
+−
+�
+
+α
+{[Sα, BαρS(t)] + [ρS(t)Cα, Sα]} ,
+(26)
+
+where the system operators Bα and Cα are defined as
+
+Bα ≡
+� ∞
+
+0
+dτ
+�
+
+β
+cαβ(τ)S(I)
+β (−τ),
+(27a)
+
+Cα ≡
+� ∞
+
+0
+dτ
+�
+
+β
+cβα(−τ)S(I)
+β (−τ).
+(27b)
+
+Here S(I)
+α (−τ) denotes the operator Sα in the interaction
+picture. In the following, we will simplify notation by
+
+omitting the superscript “I”; instead we use the conven-
+tion that all operators bearing explicit time arguments
+are to be understood as interaction-picture operators.
+(For density operators, however, we will maintain the
+superscript notation in order to distinguish them from
+Schr¨odinger-picture density operators, which also carry
+a time argument.) The quantities cαβ(τ) appearing in
+Eq. (27) are given by
+
+cαβ(τ) ≡ ⟨Eα(τ)Eβ⟩ρE .
+(28)
+
+These environment self-correlation functions quantify
+how much information the environment retains over time
+about its interaction with the system. The Markov ap-
+proximation corresponds to the assumption of a rapid
+decay of the cαβ(τ) relative to the timescale set by the
+evolution of the system.
+In many situations of interest, the general form of the
+Born–Markov master equation, Eq. (26), simplifies con-
+siderably.
+For example, typically only a single system
+observable S is monitored by the environment, Hint =
+S ⊗E. Also, the time dependence of the operators Sα(τ)
+and Eα(τ) is often simple, facilitating the calculation of
+the quantities Bα and Cα.
+Examples are discussed in
+Sec. IV.
+
+B.
+Lindblad master equations
+
+Lindblad master equations constitute a particular, al-
+beit quite general, class of time-local Markovian mas-
+ter equations. They arise from the requirement that the
+evolution of the reduced density matrix generated by the
+master equation must ensure complete positivity [20, 80–
+85]. Complete positivity guarantees that the dynamical
+map ρS(0) �→ ρS(t) = V(t)ρS(0) described by the master
+equation generates physically consistent dynamics even
+when S is initially entangled with another system. While
+complete positivity is automatically fulfilled if the evo-
+lution is exact, approximate master equations will not
+necessarily ensure complete positivity [20, 83–86]. The
+Lindblad master equation is a special case of the gen-
+eral Born–Markov master equation that ensures complete
+positivity and takes the general form [81, 82]
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)]
+
++ 1
+
+2
+
+�
+
+αβ
+γαβ
+��
+Sα, ρS(t)S†
+β
+�
++
+�
+SαρS(t), S†
+β
+��
+,
+(29)
+
+where H′
+S is the renormalized Hamiltonian of the sys-
+tem. The coefficients γαβ are time-independent and ex-
+haustively encapsulate information about the physical
+parameters of the decoherence processes (and possibly
+dissipation processes).
+One can show that the matrix
+Γ ≡ (γαβ) formed by the coefficients γαβ is positive, i.e.,
+all its eigenvalues κµ are ≥ 0. Therefore, Eq. (29) can be
+
+
+9
+
+simplified by diagonalizing Γ, which results in the diago-
+nal form [82, 87]
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)]
+
+− 1
+
+2
+
+�
+
+µ
+κµ
+�
+L†
+µLµρS(t) + ρSL†
+µLµ − 2LµρS(t)L†
+µ
+�
+.
+
+(30)
+
+The Lindblad operators Lµ are linear combinations of the
+original operators Sα, with coefficients determined by the
+diagonalization of Γ. The Lindblad structure of a mas-
+ter equation can also be motivated from the requirement
+that it gives rise to the most general form of generators
+of quantum dynamical semigroups [20, 81, 82, 84, 87–89].
+It is possible to bring any Born–Markov master equation
+into Lindblad form by imposing the rotating-wave ap-
+proximation. This assumption, ubiquituous in quantum
+optics, is justified whenever the timescale set by the typ-
+ical energy differences ℏ(ω − ω′) of the system Hamilto-
+nian is short in comparison with the relaxation timescale
+of the system. (See Sec. 3.3.1 of Ref. [20] for details.)
+Because the Sα are not necessarily Hermitian, the
+Lindblad operators do not always correspond to physical
+observables. But when they do, we can rewrite Eq. (30)
+in compact double-commutator form,
+
+d
+dtρS(t) = −i [H′
+S, ρS(t)] − 1
+
+2
+
+�
+
+µ
+κµ [Lµ, [Lµ, ρS(t)]] .
+
+(31)
+As an example, consider a situation in which the envi-
+ronment monitors the position of a system. With L = x
+and the “free”-particle Hamiltonian H′
+S = HS = p2/2m,
+Eq. (31) becomes
+
+d
+dtρS(t) = − i
+
+2m
+�
+p2, ρS(t)
+�
+− 1
+
+2κ [x, [x, ρS(t)]] .
+(32)
+
+Expressing this master equation in the position represen-
+tation results in
+
+∂ρS(x, x′, t)
+
+∂t
+= − i
+
+2m
+
+� ∂2
+
+∂x′2 − ∂2
+
+∂x2
+
+�
+ρS(x, x′, t)
+
+− 1
+
+2κ (x − x′)2 ρS(x, x′, t).
+(33)
+
+This is the classic equation of motion for decoherence due
+to environmental scattering first derived in Ref. [17].
+Lindblad master equations provide an intuitive and
+simple way of representing the environmental monitoring
+of an open quantum system. Most of the real physics be-
+hind this monitoring process is hidden in the coefficients
+κµ appearing in Eq. (30). If the Lindblad operators are
+chosen to be dimensionless, the κµ can be directly in-
+terpreted as decoherence rates, since they have units of
+inverse time.
+Equation (31) shows that the decoherence term van-
+ishes if
+
+[Lµ, ρS(t)] = 0
+for all µ, t.
+(34)
+
+In this case, ρS(t) evolves unitarily. Since the Lµ are lin-
+ear combinations of the Sα, Eq. (34) typically means that
+[Sα, ρS(t)] = 0 for all α, t. This implies that simultane-
+ous eigenstates of all Sα will be immune to decoherence,
+which is precisely the pointer-state criterion of Eq. (19).
+In quantum-jump and quantum-trajectory approaches,
+the evolution of the reduced density matrix is conditioned
+on an explicitly observed sequence of measurement re-
+sults in the environment. This allows for the (formal)
+description of a single realization of the system evolv-
+ing stochastically, conditioned on a particular measure-
+ment record. The dynamics are then described by a mas-
+ter equation of the Lindblad type, Eq. (31), for the re-
+duced density matrix ρC
+S conditioned on the measurement
+records of the Lindblad operators Lµ,
+
+dρC
+S = −i
+�
+HS, ρC
+S
+�
+dt − 1
+
+2
+
+�
+
+µ
+κµ
+�
+Lµ,
+�
+Lµ, ρC
+S
+��
+dt
+
++
+�
+
+µ
+
+√κµ W[Lµ]ρC
+S dWµ.
+(35)
+
+Here, W[L]ρ ≡ Lρ+ρL†−ρ Tr
+�
+Lρ + ρL†�
+, and the dWµ
+denote so-called Wiener increments. Equation (35) corre-
+sponds to a diffusive unraveling of the Lindblad equation
+into individual quantum trajectories, which can then be
+expressed by means of a stochastic Schr¨odinger equation
+[90–102].
+
+C.
+Non-Markovian decoherence
+
+The derivation of the Born–Markov master equation
+assumes that the coupling between system and environ-
+ment is weak and memory effects of the environment can
+be neglected.
+These conditions, however, are not met
+in certain situations of physical interest.
+An example
+would be a superconducting qubit strongly coupled to a
+low-temperature environment of other two-level systems
+[103, 104]. Also, a recent experiment [105] has measured
+strongly non-Ohmic spectral densities for the environ-
+ment of a quantum nanomechanical system; such densi-
+ties lead to non-Markovian evolution.
+In many cases, pronounced memory effects in the envi-
+ronment will cause strong dependencies of the evolution
+of the reduced density operator on the past history of the
+system–environment composite and therefore make it im-
+possible to describe the reduced dynamics by a differen-
+tial equation that is local in time. Surprisingly, however,
+one can show that even non-Markovian dynamics some-
+times can still be described by a time-local differential
+equation of the form
+
+d
+dtρS(t) = K(t)ρS(t),
+(36)
+
+where the superoperator K(t) depends only on t.
+For
+example, a non-Markovian master equation for quantum
+Brownian motion (see Sec. IV B) can be obtained through
+
+
+10
+
+a formal modification of the Born–Markov master equa-
+tion [4, 5]. In general, it is often possible to arrive at
+non-Markovian but time-local master equations via the
+so-called time-convolutionless projection operator tech-
+nique [106–109].
+
+IV.
+DECOHERENCE MODELS
+
+Many physical systems can be represented either by
+a qubit if the state space of the system is discrete and
+effectively two-dimensional, or by a particle described by
+continuous phase-space coordinates. Needless to say, in
+the case of quantum information processing the qubit
+representation is of particular relevance.
+Similarly, a wide range of environments can be modeled
+as a collection of quantum harmonic oscillators or qubits.
+Harmonic-oscillator environments are of great generality.
+At low energies, many systems interacting with an en-
+vironment can effectively be represented by one or two
+coordinates of the system linearly coupled to an environ-
+ment of harmonic oscillators; indeed, sufficiently weak in-
+teractions with an arbitrary environment can be mapped
+onto a system linearly coupled to a harmonic-oscillator
+environment [110, 111].
+Environments represented by qubits are often the ap-
+propriate model in the low-temperature regine, where de-
+coherence is typically dominated by interactions with lo-
+calized modes, such as paramagnetic spins, paramagnetic
+electronic impurities, tunneling charges, defects, and nu-
+clear spins [103, 104, 112]. Each of the localized modes is
+represented by a finite-dimensional Hilbert space with a
+finite energy cutoff. We can therefore model these modes
+as a set of discrete states. Typically, only two such states
+are relevant, and thus the localized modes can be mapped
+onto an environment of qubits. Since qubits can be for-
+mally represented by spin- 1
+
+2 particles, such models are
+known as spin-environment models.
+In the following, we will discuss four important stan-
+dard models, namely, collisional decoherence (Sec. IV A),
+quantum Brownian motion (Sec. IV B), the spin–boson
+model (Sec. IV C), and the spin–spin model (Sec. IV D).
+For details on these and other decoherence models, in-
+cluding derivations of the relevant master equations, see
+Secs. 3 and 5 of Ref. [9].
+
+A.
+Collisional decoherence
+
+Collisional decoherence arises from the scattering of en-
+vironmental particles by a massive free quantum particle.
+Models of collisional decoherence were first studied in the
+classic paper by Joos and Zeh [17]. A more rigorous and
+general treatment was later developed by Hornberger and
+collaborators [31, 36–39] (see also [34, 35, 113]), which,
+among other refinements, remedied a flaw in Joos and
+Zeh’s original derivation that had resulted in decoher-
+ence rates that were too large by a factor of 2π [31].
+
+If we assume that the central particle is much more
+massive than the environmental particles such that its
+center-of-mass state is not disturbed by the scattering
+events (no recoil), the time evolution of the reduced den-
+sity matrix is given by [9, 17, 31, 34, 35]
+
+∂ρS(x, x′, t)
+
+∂t
+= −F(x − x′)ρS(x, x′, t).
+(37)
+
+This master equation describes pure spatial decoherence
+without dissipation. The decoherence factor F(x − x′)
+plays the role of a localization rate.
+It represents the
+characteristic decoherence rate at which spatial coher-
+ences between two positions x and x′ become locally sup-
+pressed and is given by
+
+F(x − x′) =
+� ∞
+
+0
+dq ϱ(q)v(q)
+
+×
+� dˆn dˆn′
+
+4π
+
+�
+1 − eiq(ˆn−ˆn′)·(x−x′)�
+|f(qˆn, qˆn′)|2 .
+(38)
+
+Here ϱ(q) denotes the number density of incoming par-
+ticles with magnitude of momentum equal to q = |q|, ˆn
+and ˆn′ are unit vectors (with dˆn and dˆn′ representing the
+associated solid-angle differentials), and v(q) denotes the
+speed of particles with momentum q. For the scattering
+of massive environmental particles we have v(q) = q/m,
+where m is each particle’s mass, while for the scatter-
+ing of photons and other massless particles v(q) is equal
+to the speed of light. The quantity |f(qˆn, qˆn′)|2 is the
+differential cross section for the scattering of an environ-
+mental particle from initial momentum q = qˆn to final
+momentum q′ = qˆn′.
+Whenever the mass of the central particle becomes
+comparable to the mass of the environmental particles (as
+in the case of air molecules scattered by small molecules
+and free electrons [114]), the no-recoil assumption does
+not hold and more general models for collisional deco-
+herence have to be considered [35, 36].
+The resulting
+dynamics include dissipation, as well as decoherence in
+both position and momentum.
+To further evaluate the decoherence factor F(x − x′),
+Eq. (38), we distinguish two important limiting cases. In
+the short-wavelength limit, the typical wavelength of the
+scattered environmental particles is much shorter than
+the coherent separation ∆x = |x − x′| between the well-
+localized wave packets in the spatial superposition state
+of the system.
+Then a single scattering event will be
+able to fully resolve this separation and thus carry away
+complete which-path information, leading to maximum
+spatial decoherence per scattering event. In this limit,
+F(x − x′) turns out to be simply equal to the total scat-
+tering rate Γtot [9]. This implies the existence of an upper
+limit for the decoherence rate when increasing the sepa-
+ration ∆x, in contrast with decoherence rates obtained
+from linear models [compare Eqs. (22) and (54)]. Equa-
+tion (37) then shows that spatial interference terms will
+become exponentially suppressed at a rate set by Γtot,
+
+ρS(x, x′, t) = ρS(x, x′, 0)e−Γtott.
+(39)
+
+
+11
+
+TABLE I. Estimates of decoherence timescales (in seconds)
+for the suppression of spatial interferences over a distance ∆x
+equal to the size a of the object (∆x = a = 10−3 cm for a
+dust grain and ∆x = a = 10−6 cm for a large molecule). See
+Ref. [9] for details.
+
+Environment
+Dust grain Large molecule
+
+Cosmic background radiation
+1
+1024
+
+Photons at room temperature
+10−18
+106
+
+Best laboratory vacuum
+10−14
+10−2
+
+Air at normal pressure
+10−31
+10−19
+
+In the opposite long-wavelength limit, the environmen-
+tal wavelengths are much larger than the coherent sep-
+aration ∆x = |x − x′|, which implies that an individual
+scattering event will reveal only incomplete which-path
+information. For this case, one can show that spatial co-
+herences become exponentially suppressed at a rate that
+depends on the square of the separation ∆x [9],
+
+ρS(x, x′, t) = ρS(x, x′, 0)e−Λ(∆x)2t,
+(40)
+
+where Λ is a scattering constant that encapsulates the
+physical details of the interaction.
+Thus, the quantity
+Λ(∆x)2 plays the role of a decoherence rate.
+The de-
+pendence of this rate on ∆x is reasonable: if the envi-
+ronmental wavelengths are much larger than ∆x, it will
+require a large number of scattering events to encode
+an appreciable amount of which-path information in the
+environment, and this amount will increase, for a given
+number of scattering events, as ∆x becomes larger. Note
+that if ∆x is increased beyond the typical wavelength of
+the environment, the short-wavelength limit needs to be
+considered instead, for which the decoherence rate is in-
+dependent of ∆x and attains its maximum possible value.
+Numerical values of collisional decoherence rates ob-
+tained from Eqs. (39) and (40), with the physically rele-
+vant scattering parameters Γtot and Λ appropriately eval-
+uated, have shown the extreme efficiency of collisions in
+suppressing spatial interferences; Table I shows a few
+classic order-of-magnitude estimates [8, 9, 17].
+Excel-
+lent agreement between theory and experiment has been
+demonstrated for the decoherence of fullerenes due to col-
+lisions with background gas molecules in a Talbot–Lau
+interferometer [31, 115–118] (see Sec. VI B and Fig. 2),
+and for the decoherence of sodium atoms in a Mach–
+Zehnder interferometer due to the scattering of photons
+[119] and gas molecules [120].
+
+B.
+Quantum Brownian motion
+
+A classic and extensively studied model of decoherence
+and dissipation is the one-dimensional motion of a par-
+ticle weakly coupled to a thermal bath of noninteracting
+harmonic oscillators, a model known as quantum Brown-
+ian motion. The self-Hamiltonian HE of the environment
+
+is given by
+
+HE =
+�
+
+i
+
+� 1
+
+2mi
+p2
+i + 1
+
+2miω2
+i q2
+i
+
+�
+,
+(41)
+
+where mi and ωi denote the mass and natural frequency
+of the ith oscillator, and qi and pi are the canonical posi-
+tion and momentum operators. The interaction Hamilto-
+nian Hint describes the bilinear coupling of the system’s
+position coordinate x to the positions qi of the environ-
+mental oscillators, Hint = x ⊗ �
+
+i ciqi, where the ci de-
+note coupling strengths. This interaction represents the
+continuous environmental monitoring of the position co-
+ordinate of the system.
+The Born–Markov master equation describing the evo-
+lution of the density matrix ρS(t) of the system is given
+by [9, 45]
+
+d
+dtρS(t) = −i
+�
+HS, ρS(t)
+�
+
+−
+� ∞
+
+0
+dτ
+�
+ν(τ)
+�
+x,
+�
+x(−τ), ρS(t)
+��
+
+− iη(τ)
+�
+x,
+�
+x(−τ), ρS(t)
+���
+.
+(42)
+
+Here, x(τ) denotes the system’s position operator in the
+interaction picture, x(τ) = eiHSτxe−iHSτ.
+The curly
+brackets { · , · } in the second line denote the anticom-
+mutator {A, B} ≡ AB + BA. The functions
+
+ν(τ) =
+� ∞
+
+0
+dω J(ω) coth
+� ω
+
+2kT
+
+�
+cos (ωτ) ,
+(43)
+
+η(τ) =
+� ∞
+
+0
+dω J(ω) sin (ωτ) ,
+(44)
+
+are known as the noise kernel and dissipation kernel, re-
+spectively. The function J(ω), called the spectral density
+of the environment, is given by
+
+J(ω) ≡
+�
+
+i
+
+c2
+i
+
+2miωi
+δ(ω − ωi).
+(45)
+
+In general, spectral densities encapsulate the physi-
+cal properties of the environment.
+One frequently re-
+places the collection of individual environmental oscilla-
+tors by an (often phenomenologically motivated) contin-
+uous function J(ω) of the environmental frequencies ω.
+If we specialize to the important case of the system rep-
+resented by a harmonic oscillator with self-Hamiltonian
+
+HS =
+1
+
+2M p2 + 1
+
+2MΩ2x2,
+(46)
+
+the resulting Born–Markov master equation is
+
+d
+dtρS(t) = −i
+�
+HS + 1
+
+2M �Ω2x2, ρS(t)
+�
+−iγ
+�
+x,
+�
+p, ρS(t)
+��
+
+− D
+�
+x,
+�
+x, ρS(t)
+��
+− f
+�
+x,
+�
+p, ρS(t)
+��
+.
+(47)
+
+
+12
+
+The coefficients �Ω2, γ, D, and f are defined as
+
+�Ω2 ≡ − 2
+
+M
+
+� ∞
+
+0
+dτ η(τ) cos (Ωτ) ,
+(48a)
+
+γ ≡
+1
+
+MΩ
+
+� ∞
+
+0
+dτ η(τ) sin (Ωτ) ,
+(48b)
+
+D ≡
+� ∞
+
+0
+dτ ν(τ) cos (Ωτ) ,
+(48c)
+
+f ≡ − 1
+
+MΩ
+
+� ∞
+
+0
+dτ ν(τ) sin (Ωτ) .
+(48d)
+
+The first term on the right-hand side of Eq. (47) repre-
+sents the unitary dynamics of a harmonic oscillator whose
+natural frequency is shifted by �Ω. The second term de-
+scribes momentum damping (dissipation) at a rate pro-
+portional to γ, which depends only on the spectral den-
+sity but not the temperature of the environment. The
+third term is of the Lindblad double-commutator form
+[see Eq. (31)] and describes decoherence of spatial coher-
+ences over a distance ∆X at a rate D(∆X)2. Note that
+D depends on both the spectral density J(ω) and the
+temperature T of the environment. The fourth term also
+represents decoherence, but its influence on the dynam-
+ics of the system is usually negligible, especially at higher
+temperatures. In the long-time limit γt ≫ 1, the master
+equation (47) describes dispersion in position space given
+by
+
+∆X2(t) =
+D
+
+2m2γ2 t.
+(49)
+
+That is, the width ∆X(t) of the ensemble in position
+space asymptotically scales as ∆X(t) ∝
+√
+
+t, just as in
+classical Brownian motion; hence the term “quantum
+Brownian motion.”
+Figure 1 shows the time evolution of position-space and
+momentum-space superpositions of two Gaussian wave
+packets in the Wigner picture, as described by Eq. (47)
+[28]. Interference between the two wave packets is rep-
+resented by oscillations between the direct peaks. The
+interaction with the environment damps these oscilla-
+tions.
+The damping occurs on different timescales for
+the two initial conditions. While the momentum coordi-
+nate is not directly monitored by the environment, the
+intrinsic dynamics, through their creation of spatial su-
+perpositions from superpositions of momentum, result
+in decoherence in momentum space.
+This interplay of
+environmental monitoring and intrinsic dynamics leads
+to the emergence of pointer states that are minimum-
+uncertainty Gaussians (coherent states) well-localized in
+both position and momentum, thus approximating clas-
+sical points in phase space [5, 8, 16, 28, 44, 47, 48].
+Let us consider the important case of an ohmic spectral
+density J(ω) ∝ ω with a high-frequency cutoff Λ,
+
+J(ω) = 2Mγ0
+
+π
+ω
+Λ2
+
+Λ2 + ω2 .
+(50)
+
+In the limit of a high-temperature environment (kT ≫ Ω
+and kT ≫ Λ), we arrive at the Caldeira–Leggett master
+
+x
+
+p
+
+x
+
+p
+
+FIG. 1. Evolution of superpositions of Gaussian wave packets
+in quantum Brownian motion as studied in Ref. [28], visual-
+ized in the Wigner representation. Time increases from top
+to bottom. In the left column, the initial wave packets are
+separated in position; in the right column, the separation is
+in momentum.
+
+equation [121],
+
+d
+dtρS(t) = −i
+�
+H′
+S, ρS(t)
+�
+− iγ0
+�
+x,
+�
+p, ρS(t)
+��
+
+− 2Mγ0kT
+�
+x,
+�
+x, ρS(t)
+��
+,
+(51)
+
+where
+
+H′
+S = HS + 1
+
+2M �Ω2x2 =
+1
+
+2M p2 + 1
+
+2M
+�
+Ω2 − 2γ0Λ
+�
+x2
+
+(52)
+is the frequency-shifted Hamiltonian H′
+S of the system.
+This equation has been widely and successfully used to
+model decoherence and dissipation processes, even in
+cases where the assumptions were not strictly fulfilled
+(for example, in quantum-optical settings, where often
+kT ≲ Λ [122]).
+In the position representation, the final term on the
+right-hand side of Eq. (51) can be written as
+
+− γ0
+
+�x − x′
+
+λdB
+
+�2
+ρS(x, x′, t),
+(53)
+
+where λdB = (2MkT)−1/2 is the thermal de Broglie wave-
+length. This term describes spatial localization with a
+
+
+13
+
+decoherence rate τ −1
+|x−x′| given by [18]
+
+τ −1
+|x−x′| = γ0
+
+�x − x′
+
+λdB
+
+�2
+.
+(54)
+
+This is Eq. (22), and as discussed there, given that λdB is
+extremely small for macroscopic and even mesoscopic ob-
+jects, we see that superpositions of macroscopically sepa-
+rated center-of-mass positions will typically be decohered
+on timescales many orders of magnitude shorter than the
+dissipation (relaxation) timescale γ−1
+0 . Over timescales
+on the order of the decoherence time, we may therefore
+often neglect the dissipative term in Eq. (51), leading to
+the pure-decoherence master equation
+
+d
+dtρS(t) = −i
+�
+H′
+S, ρS(t)
+�
+− 2Mγ0kT
+�
+x,
+�
+x, ρS(t)
+��
+. (55)
+
+C.
+Spin–boson models
+
+In the spin–boson model, a qubit interacts with an
+environment of harmonic oscillators. The seminal review
+paper by Leggett et al. [29] discusses the dynamics of the
+spin–boson model in great detail.
+Let us first consider a simplified spin–boson model
+where the self-Hamiltonian of the system is taken to be
+HS = 1
+
+2ω0σz, with eigenstates |0⟩ and |1⟩. In contrast
+with the more general case discussed below, this Hamilto-
+nian does not include a tunneling term − 1
+
+2∆0σx, and thus
+HS does not generate any nontrivial intrinsic dynamics.
+We employ the familiar self-Hamiltonian, Eq. (41), for
+an environment of harmonic oscillators, and choose the
+bilinear interaction Hamiltonian Hint = σz ⊗�
+
+i ciqi. Us-
+ing the raising and lowering operators a† and a, we can
+recast the total Hamiltonian as
+
+H = 1
+
+2ω0σz +
+�
+
+i
+ωia†
+iai + σz ⊗
+�
+
+i
+
+�
+gia†
+i + g∗
+i ai
+�
+. (56)
+
+Note that since
+�
+H, σz
+�
+= 0, no transitions between |0⟩
+and |1⟩ can be induced by H. There is no energy ex-
+change between the system and the environment, and we
+therefore deal with a model of decoherence without dis-
+sipation. Such a model is a good representation of rapid
+decoherence processes during which the amount of dissi-
+pation is negligible, as is often the case in physical appli-
+cations. The resulting evolution can be solved exactly [9].
+For an ohmic spectral density with a high-frequency cut-
+off, it is found that superpositions of the form α |0⟩+β |1⟩
+are exponentially decohered on a timescale set by the
+thermal correlation time (kT)−1 of the environment.
+Inclusion of a tunneling term − 1
+
+2∆0σx yields the gen-
+eral spin–boson model defined by the Hamiltonian
+
+H = 1
+
+2ω0σz − 1
+
+2∆0σx +
+�
+
+i
+
+� 1
+
+2mi
+p2
+i + 1
+
+2miω2
+i q2
+i
+
+�
+
++ σz ⊗
+�
+
+i
+ciqi.
+(57)
+
+The rich non-Markovian dynamics of this model have
+been analyzed in Refs. [29, 123]. The particular dynamics
+strongly depend on the various parameters, such as the
+temperature of the environment, the form of the spec-
+tral density (subohmic, ohmic, or supraohmic), and the
+system–environment coupling strength. For each param-
+eter regime, a characteristic dynamical behavior emerges:
+localization, exponential or incoherent relaxation, expo-
+nential decay, and strongly or weakly damped coherent
+oscillations [29].
+In the weak-coupling limit, one can derive the Born–
+Markov master equation in much the same way as in
+the case of quantum Brownian motion (note the similar
+structure of the Hamiltonians). The result is (see Ref. [9]
+for details)
+
+d
+dtρS(t) = −i
+�
+H′
+SρS(t) − ρS(t)H′†
+S
+�
+
+− �D [σz, [σz, ρS(t)]] + ζσzρS(t)σy + ζ∗σyρS(t)σz.
+(58)
+
+The first term on the right-hand side of the master equa-
+tion (58) represents the evolution under the environment-
+shifted self-Hamiltonian H′
+S, the second term corre-
+sponds to decoherence in the σz eigenbasis of the system
+at a rate given by �D, and the last two terms describe
+the decay of the two-level system. H′
+S is the renormal-
+ized (and in general non-Hermitian) Hamiltonian of the
+system. The coefficients ζ∗, �D, �f, and �γ are given by
+
+ζ∗ = �f − i�γ,
+(59a)
+
+�D =
+� ∞
+
+0
+dτ ν(τ) cos (∆0τ) ,
+(59b)
+
+�f =
+� ∞
+
+0
+dτ ν(τ) sin (∆0τ) ,
+(59c)
+
+�γ =
+� ∞
+
+0
+dτ η(τ) sin (∆0τ) ,
+(59d)
+
+with the noise and the dissipation kernels ν(τ) and η(τ)
+taking the same form as in quantum Brownian motion
+[see Eqs. (43) and (44)].
+
+D.
+Spin-environment models
+
+A qubit linearly coupled to a collection of other
+qubits—known also as a spin–spin model—is often a good
+model of a single two-level system, such as a supercon-
+ducting qubit, strongly coupled to a low-temperature en-
+vironment [103, 104].
+The model of a harmonic oscil-
+lator interacting with a spin environment may be rele-
+vant to the description of decoherence and dissipation in
+quantum-nanomechanical systems and micron-scale ion
+traps [124]. For details on the theory of spin-environment
+models, see Refs. [104, 125–127].
+A simple version of a spin–spin model is described by
+
+
+14
+
+the total Hamiltonian
+
+H = HS + Hint = −1
+
+2∆0σx + 1
+
+2σz ⊗
+
+N
+�
+
+i=1
+giσ(i)
+z
+
+≡ −1
+
+2∆0σx + 1
+
+2σz ⊗ E.
+(60)
+
+Here, HS represents the intrinsic dynamics given by a
+tunneling term, while Hint describes the environmental
+monitoring of the observable σz.
+The model can be solved exactly [128, 129], and
+the resulting dynamics illustrate the dependence of the
+preferred basis on the relative strengths of the self-
+Hamiltonian of the system and the interaction Hamil-
+tonian.
+The preferred basis emerges as the local ba-
+sis that is most robust under the total Hamiltonian.
+When the interaction Hamiltonian dominates over the
+self-Hamiltonian, the pointer states are found to be eigen-
+states of the interaction Hamiltonian, in agreement with
+the commutativity criterion, Eq. (18). Conversely, when
+the modes of the environment are slow and the self-
+Hamiltonian dominates the evolution of the system (the
+quantum limit of decoherence [42]), the pointer states are
+the eigenstates of the Hamiltonian of the system.
+In the weak-coupling limit, spin environments can be
+mapped onto oscillator environments [110, 130]. Specifi-
+cally, the reduced dynamics of a system weakly coupled
+to a spin environment can be described by the system
+coupled to an equivalent oscillator environment described
+by an explicitly temperature-dependent spectral density
+of the form
+
+Jeff(ω, T) ≡ J(ω) tanh
+� ω
+
+2kT
+
+�
+,
+(61)
+
+where J(ω) is the original spectral density of the spin
+environment. (See Sec. 5.4.2 of Ref. [9] for details and
+examples.)
+
+V.
+QUBIT DECOHERENCE, QUANTUM
+ERROR CORRECTION, AND ERROR
+AVOIDANCE
+
+Quantum computation and quantum information pro-
+cessing rely on coherent superpositions of mesoscopically
+or macroscopically distinct states that are highly suscep-
+tible to decoherence. Avoiding, controlling, and mitigat-
+ing decoherence is therefore of paramount importance.
+While the qubits need to be protected from detrimental
+environmental interactions, we also need to be able to
+control and measure them via a macroscopic apparatus.
+The formidable challenge of designing a quantum com-
+puter consists of meeting both demands in a balanced
+way. Even so, decoherence induced by interactions with
+the environment and the control apparatus, as well as
+noise due to faulty gate operations, will likely be too
+strong to allow for useful quantum computations to be
+carried out [74, 131]. What is also needed is an active
+
+mitigation of the effects of decoherence through active
+quantum error correction [132–136].
+We may distinguish two limiting cases for modeling
+decoherence in qubits. The first limit is that of indepen-
+dent qubit decoherence. Here, each qubit couples indepen-
+dently to its own environment, without any interactions
+between these environments. For example, this may be
+the case if the qubits are spatially well-separated (rela-
+tive to the typical coherence length of the environment)
+and only couple to their immediate surroundings. Then
+the error processes affecting the qubits will be completely
+uncorrelated. Thus, if the probability of a particular er-
+ror to affect one qubit is p, the probability of this error
+to occur in K qubits will be pK. Many error-correcting
+schemes are only efficient in correcting such single-qubit
+errors, and thus the assumption of independent decoher-
+ence frequently underlies these schemes. This assump-
+tion, however, is unrealistic when the qubits are located
+spatially close to each other. In this case, all qubits ap-
+proximately feel the same environment, and it is likely
+that errors will become correlated among multiple qubits.
+The limiting case corresponding to this situation is that
+of collective qubit decoherence, in which all qubits couple
+to exactly the same environment.
+
+A.
+Correction of decoherence-induced quantum
+errors
+
+Consider a single qubit S, initially described by a pure
+state |ψ⟩ and interacting with an environment E. One
+can show that an arbitrary evolution of the combined
+qubit–environment state can always be written in the
+form
+
+|ψ⟩ |e0⟩ −→ I |ψ⟩ |eI⟩ +
+�
+
+s=x,y,z
+(σs |ψ⟩) |es⟩ ,
+(62)
+
+where the Pauli operators σs act on the Hilbert space of
+S, and |eI⟩ and {|es⟩} are environmental states that are
+not necessarily orthogonal or normalized. Thus, any in-
+fluence of the environment on the qubit can be expressed
+simply in terms of a weighted sum of the Pauli operators
+and the identity operator acting on the original state of
+the qubit. The effects of σx and σz on the qubit state are
+often referred as a bit-flip error and phase-flip error, re-
+spectively. If we restrict our attention to environmental
+entanglement and the resulting decoherence effects, then
+only phase-flip errors need to be taken into account.
+For N qubits, Eq. (62) generalizes to
+
+|ψ⟩ |e0⟩ −→
+�
+
+i
+(Ei |ψ⟩) |ei⟩ .
+(63)
+
+Here |ψ⟩ is the initial N-qubit state, and the error op-
+erators Ei are tensor products of N operators involv-
+ing identity and Pauli operators. Equation (63) repre-
+sents a worst-case scenario.
+In many cases, simplified
+versions can be used. One important case is that of par-
+tial decoherence. Here, only a small number K < N of
+
+
+15
+
+qubits become entangled with the environment between
+two successive applications of an error-correcting mech-
+anism. Then it will be sufficient to restrict our attention
+to the 2K possible error operators made up of at most K
+operators σz and N − K identity operators. In the case
+of independent qubit decoherence, we only need to con-
+sider a collection of independent phase-flip errors acting
+on single qubits, represented by error operators of the
+form E = I ⊗ · · · ⊗ I ⊗ σz ⊗ I ⊗ · · · ⊗ I.
+Given the entangled state on the right-hand side of
+Eq. (63), the goal of quantum error correction is to re-
+store the initial (unknown) state |ψ⟩. We let an ancilla,
+described by an initial state |a0⟩, interact with the qubit
+system such that
+
+|a0⟩
+
+��
+
+i
+(Ei |ψ⟩) |ei⟩
+
+�
+
+−→
+�
+
+i
+|ai⟩ (Ei |ψ⟩) |ei⟩ .
+(64)
+
+Let us assume that the ancilla states |ai⟩ are at least
+approximately mutually orthogonal, such that they can
+be distinguished by measurement. We now measure the
+observable OA = �
+
+i ai|ai⟩⟨ai| on the ancilla, with ai ̸=
+aj for i ̸= j. The projective measurement will yield a
+particular outcome, say, ak, and lead to the reduction of
+the entangled state,
+�
+
+i
+|ai⟩ (Ei |ψ⟩) |ei⟩ −→ |ak⟩ (Ek |ψ⟩) |ek⟩ .
+(65)
+
+The outcome ak of the measurement tells us the counter-
+transformation needed to restore the initial qubit state.
+Applying E−1
+k
+= E†
+k to the system gives
+
+|ak⟩ (Ek |ψ⟩) |ek⟩
+E−1
+k
+−−−→ |ak⟩ |ψ⟩ |ek⟩ .
+(66)
+
+Note that, as required in order to avoid introducing ad-
+ditional decoherence in the computational basis of the
+qubit system, we have obtained no information whatso-
+ever about the state of the system.
+This account of quantum error correction has been
+highly idealized.
+Let us mention three complications.
+First, it is impossible to design an interaction between
+the computational qubits and the ancilla that would al-
+low us to distinguish, by measuring the ancilla, between
+all possible errors. Second, in realistic settings the error
+operators Ei may be very complex, and it remains to be
+seen whether and how the corresponding countertrans-
+formations can be applied without introducing signifi-
+cant additional decoherence.
+Third, the ancilla qubits
+are physically similar to the computational qubits and
+can therefore be expected to be equally prone to en-
+vironmental interactions (and thus decoherence) as the
+computational qubits themselves. Since the inclusion of
+ancilla qubits increases the total number of qubits in the
+quantum computer, and since decoherence rates typically
+scale exponentially with the size of the system, it will re-
+quire sophisticated experimental designs to ensure not
+only that quantum error correction works in practice,
+but also that it does not aggravate the problem of qubit
+decoherence.
+
+B.
+Quantum computation on decoherence-free
+subspaces
+
+We introduced the concept of decoherence-free sub-
+spaces (DFS) [49–58],
+or pointer subspaces [3],
+in
+Sec. II C 2. DFS allow us to encode quantum informa-
+tion in “quiet corners” of the Hilbert space to protect
+it from environmental effects. In contrast with quantum
+error correction, DFS prevent errors from happening in
+the first place and thus represent a strategy for intrinsic
+error avoidance.
+The two limiting cases of independent qubit decoher-
+ence and collective qubit decoherence delineate the lim-
+its on the size of a DFS. To illustrate this relation-
+ship, let us consider the case of collective decoherence
+of an N-qubit system interacting with an oscillator bath
+[49, 51, 53, 56, 137].
+The interaction Hamiltonian for
+this generalized spin–boson model is taken to be [com-
+pare Eq. (56)]
+
+Hint =
+
+N
+�
+
+i=1
+σ(i)
+z
+⊗
+�
+
+j
+
+�
+gija†
+j + g∗
+ijaj
+�
+≡
+
+N
+�
+
+i=1
+σ(i)
+z
+⊗ Ei.
+
+(67)
+The assumption of collective decoherence implies that
+the couplings gij (and thus the environment operators
+Ei) must be independent of the index i. Then Eq. (67)
+becomes
+
+Hint =
+
+��
+
+i
+σ(i)
+z
+
+�
+
+⊗ E ≡ Sz ⊗ E.
+(68)
+
+Recall that a DFS is spanned by a degenerate set of
+eigenstates of the system operators Sα of the interaction
+Hamiltonian [see Eq. (20)]. Thus, in our case the DFS
+will be spanned by degenerate eigenstates of the collec-
+tive spin operator Sz. Any N-qubit product state of the
+computational basis states |0⟩ and |1⟩ (the eigenstates of
+σz with eigenvalues +1 and −1, respectively) will be an
+eigenstate of Sz. There are 2N +1 different possible inte-
+ger eigenvalues m, ranging from m = −N (corresponding
+to the basis state |1 · · · 1⟩) to m = +N (corresponding
+to |0 · · · 0⟩). The largest number of mutually orthogonal
+computational-basis states with the same eigenvalue m
+of Sz is given by the set S0 of basis states with m = 0,
+i.e., those with N/2 qubits in the state |0⟩. There are
+n0 =
+� N
+N/2
+�
+such states in this set, spanning a DFS of di-
+mension n0. For large values of N, we can approximate
+the binomial coefficient using Stirling’s formula,
+
+log2
+
+� N
+N/2
+
+�
+≈ N − 1
+
+2 log2(πN/2)
+N≫1
+−−−→ N.
+(69)
+
+Therefore, in the limiting case of collective decoherence,
+the dimension of our DFS approaches the dimension of
+the original Hilbert space, and the encoding efficiency
+approaches unity. For example, for N = 4 qubits, the set
+
+S0 = { |0011⟩ , |0101⟩ , |0110⟩ , |1001⟩ , |1010⟩ , |1100⟩ }
+(70)
+
+
+16
+
+spans a maximum-size DFS of dimension six, to be com-
+pared with the dimension of the original Hilbert space,
+which is 24 = 16. Thus, given the model for collective de-
+coherence considered here, using four physical qubits we
+can encode up to two logical qubits in a DFS (since en-
+coding three logical qubits would already require a DFS
+of dimension 23 = 8).
+As mentioned in Sec. II C 2, the existence of a DFS
+corresponds to a dynamical symmetry. Our model rep-
+resents a case of perfect dynamical symmetry, since
+the system–environment interaction, Eq. (68), is com-
+pletely symmetric with respect to any permutations of
+the qubits, thereby leading to a DFS of maximum size.
+What happens if the symmetry is broken by additional
+small independent coupling terms? It has been shown
+[50, 138] that, to first order in the perturbation strength,
+the storage of quantum information in DFS is stable to
+such perturbations to all orders in time, but that the pro-
+cessing of such quantum information encoded in DFS is
+robust only to first order in time.
+In the case of purely independent qubit decoherence,
+the environment operators Ei appearing in Eq. (67) will
+now differ from one another. To find a DFS, we follow
+the usual strategy [see Eq. (20)] of determining a set of
+orthonormal basis states {|si⟩} such that
+
+�
+I(1) ⊗ · · · ⊗ I(j−1) ⊗ σ(j)
+z
+⊗ I(j+1) ⊗ · · · ⊗ I(N)�
+|si⟩
+
+= λ(j) |si⟩
+(71)
+
+for all i and 1 ≤ j ≤ N. The only state fulfilling this
+eigenvalue problem is |0 · · · 0⟩.
+Since we need at least
+a two-dimensional subspace to encode a single logical
+qubit, the case of independent decoherence in the spin–
+boson model does not allow for the existence of a DFS for
+quantum computation. In the language of pointer sub-
+spaces, there is only a single exact pointer state, and this
+environment-superselected preferred state of the system
+will be the ground state |0 · · · 0⟩.
+In realistic settings, neither the assumption of purely
+independent decoherence nor the limit of entirely collec-
+tive decoherence will be entirely appropriate. We can,
+however, use a DFS to protect the qubits from collective
+decoherence effects, and we can recover from single-qubit
+errors due to independent decoherence using active error-
+correction methods. These two approaches can be con-
+catenated [54] to enable universal fault-tolerant quantum
+computation even when the restriction to single-qubit er-
+rors is dropped [55, 139].
+
+C.
+Environment engineering and dynamical
+decoupling
+
+For reasonably large DFS to exist,
+the system–
+environment interaction must exhibit a sufficiently high
+degree of symmetry.
+Such symmetries are unlikely to
+arise naturally in typical experimental settings.
+
+One way of overcoming this limitation is based on envi-
+ronment engineering. Here, one tries to generate certain
+symmetries in the structure of the system–environment
+interactions. For example, an appropriately engineered
+symmetrization could make superposition states in Bose–
+Einstein condensates correspond to (approximate) de-
+generate eigenstates of the interaction Hamiltonian, in
+which case such states would lie within a DFS, thereby
+significantly enhancing their longevity [140].
+In ion
+traps, changing the parameters in the effective interac-
+tion Hamiltonian for the trapped ion allows one to se-
+lect different pointer subspaces and thereby control into
+which DFS the trapped ion is driven [77, 79, 141, 142].
+Another approach to the active creation of DFS is
+known as dynamical decoupling [143–148].
+Here time-
+dependent modifications are introduced into the Hamil-
+tonian of the system that counteract the influence of
+the environment. These modifications take the form of
+sequences of rapid projective measurements or strong
+control-field pulses acting on the system (“quantum
+bang-bang control” [143]).
+Even if the structure of
+the system–environment interaction Hamiltonian is not
+known, decoherence can be suppressed arbitrarily well
+in the limit of an infinitely fast rate of the decoupling
+control field, thus dynamically creating a DFS (which
+then represents a dynamically decoupled subspace). In
+the realistic case of a finite control rate, sufficient (albeit
+imperfect) protection from decoherence can be achieved
+via this decoupling technique, provided the control rate
+is larger than the fastest timescale set by the rate of for-
+mation of environmental entanglement.
+
+VI.
+EXPERIMENTAL STUDIES OF
+DECOHERENCE
+
+Decoherence, of course, happens all around us, and
+in this sense its consequences are readily observed. But
+what we would like to do is to be able to experimen-
+tally study the gradual and controlled action of deco-
+herence. In this endeavor, several obstacles have to be
+overcome. We need to prepare the system in a superpo-
+sition of mesoscopically or even macroscopically distin-
+guishable states with a sufficiently long decoherence time
+such that the gradual action of decoherence can be re-
+solved. We must be able to monitor decoherence without
+introducing a significant amount of additional, unwanted
+decoherence. We would also like to have sufficient con-
+trol over the environment so we can tune the strength
+and form of its interaction with the system.
+Starting
+in the mid-1990s, several such experiments have been
+performed, for example, using cavity QED [12], meso-
+scopic molecules [149], and superconducting systems such
+as SQUIDs and Cooper-pair boxes [13]. Bose–Einstein
+condensates [150] and quantum nanomechanical systems
+[151, 152] are promising candidates for future experimen-
+tal tests of decoherence.
+These experiments are important for several reasons.
+
+
+17
+
+They are impressive demonstrations of the possibility
+of generating nonclassical quantum states in mesoscopic
+and macroscopic systems. They show that the quantum–
+classical boundary is smooth and can be shifted by vary-
+ing the relevant experimental parameters.
+They allow
+us to test and improve decoherence models, and they
+help us design devices for quantum information process-
+ing that are good at evading the detrimental influence
+of the environment. Finally, such experiments may be
+used to test quantum mechanics itself [13]. Such tests re-
+quire sufficient shielding of the system from decoherence
+so that an observed (full or partial) collapse of the wave-
+function could be unambigously attributed to some novel
+nonunitary mechanism in nature, such as those proposed
+in dynamical reduction models [153–155]. This shielding,
+however, is difficult to implement in practice, because
+the large number of particles required for the reduction
+mechanism to become effective will also lead to strong
+decoherence [114, 156].
+The superpositions realized in
+current experiments are still not sufficiently macroscopic
+to rule out collapse theories, although it has been demon-
+strated [118] that matter-wave interferometry with large
+molecular clusters (in the mass range between 106 and
+108 amu) would be able to test the collapse theories pro-
+posed in Refs. [154, 155]; such experiments may soon
+become technologically feasible [11].
+
+A.
+Atoms in a cavity
+
+In 1996 Brune et al. generated a superposition of ra-
+diation fields with classically distinguishable phases in-
+volving several photons [12, 150, 157]. This experiment
+was the first to realize a mesoscopic Schr¨odinger-cat state
+and allowed for the controlled observation and manipu-
+lation of its decoherence. A rubidium atom is prepared
+in a superposition of energy eigenstates |g⟩ and |e⟩ cor-
+responding to two circular Rydberg states.
+The atom
+enters a cavity C containing a radiation field contain-
+ing a few photons. If the atom is in the state |g⟩, the
+field remains unchanged, whereas if it is in the state
+|e⟩, the coherent state |α⟩ of the field undergoes a phase
+shift φ, |α⟩ −→
+��eiφα
+�
+; the experiment achieved φ ≈ π.
+An initial superposition of the atom is therefore am-
+plified into an entangled atom–field state of the form
+1
+√
+
+2 (|g⟩ |α⟩ + |e⟩ |−α⟩).
+The atom then passes through
+an additional cavity, further transforming the superposi-
+tion. Finally, the energy state of the atom is measured.
+This disentangles the atom and the field and leaves the
+latter in a superposition of the mesoscopically distinct
+states |α⟩ and |−α⟩.
+To monitor the decoherence of this superposition, a
+second rubidium atom is sent through the apparatus. Af-
+ter interacting with the field superposition state in the
+cavity C, the atom will always be found in the same en-
+ergy state as the first atom if the superposition has not
+been decohered. This correlation rapidly decays with in-
+creasing decoherence. Thus, by recording the measure-
+
+ment correlation as a function of the wait time τ between
+sending the first and second atom through the appara-
+tus, the decoherence of the field state can be monitored.
+Experimental results were in excellent agreement with
+theoretical predictions [158, 159]. It was found that de-
+coherence became faster as the phase shift φ and the
+mean number ¯n = |α|2 of photons in the cavity C was
+increased. Both results are expected, since an increase
+in φ and ¯n means that the components in the superpo-
+sition become more distinguishable. Recent experiments
+have realized superposition states involving several tens
+of photons [160] and have monitored the gradual deco-
+herence of such states [161].
+
+B.
+Matter-wave interferometry
+
+In these experiments (see Ref. [11] for a review), spatial
+interference patterns are demonstrated for mesoscopic
+molecules ranging from fullerenes [162] to molecular clus-
+ters involving hundreds of atoms, with a total size of
+up to 60 ˚A and masses of several thousand amu (see
+Fig. 2) [163, 164].
+Since the de Broglie wavelength of
+such molecules is on the order of picometers, standard
+double-slit interferometry is out of reach. Instead, the
+experiments make use of the Talbot effect, an interfer-
+ence phenomenon in which a plane wave incident on a
+diffraction grating creates an image of the grating at mul-
+tiples of a distance L behind the grating. In the experi-
+ment, the molecular density (at a macroscopic distance L
+from the grating) is scanned along the direction perpen-
+dicular to the molecular beam.
+An oscillatory density
+pattern (corresponding to the image of the slits in the
+grating) is observed, confirming the existence of coher-
+ence and interference between the different paths of each
+individual molecule passing through the grating. Recent
+experiments have used an improved version of the origi-
+nal Talbot–Lau setup [165], as well as optical ionization
+gratings [166].
+Decoherence is measured as a decrease of the visibil-
+ity of this pattern (Fig. 2). The controlled decoherence
+due to collisions with background gas particles [115, 116]
+and due to emission of thermal radiation from heated
+molecules [168] has been observed, showing a smooth de-
+cay of visibility in agreement with theoretical predictions
+[31, 117, 167]. These successes have led to speculations
+that one could perform similar experiments using even
+larger particles such as proteins and viruses [115, 169]
+or carbonaceous aerosols [170]. Such experiments will be
+limited by collisional and thermal decoherence and by
+noise due to inertial forces and vibrations [115, 169, 170].
+
+C.
+Superconducting systems
+
+Superconducting
+quantum
+interference
+devices
+(SQUIDs)
+and
+Cooper-pair
+boxes
+have
+important
+applications in quantum information processing.
+A
+
+
+18
+NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1263
+
+NATURE COMMUNICATIONS | 2:263 | DOI: 10.1038/ncomms1263 | www.nature.com/naturecommunications
+
+11 Macmillan Publishers Limited. All rights reserved.
+
+ysics, single-particle 
+regarded as a para-
+feature of quantum 
+objects of our mac-
+rinciple has become 
+ng feld of quantum 
+ch in many labora-
+nderstanding of the 
+uantum systems and 
+o the observation of 
+
+m interference with 
+r successful experi-
+our study focuses on 
+ion of the molecule 
+ce. We do this with 
+vide useful molecu-
+1 compares the size 
+8 and PFNS10, with 
+traphenylporphyrin 
+PF84 and TPPF152. 
+molecules in a three-
+apitza-Dirac-Talbot-
+
+rated in a thermal 
+ravitational free-fall 
+meter itself consists 
+amber at a pressure 
+mbrane with 90-nm 
+6 nm. Each slit of G1 
+ecular position that, 
+ads to a momentum 
+
+delocalization and 
+increasing distance 
+ser light wave with a 
+een the electric laser 
+y creates a sinusoidal 
+t matter waves. Te 
+n such that quantum 
+c molecular density 
+structure is sampled 
+cal to G1) across the 
+
+of the transmitted 
+MS).
+added various tech-
+to liquid samples, a 
+tial to maintain the 
+owed us to increase 
+r and many optimi-
+were needed to meet 
+s with very massive 
+
+tum interferograms 
+re 3. In all cases the 
+ude of the sinusoidal 
+al, exceeds the maxi-
+y a signifcant multi-
+t shown for TPPF84 
+ed interference con-
+ith individual scans 
+) and Vobs = 49% for 
+n, we have observed 
+10 and Vobs = 16 � 2% 
+
+for TPPF152 (see Figure 3), in which our classical model predicts 
+Vclass = 1%. Tis supports our claim of true quantum interference for 
+all these complex molecules.
+
+Te most massive molecules are also the slowest and therefore 
+
+the most sensitive ones to external perturbations. In our particle 
+
+Figure 1 | Gallery of molecules used in our interference study. (a) The 
+fullerene C60 (m = 720 AMU, 60 atoms) serves as a size reference and 
+for calibration purposes; (b) The perfluoroalkylated nanosphere PFNS8 
+(C60[C12F25]8, m = 5,672 AMU, 356 atoms) is a carbon cage with eight 
+perfluoroalkyl chains. (c) PFNS10 (C60[C12F25]10, m = 6,910 AMU, 430 
+atoms) has ten side chains and is the most massive particle in the set. 
+(d) A single tetraphenylporphyrin TPP (C44H30N4, m = 614 AMU, 78 
+atoms) is the basis for the two derivatives (e) TPPF84 (C84H26F84N4S4, 
+m = 2,814 AMU, 202 atoms) and (f) TPPF152 (C168H94F152O8N4S4, 
+m = 5,310 AMU, 430 atoms). In its unfolded configuration, the latter is the 
+largest molecule in the set. Measured by the number of atoms, TPPF152 
+and PFNS10 are equally complex. All molecules are displayed to scale. The 
+scale bar corresponds to 10 Å.
+
+y
+
+X
+
+Detector
+
+G1
+
+G2
+
+G3
+
+S3
+
+S2
+
+S1
+
+Oven
+
+Lens
+
+Laser
+
+Z
+
+Figure 2 | Layout of the Kapitza-Dirac-Talbot-Lau (KDTL) interference 
+experiment. The effusive source emits molecules that are velocity-selected 
+by the three delimiters S1, S2 and S3. The KDTL interferometer is composed 
+of two SiNx gratings G1 and G3, as well as the standing light wave G2. The 
+optical dipole force grating imprints a phase modulation �(x)��opt·P/(v·wy) 
+onto the matter wave. Here �opt is the optical polarizability, P the laser 
+power, v the molecular velocity and wy the laser beam waist perpendicular 
+to the molecular beam. The molecules are detected using electron impact 
+ionization and quadrupole mass spectrometry.
+
+0
+0.4
+0.8
+1.2
+1.6
+4
+
+6
+
+8
+10
+
+20
+
+30
+
+visibility (%)
+
+pressure (in 10−6 mbar)
+
+FIG. 2. Left: Molecular clusters used in recent interference
+experiments, drawn to scale (the scale bar represents 10 ˚A).
+Figure from Ref. [163].
+(a) Fullerene C60 (m = 720 amu,
+60 atoms). (b) Perfluoroalkylated nanosphere PFNS8 (m =
+5672 amu, 356 atoms).
+(c) PFNS10 (m = 6910 amu, 430
+atoms). (d) Tetraphenylporphyrin TPP (m = 614 amu, 78
+atoms).
+(e) TPPF84 (m = 2814 amu, 202 atoms).
+(f)
+TPPF152 (m = 5310 amu, 430 atoms). Right: Visibility of
+interference fringes of C70 fullerenes as a function of the pres-
+sure of the background gas. Measured values (circles) agree
+well with the theoretical prediction (solid line) [31, 117, 167]
+describing an exponential decay of visibility with pressure.
+Figure adapted from Ref. [115].
+
+SQUID consists of a ring of superconducting material
+interrupted by thin insulating barriers, the Josephson
+junctions (Fig. 3a).
+At sufficiently low temperatures,
+electrons of opposite spin condense into bosonic Cooper
+pairs.
+Quantum-mechanical tunneling of Cooper pairs
+through the junctions leads to the flow of a resistance-
+free supercurrent around the loop (Josephson effect),
+which creates a magnetic flux threading the loop. The
+collective center-of-mass motion of a macroscopic num-
+ber (∼ 109) of Cooper pairs can then be represented by
+a wave function labeled by a single macroscopic variable,
+namely, the total trapped flux Φ through the loop.
+The two possible directions of the supercurrent define
+a qubit with basis states {|⟳⟩ , |⟲⟩}.
+By adjusting an
+external magnetic field, the SQUID can be biased such
+
+(a)
+
+(a)
+
+80%
+
+60%
+
+40%
+
+5
+0
+
+probability for ⟳
+
+(b)
+
+Josephson junction
+
+superconducting
+
+ring
+
+supercurrent
+
+(b)
+
+(a)
+
+delay time τ (ns)
+
+80%
+
+60%
+
+40%
+
+5
+10
+15
+20
+25
+30
+35
+0
+
+probability for ⟳
+
+(b)
+
+Josephson junction
+
+superconducting
+
+ring
+
+supercurrent
+
+FIG. 3. (a) Schematic illustration of a SQUID. A supercon-
+ducting ring is interrupted by Josephson junctions, leading
+to a dissipationless supercurrent.
+(b) Decoherence of a su-
+perposition of clockwise and counterclockwise supercurrents
+in a superconducting qubit. The damping of the oscillation
+amplitude corresponds to the gradual loss of coherence from
+the system. Figure adapted from Ref. [173].
+
+that the two lowest-lying energy eigenstates |0⟩ and |1⟩
+are equal-weight superpositions of the persistent-current
+states |⟳⟩ and |⟲⟩.
+Such superposition states involving µA currents were
+first experimentally observed in 2000 using spectroscopic
+measurements [171, 172]. Their decoherence was subse-
+quently measured using Ramsey interferometry [173], as
+follows. Two consecutive microwave pulses are applied to
+the system. During the delay time τ between the pulses,
+the system evolves freely. After application of the second
+pulse, the system is left in a superposition of |⟳⟩ and |⟲⟩,
+with the relative amplitudes exhibiting an oscillatory de-
+pendence on τ. A series of measurements in the basis
+{|⟳⟩ , |⟲⟩} over a range of delay times τ then allows one
+to trace out an oscillation of the occupation probabilities
+for |⟳⟩ and |⟲⟩ as a function of τ (Fig. 3b). The envelope
+of the oscillation is damped as a consequence of decoher-
+ence acting on the system during the free evolution of
+duration τ. From the decay of the envelope we can infer
+the decoherence timescale; the original experiment gave
+20 ns [173], while subsequent experiments have achieved
+decoherence times of several µs [174].
+Superpositions states and their decoherence have also
+been observed in superconducting devices whose key vari-
+able is charge (or phase), instead of the flux variable used
+in SQUIDs. Such Cooper-pair boxes consist of a small
+superconducting island onto which Cooper pairs can tun-
+nel from a reservoir through a Josephson junction. Two
+different charge states of the island, differing by at least
+one Cooper pair, define the basis states. Coherent os-
+cillations between such charge states were first observed
+
+
+19
+
+in 1999 [175]. In 2002, Vion et al. [176] reported thou-
+sands of coherent oscillations with a decoherence time
+of 0.5 µs. Similar results have been obtained for phase
+qubits [177, 178], demonstrating decoherence times of
+several µs.
+
+VII.
+DECOHERENCE AND THE
+FOUNDATIONS OF QUANTUM MECHANICS
+
+Can decoherence address foundational problems? If so,
+which ones, and how? Addressing these subtle questions
+is beyond the scope of this review; a few brief remarks
+must suffice here.
+(See Refs. [6, 7, 9, 21] for in-depth
+discussions.) Decoherence, at its heart, is a technical re-
+sult concerning the dynamics and measurement statistics
+of open quantum systems. From this view, decoherence
+merely addresses a consistency problem, by explaining
+how and when the quantum probability distributions ap-
+proach the classically expected distributions. Since deco-
+herence follows directly from an application of the quan-
+tum formalism to interacting quantum systems, it is not
+tied to any particular interpretation of quantum mechan-
+ics, nor does it supply such an interpretation, nor does it
+amount to a theory that could make predictions beyond
+those of standard quantum mechanics.
+The predictively relevant part of decoherence theory
+relies on reduced density matrices, whose formalism and
+interpretation presume the collapse postulate and Born’s
+
+rule. If we understand the “quantum measurement prob-
+lem” as the question of how to reconcile the linear, de-
+terministic evolution described by the Schr¨odinger equa-
+tion with the occurrence of random measurement out-
+comes, then decoherence has not solved this problem
+[6, 9].
+Decoherence does, however, address an aspect
+sometimes associated with the quantum measurement
+problem, namely the preferred-basis problem (at least
+in the sense described in Sec. II C). Further explorations
+of the role of the environment, such as in quantum Dar-
+winism (see Sec. II D), can help illuminate fundamental
+questions concerning information transfer and amplifica-
+tion in the quantum setting.
+Decoherence has been used to identify internal con-
+cistency issues in interpretations of quantum mechanics,
+and the picture associated with the decoherence process
+has sometimes been seen as suggestive of particular inter-
+pretations of quantum mechanics [6, 7]. Indeed, histori-
+cally decoherence theory arose in the context of Zeh’s [1]
+independent formulation of an Everett-style interpreta-
+tion (see Ref. [179] for a historical analysis). Ultimately,
+however, it seems that certain interpretations simply may
+be more in need of decoherence than others for defin-
+ing their structure; see Ref. [180] for the example of an
+Everett-style interpretation [23]. At the end of the day,
+any interpretation that does not involve entities, claims,
+or structures in contradiction with the predictions of de-
+coherence theory (which is to say, with the predictions of
+quantum mechanics) will arguably remain viable.
+
+[1] H.D. Zeh, Found. Phys. 1, 69 (1970)
+[2] W.H. Zurek, Phys. Rev. D 24, 1516 (1981)
+[3] W.H. Zurek, Phys. Rev. D 26, 1862 (1982).
+doi:
+10.1103/PhysRevD.26.1862
+
+[4] J.P. Paz, W.H. Zurek, in Coherent Atomic Matter
+Waves, Les Houches Session LXXII, Les Houches Sum-
+mer School Series, vol. 72, ed. by R. Kaiser, C. West-
+brook, F. David (Springer, Berlin, 2001), Les Houches
+Summer School Series, vol. 72, pp. 533–614
+
+[5] W.H. Zurek, Rev. Mod. Phys. 75, 715 (2003).
+doi:
+10.1103/RevModPhys.75.715
+
+[6] M. Schlosshauer, Rev. Mod. Phys. 76, 1267 (2004)
+[7] G.
+Bacciagaluppi,
+in
+The
+Stanford
+Encyclopedia
+of Philosophy, ed. by E.N. Zalta (2012).
+Online at
+http://plato.stanford.edu/archives/win2012/entries/qm-
+decoherence
+
+[8] E. Joos, H.D. Zeh, C. Kiefer, D. Giulini, J. Kupsch, I.O.
+Stamatescu, Decoherence and the Appearance of a Clas-
+sical World in Quantum Theory, 2nd edn. (Springer,
+New York, 2003)
+
+[9] M.
+Schlosshauer,
+Decoherence
+and
+the
+Quantum-
+to-Classical Transition (Springer, Berlin/Heidelberg,
+2007)
+
+[10] M. Schlosshauer, Phys. Rep. 831, 1 (2019).
+doi:
+10.1016/j.physrep.2019.10.001
+
+[11] K. Hornberger, S. Gerlich, S. Nimmrichter, P. Haslinger,
+M. Arndt, Rev. Mod. Phys. 84, 157 (2012)
+
+[12] J.M. Raimond, M. Brune, S. Haroche, Rev. Mod. Phys.
+
+73, 565 (2001)
+
+[13] A.J. Leggett, J. Phys.:
+Condens. Matter 14, R415
+(2002)
+
+[14] E. Joos, in Decoherence: Theoretical, Experimental, and
+Conceptual Problems, ed. by P. Blanchard, D. Giulini,
+E. Joos, C. Kiefer, I.O. Stamatescu (Springer, Berlin,
+2000), pp. 1–17
+
+[15] M. Schlosshauer, K. Camilleri, AIP Conf. Proc. 1327,
+26 (2011)
+
+[16] O. K¨ubler, H.D. Zeh, Ann. Phys. (N.Y.) 76, 405 (1973)
+[17] E. Joos, H.D. Zeh, Z. Phys. B: Condens. Matter 59, 223
+(1985)
+
+[18] W.H. Zurek, in Frontiers of Nonequilibrium Statistical
+Mechanics, ed. by G.T. Moore, M.O. Scully (Plenum
+Press, New York, 1986), pp. 145–149. First published
+in 1984 as Los Alamos report LAUR 84-2750
+
+[19] K. Hornberger,
+in Entanglement and Decoherence:
+Foundations and Modern Trends,
+Lecture Notes in
+Physics, vol. 768, ed. by A. Buchleitner, C. Viviescas,
+M. Tiersch (Springer, Berlin, 2009), pp. 221–276
+
+[20] H.P. Breuer, F. Petruccione, The Theory of Open Quan-
+tum Systems (Oxford University Press, Oxford, 2002)
+
+[21] M. Schlosshauer, A. Fine, in Quantum Mechanics at the
+Crossroads: New Perspectives from History, Philosophy
+and Physics, ed. by J. Evans, A. Thorndike (Springer,
+Berlin, 2006), pp. 125–148
+
+[22] W.K. Wootters, W.H. Zurek, Phys. Rev. D 19, 473
+(1979)
+
+
+20
+
+[23] H. Everett, Rev. Mod. Phys. 29, 454 (1957)
+[24] A. Einstein, B. Podolsky, N. Rosen, Phys. Rev. 47, 777
+(1935)
+
+[25] J.S. Bell, Physics 1, 195 (1964)
+[26] J.S. Bell, Rev. Mod. Phys. 38, 447 (1966)
+[27] A. Peres, Am. J. Phys. 46 (1978)
+[28] J.P. Paz, S. Habib, W.H. Zurek, Phys. Rev. D 47, 488
+(1993)
+
+[29] A.J. Leggett, S. Chakravarty, A.T. Dorsey, M.P.A.
+Fisher, A. Garg, Rev. Mod. Phys. 59, 1 (1987)
+
+[30] S.G. Mokarzel, A.N. Salgueiro, M.C. Nemes, Phys. Rev.
+A 65, 044101 (2002)
+
+[31] K. Hornberger, J.E. Sipe, Phys. Rev. A 68, 012105
+(2003)
+
+[32] K. Kraus, States, Effects, and Operations (Springer,
+Berlin, 1983)
+
+[33] W.H. Zurek, Phys. Today 44, 36 (1991). See also the
+updated version available as eprint quant-ph/0306072
+
+[34] M.R. Gallis, G.N. Fleming, Phys. Rev. A 42, 38 (1990)
+[35] L. Di´osi, Europhys. Lett. 30, 63 (1995)
+[36] K. Hornberger, Phys. Rev. Lett. 97, 060601 (2006)
+[37] K. Hornberger, B. Vacchini, Phys. Rev. A 77, 022112
+(2008)
+
+[38] M. Busse, K. Hornberger, J. Phys. A: Math. Theor. 42,
+362001 (2009)
+
+[39] M. Busse, K. Hornberger, J. Phys. A: Math. Theor. 43,
+015303 (2010)
+
+[40] R.A. Harris, L. Stodolsky, J. Chem. Phys. 74, 2145
+(1981)
+
+[41] H.D. Zeh, in Decoherence:
+Theoretical, Experimen-
+tal, and Conceptual Problems, ed. by P. Blanchard,
+D. Giulini, E. Joos, C. Kiefer, I. Stamatescu, Lecture
+Notes in Physics No. 538 (Springer, Berlin, 2000), pp.
+19–42
+
+[42] J.P. Paz, W.H. Zurek, Phys. Rev. Lett. 82, 5181 (1999)
+[43] W.H. Zurek, S. Habib, J.P. Paz, Phys. Rev. Lett. 70,
+1187 (1993)
+
+[44] W.H. Zurek, Prog. Theor. Phys. 89, 281 (1993)
+[45] B.L. Hu, J.P. Paz, Y. Zhang, Phys. Rev. D 45, 2843
+(1992)
+
+[46] W.H. Zurek, Philos. Trans. R. Soc. London, Ser. A 356,
+1793 (1998)
+
+[47] L. Di´osi, C. Kiefer, Phys. Rev. Lett. 85, 3552 (2000)
+[48] J. Eisert, Phys. Rev. Lett. 92, 210401 (2004)
+[49] G.M. Palma, K.A. Suominen, A.K. Ekert, Proc. R. Soc.
+Lond. A 452, 567 (1996)
+
+[50] D.A. Lidar, I.L. Chuang, K.B. Whaley, Phys. Rev. Lett.
+81, 2594 (1998)
+
+[51] P. Zanardi, M. Rasetti, Phys. Rev. Lett. 79, 3306 (1997)
+[52] P. Zanardi, M. Rasetti, Mod. Phys. Lett. B 11, 1085
+(1997)
+
+[53] P. Zanardi, Phys. Rev. A 57, 3276 (1998)
+[54] D.A. Lidar, D. Bacon, K.B. Whaley, Phys. Rev. Lett.
+82, 4556 (1999)
+
+[55] D. Bacon, J. Kempe, D.A. Lidar, K.B. Whaley, Phys.
+Rev. Lett. 85, 1758 (2000)
+
+[56] L.M. Duan, G.C. Guo, Phys. Rev. A 57, 737 (1998)
+[57] P. Zanardi, Phys. Rev. A 63, 012301 (2001)
+[58] E. Knill, R. Laflamme, L. Viola, Phys. Rev. Lett. 82,
+2525 (2000)
+
+[59] J. Kempe, D. Bacon, D.A. Lidar, K.B. Whaley, Phys.
+Rev. A 63, 042307 (2001)
+
+[60] D.A. Lidar, K.B. Whaley, in Irreversible Quantum Dy-
+namics, Springer Lecture Notes in Physics, vol. 622, ed.
+
+by F. Benatti, R. Floreanini (Springer, Berlin, 2003),
+pp. 83–120. Also available as eprint quant-ph/0301032
+
+[61] W.H. Zurek, in Science and Ultimate Reality, ed. by
+J.D. Barrow, P.C.W. Davies, C.H. Harper (Cambridge
+University Press, Cambridge, England, 2004), pp. 121–
+137
+
+[62] H. Ollivier, D. Poulin, W.H. Zurek, Phys. Rev. Lett. 93,
+220401 (2004)
+
+[63] H. Ollivier, D. Poulin, W.H. Zurek, Phys. Rev. A 72,
+042113 (2005)
+
+[64] R. Blume-Kohout, W.H. Zurek, Found. Phys. 35, 1857
+(2005)
+
+[65] R. Blume-Kohout, W.H. Zurek, Phys. Rev. A 73,
+062310 (2006)
+
+[66] W.H. Zurek, Nature Phys. 5, 181 (2009)
+[67] C.J. Riedel, W.H. Zurek, Phys. Rev. Lett. 105, 020404
+(2010)
+
+[68] C.J. Riedel, W.H. Zurek, New J. Phys. 13, 073038
+(2011)
+
+[69] C.J. Riedel, W.H. Zurek, M. Zwolak, New J. Phys. 14,
+083010 (2012)
+
+[70] M. Zwolak, C.J. Riedel, W.H. Zurek, Phys. Rev. Lett.
+112, 140406 (2014)
+
+[71] R. Blume-Kohout, W.H. Zurek, Phys. Rev. Lett. 101,
+240405 (2008)
+
+[72] H. Ollivier, W.H. Zurek, Phys. Rev. Lett. 88, 017901
+(2002)
+
+[73] S. Schneider, G.J. Milburn, Phys. Rev. A 57, 3748
+(1998)
+
+[74] C. Miquel, J.P. Paz, W.H. Zurek, Phys. Rev. Lett. 78,
+3971 (1997)
+
+[75] L.M.K. Vandersypen, I.L. Chuang, Rev. Mod. Phys. 76,
+1037 (2004)
+
+[76] J.M. Martinis, S. Nam, J. Aumentado, K.M. Lang,
+C. Urbina, Phys. Rev. B 67, 094510 (2003)
+
+[77] C.J. Myatt, B.E. King, Q.A. Turchette, C.A. Sackett,
+D. Kielpinski, W.M. Itano, C. Monroe, D.J. Wineland,
+Nature 403, 269 (2000)
+
+[78] S. Schneider, G.J. Milburn, Phys. Rev. A 59, 3766
+(1999)
+
+[79] Q.A. Turchette, C.J. Myatt, B.E. King, C.A. Sackett,
+D. Kielpinski, W.M. Itano, C. Monroe, D.J. Wineland,
+Phys. Rev. A 62, 053807 (2000)
+
+[80] K. Kraus, Ann. Phys. 64, 311 (1971)
+[81] V. Gorini, A. Kossakowski, E.C.G. Sudarshan, J. Math.
+Phys. 17, 821 (1976)
+
+[82] G. Lindblad, Commun. Math. Phys. 48, 119 (1976)
+[83] R. Alicki, M. Fannes, Quantum Dynamical Systems
+(Oxford University Press, Oxford, 2001)
+
+[84] R. Alicki, K. Lendi, Quantum Dynamical Semigroups
+and Applications, Lect. Notes Phys., vol. 717, 2nd edn.
+(Springer, Berlin/Heidelberg, 2007)
+
+[85] F. Benatti, R. Floreanini, Int. J. Mod. Phys. B 19, 3063
+(2005). doi:10.1142/S0217979205032097
+
+[86] R. D¨umcke, H. Spohn, Z. Phys. B 34, 419 (1979)
+[87] V. Gorini, A. Frigerio, M. Verri, A. Kossakowski, E.C.G.
+Sudarshan, Rep. Math. Phys. 13, 149 (1978)
+
+[88] E.B. Davies, Commun. Math. Phys. 39, 91 (1974)
+[89] A. Kossakowski, Rep. Math. Phys. 3, 247 (1972)
+[90] A. Barchielli, V.P. Belavkin, J. Phys. A: Math. Gen. 24,
+1495 (1991)
+
+[91] V.P. Belavkin, in Lecture Notes in Control and Infor-
+mation Sciences, vol. 121 (Springer, Berlin, 1989), pp.
+245–265
+
+
+21
+
+[92] V.P. Belavkin, J. Phys. A: Math. Gen. 22, L1109 (1989)
+[93] V.P. Belavkin, Phys. Lett. A 140, 355 (1989)
+[94] V.P. Belavkin,
+in Chaos:
+The Interplay Between
+Stochastic and Deterministic Behaviour, ed. by P. Gar-
+baczewksi, M. Wolf, A. Veron, Lecture Notes in Physics
+(Springer, 1995), pp. 21–41
+
+[95] L. Di´osi, Phys. Lett. A 129, 419 (1988)
+[96] L. Di´osi, Phys. Lett. A 132, 233 (1988)
+[97] L. Di´osi, J. Phys. A 21, 2885 (1988)
+[98] N. Gisin, Phys. Rev. Lett. 52, 1657 (1984)
+[99] N. Gisin, Helv. Phys. Acta 62, 363 (1989)
+
+[100] H.M. Wiseman, Phys. Rev. A 49, 2133 (1994)
+[101] H.S. Goan, G.J. Milburn, H.M. Wiseman, H.B. Sun,
+Phys. Rev. B 63, 125326 (2001)
+
+[102] M.B. Plenio, P.L. Knight, Rev. Mod. Phys. 70, 101
+(1998)
+
+[103] N.V. Prokof’ev, P.C.E. Stamp, Rep. Prog. Phys. 63,
+669 (2000)
+
+[104] M. Dub´e, P.C.E. Stamp, Chem. Phys. 268, 257 (2001)
+[105] S. Gr¨oblacher, A. Trubarov, N. Prigge, M. Aspelmeyer,
+J. Eisert, Nature Comm. 6, 7606 (2015)
+
+[106] S. Chaturvedi, F. Shibata, Z. Phys. B 35, 297 (1979)
+[107] F. Shibata, T. Arimitsu, J. Phys. Soc. Jpn. 49, 891
+(1980)
+
+[108] A. Royer, Phys. Rev. A 6, 1741 (1972)
+[109] A. Royer, Phys. Lett. A 315, 335 (2003)
+[110] R. Feynman, F.L. Vernon, Ann. Phys. (N.Y.) 24, 118
+(1963)
+
+[111] A. Caldeira, A. Leggett, Ann. Phys. (N.Y.) 149, 374
+(1983)
+
+[112] O.V. Lounasmaa, Experimental Principles and Methods
+below 1 K (Academic Press, New York, 1974)
+
+[113] S.L. Adler, J. Phys. A: Math. Gen. 39, 14067 (2006)
+[114] M. Tegmark, Found. Phys. Lett. 6, 571 (1993)
+[115] L.
+Hackerm¨uller,
+K.
+Hornberger,
+B.
+Brezger,
+A. Zeilinger,
+M. Arndt,
+Appl. Phys. B 77,
+781
+(2003)
+
+[116] K. Hornberger, S. Uttenthaler, B. Brezger, L. Hack-
+erm¨uller, M. Arndt, A. Zeilinger, Phys. Rev. Lett. 90,
+160401 (2003)
+
+[117] K. Hornberger, J.E. Sipe, M. Arndt, Phys. Rev. A 70,
+053608 (2004)
+
+[118] S. Nimmrichter, K. Hornberger, P. Haslinger, M. Arndt,
+Phys. Rev. A 83, 043621 (2011)
+
+[119] D.A. Kokorowski, A.D. Cronin, T.D. Roberts, D.E.
+Pritchard, Phys. Rev. Lett. 86, 2191 (2001)
+
+[120] H. Uys, J.D. Perreault, A.D. Cronin, Phys. Rev. Lett.
+95, 150403 (2005)
+
+[121] A.O. Caldeira, A.J. Leggett, Physica A 121, 587 (1983)
+[122] D.F. Walls, M.J. Collett, G.J. Milburn, Phys. Rev. D
+32, 3208 (1985)
+
+[123] U. Weiss, Quantum Dissipative Systems (World Scien-
+tific, Singapore, 1999)
+
+[124] M. Schlosshauer, A.P. Hines, G.J. Milburn, Phys. Rev.
+A 77, 022111 (2008)
+
+[125] P.C.E. Stamp, in Tunnelling in Complex Systems, ed.
+by S. Tomsovic (World Scientific, Singapore, 1998), pp.
+101–197
+
+[126] N.V. Prokof’ev, P.C.E. Stamp, (1995)
+[127] N.V. Prokof’ev, P.C.E. Stamp, J. Phys. Chem. Lett. 5,
+L663 (1993)
+
+[128] V.V. Dobrovitski, H.A. De Raedt, M.I. Katsnelson,
+B.N. Harmon, Phys. Rev. Lett. 90, 210401 (2003). doi:
+10.1103/PhysRevLett.90.210401
+
+[129] F.M. Cucchietti, J.P. Paz, W.H. Zurek, Phys. Rev. A
+72, 052113 (2005). doi:10.1103/PhysRevA.72.052113
+
+[130] A.O. Caldeira, A.H. Castro Neto, T.O. de Carvalho,
+Phys. Rev. B 48, 13974 (1993)
+
+[131] C. Miquel, J.P. Paz, R. Perazzo, Phys. Rev. A 54, 2605
+(1996)
+
+[132] A.M. Steane, Phys. Rev. Lett. 77, 793 (1996)
+[133] P.W. Shor, Phys. Rev. A 52, R2493 (1995)
+[134] A.M. Steane, in Decoherence and Its Implications in
+Quantum Computation and Information Transfer, ed.
+by P. Turchi, A. Gonis (IOS Press, Amsterdam, 2001),
+pp. 284–298. Also available as eprint quant-ph/0304016
+
+[135] E. Knill, R. Laflamme, A. Ashikhmin, H. Barnum,
+L. Viola, W. Zurek, LA Science 27, 188 (2002)
+
+[136] M.A. Nielsen, I.L. Chuang, Quantum Computation and
+Quantum Information (Cambridge University Press,
+Cambridge, 2000)
+
+[137] J.H. Reina, L. Quiroga, N.F. Johnson, Phys. Rev. A 65
+65, 032326 (2002)
+
+[138] D. Bacon, D.A. Lidar, K.B. Whaley, Phys. Rev. A 60,
+1944 (1999)
+
+[139] D.A. Lidar, D. Bacon, J. Kempe, K.B. Whaley, Phys.
+Rev. A 63, 022307 (2001)
+
+[140] D.A.R. Dalvit, J. Dziarmaga, W.H. Zurek, Phys. Rev.
+A 62, 013607 (2000)
+
+[141] J.F. Poyatos, J.I. Cirac, P. Zoller, Phys. Rev. Lett. 77,
+4728 (1996)
+
+[142] A.R.R. Carvalho, P. Milman, R.L. de Matos Filho,
+L. Davidovich, Phys. Rev. Lett. 86, 4988 (2001)
+
+[143] L. Viola, S. Lloyd, Phys. Rev. A 58, 2733 (1998)
+[144] L. Viola, E. Knill, S. Lloyd, Phys. Rev. Lett. 82, 2417
+(1999)
+
+[145] P. Zanardi, Phys. Lett. A 258, 77 (1999)
+[146] L. Viola, E. Knill, S. Lloyd, Phys. Rev. Lett. 85, 3520
+(2000)
+
+[147] L.A. Wu, D.A. Lidar, Phys. Rev. Lett. 88, 207902
+(2002)
+
+[148] L.A. Wu, M.S. Byrd, D.A. Lidar, Phys. Rev. Lett. 89,
+127901 (2002)
+
+[149] M. Arndt, K. Hornberger, A. Zeilinger, Phys. World 18,
+35 (2005)
+
+[150] R. Kaiser, C. Westbrook, F. David (eds.).
+Coherent
+Atomic Matter Waves, Les Houches Session LXXII, Les
+Houches Summer School Series (Springer, Berlin, 2001)
+
+[151] M. Blencowe, Phys. Rep. 395, 159 (2004)
+[152] M. Aspelmeyer, T.J. Kippenberg, F. Marquardt, Rev.
+Mod. Phys. 86, 1391 (2014)
+
+[153] A. Bassi, G.C. Ghirardi, Phys. Rep. 379, 257 (2003)
+[154] S.L. Adler, J. Phys. A 40, 2935 (2007)
+[155] A. Bassi, D.A. Deckert, L. Ferialdi, EPL 92, 50006
+(2010)
+
+[156] S. Nimmrichter, K. Hornberger, Phys. Rev. Lett. 110,
+160403 (2013)
+
+[157] M. Brune, E. Hagley, J. Dreyer, X. Maˆıtre, A. Maali,
+C. Wunderlich, J.M. Raimond, S. Haroche, Phys. Rev.
+Lett. 77, 4887 (1996)
+
+[158] L. Davidovich, M. Brune, J.M. Raimond, S. Haroche,
+Phys. Rev. A 53, 1295 (1996)
+
+[159] X. Maˆıtre, E. Hagley, J. Dreyer, A. Maali, C.W.M.
+Brune, J.M. Raimond, S. Haroche, J. Mod. Opt. 44,
+2023 (1997)
+
+[160] A.
+Auffeves,
+P.
+Maioli,
+T.
+Meunier,
+S.
+Gleyzes,
+G. Nogues, M. Brune, J.M. Raimond, S. Haroche, Phys.
+Rev. Lett. 91, 230405 (2003)
+
+
+22
+
+[161] S. Del´eglise, I. Dotsenko, C. Sayrin, J. Bernu, M. Brune,
+J.M. Raimond, S. Haroche, Nature 455, 510 (2008)
+
+[162] M.
+Arndt,
+O.
+Nairz,
+J.
+Vos-Andreae,
+C.
+Keller,
+G. van der Zouw, A. Zeilinger, Nature 401, 680 (1999)
+
+[163] S. Gerlich, S. Eibenberger, M. Tomandl, S. Nimm-
+richter, K. Hornberger, P.J. Fagan, J. T¨uxen, M. Mayor,
+M. Arndt, Nature Comm. 2, 263 (2012)
+
+[164] S. Eibenberger,
+S. Gerlich,
+M. Arndt,
+M. Mayor,
+J. T¨uxen, Phys. Chem. Chem. Phys. 15, 14696 (2013)
+
+[165] S. Gerlich, L. Hackerm¨uller, K. Hornberger, A. Stibor,
+H. Ulbricht, F. Goldfarb, T. Savas, M. M¨uri, M. Mayor,
+M. Arndt, Nature Phys. 3, 711 (2007)
+
+[166] P. Haslinger, N. D¨orre, P. Geyer, J. Rodewald, S. Nimm-
+richter, M. Arndt, Nature Phys. 9, 144 (2013)
+
+[167] K. Hornberger, L. Hackerm¨uller, M. Arndt, Phys. Rev.
+A 71, 023601 (2005)
+
+[168] L.
+Hackerm¨uller,
+K.
+Hornberger,
+B.
+Brezger,
+A. Zeilinger, M. Arndt, Nature 427, 711 (2004)
+
+[169] M.
+Arndt,
+O.
+Nairz,
+A.
+Zeilinger,
+in
+Quantum
+[Un]Speakables:
+From Bell to Quantum Information,
+ed. by R.A. Bertlmann, A. Zeilinger (Springer, Berlin,
+2002), pp. 333–351
+
+[170] K. Hornberger, Phys. Rev. A 73, 052102 (2006)
+[171] J.R. Friedman, V. Patel, W. Chen, S.K. Yolpygo, J.E.
+
+Lukens, Nature 406, 43 (2000)
+
+[172] C.H. van der Wal, A.C.J. ter Haar, F.K. Wilhelm, R.N.
+Schouten, C.J.P.M. Harmans, T.P. Orlando, S. Lloyd,
+J.E. Mooij, Science 290, 773 (2000)
+
+[173] I. Chiorescu, Y. Nakamura, C.J.P.M. Harmans, J.E.
+Mooij, Science 21, 1869 (2003)
+
+[174] P. Bertet, I. Chiorescu, G. Burkard, K. Semba, C.J.P.M.
+Harmans, D.P. DiVincenzo, J.E. Mooij, Phys. Rev.
+Lett. 95, 257002 (2005)
+
+[175] Y. Nakamura, Y.A. Pashkin, J.S. Tsai, Nature 398, 786
+(1999)
+
+[176] D. Vion, A. Aassime, A. Cottet, P. Joyez, H. Pothier,
+C. Urbina, D. Esteve, M.H. Devoret, Science 296, 886
+(2002)
+
+[177] Y. Yu, S. Han, X. Chu, S.I. Chu, Z. Wang, Science 296,
+889 (2002)
+
+[178] J.M. Martinis, S. Nam, J. Aumentado, C. Urbina, Phys.
+Rev. Lett. 89, 117901 (2002)
+
+[179] K. Camilleri, Stud. Hist. Phil. Mod. Phys. 40, 290
+(2009)
+
+[180] D. Wallace, in Many Worlds? Everett, Quantum Theory
+and Reality, ed. by S. Saunders, J. Barrett, A. Kent,
+D. Wallace (Oxford University Press, Oxford, 2010), pp.
+53–72
+
+
diff --git a/papers/references/temp/Sparso2001.pdf b/papers/references/temp/Sparso2001.pdf
new file mode 100644
index 00000000..0507b33e
--- /dev/null
+++ b/papers/references/temp/Sparso2001.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:35bad8700412d0272a2d3b70216bb7581007c6ae945b927201e31b359965e589
+size 2602288
diff --git a/papers/references/temp/Sparso2001.txt b/papers/references/temp/Sparso2001.txt
new file mode 100644
index 00000000..a1dc2b4b
--- /dev/null
+++ b/papers/references/temp/Sparso2001.txt
@@ -0,0 +1,24580 @@
+PRINCIPLES OF
+ASYNCHRONOUS CIRCUIT DESIGN
+– A Systems Perspective
+
+Edited by
+JENS SPARSØ
+Technical University of Denmark
+
+STEVE FURBER
+The University of Manchester, UK
+
+Kluwer Academic Publishers
+Boston/Dordrecht/London
+
+
+Contents
+
+Preface
+xi
+
+Part I
+Asynchronous circuit design – A tutorial
+Author: Jens Sparsø
+
+1
+Introduction
+3
+1.1
+Why consider asynchronous circuits?
+3
+1.2
+Aims and background
+4
+1.3
+Clocking versus handshaking
+5
+1.4
+Outline of Part I
+8
+
+2
+Fundamentals
+9
+2.1
+Handshake protocols
+9
+2.1.1
+Bundled-data protocols
+9
+2.1.2
+The 4-phase dual-rail protocol
+11
+2.1.3
+The 2-phase dual-rail protocol
+13
+2.1.4
+Other protocols
+13
+2.2
+The Muller C-element and the indication principle
+14
+2.3
+The Muller pipeline
+16
+2.4
+Circuit implementation styles
+17
+2.4.1
+4-phase bundled-data
+18
+2.4.2
+2-phase bundled data (Micropipelines)
+19
+2.4.3
+4-phase dual-rail
+20
+2.5
+Theory
+23
+2.5.1
+The basics of speed-independence
+23
+2.5.2
+Classification of asynchronous circuits
+25
+2.5.3
+Isochronic forks
+26
+2.5.4
+Relation to circuits
+26
+2.6
+Test
+27
+2.7
+Summary
+28
+
+3
+Static data-flow structures
+29
+3.1
+Introduction
+29
+3.2
+Pipelines and rings
+30
+
+v
+
+
+vi
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+3.3
+Building blocks
+31
+3.4
+A simple example
+33
+3.5
+Simple applications of rings
+35
+3.5.1
+Sequential circuits
+35
+3.5.2
+Iterative computations
+35
+3.6
+FOR, IF, and WHILE constructs
+36
+3.7
+A more complex example: GCD
+38
+3.8
+Pointers to additional examples
+39
+3.8.1
+A low-power filter bank
+39
+3.8.2
+An asynchronous microprocessor
+39
+3.8.3
+A fine-grain pipelined vector multiplier
+40
+3.9
+Summary
+40
+
+4
+Performance
+41
+4.1
+Introduction
+41
+4.2
+A qualitative view of performance
+42
+4.2.1
+Example 1: A FIFO used as a shift register
+42
+4.2.2
+Example 2: A shift register with parallel load
+44
+4.3
+Quantifying performance
+47
+4.3.1
+Latency, throughput and wavelength
+47
+4.3.2
+Cycle time of a ring
+49
+4.3.3
+Example 3: Performance of a 3-stage ring
+51
+4.3.4
+Final remarks
+52
+4.4
+Dependency graph analysis
+52
+4.4.1
+Example 4: Dependency graph for a pipeline
+52
+4.4.2
+Example 5: Dependency graph for a 3-stage ring
+54
+4.5
+Summary
+56
+
+5
+Handshake circuit implementations
+57
+5.1
+The latch
+57
+5.2
+Fork, join, and merge
+58
+5.3
+Function blocks – The basics
+60
+5.3.1
+Introduction
+60
+5.3.2
+Transparency to handshaking
+61
+5.3.3
+Review of ripple-carry addition
+64
+5.4
+Bundled-data function blocks
+65
+5.4.1
+Using matched delays
+65
+5.4.2
+Delay selection
+66
+5.5
+Dual-rail function blocks
+67
+5.5.1
+Delay insensitive minterm synthesis (DIMS)
+67
+5.5.2
+Null Convention Logic
+69
+5.5.3
+Transistor-level CMOS implementations
+70
+5.5.4
+Martin’s adder
+71
+5.6
+Hybrid function blocks
+73
+5.7
+MUX and DEMUX
+75
+5.8
+Mutual exclusion, arbitration and metastability
+77
+5.8.1
+Mutual exclusion
+77
+5.8.2
+Arbitration
+79
+5.8.3
+Probability of metastability
+79
+
+
+Contents
+vii
+
+5.9
+Summary
+80
+
+6
+Speed-independent control circuits
+81
+6.1
+Introduction
+81
+6.1.1
+Asynchronous sequential circuits
+81
+6.1.2
+Hazards
+82
+6.1.3
+Delay models
+83
+6.1.4
+Fundamental mode and input-output mode
+83
+6.1.5
+Synthesis of fundamental mode circuits
+84
+6.2
+Signal transition graphs
+86
+6.2.1
+Petri nets and STGs
+86
+6.2.2
+Some frequently used STG fragments
+88
+6.3
+The basic synthesis procedure
+91
+6.3.1
+Example 1: a C-element
+92
+6.3.2
+Example 2: a circuit with choice
+92
+6.3.3
+Example 2: Hazards in the simple gate implementation
+94
+6.4
+Implementations using state-holding gates
+96
+6.4.1
+Introduction
+96
+6.4.2
+Excitation regions and quiescent regions
+97
+6.4.3
+Example 2: Using state-holding elements
+98
+6.4.4
+The monotonic cover constraint
+98
+6.4.5
+Circuit topologies using state-holding elements
+99
+6.5
+Initialization
+101
+6.6
+Summary of the synthesis process
+101
+6.7
+Petrify: A tool for synthesizing SI circuits from STGs
+102
+6.8
+Design examples using Petrify
+104
+6.8.1
+Example 2 revisited
+104
+6.8.2
+Control circuit for a 4-phase bundled-data latch
+106
+6.8.3
+Control circuit for a 4-phase bundled-data MUX
+109
+6.9
+Summary
+113
+
+7
+Advanced 4-phase bundled-data
+protocols and circuits
+115
+
+7.1
+Channels and protocols
+115
+7.1.1
+Channel types
+115
+7.1.2
+Data-validity schemes
+116
+7.1.3
+Discussion
+116
+7.2
+Static type checking
+118
+7.3
+More advanced latch control circuits
+119
+7.4
+Summary
+121
+
+8
+High-level languages and tools
+123
+8.1
+Introduction
+123
+8.2
+Concurrency and message passing in CSP
+124
+8.3
+Tangram: program examples
+126
+8.3.1
+A 2-place shift register
+126
+8.3.2
+A 2-place (ripple) FIFO
+126
+
+
+viii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+8.3.3
+GCD using while and if statements
+127
+8.3.4
+GCD using guarded commands
+128
+8.4
+Tangram: syntax-directed compilation
+128
+8.4.1
+The 2-place shift register
+129
+8.4.2
+The 2-place FIFO
+130
+8.4.3
+GCD using guarded repetition
+131
+8.5
+Martin’s translation process
+133
+8.6
+Using VHDL for asynchronous design
+134
+8.6.1
+Introduction
+134
+8.6.2
+VHDL versus CSP-type languages
+135
+8.6.3
+Channel communication and design flow
+136
+8.6.4
+The abstract channel package
+138
+8.6.5
+The real channel package
+142
+8.6.6
+Partitioning into control and data
+144
+8.7
+Summary
+146
+Appendix: The VHDL channel packages
+148
+A.1
+The abstract channel package
+148
+A.2
+The real channel package
+150
+
+Part II
+Balsa - An Asynchronous Hardware Synthesis System
+Author: Doug Edwards, Andrew Bardsley
+
+9
+An introduction to Balsa
+155
+9.1
+Overview
+155
+9.2
+Basic concepts
+156
+9.3
+Tool set and design flow
+159
+9.4
+Getting started
+159
+9.4.1
+A single-place buffer
+161
+9.4.2
+Two-place buffers
+163
+9.4.3
+Parallel composition and module reuse
+164
+9.4.4
+Placing multiple structures
+165
+9.5
+Ancillary Balsa tools
+166
+9.5.1
+Makefile generation
+166
+9.5.2
+Estimating area cost
+167
+9.5.3
+Viewing the handshake circuit graph
+168
+9.5.4
+Simulation
+168
+
+10
+The Balsa language
+173
+10.1
+Data types
+173
+10.2
+Data typing issues
+176
+10.3
+Control flow and commands
+178
+10.4
+Binary/unary operators
+181
+10.5
+Program structure
+181
+10.6
+Example circuits
+183
+10.7
+Selecting channels
+190
+
+
+Contents
+ix
+11
+Building library components
+193
+11.1
+Parameterised descriptions
+193
+11.1.1 A variable width buffer definition
+193
+11.1.2 Pipelines of variable width and depth
+194
+11.2
+Recursive definitions
+195
+11.2.1 An n-way multiplexer
+195
+11.2.2 A population counter
+197
+11.2.3 A Balsa shifter
+200
+11.2.4 An arbiter tree
+202
+
+12
+A simple DMA controller
+205
+12.1
+Global registers
+205
+12.2
+Channel registers
+206
+12.3
+DMA controller structure
+207
+12.4
+The Balsa description
+211
+12.4.1 Arbiter tree
+211
+12.4.2 Transfer engine
+212
+12.4.3 Control unit
+213
+
+Part III
+Large-Scale Asynchronous Designs
+
+13
+Descale
+221
+Joep Kessels & Ad Peeters, Torsten Kramer and Volker Timm
+13.1
+Introduction
+222
+13.2
+VLSI programming of asynchronous circuits
+223
+13.2.1 The Tangram toolset
+223
+13.2.2 Handshake technology
+225
+13.2.3 GCD algorithm
+226
+13.3
+Opportunities for asynchronous circuits
+231
+13.4
+Contactless smartcards
+232
+13.5
+The digital circuit
+235
+13.5.1 The 80C51 microcontroller
+236
+13.5.2 The prefetch unit
+239
+13.5.3 The DES coprocessor
+241
+13.6
+Results
+243
+13.7
+Test
+245
+13.8
+The power supply unit
+246
+13.9
+Conclusions
+247
+
+14
+An Asynchronous Viterbi Decoder
+249
+Linda E. M. Brackenbury
+14.1
+Introduction
+249
+14.2
+The Viterbi decoder
+250
+14.2.1 Convolution encoding
+250
+14.2.2 Decoder principle
+251
+14.3
+System parameters
+253
+14.4
+System overview
+254
+
+
+x
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+14.5
+The Path Metric Unit (PMU)
+256
+14.5.1 Node pair design in the PMU
+256
+14.5.2 Branch metrics
+259
+14.5.3 Slot timing
+261
+14.5.4 Global winner identification
+262
+14.6
+The History Unit (HU)
+264
+14.6.1 Principle of operation
+264
+14.6.2 History Unit backtrace
+264
+14.6.3 History Unit implementation
+267
+14.7
+Results and design evaluation
+269
+14.8
+Conclusions
+271
+14.8.1 Acknowledgement
+272
+14.8.2 Further reading
+272
+
+15
+Processors
+273
+Jim D. Garside
+15.1
+An introduction to the Amulet processors
+274
+15.1.1 Amulet1 (1994)
+274
+15.1.2 Amulet2e (1996)
+275
+15.1.3 Amulet3i (2000)
+275
+15.2
+Some other asynchronous microprocessors
+276
+15.3
+Processors as design examples
+278
+15.4
+Processor implementation techniques
+279
+15.4.1 Pipelining processors
+279
+15.4.2 Asynchronous pipeline architectures
+281
+15.4.3 Determinism and non-determinism
+282
+15.4.4 Dependencies
+288
+15.4.5 Exceptions
+297
+15.5
+Memory – a case study
+302
+15.5.1 Sequential accesses
+302
+15.5.2 The Amulet3i RAM
+303
+15.5.3 Cache
+307
+15.6
+Larger asynchronous systems
+310
+15.6.1 System-on-Chip (DRACO)
+310
+15.6.2 Interconnection
+310
+15.6.3 Balsa and the DMA controller
+312
+15.6.4 Calibrated time delays
+313
+15.6.5 Production test
+314
+15.7
+Summary
+315
+
+Epilogue
+317
+
+References
+319
+
+Index
+333
+
+
+Preface
+
+This book was compiled to address a perceived need for an introductory text
+on asynchronous design. There are several highly technical books on aspects of
+the subject, but no obvious starting point for a designer who wishes to become
+acquainted for the first time with asynchronous technology. We hope this book
+will serve as that starting point.
+The reader is assumed to have some background in digital design. We as-
+sume that concepts such as logic gates, flip-flops and Boolean logic are famil-
+iar. Some of the latter sections also assume familiarity with the higher levels of
+digital design such as microprocessor architectures and systems-on-chip, but
+readers unfamiliar with these topics should still find the majority of the book
+accessible.
+The intended audience for the book comprises the following groups:
+
+Industrial designers with a background in conventional (clocked) digital
+design who wish to gain an understanding of asynchronous design in
+order, for example, to establish whether or not it may be advantageous
+to use asynchronous techniques in their next design task.
+
+Students in Electronic and/or Computer Engineering who are taking a
+course that includes aspects of asynchronous design.
+
+The book is structured in three parts. Part I is a tutorial in asynchronous
+design. It addresses the most important issue for the beginner, which is how to
+think about asynchronous systems. The first big hurdle to be cleared is that of
+mindset – asynchronous design requires a different mental approach from that
+normally employed in clocked design. Attempts to take an existing clocked
+system, strip out the clock and simply replace it with asynchronous handshakes
+are doomed to disappoint. Another hurdle is that of circuit design methodol-
+ogy – the existing body of literature presents an apparent plethora of disparate
+approaches. The aim of the tutorial is to get behind this and to present a single
+unified and coherent perspective which emphasizes the common ground. In
+this way the tutorial should enable the reader to begin to understand the char-
+acteristics of asynchronous systems in a way that will enable them to ‘think
+
+xi
+
+
+xii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+outside the box’ of conventional clocked design and to create radical new de-
+sign solutions that fully exploit the potential of clockless systems.
+Once the asynchronous design mindset has been mastered, the second hur-
+dle is designer productivity. VLSI designers are used to working in a highly
+productive environment supported by powerful automatic tools. Asynchronous
+design lags in its tools environment, but things are improving. Part II of the
+book gives an introduction to Balsa, a high-level synthesis system for asyn-
+chronous circuits. It is written by Doug Edwards (who has managed the Balsa
+development at the University of Manchester since its inception) and Andrew
+Bardsley (who has written most of the software). Balsa is not the solution to all
+asynchronous design problems, but it is capable of synthesizing very complex
+systems (for example, the 32-channel DMA controller used on the DRACO
+chip described in Chapter 15) and it is a good way to develop an understanding
+of asynchronous design ‘in the large’.
+Knowing how to think about asynchronous design and having access to suit-
+able tools leaves one question: what can be built in this way? In Part III we
+offer a number of examples of complex asynchronous systems as illustrations
+of the answer to this question. In each of these examples the designers have
+been asked to provide descriptions that will provide the reader with insights
+into the design process. The examples include a commercial smart card chip
+designed at Philips and a Viterbi decoder designed at the University of Manch-
+ester. Part III closes with a discussion of the issues that come up in the design
+of advanced asynchronous microprocessors, focusing on the Amulet processor
+series, again developed at the University of Manchester.
+Although the book is a compilation of contributions from different authors,
+each of these has been specifically written with the goals of the book in mind –
+to provide answers to the sorts of questions that a newcomer to asynchronous
+design is likely to ask. In order to keep the book accessible and to avoid it
+becoming an intimidating size, much valuable work has had to be omitted. Our
+objective in introducing you to asynchronous design is that you might become
+acquainted with it. If your relationship develops further, perhaps even into the
+full-blown affair that has smitten a few, included among whose number are the
+contributors to this book, you will, of course, want to know more. The book
+includes an extensive bibliography that will provide food enough for even the
+most insatiable of appetites.
+
+JENS SPARSØ AND STEVE FURBER, SEPTEMBER 2001
+
+
+xiii
+
+Acknowledgments
+
+Many people have helped significantly in the creation of this book. In addi-
+tion to writing their respective chapters, several of the authors have also read
+and commented on drafts of other parts of the book, and the quality of the work
+as a whole has been enhanced as a result.
+The editors are also grateful to Alan Williams, Russell Hobson and Steve
+Temple, for their careful reading of drafts of this book and their constructive
+suggestions for improvement.
+Part I of the book has been used as a course text and the quality and con-
+sistency of the content improved by feedback from the students on the spring
+2001 course “49425 Design of Asynchronous Circuits” at DTU.
+Any remaining errors or omissions are the responsibility of the editors.
+
+The writing of this book was initiated as a dissemination effort within the
+European Low-Power Initiative for Electronic System Design (ESD-LPD), and
+this book is part of the book series from this initiative. As will become clear,
+the book goes far beyond the dissemination of results from projects within
+in the ESD-LPD cluster, and the editors would like to acknowledge the sup-
+port of the working group on asynchronous circuit design, ACiD-WG, that has
+provided a fruitful forum for interaction and the exchange of ideas. The ACiD-
+WG has been funded by the European Commission since 1992 under several
+Framework Programmes: FP3 Basic Research (EP7225), FP4 Technologies
+for Components and Subsystems (EP21949), and FP5 Microelectronics (IST-
+1999-29119).
+
+
+
+Foreword
+
+This book is the third in a series on novel low-power design architectures,
+methods and design practices. It results from a large European project started
+in 1997, whose goal is to promote the further development and the faster and
+wider industrial use of advanced design methods for reducing the power con-
+sumption of electronic systems.
+Low-power design became crucial with the widespread use of portable in-
+formation and communication terminals, where a small battery has to last for a
+long period. High-performance electronics, in addition, suffers from a contin-
+uing increase in the dissipated power per square millimeter of silicon, due to
+increasing clock-rates, which causes cooling and reliability problems or other-
+wise limits performance.
+The European Union’s Information Technologies Programme ‘Esprit’ there-
+fore launched a ‘Pilot action for Low-Power Design’, which eventually grew
+to 19 R&D projects and one coordination project, with an overall budget of 14
+million EUROs. This action is known as the European Low-Power Initiative
+for Electronic System Design (ESD-LPD) and will be completed in the year
+2002. It aims to develop or demonstrate new design methods for power reduc-
+tion, while the coordination project ensures that the methods, experiences and
+results are properly documented and publicised.
+The initiative addresses low-power design at various levels. These include
+system and algorithmic level, instruction set processor level, custom proces-
+sor level, register transfer level, gate level, circuit level and layout level. It
+covers data-dominated, control-dominated and asynchronous architectures. 10
+projects deal mainly with digital circuits, 7 with analog and mixed-signal cir-
+cuits, and 2 with software-related aspects. The principal application areas are
+communication, medical equipment and e-commerce devices.
+
+xv
+
+
+xvi
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+The following list describes the objectives of the 20 projects. It is sorted by
+decreasing funding budget.
+
+CRAFT CMOS Radio Frequency Circuit Design for Wireless Application
+
+Advanced CMOS RF circuit design including blocks such as LNA, down con-
+verter mixers & phase shifters, oscillators and frequency synthesisers, integrated
+filters delta sigma conversion, power amplifiers
+
+Development of novel models for active and passive devices as well as fine-tuning
+and validation based on first silicon prototypes
+
+Analysis and specification of sophisticated architectures to meet, in particular,
+low-power single-chip implementation
+
+PAPRICA Power and Part Count Reduction Innovative Communication Architecture
+
+Feasibility assessment of DQIF, through physical design and characterisation of
+the core blocks
+
+Low-power RF design techniques in standard CMOS digital processes
+
+RF design tools and framework; PAPRICA Design Kit
+
+Demonstration of a practical implementation of a specific application
+
+MELOPAS Methodology for Low Power Asic design
+
+To develop a methodology to evaluate the power consumption of a complex ASIC
+early on in the design flow
+
+To develop a hardware/software co-simulation tool
+
+To quickly achieve a drastic reduction in the power consumption of electronic
+equipment
+
+TARDIS Technical Coordination and Dissemination
+
+To organise the communication between design experiments and to exploit their
+potential synergy
+
+To guide the capturing of methods and experiences gained in the design experi-
+ments
+
+To organise and promote the wider dissemination and use of the gathered design
+know-how and experience
+
+LUCS Low-Power Ultrasound Chip Set.
+
+Design methodology on low-power ADC, memory and circuit design
+
+Prototype demonstration of a hand-held medical ultrasound scanner
+
+
+Foreword
+xvii
+
+ALPINS Analog Low-Power Design for Communications Systems
+
+Low-voltage voice band smoothing filters and analog-to-digital and digital-to-
+analog converters for an analog front-end circuit for a DECT system
+
+High linear transconductor-capacitor (gm-C) filter for GSM Analog Interface Cir-
+cuit operating at supply voltages as low as 2.5V
+
+Formal verification tools, which will be implemented in the industrial partner’s
+design environment. These tools support the complete design process from sys-
+tem level down to transistor level
+
+SALOMON System-level analog-digital trade-off analysis for low power
+
+A general top-down design flow for mixed-signal telecom ASICs
+
+High-level models of analog and digital blocks and power estimators for these
+blocks
+
+A prototype implementation of the design flow with particular software tools to
+demonstrate the general design flow
+
+DESCALE Design Experiment on a Smart Card Application for Low Energy
+
+The application of highly innovative handshake technology
+
+Aiming at some 3 to 5 times less power and some 10 times smaller peak currents
+compared to synchronously operated solutions
+
+SUPREGE A low-power SUPerREGEnerative transceiver for wireless data transmission at
+short distances
+
+Design trade-offs and optimisation of the micro power receiver/transmitter as a
+function of various parameters (power consumption, area, bandwidth, sensitivity,
+etc)
+
+Modulation/demodulation and interface with data transmission systems
+
+Realisation of the integrated micro power receiver/transmitter based on the super-
+regeneration principle
+
+PREST Power REduction for System Technologies
+
+Survey of contemporary Low-Power Design techniques and commercial power
+analysis software tools
+
+Investigation of architectural and algorithmic design techniques with a power
+consumption comparison
+
+Investigation of Asynchronous design techniques and Arithmetic styles
+
+Set-up and assessment of a low-power design flow
+
+Fabrication and characterisation of a Viterbi demonstrator to assess the most
+promising power reduction techniques
+
+
+xviii
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+DABLP Low-Power Exploration for Mapping DAB Applications to Multi-Processors
+
+A DAB channel decoder architecture with reduced power consumption
+
+Refined and extended ATOMIUM methodology and supporting tools
+
+COSAFE Low-Power Hardware-Software Co-Design for Safety-Critical Applications
+
+The development of strategies for power-efficient assignment of safety critical
+mechanisms to hardware or software
+
+The design and implementation of a low-power, safety-critical ASIP, which re-
+alises the control unit of a portable infusion pump system
+
+AMIED Asynchronous Low-Power Methodology and Implementation of an Encryption/De-
+cryption System
+
+Implementation of the IDEA encryption/decryption method with drastically re-
+duced power consumption
+
+Advanced low-power design flow with emphasis on algorithm and architecture
+optimisations
+
+Industrial demonstration of the asynchronous design methodology based on com-
+mercial tools
+
+LPGD A Low-Power Design Methodology/Flow and its Application to the Implementation of
+a DCS1800-GSM/DECT Modulator/Demodulator
+
+To complete the development of a top-down, low-power design methodology/flow
+for DSP applications
+
+To demonstrate the methods on the example of an integrated GFSK/GMSK Modu-
+lator-Demodulator (MODEM) for DCS1800-GSM/DECT applications
+
+SOFLOPO Low-Power Software Development for Embedded Applications
+
+Develop techniques and guidelines for mapping a specific algorithm code onto
+appropriate instruction subsets
+
+Integrate these techniques into software for the power-conscious ARM-RISC and
+DSP code optimisation
+
+I-MODE Low-Power RF to Base Band Interface for Multi-Mode Portable Phone
+
+To raise the level of integration in a DECT/DCS1800 transceiver, by implement-
+ing the necessary analog base band low-pass filters and data converters in CMOS
+technology using low-power techniques
+
+COOL-LOGOS Power Reduction through the Use of Local don’t Care Conditions and Global
+Gate Resizing Techniques: An Experimental Evaluation.
+
+To apply the developed low-power design techniques to an existing 24-bit DSP,
+which is already fabricated
+
+To assess the merit of the new techniques using experimental silicon through com-
+parisons of the projected power reduction (in simulation) and actually measured
+reduction of new DSP; assessment of the commercial impact
+
+
+Foreword
+xix
+
+LOVO Low Output VOltage DC/DC converters for low-power applications
+
+Development of technical solutions for the power supplies of advanced low-
+power systems
+
+New methods for synchronous rectification for very low output voltage power
+converters
+
+PCBIT Low-Power ISDN Interface for Portable PC’s
+
+Design of a PC-Card board that implements the PCBIT interface
+
+Integrate levels 1 and 2 of the communication protocol in a single ASIC
+
+Incorporate power management techniques in the ASIC design:
+
+– system level: shutdown of idle modules in the circuit
+– gate level: precomputation, gated-clock FSMs
+
+COLOPODS Design of a Cochlear Hearing Aid Low-Power DSP System
+
+Selection of a future oriented low-power technology enabling future power re-
+duction through integration of analog modules
+
+Design of a speech processor IC yielding a power reduction of 90% compared to
+the 3.3 Volt implementation
+
+The low power design projects have achieved the following results:
+
+Projects that have designed prototype chips can demonstrate power re-
+ductions of 10 to 30 percent.
+
+New low-power design libraries have been developed.
+
+New proven low-power RF architectures are now available.
+
+New smaller and lighter mobile equipment has been developed.
+
+Instead of running a number of Esprit projects at the same time indepen-
+dently of each other, during this pilot action the projects have collaborated
+strongly. This is achieved mostly by the novel feature of this action, which
+is the presence and role of the coordinator: DIMES - the Delft Institute of
+Microelectronics and Submicron-technology, located in Delft, the Netherlands
+(http://www.dimes.tudelft.nl). The task of the coordinator is to co-ordinate,
+facilitate, and organize:
+
+the information exchange between projects;
+
+the systematic documentation of methods and experiences;
+
+the publication and the wider dissemination to the public.
+
+
+xx
+PRINCIPLES OF ASYNCHRONOUS CIRCUIT DESIGN
+
+The most important achievements, credited to the presence of the coordina-
+tor are:
+
+New personnel contacts have been made, and as a consequence the re-
+sulting synergy between partners resulted in better and faster develop-
+ments.
+
+The organization of low-power design workshops, special sessions at
+conferences, and a low-power design web site:
+
+http://www.esdlpd.dimes.tudelft.nl.
+
+At this site all of the public reports from the projects can be found, as
+can all kinds of information about the initiative itself.
+
+The design methodology, design methods and/or design experience are
+disclosed, are well-documented and available.
+
+Based on the work of the projects, and in cooperation with the projects,
+the publication of a low-power design book series is planned. Written
+by members of the projects, this series of books on low-power design
+will disseminate novel design methodologies and design experiences
+that were obtained during the run-time of the European Low Power Ini-
+tiative for Electronic System Design, to the general public.
+
+In conclusion, the major contribution of this project cluster is, in addition
+to the technical achievements already mentioned, the acceleration of the in-
+troduction of novel knowledge on low-power design methods into mainstream
+development processes.
+We would like to thank all project partners from all of the different compa-
+nies and organizations who make the Low-Power Initiative a success.
+
+Rene van Leuken, Reinder Nouta, Alexander de Graaf, Delft, June 2001
+
+
+I
+
+ASYNCHRONOUS CIRCUIT DESIGN
+– A TUTORIAL
+
+Author: Jens Sparsø
+Technical University of Denmark
+jsp@imm.dtu.dk
+
+Abstract
+Asynchronous circuits have characteristics that differ significantly from those
+of synchronous circuits and, as will be clear from some of the later chapters
+in this book, it is possible exploit these characteristics to design circuits with
+very interesting performance parameters in terms of their power, performance,
+electromagnetic emissions (EMI), etc.
+Asynchronous design is not yet a well-established and widely-used design meth-
+odology. There are textbooks that provide comprehensive coverage of the under-
+lying theories, but the field has not yet matured to a point where there is an estab-
+lished currriculum and university tradition for teaching courses on asynchronous
+circuit design to electrical engineering and computer engineering students.
+As this author sees the situation there is a gap between understanding the fun-
+damentals and being able to design useful circuits of some complexity. The aim
+of Part I of this book is to provide a tutorial on asynchronous circuit design that
+fills this gap.
+More specifically the aims are: (i) to introduce readers with background in syn-
+chronous digital circuit design to the fundamentals of asynchronous circuit de-
+sign such that they are able to read and understand the literature, and (ii) to
+provide readers with an understanding of the “nature” of asynchronous circuits
+such that they are to design non-trivial circuits with interesting performance pa-
+rameters.
+The material is based on experience from the design of several asynchronous
+chips, and it has evolved over the last decade from tutorials given at a number
+of European conferences and from a number of special topics courses taught
+at the Technical University of Denmark and elsewhere. In May 1999 I gave a
+one-week intensive course at Delft University of Technology and it was when
+preparing for this course I felt that the material was shaping up, and I set out
+to write the following text. Most of the material has recently been used and
+debugged in a course at the Technical University of Denmark in the spring 2001.
+Supplemented by a few journal articles and a small design project, the text may
+be used for a one semester course on asynchronous design.
+
+Keywords:
+asynchronous circuits, tutorial
+
+
+
+Chapter 1
+
+INTRODUCTION
+
+1.1.
+Why consider asynchronous circuits?
+
+Most digital circuits designed and fabricated today are “synchronous”. In
+essence, they are based on two fundamental assumptions that greatly simplify
+their design: (1) all signals are binary, and (2) all components share a common
+and discrete notion of time, as defined by a clock signal distributed throughout
+the circuit.
+Asynchronous circuits are fundamentally different; they also assume bi-
+nary signals, but there is no common and discrete time. Instead the circuits
+use handshaking between their components in order to perform the necessary
+synchronization, communication, and sequencing of operations. Expressed in
+‘synchronous terms’ this results in a behaviour that is similar to systematic
+fine-grain clock gating and local clocks that are not in phase and whose period
+is determined by actual circuit delays – registers are only clocked where and
+when needed.
+This difference gives asynchronous circuits inherent properties that can be
+(and have been) exploited to advantage in the areas listed and motivated below.
+The interested reader may find further introduction to the mechanisms behind
+the advantages mentioned below in [140].
+
+Low power consumption, [136, 138, 42, 45, 99, 102]
+
+�
+�
+�due to fine-grain clock gating and zero standby power consumption.
+
+High operating speed, [156, 157, 88]
+
+�
+�
+�operating speed is determined by actual local latencies rather than
+global worst-case latency.
+
+Less emission of electro-magnetic noise, [136, 109]
+
+�
+�
+�the local clocks tend to tick at random points in time.
+
+Robustness towards variations in supply voltage, temperature, and fabri-
+cation process parameters, [87, 98, 100]
+
+�
+�
+�timing is based on matched delays (and can even be insensitive to
+circuit and wire delays).
+
+3
+
+
+4
+Part I: Asynchronous circuit design – A tutorial
+
+Better composability and modularity, [92, 80, 142, 128, 124]
+
+�
+�
+�because of the simple handshake interfaces and the local timing.
+
+No clock distribution and clock skew problems,
+
+�
+�
+�there is no global signal that needs to be distributed with minimal
+phase skew across the circuit.
+
+On the other hand there are also some drawbacks. The asynchronous con-
+trol logic that implements the handshaking normally represents an overhead
+in terms of silicon area, circuit speed, and power consumption. It is therefore
+pertinent to ask whether or not the investment pays off, i.e. whether the use of
+asynchronous techniques results in a substantial improvement in one or more
+of the above areas. Other obstacles are a lack of CAD tools and strategies and
+a lack of tools for testing and test vector generation.
+Research in asynchronous design goes back to the mid 1950s [93, 92], but
+it was not until the late 1990s that projects in academia and industry demon-
+strated that it is possible to design asynchronous circuits which exhibit signifi-
+cant benefits in nontrivial real-life examples, and that commercialization of the
+technology began to take place. Recent examples are presented in [106] and in
+Part III of this book.
+
+1.2.
+Aims and background
+
+There are already several excellent articles and book chapters that introduce
+asynchronous design [54, 33, 34, 35, 140, 69, 124] as well as several mono-
+graphs and textbooks devoted to asynchronous design including [106, 14, 25,
+18, 95] – why then write yet another introduction to asynchronous design?
+There are several reasons:
+
+My experience from designing several asynchronous chips [123, 103],
+and from teaching asynchronous design to students and engineers over
+the past 10 years, is that it takes more than knowledge of the basic prin-
+ciples and theories to design efficient asynchronous circuits. In my ex-
+perience there is a large gap between the introductory articles and book
+chapters mentioned above explaining the design methods and theories
+on the one side, and the papers describing actual designs and current re-
+search on the other side. It takes more than knowing the rules of a game
+to play and win the game. Bridging this gap involves experience and a
+good understanding of the nature of asynchronous circuits. An experi-
+ence that I share with many other researchers is that “just going asyn-
+chronous” results in larger, slower and more power consuming circuits.
+The crux is to use asynchronous techniques to exploit characteristics in
+the algorithm and architecture of the application in question. This fur-
+
+
+Chapter 1: Introduction
+5
+
+ther implies that asynchronous techniques may not always be the right
+solution to the problem.
+
+Another issue is that asynchronous design is a rather young discipline.
+Different researchers have proposed different circuit structures and de-
+sign methods. At a first glance they may seem different – an observation
+that is supported by different terminologies; but a closer look often re-
+veals that the underlying principles and the resulting circuits are rather
+similar.
+
+Finally, most of the above-mentioned introductory articles and book
+chapters are comprehensive in nature. While being appreciated by those
+already working in the field, the multitude of different theories and ap-
+proaches in existence represents an obstacle for the newcomer wishing
+to get started designing asynchronous circuits.
+
+Compared to the introductory texts mentioned above, the aims of this tu-
+torial are: (1) to provide an introduction to asynchronous design that is more
+selective, (2) to stress basic principles and similarities between the different ap-
+proaches, and (3) to take the introduction further towards designing practical
+and useful circuits.
+
+1.3.
+Clocking versus handshaking
+
+Figure 1.1(a) shows a synchronous circuit. For simplicity the figure shows a
+pipeline, but it is intended to represent any synchronous circuit. When design-
+ing ASICs using hardware description languages and synthesis tools, designers
+focus mostly on the data processing and assume the existence of a global clock.
+For example, a designer would express the fact that data clocked into register
+R3 is a function CL3 of the data clocked into R2 at the previous clock as the
+following assignment of variables: R3 :� CL3�R2�. Figure 1.1(a) represents
+this high-level view with a universal clock.
+When it comes to physical design, reality is different. Todays ASICs use a
+structure of clock buffers resulting in a large number of (possibly gated) clock
+signals as shown in figure 1.1(b). It is well known that it takes CAD tools
+and engineering effort to design the clock gating circuitry and to minimize
+and control the skew between the many different clock signals. Guaranteeing
+the two-sided timing constraints – the setup to hold time window around the
+clock edge – in a world that is dominated by wire delays is not an easy task.
+The buffer-insertion-and-resynthesis process that is used in current commercial
+CAD tools may not converge and, even if it does, it relies on delay models that
+are often of questionable accuracy.
+
+
+6
+Part I: Asynchronous circuit design – A tutorial
+
+CL4
+
+CL4
+
+"Channel" or "Link"
+
+R2
+R3
+R4
+R1
+CL4
+CL3
+
+(d)
+
+Ack
+
+R2
+R3
+R4
+R1
+Data
+CL3
+CL4
+
+Req
+CTL
+CTL
+CTL
+CTL
+
+Req
+
+Ack
+
+Data
+
+R2
+R3
+R1
+CL3
+
+CLK
+
+(b)
+
+CLK
+
+R2
+R3
+R4
+R1
+CL3
+(a)
+
+(c)
+
+R4
+
+clock gate signal
+
+Figure 1.1.
+(a) A synchronous circuit, (b) a synchronous circuit with clock drivers and clock
+gating, (c) an equivalent asynchronous circuit, and (d) an abstract data-flow view of the asyn-
+chronous circuit. (The figure shows a pipeline, but it is intended to represent any circuit topol-
+ogy).
+
+
+Chapter 1: Introduction
+7
+
+Asynchronous design represents an alternative to this. In an asynchronous
+circuit the clock signal is replaced by some form of handshaking between
+neighbouring registers; for example the simple request-acknowledge based
+handshake protocol shown in figure 1.1(c). In the following chapter we look
+at alternative handshake protocols and data encodings, but before departing
+into these implementation details it is useful to take a more abstract view as
+illustrated in figure 1.1(d):
+
+think of the data and handshake signals connecting one register to the
+next in figure 1.1(c) as a “handshake channel” or “link,”
+
+think of the data stored in the registers as tokens tagged with data values
+(that may be changed along the way as tokens flow through combina-
+tional circuits), and
+
+think of the combinational circuits as being transparent to the handshak-
+ing between registers; a combinatorial circuit simply absorbs a token on
+each of its input links, performs its computation, and then emits a to-
+ken on each of its output links (much like a transition in a Petri net, c.f.
+section 6.2.1).
+
+Viewed this way, an asynchronous circuit is simply a static data-flow struc-
+ture [36]. Intuitively, correct operation requires that data tokens flowing in the
+circuit do not disappear, that one token does not overtake another, and that new
+tokens do not appear out of nowhere. A simple rule that can ensure this is the
+following:
+
+A register may input and store a new data token from its predecessor if its
+successor has input and stored the data token that the register was previ-
+ously holding. [The states of the predecessor and successor registers are
+signaled by the incoming request and acknowledge signals respectively.]
+
+Following this rule data is copied from one register to the next along the path
+through the circuit. In this process subsequent registers will often be holding
+copies of the same data value but the old duplicate data values will later be
+overwritten by new data values in a carefully ordered manner, and a handshake
+cycle on a link will always enclose the transfer of exactly one data-token. Un-
+derstanding this “token flow game” is crucial to the design of efficient circuits,
+and we will address these issues later, extending the token-flow view to cover
+structures other than pipelines. Our aim here is just to give the reader an intu-
+itive feel for the fundamentally different nature of asynchronous circuits.
+An important message is that the “handshake-channel and data-token view”
+represents a very useful abstraction that is equivalent to the register transfer
+level (RTL) used in the design of synchronous circuits. This data-flow ab-
+straction, as we will call it, separates the structure and function of the circuit
+from the implementation details of its components.
+
+
+8
+Part I: Asynchronous circuit design – A tutorial
+
+Another important message is that it is the handshaking between the regis-
+ters that controls the flow of tokens, whereas the combinational circuit blocks
+must be fully transparent to this handshaking. Ensuring this transparency is not
+always trivial; it takes more than a traditional combinational circuit, so we will
+use the term ’function block’ to denote a combinational circuit whose input
+and output ports are handshake-channels or links.
+Finally, some more down-to-earth engineering comments may also be rele-
+vant. The synchronous circuit in figure 1.1(b) is “controlled” by clock pulses
+that are in phase with a periodic clock signal, whereas the asynchronous circuit
+in figure 1.1(c) is controlled by locally derived clock pulses that can occur at
+any time; the local handshaking ensures that clock pulses are generated where
+and when needed. This tends to randomize the clock pulses over time, and is
+likely to result in less electromagnetic emission and a smoother supply current
+without the large di�dt spikes that characterize a synchronous circuit.
+
+1.4.
+Outline of Part I
+
+Chapter 2 presents a number of fundamental concepts and circuits that are
+important for the understanding of the following material. Read through it but
+don’t get stuck; you may want to revisit relevant parts later.
+Chapters 3 and 4 address asynchronous design at the data-flow level: chap-
+ter 3 explains the operation of pipelines and rings, introduces a set of hand-
+shake components and explains how to design (larger) computing structures,
+and chapter 4 addresses performance analysis and optimization of such struc-
+tures, both qualitatively and quantitatively.
+Chapter 5 addresses the circuit level implementation of the handshake com-
+ponents introduced in chapter 3, and chapter 6 addresses the design of hazard-
+free sequential (control) circuits. The latter includes a general introduction to
+the topics and in-depth coverage of one specific method: the design of speed-
+independent control circuits from signal transition graph specifications. These
+techniques are illustrated by control circuits used in the implementation of
+some of the handshake components introduced in chapter 3.
+All of the above chapters 2–6 aim to explain the basic techniques and meth-
+ods in some depth. The last two chapters are briefer. Chapter 7 introduces
+more advanced topics related to the implementation of circuits using the 4-
+phase bundled-data protocol, and chapter 8 addresses hardware description
+languages and synthesis tools for asynchronous design. Chapter 8 is by no
+means comprehensive; it focuses on CSP-like languages and syntax-directed
+compilation, but also describes how asynchronous design can be supported by
+a standard language, VHDL.
+
+
+Chapter 2
+
+FUNDAMENTALS
+
+This chapter provides explanations of a number of topics and concepts that
+are of fundamental importance for understanding the following chapters and
+for appreciating the similarities between the different asynchronous design
+styles. The presentation style will be somewhat informal and the aim is to
+provide the reader with intuition and insight.
+
+2.1.
+Handshake protocols
+
+The previous chapter showed one particular handshake protocol known as a
+return-to-zero handshake protocol, figure 1.1(c). In the asynchronous commu-
+nity it is given a more informative name: the 4-phase bundled-data protocol.
+
+2.1.1
+Bundled-data protocols
+
+The term bundled-data refers to a situation where the data signals use nor-
+mal Boolean levels to encode information, and where separate request and
+acknowledge wires are bundled with the data signals, figure 2.1(a). In the 4-
+phase protocol illustrated in figure 2.1(b) the request and acknowledge wires
+also use normal Boolean levels to encode information, and the term 4-phase
+refers to the number of communication actions: (1) the sender issues data and
+sets request high, (2) the receiver absorbs the data and sets acknowledge high,
+(3) the sender responds by taking request low (at which point data is no longer
+guaranteed to be valid) and (4) the receiver acknowledges this by taking ac-
+knowledge low. At this point the sender may initiate the next communication
+cycle.
+The 4-phase bundled data protocol is familiar to most digital designers, but
+it has a disadvantage in the superfluous return-to-zero transitions that cost un-
+necessary time and energy. The 2-phase bundled-data protocol shown in fig-
+ure 2.1(c) avoids this. The information on the request and acknowledge wires
+is now encoded as signal transitions on the wires and there is no difference
+between a 0
+� 1 and a 1
+� 0 transition, they both represent a “signal event”.
+Ideally the 2-phase bundled-data protocol should lead to faster circuits than
+the 4-phase bundled-data protocol, but often the implementation of circuits
+
+9
+
+
+10
+Part I: Asynchronous circuit design – A tutorial
+
+(push) channel
+
+(a)
+
+(b)
+4-phase protocol
+(c)
+2-phase protocol
+
+Data
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Data
+
+n
+
+Bundled data
+
+Data
+
+Ack
+
+Req
+
+Figure 2.1.
+(a) A bundled-data channel. (b) A 4-phase bundled-data protocol. (c) A 2-phase
+bundled-data protocol.
+
+responding to events is complex and there is no general answer as to which
+protocol is best.
+At this point some discussion of terminology is appropriate. Instead of the
+term bundled-data that is used throughout this text, some texts use the term
+single-rail. The term ‘bundled-data’ hints at the timing relationship between
+the data signals and the handshake signals, whereas the term ‘single-rail’ hints
+at the use of one wire to carry one bit of data. Also, the term single-rail may
+be considered consistent with the dual-rail data representation discussed in the
+next section. Instead of the term 4-phase handshaking (or signaling) some texts
+use the terms return-to-zero (RTZ) signaling or level signaling, and instead of
+the term 2-phase handshaking (or signaling) some texts use the terms non-
+return-to-zero (NRZ) signaling or transition signaling. Consequently a return-
+to-zero single-track prococol is the same as a 4-phase bundled-data protocol,
+etc.
+The protocols introduced above all assume that the sender is the active party
+that initiates the data transfer over the channel. This is known as a push chan-
+nel. The opposite, the receiver asking for new data, is also possible and is
+called a pull channel. In this case the directions of the request and acknowl-
+edge signals are reversed, and the validity of data is indicated in the acknowl-
+edge signal going from the sender to the receiver. In abstract circuit diagrams
+showing links/channels as one symbol we will often mark the active end of a
+channel with a dot, as illustrated in figure 2.1(a).
+To complete the picture we mention a number of variations: (1) a channel
+without data that can be used for synchronization, and (2) a channel where
+data is transmitted in both directions and where req and ack indicate validity
+
+
+Chapter 2: Fundamentals
+11
+
+of the data that is exchanged. The latter could be used to interface a read-
+only memory: the address would be bundled with req and the data would be
+bundled with ack. These alternatives are explained later in section 7.1.1. In the
+following sections we will restrict the discussion to push channels.
+All the bundled-data protocols rely on delay matching, such that the order
+of signal events at the sender’s end is preserved at the receiver’s end. On a
+push channel, data is valid before request is set high, expressed formally as
+Valid
+�Data�
+� Req. This ordering should also be valid at the receiver’s end,
+and it requires some care when physically implementing such circuits. Possible
+solutions are:
+
+To control the placement and routing of the wires, possibly by routing
+all signals in a channel as a bundle. This is trivial in a tile-based datapath
+structure.
+
+To have a safety margin at the sender’s end.
+
+To insert and/or resize buffers after layout (much as is done in today’s
+synthesis and layout CAD tools).
+
+An alternative is to use a more sophisticated protocol that is robust to wire
+delays. In the following sections we introduce a number of such protocols that
+are completely insensitive to delays.
+
+2.1.2
+The 4-phase dual-rail protocol
+
+The 4-phase dual-rail protocol encodes the request signal into the data sig-
+nals using two wires per bit of information that has to be communicated, fig-
+ure 2.2. In essence it is a 4-phase protocol using two request wires per bit of
+information d; one wire d
+�t is used for signaling a logic 1 (or true), and an-
+other wire d
+�f is used for signaling logic 0 (or false). When observing a 1-bit
+channel one will see a sequence of 4-phase handshakes where the participating
+
+"1"
+"0"
+"E"
+
+dual-rail
+(push) channel
+
+0
+
+0
+1
+1
+
+d.t d.f
+
+0
+
+1
+0
+1
+
+Valid  "0"
+Valid  "1"
+Not used
+
+Empty ("E")
+
+2n
+
+Ack
+
+Data, Req
+4-phase
+
+Data {d.t, d.f} Empty
+Valid
+Empty
+Valid
+
+Ack
+
+Figure 2.2.
+A delay-insensitive channel using the 4-phase dual-rail protocol.
+
+
+12
+Part I: Asynchronous circuit design – A tutorial
+
+“request” signal in any handshake cycle can be either d
+�t or d
+�f . This protocol
+is very robust; two parties can communicate reliably regardless of delays in the
+wires connecting the two parties – the protocol is delay-insensitive.
+Viewed together the
+�x�f
+�x�t
+� wire pair is a codeword;
+�x�f
+�x�t
+�
+�
+�1�0�
+and
+�x�f
+�x�t
+�
+�
+�0�1� represent “valid data” (logic 0 and logic 1 respectively)
+and
+�x�f
+�x�t
+�
+�
+�0�0� represents “no data” (or “spacer” or “empty value” or
+“NULL”). The codeword
+�x�f
+�x�t
+�
+�
+�1�1� is not used, and a transition from
+one valid codeword to another valid codeword is not allowed, as illustrated in
+figure 2.2.
+This leads to a more abstract view of 4-phase handshaking: (1) the sender
+issues a valid codeword, (2) the receiver absorbs the codeword and sets ac-
+knowledge high, (3) the sender responds by issuing the empty codeword, and
+(4) the receiver acknowledges this by taking acknowledge low. At this point
+the sender may initiate the next communication cycle. An even more abstract
+view of what is seen on a channel is a data stream of valid codewords separated
+by empty codewords.
+Let’s now extend this approach to bit-parallel channels. An N-bit data chan-
+nel is formed simply by concatenating N wire pairs, each using the encoding
+described above. A receiver is always able to detect when all bits are valid (to
+which it responds by taking acknowledge high), and when all bits are empty
+(to which it responds by taking acknowledge low). This is intuitive, but there
+is also some mathematical background – the dual-rail code is a particularly
+simple member of the family of delay-insensitive codes [147], and it has some
+nice properties:
+
+any concatenation of dual-rail codewords is itself a dual-rail codeword.
+
+for a given N (the number of bits to be communicated), the set of all
+possible codewords can be disjointly divided into 3 sets:
+
+– the empty codeword where all N wire pairs are
+�0,0�.
+
+– the intermediate codewords
+where some wire-pairs assume the
+empty state and some wire pairs assume valid data.
+
+– the 2N different valid codewords.
+
+Figure 2.3 illustrates the handshaking on an N-bit channel: a receiver will
+see the empty codeword, a sequence of intermediate codewords (as more and
+more bits/wire-pairs become valid) and eventually a valid codeword. After
+receiving and acknowledging the codeword, the receiver will see a sequence of
+intermediate codewords (as more and more bits become empty), and eventually
+the empty codeword to which the receiver responds by driving acknowledge
+low again.
+
+
+Chapter 2: Fundamentals
+13
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+
+Data
+
+Time
+
+1
+
+0
+
+Time
+
+Figure 2.3.
+Illustration of the handshaking on a 4-phase dual-rail channel.
+
+2.1.3
+The 2-phase dual-rail protocol
+
+The 2-phase dual-rail protocol also uses 2 wires
+�d
+�t
+�d
+�f
+� per bit, but the
+information is encoded as transitions (events) as explained previously. On an
+N-bit channel a new codeword is received when exactly one wire in each of
+the N wire pairs has made a transition. There is no empty value; a valid mes-
+sage is acknowledged and followed by another message that is acknowledged.
+Figure 2.4 shows the signal waveforms on a 2-bit channel using the 2-phase
+dual-rail protocol.
+
+Ack
+
+(d1.t, d1.f)
+
+(d0.t, d0.f)
+
+d1.t
+
+d1.f
+
+Ack
+
+d0.f
+
+d0.t
+
+00
+01
+00
+11
+
+Figure 2.4.
+Illustration of the handshaking on a 2-phase dual-rail channel.
+
+2.1.4
+Other protocols
+
+The previous sections introduced the four most common channel protocols:
+the 4-phase bundled-data push channel, the 2-phase bundled-data push chan-
+nel, the 4-phase dual-rail push channel and the 2-phase dual-rail push channel;
+but there are many other possibilities. The two wires per bit used in the dual-
+rail protocol can be seen as a one-hot encoding of that bit and often it is useful
+to extend to 1-of-n encodings in control logic and higher-radix data encodings.
+
+
+14
+Part I: Asynchronous circuit design – A tutorial
+
+If the focus is on communication rather than computation, m-of-n encodings
+may be of relevance. The solution space can be expressed as the cross product
+of a number of options including:
+
+�2-phase
+�4-phase
+�
+�
+�bundled-data
+�dual-rail
+�1-of-n
+�
+�
+�
+�
+�
+�
+�push
+�pull
+�
+
+The choice of protocol affects the circuit implementation characteristics
+(area, speed, power, robustness, etc.). Before continuing with these imple-
+mentation issues it is necessary to introduce the concept of indication or ac-
+knowledgement, as well as a new component, the Muller C-element.
+
+2.2.
+The Muller C-element and the indication principle
+
+In a synchronous circuit the role of the clock is to define points in time
+where signals are stable and valid. In between the clock-ticks, the signals
+may exhibit hazards and may make multiple transitions as the combinational
+circuits stabilize. This does not matter from a functional point of view. In
+asynchronous (control) circuits the situation is different. The absence of a
+clock means that, in many circumstances, signals are required to be valid all
+the time, that every signal transition has a meaning and, consequently, that
+hazards and races must be avoided.
+Intuitively a circuit is a collection of gates (normally including some feed-
+back loops), and when the output of a gate changes it is seen by other gates
+that in turn may decide to change their outputs accordingly. As an example fig-
+ure 2.5 shows one possible implementation of the CTL circuit in figure 1.1(c).
+The intention here is not to explain its function, just to give an impression of
+the type of circuit we are discussing. It is obvious that hazards on the Ro,
+Ai, and Lt signals would be disastrous if the circuit is used in the pipeline of
+figure 1.1(c).
+
++
+
+&
+&
+
++
+
+Ao
+
+Ri
+
+Ai
+
+Ro
+
+CTL
+
+Lt
+
+Figure 2.5.
+An example of an asynchronous control circuit. Lt is a “local” clock that is in-
+tended to control a latch.
+
+
+Chapter 2: Fundamentals
+15
+
+0
+0
+1
+1
+
+0
+
+a
+b
+y
+
+1
+
+0
+1
+0
+1
+
+a
+
+b
+
+y
++
+1
+1
+
+Figure 2.6.
+A normal OR gate
+
+a
+
+b
+y
+
+a
+y
+C
+b
+
+Some specifications:
+
+1: if a
+� b then y :� a
+
+2:
+a
+� b
+�� y :� a
+
+3:
+y
+� ab
+�y�a
+�b�
+
+4:
+a
+b
+y
+
+0
+0
+0
+0
+1
+no change
+1
+0
+no change
+1
+1
+1
+
+Figure 2.7.
+The Muller C-element: symbol, possible implementation, and some alternative
+specifications.
+
+The concept of indication or acknowledgement plays an important role in
+the design of such circuits. Consider the simple 2-input OR gate in figure 2.6.
+An observer seeing the output change from 1 to 0 may conclude that both
+inputs are now at 0. However, when seeing the output change from 0 to 1 the
+observer is not able to make conclusions about both inputs. The observer only
+knows that at least one input is 1, but it does not know which. We say that
+the OR gate only indicates or acknowledges when both inputs are 0. Through
+similar arguments it can be seen that an AND gate only indicates when both
+inputs are 1.
+Signal transitions that are not indicated or acknowledged in other signal
+transitions are the source of hazards and should be avoided. We will address
+this issue in greater detail later in section 2.5.1 and in chapter 6.
+A circuit that is better in this respect is the Muller C-element shown in fig-
+ure 2.7. It is a state-holding element much like an asynchronous set-reset latch.
+When both inputs are 0 the output is set to 0, and when both inputs are 1 the
+output is set to 1. For other input combinations the output does not change.
+Consequently, an observer seeing the output change from 0 to 1 may conclude
+that both inputs are now at 1; and similarly, an observer seeing the output
+change from 1 to 0 may conclude that both inputs are now 0.
+
+
+16
+Part I: Asynchronous circuit design – A tutorial
+
+Combining this with the observation that all asynchronous circuits rely on
+handshaking that involves cyclic transitions between 0 and 1, it should be clear
+that the Muller C-element is indeed a fundamental component that is exten-
+sively used in asynchronous circuits.
+
+2.3.
+The Muller pipeline
+
+Figure 2.8 shows a circuit that is built from C-elements and inverters. The
+circuit is known as a Muller pipeline or a Muller distributor. Variations and ex-
+tensions of this circuit form the (control) backbone of almost all asynchronous
+circuits. It may not always be obvious at a first glance, but if one strips off the
+cluttering details, the Muller pipeline is always there as the crux of the matter.
+The circuit has a beautiful and symmetric behaviour, and once you understand
+its behaviour, you have a very good basis for understanding most asynchronous
+circuits.
+The Muller pipeline in figure 2.8 is a mechanism that relays handshakes.
+After all of the C-elements have been initialized to 0 the left environment
+may start handshaking. To understand what happens let’s consider the ith C-
+element, C
+�i�: It will propagate (i.e. input and store) a 1 from its predecessor,
+C
+�i
+� 1�, only if its successor, C
+�i
+� 1�, is 0. In a similar way it will propagate
+(i.e. input and store) a 0 from its predecessor if its successor is 1. It is often
+useful to think of the signals propagating in an asynchronous circuit as a se-
+quence of waves, as illustrated at the bottom of figure 2.8. Viewed this way, the
+role of a C-element stage in the pipeline is to propagate crests and troughs of
+waves in a carefully controlled way that maintains the integrity of each wave.
+On any interface between C-element pipeline stages an observer will see
+correct handshaking, but the timing may differ from the timing of the hand-
+shaking on the left hand environment; once a wave has been injected into the
+Muller pipeline it will propagate with a speed that is determined by actual de-
+lays in the circuit.
+Eventually the first handshake (request) injected by the left hand environ-
+ment will reach the right hand environment. If the right hand environment
+does not respond to the handshake, the pipeline will eventually fill. If this hap-
+pens the pipeline will stop handshaking with the left hand environment – the
+Muller pipeline behaves like a ripple through FIFO!
+In addition to this elegant behaviour, the pipeline has a number of beautiful
+symmetries. Firstly, it does not matter if you use 2-phase or 4-phase handshak-
+ing. It is the same circuit. The difference is in how you interpret the signals
+and use the circuit. Secondly, the circuit operates equally well from right to
+left. You may reverse the definition of signal polarities, reverse the role of the
+request and acknowledge signals, and operate the circuit from right to left. It
+is analogous to electrons and holes in a semiconductor; when current flows in
+
+
+Chapter 2: Fundamentals
+17
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Req
+Req
+
+Ack
+Ack
+Ack
+
+Req
+Req
+
+Ack
+
+C
+CC
+C
+
+if   C[i-1]      C[i+1]   then   C[i] := C[i-1]
+
+C[i+1]
+C[i-1]
+
+Right
+
+C[i]
+
+Left
+
+Figure 2.8.
+The Muller pipeline or Muller distributor.
+
+one direction it may be carried by electrons flowing in one direction or by holes
+flowing in the opposite direction.
+Finally, the circuit has the interesting property that it works correctly regard-
+less of delays in gates and wires – the Muller-pipeline is delay-insensitive.
+
+2.4.
+Circuit implementation styles
+
+As mentioned previously, the choice of handshake protocol affects the cir-
+cuit implementation (area, speed, power, robustness, etc.). Most practical cir-
+cuits use one of the following protocols introduced in section 2.1:
+
+4-phase bundled-data – which most closely resembles the design of syn-
+chronous circuits and which normally leads to the most efficient circuits,
+due to the extensive use of timing assumptions.
+
+2-phase bundled-data – introduced under the name Micropipelines by Ivan
+Sutherland in his 1988 Turing Award lecture.
+
+4-phase dual-rail – the classic approach rooted in David Muller’s pioneering
+work in the 1950s.
+
+Common to all protocols is the fact that the corresponding circuit imple-
+mentations all use variations of the Muller pipeline for controlling the storage
+elements. Below we explain the basics of pipelines built using simple transpar-
+
+
+18
+Part I: Asynchronous circuit design – A tutorial
+
+C
+C
+C
+
+C
+C
+C
+
+Latch
+
+EN
+
+Comb.
+
+F
+Latch
+
+EN
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Data
+
+Latch
+
+EN
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Latch
+
+EN
+
+Req
+
+Ack
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Data
+
+Req
+
+Ack
+
+Comb.
+
+F
+
+Req
+
+Ack
+
+(a)
+
+(b)
+
+Figure 2.9.
+A simple 4-phase bundled-data pipeline.
+
+ent latches as storage elements. More optimized and elaborate circuit imple-
+mentations and more complex circuit structures are the topics of later chapters.
+
+2.4.1
+4-phase bundled-data
+
+A 4-phase bundled-data pipeline is particularly simple. A Muller pipeline
+is used to generate local clock pulses. The clock pulse generated in one stage
+overlaps with the pulses generated in the neighbouring stages in a carefully
+controlled interlocked manner. Figure 2.9(a) shows a FIFO, i.e. a pipeline
+without data processing, and figure 2.9(b) shows how combinational circuits
+(also called function blocks) can be added between the latches. To maintain
+correct behaviour matching delays have to be inserted in the request signal
+paths.
+You may view this circuit as a traditional “synchronous” data-path, con-
+sisting of latches and combinational circuits that are clocked by a distributed
+gated-clock driver, or you may view it as an asynchronous data-flow structure
+composed of two types of handshake components: latches and function blocks,
+as indicated by the dashed boxes.
+The pipeline implementation shown in figure 2.9 is particularly simple but
+it has some drawbacks: when it fills the state of the C-elements is (0, 1, 0,
+1, etc.), and as a consequence only every other latch is storing data. This
+
+
+Chapter 2: Fundamentals
+19
+
+C
+C
+C
+
+C P
+
+Latch
+
+C P
+
+Latch
+
+C P
+
+Latch
+
+Req
+Req
+Req
+
+Ack
+Ack
+Ack
+
+Req
+
+Ack
+
+Data
+Data
+
+Figure 2.10.
+A simple 2-phase bundled-data pipeline.
+
+is no worse than in a synchronous circuit using master-slave flip-flops, but
+it is possible to design asynchronous pipelines and FIFOs that are better in
+this respect. Another disadvantage is speed. The throughput of a pipeline
+or FIFO depends on the time it takes to complete a handshake cycle and for
+the above implementation this involves communication with both neighbours.
+Chapter 7 addresses alternative implementations that are both faster and have
+better occupancy when full.
+
+2.4.2
+2-phase bundled data (Micropipelines)
+
+A 2-phase bundled-data pipeline also uses a Muller pipeline as the backbone
+control circuit, but the control signals are interpreted as events or transitions,
+figure 2.10. For this reason special capture-pass latches are needed: events
+on the C and P inputs alternate, causing the latch to alternate between cap-
+ture mode and pass mode. This calls for a special latch design as shown in
+figure 2.11 and explained below. The switch symbol in figure 2.11 is a multi-
+plexer, and the event controlled latch can be understood as two ordinary level
+sensitive latches (operating in an alternating fashion) followed by a multiplexer
+and a buffer.
+Figure 2.10 shows a pipeline without data processing. Combinational cir-
+cuits with matching delay elements can be inserted between latches in a similar
+way to the 4-phase bundled-data approach in figure 2.9.
+The 2-phase bundled-data approach was pioneered by Ivan Sutherland in the
+late 1980s and an excellent introduction is given in his 1988 Turing Award Lec-
+ture [128]. The title Micropipelines is often used synonymously with the use
+of the 2-phase bundled-data protocol, but it also refers to the use of a particular
+set of components that are based on event signalling. In addition to the latch in
+figure 2.11 these are: AND, OR, Select, Toggle, Call and Arbiter. The above
+figures 2.10 and 2.11 are similar to figures 15 and 12 in [128], but they empha-
+sise stronger the fact that the control structure is a Muller-pipeline. Some alter-
+
+
+20
+Part I: Asynchronous circuit design – A tutorial
+
+pass
+
+pass
+
+C
+P
+
+In
+Out
+
+C
+P
+
+C=0
+P=0
+C=1
+P=0
+
+C=1
+P=1
+C=0
+P=1
+
+capture
+
+t0:
+t1:
+
+capture
+
+t2:
+t3:
+
+Latch
+
+Figure 2.11.
+Implementation and operation of a capture-pass event controlled latch. At time
+t0 the latch is transparent (i.e. in pass mode) and signals C and P are both low. An event on the
+C input turns the latch into capture mode, etc.
+
+native latch designs that are (significantly) smaller and (significantly) slower
+are also presented in [128].
+At the conceptual level the 2-phase bundled-data approach is elegant and ef-
+ficient; compared to the 4-phase bundled-data approach it avoids the power and
+performance loss that is incurred by the return-to-zero part of the handshaking.
+However, as illustrated by the latch design, the implementation of components
+that respond to signal transitions is often more complex than the implemen-
+tation of components that respond to normal level signals. In addition to the
+storage elements explained above, conditional control logic that responds to
+signal transitions tends to be complex as well. This has been experienced by
+this author [123], by the University of Manchester [42, 45] and by many others.
+Having said this, the 2-phase bundled-data approach may be the preferred
+solution in systems with unconditional data-flows and very high speed require-
+ments. But as just mentioned, the higher speed comes at a price: larger silicon
+area and higher power consumption. In this respect asynchronous design is no
+different from synchronous design.
+
+2.4.3
+4-phase dual-rail
+
+A 4-phase dual-rail pipeline is also based on the Muller pipeline, but in a
+more elaborate way that has to do with the combined encoding of data and
+request. Figure 2.12 shows the implementation of a 1-bit wide and three stage
+deep pipeline without data processing. It can be understood as two Muller
+
+
+Chapter 2: Fundamentals
+21
+
+C
+
+C
+
++
+
+C
+
+C
+
++
+
+C
+
+C
+
++
+
+d.f
+
+d.t
+
+Ack
+
+d.f
+
+d.t
+
+Ack
+
+Figure 2.12.
+A simple 3-stage 1-bit wide 4-phase dual-rail pipeline.
+
+C
+
+C
+
+C
+
+C
+
++
+
++
+
++
+
+C
+
+C
+
+C
+
+di[0].f
+
+di[0].t
+
+di[1].f
+
+di[1].t
+
+di[2].f
+
+di[2].t
+
++
+
++
+
++
+
+"All empty" 
+
+ack_i
+ack_o
+
+do[0].f
+
+do[0].t
+
+do[1].f
+
+do[1].t
+
+do[2].f
+
+do[2].t
+
+Alternative completion detector 
+
+C
+
+"All valid" 
+
+&
+
+&
+
+Figure 2.13.
+An N-bit latch with completion detection.
+
+pipelines connected in parallel, using a common acknowledge signal per stage
+to synchronize operation. The pair of C-elements in a pipeline stage can store
+the empty codeword
+�d
+�t
+�d
+�f
+�
+�
+�0�0�, causing the acknowledge signal out
+of that stage to be 0, or it can store one of the two valid codewords
+�0�1�
+and
+�1�0�, causing the acknowledge signal out of that stage to be logic 1.
+At this point, and referring back to section 2.2, the reader should notice that
+because the codeword
+�1�1� is illegal and does not occur, the acknowledge
+signal generated by the OR gate safely indicates the state of the pipeline stage
+as being “valid” or “empty.”
+An N-bit wide pipeline can be implemented by using a number of 1-bit
+pipelines in parallel. This does not guarantee to a receiver that all bits in a
+word arrive at the same time, but often the necessary synchronization is done
+
+
+22
+Part I: Asynchronous circuit design – A tutorial
+
+b
+y
+
+E
+E
+0
+0
+
+F
+F
+T
+F
+T
+F
+T
+T
+
+1
+1
+0
+1
+
+0
+0
+
+NO  CHANGE
+
+y.f
+y.t
+
+0
+1
+
+a
+b
+
+AND
+
+a
+
++
+y.f
+
+C
+
+C
+
+C
+
+C
+y.t
+
+a.f
+00
+
+01
+
+10
+
+11
+
+a.t
+
+b.t
+
+b.f
+
+Figure 2.14.
+A 4-phase dual-rail AND gate: symbol, truth table, and implementation.
+
+in the function blocks. In [124, 125] we describe a design of this style using
+the DIMS combinational circuits explained below.
+If bit-parallel synchronization is needed, the individual acknowledge signals
+can be combined into one global acknowledge using a C-element. Figure 2.13
+shows an N-bit wide latch. The OR gates and the C-element in the dashed box
+form a completion detector that indicates whether the N-bit dual-rail codeword
+stored in the latch is empty or valid. The figure also shows an implementation
+of a completion detector using only a 2-input C-element.
+Let us now look at how combinational circuits for 4-phase dual-rail cir-
+cuits are implemented. As mentioned in chapter 1 combinational circuits must
+be transparent to the handshaking between latches. Therefore, all outputs of
+a combinational circuit must not become valid until after all inputs have be-
+come valid. Otherwise the receiving latch may prematurely set acknowledge
+high (before all signals from the sending latch have become valid). In a sim-
+ilar way all outputs of a combinational circuit must not become empty until
+after all inputs have become empty. Otherwise the receiving latch may pre-
+maturely set acknowledge low (before all signals from the sending latch have
+become empty). Consequently a combinational circuit for the 4-phase dual-
+rail approach involves state holding elements and it exhibits a hysteresis-like
+behaviour in the empty-to-valid and valid-to-empty transitions.
+A particularly simple approach, using only C-elements and OR gates, is
+illustrated in figure 2.14, which shows the implementation of a dual-rail AND
+gate.The circuit can be understood as a direct mapping from sum-of-minterms
+expressions for each of the two output wires into hardware. The circuit waits
+for all its inputs to become valid. When this happens exactly one of the four
+C-elements goes high. This again causes the relevant output wire to go high
+corresponding to the gate producing the desired valid output. When all inputs
+become empty the C-elements are all set low, and the output of the dual-rail
+AND gate becomes empty again. Note that the C-elements provide both the
+
+
+Chapter 2: Fundamentals
+23
+
+necessary ’and’ operator and the hysteresis in the empty-to-valid and valid-to-
+empty transitions that is required for transparent handshaking. Note also that
+(again) the OR gate is never exposed to more than one input signal being high.
+Other dual-rail gates such as OR and EXOR can be implemented in a sim-
+ilar fashion, and a dual-rail inverter involves just a swap of the true and false
+wires. The transistor count in these basic dual-rail gates is obviously rather
+high, and in chapter 5 we explore more efficient circuit implementations. Here
+our interest is in the fundamental principles.
+Given a set of basic dual-rail gates one can construct dual-rail combinational
+circuits for arbitrary Boolean expressions using normal combinational circuit
+synthesis techniques. The transparency to handshaking that is a property of
+the basic gates is preserved when composing gates into larger combinational
+circuits.
+The fundamental ideas explained above all go back to David Muller’s work
+in the late 1950s and early 1960s [93, 92]. While [93] develops the fundamen-
+tal theory for the design of speed-independent circuits, [92] is a more practical
+introduction including a design example: a bit-serial multiplier using latches
+and gates as explained above.
+
+2.5.
+Theory
+
+Asynchronous circuits can be classified, as we will see below, as being self-
+timed, speed-independent or delay-insensitive depending on the delay assump-
+tions that are made. In this section we introduce some important theoretical
+concepts that relate to this classification. The goal is to communicate the basic
+ideas and provide some intuition on the problems and solutions, and a reader
+who wishes to dig deeper into the theory is referred to the literature. Some
+recent starting points are [95, 54, 69, 35, 18].
+
+2.5.1
+The basics of speed-independence
+
+We will start by reviewing the basics of David Muller’s model of a cir-
+cuit and the conditions for it being speed-independent [93]. A circuit is mod-
+eled along with its (dummy) environment as a closed network of gates, closed
+meaning that all inputs are connected to outputs and vice versa. Gates are
+modeled as Boolean operators with arbitrary non-zero delays, and wires are
+assumed to be ideal. In this context the circuit can be described as a set of
+concurrent Boolean functions, one for each gate output. The state of the circuit
+is the set of all gate outputs. Figure 2.15 illustrates this for a stage of a Muller
+pipeline with an inverter and a buffer mimicing the handshaking behaviour of
+the left and right hand environments.
+A gate whose output is consistent with its inputs is said to be stable; its
+“next output” is the same as its “current output”, zi
+
+�
+� zi. A gate whose inputs
+
+
+24
+Part I: Asynchronous circuit design – A tutorial
+
+r i
+a i+1
+
+c i
+a i
+r i+1
+
+iy
+
+C
+
+ri
+
+�
+�
+not
+�ci
+
+�
+ci
+
+�
+�
+riyi
+
+�
+�ri
+
+�yi
+
+�ci
+yi
+
+�
+�
+not
+�ai�1
+
+�
+ai�1
+
+�
+�
+ci
+
+Figure 2.15.
+Muller model of a Muller pipeline stage with “dummy” gates modeling the envi-
+ronment behaviour.
+
+have changed in such a way that an output change is called for is said to be
+excited; its “next output” is different from its “current output”, i.e. zi
+
+�
+�� zi.
+After an arbitrary delay an excited gate may spontaneously change its output
+and become stable. We say that the gate fires, and as excited gates fire and
+become stable with new output values, other gates in turn become excited, etc.
+To illustrate this, suppose that the circuit in figure 2.15 is in state
+�ri
+
+�yi
+
+�ci
+
+�
+ai�1
+
+�
+�
+�0�1�0�0�. In this state (the inverter) ri is excited corresponding to the
+left environment being about to take request high. After the firing of ri
+
+� the
+circuit reaches state
+�ri
+
+�yi
+
+�ci
+
+�ai�1
+
+�
+�
+�1�1�0�0� and ci now becomes excited.
+For synthesis and analysis purposes one can construct the complete state graph
+representing all possible sequences of gate firings. This is addressed in detail
+in chapter 6. Here we will restrict the discussion to an explanation of the
+fundamental ideas.
+In the general case it is possible that several gates are excited at the same
+time (i.e. in a given state). If one of these gates, say zi, fires the interesting
+thing is what happens to the other excited gates which may have zi as one
+of their inputs: they may remain excited, or they may find themselves with a
+different set of input signals that no longer calls for an output change. A circuit
+is speed-independent if the latter never happens. The practical implication of
+an excited gate becoming stable without firing is a potential hazard. Since
+delays are unknown the gate may or may not have changed its output, or it
+may be in the middle of doing so when the ‘counter-order’ comes calling for
+the gate output to remain unchanged.
+Since the model involves a Boolean state variable for each gate (and for
+each wire segment in the case of delay-insensitive circuits) the state space be-
+comes very large even for very simple circuits. In chapter 6 we introduce signal
+transition graphs as a more abstract representation from which circuits can be
+synthesized.
+Now that we have a model for describing and reasoning about the behaviour
+of gate-level circuits let’s address the classification of asynchronous circuits.
+
+
+Chapter 2: Fundamentals
+25
+
+d
+
+d
+
+dA
+
+2
+
+3
+
+d1
+A
+
+B
+
+dB
+
+C
+
+dC
+
+Figure 2.16.
+A circuit fragment with gate and wire delays. The output of gate A forks to inputs
+of gates B and C.
+
+2.5.2
+Classification of asynchronous circuits
+
+At the gate level, asynchronous circuits can be classified as being self-timed,
+speed-independent or delay-insensitive depending on the delay assumptions
+that are made. Figure 2.16 serves to illustrate the following discussion. The
+figure shows three gates: A, B, and C, where the output signal from gate A is
+connected to inputs on gates B and C.
+A speed-independent (SI) circuit as introduced above is a circuit that oper-
+ates “correctly” assuming positive, bounded but unknown delays in gates and
+ideal zero-delay wires. Referring to figure 2.16 this means arbitrary dA, dB,
+and dC, but d1
+
+� d2
+
+� d3
+
+� 0. Assuming ideal zero-delay wires is not very
+realistic in today’s semiconductor processes. By allowing arbitrary d1 and d2
+and by requiring d2
+
+� d3 the wire delays can be lumped into the gates, and
+from a theoretical point of view the circuit is still speed-independent.
+A circuit that operates “correctly” with positive, bounded but unknown de-
+lays in wires as well as in gates is delay-insensitive (DI). Referring to fig-
+ure 2.16 this means arbitrary dA, dB, dC, d1, d2, and d3. Such circuits are obvi-
+ously extremely robust. One way to show that a circuit is delay-insensitive is to
+use a Muller model of the circuit where wire segments (after forks) are modeled
+as buffer components. If this equivalent circuit model is speed-independent,
+then the circuit is delay-insensitive.
+Unfortunately the class of delay-insensitive circuits is rather small. Only
+circuits composed of C-elements and inverters can be delay-insensitive [82],
+and the Muller pipeline in figures 2.5, 2.8, and 2.15 is one important example.
+Circuits that are delay-insensitive with the exception of some carefully identi-
+fied wire forks where d2
+
+� d3 are called quasi-delay-insensitive (QDI). Such
+wire forks, where signal transitions occur at the same time at all end-points, are
+called isochronic (and discussed in more detail in the next section). Typically
+these isochronic forks are found in gate-level implementations of basic build-
+ing blocks where the designer can control the wire delays. At the higher levels
+of abstraction the composition of building blocks would typically be delay-
+insensitive. After these comments it is obvious that a distinction between DI,
+QDI and SI makes good sense.
+
+
+26
+Part I: Asynchronous circuit design – A tutorial
+
+Because the class of delay-insensitive circuits is so small, basically exclud-
+ing all circuits that compute, most circuits that are referred to in the literature
+as delay-insensitive are only quasi-delay-insensitive.
+Finally a word about self-timed circuits: speed-independence and delay-
+insensitivity as introduced above are (mathematically) well defined properties
+under the unbounded gate and wire delay model. Circuits whose correct opera-
+tion relies on more elaborate and/or engineering timing assumptions are simply
+called self-timed.
+
+2.5.3
+Isochronic forks
+
+From the above it is clear that the distinction between speed-independent
+circuits and delay-insensitive circuits relates to wire forks and, more specifi-
+cally, to whether the delays to all end-points of a forking wire are identical or
+not. If the delays are identical, the wire-fork is called isochronic.
+The need for isochronic forks is related to the concept of indication intro-
+duced in section 2.2. Consider a situation in figure 2.16 where gate A has
+changed its output. Eventually this change is observed on the inputs of gates
+B and C, and after some time gates B and C may respond to the new input by
+producing a new output. If this happens we say that the output change on gate
+A is indicated by output changes on gates B and C. If, on the other hand, only
+gate B responds to the new input, it is not possible to establish whether gate C
+has seen the input change as well. In this case it is necessary to strengthen the
+assumptions to d2
+
+� d3 (i.e. that the fork is isochronic) and conclude that since
+the input signal change was indicated by the output of B, gate C has also seen
+the change.
+
+2.5.4
+Relation to circuits
+
+In the 2-phase and 4-phase bundled-data approaches the control circuits are
+normally speed-independent (or in some cases even delay-insensitive), but the
+data-path circuits with their matched delays are self-timed. Circuits designed
+following the 4-phase dual-rail approach are generally quasi-delay-insensitive.
+In the circuits shown in figures 2.12 and 2.14 the forks that connect to the inputs
+of several C-elements must be isochronic, whereas the forks that connect to the
+inputs of several OR gates are delay-insensitive.
+The different circuit classes, DI, QDI, SI and self-timed, are not mutually-
+exclusive ways to build complete systems, but useful abstractions that can be
+used at different levels of design. In most practical designs they are mixed.
+For example, in the Amulet processors [44, 43, 48] SI design is used for lo-
+cal asynchronous controllers, bundled-data for local data processing, and DI
+is used for high-level composition. Another example is the hearing-aid filter
+bank design presented in [103]. It uses the DI dual-rail 4-phase protocol inside
+
+
+Chapter 2: Fundamentals
+27
+
+RAM-modules and arithmetic circuits to provide robust completion indication,
+and 4-phase bundled-data with SI control at the top levels of design, i.e. some-
+what different from the Amulet designs. This emphasizes that the choice of
+handshake protocol and circuit implementation style is among the factors to
+consider when optimizing an asynchronous digital system.
+It is important to stress that speed-independence and delay-insensitivity are
+mathematical properties that can be verified for a given implementation. If an
+abstract component – such as a C-element or a complex And-Or-Invert gate
+– is replaced by its implementation using simple gates and possibly some
+wire-forks, then the circuit may no longer be speed-independent or delay-
+insensitive.
+As an illustrative example we mention that the simple Muller
+pipeline stage in figures 2.8 and 2.15 is no longer delay-insensitive if the C-
+element is replaced by the gate-level implementation shown in figure 2.5 that
+uses simple AND and OR gates. Furthermore, even simple gates are abstrac-
+tions; in CMOS the primitives are N and P transistors, and even the simplest
+gates include forks.
+In chapter 6 we will explore the design of SI control circuits in great detail
+(because theory and synthesis tools are well developed). As SI circuits ignore
+wire delays completely some care is needed when physically implementing
+these circuits. In general one might think that the zero wire-delay assumption
+is trivially satisfied in small circuits involving 10-20 gates, but this need not be
+the case: a normal place and route CAD tool might spread the gates of a small
+controller all over the chip. Even if the gates are placed next to each other
+they may have different logic thresholds on their inputs which in combination
+with slowly rising or falling signals can cause (and have caused!) circuits
+to malfunction. For static CMOS and for circuits operating with low supply
+voltages (e.g. VDD
+
+� VtN
+
+�
+�VtP
+
+�) this is less of a problem, but for dynamic
+circuits using a larger VDD (e.g. 3.3 V or 5.0 V) the logic thresholds can be
+very different. This often overlooked problem is addressed in detail in [134].
+
+2.6.
+Test
+
+When it comes to the commercial exploitation of asynchronous circuits the
+problem of test comes to the fore. Test is a major topic in its own right, and
+it is beyond the scope of this tutorial to do anything more than mention a few
+issues and challenges. Although the following text is brief it assumes some
+knowledge of testing. The material does not constitute a foundation for the
+following chapters and it may be skipped.
+The previous discussion about Muller circuits (excited gates and the firing
+of gates), the principle of indication, and the discussion of isochronoic forks
+ties in nicely with a discussion of testing for stuck at faults. In the stuck-at
+fault model defects are modeled at the gate level as (individual) inputs and
+outputs being stuck-at-1 or stuck-at-0. The principle of indication says that all
+
+
+28
+Part I: Asynchronous circuit design – A tutorial
+
+input signal transitions on a gate must be indicated by an output signal tran-
+sition on the gate. Furthermore, asynchronous circuits make extensive use of
+handshaking and this causes signals to exhibit cyclic transitions between 0 and
+1. In this scenario, the presence of a stuck-at fault is likely to cause the cir-
+cuit to halt; if one component stops handshaking the stall tends to “propagate”
+to neighbouring components, and eventually the entire circuit halts. Conse-
+quently, the development of a set of test patterns that exhaustively tests for all
+stuck-at faults is simply a matter of developing a set of test patterns that toggle
+all nodes, and this is generally a comparatively simple task.
+Since isochronic forks are forks where a signal transition in one or more
+branches is not indicated in the gates that take these signals as inputs, it follows
+that isochronic forks imply untestable stuck-at faults.
+Testing asynchronous circuits incurs additional problems. As we will see
+in the following chapters, asynchronous circuits tend to implement registers
+using latches rather than flip-flops. In combination with the absence of a global
+clock, this makes it less straightforward to connect registers into scan-paths.
+Another consequence of the distributed self-timed control (i.e. the lack of a
+global clock) is that it is less straightforward to single-step the circuit through
+a sequence of well-defined states. This makes it less straightforward to steer
+the circuit into particular quiescent states, which is necessary for IDDQ testing,
+– the technique that is used to test for shorts and opens which are faults that
+are typical in today’s CMOS processes.
+The extensive use of state-holding elements (such as the Muller C-element),
+together with the self-timed behaviour, makes it difficult to test the feed-back
+circuitry that implements the state holding behaviour. Delay-fault testing rep-
+resents yet another challenge.
+The above discussion may leave the impression that the problem of testing
+asynchronous circuits is largely unsolved. This is not correct. The truth is
+rather that the techniques for testing synchronous circuits are not directly ap-
+plicable. The situation is quite similar to the design of asynchronous circuits
+that we will address in detail in the following chapters. Here a mix of new
+and well-known techniques are also needed. A good starting point for reading
+about the testing of asynchronous circuits is [120]. Finally, we mention that
+testing is also touched upon in chapters 13 and 15.
+
+2.7.
+Summary
+
+This chapter introduced a number of fundamental concepts. We will now
+return to the main track of designing circuits. The reader will probably want to
+revisit some of the material in this chapter again while reading the following
+chapters.
+
+
+Chapter 3
+
+STATIC DATA-FLOW STRUCTURES
+
+In this chapter we will develop a high-level view of asynchronous design
+that is equivalent to RTL (register transfer level) in synchronous design. At
+this level the circuits may be viewed as static data-flow structures. The aim is
+to focus on the behaviour of the circuits, and to abstract away the details of the
+handshake signaling which can be considered an orthogonal implementation
+issue.
+
+3.1.
+Introduction
+
+The various handshake protocols and the associated circuit implementa-
+tion styles presented in the previous chapters are rather different. However,
+when looking at the circuits at a more abstract level – the data-flow handshake-
+channel level introduced in chapter 1 – these differences diminish, and it makes
+good sense to view the choice of handshake protocol and circuit implementa-
+tion style as low level implementation decisions that can be made largely in-
+dependently from the more abstract design decisions that establish the overall
+structure and operation of the circuit.
+Throughout this chapter we will assume a 4-phase protocol since this is
+most common. From a data-flow point of view this means that the we will be
+dealing with data streams composed of alternating valid and empty values – in
+a two-phase protocol we would see only a sequence of valid values, but apart
+from that everything else would be the same. Furthermore we will be dealing
+with simple latches as storage elements. The latches are controlled according
+to the simple rule stated in chapter 1:
+
+A latch may input and store a new token (valid or empty) from its pre-
+decessor if its successor latch has input and stored the token that it was
+previously holding.
+
+Latches are the only components that initiate and take an active part in hand-
+shaking; all other components are “transparent” to the handshaking. To ease
+the distinction between latches and combinational circuits and to emphasize
+the token flow in circuit diagrams, we will use a box symbol with double verti-
+cal lines to represent latches throughout the rest of this tutorial (see figure 3.1).
+
+29
+
+
+30
+Part I: Asynchronous circuit design – A tutorial
+
+L0
+L1
+L2
+L3
+L4
+
+E
+V
+V
+E
+E
+
+Bubble
+Bubble
+Token
+Token
+Token
+
+Figure 3.1.
+A possible state of a five stage pipeline.
+
+V
+
+V
+E
+V
+
+E
+E
+
+E
+V
+V
+
+E
+V
+E
+t3:
+
+t2:
+
+t1:
+
+t0:
+Token Token
+Bubble
+
+Figure 3.2.
+Ring: (a) a possible state; and (b) a sequence of data transfers.
+
+3.2.
+Pipelines and rings
+
+Figure 3.1 shows a snapshot of a pipeline composed of five latches. The
+“box arrows” represent channels or links consisting of request, acknowledge
+and data signals (as explained on page 5). The valid value in L1 has just
+been copied into L2 and the empty value in L3 has just been copied into
+L4. This means that L1 and L3 are now holding old duplicates of the val-
+ues now stored in L2 and L4. Such old duplicates are called “bubbles”, and the
+newest/rightmost valid and empty values are called “tokens”. To distinguish
+tokens from bubbles, tokens are represented with a circle around the value. In
+this way a latch may hold a valid token, an empty token or a bubble. Bubbles
+can be viewed as catalysts: a bubble allows a token to move forward, and in
+supporting this the bubble moves backwards one step.
+Any circuit should have one or more bubbles, otherwise it will be in a dead-
+lock state. This is a matter of initializing the circuit properly, and we will
+elaborate on this shortly. Furthermore, as we will see later, the number of
+bubbles also has a significant impact on performance.
+In a pipeline with at least three latches, it is possible to connect the output
+of the last stage to the input of the first, forming a ring in which data tokens
+can circulate autonomously. Assuming the ring is initialized as shown in fig-
+ure 3.2(a) at time t0 with a valid token, an empty token and a bubble, the first
+steps of the circulation process are shown in figure 3.2(b), at times t1, t2 and
+
+
+Chapter 3: Static data-flow structures
+31
+
+t3. Rings are the backbone structures of circuits that perform iterative compu-
+tations. The cycle time of the ring in figure 3.2 is 6 “steps” (the state at t6 will
+be identical to the state at t0). Both the valid token and the empty token have
+to make one round trip. A round trip involves 3 “steps” and as there is only
+one bubble to support this the cycle time is 6 “steps”. It is interesting to note
+that a 4-stage ring initialized to hold a valid token, an empty token and two
+bubbles can iterate in 4 “steps”. It is also interesting to note that the addition
+of one more latch does not re-time the circuit or alter its function (as would be
+the case in a synchronous circuit); it is still a ring in which a single data token
+is circulating.
+
+3.3.
+Building blocks
+
+Figure 3.3 shows a minimum set of components that is sufficient to im-
+plement asynchronous circuits (static data-flow structures with deterministic
+behaviour, i.e. without arbiters). The components can be grouped in four cat-
+egories as explained below. In the next section we will see examples of the
+token-flow behaviour in structures composed of these components. Compo-
+nents for mutual exclusion and arbitration are covered in section 5.8.
+
+Latches provide storage for variables and implement the handshaking that
+supports the token flow. In addition to the normal latch a number of
+degenerate latches are often needed: a latch with only an output channel
+is a source that produces tokens (with the same constant value), and a
+latch with only an input channel is a sink that consumes tokens. Fig-
+ure 2.9 shows the implementation of a 4-phase bundled-data latch, fig-
+ure 2.11 shows the implementation of a 2-phase bundled-data latch, and
+figures 2.12 – 2.13 show the implementation of a 4-phase dual-rail latch.
+
+Function blocks are the asynchronous equivalent of combinatorial circuits.
+They are transparent/passive from a handshaking point of view. A func-
+tion block will: (1) wait for tokens on its inputs (an implicit join), (2)
+perform the required combinatorial function, and (3) issue tokens on its
+outputs. Both empty and valid tokens are handled in this way. Some
+implementations assume that the inputs have been synchronized. In this
+case it may be necessary to use an explicit join component. The imple-
+mentation of function blocks is addressed in detail in chapter 5.
+
+Unconditional flow control: Fork and join components are used to handle
+parallel threads of computation. In engineering terms, forks are used
+when the output from one component is input to more components, and
+joins are used when data from several independent channels needs to
+be synchronized – typically because they are (independent) inputs to a
+circuit. In the following we will often omit joins and forks from cir-
+
+
+32
+Part I: Asynchronous circuit design – A tutorial
+
+Merge
+
+Latch
+Source
+Sink
+
+0
+
+1
+
+MUX
+DEMUX
+
+0
+
+1
+
+Function block
+
+Join
+
+... behaves like:
+
+Fork      
+
+   - Fork
+(carried out in sequence)
+
+   - Join;
+   - Comb. logic;
+
+(Alternative symbols)
+
+Figure 3.3.
+A minimum and, for most cases, sufficient set of asynchronous components.
+
+cuit diagrams: the fan-out of a channel implies a fork, and the fan-in of
+several channels implies a join.
+
+A merge component has two or more input channels and one output
+channel. Handshakes on the input channels are assumed to be mutually
+exclusive and the merge relays input tokens/handshakes to the output.
+
+Conditional flow control: MUX and DEMUX components perform the usual
+functions of selecting among several inputs or steering the input to one
+of several outputs. The control input is a channel just like the data in-
+puts and outputs. A MUX will synchronize the control channel and the
+relevant input channel and send the input data to the data output. The
+other input channel is ignored. Similarly a DEMUX will synchronize
+the control and data input channels and steer the input to the selected
+output channel.
+
+As mentioned before the latches implement the handshaking and thereby the
+token flow in a circuit. All other components must be transparent to the hand-
+
+
+Chapter 3: Static data-flow structures
+33
+
+shaking. This has significant implications for the implementation of these com-
+ponents!
+
+3.4.
+A simple example
+
+Figure 3.4 shows an example of a circuit composed of latches, forks and
+joins that we will use to illustrate the token-flow behaviour of an asynchronous
+circuit. The structure can be described as pipeline segments and a ring con-
+nected into a larger structure using fork and join components.
+
+Figure 3.4.
+An example asynchronous circuit composed of latches, forks and joins.
+
+Assume that the circuit is initialized as shown in figure 3.5 at time t0: all
+latches are initialized to the empty value except for the bottom two latches in
+the ring that are initialized to contain a valid value and an empty value. Values
+enclosed in circles are tokens and the rest are bubbles. Assume further that
+the left and right hand environments (not shown) take part in the handshakes
+that the circuit is prepared to perform. Under these conditions the operation
+of the circuit (i.e. the flow of tokens) is as illustrated in the snapshots labeled
+t0
+
+�t11. The left hand environment performs one handshake cycle inputting a
+valid value followed by an empty value. In a similar way the right environment
+takes part in one handshake cycle and consumes a valid value and an empty
+value.
+Because the flow of tokens is controlled by local handshaking the circuit
+could exhibit many other behaviours. For example, at time t5 the circuit is
+ready to accept a new valid value from its left environment. Notice also that
+if the initial state had no tokens in the ring, then the circuit would deadlock
+after a few steps. It is highly recommended that the reader tries to play the
+token-bubble data-flow game; perhaps using the same circuit but with different
+initial states.
+
+
+34
+Part I: Asynchronous circuit design – A tutorial
+
+V
+
+V
+
+V
+
+V
+
+V
+
+V
+
+V
+
+E
+
+V
+
+E
+
+E
+
+E
+
+E
+
+V
+
+E
+
+E
+
+V
+
+V
+E
+
+E
+
+V
+
+E
+
+E
+V
+
+V
+
+V
+E
+
+E
+
+V
+
+E
+
+V
+
+E
+
+V
+
+E
+
+V
+
+E
+
+E
+V
+
+E
+
+E
+
+V
+E
+
+V
+E
+
+E
+
+V
+E
+
+V
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+E
+
+E
+
+V
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+V
+
+V
+
+E
+
+E
+
+V
+
+V
+
+E
+
+E
+
+E
+
+E
+
+E
+
+t0:
+
+t1:
+
+t2:
+
+t3:
+
+E
+E
+
+V
+
+t5:
+
+t4:
+
+E
+
+E
+
+V
+
+V
+
+V
+
+V
+
+V
+
+E
+
+E
+
+t6:
+
+E
+E
+
+t7:
+
+E
+
+t8:
+
+E
+
+E
+E
+
+E
+
+E
+
+V
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+t9:
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+
+E
+E
+
+E
+
+E
+
+E
+
+t10:
+
+t10:
+
+V
+Bubble
+E
+
+Valid token
+Empty token
+
+Bubble
+
+Legend:
+
+Figure 3.5.
+A possible operation sequence of the example circuit from figure 3.4.
+
+
+Chapter 3: Static data-flow structures
+35
+
+3.5.
+Simple applications of rings
+
+This section presents a few simple and obvious circuits based on a single
+ring.
+
+3.5.1
+Sequential circuits
+
+Figure 3.6 shows a straightforward implementation of a finite state machine.
+Its structure is similar to a synchronous finite state machine; it consists of a
+function block and a ring that holds the current state. The machine accepts an
+“input token” that is joined with the “current state token”. Then the function
+block computes the output and the next state, and finally the fork splits these
+into an “output token” and a “next state token.”
+
+state
+Current
+Next
+state
+
+F
+Input
+Output
+
+V
+E
+E
+
+Figure 3.6.
+Implementation of an asynchronous finite state machine using a ring.
+
+3.5.2
+Iterative computations
+
+A ring can also be used to build circuits that implement iterative computa-
+tions. Figure 3.7 shows a template circuit. The idea is that the circuit will:
+(1) accept an operand, (2) sequence through the same operation a number of
+times until the computation terminates and (3) output the result. The necessary
+control is not shown. The figure shows one particular implementation. Pos-
+
+F
+E
+E
+
+1
+
+0
+0
+
+1
+
+Operand(s)
+Result
+
+E
+
+Figure 3.7.
+Implementation of an iterative computation using a ring.
+
+
+36
+Part I: Asynchronous circuit design – A tutorial
+
+sible variations involve locating the latches and the function block differently
+in the ring as well as decomposing the function block and putting these (sim-
+pler) function blocks between more latches. In [156] Ted Williams presents
+a circuit that performs division using a self-timed 5-stage ring. This design
+was later used in a floating point coprocessor in a commercial microprocessor
+[157].
+
+3.6.
+FOR, IF, and WHILE constructs
+
+Very often the desired function of a circuit is expressed using a program-
+ming language (C, C++, VHDL, Verilog, etc.). In this section we will show
+implementation templates for a number of typical conditional structures and
+loop structures. A reader who is familiar with control-data-flow graphs, per-
+haps from high-level synthesis, will recognize the great similarities between
+asynchronous circuits and control-data-flow graphs [36, 127].
+
+if <cond> then <body1> else <body2>
+An asynchronous circuit template
+for implementing an if statement is shown in figure 3.8(a). The data-type of
+the input and output channels to the if circuit is a record containing all vari-
+ables in the <cond> expression and the variables manipulated by <body1> and
+<body2>. The data-type of the output channel from the cond block is a Boolean
+that controls the DEMUX and MUX components. The FORK associated with
+this channel is not shown.
+Since the execution of <body1> and <body2> is mutually exclusive it is
+possible to replace the controlled MUX in the bottom of the circuit with a
+simpler MERGE as shown in figure 3.8(b). The circuit in figure 3.8 contains
+
+body1
+body2
+
+1
+0
+
+1
+0
+
+{variables}
+
+cond
+
+{variables}
+
+body1
+body2
+
+1
+0
+
+{variables}
+
+cond
+
+{variables}
+
+merge
+
+(b)
+(a)
+
+Figure 3.8.
+A template for implementing if statements.
+
+
+Chapter 3: Static data-flow structures
+37
+
+no feedback loops and no latches – it can be considered a large function block.
+The circuit can be pipelined for improved performance by inserting latches.
+
+for <count> do <body>
+An asynchronous circuit template for implementing
+a for statement is shown in figure 3.9. The data-type of the input channel to
+the for circuit is a record containing all variables manipulated in the <body>
+and the loop count, <count>, that is assumed to be a non-negative integer. The
+data-type of the output channel is a record containing all variables manipulated
+in the <body>.
+
+0
+
+count
+
+E
+
+body
+
+1
+0
+
+{variables}
+
+{variables}
+
+{variables},  count
+
+1
+0
+
+Initial tokens
+
+Figure 3.9.
+A template for implementing for statements.
+
+The data-type of the output channel from the count block is a Boolean,
+and one handshake on the input channel of the count block encloses <count>
+handshakes on the output channel: <count> - 1 handshakes providing the
+Boolean value “1” and one (final) handshake providing the Boolean value “0”.
+Notice the two latches on the control input to the MUX. They must be initial-
+ized to contain a data token with the value “0” and an empty token in order to
+enable the for circuit to read the variables into the loop.
+After executing the for statement once, the last handshake of the count block
+will steer the variables in the loop onto the output channel and put a “0” token
+and an empty token into the two latches, thereby preparing the for circuit for
+a subsequent activation. The FORK in the input and the FORK on the output
+of the count block are not shown. Similarly a number of latches are omitted.
+Remember: (1) all rings must contain at least 3 latches and (2) for each latch
+initialized to hold a data token there must also be a latch initialized to hold an
+empty token (when using 4-phase handshaking).
+
+
+38
+Part I: Asynchronous circuit design – A tutorial
+
+while <cond> do <body>
+An asynchronous circuit template for implement-
+ing a while statement is shown in figure 3.10. Inputs to (and outputs from) the
+circuit are the variables in the <cond> expression and the variables manipu-
+lated by <body>. As before in the for circuit, it is necessary to put two latches
+initialized to contain a data token with the value “0” and an empty token on
+the control input of the MUX. And as before a number of latches are omitted
+in the two rings that constitute the while circuit. When the while circuit termi-
+nates (after zero or more iterations) data is steered out of the loop and this also
+causes the latches on the MUX control input to become initialized properly for
+the subsequent activation of the circuit.
+
+0
+
+cond
+
+{variables}
+
+body
+
+{variables}
+
+{variables}
+
+1
+0
+
+1
+0
+
+E
+
+Initial tokens
+
+Figure 3.10.
+A template for implementing while statements.
+
+3.7.
+A more complex example: GCD
+
+Using the templates just introduced we will now design a small example
+circuit, GCD, that computes the greatest common divisor of two integers. GCD
+is often used as an introductory example, and figure 3.11 shows a programming
+language specification of the algorithm.
+In addition to its role as a design example in the current context, GCD can
+also serve to illustrate the similarities and differences between different design
+techniques. In chapter 8 we will use the same example to illustrate the Tangram
+language and the associated syntax-directed compilation process (section 8.3.3
+on pages 127–128).
+The implementation of GCD is shown in figure 3.12. It consists of a while
+template whose body is an if template. Figure 3.12 shows the circuit including
+all the necessary latches (with their initial states). The implementation makes
+no attempt at sharing resources – it is a direct mapping following the imple-
+mentation templates presented in the previous section.
+
+
+Chapter 3: Static data-flow structures
+39
+
+input (a,b);
+while a
+�� b do
+if a
+� b
+then a
+� a
+�b;
+else b
+� b
+�a;
+output (a);
+
+Figure 3.11.
+A programming language specification of GCD.
+
+0
+
+1
+
+0
+
+A>B
+
+1
+
+0
+
+1
+
+0
+
+A-B
+
+B-A
+
+0
+
+1
+
+E
+
+E
+
+E
+E
+
+A==B
+
+A,B
+GCD(A,B)
+
+A,B
+
+A,B
+
+1
+
+1
+
+Figure 3.12.
+An asynchronous circuit implementation of GCD.
+
+3.8.
+Pointers to additional examples
+
+3.8.1
+A low-power filter bank
+
+In [103] we reported on the design of a low-power IFIR filter bank for a
+digital hearing aid. It is a circuit that was designed following the approach
+presented in this chapter. The paper also provides some insight into the design
+of low power circuits as well as the circuit level implementation of memory
+structures and datapath units.
+
+3.8.2
+An asynchronous microprocessor
+
+In [23] we reported on the design of a MIPS microprocessor, called ARISC.
+Although there are many details to be understood in a large-scale design like
+a microprocessor, the basic architecture shown in figure 3.13 can be under-
+stood as a simple data-flow structure. The solid-black rectangles represent
+latches, the box-arrows represent channels, and the text-boxes represents func-
+tion blocks (combinatorial circuits).
+The processor is a simple pipelined design with instructions retiring in pro-
+gram order. It consists of a fetch-decode-issue ring with a fixed number of to-
+
+
+40
+Part I: Asynchronous circuit design – A tutorial
+
+REG
+Read
+
+PC
+Read
+
+PC
+ALU
+
+REG
+Write
+Data
+Mem.
+
+Inst.
+Mem.
+
+On
+Bolt
+
+Issue
+
+Decode
+
+Flush
+
+Arith.
+
+Logic
+
+Shift
+
+CP0
+
+Lock
+
+UnLock
+
+Figure 3.13.
+Architecture of the ARISC microprocessor.
+
+kens. This ensures a fixed instruction prefetch depth. The issue stage forks de-
+coded instructions into the execute pipeline and initiates the fetch of one more
+instruction. Register forwarding is avoided by a locking mechanism: when an
+instruction is issued for execution the destination register is locked until the
+write-back has taken place. If a subsequent instruction has a read-after-write
+data hazard this instruction is stalled until the register is unlocked. The tokens
+flowing in the design contain all operands and control signals related to the
+execution of an instruction, i.e. similar to what is stored in a pipeline stage in a
+synchronous processor. For further information the interested reader is referred
+to [23]. Other asynchronous microprocessors are based on similar principles.
+
+3.8.3
+A fine-grain pipelined vector multiplier
+
+The GCD circuit and the ARISC presented in the preceding sections use bit-
+parallel communication channels. An example of a static data-flow structure
+that uses 1-bit channels and fine grain pipelining is the serial-parallel vector
+multiplier design reported in [124, 125]. Here all necessary word-level syn-
+chronization is performed implicitly by the function blocks. The large number
+of interacting rings and pipeline segments in the static data-flow representa-
+tion of the design makes it rather complex. After reading the next chapter on
+performance analysis the interested reader may want to look at this design; it
+contains several interesting optimizations.
+
+3.9.
+Summary
+
+This chapter developed a high-level view of asynchronous design that is
+equivalent to RTL (register transfer level) in synchronous design – static data
+flow structures. The next chapter address performance analysis at this level of
+abstraction.
+
+
+Chapter 4
+
+PERFORMANCE
+
+In this chapter we will address the performance analysis and optimization
+of asynchronous circuits. The material extends and builds upon the “static
+data-flow structures view” introduced in the previous chapter.
+
+4.1.
+Introduction
+
+In a synchronous circuit, performance analysis and optimization is a matter
+of finding the longest latency signal path between two registers; this determines
+the period of the clock signal. The global clock partitions the circuit into many
+combinatorial circuits that can be analyzed individually. This is known as static
+timing analysis and it is a rather simple task, even for a large circuit.
+For an asynchronous circuit, performance analysis and optimization is a
+global and therefore much more complex problem. The use of handshaking
+makes the timing in one component dependent on the timing of its neighbours,
+which again depends on the timing of their neighbours, etc. Furthermore, the
+performance of a circuit does not depend only on its structure, but also on
+how it is initialized and used by its environment. The performance of an asyn-
+chronous circuit can even exhibit transients and oscillations.
+We will first develop a qualitative understanding of the dynamics of the to-
+ken flow in asynchronous circuits. A good understanding of this is essential for
+designing circuits with good performance. We will then introduce some quan-
+titative performance parameters that characterize individual pipeline stages and
+pipelines and rings composed of identical pipeline stages. Using these param-
+eters one can make first-level design decisions. Finally we will address how
+more complex and irregular structures can be analyzed.
+The following text represents a major revision of material from [124] and
+it is based on original work by Ted Williams [153, 154, 155]. If consulting
+these references the reader should be aware of the exact definition of a token.
+Throughout this book a token is defined as a valid data value or an empty data
+value, whereas in the cited references (that deal exclusively with 4-phase hand-
+shaking) a token is a valid-empty data pair. The definition used here accentu-
+ates the similarity between a token in an asynchronous circuit and the token in
+
+41
+
+
+42
+Part I: Asynchronous circuit design – A tutorial
+
+a Petri net. Furthermore it provides some unification between 4-phase hand-
+shaking and 2-phase handshaking – 2-phase handshaking is the same game,
+but without empty-tokens.
+In the following we will assume 4-phase handshaking, and the examples we
+provide all use bundled-data circuits. It is left as an exercise for the reader
+to make the simple adaptations that are necessary for dealing with 2-phase
+handshaking.
+
+4.2.
+A qualitative view of performance
+
+4.2.1
+Example 1: A FIFO used as a shift register
+
+The fundamental concepts can be illustrated by a simple example: a FIFO
+composed of a number of latches in which there are N valid tokens separated
+by N empty tokens, and whose environment alternates between reading a token
+from the FIFO and writing a token into the FIFO (see figure 4.1(a)). In this way
+the nomber of tokens in the FIFO is invariant. This example is relevant because
+many designs use FIFOs in this way, and because it models the behaviour of
+shift registers as well as rings – structures in which the number of tokens is
+also invariant.
+A relevant performance figure is the throughput, which is the rate at which
+tokens are input to or output from the shift register. This figure is proportional
+to the time it takes to shift the contents of the chain of latches one position to
+the right.
+Figure 4.1(b) illustrates the behaviour of an implementation in which there
+are 2N latches per valid token and figure 4.1(c) illustrates the behaviour of an
+implementation in which there are 3N latches per valid token. In both exam-
+ples the number of valid tokens in the FIFO is N
+� 3, and the only difference
+between the two situations in figure 4.1(b) and 4.1(c) is the number of bubbles.
+In figure 4.1(b) at time t1 the environment reads the valid token, D1, as
+indicated by the solid channel symbol. This introduces a bubble that enables
+data transfers to take place one at a time (t2
+
+�t5). At time t6 the environment
+inputs a valid token, D4, and at this point all elements have been shifted one
+position to the right. Hence, the time used to move all elements one place to
+the right is proportional to the number of tokens, in this case 2N
+� 6 time steps.
+Adding more latches increases the number of bubbles, which again increases
+the number of data transfers that can take place simultaneously, thereby im-
+proving the performance. In figure 4.1(c) the shift register has 3N stages and
+therefore one bubble per valid-empty token-pair. The effect of this is that N
+data transfers can occur simultaneously and the time used to move all elements
+one place to the right is constant; 2 time steps.
+If the number of latches was increased to 4N there would be one token per
+bubble, and the time to move all tokens one step to the right would be only
+
+
+Chapter 4: Performance
+43
+
+E
+D1
+E
+E
+D2
+
+(c) N data tokens and N empty tokens in 3N stages:
+
+(b) N data tokens and N empty tokens in 2N stages:
+
+(a) A FIFO and its environment:
+
+bubble
+bubble
+bubble
+
+D3
+
+E
+E
+E
+D2
+
+E
+E
+D2
+
+E
+E
+
+E
+
+D1
+
+E
+
+E
+
+E
+
+D2
+
+D2
+E
+
+E
+
+E
+
+D2
+
+D2
+
+E
+
+E
+D4
+
+D3
+
+D3
+E
+
+D2
+E
+D3
+D4
+E
+
+E
+D1
+E
+D2
+D3
+E
+
+E
+E
+D1
+E
+D2
+D3
+D4
+
+E
+D2
+E
+D3
+E
+D4
+
+D4
+
+D4
+D3
+D2
+
+bubble
+bubble
+bubble
+
+E
+E
+E
+
+bubble
+bubble
+bubble
+
+D2
+D3
+D4
+E
+E
+E
+E
+
+E
+E
+D2
+E
+D3
+E
+D4
+
+E
+
+D4
+
+E
+
+D4
+
+E
+
+Environment
+
+E
+D2
+E
+D3
+E
+
+E
+
+E
+E
+t4:
+
+bubble
+bubble
+bubble
+
+D1
+D3
+D2
+
+E
+E
+
+bubble
+bubble
+bubble
+
+t3:
+
+t2:
+
+t1:
+
+t0:
+
+E
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+bubble
+
+t5:
+
+t4:
+
+t3:
+
+t2:
+
+t1:
+
+t0:
+
+t8:
+
+t7:
+
+t6:
+
+D4
+
+E
+
+D3
+
+E
+
+D2
+
+E
+E
+
+D1
+
+D3
+
+D3
+
+D3
+
+D3
+
+Figure 4.1.
+A FIFO and its environment. The environment alternates between reading a token
+from the FIFO and writing at token into the FIFO.
+
+
+44
+Part I: Asynchronous circuit design – A tutorial
+
+1 time step. In this situation the pipeline is half full and the latches holding
+bubbles act as slave latches (relative to the latches holding tokens). Increasing
+the number of bubbles further would not increase the performance further. Fi-
+nally, it is interesting to notice that the addition of just one more latch holding
+a bubble to figure 4.1(b) would double the performance. The asynchronous
+designer has great freedom in trading more latches for performance.
+As the number of bubbles in a design depends on the number of latches per
+token, the above analysis illustrates that performance optimization of a given
+circuit is primarily a task of structural modification – circuit level optimization
+like transistor sizing is of secondary importance.
+
+4.2.2
+Example 2: A shift register with parallel load
+
+In order to illustrate another point – that the distribution of tokens and bub-
+bles in a circuit can vary over time, depending on the dynamics of the circuit
+and its environment – we offer another example: a shift register with parallel
+load. Figure 4.2 shows an initial design of a 4-bit shift register. The circuit has
+a bit-parallel input channel, din[3:0], connecting it to a data producing envi-
+ronment. It also has a 1-bit data channel, do, and a 1-bit control channel, ctl,
+connecting it to a data consuming environment. Operation is controlled by the
+data consuming environment which may request the circuit to: (ctl
+� 0) per-
+form a parallel load and to provide the least significant bit from the bit-parallel
+channel on the do channel, or (ctl
+� 1) to perform a right shift and provide
+the next bit on the do channel. In this way the data consuming environment
+always inputs a control token (valid or empty) to which the circuit always re-
+sponds by outputting a data token (valid or empty). During a parallel load, the
+previous content of the shift register is steered into the “dead end” sink-latches.
+During a right shift the constant 0 is shifted into the most significant position
+– corresponding to a logical right shift. The data consuming environment is
+not required to read all the input data bits, and it may continue reading zeros
+beyond the most significant input data bit.
+The initial design shown in figure 4.2 suffers from two performance lim-
+iting inexpediencies: firstly, it has the same problem as the shift register in
+figure 4.1(b) – there are too few bubbles, and the peak data rate on the bit-
+serial output reduces linearly with the length of the shift register. Secondly,
+the control signal is forked to all of the MUXes and DEMUXes in the design.
+This implies a high fan-out of the request and data signals (which requires a
+couple of buffers) and synchronization of all the individual acknowledge sig-
+nals (which requires a C-element with many inputs, possibly implemented as
+a tree of C-elements). The first problem can be avoided by adding a 3rd latch
+to the datapath in each stage of the circuit corresponding to the situation in
+
+
+Chapter 4: Performance
+45
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+1
+
+0
+
+din[1]
+din[2]
+din[0]
+
+din[1]
+din[0]
+din[3] din[2]
+
+din[3:0]
+
+producing
+environment
+
+Data
+
+environment
+consuming
+Data
+
+ctl
+
+do
+E
+d3
+E
+d2
+
+din[3]
+
+E
+d1
+0
+
+Figure 4.2.
+Initial design of the shift register with parallel load.
+
+
+46
+Part I: Asynchronous circuit design – A tutorial
+
+0
+
+0
+
+0
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+1
+
+0
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+
+0
+
+1
+ctl
+
+din[0]
+din[2]
+din[3]
+
+din[3]
+din[2]
+din[1]
+din[0]
+
+dout
+
+dout
+
+dout
+
+din[0]
+din[1]
+din[2]
+din[3]
+
+E
+0
+E
+0
+E
+0
+
+d2
+d1
+d3
+d0
+din[1]
+
+E
+E
+E
+
+E
+
+d3
+d2
+
+0
+E
+0
+
+E
+E
+E
+
+0
+E
+
+d1
+d0
+
+d0
+
+0
+
+d1
+E
+d0
+
+E
+1
+
+E
+E
+
+E
+
+d2
+
+E
+
+d2
+
+E
+
+E
+
+d3
+
+0
+
+E
+
+(c)
+
+(b)
+
+(a)
+
+Figure 4.3.
+Improved design of the shift register with parallel load.
+
+
+Chapter 4: Performance
+47
+
+figure 4.1(c), but if the extra latches are added to the control path instead, as
+shown in figure 4.3(a) on page 46, they will solve both problems.
+This improved design exhibits an interesting and illustrative dynamic be-
+haviour: initially, the data latches are densely packed with tokens and all the
+control latches contain bubbles, figure 4.3(a). The first step of the parallel load
+cycle is shown in figure 4.3(b), and figure 4.3(c) shows a possible state after
+the data consuming environment has read a couple of bits. The most-significant
+stage is just about to perform its “parallel load” and the bubbles are now in the
+chain of data latches. If at this point the data consuming environment paused,
+the tokens in the control path would gradually disappear while tokens in the
+datapath would pack again. Note that at any time the total number of tokens in
+the circuit is constant!
+
+4.3.
+Quantifying performance
+
+4.3.1
+Latency, throughput and wavelength
+
+When the overall structure of a design is being decided, it is important to
+determine the optimal number of latches or pipeline stages in the rings and
+pipeline fragments from which the design is composed. In order to establish a
+basis for first order design decisions, this section will introduce some quantita-
+tive performance parameters. We will restrict the discussion to 4-phase hand-
+shaking and bundled-data circuit implementations and we will consider rings
+with only a single valid token. Subsection 4.3.4, which concludes this section
+on performance parameters, will comment on adapting to other protocols and
+implementation styles.
+The performance of a pipeline is usually characterized by two parameters:
+latency and throughput (or its inverse called period or cycle time). For an asyn-
+chronous pipeline a third parameter, the dynamic wavelength, is important as
+well. With reference to figure 4.4 and following [153, 154, 155] these param-
+eters are defined as follows:
+
+Latency: The latency is the delay from the input of a data item until the corre-
+sponding output data item is produced. When data flows in the forward
+direction, acknowledge signals propagate in the reverse direction. Con-
+sequently two parameters are defined:
+
+The forward latency,Lf , is the delay from new data on the input of
+a stage (Data�i
+� 1� or Req�i
+� 1�) to the production of the corre-
+sponding output (Data�i� or Req�i�) provided that the acknowledge
+signals are in place when data arrives. Lf
+�V and Lf
+�E denote the
+latencies for propagating a valid token and an empty token respec-
+tively. It is assumed that these latencies are constants, i.e. that
+they are independent of the value of the data. [As forward propa-
+
+
+48
+Part I: Asynchronous circuit design – A tutorial
+
+Data[i-1]
+
+Req[i-1]
+
+Ack[i]
+Ack[i+1]
+
+Data[i]
+
+Ack[i]
+
+Data[i-1]
+
+Ack[i+1]
+
+Data[i]
+
+d
+
+Dual-rail pipeline:
+
+Req[i]
+
+Bundled-data pipeline:
+
+L[i]
+F[i]
+
+L[i]
+F[i]
+
+Figure 4.4.
+Generic pipelines for definition of performance parameters.
+
+gation of an empty token does not “compute” it may be desirable
+to minimize Lf
+�E. In the 4-phase bundled-data approach this can
+be achieved through the use of an asymmetric delay element.]
+
+The reverse latency, Lr, is the delay from receiving an acknowl-
+edge from the succeeding stage (Ack[i+1]) until the corresponding
+acknowledge is produced to the preceding stage (Ack[i]) provided
+that the request is in place when the acknowledge arrives. Lr
+
+� and
+Lr
+
+� denote the latencies of propagating Ack
+� and Ack
+� respectively.
+
+Period: The period, P, is the delay between the input of a valid token (fol-
+lowed by its succeeding empty token) and the input of the next valid
+token, i.e. a complete handshake cycle. For a 4-phase protocol this in-
+volves: (1) forward propagation of a valid data value, (2) reverse propa-
+gation of acknowledge, (3) forward propagation of the empty data value,
+and (4) reverse propagation of acknowledge. Therefore a lower bound
+on the period is:
+P
+� L f
+�V
+
+�Lr
+
+�
+�L f
+�E
+
+�Lr
+
+�
+(4.1)
+
+Many of the circuits we consider in this book are symmetric, i.e. Lf
+�V
+
+�
+L f
+�E and Lr
+
+�
+� Lr
+
+�, and for these circuits the period is simply:
+
+P
+� 2L f
+
+�2Lr
+(4.2)
+
+We will also consider circuits where Lf
+�V
+
+� L f
+�E and, as we will see in
+section 4.4.1 and again in section 7.3, the actual implementation of the
+latches may lead to a period that is larger than the minimum possible
+
+
+Chapter 4: Performance
+49
+
+given by equation 4.1. In section 4.4.1 we analyze a pipeline whose
+period is:
+P
+� 2Lr
+
+�2L f
+�V
+(4.3)
+
+Throughput: The throughput, T, is the number of valid tokens that flow
+through a pipeline stage per unit time: T
+� 1�P
+
+Dynamic wavelength: The dynamic wavelength, Wd, of a pipeline is the num-
+ber of pipeline stages that a forward-propagating token passes through
+during P:
+
+Wd
+
+� P
+
+L f
+(4.4)
+
+Explained differently: Wd is the distance – measured in pipeline stages
+– between successive valid or empty tokens, when they flow unimpeded
+down a pipeline. Think of a valid token as the crest of a wave and its
+associated empty token as the trough of the wave. If Lf
+�V
+
+�� L f
+�E the
+average forward latency Lf
+
+� 1
+
+2
+
+�L f
+�V
+
+�L f
+�E
+
+� should be used in the above
+equation.
+
+Static spread: The static spread, S, is the distance – measured in pipeline
+stages – between successive valid (or empty) tokens in a pipeline that is
+full (i.e. contains no bubbles). Sometimes the term occupancy is used;
+this is the inverse of S.
+
+4.3.2
+Cycle time of a ring
+
+The parameters defined above are local performance parameters that char-
+acterize the implementation of individual pipeline stages. When a number of
+pipeline stages are connected to form a ring, the following parameter is rele-
+vant:
+
+Cycle time: The cycle time of a ring, TCycle, is the time it takes for a token
+(valid or empty) to make one round trip through all of the pipeline stages
+in the ring. To achieve maximum performance (i.e. minimum cycle
+time), the number of pipeline stages per valid token must match the dy-
+namic wavelength, in which case TCycle
+
+� P. If the number of pipeline
+stages is smaller, the cycle time will be limited by the lack of bubbles,
+and if there are more pipeline stages the cycle time will be limited by
+the forward latency through the pipeline stages. In [153, 154, 155] these
+two modes of operation are called bubble limited and data limited, re-
+spectively.
+
+
+50
+Part I: Asynchronous circuit design – A tutorial
+
+Wd
+
+cycle
+T
+
+N < W   :
+d
+
+Tcycle  =
+2 N
+N - 2 L
+r
+
+(Bubble limited)
+
+N > W   :
+d
+
+Tcycle
+=  N Lf
+
+(Data limited)
+
+N
+
+P
+
+Figure 4.5.
+Cycle time of a ring as a function of the number of pipeline stages in it.
+
+The cycle time of an N-stage ring in which there is one valid token,
+one empty token and N
+� 2 bubbles can be computed from one of the
+following two equations (illustrated in figure 4.5):
+
+When N
+� Wd the cycle time is limited by the forward latency
+through the N stages:
+
+TCycle
+
+�DataLimited
+�
+� N
+�Lf
+(4.5)
+
+If Lf
+�V
+
+�� L f
+�E use Lf
+
+� max�L f
+�V;L f
+�E
+
+�.
+
+When N
+�Wd the cycle time is limited by the reverse latency. With
+N pipeline stages, one valid token and one empty token, the ring
+contains N
+� 2 bubbles, and as a cycle involves 2N data transfers
+(N valid and N empty), the cycle time becomes:
+
+TCycle
+
+�BubbleLimited
+�
+�
+2N
+
+N
+�2Lr
+(4.6)
+
+If Lr
+
+�
+�� Lr
+
+� use Lr
+
+� 1
+
+2
+
+�Lr
+
+�
+�Lr
+
+�
+�
+
+For the sake of completeness it should be mentioned that a third possible
+mode of operation called control limited exists for some circuit config-
+urations [153, 154, 155]. This is, however, not relevant to the circuit
+implementation configurations presented in this book.
+
+The topic of performance analysis and optimization has been addressed in
+some more recent papers [31, 90, 91, 37] and in some of these the term “slack
+matching” is used (referring to the process of balancing the timing of forward
+flowing tokens and backward flowing bubbles).
+
+
+Chapter 4: Performance
+51
+
+4.3.3
+Example 3: Performance of a 3-stage ring
+
+ 
+
+Pipeline stage [i]
+
+Req[i-1]
+
+Data[i-1]
+Data[i]
+
+Req[i]
+
+Ack[i+1]
+Ack[i]
+
+CL
+L
+
+ti = 1
+
+Lf
+ 
+
+Lr
+Ack[i]
+
+Req[i-1]
+Req[i]
+
+Ack[i+1]
+
+Data[i]
+Data[i-1]
+CL
+L
+
+ti = 1
+
+td = 3
+td = 3
+
+tc = 2
+tc = 2
+
+C
+C
+
+Figure 4.6.
+A simple 4-phase bundled-data pipeline stage, and an illustration of its forward
+and reverse latency signal paths.
+
+Let us illustrate the above by a small example: a 3-stage ring composed of
+identical 4-phase bundled-data pipeline stages that are implemented as illus-
+trated in figure 4.6(a). The data path is composed of a latch and a combinatorial
+circuit, CL. The control part is composed of a C-element and an inverter that
+controls the latch and a delay element that matches the delay in the combinato-
+rial circuit. Without the combinatorial circuit and the delay element we have a
+simple FIFO stage. For illustrative purposes the components in the control part
+are assigned the following latencies: C-element: tc
+
+� 2 ns, inverter: ti
+
+� 1 ns,
+and delay element: td
+
+� 3 ns.
+Figure 4.6(b) shows the signal paths corresponding to the forward and re-
+verse latencies, and table 4.1 lists the expressions and the values of these pa-
+rameters. From these figures the period and the dynamic wavelength for the
+two circuit configurations are calculated. For the FIFO, Wd
+
+� 5�0 stages, and
+for the pipeline, Wd
+
+� 3�2. A ring can only contain an integer number of stages
+and if Wd is not integer it is necessary to analyze rings with
+�Wd
+
+� and
+�Wd
+
+�
+
+Table 4.1.
+Performance of different simple ring configurations.
+
+FIFO
+Pipeline
+Parameter
+Expression
+Value
+Expression
+Value
+
+Lr
+tc
+
+�ti
+3 ns
+tc
+
+�ti
+3 ns
+L f
+tc
+2 ns
+tc
+
+�td
+5 ns
+P
+� 2L f
+
+�2Lr
+4tc
+
+�2ti
+10 ns
+4tc
+
+�2ti
+
+�2td
+16 ns
+
+Wd
+5 stages
+3.2 stages
+
+TCycle (3 stages)
+6 Lr
+18 ns
+6 Lr
+18 ns
+TCycle (4 stages)
+4 Lr
+12 ns
+4 Lf
+20 ns
+TCycle (5 stages)
+3�3 Lr
+
+� 5 L f
+10 ns
+5 Lf
+25 ns
+TCycle (6 stages)
+6 Lf
+12 ns
+6 Lf
+30 ns
+
+
+52
+Part I: Asynchronous circuit design – A tutorial
+
+stages and determine which yields the smallest cycle time. Table 4.1 shows the
+results of the analysis including cycle times for rings with 3 to 6 stages.
+
+4.3.4
+Final remarks
+
+The above presentation made a number of simplifying assumptions: (1)
+only rings and pipelines composed of identical pipeline stages were consid-
+ered, (2) it assumed function blocks with symmetric delays (i.e. circuits where
+L f
+�V
+
+� L f
+�E), (3) it assumed function blocks with constant latencies (i.e. ig-
+noring the important issue of data-dependent latencies and average-case per-
+formance), (4) it considered rings with only a single valid token, and (5) the
+analysis considered only 4-phase handshaking and bundled-data circuits.
+For 4-phase dual-rail implementations (where request is embedded in the
+data encoding) the performance parameter equations defined in the previous
+section apply without modification. For designs using a 2-phase protocol, some
+straightforward modifications are necessary: there are no empty tokens and
+hence there is only one value for the forward latency Lf and one value for the
+reverse latency Lr. It is also a simple matter to state expressions for the cycle
+time of rings with more tokens.
+It is more difficult to deal with data-dependent latencies in the function
+blocks and to deal with non-identical pipeline stages. Despite these deficien-
+cies the performance parameters introduced in the previous sections are very
+useful as a basis for first-order design decisions.
+
+4.4.
+Dependency graph analysis
+
+When the pipeline stages incorporate different function blocks, or function
+blocks with asymmetric delays, it is a more complex task to determine the crit-
+ical path. It is necessary to construct a graph that represents the dependencies
+between signal transitions in the circuit, and to analyze this graph and identify
+the critical path cycle [19, 153, 154, 155]. This can be done in a systematic or
+even mechanical way but the amount of detail makes it a complex task.
+The nodes in such a dependency graph represent rising or falling signal
+transitions, and the edges represent dependencies between the signal transi-
+tions. Formally, a dependency is a marked graph [28]. Let us look at a couple
+of examples.
+
+4.4.1
+Example 4: Dependency graph for a pipeline
+
+As a first example let us consider a (very long) pipeline composed of identi-
+cal stages using a function block with asymmetric delays causing Lf
+�E
+
+� L f
+�V.
+Figure 4.7(a) shows a 3-stage section of this pipeline. Each pipeline stage has
+
+
+Chapter 4: Performance
+53
+
+the following latency parameters:
+
+L f
+�V
+
+�
+td
+�0�1
+
+�
+�tc
+
+� 5 ns
+�2 ns
+� 7 ns
+L f
+�E
+
+�
+td
+�1�0
+
+�
+�tc
+
+� 1 ns
+�2 ns
+� 3 ns
+Lr
+
+�
+� Lr
+
+�
+�
+ti
+
+�tc
+
+� 3 ns
+
+There is a close relationship between the circuit diagram and the dependency
+graph. As signals alternate between rising transitions (�) and falling transitions
+(�) – or between valid and empty data values – the graph has two nodes per
+circuit element. Similarly the graph has two edges per wire in the circuit.
+Figure 4.7(b) shows the two graph fragments that correspond to a pipeline
+stage, and figure 4.7(c) shows the dependency graph that corresponds to the 3
+pipeline stages in figure 4.7(a).
+A label outside a node denotes the circuit delay associated with the signal
+transition. We use a particular style for the graphs that we find illustrative: the
+nodes corresponding to the forward flow of valid and empty data values are
+organized as two horizontal rows, and nodes representing the reverse flowing
+acknowledge signals appear as diagonal segments connecting the rows.
+The cycle time or period of the pipeline is the time from a signal transition
+until the same signal transition occurs again. The cycle time can therefore be
+determined by finding the longest simple cycle in the graph, i.e. the cycle with
+the largest accumulated circuit delay which does not contain a sub-cycle. The
+dotted cycle in figure 4.7(c) is the longest simple cycle. Starting at point A the
+corresponding period is:
+
+P
+�
+tD�0�1�
+
+�tC
+
+�
+��
+�
+Lf
+�V
+
+�tI
+
+�tC
+
+�
+��
+�
+
+Lr
+
+�
+
+�tD�1�0�
+
+�tC
+
+�
+��
+�
+Lf
+�V
+
+�tI
+
+�tC
+
+�
+��
+�
+
+Lr
+
+�
+
+�
+2Lr
+
+�2L f
+�V
+
+� 20 ns
+
+Note that this is the period given by equation 4.3 on page 49. An alternative
+cycle time candidate is the following:
+
+R
+
+�i�
+
+�;Req
+
+�i�
+
+�
+
+�
+��
+�
+Lf
+�V
+
+;A
+
+�i�1�
+
+�;Req
+
+�i�1�
+
+�
+
+�
+��
+�
+
+Lr
+
+�
+
+;R
+
+�i�
+
+�;Req
+
+�i�
+
+�
+
+�
+��
+�
+Lf
+�E
+
+;A
+
+�i�1�
+
+�;Req
+
+�i�1�
+
+�
+
+�
+��
+�
+
+Lr
+
+�
+
+;
+
+and the corresponding period is:
+
+P
+� 2Lr
+
+�L f
+�V
+
+�L f
+�E
+
+� 16 ns
+
+Note that this is the minimum possible period given by equation 4.1 on page 48.
+The period is determined by the longest cycle which is 20 ns. Thus, this ex-
+ample illustrates that for some (simple) latch implementations it may not be
+possible to reduce the cycle time by using function blocks with asymmetric
+delays (Lf
+�E
+
+� L f
+�V).
+
+
+54
+Part I: Asynchronous circuit design – A tutorial
+
+L
+
+ 
+
+CL
+L
+
+ 
+
+CL
+L
+
+ 
+
+CL
+
+ti = 1
+
+Ack[i-1]
+Ack[i]
+
+Stage[i-1]
+Stage[i]
+Stage[i+1]
+
+Data[i-1]
+
+td(0->1) = 5
+td(0->1) = 5
+
+Req[i-1]
+
+td(0->1) = 5
+
+Req[i]
+
+ti = 1
+
+Req[i-2]
+
+Data[i-2]
+
+Ack[i+2]
+
+Req[i+1]
+
+Data[i+1]
+
+Ack[i+1]
+
+Data[i]
+
+ti = 1
+
+td(1->0) = 1
+td(1->0) = 1
+td(1->0) = 1
+tc = 2
+tc = 2
+tc = 2
+
+(a)
+
+C
+C
+C
+
+Req[ i ]
+Ack[ i ]
+Req[ i-1]
+tc
+
+ti
+
+Ack[ i+1]
+
+R[ i ]
+
+A[ i ]
+
+td[ i ](1->0)
+
+tc
+
+ti
+
+R[ i ]
+
+A[ i ]
+
+Ack[ i+1]
+
+td[ i ](0->1)
+Req[ i ]
+Ack[ i ]
+
+0->1 transition of Req[i]:
+1->0 transition of Req[i]:
+
+ti = 1 ns
+ti = 1 ns
+
+ti = 1 ns
+ti = 1 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+
+ti = 1 ns
+
+tc = 2 ns
+
+R[i-1]
+R[i]
+
+td(0->1) = 5 ns
+td(0->1) = 5 ns
+td(0->1) = 5 ns
+
+Ack[i]
+R[i+1]
+Ack[i+1]
+Req[i]
+Req[i+1]
+Req[i-1]
+
+A[i-1]
+A[i]
+A[i+1]
+
+A[i-1]
+A[i]
+A[i+1]
+
+R[i-1]
+Ack[i-1]
+Req[i-1]
+
+td(1->0) = 1 ns
+td(1->0) = 1 ns
+
+R[i]
+Ack[i]
+Req[i]
+
+td(1->0) = 1 ns
+
+R[i+1]
+Req[i+1]
+Ack[i+1]
+
+ti = 1 ns
+
+Stage [i]
+Stage [i-1]
+Stage [i+1]
+
+Ack[i-1] 
+
+A
+
+C
+
+C
+Req[ i-1]
+
+(b)
+
+(c)
+
+Figure 4.7.
+Data dependency graph for a 3-stage section of a pipeline: (a) the circuit dia-
+gram, (b) the two graph fragments corresponding to a pipeline stage, and (c) the resulting data-
+dependency graph.
+
+4.4.2
+Example 5: Dependency graph for a 3-stage ring
+
+As another example of dependency graph analysis let us consider a three
+stage 4-phase bundled-data ring composed of different pipeline stages, fig-
+ure 4.8(a): stage 1 with a combinatorial circuit that is matched by a symmetric
+
+
+Chapter 4: Performance
+55
+
+L
+
+ 
+
+CL
+L
+
+ 
+ 
+
+(a)
+
+L
+
+Ack1
+
+Req3
+
+Data3
+CL
+Data1
+
+Ack2
+
+Req1
+
+Ack3
+
+Data2
+
+Stage 2
+Stage 1
+
+Ack1
+
+Req3
+
+Data3
+
+Stage 3
+
+tc = 2 ns
+
+ti = 1 ns
+ti = 1 ns
+
+Req2
+
+td3(0->1) = 6 ns
+td3(1->0) = 1 ns
+tc = 2 ns
+
+ti = 1 ns
+
+td2 = 2 ns
+
+tc = 2 ns
+
+C
+C
+C
+
+B
+A
+
+B
+A
+
+R1
+
+A1
+
+Req1
+Ack1
+
+ti = 1 ns
+
+ti = 1 ns
+ti = 1 ns
+
+R1
+
+A1
+
+Req1
+Ack1
+R2
+
+A2
+
+Req2
+Ack2
+
+ti = 1 ns
+ti = 1 ns
+
+R3
+Req3
+Ack3
+
+A3
+
+td1(0->1) = 2 ns
+tc = 2 ns
+
+R2
+
+A2
+
+Req2
+Ack2
+
+tc = 2 ns
+
+tc = 2 ns
+
+tc = 2 ns
+td1(1->0) = 2 ns
+
+td3(0->1) = 6 ns
+tc = 2 ns
+
+Req3
+Ack3
+
+A3
+
+R3
+
+ti = 1 ns
+
+tc = 2 ns
+
+Stage 1
+Stage 2
+Stage 3
+
+td3(1->0) = 1 ns
+
+(b)
+
+Figure 4.8.
+Data dependency graph for an example 3-stage ring: (a) the circuit diagram for
+the ring and (b) the resulting data-dependency graph.
+
+delay element, stage 2 without combinatorial logic, and stage 3 with a combi-
+natorial circuit that is matched by an asymmetric delay element.
+The dependency graph is similar to the dependency graph for the 3-stage
+pipeline from the previous section. The only difference is that the output port
+of stage 3 is connected to the input port of stage 1, forming a closed graph.
+There are several “longest simple cycle” candidates:
+1 A cycle corresponding to the forward flow of valid-tokens:
+
+(R1�; Req1�; R2�; Req2�; R3�; Req3�)
+
+For this cycle, the cycle time is TCycle
+
+� 14 ns.
+
+2 A cycle corresponding to the forward flow of empty-tokens:
+
+(R1�; Req1�; R2�; Req2�; R3�; Req3
+�)
+
+For this cycle, the cycle time is TCycle
+
+� 9 ns.
+
+
+56
+Part I: Asynchronous circuit design – A tutorial
+
+3 A cycle corresponding to the backward flowing bubble:
+
+(A1�; Req1�; A3�; Req3�; A2�; Req2�; A1�; Req1�; A3�; Req3�;
+A2�; Req2�)
+
+For this cycle, the cycle time is TCycle
+
+� 6Lr
+
+� 18 ns.
+
+The 3-stage ring contains one valid-token, one empty-token and one bub-
+ble, and it is interesting to note that the single bubble is involved in six
+data transfers, and therefore makes two reverse round trips for each for-
+ward round trip of the valid-token.
+
+4 There is, however, another cycle with a slightly longer cycle time, as
+illustrated in figure 4.8(b). It is the cycle corresponding to the backward-
+flowing bubble where the sequence:
+
+(A1�; Req1�; A3�)
+is replaced by
+(R3�)
+
+For this cycle the cycle time is TCycle
+
+� 6Lr
+
+� 20 ns.
+
+A dependency graph analysis of a 4-stage ring is very similar. The only
+difference is that there are two bubbles in the ring. In the dependency graph
+this corresponds to the existence of two “bubble cycles” that do not interfere
+with each other.
+The dependency graph approach presented above assumes a closed circuit
+that results in a closed dependency graph. If a component such as a pipeline
+fragment is to be analyzed it is necessary to include a (dummy) model of its
+environment as well – typically in the form of independent and eager token
+producers and token consumers, i.e. dummy circuits that simply respond to
+handshakes. Figure 2.15 on page 24 illustrated this for a single pipeline stage
+control circuit.
+Note that a dependency graph as introduced above is similar to a signal
+transition graph (STG) which we will introduce more carefully in chapter 6.
+
+4.5.
+Summary
+
+This chapter addressed the performance analysis of asynchronous circuits
+at several levels: firstly, by providing a qualitative understanding of perfor-
+mance based on the dynamics of tokens flowing in a circuit; secondly, by in-
+troducing quantitative performance parameters that characterize pipelines and
+rings composed of identical pipeline stages and, thirdly, by introducing de-
+pendency graphs that enable the analysis of pipelines and rings composed of
+non-identical stages.
+At this point we have covered the design and performance analysis of asyn-
+chronous circuits at the “static data-flow structures” level, and it is time to
+address low-level circuit design principles and techniques. This will be the
+topic of the next two chapters.
+
+
+Chapter 5
+
+HANDSHAKE CIRCUIT IMPLEMENTATIONS
+
+In this chapter we will address the implementation of handshake compo-
+nents. First, we will consider the basic set of components introduced in sec-
+tion 3.3 on page 32: (1) the latch, (2) the unconditional data-flow control el-
+ements join, fork and merge, (3) function blocks, and (4) the conditional flow
+control elements MUX and DEMUX. In addition to these basic components
+we will also consider the implementation of mutual exclusion elements and
+arbiters and touch upon the (unavoidable) problem of metastability. The major
+part of the chapter (sections 5.3–5.6) is devoted to the implementation of func-
+tion blocks and the material includes a number of fundamental concepts and
+circuit implementation styles.
+
+5.1.
+The latch
+
+As mentioned previously, the role of latches is: (1) to provide storage for
+valid and empty tokens, and (2) to support the flow of tokens via handshak-
+ing with neighbouring latches. Possible implementations of the handshake
+latch were shown in chapter 2: Figure 2.9 on page 18 shows how a 4-phase
+bundled-data handshake latch can be implemented using a conventional latch
+and a control circuit (the figure shows several such examples assembled into
+pipelines). In a similar way figure 2.11 on page 20 shows the implementation
+of a 2-phase bundled-data latch, and figures 2.12-2.13 on page 21 show the
+implementation of a 4-phase dual-rail latch.
+A handshake latch can be characterized in terms of the throughput, the dy-
+namic wavelength and the static spread of a FIFO that is composed of identical
+latches. Common to the two 4-phase latch designs mentioned above is that a
+FIFO will fill with every other latch holding a valid token and every other latch
+holding an empty token (as illustrated in figure 4.1(b) on page 43). Thus, the
+static spread for these FIFOs is S
+� 2.
+A 2-phase implementation does not involve empty tokens and consequently
+it may be possible to design a latch whose static spread is S
+� 1. Note, how-
+ever, that the implementation of the 2-phase bundled-data handshake latch in
+
+57
+
+
+58
+Part I: Asynchronous circuit design – A tutorial
+
+figure 2.11 on page 20 involves several level-sensitive latches; the utilization
+of the level sensitive latches is no better.
+Ideally, one would want to pack a valid token into every level-sensitive latch,
+and in chapter 7 we will address the design of 4-phase bundled-data handshake
+latches that have a smaller static spread.
+
+5.2.
+Fork, join, and merge
+
+Possible 4-phase bundled-data and 4-phase dual-rail implementations of the
+fork, join, and merge components are shown in figure 5.1. For simplicity the
+figure shows a fork with two output channels only, and join and merge compo-
+nents with two input channels only. Furthermore, all channels are assumed to
+be 1-bit channels. It is, of course, possible to generalize to three or more inputs
+and outputs respectively, and to extend to n-bit channels. Based on the expla-
+nation given below this should be straightforward, and it is left as an exercise
+for the reader.
+
+4-phase fork and join
+A fork involves a C-element to combine the acknowl-
+edge signals on the output channels into a single acknowledge signal on the
+input channel. Similarly a 4-phase bundled-data join involves a C-element to
+combine the request signals on the input channels into a single request signal
+on the output channel. The 4-phase dual-rail join does not involve any active
+components as the request signal is encoded into the data.
+The particular fork in figure 5.1 duplicates the input data, and the join con-
+catenates the input data. This happens to be the way joins and forks are mostly
+used in our static data-flow structures, but there are many alternatives: for ex-
+ample, the fork could split the input data which would make it more symmetric
+to the join in figure 5.1. In any case the difference is only in how the input data
+is transferred to the output. From a control point of view the different alter-
+natives are identical: a join synchronizes several input channels and a fork
+synchronizes several output channels.
+
+4-phase merge
+The implementation of the merge is a little more elaborate.
+Handshakes on the input channels are mutually exclusive, and the merge sim-
+ply relays the active input handshake to the output channel.
+Let us consider the implementation of the 4-phase bundled-data merge first.
+It consists of an asynchronous control circuit and a multiplexer that is con-
+trolled by the input request. The control circuit is explained below.
+The request signals on the input channels are mutually exclusive and may
+simply be ORed together to produce the request signal on the output channel.
+For each input channel, a C-element produces an acknowledge signal in re-
+sponse to an acknowledge on the output channel provided that the input chan-
+nel has valid data. For example, the C-element driving the xack signal is set high
+
+
+Chapter 5: Handshake circuit implementations
+59
+
+C
+
+1) 
+
+y
+
+1) 
+
+y
+
+x.f
+z.f
+
+y
+
+Merge
+
+Join
+
+C
+
+C
+
+y−req
+
+y−ack
+z−ack
+
+x−req
+
+y
+
+z
+
+z−req
+
+x−ack
+
+x
+
+y.f
+
+MUX
+
+x−ack
+
+x−req
+
+z.t
+
+y.t
+
+x−ack
+
+x
+
+x.t
+
+x−ack
+
+y
+
+z−ack
+y−ack
+
+z0.f
+
+z−ack
+
+z−req
+
+Fork      
+
+y−ack
+
+x
+y
+z1
+z0
+
+z0.t
+y−ack
+
+x−ack
+
+x.t
+x.f
+
+z1.t
+z1.f
+
+z
+
+y−ack
+
+y−req
+
+z−req
+
+Component
+4−phase bundled−data
+4−phase dual−rail
+
+x−ack
+z−ack
+
+x−req
+y−req
+
+z.f 
+
+x−ack
+
+y−ack
+
+z−ack
+
+z−ack
+
+z.t 
+x.t 
+x.f 
+
+y.t 
+y.f 
+
+y.t
+y.f
+
+C
+
+C
+C
+
++
+
++
++
+
++
+
+C
+z
+
++
+
+x
+
+x
+
+z
+
+z
+
+x
+
+Figure 5.1.
+4-phase bundled-data and 4-phase dual-rail implementations of the fork, join and
+merge components.
+
+when xreq and zack have both gone high, and it is reset when both signals have
+gone low again. As zack goes low in response to xreq going low, it will suffice to
+reset the C-element in response to zack going low. This optimization is possible
+if asymmetric C-elements are available, figure 5.2. Similar arguments applies
+for the C-element that drives the yack signal. A more detailed introduction to
+generalized C-elements and related state-holding devices is given in chapter 6,
+sections 6.4.1 and 6.4.5.
+
++
+
+C
+x-ack
+z-ack
+
+x-req
+
+z-ack
+x-ack
+
+reset
+
+x-req
+set
+
+Figure 5.2.
+A possible implementation of the upper asymmetric C-element in the 4-phase
+bundled-data merge in figure 5.1.
+
+
+60
+Part I: Asynchronous circuit design – A tutorial
+
+The implementation of the 4-phase dual-rail merge is fairly similar. As
+request is encoded into the data signals an OR gate is used for each of the
+two output signals z�t and z�f . Acknowledge on an input channel is produced
+in response to an acknowledge on the output channel provided that the input
+channel has valid data. Since the example assumes 1-bit wide channels, the
+latter is established using an OR gate (marked “1”), but for N-bit wide channels
+a completion detector (as shown in figure 2.13 on page 21) would be required.
+
+2-phase fork, join and merge
+Finally a word about 2-phase bundled-data
+implementations of the fork, join and merge components: the implementation
+of 2-phase bundled-data fork and join components is identical to the imple-
+mentation of the corresponding 4-phase bundled-data components (assuming
+that all signals are initially low).
+The implementation of a 2-phase bundled-data merge, on the other hand,
+is complex and rather different, and it provides a good illustration of why the
+implementation of some 2-phase bundled-data components is complex. When
+observing an individual request or acknowledge signal the transitions will obvi-
+ously alternate between rising and falling, but since nothing is known about the
+sequence of handshakes on the input channels there is no relationship between
+the polarity of a request signal transition on an input channel and the polarity
+of the corresponding request signal transition on the output channel. Similarly
+there is no relationship between the polarity of an acknowledge signal transi-
+tion on the output channel and the polarity of the corresponding acknowledge
+signal transition on the input channel channel. This calls for some kind of stor-
+age element on each request and acknowledge signal produced by the circuit.
+This brings complexity, as does the associated control logic.
+
+5.3.
+Function blocks – The basics
+
+This section will introduce the fundamental principles of function block de-
+sign, and subsequent sections will illustrate function block implementations
+for different handshake protocols. The running example will be an N-bit ripple
+carry adder.
+
+5.3.1
+Introduction
+
+A function block is the asynchronous equivalent of a combinatorial circuit:
+it computes one or more output signals from a set of input signals. The term
+“function block” is used to stress the fact that we are dealing with circuits with
+a purely functional behaviour.
+However, in addition to computing the desired function(s) of the input sig-
+nals, a function block must also be transparent to the handshaking that is im-
+plemented by its neighbouring latches. This transparency to handshaking is
+
+
+Chapter 5: Handshake circuit implementations
+61
+
+block
+Function 
+
+A
+
+B
+SUM
+
+ADD
+
+cin
+cout
+
+Join
+Fork
+
+Figure 5.3.
+A function block whose operands and results are provided on separate channels
+requires a join of the inputs and a fork on the output.
+
+what makes function blocks different from combinatorial circuits and, as we
+will see, there are greater depths to this than is indicated by the word “trans-
+parent” – in particular for function blocks that implicitly indicate completion
+(which is the case for circuits using dual-rail signals).
+The most general scenario is where a function block receives its operands
+on separate channels and produces its results on separate channels, figure 5.3.
+The use of several independent input and output channels implies a join on the
+input side and a fork on the output side, as illustrated in the figure. These can
+be implemented separately, as explained in the previous section, or they can be
+integrated into the function block circuitry. In what follows we will restrict the
+discussion to a scenario where all operands are provided on a single channel
+and where all results are provided on a single channel.
+We will first address the issue of handshake transparency and then review
+the fundamentals of ripple carry addition, in order to provide the necessary
+background for discussing the different implementation examples that follow.
+A good paper on the design of function blocks is [97].
+
+5.3.2
+Transparency to handshaking
+
+The general concepts are best illustrated by considering a 4-phase dual-rail
+scenario – function blocks for bundled data protocols can be understood as
+a special case. Figure 5.4(a) shows two handshake latches connected directly
+and figure 5.4(b) shows the same situation with a function block added between
+the two latches. The function block must be transparent to the handshaking.
+Informally this means that if observing the signals on the ports of the latches,
+one should see the same sequence of handshake signal transitions; the only
+difference should be some slow-down caused by the latency of the function
+block.
+A function block is obviously not allowed to produce a request on its output
+before receiving a request on its input; put the other way round, a request on the
+output of the function block should indicate that all of the inputs are valid and
+that all (relevant) internal signals and all output signals have been computed.
+
+
+62
+Part I: Asynchronous circuit design – A tutorial
+
+Ack
+
+Data
+
+Ack
+
+F
+
+Input
+data
+Output
+data
+
+(b)
+(a)
+
+LATCH
+
+LATCH
+
+LATCH
+
+LATCH
+
+Figure 5.4.
+(a) Two latches connected directly by a handshake channel and (b) the same situ-
+ation with a function block added between the latches. The handshaking as seen by the latches
+in the two situations should be the same, i.e. the function block must be designed such that it is
+transparent to the handshaking.
+
+(Here we are touching upon the principle of indication once again.) In 4-phase
+protocols a symmetric set of requirements apply for the return-to-zero part of
+the handshaking.
+Function blocks can be characterized as either strongly indicating or weakly
+indicating depending on how they behave with respect to this handshake trans-
+parency. The signalling that can be observed on the channel between the two
+
+All
+valid
+
+All
+empty
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+1
+0
+
+Time
+
+Input data
+
+Output data
+
+(1)
+
+(2a)
+
+(2b)
+
+(3)
+(4a)
+
+(4b)
+(4b)
+
+(1)
+“All inputs become defined”
+�
+“Some outputs become defined”
+(2)
+“All outputs become defined”
+�
+“Some inputs become undefined”
+(3)
+“All inputs become undefined”
+�
+“Some outputs become undefined”
+(4)
+“All outputs become undefined”
+�
+“Some inputs become defined”
+
+Figure 5.5.
+Signal traces and event orderings for a strongly indicating function block.
+
+
+Chapter 5: Handshake circuit implementations
+63
+
+All
+valid
+
+All
+empty
+
+All
+valid
+
+All
+empty
+
+Acknowledge
+1
+
+0
+
+Time
+
+Input data
+
+Output data
+
+(6b)
+
+(3b)
+
+(6a)
+
+(6b)
+
+(2)
+
+(1)
+(3a)
+
+(4)
+(5)
+
+(1)
+“Some inputs become defined”
+�
+“Some outputs become defined”
+(2)
+“All inputs become defined”
+�
+“All outputs become defined”
+(3)
+“All outputs become defined”
+�
+“Some inputs become undefined”
+(4)
+“Some inputs become undefined”
+�
+“Some outputs become undefined”
+(5)
+“All inputs become undefined”
+�
+“All outputs become undefined”
+(6)
+“All outputs become undefined”
+�
+“Some inputs become defined”
+
+Figure 5.6.
+Signal traces and event orderings for a weakly indicating function block.
+
+latches in figure 5.4(a) was illustrated in figure 2.3 on page 13. We can illus-
+trate the handshaking for the situation in figure 5.4(b) in a similar way.
+
+A function block is strongly indicating, as illustrated in figure 5.5, if (1)
+it waits for all of its inputs to become valid before it starts to compute and
+produce valid outputs, and if (2) it waits for all of its inputs to become
+empty before it starts to produce empty outputs.
+
+A function block is weakly indicating, as illustrated in figure 5.6, if (1)
+it starts to compute and produce valid outputs as soon as possible, i.e.
+when some but not all input signals have become valid, and if (2) it
+starts to produce empty outputs as soon as possible, i.e. when some but
+not all input signals have become empty.
+
+For a weakly indication function block to behave correctly, it is necessary
+to require that it never produces all valid outputs until after all inputs have be-
+come valid, and that it never produces all empty outputs until after all inputs
+have become empty. This behaviour is identical to Seitz’s weak conditions in
+[121]. In [121] Seitz further explains that it can be proved that if the individual
+components satisfy the weak conditions then any “valid combinatorial circuit
+
+
+64
+Part I: Asynchronous circuit design – A tutorial
+
+structure” of function blocks also satisfies the weak conditions, i.e. that func-
+tion blocks may be combined to form larger function blocks. By “valid com-
+binatorial circuit structure” we mean a structure where no components have
+inputs or outputs left unconnected and where there are no feed-back signal
+paths. Strongly indicating function blocks have the same property – a “valid
+combinatorial circuit structure” of strongly indicating function blocks is itself
+a strongly indicating function block.
+Notice that both weakly and strongly indicating function blocks exhibit a
+hysteresis-like behaviour in the valid-to-empty and empty-to-valid transitions:
+(1) some/all outputs must remain valid until after some/all inputs have become
+empty, and (2) some/all outputs must remain empty until after some/all inputs
+have become valid. It is this hysteresis that ensures handshake transparency,
+and the implementation consequence is that one or more state holding circuits
+(normally in the form of C-elements) are needed.
+Finally, a word about the 4-phase bundled-data protocol. Since Req� is
+equivalent to “all data signals are valid” and since Req� is equivalent to “all
+data signals are empty,” a 4-phase bundled-data function block can be catego-
+rized as strongly indicating.
+As we will see in the following, strongly indicating function blocks have
+worst-case latency. To obtain actual case latency weakly indicating function
+blocks must be used. Before addressing possible function block implementa-
+tion styles for the different handshake protocols it is useful to review the basics
+of binary ripple-carry addition, the running example in the following sections.
+
+5.3.3
+Review of ripple-carry addition
+
+Figure 5.7 illustrates the implementation principle of a ripple-carry adder.
+A 1-bit full adder stage implements:
+
+s
+�
+a
+�b
+�c
+(5.1)
+d
+�
+ab
+�ac
+�bc
+(5.2)
+
+ai bi
+a1 b1
+an bn
+
+�����
+�����
+�����
+�����
+
+
+
+
+
+di
+d1
+ci
+cn
+
+sn
+si
+s1
+
+cout
+cin
+
+Figure 5.7.
+A ripple-carry adder. The carry output of one stage di is connected to the carry
+input of the next stage ci�1.
+
+
+Chapter 5: Handshake circuit implementations
+65
+
+In many implementations inputs a and b are recoded as:
+
+p
+�
+a
+�b
+(“propagate” carry)
+(5.3)
+g
+�
+ab
+(“generate” carry)
+(5.4)
+k
+�
+ab
+(“kill” carry)
+(5.5)
+
+�
+�
+�and the output signals are computed as follows:
+
+s
+�
+p
+�c
+(5.6)
+d
+�
+g
+� pc
+or alternatively
+(5.7a)
+
+d
+�
+k
+� pc
+(5.7b)
+
+For a ripple-carry adder, the worst case critical path is a carry rippling across
+the entire adder. If the latency of a 1-bit full adder is tadd the worst case latency
+of an N-bit adder is N
+� tadd. This is a very rare situation and in general the
+longest carry ripple during a computation is much shorter. Assuming random
+and uncorrelated operands the average latency is log�N
+�
+�tadd and, if numeri-
+cally small operands occur more frequently, the average latency is even less.
+Using normal Boolean signals (as in the bundled-data protocols) there is no
+way to know when the computation has finished and the resulting performance
+is thus worst-case.
+By using dual-rail carry signals (d
+�t
+�d
+�f ) it is possible to design circuits
+that indicate completion as part of the computation and thus achieve actual
+case latency. The crux is that a dual-rail carry signal, d, conveys one of the
+following 3 messages:
+
+(d
+�t
+�d
+�f ) = (0,0) = Empty
+“The carry has not been computed yet”
+(possibly because it depends on c)
+(d
+�t
+�d
+�f ) = (1,0) = True
+“The carry is 1”
+(d
+�t
+�d
+�f ) = (0,1) = False
+“The carry is 0”
+
+Consequently it is possible for a 1-bit adder to output a valid carry without
+waiting for the incoming carry if its inputs make this possible (a
+� b
+� 0 or
+a
+� b
+� 1). This idea was first put forward in 1955 in a paper by Gilchrist [52].
+The same idea is explained in [62, pp. 75-78] and in [121].
+
+5.4.
+Bundled-data function blocks
+
+5.4.1
+Using matched delays
+
+A bundled-data implementation of the adder in figure 5.7 is shown in fig-
+ure 5.8. It is composed of a traditional combinatorial circuit adder and a match-
+ing delay element. The delay element provides a constant delay that matches
+the worst case latency of the combinatorial adder. This includes the worst case
+
+
+66
+Part I: Asynchronous circuit design – A tutorial
+
+comb.
+circuit
+
+s[n:1]
+
+d
+
+a[n:1]
+b[n:1]
+c
+
+matched
+delay
+Req-in
+Req-out
+
+Ack-in
+Ack-out
+
+Figure 5.8.
+A 4-phase bundled data implementation of the N
+�bit handshake adder from fig-
+ure 5.7.
+
+critical path in the circuit – a carry rippling across the entire adder – as well as
+the worst case operating conditions. For reliable operation some safety margin
+is needed.
+In addition to the combinatorial circuit itself, the delay element represents
+a design challenge for the following reasons: to a first order the delay element
+will track delay variations that are due to the fabrication process spread as well
+as variations in temperature and supply voltage. On the other hand, wire de-
+lays can be significant and they are often beyond the designer’s control. Some
+design policy for matched delays is obviously needed. In a full custom de-
+sign environment one may use a dummy circuit with identical layout but with
+weaker transistors. In a standard cell automatic place and route environment
+one will have to accept a fairly large safety margin or do post-layout timing
+analysis and trimming of the delays. The latter sounds tedious but it is similar
+to the procedure used in synchronous design where setup and hold times are
+checked and delays trimmed after layout.
+In a 4-phase bundled-data design an asymmetric delay element may be
+preferable from a performance point of view, in order to perform the return-to-
+zero part of the handshaking as quickly as possible. Another issue is the power
+consumption of the delay element. In the ARISC processor design reported in
+[23] the delay elements consumed 10 % of the total power.
+
+5.4.2
+Delay selection
+
+In [105] Nowick proposed a scheme called “speculative completion”. The
+basic principle is illustrated in figure 5.9. In addition to the desired function
+some additional circuitry is added that selects among several matched delays.
+The estimate must be conservative, i.e. on the safe side. The estimation can
+be based on the input signals and/or on some internal signals in the circuit that
+implements the desired function.
+For an N-bit ripple-carry adder the propagate signals (c.f. equation 5.3)
+that form the individual 1-bit full adders (c.f. figure 5.7) may be used for the
+estimation. As an example of the idea consider a 16-bit adder. If p8
+
+� 0 the
+
+
+Chapter 5: Handshake circuit implementations
+67
+
+large
+
+small
+
+medium
+
+Estimate
+
+Funct.
+
+Req_in
+Req_out
+
+Inputs
+Outputs
+
+MUX
+
+Figure 5.9.
+The basic idea of “speculative completion”.
+
+longest carry ripple can be no longer than 8 stages, and if p12
+
+� p8
+
+� p4
+
+� 0
+the longest carry ripple can be no longer than 4 stages. Based on such simple
+estimates a sufficiently large matched delay is selected. Again, if a 4-phase
+protocol is used, asymmetric delay elements are preferable from a performance
+point of view.
+To the designer the trade-off is between an aggressive estimate with a large
+circuit overhead (area and power) or a less aggressive estimate with less over-
+head. For more details on the implementation and the attainable performance
+gains the reader is is referred to [105, 107].
+
+5.5.
+Dual-rail function blocks
+
+5.5.1
+Delay insensitive minterm synthesis (DIMS)
+
+In chapter 2 (page 22 and figure 2.14) we explained the implementation of
+an AND gate for dual-rail signals. Using the same basic topology it is possible
+to implement other simple gates such as OR, EXOR, etc. An inverter involves
+no active circuitry as it is just a swap of the two wires.
+Arbitrary functions can be implemented by combining gates in exactly the
+same way as when one designs combinatorial circuits for a synchronous cir-
+cuit. The handshaking is implicitly taken care of and can be ignored when
+composing gates and implementing Boolean functions. This has the important
+implication that existing logic synthesis techniques and tools may be used, the
+only difference is that the basic gates are implemented differently.
+The dual-rail AND gate in figure 2.14 is obviously rather inefficient: 4 C-
+elements and 1 OR gate totaling approximately 30 transistors – a factor five
+greater than a normal AND gate whose implementation requires only 6 tran-
+sistors. By implementing larger functions the overhead can be reduced. To
+illustrate this figure 5.10(b)-(c) shows the implementation of a 1-bit full adder.
+We will discuss the circuit in figure 5.10(d) shortly.
+
+
+68
+Part I: Asynchronous circuit design – A tutorial
+
+b.f   b.t
+
+c.f   c.t
+
+s.f   s.t
+
+d.f   d.t
+
+Generate
+
+Kill
+
+E
+E
+E
+0
+0
+0
+0
+
+c
+b
+a
+
+F
+F
+F
+T
+F
+F
+F
+T
+F
+T
+T
+F
+T
+F
+F
+T
+F
+T
+T
+T
+F
+T
+T
+T
+
+0
+1
+0
+1
+1
+0
+1
+0
+0
+1
+0
+1
+1
+0
+1
+
+0
+0
+0
+0
+
+0
+
+1
+1
+
+1
+
+1
+
+1
+
+1
+0
+1
+
+0
+0
+
+0
+
+NO  CHANGE
+
+Kill
+
+Generate
+
+s.t
+s.f
+d.t
+d.f
+
+(b)
+
+(d)
+(c)
+
+(a)
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+b.f
+
+c.f
+
+b.t
+
+c.t
+
+a.f
+a.t
++
+
++
+
++
+
++
+
+s.t
+
+s.f
+
+d.t
+
+d.f
+
+ADD
+
+a.f   a.t
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+C
+
+b.f
+
+c.f
+
+b.t
+
+c.t
+
+a.f
+a.t
+
+C
+
+C
+
++
+
++
+
++
+
++
+
+s.t
+
+s.f
+
+d.t
+
+d.f
+
+1
+
+Figure 5.10.
+A 4-phase dual-rail full-adder: (a) Symbol, (b) truth table, (c) DIMS implemen-
+tation and (d) an optimization that makes the full adder weakly indicating.
+
+The PLA-like structure of the circuit in figure 5.10(c) illustrates a general
+principle for implementing arbitrary Boolean functions. In [124] we called this
+approach DIMS – Delay-Insensitive Minterm Synthesis – because the circuits
+are delay-insensitive and because the C-elements in the circuits generate all
+minterms of the input variables. The truth tables have 3 groups of rows speci-
+fying the output when the input is: (1) the empty codeword to which the circuit
+responds by setting the output empty, (2) an intermediate codeword which does
+not affect the output, or (3) a valid codeword to which the circuit responds by
+setting the output to the proper valid value.
+The fundamental ideas explained above all go back to David Muller’s work
+in the late 1950s and early 1960s [93, 92]. While [93] develops the funda-
+mental theorem for the design of speed-independent circuits, [92] is a more
+practical introduction including a design example: a bit-serial multiplier using
+latches and gates as explained above.
+Referring to section 5.3.2, the DIMS circuits as explained here can be cat-
+egorized as strongly indicating, and hence they exhibit worst case latency. In
+
+
+Chapter 5: Handshake circuit implementations
+69
+
+an N-bit ripple-carry adder the empty-to-valid and valid-to-empty transitions
+will ripple in strict sequence from the least significant full adder to the most
+significant one.
+If we change the full-adder design slightly as illustrated in figure 5.10(d) a
+valid d may be produced before the c input is valid (“kill” or “generate”), and
+an N-bit ripple-carry adder built from such full adders will exhibit actual-case
+latency – the circuits are weakly indicating function blocks.
+The designs in figure 5.10(c) and 5.10(d), and ripple-carry adders built from
+these full adders, are all symmetric in the sense that the latency of propagating
+an empty value is the same as the latency of propagating the preceding valid
+value. This may be undesirable. Later in section 5.5.4 we will introduce an
+elegant design that propagates empty values in constant time (with the latency
+of 2 full adder cells).
+
+5.5.2
+Null Convention Logic
+
+The C-elements and OR gates from the previous sections can be seen as n-
+of-n and 1-of-n threshold gates with hysteresis, figure 5.11. By using arbitrary
+m-of-n threshold gates with hysteresis – an idea proposed by Theseus Logic,
+Inc., [39] – it is possible to reduce the implementation complexity. An m-
+of-n threshold gate with hysteresis will set its output high when any m inputs
+have gone high and it will set its output low when all its inputs are low. This
+elegant circuit implementation idea is the key element in Theseus Logic’s Null
+Convention Logic. At the higher levels of design NCL is no different from
+the data-flow view presented in chapter 3 and NCL has great similarities to the
+circuit design styles presented in [92, 122, 124, 97]. Figure 5.11 shows that
+
+1
+1
+
+2
+2
+
+3
+
+5
+
+1
+1
+1
+
+2
+2
+
+3
+3
+
+4
+4
+
+C-elements
+
+Inverter
+
+OR-gates
+
+Figure 5.11.
+NCL gates: m�of�n threshold gates with hysteresis (1
+� m
+� n).
+
+
+70
+Part I: Asynchronous circuit design – A tutorial
+
+2
+
+2
+
+3
+
+3
+
+b.f
+b.t
+
+c.t
+c.f
+
+a.t
+a.f
+
+s.t
+
+s.f
+
+d.t
+d.f
+
+Figure 5.12.
+A full adder using NCL gates.
+
+OR gates and C-elements can be seen as special cases in the world of threshold
+gates. The digit inside a gate symbol is the threshold of the gate. Figure 5.12
+shows the implementation of a dual-rail full adder using NCL threshold gates.
+The circuit is weakly indicating.
+
+5.5.3
+Transistor-level CMOS implementations
+
+The last two adder designs we will introduce are based on CMOS transistor-
+level implementations using dual-rail signals. Dual-rail signals are essentially
+what are produced by precharged differential logic circuits that are used in
+memory structures and in logic families like DCVSL, figure 5.13 [151, 55].
+In a bundled-data design the precharge signal can be the request signal on
+the input channel to the function block. In a dual-rail design the precharge
+p-type transistors may be replaced by transistor networks that detect when all
+
+A
+
+B
+B
+
+Precharge
+
+N  transistor
+network
+
+Out.t
+
+Inputs
+
+Precharge
+
+Out.f
+
+Figure 5.13.
+A precharged differential CMOS combinatorial circuit. By adding the cross-
+coupled p-type transistors labeled “A” or the (weak) feedback-inverters labeled “B” the circuit
+becomes (pseudo)static.
+
+
+Chapter 5: Handshake circuit implementations
+71
+
+c.f
+
+c.t
+
+b.t
+
+a.f
+
+b.f
+
+c.f
+
+c.t
+
+b.t
+
+a.f
+
+b.f
+
+b.t
+
+c.f
+
+b.f
+
+c.t
+
+d.f
+
+b.f
+
+c.f
+
+b.f
+
+c.f
+
+d.t
+
+b.t
+
+c.f
+
+b.t
+
+c.t
+
+b.f
+
+c.t
+
+b.t
+
+c.t
+
+a.t
+a.t
+
+a.t
+a.f
+a.f
+a.f
+a.t
+a.f
+a.t
+a.t
+
+Figure 5.14.
+Transistor-level implementation of the carry signal for the strongly indicating full
+adder from figure 5.10(c).
+
+inputs are empty. Similarly the pull down n-type transistor signal paths should
+only conduct when the required input signals are valid.
+Transistor implementations of the DIMS and NCL gates introduced above
+are thus straightforward. Figure 5.14 shows a transistor-level implementation
+of a carry circuit for a strongly-indicating full adder. In the pull-down circuit
+each column of transistors corresponds to a minterm. In general when imple-
+menting DCVSL gates it is possible to share transistors in the two pull-down
+networks, but in this particular case it has not been done in order to illustrate
+better the relationship between the transistor implementation and the gate im-
+plementation in figure 5.10(c).
+The high stacks of p-type transistors are obviously undesirable. They may
+be replaced by a single transistor controlled by an “all empty” signal generated
+elsewhere. Finally, we mention that the weakly-indicating full adder design
+presented in the next section includes optimizations that minimize the p-type
+transistor stacks.
+
+5.5.4
+Martin’s adder
+
+In [85] Martin addresses the design of dual-rail function blocks in general
+and he illustrates the ideas using a very elegant dual-rail ripple-carry adder.
+The adder has a small transistor count, it exhibits actual case latency when
+adding valid data, and it propagates empty values in constant time – the adder
+represents the ultimate in the design of weakly indicating function blocks.
+Looking at the weakly-indicating transistor-level carry circuit in figure 5.14
+we see that d remains valid until a, b, and c are all empty. If we designed a
+similar sum circuit its output s would also remain valid until a, b, and c are all
+empty. The weak conditions in figure 5.6 only require that one output remains
+
+
+72
+Part I: Asynchronous circuit design – A tutorial
+
+s1
+
+a1 b1
+
+d1
+c1
+c2
+
+b2
+a2
+
+d2
+
+s2
+
+b3
+a3
+
+d3
+
+s3
+
+c3
+
+c3
+
+d2
+d1
+
+c2
+
+d3
+
+s3
+s2
+s1
+
+c1
+
+a3, b3
+a1,b1
+
+c3
+
+d2
+d1
+
+c2
+
+d3
+
+s3
+s2
+s1
+
+c1
+
+a3, b3
+a2, b2
+a1, b1
+
+a2, b2
+
+(b)
+
+(c)
+
+(a)
+
+Ripple-carry adder:
+
+Validity indication:
+
+Empty indication:
+
+Kill /Generate
+Propagate
+
+Figure 5.15.
+(a) A 3-stage ripple-carry adder and graphs illustrating how valid data (b) and
+empty data (c) propagate through the circuit (Martin [85]).
+
+valid until all inputs have become invalid. Hence it is allowed to split the
+indication of a, b and c being empty among the carry and the sum circuits.
+In [85] Martin uses some very illustrative directed graphs to express how
+the output signals indicate when input signals and internal signals are valid or
+empty. The nodes in the graphs are the signals in the circuit and the directed
+edges represent indication dependencies. Solid edges represent guaranteed de-
+pendencies and dashed edges represent possible dependencies. Figure 5.15(a)
+shows three full adder stages of a ripple-carry adder, and figures 5.15(b) and
+5.15(c) show how valid and empty inputs respectively propagate through the
+circuit.
+The propagation and indication of valid values is similar to what we dis-
+cussed above in the other adder designs, but the propagation and indication of
+empty values is different and exhibits constant latency. When the outputs d3,
+s3, s2, and s1 are all valid this indicates that all input signals and all inter-
+nal carry signals are valid. Similarly when the outputs d3, s3, s2, and s1 are
+all empty this indicates that all input signals and all internal carry signals are
+empty – the ripple-carry adder satisfies the weak conditions.
+
+
+Chapter 5: Handshake circuit implementations
+73
+
+a.t
+
+c.t
+
+c.f
+
+c.t
+
+c.f
+
+c.t
+
+b.t
+
+c.f
+
+b.f
+
+a.f
+
+c.t
+
+b.f
+
+a.t
+
+c.f
+
+a.f
+
+b.t
+
+s.f
+
+s.t
+
+a.f
+
+b.t
+
+a.t
+
+a.f
+
+b.f
+
+c.f
+
+a.f
+b.f
+
+d.f
+
+b.f
+
+a.f
+
+b.t
+
+a.t
+
+a.t
+
+b.t
+a.t
+b.t
+
+c.t
+
+d.t
+
+b.f
+
+Figure 5.16.
+The CMOS transistor implementation of Martin’s adder [85, Fig. 3].
+
+The corresponding transistor implementation of a full adder is shown in
+figure 5.16. It uses 34 transistors, which is comparable to a traditional combi-
+natorial circuit implementation.
+The principles explained above apply to the design of function blocks in
+general. “Valid/empty indication (or acknowledgement), dependency graphs”
+as shown in figure 5.15 are a very useful technique for understanding and de-
+signing circuits with low latency and the weakest possible indication.
+
+5.6.
+Hybrid function blocks
+
+The final adder we will present has 4-phase bundled-data input and output
+channels and a dual-rail carry chain. The design exhibits characteristics sim-
+ilar to Martin’s dual-rail adder presented in the previous section: actual case
+latency when propagating valid data, constant latency when propagating empty
+data, and a moderate transistor count. The basic structure of this hybrid adder
+is shown in figure 5.17. Each full adder is composed of a carry circuit and
+a sum circuit. Figure 5.18(a)-(b) shows precharged CMOS implementations
+of the two circuits. The idea is that the circuits precharge when Reqin
+
+� 0,
+evaluate when Reqin
+
+� 1, detect when all carry signals are valid and use this
+information to indicate completion, i.e. Reqout
+
+�. If the latency of the comple-
+tion detector does not exceed the latency in the sum circuit in a full adder then
+a matched delay element is needed as indicated in figure 5.17.
+The size and latency of the completion detector in figure 5.17 grows with the
+size of the adder, and in wide adders the latency of the completion detector may
+significantly exceed the latency of the sum circuit. An interesting optimization
+that reduces the completion detector overhead – possibly at the expense of
+a small increase in overall latency (Reqin
+
+� to Reqout
+
+�) – is to use a mix of
+strongly and weakly indicating function blocks [101]. Following the naming
+convention established in figure 5.7 on page 64 we could make, for example,
+
+
+74
+Part I: Asynchronous circuit design – A tutorial
+
+C
+
+Completion
+detector
+
+Precharge/Evaluate
+all cy and sum circuits
+
++
++
++
+
+sum
+sum
+
+sn
+si
+
+sum
+
+s1
+Req_out
+
+c1.t
+
+d1.f
+ci.f
+
+ci.t
+d1.t
+di.t
+
+di.f
+dn.f
+
+dn.t
+
+c1.f
+
+cin
+
+cout
+
+carry
+carry
+carry
+
+bn
+bi
+a1 b1
+ai
+an
+
+cn.f
+
+cn.t
+
+Req_in
+
+Figure 5.17.
+Block diagram of a hybrid adder with 4-phase bundled-data input and output
+channels and with an internal dual-rail carry chain.
+
+Req_in
+Req_in
+
+Req_in
+
+c.t
+
+a
+b
+a
+a
+
+b
+b
+c.f
+
+a
+b
+
+d.t
+
+d.f
+
+Req_in
+
+Req_in
+
+a
+a
+
+b
+b
+b
+b
+
+c.t
+c.f
+
+s
+
+Req_in
+
+d.t
+
+d.f
+
+(c)
+Req_in
+
+c.f
+
+a
+
+Req_in
+
+a
+
+c.t
+
+a
+
+b
+b
+
+a
+
+(a)
+(b)
+
+Figure 5.18.
+The CMOS transistor implementation of a full adder for the hybrid adder in
+figure 5.17: (a) a weakly indicating carry circuit, (b) the sum circuit and (c) a strongly indicating
+carry circuit.
+
+
+Chapter 5: Handshake circuit implementations
+75
+
+adders 1, 4, 7,
+�
+�
+�weakly indicating and all other adders strongly indicating. In
+this case only the carry signals out of stages 3, 6, 9,
+�
+�
+�need to be checked to
+detect completion. For i
+� 3�6�9�
+�
+�
+� di indicates the completion of di�1 and
+di�2 as well. Many other schemes for mixing strongly and weakly indicating
+full adders are possible. The particular scheme presented in [101] exploited the
+fact that typical-case operands (sampled audio) are numerically small values,
+and the design detects completion from a single carry signal.
+
+Summary – function block design
+
+The previous sections have explained the basics of how to implement func-
+tion blocks and have illustrated this using a variety of ripple-carry adders. The
+main points were “transparency to handshaking” and “actual case latency”
+through the use of weakly-indicating components.
+Finally, a word of warning to put things into the right perspective: to some
+extent the ripple-carry adders explained above over-sell the advantages of aver-
+age-case performance. It is easy to get carried away with elegant circuit de-
+signs but it may not be particularly relevant at the system level:
+
+In many systems the worst-case latency of a ripple-carry adder may sim-
+ply not be acceptable.
+
+In a system with many concurrently active components that synchronize
+and exchange data at high rates, the slowest component at any given time
+tends to dominate system performance; the average-case performance of
+a system may not be nearly as good as the average-case latency of its
+individual components.
+
+In many cases addition is only one part of a more complex compound
+arithmetic operation. For example, the final design of the asynchronous
+filter bank presented in [103] did not use the ideas presented above. In-
+stead we used entirely strongly-indicating full adders because this al-
+lowed an efficient two-dimensional precharged compound add-multiply-
+accumulate unit to be implemented.
+
+5.7.
+MUX and DEMUX
+
+Now that the principles of function block design have been covered we are
+ready to address the implementation of the MUX and DEMUX components,
+c.f. figure 3.3 on page 32. Let’s recapitulate their function: a MUX will syn-
+chronize the control channel and relay the data and the handshaking of the
+selected input channel to the output data channel. The other input channel is
+ignored (and may have a request pending). Similarly a DEMUX will synchro-
+nize the control and the data input channels and steer the input to the selected
+output channel. The other output channel is passive and in the idle state.
+
+
+76
+Part I: Asynchronous circuit design – A tutorial
+
+If we consider only the “active” channels then the MUX and the DEMUX
+can be understood and designed as function blocks – they must be transparent
+to the handshaking in the same way as function blocks. The control chan-
+nel and the (selected) input data channel are first joined and then an output is
+produced. Since no data transfer can take place without the control channel
+and the (selected) input data channel both being active, the implementations
+become strongly indicating function blocks.
+Let’s consider implementations using 4-phase protocols. The simplest and
+most intuitive designs use a dual-rail control channel. Figure 5.19 shows the
+implementation of the MUX and the DEMUX using the 4-phase bundled-data
+
+n
+
+n
+
+"Join"
+
+"Join"
+
+1
+
+MUX
+
+0
+
+"Join"
+
+1
+
+z−ack
+
+"Join"
+
+y
+
+DEMUX
+
+y
+
+z
+
+ctl_ack
+
+x
+
+x−ack
+
+Component
+4−phase bundled−data
+
+y−ack
+
+y−ack
+0
+
+y
+
+y−req
+
+z
+
+ctl_ack
+
+x−req
+
+z−req
+
+y−req
+
+ctl.f  ctl.t
+
+n
+
+ctl.f  ctl.t
+
+n
+n
+
+MUX
+
+x
+
+y
+
+z−ack
+
+z−req
+
+C
+x−ack
+
+x−req
+
+n
+
+ctl
+
+z
+
+C
+
+x
+
+z
+
++
+
+ctl
+
+C
+
++
+
+C
+
+C
+
+x
+
+C
+
+Figure 5.19.
+Implementation of MUX and DEMUX. The input and output data channels x,
+y, and z use the 4-phase bundled-data protocol and the control channel ctl uses the 4-phase
+dual-rail protocol (in order to simplify the design).
+
+
+Chapter 5: Handshake circuit implementations
+77
+
+protocol on the input and output data channels and the 4-phase dual-rail proto-
+col on the control channel. In both circuits ctl�t and ctl�f can be understood as
+two mutually exclusive requests that select between the two alternative input-
+to-output data transfers, and in both cases ctl�t and ctl�f are joined with the
+relevant input requests (at the C-elements marked “Join”). The rest of the
+MUX implementation is then similar to the 4-phase bundled-data MERGE in
+figure 5.1 on page 59. The rest of the DEMUX should be self explanatory; the
+handshaking on the two output ports are mutually exclusive and the acknowl-
+edge signals yack and zack are ORed to form xack
+
+� ctlack.
+All 4-phase dual-rail implementations of the MUX and DEMUX compo-
+nents are rather similar, and all 4-phase bundled-data implementations may be
+obtained by adding 4-phase bundled-data to 4-phase dual-rail protocol con-
+version circuits on the control input. At the end of chapter 6, an all 4-phase
+bundled-data MUX will be one of the examples we use to illustrate the design
+of speed-independent control circuits.
+
+5.8.
+Mutual exclusion, arbitration and metastability
+
+5.8.1
+Mutual exclusion
+
+Some handshake components (including MERGE) require that the commu-
+nication along several (input) channels is mutually exclusive. For the simple
+static data-flow circuit structures we have considered so far this has been the
+case, but in general one may encounter situations where a resource is shared
+between several independent parties/processes.
+The basic circuit needed to deal with such situations is a mutual exclusion
+element (MUTEX), figure 5.20 (we will explain the implementation shortly).
+The input signals R1 and R2 are two requests that originate from two inde-
+pendent sources, and the task of the MUTEX is to pass these inputs to the
+corresponding outputs G1 and G2 in such a way that at most one output is ac-
+tive at any given time. If only one input request arrives the operation is trivial.
+If one input request arrives well before the other, the latter request is blocked
+until the first request is de-asserted. The problem arises when both input sig-
+
+R1
+
+R2
+
+R1
+
+R2
+
+Bistable
+
+&
+
+&
+
+G2
+
+G1
+
+G1
+
+G2
+
+MUTEX
+
+Metastability filter
+
+x2
+
+x1
+
+Figure 5.20.
+The mutual exclusion element: symbol and possible implementation.
+
+
+78
+Part I: Asynchronous circuit design – A tutorial
+
+nals are asserted at the same time. Then the MUTEX is required to make an
+arbitrary decision, and this is where metastability enters the picture.
+The problem is exactly the same as when a synchronous circuit is exposed
+to an asynchronous input signal (one that does not satisfy set-up and hold time
+requirements). For a clocked flip-flop that is used to synchronize an asyn-
+chronous input signal, the question is whether the data signal made its tran-
+sition before or after the active edge of the clock. As with the MUTEX the
+question is again which signal transition occured first, and as with the MU-
+TEX a random decision is needed if the transition of the data signal coincides
+with the active edge of the clock signal.
+The fundamental problem in a MUTEX and in a synchronizer flip-flop is
+that we are dealing with a bi-stable circuit that receives requests to enter each
+of its two stable states at the same time. This will cause the circuit to enter a
+metastable state in which it may stay for an unbounded length of time before
+randomly settling in one of its stable states. The problem of synchronization
+is covered in most textbooks on digital design and VLSI, and the analysis of
+metastability that is presented in these textbooks applies to our MUTEX com-
+ponent as well. A selection of references is: [95, sect. 9.4] [53, sect. 5.4 and
+6.5] [151, sect. 5.5.7] [115, sect. 6.2.2 and 9.4-5] [150, sect. 8.9].
+For the synchronous designer the problem is that metastability may per-
+sist beyond the time interval that has been allocated to recover from potential
+metastability. It is simply not possible to obtain a decision within a bounded
+length of time. The asynchronous designer, on the other hand, will eventually
+obtain a decision, but there is no upper limit on the time he will have to wait
+for the answer. In [22] the terms “time safe” and “value safe” are introduced
+to denote and classify these two situations.
+A possible implementation of the MUTEX, as shown in figure 5.20, in-
+volves a pair of cross coupled NAND gates and a metastability filter. The cross
+coupled NAND gates enable one input to block the other. If both inputs are
+asserted at the same time, the circuit becomes metastable with both signals
+x1 and x2 halfway between supply and ground. The metastability filter pre-
+vents these undefined values from propagating to the outputs; G1 and G2 are
+both kept low until signals x1 and x2 differ by more than a transistor threshold
+voltage.
+The metastability filter in figure 5.20 is a CMOS transistor-level implemen-
+tation from [83]. An NMOS predecessor of this circuit appeared in [121].
+Gate-level implementations are also possible: the metastability filter can be
+implemented using two buffers whose logic thresholds have been made partic-
+ularly high (or low) by “trimming” the strengths of the pull-up and pull-down
+transistor paths ([151, section 2.3]). For example, a 4-input NAND gate with
+all its inputs tied together implements a buffer with a particularly high logic
+
+
+Chapter 5: Handshake circuit implementations
+79
+
+threshold. The use of this idea in the implementation of mutual exclusion ele-
+ments is described in [6, 139].
+
+5.8.2
+Arbitration
+
+The MUTEX can be used to build a handshake arbiter that can be used to
+control access to a resource that is shared between several autonomous inde-
+pendent parties. One possible implementation is shown in figure 5.21.
+
+&
+
+&
+
+C
+
+C
+
+R0
+A0
+
+R1
+A1
+
+R2
+A2
+
+ARBITER
+
++
+R0
+
+A0
+
+y1
+
+y2
+
+G1
+
+MUTEX
+
+R2
+G2
+
+G1
+R1
+
+A1
+
+R1
+
+R2
+
+A2
+
+G2
+
+A1
+
+A2
+
+a’
+
+aa’
+
+b’
+
+bb’
+
+Figure 5.21.
+The handshake arbiter: symbol and possible implementation.
+
+The MUTEX ensures that signals G1 and G2 at the a’–aa’ interface are
+mutually exclusive. Following the MUTEX are two AND gates whose purpose
+it is to ensure that handshakes on the
+�y1�A1� and
+�y2�A2� channels at the b’–
+bb’ interface are mutually exclusive: y2 can only go high if A1 is low and
+y1 can only go high if signal A2 is low. In this way, if handshaking is in
+progress along one channel, it blocks handshaking on the other channel. As
+handshaking along channels
+�y1�A1� and
+�y2�A2� are mutually exclusive the
+rest of the arbiter is simply a MERGE, c.f., figure 5.1 on page 59. If data needs
+to be passed to the shared resource a multiplexer is needed in exactly the same
+way as in the MERGE. The multiplexer may be controlled by signals y1 and/or
+y2.
+
+5.8.3
+Probability of metastability
+
+Let us finally take a quantitative look at metastability: if P�mett
+
+� denotes
+the probability of the MUTEX being metastable for a period of time of t or
+longer (within an observation interval of one second), and if this situation is
+considered a failure, then we may calculate the mean time between failure as:
+
+MTBF
+�
+1
+
+P�mett
+
+�
+(5.8)
+
+The probability P�mett
+
+� may be calculated as:
+
+P�mett
+
+�
+� P�mett
+
+�mett
+�0
+
+�
+�P�mett
+�0
+
+�
+(5.9)
+
+
+80
+Part I: Asynchronous circuit design – A tutorial
+
+where:
+
+P�mett
+
+�mett
+�0
+
+� is the probability that the MUTEX is still metastable at
+time t given that it was metastable at time t
+� 0.
+
+P�mett
+�0
+
+� is the probability that the MUTEX will enter metastability
+within a given observation interval.
+
+The probability P�mett
+�0
+
+� can be calculated as follows: the MUTEX will go
+metastable if its inputs R1 and R2 are exposed to transitions that occur almost
+simultaneously, i.e. within some small time window ∆. If we assume that
+the two input signals are uncorrelated and that they have average switching
+frequencies fR1 and fR2 respectively, then:
+
+P�mett
+�0
+
+�
+�
+1
+
+∆
+� fR1
+
+� fR2
+(5.10)
+
+which can be understood as follows: within an observation interval of one
+second the input signal R2 makes 1� fR2 attempts at hitting one of the 1� fR1
+time intervals of duration ∆ where the MUTEX is vulnerable to metastability.
+The probability P�mett
+
+�mett
+�0
+
+� is determined as:
+
+P�mett
+
+�mett
+�0
+
+�
+� e
+
+�t
+�τ
+(5.11)
+
+where τ expresses the ability of the MUTEX to exit the metastable state spon-
+taneously. This equation can be explained in two different ways and experi-
+mental results have confirmed its correctness. One explanation is that the cross
+coupled NAND gates have no memory of how long they have been metastable,
+and that the only probability distribution that is “memoryless” is an exponen-
+tial distribution. Another explanation is that a small-signal model of the cross-
+coupled NAND gates at the metastable point has a single dominating pole.
+Combining equations 5.8–5.11 we obtain:
+
+MTBF
+�
+e t
+�τ
+
+∆
+� fR1
+
+� fR2
+(5.12)
+
+Experiments and simulations have shown that this equation is reasonably
+accurate provided that t is not very small, and experiments or simulations may
+be used to determine the two parameters ∆ and τ. Representative values for
+good circuit designs implemented in a 0.25 µm CMOS process are ∆
+� 30ps
+and τ
+� 25ps.
+
+5.9.
+Summary
+
+This chapter addressed the implementation of the various handshake com-
+ponents: latch, fork, join, merge, function blocks, mux, demux, mutex and
+arbiter). A significant part of the material addressed principles and techniques
+for implementing function blocks.
+
+
+Chapter 6
+
+SPEED-INDEPENDENT CONTROL CIRCUITS
+
+This chapter provides an introduction to the design of asynchronous sequen-
+tial circuits and explains in detail one well-developed specification and synthe-
+sis method: the synthesis of speed-independent control circuits from signal
+transition graph specifications.
+
+6.1.
+Introduction
+
+Over time many different formalisms and theories have been proposed for
+the design of asynchronous control circuits (e.g. sequential circuits or state
+machines). The multitude of approaches arises from the combination of: (a)
+different specification formalisms, (b) different assumptions about delay mod-
+els for gates and wires, and (c) different assumptions about the interaction
+between the circuit and its environment. Full coverage of the topic is far be-
+yond the scope of this book. Instead we will first present some of the basic
+assumptions and characteristics of the various design methods and give point-
+ers to relevant literature and then we will explain in detail one method: the
+design of speed-independent circuits from signal transition graphs – a method
+that is supported by a well-developed public domain tool, Petrify.
+A good starting point for further reading is a book by Myers [95]. It provides
+in-depth coverage of the various formalisms, methods, and theories for the
+design of asynchronous sequential circuits and it provides a comprehensive
+list of references.
+
+6.1.1
+Asynchronous sequential circuits
+
+To start the discussion figure 6.1 shows a generic synchronous sequential
+circuit and two alternative asynchronous control circuits: a Huffman style fun-
+damental mode circuit with buffers (delay elements) in the feedback signals,
+and a Muller style input-output mode circuit with wires in the feedback path.
+The synchronous circuit is composed of a set of registers holding the current
+state and a combinational logic circuit that computes the output signals and the
+next state signals. When the clock ticks the next state signals are copied into the
+registers thus becoming the current state. Reliable operation only requires that
+
+81
+
+
+82
+Part I: Asynchronous circuit design – A tutorial
+
+Synchronous:
+
+Clock
+
+Current state
+Next state
+
+Asynchronous
+Huffman style 
+fundamental mode: 
+Muller style 
+Asynchronous
+
+input-output mode: 
+
+Logic
+Logic
+Logic
+
+Inputs
+Outputs
+
+Figure6.1.
+(a) A synchronous sequential circuit. (b) A Huffman style asynchronous sequential
+circuit with buffers in the feedback path, and (c) a Muller style asynchronous sequential circuit
+with wires in the feedback path.
+
+the next state output signals from the combinational logic circuit are stable in a
+time window around the rising edge of the clock, an interval that is defined by
+the setup and hold time parameters of the register. Between two clock ticks the
+combinational logic circuit is allowed to produce signals that exhibit hazards.
+The only thing that matters is that the signals are ready and stable when the
+clock ticks.
+In an asynchronous circuit there is no clock and all signals have to be valid
+at all times. This implies that at least the output signals that are seen by the
+environment must be free from all hazards. To achieve this, it is sometimes
+necessary to avoid hazards on internal signals as well. This is why the syn-
+thesis of asynchronous sequential circuits is difficult. Because it is difficult
+researchers have proposed different methods that are based on different (sim-
+plifying) assumptions.
+
+6.1.2
+Hazards
+
+For the circuit designer a hazard is an unwanted glitch on a signal. Fig-
+ure 6.2 shows four possible hazards that may be observed. A circuit that is in
+a stable state does not spontaneously produce a hazard – hazards are related
+to the dynamic operation of a circuit. This again relates to the dynamics of
+the input signals as well as the delays in the gates and wires in the circuit. A
+discussion of hazards is therefore not possible without stating precisely which
+delay model is being used and what assumptions are made about the interaction
+between the circuit and its environment. There are greater theoretical depths
+in this area than one might think at a first glance.
+Gates are normally assumed to have delays. In section 2.5.3 we also dis-
+cussed wire delays, and in particular the implications of having different delays
+in different branches of a forking wire. In addition to gate and wire delays it is
+also necessary to specify which delay model is being used.
+
+
+Chapter 6: Speed-independent control circuits
+83
+
+Static-1  hazard:
+
+Static-0  hazard:
+
+1
+
+0
+0
+
+1
+
+1
+0
+1
+1
+0
+
+0
+1
+1 0
+0
+
+1
+
+0
+
+0
+
+1
+
+1
+
+0
+
+Desired signal
+Actual signal
+
+Dynamic-10  hazard:
+
+Dynamic-01  hazard:
+
+Figure 6.2.
+Possible hazards that may be observed on a signal.
+
+6.1.3
+Delay models
+
+A pure delay that simply shifts any signal waveform later in time is perhaps
+what first comes to mind. In the hardware description language VHDL this is
+called a transport delay. It is, however, not a very realistic model as it implies
+that the gates and wires have infinitely high bandwidth. A more realistic delay
+model is the inertial delay model. In addition to the time shifting of a signal
+waveform, an inertial delay suppresses short pulses. In the inertial delay model
+used in VHDL two parameters are specified, the delay time and the reject time,
+and pulses shorter than the reject time are filtered out. The inertial delay model
+is the default delay model used in VHDL.
+These two fundamental delay models come in several flavours depending on
+how the delay time parameter is specified. The simplest is a fixed delay where
+the delay is a constant. An alternative is a min-max delay where the delay is
+unknown but within a lower and upper bound: tmin
+
+� tdelay
+
+� tmax. A more
+pessimistic model is the unbounded delay where delays are positive (i.e. not
+zero), unknown and unbounded from above: 0
+� tdelay
+
+� ∞. This is the delay
+model that is used for gates in speed-independent circuits.
+It is intuitive that the inertial delay model and the min-max delay model
+both have properties that help filter out some potential hazards.
+
+6.1.4
+Fundamental mode and input-output mode
+
+In addition to the delays in the gates and wires, it is also necessary to for-
+malize the interaction between the circuit being designed and its environment.
+Again, strong assumptions may simplify the design of the circuit. The design
+methods that have been proposed over time all have their roots in one of the
+following assumptions:
+
+Fundamental mode: The circuit is assumed to be in a state where all input
+signals, internal signals, and output signals are stable. In such a sta-
+ble state the environment is allowed to change one input signal. After
+
+
+84
+Part I: Asynchronous circuit design – A tutorial
+
+that, the environment is not allowed to change the input signals again
+until the entire circuit has stabilized. Since internal signals such as state
+variables are unknown to the environment, this implies that the longest
+delay in the circuit must be calculated and the environment is required
+to keep the input signals stable for at least this amount of time. For this
+to make sense, the delays in gates and wires in the circuit have to be
+bounded from above. The limitation on the environment is formulated
+as an absolute time requirement.
+
+The design of asynchronous sequential circuits based on fundamental
+mode operation was pioneered by David Huffman in the 1950s [59, 60].
+
+Input-output mode: Again the circuit is assumed to be in a stable state. Here
+the environment is allowed to change the inputs. When the circuit has
+produced the corresponding output (and it is allowable that there are no
+output changes), the environment is allowed to change the inputs again.
+There are no assumptions about the internal signals and it is therefore
+possible that the next input change occurs before the circuit has stabi-
+lized in response to the previous input signal change.
+
+The restrictions on the environment are formulated as causal relations
+between input signal transitions and output signal transitions. For this
+reason the circuits are often specified using trace based methods where
+the designer specifies all possible sequences of input and output signal
+transitions that can be observed on the interface of the circuit. Signal
+transition graphs, introduced later, are such a trace-based specification
+technique.
+
+The design of asynchronous sequential circuits based on the input-output
+mode of operation was pioneered by David Muller in the 1950s [93, 92].
+As mentioned in section 2.5.1, these circuits are speed-independent.
+
+6.1.5
+Synthesis of fundamental mode circuits
+
+In the classic work by Huffman the environment was only allowed to change
+one input signal at a time. In response to such an input signal change, the
+combinational logic will produce new outputs, of which some are fed back,
+figure 6.1(b). In the original work it was further required that only one feed-
+back signal changes (at a time) and that the delay in the feedback buffer is
+large enough to ensure that the entire combinational circuit has stabilized be-
+fore it sees the change of the feedback signal. This change may, in turn, cause
+the combinational logic to produce new outputs, etc. Eventually through a se-
+quence of single signal transitions the circuit will reach a stable state where
+the environment is again allowed to make a single input change. Another way
+of expressing this behaviour is to say that the circuit starts out in a stable state
+
+
+Chapter 6: Speed-independent control circuits
+85
+
+s0
+00
+01
+11
+c
+10
+
+Inputs a,b
+Output 
+
+s0
+s1
+s2
+-
+0
+
+s1
+s3
+-
+-
+
+s2
+s3
+-
+-
+
+s3
+s5
+s4
+-
+
+s4
+s0
+
+s5
+s0
+-
+-
+
+0
+
+0
+
+1
+
+1
+
+1
+
+-
+-
+
+s1
+s2
+
+s3
+
+s4
+s5
+
+00/0
+
+00/0
+
+11/1
+10/0
+
+11/1
+
+10/0
+01/0
+
+01/0
+
+Primitive flow table
+Mealy type state diagram
+
+ab/c
+
+01/1
+10/1
+
+01/1
+10/1
+
+00/0
+
+11/1
+
+Burst mode specification
+
+s0
+
+a+b+/c+
+
+a-b-/c-
+
+s3
+
+Figure 6.3.
+Some alternative specifications of a Muller C-element: a Mealy state diagram, a
+primitive flow table, and a burst-mode state diagram.
+
+(which is defined to be a state that will persist until an input signal changes). In
+response to an input signal change the circuit will step through a sequence of
+transient, unstable states, until it eventually settles in a new stable state. This
+sequence of states is such that from one state to the next only one variable
+changes.
+The interested reader is encouraged to consult [75], [133] or [95] and to
+specify and synthesize a C-element. The following gives a flavour of the design
+process and the steps involved:
+
+The design may start with a state graph specification that is very simi-
+lar to the specification of a synchronous sequential circuit. This is op-
+tional. Figure 6.3 shows a Mealy type state graph specification of the
+C-element.
+
+The classic design process involves the following steps:
+
+The intended sequential circuit is specified in the form of a primitive
+flow table (a state table with one row per stable state). Figure 6.3 shows
+the primitive flow table specification of a C-element.
+
+A minimum-row reduced flow table is obtained by merging compatible
+states in the primitive flow table.
+
+The states are encoded.
+
+Boolean equations for output variables and state variables are derived.
+
+Later work has generalized the fundamental mode approach by allowing a
+restricted form of multiple-input and multiple-output changes. This approach
+
+
+86
+Part I: Asynchronous circuit design – A tutorial
+
+is called burst mode [32, 27]. When in a stable state, a burst-mode circuit
+will wait for a set of input signals to change (in arbitrary order). After such
+an input burst has completed the machine computes a burst of output signals
+and new values of the internal variables. The environment is not allowed to
+produce a new input burst until the circuit has completely reacted to the pre-
+vious burst – fundamental mode is still assumed, but only between bursts of
+input changes. For comparison, figure 6.3 also shows a burst-mode specifica-
+tion of a C-element. Burst-mode circuits are specified using state graphs that
+are very similar to those used in the design of synchronous circuits. Several
+mature tools for synthesizing burst-mode controllers have been developed in
+academia [40, 160]. These tools are available in the public domain.
+
+6.2.
+Signal transition graphs
+
+The rest of this chapter will be devoted to the specification and synthesis
+of speed-independent control circuits. These circuits operate in input-output
+mode and they are naturally specified using signal transition graphs, (STGs).
+An STG is a petri net and it can be seen as a formalization of a timing dia-
+gram. The synthesis procedure that we will explain in the following consists
+of: (1) Capturing the behaviour of the intended circuit and its environment in
+an STG. (2) Generating the corresponding state graph, and adding state vari-
+ables if needed. (3) Deriving Boolean equations for the state variables and
+outputs.
+
+6.2.1
+Petri nets and STGs
+
+Briefly, a Petri net [3, 113, 94] is a graph composed of directed arcs and
+two types of nodes: transitions and places. Depending on the interpretation
+that is assigned to places, transitions and arcs, Petri nets can be used to model
+and analyze many different (concurrent) systems. Some places can be marked
+with tokens and the Petri net model can be “executed” by firing transitions. A
+transition is enabled to fire if there are tokens on all of its input places, and
+an enabled transition must eventually fire. When a transition fires, a token is
+removed from each input place and a token is added to each output place. We
+will show an example shortly. Petri nets offer a convenient way of expressing
+choice and concurrency.
+It is important to stress that there are many variations of and extensions to
+Petri nets – Petri nets are a family of related models and not a single, unique
+and well defined model. Often certain restrictions are imposed in order to make
+the analysis for certain properties practical. The STGs we will consider in the
+following belong to such a restricted subclass: an STG is a 1-bounded Petri net
+in which only simple forms of input choice are allowed. The exact meaning of
+
+
+Chapter 6: Speed-independent control circuits
+87
+
+Timing diagram
+C-element and dummy environment
+
+a
+
+b
+
+c
+etc.
+
+a
+c
+b
+
+b+
+
+b-
+
+a+
+
+a-
+
+c-
+
+c+
+
+STG
+
+b+
+
+c+
+
+b-
+
+c-
+
+a+
+
+a-
+
+Petri net
+
+Figure 6.4.
+A C-element and its ‘well behaved’ dummy environment, its specification in the
+form of a timing diagram, a Petri net, and an STG formalization of the timing diagram.
+
+“1-bounded” and “simple forms of input choice” will be defined at the end of
+this section.
+In an STG the transitions are interpreted as signal transitions and the places
+and arcs capture the causal relations between the signal transitions. Figure 6.4
+shows a C-element and a ‘well behaved’ dummy environment that maintains
+the input signals until the C-element has changed its outputs. The intended be-
+haviour of the C-element could be expressed in the form of a timing diagram
+as shown in the figure. Figure 6.4 also shows the corresponding Petri net spec-
+ification. The Petri net is marked with tokens on the input places to the a� and
+b� transitions, corresponding to state
+�a�b�c�
+�
+�0�0�0�. The a� and b� tran-
+sitions may fire in any order, and when they have both fired the c� transition
+becomes enabled to fire, etc. Often STGs are drawn in a simpler form where
+most places have been omitted. Every arc that connects two transitions is then
+thought of as containing a place. Figure 6.4 shows the STG specification of the
+C-element.
+A given marking of a Petri net corresponds to a possible state of the sys-
+tem being modeled, and by executing the Petri net and identifying all possible
+
+
+88
+Part I: Asynchronous circuit design – A tutorial
+
+markings it is possible to derive the corresponding state graph of the system.
+The state graph is generally much more complex than the corresponding Petri
+net.
+An STG describing a meaningful circuit enjoys certain properties, and for
+the synthesis algorithms used in tools like Petrify to work, additional properties
+and restrictions may be required. An STG is a Petri net with the following
+characteristics:
+
+1 Input free choice: The selection among alternatives must only be con-
+trolled by mutually exclusive inputs.
+
+2 1-bounded: There must never be more than one token in a place.
+
+3 Liveness: The STG must be free from deadlocks.
+
+An STG describing a meaningful speed-independent circuit has the following
+characteristics:
+
+4 Consistent state assignment: The transitions of a signal must strictly
+alternate between
+� and
+� in any execution of the STG.
+
+5 Persistency: If a signal transition is enabled it must take place, i.e. it
+must not be disabled by another signal transition. The STG specification
+of the circuit must guarantee persistency of internal signals (state vari-
+ables) and output signals, whereas it is up to the environment to guaran-
+tee persistency of the input signals.
+
+In order to be able to synthesize a circuit implementation the following char-
+acteristic is required:
+
+6 Complete state coding (CSC): Two or more different markings of the
+STG must not have the same signal values (i.e. correspond to the same
+state). If this is not the case, it is necessary to introduce extra state
+variables such that different markings correspond to different states. The
+synthesis tool Petrify will do this automatically.
+
+6.2.2
+Some frequently used STG fragments
+
+For the newcomer it may take a little practice to become familiar with spec-
+ifying and designing circuits using STGs. This section explains some of the
+most frequently used templates from which one can construct complete speci-
+fications.
+The basic constructs are: fork, join, choice and merge, see figure 6.5. The
+choice is restricted to what is called input free choice: the transitions follow-
+ing the choice place must represent mutually exclusive input signal transitions.
+This requirement is quite natural; we will only specify and design determin-
+istic circuits. Figure 6.6 shows an example Petri net that illustrates the use
+
+
+Chapter 6: Speed-independent control circuits
+89
+
+Fork
+
+Join
+
+Choice
+
+Merge
+
+Figure 6.5.
+Petri net fragments for fork, join, free choice and merge constructs.
+
+P9
+
+T7
+
+P8
+
+P7
+
+P2
+
+P6
+P5
+
+P4
+P3
+
+P1
+
+T3
+T2
+
+T8
+
+T6
+
+T5
+
+T1
+Fork
+
+Join
+
+Merge
+
+Choice
+
+T4
+
+Figure 6.6.
+An example Petri net that illustrates the use of fork, join, free choice and merge.
+
+of fork, join, free choice and merge. The example system will either perform
+transitions T6 and T7 in sequence, or it will perform T1 followed by the con-
+current execution of transitions T2, T3 and T4 (which may occur in any order),
+followed by T5.
+Towards the end of this chapter we will design a 4-phase bundled-data ver-
+sion of the MUX component from figure 3.3 on page 32. For this we will need
+some additional constructs: a controlled choice and a Petri net fragment for the
+input end of a bundled-data channel.
+Figure 6.7 shows a Petri net fragment where place P1 and transitions T3
+and T4 represent a controlled choice: a token in place P1 will engage in either
+transition T3 or transition T4. The choice is controlled by the presence of
+a token in either P2 or P3. It is crucial that there can never be a token in
+both these places at the same time, and in the example this is ensured by the
+mutually exclusive input signal transitions T1 and T2.
+
+
+90
+Part I: Asynchronous circuit design – A tutorial
+
+T2
+
+T5
+
+P1
+
+T0
+
+T1
+
+P2
+P1: Controlled Choice
+
+Mutually exclusive "paths"
+
+T3
+T4
+
+P3
+
+P0
+P0: Free Choice
+
+Figure 6.7.
+A Petri net fragment including a controlled choice.
+
+Figure 6.8 shows a Petri net fragment for a component with a one-bit input
+channel using a 4-phase bundled-data protocol. It could be the control chan-
+nel used in the MUX and DEMUX components introduced in figure 3.3 on
+page 32. The two transitions dummy1 and dummy2 do not represent transitions
+on the three signals in the channel, they are dummy transitions that facilitate
+expressing the specification. These dummy transitions represent an extension
+to the basic class of STGs.
+Note also that the four arcs connecting:
+place P5 and transition Ctl
+�
+place P5 and transition Ctl
+�
+place P6 and transition dummy2
+place P7 and transition dummy1
+have arrows at both ends. This is a shorthand notation for an arc in each direc-
+tion. Note also that there are several instances where a place is both an input
+place and a output place for a transition. Place P5 and transition Ctl
+� is an
+example of this.
+The overall structure of the Petri net fragment can be understood as follows:
+at the top is a sequence of transitions and places that capture the handshaking
+on the Req and Ack signals. At the bottom is a loop composed of places P6
+and P7 and transitions Ctl
+� and Ctl
+� that captures the control signal changing
+between high and low. The absence of a token in place P5 when Req is high
+expresses the fact that Ctl is stable in this period. When Req is low and a
+token is present in place P5, Ctl is allowed to make as many transitions as it
+desires. When Req� fires, a token is put in place P4 (which is a controlled
+choice place). The Ctl signal is now stable, and depending on its value one of
+the two transitions dummy1 or dummy2 will become enabled and eventually
+
+
+Chapter 6: Speed-independent control circuits
+91
+
+Req
+
+Ack
+
+Ctl
+
+Bundled data interface
+
+Ctl
+Req/Ack
+
+Ctl−
+
+Ctl+
+
+Ack+
+
+Req−
+
+Ack−
+
+Do the "Ctl=0" action
+Do the "Ctl=1" action
+
+dummy1
+dummy2
+
+P1
+
+P2
+
+P6
+P7
+
+Req+
+
+P3
+
+P4
+
+P5
+
+P0
+
+Figure 6.8.
+A Petri net fragment for a component with a one-bit input channel using a 4-phase
+bundled-data protocol.
+
+fire. At this point the intended input-to-output operation that is not included in
+this example may take place, and finally the handshaking on the control port
+finishes (Ack
+�; Req�; Ack
+�).
+
+6.3.
+The basic synthesis procedure
+
+The starting point for the synthesis process is an STG that satisfies the re-
+quirements listed on page 88. From this STG the corresponding state graph is
+derived by identifying all of the possible markings of the STG that are reach-
+able given its initial marking. The last step of the synthesis process is to derive
+Boolean equations for the state variables and output variables.
+We will go through a number of examples by hand in order to illustrate the
+techniques used. Since the state of a circuit includes the values of all of the
+signals in the circuit, the computational complexity of the synthesis process
+can be large, even for small circuits. In practice one would always use one
+of the CAD tools that has been developed – for example Petrify that we will
+introduce later.
+
+
+92
+Part I: Asynchronous circuit design – A tutorial
+
+6.3.1
+Example 1: a C-element
+
+c
+ab
+00
+01
+10
+0
+
+1
+0
+0
+
+11
+0* 0
+
+1*
+1
+1
+1
+
+c = ab + ac + bc 
+
+C element and its environment
+State Graph
+
+Karnaugh map for C
+
+0*0*0
+
+10*0
+0*10
+
+110*
+
+01*1
+1*01
+
+001*
+
+1*1*1
+
+a
+c
+b
+
+Figure 6.9.
+State graph and Boolean equation for the C-element STG from figure 6.4.
+
+Figure 6.9 shows the state graph corresponding to the STG specification in
+figure 6.4 on page 87. Variables that are excited in a given state are marked
+with an asterisk. Also shown in figure 6.9 is the Karnaugh map for output
+signal c. The Boolean expression for c must cover states in which c
+� 1 and
+states where it is excited, c
+� 0
+
+� (changing to 1). In order to better distinguish
+excited variables from stable ones in the Karnaugh maps, we will use R (rising)
+instead of 0
+
+� and F (falling) instead of 1
+
+� throughout the rest of this book.
+It is comforting to see that we can successfully derive the implementation of
+a known circuit, but the C-element is really too simple to illustrate all aspects
+of the design process.
+
+6.3.2
+Example 2: a circuit with choice
+
+The following example provides a better illustration of the synthesis pro-
+cedure, and in a subsequent section we will come back to this example and
+explain more efficient implementations. The example is simple – the circuit
+has only 2 inputs and 2 outputs – and yet it brings forward all relevant issues.
+The example is due to Chris Myers of the University of Utah who presented it
+in his 1996 course EE 587 “Asynchronous VLSI System Design.” The example
+has roots in the papers [12, 13].
+Figure 6.10 shows a specification of the circuit. The circuit has two inputs
+a and b and two outputs c and d, and the circuit has two alternative behaviours
+as illustrated in the timing diagram. The corresponding STG specification is
+shown in figure 6.11 along with the state graph for the circuit. The STG in-
+
+
+Chapter 6: Speed-independent control circuits
+93
+
+Environment
+
+a
+
+b
+
+c
+
+d
+
+a
+
+b
+
+c
+
+d
+
+Figure 6.10.
+The example circuit from [12, 13].
+
+001*0
+
+10*00
+
+1100*
+
+110*1
+
+1111*
+
+0*0*00
+
+1*110
+
+01*10
+
+010*0
+
+14
+
+6
+
+4
+
+0
+
+8
+
+12
+
+2
+
+15
+
+13
+
+a+
+
+b+
+
+d+
+
+c+
+d-
+
+c-
+
+b-
+
+b+
+
+c+
+
+a-
+
+b+
+a+
+
+c+
+b+
+
+d+
+
+c+
+a-
+
+b-
+
+c-
+
+d-
+
+P1
+
+P0
+
+Figure 6.11.
+The STG specification and the corresponding state graph.
+
+x3
+
+x2
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
++
+
+&
+
+&
+
+c = d + a b + b c
+
+d
+a
+b
+c
+
+&
+
+&
+&
+
++
+
+b
+
+a
+d
+
+c
+
+Karnaugh map:
+
+Boolean equation for c:
+
+An atomic complex gate:
+
+Using simple gates:
+
+x1
+
+Figure 6.12.
+The Karnaugh map, the Boolean equation, and two alternative gate-level imple-
+mentations of output signal c.
+
+
+94
+Part I: Asynchronous circuit design – A tutorial
+
+cludes only the free choice place P0 and the merge place P1. All arcs that
+directly connect two transitions are assumed to include a place. The states in
+the state diagram have been labeled with decimal numbers to ease filling out
+the Karnaugh maps.
+The STG satisfies all of the properties 1-6 listed on page 88 and it is thus
+possible to proceed and derive Boolean equations for output signals c and d.
+[Note: In state 0 both inputs are marked to be excited,
+�a�b�
+�
+�0
+
+�
+
+�0
+
+�
+
+�, and
+in states 4 and 8 one of the signals is still 0 but no longer excited. This is a
+problem of notation only. In reality only one of the two variables is excited in
+state 0, but we don’t know which one. Furthermore, the STG is only required
+to be persistent with respect to the internal signals and the output signals. Per-
+sistency of the input signals must be guaranteed by the environment].
+For output c, figure 6.12 shows the Karnaugh map, the Boolean equation and
+two alternative gate implementations: one using a single atomic And-Or-Invert
+gate, and one using simple AND and OR gates. Note that there are states that
+are not reachable by the circuit. In the Karnaugh map these states correspond
+to don’t cares. The implementation of output signal d is left as an exercise for
+the reader (d
+� abc).
+
+6.3.3
+Example 2: Hazards in the simple gate
+implementation
+
+The STG in figure 6.10 satisfies all of the implementation conditions 1-6
+(including persistency), and consequently an implementation where each out-
+put signal is implemented by a single atomic complex gate is hazard free. In
+the case of c we need a complex And-Or gate with inversion of input signal a.
+In general such an atomic implementation is not feasible and it is necessary to
+decompose the implementation into a structure of simpler gates. Unfortunately
+this will introduce extra variables, and these extra variables may not satisfy the
+persistency requirement that an excited signal transition must eventually fire.
+Speed-independence preserving logic decomposition is therefore a very inter-
+esting and relevant topic [20, 76].
+The implementation of c using simple gates that is shown in figure 6.12 is
+not speed-independent; it may exhibit both static and dynamic hazards, and it
+provides a good illustration of the mechanisms behind hazards. The problem
+is that the signals x1, x2 and x3 are not included in the original STG and state
+graph. A detailed analysis that includes these signals would not satisfy the
+persistency requirement. Below we explain possible failure sequences that may
+cause a static-1 hazard and a dynamic-10 hazard on output signal c. Figure 6.13
+illustrates the discussion.
+
+A static-1 hazard may occur when the circuit goes through the following se-
+quence of states: 12, 13, 15, 14. The transition from state 12 to state 13
+
+
+Chapter 6: Speed-independent control circuits
+95
+
+Potential dynamic-10 hazard.
+Potential static-1 hazard.
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+a b
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+a b
+
+d
+
+b c
+
+d
+
+b c
+
+Figure 6.13.
+The Karnaugh maps for output signal c showing state sequences that may lead to
+hazards.
+
+corresponds to d going high and the transition from state 15 to state 14
+corresponds to d going low again. In state 13 c is excited (R) and it is
+supposed to remain high throughout states 13, 15, 14, and 6. States 13
+and 15 are covered by the cube d, and state 14 is covered by cube bc that
+is supposed to “take over” and maintain c
+� 1 after d has gone low. If
+the AND gate with output signal x2 that corresponds to cube bc is slow
+we have the problem - the static-1 hazard.
+
+A dynamic-10 hazard may occur when the circuit goes through the following
+sequence of states: 4, 6, 2, 0. This situation corresponds to the upper
+AND gate (with output signal x1) and the OR gate relaying b� into
+c� and b� into c�. However, after the c� transition the lower AND
+gate, x2, becomes excited (R) as well, but the firing of this gate is not
+indicated by any other signal transition – the OR gate already has one
+input high. If the lower AND gate (x2) fires, it will later become excited
+(F) in response to c�. The net effect of this is that the lower AND gate
+(x2) may superimpose a 0-1-0 pulse onto the c output after the intended
+c� transition has occured.
+
+In the above we did not consider the inverter with input signal a and output
+signal x3. Since a is not an input to any other gate, this decomposition is SI.
+In summary both types of hazard are related to the circuit going through a
+sequence of states that are covered by several cubes that are supposed to main-
+tain the signal at the same (stable) level. The cube that “takes over” represents
+a signal that may not be indicated by any other signal. In essence it is the same
+problem that we touched upon in section 2.2 on page 14 and in section 2.4.3
+on page 20 – an OR gate can only indicate when the first input goes high.
+
+
+96
+Part I: Asynchronous circuit design – A tutorial
+
+6.4.
+Implementations using state-holding gates
+
+6.4.1
+Introduction
+
+During operation each variable in the circuit will go through a sequence of
+states where it is (stable) 0, followed by one or more states where it is excited
+(R), followed by a sequence of states where it is (stable) 1, followed by one or
+more states where it is excited (F), etc. In the above implementation we were
+covering all states where a variable, z, was high or excited to go high (z
+� 1
+and z
+� R
+� 0�).
+An alternative is to use a state-holding device such as a set-reset latch. The
+Boolean equations for the set and reset signals need only cover the z
+� R
+� 0�
+states and the z
+� F
+� 1� states respectively. This will lead to simpler equations
+and potentially simpler decompositions. Figure 6.14 shows the implementation
+template using a standard set-reset latch and an alternative solution based on a
+standard C-element. In the latter case the reset signal must be inverted. Later,
+in section 6.4.5, we will discuss alternative and more elaborate implementa-
+tions, but for the following discussion the basic topologies in figure 6.14 will
+suffice.
+
+logic
+
+Set
+logic
+
+Reset
+z
+C
+z
+SR
+Reset
+logic
+
+Set
+logic
+
+latch
+
+Standard C-element implementation:
+SR flip-flop implementation:
+
+Figure 6.14.
+Possible implementation templates using (simple) state holding elements.
+
+At this point it is relevant to mention that the equations for when to set
+and reset the state-holding element for signal z can be found by rewriting the
+original equation (that covers states in which z
+� R and z
+� 1) into the following
+form:
+
+z
+� “Set”
+�z
+�“Reset”
+(6.1)
+
+For signal c in the above example (figure 6.12 on page 93) we would get the
+following set and reset functions: cset
+
+� d
+�ab and creset
+
+� b (which is identical
+to the result in figure 6.15 in section 6.4.3). Furthermore it is obvious that for
+all reachable states (only) the set and reset functions for a signal z must never
+be active at the same time:
+
+“Set”
+�“Reset”
+� 0
+
+
+Chapter 6: Speed-independent control circuits
+97
+
+The following sections will develop the idea of using state-holding elements
+and we will illustrate the techniques by re-implementing example 2 from the
+previous section.
+
+6.4.2
+Excitation regions and quiescent regions
+
+The above idea of using a state-holding device for each variable can be formal-
+ized as follows:
+
+An excitation region, ER, for a variable z is a maximally-connected set of
+states in which the variable is excited:
+
+ER(z�) denotes a region of states where z
+� R
+� 0*
+
+ER(z�) denotes a region of states where z
+� F
+� 1*
+
+A quiescent region, QR, for a variable z is a maximally-connected set of states
+in which the variable is not excited:
+
+QR(z�) denotes a region of states where z
+� 1
+
+QR(z�) denotes a region of states where z
+� 0
+
+For a given circuit the state space can be disjointly divided into one or more
+regions of each type.
+
+The set function (cover) for a variable z:
+
+must contain all states in the ER(z�) regions
+
+may contain states from the QR(z�) regions
+
+may contain states not reachable by the circuit
+
+The reset function (cover) for a variable z:
+
+must contain all states in the ER(z�) regions
+
+may contain states from the QR(z�) regions
+
+may contain states not reachable by the circuit
+
+In section 6.4.4 below we will add what is known as the monotonic cover
+constraint or the unique entry constraint in order to avoid hazards:
+
+A cube (product term) in the set or reset function of a variable must only
+be entered through a state where the variable is excited.
+
+Having mentioned this last constraint, we have above a complete recipe
+for the design of speed-independent circuits where each non-input signal is
+implemented by a state holding device. Let us continue with example 2.
+
+
+98
+Part I: Asynchronous circuit design – A tutorial
+
+6.4.3
+Example 2: Using state-holding elements
+
+Figure 6.15 illustrates the above procedure for example 2 from sections 6.3.2
+and 6.3.3. As before, the Boolean equations (for the set and reset functions)
+may need to be implemented using atomic complex gates in order to ensure
+that the resulting circuit is speed-independent.
+
+ER2(c-)
+
+QR1(c+)
+
+QR1(c-)
+
+ER1(c+)
+
+ER2(c+)
+
+a+
+
+b+
+
+d+
+
+c+
+d-
+
+c-
+
+b-
+
+b+
+
+c+
+
+a-
+
+001*0
+
+10*00
+
+1100*
+
+110*1
+
+1111*
+
+0*0*00
+
+1*110
+
+01*10
+
+010*0
+
+14
+
+6
+
+4
+
+0
+
+8
+
+12
+
+2
+
+15
+
+13
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+
+c-reset = b
+
+c-set = d + a b
+
+Figure 6.15.
+Excitation and quiescent regions in the state diagram for signal c in the example
+circuit from figure 6.10, and the corresponding derivation of equations for the set and reset
+functions.
+
+C
+x1
+
+c-set
+
+c-reset
+
+2
+
+6
+
+14
+
+10
+
+3
+
+7
+
+15
+
+11
+8
+9
+
+12
+13
+
+5
+4
+
+0
+1
+
+10
+
+11
+
+01
+
+10
+11
+01
+00
+
+00
+
+c d
+a b
+
+0
+
+0
+x
+x
+x
+
+x
+x
+
+x
+x
+1
+
+0
+1
+1
+
+R
+
+R
+
+F
+c-reset = b
+
+c-set = d + a b
+
+&
++
+a
+d
+
+b
+
+b
+c
+
+Figure 6.16.
+Implementation of c using a standard C-element and simple gates, along with the
+Karnaugh map from which the set and reset functions were derived.
+
+6.4.4
+The monotonic cover constraint
+
+A standard C-element based implementation of signal c from above, with
+the set and reset functions implemented using simple gates, is shown in fig-
+ure 6.16 along with the Karnaugh map from which the set and reset functions
+are derived. The set function involves two cubes d and ab that are input signals
+to an OR gate. This implementation may exhibit a dynamic-10 hazard on the
+
+
+Chapter 6: Speed-independent control circuits
+99
+
+cset-signal in a similar way to that discussed previously. The Karnaugh map in
+figure 6.16 shows the sequence of states that may lead to a malfunction: (8, 12,
+13, 15, 14, 6, 0). Signal d is low in state 12, high in states 13 and 15, and low
+again in state 14. This sequence of states corresponds to a pulse on d. Through
+the OR gate this will create a pulse on the cset signal that will cause c to go
+high. Later in state 2, c will go low again. This is the desired behaviour. The
+problem is that the internal signal x1 that corresponds to the other cube in the
+expression for cset becomes excited (x1
+� R) in state 6. If this AND gate is
+slow this may produce an unintended pulse on the cset signal after c has been
+reset again.
+If the cube ab (that covers states 4, 5, 7, and 6) is reduced to include only
+states 4 and 5 corresponding to cset
+
+� d
+� abc we would avoid the problem.
+The effect of this modification is that the OR gate is never exposed to more
+than one input signal being high, and when this is the case we do not have
+problems with the principle of indication (c.f. the discussion of indication and
+dual-rail circuits in chapter 2). Another way of expressing this is that a cover
+cube must only be entered through states belonging to an excitation region.
+This requirement is known as:
+
+the monotonic cover constraint: only one product term in a sum-of-
+products implementation is allowed to be high at any given time. Obvi-
+ously the requirement need only be satisfied in the states that are reach-
+able by the circuit, or alternatively
+
+the unique entry constraint: cover cubes may only be entered through
+excitation region states.
+
+6.4.5
+Circuit topologies using state-holding elements
+
+In addition to the set-reset flip-flop and the standard C-element based tem-
+plates presented above, there are a number of alternative solutions for imple-
+menting variables using a state-holding device.
+A popular approach is the generalized C-element that is available to the
+CMOS transistor-level designer. Here the state-holding mechanism and the set
+and reset functions are implemented in one (atomic) compound structure of
+n- and p-type transistors. Figure 6.17 shows a gate-level symbol for a circuit
+where zset
+
+� ab and zreset
+
+� bc along with dynamic and static CMOS imple-
+mentations.
+An alternative implementation that may be attractive to a designer using a
+standard cell library that includes (complex) And-Or-Invert gates is shown in
+figure 6.18. The circuit has the interesting property that it produces both the
+desired signal z and its complement z and during transitions it never produces
+
+�z�z
+�
+�
+�1�1�. Again, the example is a circuit where zset
+
+� ab and zreset
+
+� bc.
+
+
+100
+Part I: Asynchronous circuit design – A tutorial
+
+P
+
+N
+"Set"
+
+"Reset"
+
+P
+
+"Reset"
+
+N
+
+"Set"
+"Reset"
+
+"Set"
+
+N
+
+P
+
+z
+z
+
+z-set = a b
+
+z-reset = b c
+
+Dynamic (and pseudostatic) CMOS implementation:
+
+Gate level symbol: 
+
+Static CMOS implementation:
+
++
+
+-
+
+a
+b
+c
+z
+C
+
+b
+
+c
+b
+
+c
+
+a
+
+b
+b
+
+a
+
+a
+
+b
+
+c
+
+z
+
+Figure 6.17.
+A generalized C-element: gate-level symbol, and some CMOS transistor imple-
+mentations.
+
+&
+
+&
+
+&
+
+&
+
++
+
++
+
+Set
+
+Reset
+
+z
+
+z
+
+a
+b
+
+b
+
+c
+
+Figure 6.18.
+An SR implementation based on two complex And-Or-Invert gates.
+
+
+Chapter 6: Speed-independent control circuits
+101
+
+6.5.
+Initialization
+
+Initialization is an important aspect of practical circuit design, and unfortu-
+nately it has not been addressed in the above. The synthesis process assumes
+an initial state that corresponds to the initial marking of the STG, and the re-
+sulting synthesized circuit is a correct speed-independent implementation of
+the specification provided that the circuit starts out in the same initial state.
+Since the synthesized circuits generally use state-holding elements or circuitry
+with feedback loops it is necessary to actively force the circuit into the intended
+initial state.
+Consequently, the designer has to do a manual post-synthesis hack and ex-
+tend the circuit with an extra signal which, when active, sets all state-holding
+constructs into the desired state.
+Normally the circuits will not be speed-
+independent with respect to this initialization signal; it is assumed to be as-
+serted for long enough to cause the desired actions before it is de-asserted.
+For circuit implementations using state-holding elements such as set-reset
+latches and standard C-elements, initialization is trivial provided that these
+components have special clear/preset signals in addition to their normal inputs.
+In all other cases the designer has to add an initialization signal to the relevant
+Boolean equations explicitly. If the synthesis process is targeting a given cell
+library, the modified logic equations may need further logic decomposition,
+and as we have seen this may compromise speed-independence.
+The fact that initialization is not included in the synthesis process is ob-
+viously a drawback, but normally one would implement a library of control
+circuits and use these as building blocks when designing circuits at the more
+abstract “static data-flow structures” level as introduced in chapter 3.
+Initializing all control circuits as outlined above is a simple and robust ap-
+proach. However, initialization of asynchronous circuits based on handshake
+components may also be achieved by an implicit approach that that exploits
+the function of the circuit to “propagate” initial signal values into the circuit.
+In Tangram (section 8.3, and chapter 13 in part III) this this is called self-
+initialization, [135].
+
+6.6.
+Summary of the synthesis process
+
+The previous sections have covered the basic theory for synthesizing SI con-
+trol circuits from STG specifications. The style of presentation has deliberately
+been chosen to be an informal one with emphasis on examples and the intuition
+behind the theory and the synthesis procedure.
+The theory has roots in work done by the following Universities and groups:
+University of Illinois [93], MIT [26, 24], Stanford [13], IMEC [145, 159], St.
+Petersburg Electrical Engineering Institute [146], and the multinational group
+of researchers who have developed the Petrify tool [29] that we will introduce
+
+
+102
+Part I: Asynchronous circuit design – A tutorial
+
+in the next section. This author has attended several discussions from which
+it is clear that in some cases the concepts and theories have been developed
+independently by several groups, and I will refrain from attempting a precise
+history of the evolution. The reader who is interested in digging deeper into the
+subject is encouraged to consult the literature; in particular the book by Myers
+[95].
+In summary the synthesis process outlined in the previous sections involves
+the following steps:
+
+1 Specify the desired behaviour of the circuit and its (dummy) environ-
+ment using an STG.
+
+2 Check that the STG satisfies properties 1-5 on page 88: 1-bounded, con-
+sistent state assignment, liveness, only input free choice and controlled
+choice and persistency. An STG satisfying these conditions is a valid
+specification of an SI circuit.
+
+3 Check that the specification satisfies property 6 on page 88: complete
+state coding (CSC). If the specification does not satisfy CSC it is neces-
+sary to add one or more state variables or to change the specification
+(which is often possible in 4-phase control circuits where the down-
+going signal transitions can be shuffled around). Some tools (including
+Petrify) can insert state variables automatically, whereas re-shuffling of
+signals – which represents a modification of the specification – is a task
+for the designer.
+
+4 Select an implementation template and derive the Boolean equations for
+the variables themselves, or for the set and reset functions when state
+holding devices are used. Also decide if these equations can be imple-
+mented in atomic gates (typically complex AOI-gates) or if they are to
+be implemented by structures of simpler gates. These decisions may be
+set by switches in the synthesis tools.
+
+5 Derive the Boolean equations for the desired implementation template.
+
+6 Manually modify the implementation such that the circuit can be forced
+into the desired initial state by an explicit reset or initialization signal.
+
+7 Enter the design into a CAD tool and perform simulation and layout of
+the circuit (or the system in which the circuit is used as a component).
+
+6.7.
+Petrify: A tool for synthesizing SI circuits from STGs
+
+Petrify is a public domain tool for manipulating Petri nets and for syn-
+thesizing SI control circuits from STG specifications.
+It is available from
+http://www.lsi.upc.es/˜jordic/petrify/petrify.html.
+
+
+Chapter 6: Speed-independent control circuits
+103
+
+Petrify is a typical UNIX program with many options and switches. As a
+circuit designer one would probably prefer a push-button synthesis tool that
+accepts a specification and produces a circuit. Petrify can be used this way but
+it is more than this. If you know how to play the game it is an interactive tool
+for specifying, checking, and manipulating Petri nets, STGs and state graphs.
+In the following section we will show some examples of how to design speed-
+independent control circuits.
+Input to Petrify is an STG described in a simple textual format. Using the
+program draw astg that is part of the Petrify distribution (and that is based
+on the graph visualization package ‘dot’ developed at AT&T) it is possible to
+produce a drawing of the STGs and state graphs. The graphs are “nice” but the
+topological organization may be very different from how the designer thinks
+about the problem. Even the simple task of checking that an STG entered in
+textual form is indeed the intended STG may be difficult.
+To help ease this situation a graphical STG entry and simulation tool called
+VSTGL (Visual STG Lab) has been developed at the Technical University of
+Denmark. To help the designer obtain a correct specification VSTGL includes
+an interactive simulator that allows the designer to add tokens and to fire tran-
+sitions. It also carries out certain simple checks on the STG.
+VSTGL is available from http://vstgl.sourceforge.net/ and it is the
+result of a small student project done by two 4th year students. VSTGL is sta-
+ble and reliable, though naming of signal transitions may seem a bit awkward.
+Petrify can solve CSC violations by inserting state variables, and it can be
+controlled to target the implementation templates introduced in section 6.4:
+
+The -cg option will produce a complex-gate circuit (one where each non-
+input signal is implemented in a single complex gate).
+
+The -gc option will produce a generalized C-element circuit. The outputs
+from Petrify are the Boolean equations for the set and reset functions for
+each non-input signal.
+
+The -gcm option will produce a generalized C-element solution where
+the set and reset functions satisfy the monotonic cover requirement. Con-
+sequently the solution can also be mapped onto a standard C-element
+implementation where the set and reset functions are implemented us-
+ing simple AND and OR gates.
+
+The -tm option will cause Petrify to perform technology mapping onto
+a gate library that can be specified by the user. Technology mapping can
+obviously not be combined with the -cg and -gc options.
+
+Petrify comes with a manual and some examples. In the following section
+we will go through some examples drawn from the previous chapters of the
+book.
+
+
+104
+Part I: Asynchronous circuit design – A tutorial
+
+6.8.
+Design examples using Petrify
+
+In the following we will illustrate the use of Petrify by specifying and syn-
+thesizing: (a) example 2 – the circuit with choice, (b) a control circuit for the
+4-phase bundled-data implementation of the latch from figure 3.3 on page 32
+and (c) a control circuit for the 4-phase bundled-data implementation of the
+MUX from figure 3.3 on page 32. For all of the examples we will assume push
+channels only.
+
+6.8.1
+Example 2 revisited
+
+As a first example, we will synthesize the different versions of example 2
+that we have already designed manually. Figure 6.19 shows the STG as it is
+entered into VSTGL. The corresponding textual input to Petrify (the ex2.g file)
+and the STG as it may be visualized by Petrify are shown in figure 6.20. Note
+in figure 6.20 that an index is added when a signal transition appears more than
+once in order to facilitate the textual input.
+
+P0
+
+c+
+
+b+
+
+P1
+
+c-
+
+b-
+
+a-
+
+d-
+
+c+
+
+a+
+
+b+
+
+d+
+d+
+d+
+d+
+d+
+d+
+d+
+d+
+
+Figure 6.19.
+The STG of example 2 as it is entered into VSTGL.
+
+Using complex gates
+
+> petrify ex2.g -cg -eqn ex2-cg.eqn
+
+The STG has CSC.
+# File generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# from <ex2.g> on 6-Mar-01 at 8:30 AM
+....
+
+
+Chapter 6: Speed-independent control circuits
+105
+
+.model ex2
+.inputs a b
+.outputs c d
+.graph
+P0 a+ b+
+c+ P1
+b+ c+
+P1 b-
+c- P0
+b- c-
+a- P1
+d- a-
+c+/1 d-
+a+ b+/1
+b+/1 d+
+d+ c+/1
+.marking { P0 }
+.end
+
+INPUTS:   a,b
+OUTPUTS:  c,d
+
+P0
+
+a+
+b+
+
+b+/1
+c+
+
+c-
+
+P1
+
+b-
+
+d+
+
+c+/1
+a-
+
+d-
+
+Figure 6.20.
+The textual description of the STG for example 2 and the drawing of the STG
+that is produced by Petrify.
+
+# The original TS had (before/after minimization) 9/9 states
+# Original STG:
+2 places,
+10 transitions,
+13 arcs
+...
+# Current STG:
+4 places,
+9 transitions,
+18 arcs
+...
+# It is a Petri net with 1 self-loop places
+...
+
+> more ex2-cg.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 7.00
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[c] = b (c + a’) + d;
+[d] = a b c’;
+
+Using generalized C-elements:
+
+> petrify ex2.g -gc -eqn ex2-gc.eqn
+
+...
+
+> more ex2-gc.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 12.00
+
+
+106
+Part I: Asynchronous circuit design – A tutorial
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[0] = a’ b + d;
+[1] = a b c’;
+[d] = d c’ + [1];
+# mappable onto gC
+[c] = c b + [0];
+# mappable onto gC
+
+The equations for the generalized C-elements should be “interpreted”
+according to equation 6.1 on page 96
+
+Using standard C-elements and set/reset functions that satisfy the monotonic
+cover constraint:
+
+> petrify ex2.g -gcm -eqn ex2-gcm.eqn
+
+...
+
+> more ex2-gcm.eqn
+
+# EQN file for model ex2
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 10.00
+
+INORDER = a b c d;
+OUTORDER = [c] [d];
+[0] = a’ b c’ + d;
+[d] = a b c’;
+[c] = c b + [0];
+# mappable onto gC
+
+Again, the equations for the generalized C-element should be “interpreted”
+according to equation 6.1 on page 96.
+
+6.8.2
+Control circuit for a 4-phase bundled-data latch
+
+Figure 6.21 shows an asynchronous handshake latch with a dummy environ-
+ment on its left and right side. The latch can be implemented using a normal
+N-bit wide transparent latch and the control circuit we are about to design.
+A driver may be needed for the latch control signal Lt. In order to make the
+latch controller robust and independent of the delay in this driver, we may feed
+the buffered signal (Lt) back such that the controller knows when the signal
+has been presented to the latch. Figure 6.21 also shows fragments of the STG
+specification – the handshaking of the left and right hand environments and
+ideas about the behaviour of the latch controller. Initially Lt is low and the
+latch is transparent, and when new input data arrives they will flow through
+the latch. In response to Rin�, and provided that the right hand environment
+is ready for another handshake (Aout
+� 0), the controller may generate Rout
+�
+
+
+Chapter 6: Speed-independent control circuits
+107
+
+Latch controller
+Right hand environment
+
+Lt-
+Lt+
+
+Rin+
+
+Ain+
+
+Rin-
+
+Ain-
+
+Rin+
+Aout-
+
+Rout+
+
+Ain+
+
+Rin-
+Aout+
+
+Rout-
+
+Ain-
+
+Rout+
+
+Aout+
+
+Rout-
+
+Aout-
+
+Left hand environment
+
+EN
+EN
+
+EN = 0: Latch is transparant
+
+EN = 1: Latch holds data
+
+The control circuit
+
+A handshake latch
+
+Latch
+
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+
+Figure 6.21.
+A 4-phase bundled-data handshake latch and some STG fragments that capture
+ideas about its behaviour.
+
+Rout+
+
+Rin+
+
+Lt+
+
+Ain+
+
+Rin-
+
+Rout-
+
+Lt-
+
+Ain-
+
+Aout-
+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+Aout+
+
+Figure 6.22.
+The resulting STG for the latch controller (as input to VSTGL).
+
+
+108
+Part I: Asynchronous circuit design – A tutorial
+
+right away. Furthermore the data should be latched, Lt
+�, and an acknowledge
+sent to the left hand environment, Ain�. A symmetric scenario is possible in
+response to Rin� when the latch is switched back into the transparent mode.
+Combining these STG fragments results in the STG shown in figure 6.22.
+Running Petrify yields the following:
+
+> petrify lctl.g -cg -eqn lctl-cg.eqn
+
+The STG has CSC.
+# File generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# from <lctl.g> on 6-Mar-01 at 11:18 AM
+...
+# The original TS had (before/after minimization) 16/16 states
+# Original STG:
+0 places,
+10 transitions,
+14 arcs (
+0 pt + ...
+# Current STG:
+0 places,
+10 transitions,
+12 arcs (
+0 pt + ...
+# It is a Marked Graph
+.model lctl
+.inputs
+Aout Rin
+.outputs
+Lt Rout Ain
+.graph
+Rout+ Aout+ Lt+
+Lt+ Ain+
+Aout+ Rout-
+Rin+ Rout+
+Ain+ Rin-
+Rin- Rout-
+Ain- Rin+
+Rout- Lt- Aout-
+Aout- Rout+
+Lt- Ain-
+.marking { <Aout-,Rout+> <Ain-,Rin+> }
+.end
+
+> more lctl-cg.eqn
+
+# EQN file for model lctl
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 7.00
+
+INORDER = Aout Rin Lt Rout Ain;
+OUTORDER = [Lt] [Rout] [Ain];
+[Lt] = Rout;
+[Rout] = Rin (Rout + Aout’) + Aout’ Rout;
+[Ain] = Lt;
+
+The equation for [Rout] may be rewritten as:
+
+[Rout] = Rin Aout’ + Rout (Rin + Aout’)
+
+which can be recognized to be a C-element with inputs Rin and Aout’.
+
+
+Chapter 6: Speed-independent control circuits
+109
+
+6.8.3
+Control circuit for a 4-phase bundled-data MUX
+
+After the above two examples, where we have worked out already well-
+known circuit implementations, let us now consider a more complex example
+that cannot (easily) be done by hand. Figure 6.23 shows the handshake multi-
+plexer from figure 3.3 on page 32. It also shows how the handshake MUX can
+be implemented by a “regular” combinational circuit multiplexer and a control
+circuit. Below we will design a speed-independent control circuit for a 4-phase
+bundled-data MUX.
+
+Ctl CtlReq
+CtlAck
+
+Handshake MUX
+
+0
+
+1
+
+In0
+
+In1
+Out
+
+Ctl
+
+0
+
+1
+
+In1Req
+In1Ack
+
+In0Ack
+In0Req
+
+In1Data
+
+In0data
+
+OutAck
+OutReq
+
+OutData
+
+Figure 6.23.
+The handshake MUX and the structure of a 4-phase bundled-data implementa-
+tion.
+
+The MUX has three input channels and we must assume they are connected
+to three independent dummy environments. The dots remind us that the chan-
+nels are push channels. When specifying the behaviour of the MUX control
+circuit and its (dummy) environment it is important to keep this in mind. A
+typical error when drawing STGs is to specify an environment with a more
+limited behaviour than the real environment. For each of the three input chan-
+nels the STG has cycles involving (Req�;Ack
+�;Req�;Ack
+�; etc.), and each
+of these cycles is initialized to contain a token.
+As mentioned previously, it is sometimes easier to deal with control chan-
+nels using dual-rail (or in general 1�of�N) data encodings since this implies
+dealing with one-hot (decoded) control signals. As a first step towards the STG
+for a MUX using entirely 4-phase bundled-data channels, figure 6.24 shows an
+STG for a MUX where the control channel uses dual-rail signals (Ctl�t, Ctl�f
+and CtlAck). This STG can then be combined with the STG-fragment for a
+4-phase bundled-data channel from figure 6.8 on page 91, resulting in the STG
+in figure 6.25. The “intermediate” STG in figure 6.24 emphasizes the fact that
+the MUX can be seen as a controlled join – the two mutually exclusive and
+structurally identical halves are basically the STGs of a join.
+
+
+110
+Part I: Asynchronous circuit design – A tutorial
+
+Ctl.t+
+
+CtlAck+
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+Ctl.f+
+
+CtlAck+
+
+OutReq+
+
+P2
+P1
+P0
+
+Ctl.t-
+Ctl.f-
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.24.
+The STG specification of the control circuit for a 4-phase bundled-data MUX
+using a 4-phase dual-rail control channel. Combined with the STG fragment for a bundled-data
+(control) channel the resulting STG for an all 4-phase dual-rail MUX is obtained (figure 6.25).
+
+Ctl-
+
+Ctl+
+
+CtlReq+
+
+P5
+
+P2
+
+P3
+
+P4
+
+OutReq+
+
+CtlAck+
+
+CtlAck-
+
+CtlReq-
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+
+P1
+
+P0
+
+P6
+
+CtlReq-
+
+CtlAck+
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.25.
+The final STG specification of the control circuit for the 4-phase bundled-data
+MUX. All channels, including the control channel, are 4-phase bundled-data.
+
+
+Chapter 6: Speed-independent control circuits
+111
+
+Below is the result of running Petrify, this time with the -o option that writes
+the resulting STG (possibly with state signals added) in a file rather than to
+stdout.
+
+> petrify MUX4p.g -o MUX4p-csc.g -gcm -eqn MUX4p-gcm.eqn
+
+State coding conflicts for signal In1Ack
+State coding conflicts for signal In0Ack
+State coding conflicts for signal OutReq
+The STG has no CSC.
+Adding state signal: csc0
+The STG has CSC.
+
+> more MUX4p-gcm.eqn
+
+# EQN file for model MUX4p
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+# Estimated area = 29.00
+
+INORDER = In0Req OutAck In1Req Ctl CtlReq In1Ack In0Ack OutReq
+CtlAck csc0;
+OUTORDER = [In1Ack] [In0Ack] [OutReq] [CtlAck] [csc0];
+[In1Ack] = OutAck csc0’;
+[In0Ack] = OutAck csc0;
+[2] = CtlReq (In1Req csc0’ + In0Req Ctl’);
+[3] = CtlReq’ (In1Req’ csc0’ + In0Req’ csc0);
+[OutReq] = OutReq [3]’ + [2];
+# mappable onto gC
+[5] = OutAck’ csc0;
+[CtlAck] = CtlAck [5]’ + OutAck;
+# mappable onto gC
+[7] = OutAck’ CtlReq’;
+[8] = CtlReq Ctl;
+[csc0] = csc0 [8]’ + [7];
+# mappable onto gC
+
+As can be seen, the STG does not satisfy CSC (complete state coding) as
+several markings correspond to the same state vector, so Petrify adds an inter-
+nal state-signal csc0. The intuition is that after CtlReq� the Boolean signal
+Ctl is no longer valid but the MUX control circuit has not yet finished its job.
+If the circuit can’t see what to continue doing from its input signals it needs
+an internal state variable in which to keep this information. The signal csc0 is
+an active-low signal: it is set low if Ctl
+� 0 when CtlReq� and it is set back
+to high when OutAck and CtlReq are both low. The fact that the signal csc0 is
+high when all channels are idle (all handshake signals are low) should be kept
+in mind when dealing with reset, c.f. section 6.5.
+The exact details of how the state variable is added can be seen from the
+STG that includes csc0 which is produced by Petrify before it synthesizes the
+logic expressions for the circuit.
+
+
+112
+Part I: Asynchronous circuit design – A tutorial
+
+It is sometimes possible to avoid adding a state variable by re-shuffling sig-
+nal transitions. It is not always obvious what yields the best solution. In prin-
+ciple more concurrency should improve performance, but it also results in a
+larger state-space for the circuit and this often tends to result in larger and
+slower circuits. A discussion of performance also involves the interaction with
+the environment. There is plenty of room for exploring alternative solutions.
+
+Ctl-
+
+Ctl+
+
+CtlReq+
+
+P5
+
+P2
+
+P3
+
+P4
+
+OutReq+
+
+CtlAck+
+
+CtlAck-
+
+CtlReq-
+
+OutReq+
+
+OutAck+
+
+In1Ack+
+
+In1Req-
+
+OutReq-
+
+OutAck-
+
+In1Ack-
+
+OutAck+
+
+In0Req+
+
+In0Ack+
+
+In0Req-
+
+OutReq-
+
+OutAck-
+
+In0Ack-
+
+In1Req+
+
+P1
+
+P0
+
+P6
+
+CtlReq-
+
+CtlAck+
+
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+CtlAck-
+
+Figure 6.26.
+The modified STG specification of the 4-phase bundled-data MUX control cir-
+cuit.
+
+In figure 6.26 we have removed some concurrency from the MUX STG by
+ordering the transitions on In0Ack
+�In1Ack and CtlAck (In0Ack
+�
+� CtlAck
+�,
+In1Ack
+�
+� CtlAck
+� etc.). This STG satisfies CSC and the resulting circuit is
+marginally smaller:
+
+> more MUX4p-gcm.eqn
+
+# EQN file for model MUX4pB
+# Generated by petrify 4.0 (compiled 22-Dec-98 at 6:58 PM)
+# Outputs between brackets "[out]" indicate a feedback to input "out"
+
+
+Chapter 6: Speed-independent control circuits
+113
+
+# Estimated area = 27.00
+
+INORDER = In0Req OutAck In1Req Ctl CtlReq In1Ack In0Ack OutReq CtlAck;
+OUTORDER = [In1Ack] [In0Ack] [OutReq] [CtlAck];
+[0] = Ctl CtlReq OutAck;
+[1] = Ctl’ CtlReq OutAck;
+[2] = CtlReq (Ctl’ In0Req + Ctl In1Req);
+[3] = CtlReq’ (In0Ack’ In1Req’ + In0Req’ In0Ack);
+[OutReq] = OutReq [3]’ + [2];
+# mappable onto gC
+[CtlAck] = In1Ack + In0Ack;
+[In1Ack] = In1Ack OutAck + [0];
+# mappable onto gC
+[In0Ack] = In0Ack OutAck + [1];
+# mappable onto gC
+
+6.9.
+Summary
+
+This chapter has provided an introduction to the design of asynchronous se-
+quential (control) circuits with the main focus on speed-independent circuits
+and specifications using STGs. The material was presented from a practical
+view in order to enable the reader to go ahead and design his or her own speed-
+independent control circuits. This, rather than comprehensiveness, has been
+our goal, and as mentioned in the introduction we have largely ignored impor-
+tant alternative approaches including burst-mode and fundamental-mode cir-
+cuits.
+
+
+
+Chapter 7
+
+ADVANCED 4-PHASE BUNDLED-DATA
+PROTOCOLS AND CIRCUITS
+
+The previous chapters have explained the basics of asynchronous circuit
+design. In this chapter we will address 4-phase bundled-data protocols and
+circuits in more detail. This will include: (1) a variety of channel types, (2)
+protocols with different data-validity schemes, and (3) a number of more so-
+phisticated latch control circuits. These latch controllers are interesting for
+two reasons: they are very useful in optimizing the circuits for area, power and
+speed, and they are nice examples of the types of control circuits that can be
+specified and synthesized using the STG-based techniques from the previous
+chapter.
+
+7.1.
+Channels and protocols
+
+7.1.1
+Channel types
+
+So far we have considered only push channels where the sender is the active
+party that initiates the communication of data, and where the receiver is the
+passive party. The opposite situation, where the receiver is the active party
+that initiates the communication of data, is also possible, and such a channel
+is called a pull channel. A channel that carries no data is called a nonput
+channel and is used for synchronization purposes. Finally, it is also possible
+to communicate data from a receiver to a sender along with the acknowledge
+signal. Such a channel is called a biput channel. In a 4-phase bundled-data
+implementation data from the receiver is bundled with the acknowledge, and in
+a 4-phase dual-rail protocol the passive party will acknowledge the reception
+of a codeword by returning a codeword rather than just an an acknowledge
+signal. Figure 7.1 illustrates these four channel types (nonput, push, pull, and
+biput) assuming a bundled-data protocol. Each channel type may, of course,
+use any of the handshake protocols (2-phase or 4-phase) and data encodings
+(bundled-data, dual-rail, m�of�n, etc.) introduced previously.
+
+115
+
+
+116
+Part I: Asynchronous circuit design – A tutorial
+
+7.1.2
+Data-validity schemes
+
+For the bundled-data protocols it is also relevant to define the time interval
+in which data is valid, and figure 7.2 illustrates the different possibilities.
+For a push channel the request signal carries the message “here is new data
+for you” and the acknowledge signal carries the information “thank you, I have
+absorbed the data, and you may release the data wires.” Similarly, for a pull
+channel the request signal carries the message “please send new data” and the
+acknowledge signal carries the message “here is the data that you requested.”
+It is the signal transitions on the request and acknowledge wires that are in-
+terpreted in this way. A 4-phase handshake involves two transitions on each
+wire and, depending on whether it is the rising or the falling transitions on
+the request and acknowledge signals that are interpreted in this way, several
+data-validity schemes emerge: early, broad, late and extended early.
+Since 2-phase handshaking does not involve any redundant signal transitions
+there is only one data-validity scheme for each channel type (push or pull), as
+illustrated in figure 7.2.
+It is common to all of the data-validity schemes that the data is valid some
+time before the event that indicates the start of the interval, and that it remains
+stable until some time after the event that indicates the end of the interval.
+Furthermore, all of the data-validity schemes express the requirements of the
+party that receives the data. The fact that a receiver signals “thank you, I have
+absorbed the data, and you may go ahead and release the data wires,” does
+not mean that this actually happens – the sender may prolong the data-validity
+interval, and the receiver may even rely on this.
+A typical example of this is the extended-early data-validity schemes in fig-
+ure 7.2. On a push channel the data-validity interval begins some time before
+Req
+� and ends some time after Req
+�.
+
+7.1.3
+Discussion
+
+The above classification of channel types and handshake protocols stems
+mostly from Peeters’ Ph.D. thesis [112]. The choice of channel type, hand-
+shake protocol and data-validity scheme obviously affects the implementation
+of the handshake components in terms of area, speed, and power. Just as a
+design may use a mix of different bundled-data and dual-rail protocols, it may
+also use a mix of channel types and data-validity schemes.
+For example, a 4-phase bundled-data push channel using a broad or an
+extended-early data-validity scheme is a very convenient input to a function
+block that is implemented using precharged CMOS circuitry: the request signal
+may directly control the precharge and evaluate transistors because the broad
+and the extended-early data-validity schemes guarantee that the input data is
+stable during the evaluate phase.
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+117
+
+n
+
+Data
+
+Ack
+
+Req
+
+n
+
+Ack
+
+Req
+
+Req
+
+Data
+
+Data
+
+Ack
+
+Nonput channel
+
+Data
+
+Ack
+
+Req
+
+Biput channel (bundled data)
+
+Push channel (bundled data)
+
+Pull channel (bundled data)
+
+Figure 7.1.
+The four fundamental channel types: nonput, push, biput, and pull.
+
+Data (early)
+
+4-phase protocol:
+(push channel)
+Ack
+
+Req
+
+Data (broad)
+
+Data (late)
+
+Data (extended early)
+
+Data (early)
+
+Ack
+
+Req
+
+Data (broad)
+
+Data (late)
+
+Data (extended early)
+
+4-phase protocol:
+(pull channel)
+
+Ack
+
+Req
+
+Data (pull channel)
+
+2-phase protocols:
+
+Data (push channel)
+
+Figure 7.2.
+Data-validity schemes for 2-phase and 4-phase bundled data.
+
+
+118
+Part I: Asynchronous circuit design – A tutorial
+
+Another interesting option in a 4-phase bundled-data design is to use func-
+tion blocks that assume a broad data validity scheme on the input channel and
+that produce a late data validity scheme on the output channel. Under these
+assumptions it is possible to use a symmetric delay element that matches only
+half of the latency of the combinatorial circuit. The idea is that the sum of the
+delay of Req
+� and Req
+� matches the latency of the combinatorial circuit, and
+that Req
+� indicates valid output data. In [112, p.46] this is referred to as true
+single phase because the return-to-zero part of the handshaking is no longer
+redundant. This approach also has implications for the implementation of the
+components that connect to the function block.
+It is beyond the scope of this text to enter into a discussion of where and
+when to use the different options. The interested reader is referred to [112, 77]
+for more details.
+
+7.2.
+Static type checking
+
+When designing circuits it is useful to think of the combination of channel
+type and data-validity scheme as being similar to a data type in a programming
+language, and to do some static type checking of the circuit being designed
+by asking questions like: “what types are allowed on the input ports of this
+handshake component?” and “what types are produced on the output ports
+of this handshake component?”. The latter may depend on the type that was
+provided on the input port. A similar form of type checking for synchronous
+circuits using two-phase non-overlapping clocks has been proposed in [104]
+and used in the Genesil silicon compiler [67].
+
+"broad"
+
+"extended early"
+
+"late"
+"early"
+
+Figure 7.3.
+Hierarchical ordering of the four data-validity schemes for the 4-phase bundled-
+data protocol.
+
+Figure 7.3 shows a hierarchical ordering of the four possible types (data
+validity schemes) for a 4-phase bundled-data push channel: “broad” is the
+strongest type and it can be used as input to circuits that require any of the
+weaker types. Similarly “extended early” may be used where only “early” is
+required. Circuits that are transparent to handshaking (function blocks, join,
+fork, merge, mux, demux) produce outputs whose type is at most as strong as
+their (weakest) input type. In general the input and output types are the same
+but there are examples where this is not the case. The only circuit that can
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+119
+
+produce outputs whose type is stronger than the input type is a latch. Let us
+look at some examples:
+
+A join that concatenates two inputs of type “extended early” produces
+outputs that are only “early.’
+
+From the STG fragments in figure 6.21 on page 107 it may be seen that
+the simple 4-phase bundled-data latch controller from the previous chap-
+ters (figure 2.9 on page 18) assumes “early” inputs and that it produces
+“extended-early” outputs.
+
+The 4-phase bundled-data MUX design in section 6.8.3 assumes “ex-
+tended early” on its control input (the STG in figure 6.25 on page 110
+specifies stable input from CtlReq� to CtlReq�).
+
+The reader is encouraged to continue this exercise and perhaps draw the as-
+sociated timing diagrams from which the types of the outputs may be deduced.
+The type checking explained here is a very useful technique for debugging
+circuits that exhibit erronous behaviour.
+
+7.3.
+More advanced latch control circuits
+
+In previous chapters we have only considered 4-phase bundled-data hand-
+shake latches using a latch controller consisting of a C-element and an inverter
+(figure 2.9 on page 18). In [41] this circuit is called a simple latch controller,
+and in [77] it is called an un-decoupled latch controller.
+When a pipeline or FIFO that uses the simple latch controller fills, every
+second handshake latch will be holding a valid token and every other hand-
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+EN
+
+L
+
+E
+D3
+E
+E
+D1
+D2
+
+Ack
+
+Req
+
+Data
+
+Ack
+
+Req
+
+EN
+
+L
+Data
+
+(b)
+
+(a)
+
+C
+C
+C
+C
+C
+C
+
+Figure 7.4.
+(a) A FIFO based on handshake latches, and (b) its implementation using simple
+latch controllers and level-sensitive latches. The FIFO fills with valid data in every other latch.
+A latch is transparent when EN
+� 0 and it is opaque (holding data) when EN
+� 1.
+
+
+120
+Part I: Asynchronous circuit design – A tutorial
+
+Ack
+Req
+
+Data
+
+Ack
+Req
+
+Data
+D3
+D2
+D1
+
+EN
+EN
+EN
+
+Latch
+Latch
+Latch
+control
+control
+control
+
+Figure 7.5.
+A FIFO where every level-sensitive latch holds valid data when the FIFO is full.
+The semi-decoupled and fully-decoupled latch controllers from [41] allow this behaviour.
+
+shake latch will be holding an empty token as illustrated in figure 7.4(a) – the
+static spread of the pipeline is S
+� 2.
+This token picture is a bit misleading. The empty tokens correspond to
+the return-to-zero part of the handshaking and in reality the latches are not
+“holding empty tokens” – they are transparent, and this represents a waste of
+hardware resource.
+Ideally one would want to store a valid token in every level-sensitive latch
+as illustrated in figure 7.5 and just “add” the empty tokens to the data stream
+on the interfaces as part of the handshaking. In [41] Furber and Day explain
+the design of two such improved 4-phase bundled-data latch control circuits:
+a semi-decoupled and a fully-decoupled latch controller. In addition to these
+specific circuits the paper also provides a nice illustration of the use of STGs
+for designing control circuits following the approach explained in chapter 6.
+The three latch controllers have the following characteristics:
+
+The simple or un-decoupled latch controller has the problem that new
+input data can only be latched when the previous handshake on the out-
+put channel has completed, i.e., after Aout
+
+�. Furthermore, the hand-
+shakes on the input and output channels interact tightly: Rout
+
+�
+� Ain
+
+�
+and Rout
+
+�
+� Ain
+
+�.
+
+The semi-decoupled latch controller relaxes these requirements some-
+what: new inputs may be latched after Rout
+
+�, and the controller may
+produce Ain
+
+� independently of the handshaking on the output channel –
+the interaction between the input and output channels has been relaxed
+to: Aout
+
+�
+� Ain
+
+�.
+
+The fully-decoupled latch controller further relaxes these requirements:
+new inputs may be latched after Aout
+
+� (i.e. as soon as the downstream
+latch has indicated that it has latched the current data), and the hand-
+shaking on the input channel may complete without any interaction with
+the output channel.
+
+Another potential drawback of the simple latch controller is that it is unable
+to take advantage of function blocks with asymmetric delays as explained in
+
+
+Chapter 7: Advanced 4-phase bundled-data protocols and circuits
+121
+
+Latch controller
+Static spread, S
+Period, P
+
+“Simple”
+2
+2Lr
+
+�2L f
+�V
+“Semi-decoupled”
+1
+2Lr
+
+�2L f
+�V
+“Fully-decoupled”
+1
+2Lr
+
+�L f
+�V
+
+�L f
+�E
+
+Table 7.1.
+Summary of the characteristics of the latch controllers in [41].
+
+section 4.4.1 on page 52. The fully-decoupled latch controller presented in
+[41] does not have this problem. Due to the decoupling of the input and out-
+put channels the dependency graph critical cycle that determines the period, P,
+only visits nodes related to two neighbouring pipeline stages and the period be-
+comes minimum (c.f. section 4.4.1). Table 7.1 summarizes the characteristics
+of the simple, semi-decoupled and fully-decoupled latch controllers.
+All of the above-mentioned latch controllers are “normally transparent”
+and this may lead to excessive power consumption because inputs that make
+multiple transitions before settling will propagate through several consecutive
+pipeline stages. By using “normally opaque” latch controllers every latch will
+act as a barrier. If a handshake latch that is holding a bubble is exposed to
+a token on its input, the latch controller switches the latch into the transpar-
+ent mode, and when the input data have propagated safely into the latch, it
+will switch the latch back into the opaque mode in which it will hold the data.
+In the design of the asynchronous MIPS processor reported in [23] we expe-
+rienced approximately a 50 % power reduction when using normally opaque
+latch controllers instead of normally transparent latch controllers.
+Figure 7.6 shows the STG specification and the circuit implementation of
+the normally opaque latch controller used in [23]. As seen from the STG there
+is quite a strong interaction between the input and output channels, but the
+dependency graph critical cycle that determines the period only visits nodes
+related to two neighbouring pipeline stages and the period is minimum. It may
+be necessary to add some delay into the Lt
+� to Rout
+� path in order to ensure
+that input signals have propagated through the latch before Rout
+�. Further-
+more the duration of the Lt
+� 0 pulse that causes the latch to be transparent is
+determined by gate delays in the latch controller itself, and the pulse must be
+long enough to ensure safe latching of the data. The latch controller assumes
+a broad data-validity scheme on its input channel and it provides a broad data-
+validity scheme on its output channel.
+
+7.4.
+Summary
+
+This chapter introduced a selection of channel types, data-validity schemes
+and a selection of latch controllers. The presentation was rather brief; the aim
+was just to present the basics and to introduce some of the many options and
+
+
+122
+Part I: Asynchronous circuit design – A tutorial
+
+Lt = 0: Latch is transparant
+
+Lt = 1: Latch is opaque (holding data)
+
+C
+
+C
+C
+
+C
+
++
++
++
++
+
+B
+
+EN
+
+Rin+
+Rout+
+
+Ain+
+
+Rin-
+
+Ain-
+Aout-
+
+Rout-
+
+Aout+
+
+Lt-
+
+B+
+
+B-
+
+Lt+
+
+Lt
+
+Rout
+
+Aout
+Rin
+
+Ain
+
+Latch
+
+Lt
+
+Rin
+
+Ain
+
+Rout
+
+Aout
+
+Din
+Dout
+
+Figure 7.6.
+The STG specification and the circuit implementation of the normally opaque
+fully-decoupled latch controller from [23].
+
+possibilities for optimizing the circuits. The interested reader is referred to the
+literature for more details.
+Finally a warning: the static data-flow view of asynchronous circuits pre-
+sented in chapter 3 (i.e. that valid and empty tokens are copied forward con-
+trolled by the handshaking between latch controllers) and the performance
+analysis presented in chapter 4 assume that all handshake latches use the sim-
+ple normally transparent latch controller. When using semi-decoupled or fully-
+decoupled latch controllers, it is necessary to modify the token flow view, and
+to rework the performance analysis. To a first order one might substitute each
+semi-decoupled or fully-decoupled latch controller with a pair of simple latch
+controllers. Furthermore a ring need only include two handshake latches if
+semi-decoupled or fully-decoupled latch controllers are used.
+
+
+Chapter 8
+
+HIGH-LEVEL LANGUAGES AND TOOLS
+
+This chapter addresses languages and CAD tools for the high-level modeling
+and synthesis of asynchronous circuits. The aim is briefly to introduce some
+basic concepts and a few representative and influential design methods. The
+interested reader will find more details elsewhere in this book (in Part II and
+chapter 13) as well as in the original papers that are cited in the text. In the last
+section we address the use of VHDL for the design of asynchronous circuits.
+
+8.1.
+Introduction
+
+Almost all work on the high-level modeling and synthesis of asynchronous
+circuits is based on the use of a language that belongs to the CSP family of
+languages, rather than one of the two industry-standard hardware description
+languages, VHDL and Verilog. Asynchronous circuits are highly concurrent
+and communication between modules is based on handshake channels. Con-
+sequently a hardware description language for asynchronous circuit design
+should provide efficient primitives supporting these two characteristics. The
+CSP language proposed by Hoare [57, 58] meets these requirements. CSP
+stands for “Communicating Sequential Processes” and its key characteristics
+are:
+
+Concurrent processes.
+
+Sequential and concurrent composition of statements within a process.
+
+Synchronous message passing over point-to-point channels (supported
+by the primitives send, receive and – possibly – probe).
+
+CSP is a member of a large family of languages for programming concurrent
+systems in general: OCCAM [68], LOTOS [108, 16], and CCS [89], as well as
+languages defined specifically for designing asynchronous circuits: Tangram
+[142, 135], CHP [81], and Balsa [9, 10]. Further details are presented else-
+where in this book on Tangram (in Part III, chapter 13) and Balsa (in Part II).
+In this chapter we first take a closer look at the CSP language constructs
+supporting communication and concurrency. This will include a few sample
+
+123
+
+
+124
+Part I: Asynchronous circuit design – A tutorial
+
+programs to give a flavour of this type of language. Following this we briefly
+explain two rather different design methods that both take a CSP-like program
+as the starting point for the design:
+
+At Philips Research Laboratories, van Berkel, Peeters, Kessels et al. have
+developed a proprietary language, Tangram, and an associated silicon
+compiler [142, 141, 135, 112]. Using a process called syntax-directed
+compilation, the synthesis tool maps a Tangram program into a structure
+of handshake components. Using these tools several significant asyn-
+chronous chips have been designed within Philips [137, 138, 144, 73,
+74]. The last of these is a smart-card circuit that is described in chap-
+ter 13 on page 221.
+
+At Caltech Martin has developed a language CHP – Communicating
+Hardware Processes – and a set of tools that supports a partly manual,
+partly automated design flow that targets highly optimized transistor-
+level implementations of QDI 4-phase dual-rail circuits [80, 83].
+
+CHP has a syntax that is similar to CSP (using various special symbols)
+whereas Tangram has a syntax that is more like a traditional programming
+language (using keywords); but in essence they are both very similar to CSP.
+In the last section of this chapter we will introduce a VHDL-package that
+provides CSP-like message passing and explain an associated VHDL-based
+design flow that supports a manual step-wise refinement design process.
+
+8.2.
+Concurrency and message passing in CSP
+
+The “sequential processes” part of the CSP acronym denotes that each pro-
+cess is described by a program whose statements are executed in sequence one
+by one. A semicolon is used to separate statements (as in many other program-
+ming languages). The semicolon can be seen as an operator that combines
+statements into programs. In this respect a process in CSP is very similar to a
+process in VHDL. However, CSP also allows the parallel composition of state-
+ments within a process. The symbol “�” denotes parallel composition. This
+feature is not found in VHDL, whereas the fork-join construct in Verilog does
+allow statement-level concurrency within a process.
+The “communicating” part of the CSP acronym refers to synchronous mes-
+sage passing using point-to-point channels as illustrated in figure 8.1, where
+two processes P1 and P2 are connected by a channel named C. Using a send
+statement, C!x, process P1 sends (denoted by the ‘!’ symbol) the value of its
+variable x on channel C, and using a receive statement, C?y, process P2 re-
+ceives (denoted by the ‘?’ symbol) from channel C a value that is assigned
+to its variable y. The channel is memoryless and the transfer of the value of
+variable x in P1 into variable y in P2 is an atomic action. This has the effect
+
+
+Chapter 8: High-level languages and tools
+125
+
+P2:
+
+C
+
+....
+C!x;
+
+....
+x:= 17;
+var x ...
+P1:
+var y,z ...
+....
+
+C?y;
+z:= y -17;
+....
+
+Figure 8.1.
+Two processes P1 and P2 connected by a channel C. Process P1 sends the value of
+its variable x to the channel C, and process P2 receives the value and assigns it to its variable y.
+
+of synchronizing processes P1 and P2. Whichever comes first will wait for
+the other party, and the send and receive statements complete at the same time.
+The term rendezvous is sometimes used for this type of synchronization.
+When a process executes a send (or receive) statement, it commits to the
+communication and suspends until the process at the other end of the channel
+performs its receive (or send) statement. This may not always be desirable, and
+Martin has extended CSP with a probe construct [79] which allows the process
+at the passive end of a channel to probe whether or not a communication is
+pending on the channel, without committing to any communication. The probe
+is a function which takes a channel name as its argument and returns a Boolean.
+The syntax for probing channel C is C.
+As an aside we mention that some languages for programming concurrent
+systems assume channels with (possibly unbounded) buffering capability. The
+implication of this is that the channel acts as a FIFO, and the communicating
+processes do not synchronize when they communicate. Consequently this form
+of communication is called asynchronous message passing.
+Going back to our synchronous message passing, it is obvious that the phys-
+ical implementation of a memoryless channel is simply a set of wires together
+with a protocol for synchronizing the communicating processes. It is also obvi-
+ous that any of the protocols that we have considered in the previous chapters
+may be used. Synchronous message passing is thus a very useful language
+construct that supports the high-level modeling of asynchronous circuits by
+abstracting away the exact details of the data encoding and handshake protocol
+used on the channel.
+Unfortunately both VHDL and Verilog lack such primitives. It is possible
+to write low-level code that implements the handshaking, but it is highly unde-
+sirable to mix such low-level details into code whose purpose is to capture the
+high-level behaviour of the circuit.
+In the following section we will provide some small program examples to
+give a flavour of this type of language. The examples will be written in Tan-
+
+
+126
+Part I: Asynchronous circuit design – A tutorial
+
+gram as they also serve the purpose of illustrating syntax-directed compilation
+in a subsequent section. The source code, handshake circuit figures, and frag-
+ments of the text have been kindly provided by Ad Peeters from Philips.
+Manchester University has recently developed a similar language and syn-
+thesis tool that is available in the public domain [10], and is introduced in Part
+II of this book. Other examples of related work are presented in [17] and [21].
+
+8.3.
+Tangram: program examples
+
+This section provides a few simple Tangram program examples: a 2-place
+shift register, a 2-place ripple FIFO, and a greatest common divisor function.
+
+8.3.1
+A 2-place shift register
+
+Figure 8.2 shows the code for a 2-place shift register named ShiftReg. It is
+a process with an input channel In and an output channel Out, both carrying
+variables of type [0..255]. There are two local variables x and y that are
+initialized to 0. The process performs an unbounded repetition of a sequence
+of three statements: out!y; y:=x; in?x.
+
+x
+y
+out
+
+ShiftReg
+
+in
+
+T = type [0..255]
+& ShiftReg : main proc(in? chan T & out! chan T).
+begin
+& var x,y: var T := 0
+|
+forever do
+out!y ; y:=x ; in?x
+od
+end
+
+Figure 8.2.
+A Tangram program for a 2-place shift register.
+
+8.3.2
+A 2-place (ripple) FIFO
+
+Figure 8.3 shows the Tangram program for a 2-place first-in first-out buffer
+named Fifo. It can be understood as two 1-place buffers that are operating in
+parallel and that are connected by a channel c. At first sight it appears very
+similar to the 2-place shift register presented above, but a closer examination
+will show that it is more flexible and exhibits greater concurrency.
+
+
+Chapter 8: High-level languages and tools
+127
+
+x
+y
+in
+out
+
+Fifo
+
+c
+
+T = type [0..255]
+& Fifo : main proc(in? chan T & out! chan T).
+begin
+& x,y: var T
+& c : chan T
+|
+forever do in?x ; c!x
+od
+|| forever do c?y
+; out!y od
+end
+
+Figure 8.3.
+A Tangram program for a 2-place (ripple) FIFO.
+
+8.3.3
+GCD using while and if statements
+
+Figure 8.4 shows the code for a module that computes the greatest common
+divisor, the example from section 3.7. The “do x<>y then
+�
+�
+�od” is a while
+statement and, apart from the syntactical differences, the code in figure 8.4 is
+identical to the code in figure 3.11 on page 39.
+The module has an input channel from which it receives the two operands,
+and an output channel on which it sends the result.
+
+int = type [0..255]
+& gcd_if : main proc (in?chan <<int,int>> & out!chan int).
+begin x,y:var int ff
+| forever do
+in?<<x,y>>
+; do x<>y then
+if x<y then y:=y-x
+else x:=x-y
+fi
+od
+; out!x
+od
+end
+
+Figure 8.4.
+A Tangram for GCD using while and if statements.
+
+
+128
+Part I: Asynchronous circuit design – A tutorial
+
+8.3.4
+GCD using guarded commands
+
+Figure 8.5 shows an alternative version of GCD. This time the module has
+separate input channels for the two operands and its body is based on the repe-
+tition of a guarded command. The guarded repetition can be seen as a general-
+ization of the while statement. The statement repeats until all guards are false.
+When at least one of the guards is true, exactly one command corresponding to
+such a true guard is selected (either deterministically or non-deterministically)
+and executed.
+
+int = type [0..255]
+& gcd_gc : main proc (in1,in2?chan int & out!chan int).
+begin x,y:var int ff
+| forever do
+in1?x || in2?y
+; do x<y then y:=y-x
+or y<x then x:=x-y
+od
+; out!x
+od
+end
+
+Figure 8.5.
+A Tangram program for GCD using guarded repetition.
+
+8.4.
+Tangram: syntax-directed compilation
+
+Let us now address the synthesis process. The design flow uses an inter-
+mediate format based on handshake circuits. The front-end design activity
+is called VLSI programming and, using syntax-directed compilation, a Tan-
+gram program is mapped into a structure of handshake components. There is a
+one-to-one correspondence between the Tangram program and the handshake
+circuit as will be clear from the following examples. The compilation process
+is thus fully transparent to the designer, who works entirely at the Tangram
+program level.
+The back-end of the design flow involves a library of handshake circuits
+that the compiler targets as well as some tools for post-synthesis peephole
+optimization of the handshake circuits (i.e. replacing common structures of
+handshake components by more efficient equivalent ones). A number of hand-
+shake circuit libraries exist, allowing implementations using different hand-
+shake protocols (4-phase dual-rail, 4-phase bundled-data, etc.), and different
+implementation technologies (CMOS standard cells, FPGAs, etc.). The hand-
+shake components can be specified and designed: (i) manually, or (ii) using
+STGs and Petrify as explained in chapter 6, or (iii) using the lower steps in
+Martin’s transformation-based method that is presented in the next section.
+
+
+Chapter 8: High-level languages and tools
+129
+
+It is beyond the scope of this text to explain the details of the compilation
+process. We will restrict ourselves to providing a flavour of “syntax-directed
+compilation” by showing handshake circuits corresponding to the example
+Tangram programs from the previous section.
+
+8.4.1
+The 2-place shift register
+
+As a first example of syntax-directed compilation figure 8.6 shows the hand-
+shake circuit corresponding to the Tangram program in figure 8.2.
+
+�
+in
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+
+; 0
+1
+2
+
+�
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.6.
+The compiled handshake circuit for the 2-place shift register.
+
+Handshake components are represented by circular symbols, and the chan-
+nels that connect the components are represented by arcs. The small dots on
+the component symbols represent ports. An open dot denotes a passive port
+and a solid dot denotes an active port. The arrowhead represents the direction
+of the data transfer. A nonput channel does not involve the transfer of data and
+consequently it has no direction and no arrowhead. As can be seen in figure 8.6
+a handshake circuit uses a mix of push and pull channels.
+The structure of the program is a forever-do statement whose body consists
+of three statements that are executed sequentially (because they are separated
+by semicolons). Each of the three statements is a kind of assignment statement:
+the value of variable y is “assigned” to output channel out, the value of variable
+x is assigned to variable y, and the value received on input chanel in is assigned
+to variable x. The structure of the handshake circuit is exactly the same:
+
+At the top is a repeater that implements the forever-do statement. A
+repeater waits for a request on its passive input port and then it performs
+an unbounded repetition of handshakes on its active output channel. The
+handshake on the input channel never completes.
+
+Below is a 3-way sequencer that implements the semicolons in the pro-
+gram text. The sequencer waits for a request on its passive input channel,
+then it performs in sequence a full handshake on each of its active out-
+
+
+130
+Part I: Asynchronous circuit design – A tutorial
+
+put channels (in the order indicated by the numbers in the symbol) and
+finally it completes the handshaking on the passive input channel. In
+this way the sequencer activates in turn the handshake circuit constructs
+that correspond to the individual statements in the body of the forever-do
+statement.
+
+The bottom row of handshake components includes two variables, x and
+y, and three transferers, denoted by ‘�’. Note that variables have pas-
+sive read and write ports. The transferers implement the three statements
+(out!y; y:=x; in?x) that form the body of the forever-do statement,
+each a form of assignment. A transferer waits for a request on its passive
+nonput channel and then initiates a handshake on its pull input channel.
+The handshake on the pull input channel is relayed to the push output
+channel. In this way the transferer pulls data from its input channel and
+pushes it onto its output channel. Finally, it completes the handshaking
+on the passive nonput channel.
+
+8.4.2
+The 2-place FIFO
+
+Figure 8.7 shows the handshake circuit corresponding to the Tangram pro-
+gram in figure 8.3. The component labeled ‘psv’ in the handshake circuit of
+figure 8.7 is a so-called passivator. It relates to the internal channel c of the
+Fifo and implements the synchronization and communication between the ac-
+tive sender (c!x) and the active receiver (c?y).
+
+�
+in
+
+�
+
+0 ;
+1
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+psv
+
+�
+
+�
+
+�
+
+0 ;
+1
+
+�
+
+�
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.7.
+Compiled handshake circuit for the FIFO program.
+
+An optimization of the handshake circuit for Fifo is shown in figure 8.8.
+The synchronization in the datapath using a passivator has been replaced by a
+synchronization in the control using a ‘join’ component. One may observe that
+the datapath of this handshake circuit for the FIFO design is the same as that
+of the shift register, shown in figure 8.2. The only difference is in the control
+part of the circuits.
+
+
+Chapter 8: High-level languages and tools
+131
+
+�
+in
+
+�
+
+;
+0
+1
+
+�
+
+�
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+;
+0
+1
+
+�
+
+�
+y
+
+�
+� out
+
+Figure 8.8.
+Optimized handshake circuit for the FIFO program.
+
+8.4.3
+GCD using guarded repetition
+
+As a more complex example of syntax-directed compilation figure 8.9 shows
+the handshake circuit compiled from the Tangram program in figure 8.5. Com-
+pared with the previous handshake circuits, the handshake circuit for the GCD
+program introduces two new classes of components that are treated in more
+detail below.
+Firstly, the circuit contains a ‘bar’ and a ‘do’ component, both of which are
+data-dependent control components. Secondly, the handshake circuit contains
+components that do not directly correspond to language constructs, but rather
+implement sharing: the multiplexer (denoted by ‘mux’), the demultiplexer (de-
+noted by ‘dmx’), and the fork component (denoted by ‘�’).
+Warning: the Tangram fork is identical to the fork in figure 3.3 but the Tan-
+gram multiplexer and demultiplexer components are different. The Tangram
+multiplexer is identical to the merge in figure 3.3 and the Tangram demulti-
+plexer is a kind of “inverse merge.” Its output ports are passive and it requires
+the handshakes on the two outputs to be mutually exclusive.
+
+The ‘bar’ and the ‘do’ components:
+The do and bar component together
+implement the guarded command construct with two guards, in which the do
+component implements the iteration part (the do od part, including the evalu-
+ation of the disjunction of the two guards), and the bar component implements
+the choice part (the then or then part of the command).
+The do component, when activated through its passive port, first collects the
+disjunction of the value of all guards through a handshake on its active data
+port. When the value thus collected is true, it activates its active nonput port
+(to activate the selected command), and after completion starts a new evalua-
+tion cycle. When the value collected is false, the do component completes its
+operation by completing the handshake on the passive port.
+
+
+132
+Part I: Asynchronous circuit design – A tutorial
+
+�
+
+�
+in2
+mux
+
+�
+
+�
+y
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+in1
+mux
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+bar
+
+�
+
+�
+
+�
+
+do
+
+�
+
+�
+
+� out
+
+�
+
+;
+0
+1
+2
+
+�
+
+Figure 8.9.
+Compiled handshake circuit for the GCD program using guarded repetition.
+
+The bar component can be activated either through its passive data port, or
+through its passive control port. (The do component, for example, sequences
+these two activations.) When activated through the data port, it collects the
+value of two guards through a handshake on the active data ports, and then
+sends the disjunction of these values along the passive data port, thus complet-
+ing that handshake. When activated through the control port, the bar compo-
+nent activates an active control port of which the associated data port returned a
+‘true’ value in the most recent data cycle. (For simplicity, this selection is typ-
+ically implemented in a deterministic fashion, although this is not required at
+the level of the program.) One may observe that bar components can be com-
+
+
+Chapter 8: High-level languages and tools
+133
+
+bined in a tree or list to implement a guarded command list of arbitrary length.
+Furthermore, not every data cycle has to be followed by a control cycle.
+
+The ‘mux’, ‘demux’, and ‘fork’ components
+The program for GCD in
+figure 8.4 has two occurrences of variable x in which a value is written into x,
+namely input action in1?x and assignment x:=x-y. In the handshake circuit
+of figure 8.9, these two write actions for Tangram variable x are merged by the
+multiplexer component so as to arrive at the write port of handshake variable
+x.
+Variable x occurs at five different locations in the program as an expres-
+sion, once in the output expression out!x, twice in the guard expressions x<y
+and y<x, and twice in the assignment expressions x-y and y-x. These five in-
+spections of variable x could be implemented as five distinct read ports on the
+handshake variable x, which is shown in the handshake circuit in [135, Fig. 2.7,
+p.34]. In figure 8.9, a different compilation is shown, in which handshake vari-
+able x has three read ports:
+
+A read port dedicated to the occurrence in the output action.
+
+A read port dedicated to the guard expressions. Their evaluation is mu-
+tually inclusive, and hence can be combined using a synchronizing fork
+component.
+
+A read port dedicated to the assignment expressions. Their evaluation is
+mutually exclusive, and hence can be combined using a demultiplexer.
+
+The GCD example is discussed in further detail in chapter 13.
+
+8.5.
+Martin’s translation process
+
+The work of Martin and his group at Caltech has made fundamental contri-
+butions to asynchronous design and it has influenced the work of many other
+researchers. The methods have been used at Caltech to design several sig-
+nificant chips, most recently and most notably an asynchronous MIPS R3000
+processor [88]. As the following presentation of the design flow hints, the de-
+sign process is elaborate and sophisticated and is probably only an option to a
+person who has spent time with the Caltech group.
+The mostly manual design process involves the following steps (semantics-
+preserving transformations):
+(1) Process decomposition where each process is refined into a collection
+of interacting simpler processes. This step is repeated until all processes are
+simple enough to be dealt with in the next step in the process.
+(2) Handshake expansion where each communication channel is replaced
+by explicit wires and where each communication action (e.g. send or receive)
+
+
+134
+Part I: Asynchronous circuit design – A tutorial
+
+is replaced by the signal transitions required by the protocol that is being used.
+For example a receive statement such as:
+
+C?y
+
+is replaced by a sequence of simpler statements – for example:
+
+�Creq
+
+�; y :� data; Cack
+
+�;
+��Creq
+
+�; Cack
+
+�
+
+which is read as: “wait for request to go high”, “read the data”, “drive ac-
+knowledge high”, “wait for request to go low”, and “drive acknowledge low”.
+At this level it may be necessary to add state variables and/or to reshuffle
+signal transitions in order to obtain a specification that satisfies a condition
+similar to the CSC condition in chapter 6.
+(3) Production rule expansion where each handshaking expansion is re-
+placed by a set of production rules (or guarded commands), for example:
+
+a
+�b
+�� c
+�
+and
+�b
+�
+�c
+�� c
+�
+
+A production rule consist of a condition and an action, and the action is per-
+formed whenever the condition is true. As an aside we mention that the above
+two production rules express the same as the set and reset functions for the
+signal c on page 96. The production rules specify the behaviour of the internal
+signals and output signals of the process. The production rules are themselves
+simple concurrent processes and the guards must ensure that the signal tran-
+sitions occur in program order (i.e. that the semantics of the original CHP
+program are maintained). This may require strengthening the guards. Further-
+more, in order to obtain simpler circuit implementations, the guards may be
+modified and made symmetric.
+(4) Operator reduction where production rules are grouped into clusters and
+where each cluster is mapped onto a basic hardware component similar to a
+generalized C-element. The above two production rules would be mapped into
+the generalized C-element shown in figure 6.17 on page 100.
+
+8.6.
+Using VHDL for asynchronous design
+
+8.6.1
+Introduction
+
+In this section we will introduce a couple of VHDL packages that provide
+the designer with primitives for synchronous message passing between pro-
+cesses – similar to the constructs found in the CSP-family of languages (send,
+receive and probe).
+The material was developed in an M.Sc. project and used in the design of a
+32-bit floating-point ALU using the IEEE floating-point number representation
+[110], and it has subsequently been used in a course on asynchronous circuit
+
+
+Chapter 8: High-level languages and tools
+135
+
+design. Others, including [95, 118, 149, 78], have developed related VHDL
+packages and approaches.
+The channel packages introduced in the following support only one type
+of channel, using a 32-bit 4-phase bundled-data push protocol. However, as
+VHDL allows the overloading of procedures and functions, it is straightfor-
+ward to define channels with arbitrary data types. All it takes is a little cut-and-
+paste editing. Providing support for protocols other than the 4-phase bundled-
+data push protocol will require more significant extensions to the packages.
+
+8.6.2
+VHDL versus CSP-type languages
+
+The previous sections introduced several CSP-like hardware description lan-
+guages for asynchronous design. The advantages of these languages are their
+support of concurrency and synchronous message passing, as well as a limited
+and well-defined set of language constructs that makes syntax-directed compi-
+lation a relatively simple task.
+Having said this there is nothing that prevents a designer from using one
+of the industry standard languages VHDL (or Verilog) for the design of asyn-
+chronous circuits. In fact some of the fundamental concepts in these languages
+– concurrent processes and signal events – are “nice fits” with the modeling
+and design of asynchronous circuits. To illustrate this figure 8.10 shows how
+the Tangram program from figure 8.2 could be expressed in plain VHDL. In
+addition to demonstrating the feasibility, the figure also highlights the limi-
+tations of VHDL when it comes to modeling asynchronous circuits: most of
+the code expresses low-level handshaking details, and this greatly clutters the
+description of the function of the circuit.
+VHDL obviously lacks built-in primitives for synchronous message passing
+on channels similar to those found in CSP-like languages. Another feature of
+the CSP family of languages that VHDL lacks is statement-level concurrency
+within a process. On the other hand there are also some advantages of using an
+industry standard hardware description language such as VHDL:
+
+It is well supported by existing CAD tool frameworks that provide sim-
+ulators, pre-designed modules, mixed-mode simulation, and tools for
+synthesis, layout and the back annotation of timing information.
+
+The same simulator and test bench can be used throughout the entire de-
+sign process from the first high-level specification to the final implemen-
+tation in some target technology (for example a standard cell layout).
+
+It is possible to perform mixed-mode simulations where some entities
+are modeled using behavioural specifications and others are implemented
+using the components of the target technology.
+
+
+136
+Part I: Asynchronous circuit design – A tutorial
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+type T is std_logic_vector(7 downto 0)
+
+entity ShiftReg is
+port ( in_req
+: in
+std_logic;
+in_ack
+: out std_logic;
+in_data
+: in
+T;
+out_req
+: out std_logic;
+out_ack
+: in
+std_logic;
+out-data : out T );
+end ShiftReg;
+
+architecture behav of ShiftReg is
+begin
+process
+variable x, y: T;
+begin
+loop
+out_req <= ’1’ ;
+--
+out!y
+out_data <= y ;
+wait until out_ack = ’1’;
+out_req <= ’0’;
+wait until out_ack = ’0’;
+y := x;
+--
+y := x
+wait until in_req = ’1’;
+--
+in?x
+x := in_data;
+in.ack <= ’1’;
+wait until ch_req = ’0’;
+ch_ack <= ’0’;
+end loop;
+end process;
+end behav;
+
+Figure 8.10.
+VHDL description of the 2-place shift register FIFO stage from figure 8.2.
+
+Many real-world systems include both synchronous and asynchronous
+subsystems, and such hybrid systems can be modeled without any prob-
+lems in VHDL.
+
+8.6.3
+Channel communication and design flow
+
+The design flow presented in what follows is motivated by the advantages
+mentioned above. The goal is to augment VHDL with CSP-like channel com-
+munication primitives, i.e. the procedures send(<channel>, <variable>)
+and receive(<channel>,<variable>) and the function probe(<channel>).
+Another goal is to enable mixed-mode simulations where one end of a channel
+connects to an entity whose architecture body is a circuit implementation and
+the other end connects to an entity whose architecture body is a behavioural de-
+scription using the above communication primitives, figure 8.11(b). In this way
+
+
+Chapter 8: High-level languages and tools
+137
+
+Data
+
+Control
+
+Latches
+
+Ack
+
+Req
+
+Comb. logic
+
+Entity 2:
+
+High-level model: 
+
+Entity 2:
+
+Receive(<channel>,<var>)
+channel
+
+Data
+
+Control
+
+Latches
+
+Ack
+
+Req
+
+Comb. logic
+
+channel
+
+Mixed-mode model: 
+Entity 2:
+
+Entity 1:
+
+channel
+
+Comb. logic
+Latches
+
+Ack
+
+Req
+
+Data
+
+Control
+
+Low-level model: 
+
+Entity 1:
+
+Send(<channel>,<var>)
+
+Entity 1:
+
+Send(<channel>,<var>)
+
+(a)
+
+(b)
+
+(c)
+
+Figure 8.11.
+The VHDL packages for channel communication support high-level, mixed-
+mode and gate-level/standard cell simulations.
+
+a manual top-down stepwise refinement design process is supported, where the
+same test bench is used throughout the entire design process from high-level
+specification to low-level circuit implementation, figure 8.11(a-c).
+In VHDL all communication between processes takes place via signals.
+Channels therefore have to be declared as signals, preferably one signal per
+channel. Since (for a push channel) the sender drives the request and data part
+of a channel, and the receiver drives the acknowledge part, there are two drivers
+to one signal. This is allowed in VHDL if the signal is a resolved signal. Thus,
+it is possible to define a channel type as a record with a request, an acknowl-
+edge and a data field, and then define a resolution function for the channel type
+which will determine the resulting value of the channel. This type of channel,
+with separate request and acknowledge fields, will be called a real channel and
+is described in section 8.6.5. In simulations there will be three traces for each
+channel, showing the waveforms of request and acknowledge along with the
+data that is communicated.
+A channel can also be defined with only two fields: one that describes the
+state of the handshaking (called the “handshake phase” or simply the “phase”)
+and one containing the data. The type of the phase field is an enumerated type,
+
+
+138
+Part I: Asynchronous circuit design – A tutorial
+
+whose values can be the handshake phases a channel can assume, as well as
+the values with which the sender and receiver can drive the field. This type of
+channel will be called an abstract channel. In simulations there will be two
+traces for each channel, and it is easy to read the phases the channel assumes
+and the data values that are transfered.
+The procedures and definitions are organized into two VHDL-packages: one
+called “abstpack.vhd” that can be used for simulating high-level models and
+one called “realpack.vhd” that can be used at all levels of design. Full listings
+can be found in appendix 8.A at the end of this chapter. The design flow
+enabled by these packages is as follows:
+
+The circuit and its environment or test bench is first modelled and sim-
+ulated using abstract channels. All it takes is the following statement in
+the top level design unit: “usepackage work.abstpack.all”.
+
+The circuit is then partitioned into simpler entities. The entities still
+communicate using channels and the simulation still uses the abstract
+channel package. This step may be repeated.
+
+At some point the designer changes to using the real channel package
+by changing to: “usepackage work.realpack.all” in the top-level
+design unit. Apart from this simple change, the VHDL source code is
+identical.
+
+It is now possible to partition entities into control circuitry (that can be
+designed as explained in chapter 6) and data circuitry (that consist of or-
+dinary latches and combinational circuitry). Mixed mode simulations as
+illustrated in figure 8.11(b) are possible. Simulation models of the con-
+trol circuits may be their actual implementation in the target technology
+or simply an entity containing a set of concurrent signal assignments –
+for example the Boolean equations produced by Petrify.
+
+Eventually, when all entities have been partitioned into control and data,
+and when all leaf entities have been implemented using components of
+the target technology, the design is complete. Using standard technology
+mapping tools an implementation may be produced, and the circuit can
+be simulated with back annotated timing information.
+
+Note that the same simulation test bench can be used throughout the entire
+design process from the high-level specification to the low-level implementa-
+tion using components from the target technology.
+
+8.6.4
+The abstract channel package
+
+An abstract channel is defined in figure 8.12 with a data type called fp (a
+32-bit standard logic vector representing an IEEE floating-point number). The
+
+
+Chapter 8: High-level languages and tools
+139
+
+type handshake_phase is
+(
+u,
+-- uninitialized
+idle,
+-- no communication
+swait,
+-- sender waiting
+rwait,
+-- receiver waiting
+rcv,
+-- receiving data
+rec1,
+-- recovery phase 1
+rec2,
+-- recovery phase 2
+req,
+-- request signal
+ack,
+-- acknowledge signal
+error
+-- protocol error
+);
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+phase : handshake_phase;
+data
+: fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+Figure 8.12.
+Definition of an abstract channel.
+
+actual channel type is called channel fp. It is necessary to define a channel
+for each data type used in the design. The data type can be an arbitrary type,
+including record types, but it is advisable to use data types that are built from
+std logic because this is typically the type used by target component libraries
+(such as standard cell libraries) that are eventually used for the implementation.
+The meaning of the values of the type handshake phase are described in
+detail below:
+
+u: Uninitialized channel. This is the default value of the drivers. As long as
+either the sender or receiver drive the channel with this value, the channel
+stays uninitialized.
+
+idle: No communication. Both the sender and receiver drive the channel with
+the idle value.
+
+swait: The sender is waiting to perform a communication. The sender is driv-
+ing the channel with the req value and the receiver drives with the idle
+value.
+
+rwait: The receiver is waiting to perform a communication. The sender is
+driving the channel with the idle value and the receiver drives with the
+
+
+140
+Part I: Asynchronous circuit design – A tutorial
+
+rwait value. This value is used both as a driving value and as a resulting
+value for a channel, just like the idle and u values.
+
+rcv: Data is transfered. The sender is driving the channel with the req value
+and the receiver drives it with the rwait value.
+After a predefined
+amount of time (tpd at the top of the package, see later in this section)
+the receiver changes its driving value to ack, and the channel changes
+its phase to rec1. In a simulation it is only possible to see the transfered
+value during the rcv phase and the swait phase. At all other times the
+data field assumes a predefined default data value.
+
+rec1: Recovery phase. This phase is not seen in a simulation, since the channel
+changes to the rec2 phase with no time delay.
+
+rec2: Recovery phase. This phase is not seen in a simulation, since the channel
+changes to the idle phase with no time delay.
+
+req: The sender drives the channel with this value, when it wants to perform
+a communication. A channel can never assume this value.
+
+ack: The receiver drives the channel with this value when it wants to perform
+a communication. A channel can never assume this value.
+
+error: Protocol error. A channel assumes this value when the resolution func-
+tion detects an error. It is an error if there is more than one driver with
+an rwait, req or ack value. This could be the result if more than two
+drivers are connected to a channel, or if a send command is accidentally
+used instead of a receive command or vice versa.
+
+Figure 8.13 shows a graphical illustration of the protocol of the abstract
+channel. The values in large letters are the resulting values of the channel, and
+the values in smaller letters below them are the driving values of the sender
+and receiver respectively. Both the sender and receiver are allowed to initiate
+a communication. This makes it possible in a simulation to see if either the
+
+IDLE
+
+IDLE
+RWAIT
+IDLE
+RWAIT
+
+SWAIT
+REQ
+IDLE
+
+RCV
+REQ
+REC1
+REQ
+ACK
+REC2
+IDLE
+ACK
+-
+UU
+
+IDLE
+RWAIT
+
+Figure 8.13.
+The protocol for the abstract channel. The values in large letters are the resulting
+resolved values of the channel, and the values in smaller letters below them are the driving
+values of the sender and receiver respectively.
+
+
+Chapter 8: High-level languages and tools
+141
+
+sender or receiver is waiting to communicate. It is the procedures send and
+receive that follow this protocol.
+Because channels with different data types are defined as separate types,
+the procedures send, receive and probe have to be defined for each of these
+channel types. Fortunately VHDL allows overloading of procedure names, so
+it is possible to make these definitions. The only differences between the def-
+initions of the channels are the data types, the names of the channel types and
+the default values of the data fields in the channels. So it is very easy to copy
+the definitions of one channel to make a new channel type. It is not necessary
+to redefine the type handshake phase. All these definitions are conveniently
+collected in a VHDL package. This package can then be referenced wherever
+needed. An example of such a package with only one channel type can be
+seen in appendix A.1. The procedures initialize in and initialize out
+are used to initialize the input and output ends of a channel. If a sender or re-
+ceiver does not initialize a channel, no communications can take place on that
+channel.
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+use work.abstract_channels.all;
+
+entity fp_latch is
+generic(delay : time);
+port ( d
+: inout channel_fp;
+-- input data channel
+port ( q
+: inout channel_fp;
+-- output data channel
+resetn : in std_logic
+);
+end fp_latch;
+
+architecture behav of fp_latch is
+begin
+
+process
+variable data : fp;
+begin
+initialize_in(d);
+initialize_out(q);
+wait until resetn = ’1’;
+loop
+receive(d, data);
+wait for delay;
+send(q, data);
+end loop;
+end process;
+
+end behav;
+
+Figure 8.14.
+Description of a FIFO stage.
+
+
+142
+Part I: Asynchronous circuit design – A tutorial
+
+d
+q
+
+resetn
+
+fp_latch
+
+ch_in
+ch_out
+d
+q
+
+resetn
+
+fp_latch
+
+d
+q
+
+resetn
+
+fp_latch
+
+FIFO_stage_1
+FIFO_stage_2
+FIFO_stage_3
+
+Figure 8.15.
+A FIFO built using the latch defined in figure 8.14.
+
+Figure 8.16.
+Simulation of the FIFO using the abstract channel package.
+
+A simple example of a subcircuit is the FIFO stage fp latch shown in
+figure 8.14. Notice that the channels in the entity have the mode inout, and
+the FIFO stage waits for the reset signal resetn after the initialization. In that
+way it waits for other subcircuits which may actually use this reset signal for
+initialization.
+The FIFO stage uses a generic parameter delay. This delay is inserted for
+experimental reasons in order to show the different phases of the channels.
+Three FIFO stages are connected in a pipeline (figure 8.15) and fed with data
+values. The middle section has a delay that is twice as long as the other two
+stages. This will result in a blocked channel just before the slow FIFO stage
+and a starved channel just after the slow FIFO stage.
+The result of this experiment can be seen in figure 8.16. The simulator
+used is the Synopsys VSS. It is seen that ch in is predominantly in the swait
+phase, which characterizes a blocked channel, and ch out is predominantly in
+the rwait phase, which characterizes a starved channel.
+
+8.6.5
+The real channel package
+
+At some point in the design process it is time to separate communicating
+entities into control and data entities. This is supported by the real channel
+types, in which the request and acknowledge signals are separate std logic
+signals – the type used by the target component models. The data type is the
+
+
+Chapter 8: High-level languages and tools
+143
+
+same as the abstract channel type, but the handshaking is modeled differently.
+A real channel type is defined in figure 8.17.
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+req
+: std_logic;
+ack
+: std_logic;
+data : fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+Figure 8.17.
+Definition of a real channel.
+
+All definitions relating to the real channels are collected in a package (sim-
+ilar to the abstract channel package) and use the same names for the channel
+types, procedures and functions. For this reason it is very simple to switch
+to simulating using real channels. All it takes is to change the name of the
+package in the use statements in the top level design entity. Alternatively, one
+can use the same name for both packages, in which case it is the last analyzed
+package that is used in simulations.
+An example of a real channel package with only one channel type can be
+seen in appendix A.2. This package defines a 32-bit standard logic 4-phase
+bundled-data push channel. The constant tpd in this package is the delay from
+a transition on the request or acknowledge signal to the response to this tran-
+sition. “Synopsys compiler directives” are inserted in several places in the
+package. This is because Synopsys needs to know the channel types and the
+resolution functions belonging to them when it generates an EDIF netlist to the
+floor planner, but not the procedures in the package.
+Figure 8.18 shows the result of repeating the simulation experiment from the
+previous section, this time using the real channel package. Notice the sequence
+of four-phase handshakes.
+Note that the data value on a channel is, at all times, whatever value the
+sender is driving onto the channel. An alternative would be to make the resolu-
+tion function put out the default data value outside the data-validity period, but
+this may cause the setup and hold times of the latches to be violated. The proce-
+dure send provides a broad data-validity scheme, which means that it can com-
+municate with receivers that require early, broad or late data-validity schemes
+on the channel. The procedure receive requires an early data-validity scheme,
+
+
+144
+Part I: Asynchronous circuit design – A tutorial
+
+Figure 8.18.
+Simulation of the FIFO using the real channel package.
+
+which means that it can communicate with senders that provide early or broad
+data-validity schemes.
+The resolution functions for the real channels (and the abstract channels)
+can detect protocol errors. Examples of errors are more than one sender or
+receiver on a channel, and using a send command or a receive command at
+the wrong end of a channel. In such cases the channel assumes the X value on
+the request or acknowledge signals.
+
+8.6.6
+Partitioning into control and data
+
+This section describes how to separate an entity into control and data enti-
+ties. This is possible when the real channel package is used but, as explained
+below, this partitioning has to follow certain guidelines.
+To illustrate how the partitioning is carried out, the FIFO stage in figure 8.14
+in the preceding section will be separated into a latch control circuit called
+latch ctrl and a latch called std logic latch. The VHDL code is shown
+in figure 8.19, and figure 8.20 is a graphical illustration of the partitioning that
+includes the unresolved signals ud and uq as explained below.
+In VHDL a driver that drives a compound resolved signal has to drive all
+fields in the signal. Therefore a control circuit cannot drive only the acknowl-
+edge field in a channel. To overcome this problem a signal of the corresponding
+unresolved channel type has to be declared inside the partitioned entity. This
+is the function of the signals ud and uq of type uchannel fp in figure 8.17.
+The control circuit then drives only the acknowledge field in this signal; this
+is allowed since the signal is unresolved. The rest of the fields remain unini-
+tialized. The unresolved signal then drives the channel; this is allowed since it
+drives all of the fields in the channel. The resolution function for the channel
+should ignore the uninitialized values that the channel is driven with. Compo-
+nents that use the send and receive procedures also drive those fields in the
+
+
+Chapter 8: High-level languages and tools
+145
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+use work.real_channels.all;
+
+entity fp_latch is
+port ( d
+: inout channel_fp;
+-- input data channel
+q
+: inout channel_fp;
+-- output data channel
+resetn : in std_logic
+);
+end fp_latch;
+
+architecture struct of fp_latch is
+
+component latch_ctrl
+port ( rin, aout, resetn : in
+std_logic;
+ain, rout, lt : out std_logic
+);
+end component;
+
+component std_logic_latch
+generic (width : positive);
+port ( lt : in
+std_logic;
+d
+: in
+std_logic_vector(width-1 downto 0);
+q
+: out std_logic_vector(width-1 downto 0)
+);
+end component;
+
+signal lt : std_logic;
+signal ud, uq : uchannel_fp;
+
+begin
+
+latch_ctrl1 : latch_ctrl
+port map (d.req,q.ack,resetn,ud.ack,uq.req,lt);
+std_logic_latch1 : std_logic_latch
+generic map (width => 32)
+port map (lt,d.data,uq.data);
+
+d <= connect(ud);
+q <= connect(uq);
+
+end struct;
+
+Figure8.19.
+Separation of the FIFO stage into an ordinary data latch and a latch control circuit.
+
+channel that they do not control with uninitialized values. For example, an out-
+put to a channel drives the acknowledge field in the channel with the U value.
+The fields in a channel that are used as inputs are connected directly from the
+channel to the circuits that have to read those fields.
+Notice in the description that the signals ud and uq do not drive d and q
+directly but through a function called connect. This function simply returns
+its parameter. It may seem unnecessary, but it has proved to be necessary when
+some of the subcircuits are described with a standard cell implementation. In
+a simulation a special “gate-level simulation engine” is used to simulate the
+
+
+146
+Part I: Asynchronous circuit design – A tutorial
+
+Lt
+
+d
+
+std_logic_latch
+
+q
+
+d
+
+resetn
+
+Lt
+
+Lt
+aout
+ain
+
+rin
+rout
+
+resetn
+
+ud
+q
+uq
+
+latch_ctl
+
+ch_in
+ch_out
+
+q
+
+resetn
+
+d
+
+fp_latch
+fp_latch
+fp_latch
+
+FIFO_stage
+FIFO_stage
+FIFO_stage
+
+Figure 8.20.
+Separation of control and data.
+
+standard cells [129]. During initialization it will set some of the signals to the
+value X instead of to the value U as it should. It has not been possible to get
+the channel resolution function to ignore these X values, because the gate-level
+simulation engine sets some of the values in the channel. By introducing the
+connect function, which is a behavioural description, the normal simulator
+takes over and evaluates the channel by means of the corresponding resolution
+function. It should be emphasized that it is a bug in the gate-level simulation
+engine that necessitates the addition of the connect function.
+
+8.7.
+Summary
+
+This chapter addressed languages and CAD tools for high-level modeling
+and synthesis of asynchronous circuits. The text focused on a few represen-
+tative and influential design methods that are based languages that are similar
+to CSP. The reason for preferring these languages are that they support chan-
+nel based communication between processes (synchronous message passing)
+as well as concurrency at both process and statement level – two features that
+are important for modeling asynchronous circuits. The text also illustrated a
+synthesis method known as syntax directed translation. Subsequent chapters
+in this book will elaborate much more on these issues.
+Finally the chapter illustrated how channel based communication can be
+implemented in VHDL, and we provided two packages containing all the nec-
+essary procedures and functions including: send, receive and probe. These
+packages supports a manual top-down stepwise-refinement design flow where
+the same test bench can be used to simulate the design throughout the entire
+
+
+Chapter 8: High-level languages and tools
+147
+
+design process from high level specification to low level circuit implementa-
+tion.
+This chapter on languages and CAD-tools for asynchronous design con-
+cludes the tutorial on asynchronous circuit design and it it time to wrap up:
+Chapter 2 presented the fundamental concepts and theories, and provided point-
+ers to the literature. Chapters 3 and 4 presented an RTL-like abstract view
+on asynchronous circuits (tokens flowing in static data-flow structures) that is
+very useful for understanding their operation and performance. This material
+is probably where this tutorial supplements the existing body of literature the
+most. Chapters 5 and 6 addressed the design of datapath operators and con-
+trol circuits. Focus in chapter 6 was on speed-independent circuits, but this
+is not the only approach. In recent years there has also been great progress
+in synthesizing multiple-input-change fundamental-mode circuits. Chapter 7
+discussed more advanced 4-phase bundled-data protocols and circuits. Finally
+chapter 8 addressed languages and tools for high-level modeling and synthesis
+of asynchronous circuits.
+The tutorial deliberately made no attempts at covering of all corners of the
+field – the aim was to pave a road into “the world of asynchronous design”.
+Now you are here at the end of the road; hopefully with enough background
+to carry on digging deeper into the literature, and equally importantly, with
+sufficient understanding of the characteristics of asynchronous circuits, that
+you can start designing your own circuits. And finally; asynchronous circuits
+do not represent an alternative to synchronous circuits. They have advantages
+in some areas and disadvantages in other areas and they should be seen as a
+supplement, and as such they add new dimensions to the solution space that
+the digital designer explores. Even today, many circuits can not be categorized
+as either synchronous or asynchronous, they contain elements of both.
+The following chapters will introduce some recent industrial scale asyn-
+chronous chips. Additional designs are presented in [106].
+
+
+148
+Part I: Asynchronous circuit design – A tutorial
+
+Appendix: The VHDL channel packages
+
+A.1.
+The abstract channel package
+
+-- Abstract channel package: (4-phase bundled-data push channel, 32-bit data)
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+package abstract_channels is
+
+constant tpd : time := 2 ns;
+
+-- Type definition for abstract handshake protocol
+
+type handshake_phase is
+(
+u,
+-- uninitialized
+idle,
+-- no communication
+swait,
+-- sender waiting
+rwait,
+-- receiver waiting
+rcv,
+-- receiving data
+rec1,
+-- recovery phase 1
+rec2,
+-- recovery phase 2
+req,
+-- request signal
+ack,
+-- acknowledge signal
+error
+-- protocol error
+);
+
+-- Floating point channel definitions
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+phase : handshake_phase;
+data
+: fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+subtype channel_fp is resolved uchannel_fp;
+
+procedure initialize_in(signal ch : out channel_fp);
+
+procedure initialize_out(signal ch : out channel_fp);
+
+procedure send(signal ch : inout channel_fp; d : in fp);
+
+procedure receive(signal ch : inout channel_fp; d : out fp);
+
+function probe(signal ch : in channel_fp) return boolean;
+
+end abstract_channels;
+
+
+Chapter 8: High-level languages and tools
+149
+
+package body abstract_channels is
+
+-- Resolution table for abstract handshake protocol
+
+type table_type is array(handshake_phase, handshake_phase) of
+handshake_phase;
+
+constant resolution_table : table_type := (
+----------------------------------------------------------------------------
+-- 2. parameter:
+|
+|
+-- u
+idle
+swait rwait rcv
+rec1
+rec2
+req
+ack
+error
+|1. par:|
+----------------------------------------------------------------------------
+(u,
+u,
+u,
+u,
+u,
+u,
+u,
+u,
+u,
+u
+), --| u
+|
+(u,
+idle, swait,rwait,rcv,
+rec1, rec2, swait,rec2, error), --| idle
+|
+(u,
+swait,error,rcv,
+error,error,rec1, error,rec1, error), --| swait |
+(u,
+rwait,rcv,
+error,error,error,error,rcv,
+error,error), --| rwait |
+(u,
+rcv,
+error,error,error,error,error,error,error,error), --| rcv
+|
+(u,
+rec1, error,error,error,error,error,error,error,error), --| rec1
+|
+(u,
+rec2, rec1, error,error,error,error,rec1, error,error), --| rec2
+|
+(u,
+error,error,error,error,error,error,error,error,error), --| req
+|
+(u,
+error,error,error,error,error,error,error,error,error), --| ack
+|
+(u,
+error,error,error,error,error,error,error,error,error));--| error |
+
+-- Fp channel
+
+constant default_data_fp : fp := "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX";
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp is
+variable result : uchannel_fp := (idle, default_data_fp);
+begin
+for i in s’range loop
+result.phase := resolution_table(result.phase, s(i).phase);
+if (s(i).phase = req) or (s(i).phase = swait) or
+(s(i).phase = rcv) then
+result.data := s(i).data;
+end if;
+end loop;
+if not((result.phase = swait) or (result.phase = rcv)) then
+result.data := default_data_fp;
+end if;
+return result;
+end resolved;
+
+procedure initialize_in(signal ch : out channel_fp) is
+begin
+ch.phase <= idle after tpd;
+end initialize_in;
+
+procedure initialize_out(signal ch : out channel_fp) is
+begin
+ch.phase <= idle after tpd;
+end initialize_out;
+
+procedure send(signal ch : inout channel_fp; d : in fp) is
+begin
+if not((ch.phase = idle) or (ch.phase = rwait)) then
+wait until (ch.phase = idle) or (ch.phase = rwait);
+
+
+150
+Part I: Asynchronous circuit design – A tutorial
+
+end if;
+ch <= (req, d);
+wait until ch.phase = rec1;
+ch.phase <= idle;
+end send;
+
+procedure receive(signal ch : inout channel_fp; d : out fp) is
+begin
+if not((ch.phase = idle) or (ch.phase = swait)) then
+wait until (ch.phase = idle) or (ch.phase = swait);
+end if;
+ch.phase <= rwait;
+wait until ch.phase = rcv;
+wait for tpd;
+d := ch.data;
+ch.phase <= ack;
+wait until ch.phase = rec2;
+ch.phase <= idle;
+end receive;
+
+function probe(signal ch : in channel_fp) return boolean is
+begin
+return (ch.phase = swait);
+end probe;
+
+end abstract_channels;
+
+A.2.
+The real channel package
+
+-- Low-level channel package (4-phase bundled-data push channel, 32-bit data)
+
+library IEEE;
+use IEEE.std_logic_1164.all;
+
+package real_channels is
+
+-- synopsys synthesis_off
+constant tpd : time := 2 ns;
+-- synopsys synthesis_on
+
+-- Floating point channel definitions
+
+subtype fp is std_logic_vector(31 downto 0);
+
+type uchannel_fp is
+record
+req
+: std_logic;
+ack
+: std_logic;
+data : fp;
+end record;
+
+type uchannel_fp_vector is array(natural range <>) of
+uchannel_fp;
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp;
+
+
+Chapter 8: High-level languages and tools
+151
+
+subtype channel_fp is resolved uchannel_fp;
+
+-- synopsys synthesis_off
+procedure initialize_in(signal ch : out channel_fp);
+
+procedure initialize_out(signal ch : out channel_fp);
+
+procedure send(signal ch : inout channel_fp; d : in fp);
+
+procedure receive(signal ch : inout channel_fp; d : out fp);
+
+function probe(signal ch : in uchannel_fp) return boolean;
+-- synopsys synthesis_on
+
+function connect(signal ch : in uchannel_fp) return channel_fp;
+
+end real_channels;
+
+package body real_channels is
+
+-- Resolution table for 4-phase handshake protocol
+
+-- synopsys synthesis_off
+type stdlogic_table is array(std_logic, std_logic) of std_logic;
+
+constant resolution_table : stdlogic_table := (
+--
+--------------------------------------------------------------
+--
+| 2. parameter:
+|
+|
+--
+|
+U
+X
+0
+1
+Z
+W
+L
+H
+-
+|1. par:|
+--
+--------------------------------------------------------------
+( ’U’, ’X’, ’0’, ’1’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+U
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+X
+|
+( ’0’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+0
+|
+( ’1’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+1
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+Z
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+W
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+L
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ ),
+-- |
+H
+|
+( ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’, ’X’ )); -- |
+-
+|
+-- synopsys synthesis_on
+
+-- Fp channel
+
+-- synopsys synthesis_off
+constant default_data_fp : fp := "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX";
+-- synopsys synthesis_on
+
+function resolved(s : uchannel_fp_vector) return uchannel_fp is
+-- pragma resolution_method three_state
+-- synopsys synthesis_off
+variable result : uchannel_fp := (’U’,’U’,default_data_fp);
+-- synopsys synthesis_on
+begin
+-- synopsys synthesis_off
+for i in s’range loop
+result.req := resolution_table(result.req,s(i).req);
+result.ack := resolution_table(result.ack,s(i).ack);
+
+
+152
+Part I: Asynchronous circuit design – A tutorial
+
+if (s(i).req = ’1’) or (s(i).req = ’0’) then
+result.data := s(i).data;
+end if;
+end loop;
+if not((result.req = ’1’) or (result.req = ’0’)) then
+result.data := default_data_fp;
+end if;
+return result;
+-- synopsys synthesis_on
+end resolved;
+
+-- synopsys synthesis_off
+procedure initialize_in(signal ch : out channel_fp) is
+begin
+ch.ack <= ’0’ after tpd;
+end initialize_in;
+
+procedure initialize_out(signal ch : out channel_fp) is
+begin
+ch.req <= ’0’ after tpd;
+end initialize_out;
+
+procedure send(signal ch : inout channel_fp; d : in fp) is
+begin
+if ch.ack /= ’0’ then
+wait until ch.ack = ’0’;
+end if;
+ch.req <= ’1’ after tpd;
+ch.data <= d after tpd;
+wait until ch.ack = ’1’;
+ch.req <= ’0’ after tpd;
+end send;
+
+procedure receive(signal ch : inout channel_fp; d : out fp) is
+begin
+if ch.req /= ’1’ then
+wait until ch.req = ’1’;
+end if;
+wait for tpd;
+d := ch.data;
+ch.ack <= ’1’;
+wait until ch.req = ’0’;
+ch.ack <= ’0’ after tpd;
+end receive;
+
+function probe(signal ch : in uchannel_fp) return boolean is
+begin
+return (ch.req = ’1’);
+end probe;
+-- synopsys synthesis_on
+
+function connect(signal ch : in uchannel_fp) return channel_fp is
+begin
+return ch;
+end connect;
+
+end real_channels;
+
+
+II
+
+BALSA - AN ASYNCHRONOUS HARDWARE
+SYNTHESIS SYSTEM
+
+Author: Doug Edwards, Andrew Bardsley
+Department of Computer Science
+The University of Manchester
+{doug,bardsley}@cs.man.ac.uk
+
+Abstract
+Balsa is a system for describing and synthesising asynchronous circuits based
+on syntax-directed compilation into communicating handshake circuits. In these
+chapters, the basic Balsa design flow is described and several simple circuit ex-
+amples are used to illustrate the Balsa language in an informal tutorial style. The
+section concludes with a walk-through of a major design exercise – a 4 channel
+DMA controller described entirely in Balsa.
+
+Keywords:
+asynchronous circuits, high-level synthesis
+
+
+
+Chapter 9
+
+AN INTRODUCTION TO BALSA
+
+9.1.
+Overview
+
+Balsa is both a framework for synthesising asynchronous hardware systems
+and a language for describing such systems. The approach adopted is that of
+syntax-directed compilation into communicating handshaking components and
+closely follows the Tangram system ([141, 135] and Chapter 13 on page 221)
+of Philips. The advantage of this approach is that the compilation is trans-
+parent: there is a one-to-one mapping between the language constructs in the
+specification and the intermediate handshake circuits that are produced. It is
+relatively easy for an experienced user to envisage the micro-architecture of the
+circuit that results from the original description. Incremental changes made at
+the language level result in predictable changes at the circuit implementation
+level. This is important if optimisations and design trade-offs are to be made
+easily and contrasts with synchronous VHDL synthesis in which small changes
+in the specification may make radical alterations to the resulting circuit.
+It is important to understand what Balsa offers the designer and what obli-
+gations are still placed upon the designer. The tight “edit description – syn-
+thesise – simulate – revise description” loop made possible by the fast com-
+pilation process makes it very easy for the design space of a system to be
+explored and prototypes rapidly evaluated. However, there is no substitute
+for creativity. Poor designs may be created as easily as elegant designs and
+some experience in designing asynchronous circuits is required before even a
+good designer of conventional clocked circuits will best be able to exploit the
+system. Be warned that although Balsa guarantees correct-by-construction cir-
+cuits, it does not guarantee correct systems. In particular, it is quite feasible,
+as in any asynchronous system, to describe an elegant circuit which will ex-
+hibit deadlock. Furthermore, post-layout simulation is still required in order
+to check that when the instantiated circuit has been placed and routed by con-
+ventional CAD tools, it meets basic timing requirements. On the other hand,
+a choice of implementation libraries is available allowing the designer to trade
+
+155
+
+
+156
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+the greater process portability of a delay-insensitive implementation against,
+perhaps, smaller circuit area which may require a larger post-layout validation
+effort.
+Although Balsa has evolved from a research environment, it is not a toy sys-
+tem unsuited for large-scale designs; Balsa has been used to synthesise the 32
+channel DMA controller [11] for the Amulet3i asynchronous microprocessor
+macro-cell [48]. The controller has a complex specification and the resulting
+implementation occupies 2mm2 on a 0�35µm 3-layer metal process. Balsa is at
+the time of writing being used to synthesise a complete Amulet core as part of
+the EU funded G3 smartcard project [46].
+As noted earlier, Balsa is very similar to Tangram. It is a less mature pack-
+age lacking some useful tools contained within the Tangram package such as
+the power performance analyser. However, Balsa is freely available whereas
+Tangram is not generally available outside Philips. As far as the expressiveness
+of the languages is concerned, Balsa adds powerful parameterisation using re-
+cursive expansion definition facilities whereas Tangram allows more flexibility
+in interacting with non delay-insensitive external interfaces. Balsa has delib-
+erately chosen not to add such features to ensure that its channels-only delay-
+insensitive model is not compromised.
+The reader should be aware that not all aspects of the Balsa language or its
+syntax are explored in the material that follows: a more detailed introduction
+is available in the Balsa User Guide available from [7]. The Balsa system is
+freely available from the same site. The system is still evolving: the description
+here refers to Balsa release 3.1.0.
+
+9.2.
+Basic concepts
+
+A circuit described in Balsa is compiled into a communicating network com-
+posed from a small (about 40) set of handshake components. The components
+are connected by channels over which atomic communications or handshakes
+take place. Channels may have datapaths associated with them (in which case
+a handshake involves the transfer of data), or may be purely control (in which
+case the handshake acts as a synchronisation or rendezvous point).
+Each channel connects exactly one passive port of a handshake component
+to one active port of another handshake component. An active port is a port
+which initiates a communication. A passive port responds (when it is ready) to
+the request from the active port by an acknowledge signal.
+Data channels may be push channels or pull channels. In a push channel,
+the direction of the data flow is from the active port to the passive port. This
+is similar to the communication style of micropipelines. Data validity is sig-
+nalled by request and released on acknowledge. In a pull channel, the direction
+of data flow is from the passive port to the active port. The active port requests
+
+
+Chapter 9: An introduction to Balsa
+157
+
+a transfer, data validity is signalled by an acknowledge from the passive port.
+An example of a circuit composed from handshake components is shown in
+figure 9.1. Active ports are denoted by filled bubbles on a handshake compo-
+nent and passive ports are denoted by open bubbles.
+
+acknowledge
+
+request
+acknowledge
+
+request
+
+acknowledge
+
+bundled data
+
+acknowledge
+request
+
+request
+request
+
+acknowledge
+@
+
+"0;1"
+
+0
+
+1
+
+→
+
+Figure 9.1.
+Two connected handshake components.
+
+Here, a Fetch component, or Transferrer, denoted by “� ”) and a Case com-
+ponent (denoted by “@”) are connected by an internal data-bearing channel.
+Circuit action is activated by a request to the Transferrer which in turn issues
+a request to the environment on its active pull input port (on the left of the
+diagram). The environment supplies the demanded data indicating its validity
+by the acknowledgement signal. The Transferrer then presents a handshake re-
+quest and data to the Case component on its active push output port which the
+Case component receives on a passive port. Depending on the data value, the
+Case component issues a handshake to its environment on either the top right
+or bottom right port. Finally, when the acknowledgement is received by the
+Case component, an acknowledgement is returned along the original channel
+and terminating this handshake. The circuit is ready to operate once more.
+Data follows the direction of the request in this example and the acknowl-
+edgement to that request flows in the opposite direction. In this figure, indi-
+vidual physical request, acknowledgement and data wires are explicitly shown.
+Data is carried on separate wires from the signalling (it is “bundled” with the
+control) although this is not necessarily true for other data/signalling encoding
+schemes.
+The bundled-data scheme illustrated in figure 9.1 is not the only imple-
+mentation possible. Methodologies exist to implement channel connections
+with delay-insensitive signalling where timing relationships between individ-
+ual wires of an implemented channel do not affect the functionality of the cir-
+cuit. Handshake circuits can be implemented using these methodologies which
+are robust to na¨ıve realisations, process variations and interconnect delay prop-
+erties. Future releases of Balsa will include several alternative back-ends. A
+more detailed discussion of handshake protocols can be found in section 2.1
+on page 9 and section 7.1 on page 115.
+
+
+158
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Normally, handshake circuit diagrams are not shown at the level of detail
+of figure 9.1, a channel usually being shown as a single arc with the direc-
+tion of data being denoted by an arrow head on the arc. Similarly, control
+only channels, comprising only request/acknowledge wires, are indicated by
+an arc without an arrowhead. The circuit complexity of handshake circuits is
+often low: for example, a Transferrer may be implemented using only wires.
+An example of a handshake circuit for a modulo-10 counter (see page 185) is
+shown in figure 9.2. The corresponding gate-level implementation is shown in
+figure 9.3.
+
+count
+
+aclk
+
+activate
+
+4
+
+1
+0
+
+1
+
+0
+
+4
+
+4
+
+"0;1"
+
+4
+
+@
+tmp
+x /= 9
+
+4
+count
+_reg
+|
+
+→
+
+4
+→
+
+→
+
+→
+
+→
+
+4
+4
+x + 1
+
+1
+4
+
+#
+DW
+;
+
+*
+
+4
+4
+
+Figure 9.2.
+Handshake circuit of a modulo-10 counter.
+
+DW
+#
+
+4
+4
+
+*
+
+;
+
+4
+1
+
+x + 1
+4
+4
+
+T
+
+T
+
+@
+
+"0;1"
+
+1
+0
+
+1
+
+T
+
+T
+0
+
+4
+
+4
+
+4
+
+|
+
+4
+4
+
+tmp
+T
+x /= 9
+
+4
+count
+_reg
+
+count
+
+activate
+
+aclk
+
+(no ack)
+
+Control sequencing components (3 gates each)
+
+S
+
+S
+
+C
+
+r
+a
+
+Compare
+r
+
+a
+
+/= 9?
+
+Incrementer
+r
+
+a
+
+R
+
+S
+
+latch x4
+
+r
+
+a
+
+0
+
+1
+
+latch
+
+Figure 9.3.
+Gate-level circuit of a modulo-10 counter.
+
+
+Chapter 9: An introduction to Balsa
+159
+
+9.3.
+Tool set and design flow
+
+An overview of the Balsa design flow is shown in figure 9.4. Behavioural
+simulation is provided by LARD [38], a language developed within the Amulet
+group for modelling asynchronous systems. However, the target CAD sys-
+tem can also be used to perform more accurate simulations and to validate
+the design.
+Most of the Balsa tools are concerned with manipulating the
+Breeze handshake intermediate files produced by compiling Balsa descrip-
+tions. Breeze files can be used by back-end tools to provide implementations
+for Balsa descriptions, but also contain procedure and type definitions passed
+on from Balsa source files allowing Breeze to be used as the package descrip-
+tion format for Balsa.
+The Balsa system comprises the following collection of tools:
+
+balsa-c: the compiler for the Balsa language. The compiler produces
+Breeze from Balsa descriptions.
+
+balsa-netlist: produces a netlist, currently EDIF, Compass or Verilog,
+from a Breeze description, performing technology mapping and hand-
+shake expansion.
+
+breeze2ps: a tool which produces a PostScript file of the handshake cir-
+cuit graph.
+
+breeze2lard: a translator that converts a Breeze file to a LARD behavioural
+model.
+
+breeze-cost: a tool which gives an area cost estimate of the circuit.
+
+balsa-md: a tool for generating Makefiles for make(1).
+
+balsa-mgr: a graphical front-end to balsa-md with project management
+facilities.
+
+The interfaces between the Balsa and target CAD systems are handled by
+the following scripts:
+
+balsa-pv: uses powerview tools to produce an EDIF file from a top-level
+powerview schematic which incorporates Balsa generated circuits.
+
+balsa-xi: produces a Xilinx download file from an EDIF description of
+a compiled circuit.
+
+balsa-ihdl: an interface to the Cadence Verilog-XL environment.
+
+9.4.
+Getting started
+
+In this section, simple buffer circuits are described in Balsa introducing the
+basic elements of a Balsa description.
+
+
+160
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+balsa−c
+
+balsa−netlist
+
+breeze−cost
+
+balsa−md
+balsa−mgr
+
+balsa−li
+
+balsa−lcd
+
+breeze2lard
+
+Behavioural sim.
+
+LARD bytecodes
+
+LARD test harness
+LARD
+
+balsa−pv
+
+Fusion
+
+balsa−xi
+
+EDIF 2 0 0 netlist
+
+Powerview DB
+
+Simulation results
+
+Xilinx bitstream
+
+Timing extraction
+
+Timing info.
+
+balsa−ihdl
+
+Pearl
+Silicon Ensemble
+
+Silicon Ensemble
+
+Verilog−XL
+
+Verilog netlist
+
+Cadence DB
+
+SDF
+Layout
+
+SDF
+
+Simulation results
+
+cp(1)
+
+Chip compiler
+Netlist gen.
+
+TimeMill
+
+Compass netlist
+
+Compass DB
+
+Layout
+
+Cap. extraction
+
+Extracted netlist
+
+TimeMill netlist
+
+Simulation results
+
+A non−Balsa tool
+A Balsa tool
+
+A file format / data
+
+Balsa
+
+Breeze
+
+Cost estimate
+
+Figure 9.4.
+Design flow.
+
+
+Chapter 9: An introduction to Balsa
+161
+
+9.4.1
+A single-place buffer
+
+This buffer circuit is the HDL equivalent of the “hello, world” program. Its
+Balsa description is:
+
+import [balsa.types.basic]
+-- a single line comment
+-- buffer1a: A single place buffer
+procedure buffer1 (input i : byte; output o : byte) is
+variable x : byte
+begin
+loop
+i -> x
+-- Input
+communication
+;
+-- sequence the two communications
+o <- x
+-- Output communication
+end
+end
+
+Commentary on the code
+
+This Balsa description builds a single-place buffer, 8-bits wide. The circuit
+requests a byte from the environment which, when ready, transfers the data
+to the register. The circuit signals to the environment on its output channel
+that data is available and the environment reads it when it chooses. This small
+program introduces:
+
+comments:
+Balsa supports both multi-line comments and single-line com-
+ments.
+
+modular compilation:
+Balsa supports modular compilation. The import
+statement in this example includes the definition of some standard data types
+such as byte, nibble, etc. The search path given in the import statement
+is a dot-separated directory path similar to that of Java (although multi-file
+packages are not implemented). The import statement may be used to include
+other precompiled Balsa programs thereby acting as a library mechanism. Any
+import statements must precede other declarations in the files.
+
+procedures:
+The procedure declaration introduces an object that looks sim-
+ilar to a procedure definition in a conventional programming language. A Balsa
+procedure is a process. The parameters of the procedure define the interface to
+the environment outside the circuit block. In this case, the module has an 8-bit
+input and an 8-bit output. The body of the procedure definition defines an al-
+gorithmic behaviour for the circuit; it also implies a structural implementation.
+In this example, a variable x (of type byte) is declared implying that an 8-bit
+wide storage element will be appear in the synthesised circuit.
+
+
+162
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+The behaviour of the circuit is obvious from the code: 8-bit values are trans-
+ferred from the environment to the storage variable, x, and then sequentially
+output from the variable to the environment. This sequence of events is con-
+tinually repeated (loop
+�
+�
+� end).
+
+channel communication:
+the communication operators “->” and “<-” are
+channel assignments and imply a communication or handshake over the chan-
+nel. Because of the sequencing explicit in the description, the variable x will
+only accept a new value when it is ready; the value will only be passed out to
+the environment when requested. Note that the channel is always on the left-
+hand side of the operator and the corresponding variable or expression on the
+right-hand side.
+
+sequencing:
+The “;” operator separating the two assignments is not merely a
+syntactic statement separator, it explicitly denotes sequentiality. The contents
+of x are transferred to the output port after the input transfer has completed.
+Because a “;” connects two sequenced statements or blocks, it is an error to
+place a “;” after the last statement in a block.
+
+repetition
+The loop
+�
+�
+� end construct causes infinite repetition of the code
+contained within its body. Procedures without loop
+�
+�
+� end are permitted and
+will terminate, allowing procedure calls to be sequenced if required.
+
+Compiling the circuit
+
+balsa-c buffer1a
+
+The compiler produces an output file buffer1a.breeze. This is a file in an in-
+termediate format which can be imported back into other Balsa source files
+(thereby providing a simple library mechanism). Breeze is a textual format file
+designed for ease of parsing and it is therefore somewhat opaque. A primitive
+graphical representation of the compiled circuit in terms of handshake compo-
+nents can be produced (as buffer1a.ps) by:
+
+breeze2ps buffer1a
+
+The synthesised circuit
+
+The resulting handshake circuit is shown in figure 9.5. This is not actually
+taken from the output of breeze2ps, but has been redrawn to make the diagram
+more readable. Although it is not necessary to understand the exact opera-
+tion of the compiled circuit, a knowledge of the structure is helpful to gain an
+understanding of how best to describe circuits which can be synthesised effi-
+ciently using Balsa. A brief description of the operation of the circuit is given
+
+
+Chapter 9: An introduction to Balsa
+163
+
+o
+i
+
+Loop
+
+Sequence
+
+Variable
+
+→
+
+➤
+
+Fetch
+Fetch
+
+→
+
+*
+
+x
+
+;
+
+#
+
+Figure 9.5.
+Handshake circuit for a single-place buffer.
+
+below. The circuit has been annotated with the names of the various handshake
+components.
+The port at the top, denoted by “>”, is an activation port generating a hand-
+shake enclosing the behaviour of the circuit. It can be thought of as a reset
+signal which, when de-asserted, initiates the operation of the circuit. All com-
+piled Balsa programs contain an activation port.
+The activation port starts the operation of the Repeater (“#”) which initi-
+ates a handshake with the Sequencer. The Repeater corresponds directly to the
+loop�
+�
+� end construct, and the Sequencer to the “;” operator. The Sequencer
+first issues a handshake to the left-hand Fetch component, causing data to be
+moved to the storage element in the Variable element. The Sequencer then
+handshakes with the right-hand Fetch component, causing data to be read from
+the Variable element. When these operations are complete, the Sequencer com-
+pletes its handshake with the Repeater which starts the cycle again.
+
+9.4.2
+Two-place buffers
+
+1st design
+
+Having built a single-place buffer, an obvious goal is a pipeline of single
+buffer stages. Initially consider a two-place buffer; there are a number of ways
+we might describe this. One choice is to define a circuit with two storage
+elements:
+
+-- buffer2a: Sequential 2-place buffer with assignment
+--
+between variables
+import [balsa.types.basic]
+
+procedure buffer2 (input i : byte; output o : byte) is
+variable x1, x2 : byte
+begin
+
+
+164
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+loop
+i -> x1;
+-- Input communication
+x2 := x1;
+-- Implied communication
+o <- x2
+-- Output communication
+end
+end
+
+In this example in we explicitly introduce two storage elements, x1 and x2.
+The contents of the variable x1 are caused to be transferred to the variable x2
+by means of the assignment operator “:=”. However, transfer is still effected
+by means of a handshaking communication channel. This assignment operator
+is merely a way of concealing the channel for convenience.
+
+2nd design
+
+The implicit channel can be made explicit as shown in buffer2b.balsa:
+
+-- buffer2b: Sequential version with an explicit
+--
+internal channel
+import [balsa.types.basic]
+
+procedure buffer2 (input i:byte; output o:byte) is
+variable x1, x2 : byte
+channel chan: byte
+begin
+loop
+i -> x1;
+-- Input
+communication
+chan <- x1 || chan -> x2;
+-- Transfer x1 to x2
+o <- x2
+-- Output communication
+end
+end
+
+The channel which, in the previous example, was concealed behind the use
+of the “:=” assignment operator, has been made explicit. The handshake circuit
+produced (after some simple optimisations) is identical to buffer2a. The “||”
+operator is explained in the next example.
+It is important to understand the significance of the operation of the circuits
+produced by buffer2a and buffer2b. Remember that “;” is more than a syntac-
+tic separator: it is an operator denoting sequence. Thus, first the input, i, is
+transferred to x1. When this operation is complete, x1 is transferred to x2 and
+finally the contents of x2 are written to the environment on port o. Only after
+this sequence of operations is complete can new data from the environment be
+read into x1 again.
+
+9.4.3
+Parallel composition and module reuse
+
+The operation above is unnecessarily constrained: there is no reason why
+the circuit cannot be reading a new value into x1 at the same time that x2 is
+
+
+Chapter 9: An introduction to Balsa
+165
+
+writing out its data to the environment. The program in buffer2c achieves this
+optimisation.
+
+-- buffer2c: a 2-place buffer using parallel composition
+import [buffer1a]
+
+procedure buffer2 (input i : byte; output o : byte) is
+channel c : byte
+begin
+buffer1 (i, c) ||
+buffer1 (c, o)
+end
+
+Commentary on the code
+
+In the program above, a 2-place buffer is composed from 2 single-place
+buffers. The output of the first buffer is connected to the input of the second
+buffer by their respective output and input ports. However, apart from com-
+munications across the common channel, the operation of the two buffers is
+independent.
+The deceptively simple program above illustrates a number of new features
+of the Balsa language:
+
+modular compilation:
+The buffer1a circuit is included by the import mech-
+anism described earlier. The circuit must have been compiled previously. The
+Makefile generation program balsa-md (see page 166) can be used to generate
+a Makefile which will automatically take care of such dependencies.
+
+connectivity by naming:
+The output of the first buffer is connected to the
+input of the second buffer because of the common channel name, c, in the
+parameter list in the instantiation of the buffers.
+
+parallel composition:
+The “||” operator specifies that the two units which it
+connects should operate in parallel. This does not mean that the two units may
+operate totally independently: in this example, the output of one buffer writes
+to the input of the other buffer, creating a point of synchronisation. Note also
+that the parallelism referred to is a temporal parallelism. The two buffers are
+physically connected in series.
+
+9.4.4
+Placing multiple structures
+
+If we wish to extend the number of places in the buffer, the previous tech-
+nique of explicitly enumerating every buffer becomes tedious. What is required
+is a means of parameterising the buffer length (though in any real hardware
+implementation the number of buffers cannot be variable and must be known
+
+
+166
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+before-hand). One technique, shown in buffer n, is to use the for construct
+together with compile-time constants:
+
+-- buffer_n: an n-place parameterised buffer
+import [buffer1a]
+constant n = 8
+
+procedure buffer_n (input i:byte; output o:byte)
+is
+array 1 .. n-1 of channel c : byte
+begin
+buffer1 (i, c[1]) ||
+-- First buffer
+buffer1 (c[n-1], o) ||
+-- Last buffer
+for || i in 1 .. n-2 then
+-- Buffer i
+buffer1 (c[i], c[i+1])
+end
+end
+
+Commentary on the Code
+
+constants:
+the value of an expression (of any type) may be bound to a name.
+The value of the expression is evaluated at compile time and the type of the
+name when used will be the same as the original expression in the constant
+declaration. Numbers can be given in decimal (starting with one of 1..9), hexa-
+decimal (0x prefix), octal (0 prefix) and binary (0b prefix).
+
+arrayed channels:
+procedure ports and locally declared channels may be
+arrayed. Each channel can be referred to by a numeric or enumerated index,
+but from the point of view of handshaking, each channel is distinct and no
+indexed channel has any relationship with any other such channel other than
+the name they share. Arraying is not part of a channel’s type.
+
+for loops:
+a for loop allows iteration over the instantiation of a subcircuit.
+The composition of the circuits may either be a parallel composition – as in the
+example above – or sequential. In the latter case, “;” should be substituted for
+“||” in the loop specifier. The iteration range of the loop must be resolvable at
+compile time.
+A more flexible approach uses parameterised procedures and is discussed
+later in chapter 11 on page 193.
+
+9.5.
+Ancillary Balsa tools
+
+9.5.1
+Makefile generation
+
+Makefiles are commonly used in Unix by the utility make(1) to specify and
+control the processes by which complicated programs are compiled. Speci-
+fying the dependencies involved is often tedious and error prone. The Balsa
+
+
+Chapter 9: An introduction to Balsa
+167
+
+system has a utility, balsa-md, to generate the Makefile for a given program
+automatically. The generated Makefile knows not only how to compile a Balsa
+module with multiple imports, but also how to generate and run test-harnesses
+for the simulation environment, LARD, used by Balsa. Balsa-mgr provides
+a convenient, intuitive, GUI front-end to balsa-md and considerably simpli-
+fies project management, in particular the handling of multiple test harnesses.
+However, since a textual description of any GUI is tedious, balsa-mgr will not
+be discussed further and only the facilities to which the underlying balsa-md
+provides a gateway will be described in the examples that follow. The interface
+presented by balsa-mgr is shown in figure 9.6.
+
+Figure 9.6.
+balsa-mgr IDE.
+
+9.5.2
+Estimating area cost
+
+The area cost of a circuit may be estimated by executing the Makefile rule
+cost. For example, an extract of the output produced for the 2-place buffer is
+shown below:
+
+Part: buffer2
+(0 (component "$BrzFetch" (8) (10 2 9)))
+(0 (component "$BrzFetch" (8) (8 6 7)))
+(0 (component "$BrzFetch" (8) (5 4 3)))
+
+
+168
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+(20.75 (component "$BrzLoop" () (1 11)))
+(99.0 (component "$BrzSequence" (3) (11 (10 8 5))))
+(198.0 (component "$BrzVariable" (8 1 "x1[0..7]") (9 (6))))
+(198.0 (component "$BrzVariable" (8 1 "x2[0..7]") (7 (4))))
+
+Total cost: 515.75
+
+The exact format of the report produced is somewhat obscure. Each line
+corresponds to a handshake component. Its area cost is the first number on the
+line. The parameters after the component name correspond to the width of the
+various channels of that component and the internal channel names. The area
+reported is proportional to the cost of implementing the circuit in a particular
+silicon process and is of most use in comparing different circuit descriptions.
+
+9.5.3
+Viewing the handshake circuit graph
+
+A PostScript view of the handshake circuit graph can be produced by run-
+ning the rule make ps. A (flattened) view of the handshake circuit graph for
+the example buffer.2c is shown in figure 9.7.
+The two single-place buffers from which the circuit is composed are recog-
+nisable in the circuit. Apart from minor differences in the labelling of the
+handshake component symbols, the circuit is identical to that shown in fig-
+ure 8.6 discussed in section 8.4 on page 128 and the same optimisations have
+been (automatically) applied.
+
+9.5.4
+Simulation
+
+Ignoring the various simulation possibilities available once the design has
+been converted to a silicon layout, there are three strategies for evaluating and
+simulating the design from Balsa:
+
+1 Default LARD test harness.
+
+The command make sim will generate a LARD test harness and run it.
+The test harness reads data from a file for each input port of the module
+under test. Data sent to output channels appears on the standard output.
+This method needs no knowledge of LARD at all.
+
+2 Balsa test harness.
+
+If a more sophisticated test sequence is required, Balsa is a sufficiently
+flexible language in its own right to be able to specify most test se-
+quences. A default LARD test harness can then be generated for the
+Balsa test harness. Again no detailed knowledge of LARD is required.
+
+3 Custom LARD test harness.
+
+
+Chapter 9: An introduction to Balsa
+169
+
+buffer2c
+
+activate
+
+.
+
+C1: @10:18
+
+0
+
+i
+o
+
+#
+
+C10: @13:3
+
+0
+
+1
+
+#
+
+C4: @13:3
+
+0
+
+2
+
+x[0..7]
+
+;
+
+C15: @14:11
+
+0
+
+1
+
+T
+
+C14: @14:7
+
+0
+
+1
+
+.
+
+C12: @15:7
+
+1
+
+2
+
+C2: i
+
+1
+
+C13: x
+
+0
+
+2
+
+T
+
+C11: x
+
+1
+
+1
+
+x[0..7]
+
+C7: x
+
+0
+
+2
+
+;
+
+C9: @14:11
+
+0
+
+1
+
+T
+
+C6: @15:7
+
+0
+
+2
+
+C8: @14:7
+
+0
+
+1
+
+C3: o
+
+2
+
+C5: x
+
+1
+
+1
+
+C16: @27:18
+
+0
+
+2
+
+Figure 9.7.
+Flattened view of buffer2c.
+
+
+170
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+For some applications, it may be necessary to write a custom test harness
+in LARD. The Makefile generated test harness may be used as template.
+
+The default test harness exercises the target Balsa block by repeatedly hand-
+shaking on all external channels; input data channels receive the value 0 on
+each handshake, although it is possible to associate an input channel with a
+data file.
+
+Simulating buffer2c
+
+A simulation can be generated by invoking the appropriate simulation rule
+from the Makefile, producing the following output:
+
+0: chan ‘i’: writing 0
+6: chan ‘i’: writing 0
+15: chan ‘o’: reading 0
+19: chan ‘i’: writing 0
+28: chan ‘o’: reading 0
+32: chan ‘i’: writing 0
+41: chan ‘o’: reading 0
+45: chan ‘i’: writing 0
+54: chan ‘o’: reading 0
+58: chan ‘i’: writing 0
+67: chan ‘o’: reading 0
+71: chan ‘i’: writing 0
+80: chan ‘o’: reading 0
+
+The simulation runs forever unless terminated (by Ctrl-C). The numbers
+reported on the left hand side of each channel activity line are simulation times.
+LARD uses a unit delay model so these values should be treated with caution.
+
+Simulation data file
+
+This particular simulation stimulus is not very informative. A better strategy
+is to arrange for the data on the input channel i to be externally defined. In
+the next example, a file contains the following set of test data (in a variety of
+number representations):
+
+1
+0x10
+022
+0b011101
+5
+
+The Makefile can be forced to generate a rule for running a simulation from
+this stimulus file. If the simulation is now run, the following output is pro-
+duced:
+
+
+Chapter 9: An introduction to Balsa
+171
+
+3: chan ‘i’: writing 1
+15: chan ‘o’: reading 1
+16: chan ‘i’: writing 16
+28: chan ‘o’: reading 16
+29: chan ‘i’: writing 18
+41: chan ‘o’: reading 18
+42: chan ‘i’: writing 29
+54: chan ‘o’: reading 29
+55: chan ‘i’: writing 5
+67: chan ‘o’: reading 5
+Program terminated
+
+Channel viewer
+
+In the previous examples, the output of the simulation is textual appearing
+on the standard output.
+LARD has a graphical interface which displays the
+handshakes and data values associated with the internal and external channels.
+Assuming the building of a test harness rule has been specified to balsa-md,
+the channel viewer can be invoked causing two windows to appear on the
+screen: the LARD interpreter control window and the channel viewer window
+itself.
+Starting the simulation will cause a trace of the various channels in the de-
+sign to appear in the channel view window. For each channel the request and
+acknowledge signals and data values are displayed.
+
+Figure 9.8.
+Channel viewer window.
+
+
+
+Chapter 10
+
+THE BALSA LANGUAGE
+
+In this chapter, a tutorial overview of the language is given together with
+several small designs which illustrate various aspects of the language.
+
+10.1.
+Data types
+
+Balsa is strongly typed with data types based on bit vectors. The results
+of expressions must be guaranteed to fit within the range of the underlying bit
+vector representation. Types are either anonymous or named. Type equivalence
+for anonymous types is checked on the basis of the size and properties of the
+type, whereas type equivalence for named types is checked against the point of
+declaration.
+There are two classes of anonymous types: numeric types which are de-
+clared with the bits keyword, and arrays of other types. Numeric types can
+be either signed or unsigned. Signedness has an effect on expression operators
+and casting. Only numeric types and arrays of other types may be used without
+first binding a name to those types. Balsa has three separate name spaces: one
+for procedure and function names, a second for variable and channel names
+and a third for type declarations.
+
+Numeric types
+
+Numeric types support the number range
+�0�2n
+
+�1� for n-bit unsigned num-
+bers or
+��2n�1
+
+�2n�1
+
+� 1� for n-bit signed numbers. Named numeric types are
+just aliases of the same range. An example of a numeric type declaration is:
+
+type word is 16 bits
+
+This defines a new type word which is unsigned (there is no unsigned key-
+word) covering the range
+�0�216
+
+� 1�. Alternatively, a signed type could have
+been declared as:
+
+type sword is 16 signed bits
+
+which defines a new type sword covering the range
+��215
+
+�215
+
+�1�.
+
+173
+
+
+174
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+The only predefined type is bit. However the standard Balsa distribution
+comes with with a set of library declarations for such types as byte, nibble,
+boolean and cardinal as well as the constants true and false.
+
+Enumerated types
+
+Enumerated types consist of named numeric values. The named values are
+given values starting at zero and incrementing by one from left to right. Ele-
+ments with explicit values reset the counter and many names can be given to
+the same value, for example:
+
+type Colour is enumeration
+Black, Brown, Red, Orange, Yellow, Green, Blue,
+Violet, Purple=Violet, Grey, Gray=Grey, White
+end
+
+The value of the Violet element of Colour is 7, as is Purple. Both Grey
+and Gray have value 8. The total number of elements is 10. An enumeration
+can be padded to a fixed size by use of the over keyword:
+
+type SillyExample is enumeration
+e1=1, e2 over 4 bits
+end
+
+Here 2 bits are sufficient to specify the 3 possible values of the enumeration
+(0 is not bound to a name, e1 has the value 1 and e2 has the value 2). The over
+keyword ensures that the representation of the enumerated type is actually 4
+bits. Enumerated types must be bound to names by a type declaration before
+use.
+
+Constants
+
+Constant values can be defined in terms of an expression resolvable at com-
+pile time. Constants may be declared in terms of a predefined type otherwise
+they default to a numeric type. Examples are:
+
+constant minx = 5
+constant maxx = minx + 10
+constant hue = Red : Colour
+constant colour = Colour’Green
+-- explicit enumeration element
+
+Record types
+
+Records are bit-wise compositions of named elements of possibly different
+(pre-declared) types with the first element occupying the least significant bit
+positions, e.g.:
+
+type Resistor is record
+
+
+Chapter 10: The Balsa language
+175
+
+FirstBand, SecondBand, Multiplier : Colour;
+Tolerance : ToleranceColour
+end
+
+Resistor has four elements: FirstBand, SecondBand, Multiplier of
+type Colour and Tolerance of type ToleranceColour (both types must have
+been declared previously). FirstBand is the first element and so represents the
+least significant portion of the bit-wise value of a type Resistor. Selection of
+elements within the record structure is accomplished with the usual dot nota-
+tion. Thus if R15 is a variable of type Resistor, the value of its SecondBand
+can extracted by R15.SecondBand. As with enumerations, record types can be
+padded using the over notation.
+
+Array types
+
+Arrays are numerically indexed compositions of same-typed values. An
+example of the declaration of an array type is:
+
+type RegBank_t : array 0..7 of byte
+
+This introduces a new type RegBank t which is an array type of 8 elements
+indexed across the range [0, 7], each element being of type byte. The ordering
+of the range specifier is irrelevant; array 0..7 is equivalent to array 7..0.
+In general a single expression, expr, can be used to specify the array size: this
+is equivalent to a range of 0..expr-1. Anonymous array types are allowed in
+Balsa, so that variables can be declared as an array without first defining the
+array type:
+
+variable RegBank : array 0..7 of byte
+
+Arbitrary ranges of elements within an array can be accessed by an array
+slicing mechanism e.g. a[5..7] extracts elements a5, a6, and a7. As with all
+range specifiers, the ordering of the range is irrelevant. In general Balsa packs
+all composite typed structures in a least significant to most significant, left to
+right manner. Array slices always return values which are based at index 0.
+Arrays can be constructed by a tupling mechanism or by concatenation of
+other arrays of the same base type:
+
+variable a, b, c, d, e ,f: byte
+variable z2 : array 2 of byte
+variable z4 : array 4 of byte
+variable z6 : array 6 of byte
+
+z4:= {a,b,c,d}
+-- array construction
+z6:= z4 @ {e, f}
+-- array concatenation
+z2:= (z4 @ {e, f}) [3..4] -- element extraction by array slicing
+
+
+176
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+In the last example, the first element of z2 is set to d and the second element
+is set to e. Array slicing is useful for extracting arbitrary bitfields from other
+datatypes.
+
+Arrayed channels
+
+Channels may be arrayed, that is they may consist of several distinct chan-
+nels which can be referred to by a numeric or enumerated index. This is similar
+to the way in which variables can have an array type except that each channel
+is distinct for the purposes of handshaking and each indexed channel has no
+relationship to the other channels in the array other than the single name they
+share. The syntax for arrayed channels is different from that of array typed
+variables making it easier to disambiguate arrays from arrayed channels. As
+an example:
+
+array 4 of channel XYZ : array 4 of byte
+
+declares 4 channels, XYZ[0] to XYZ[3], each channel is a 32-bit wide type
+array 0..3 of byte. An example of the use of arrayed channels is shown
+in section 9.4.4 on page 165.
+
+10.2.
+Data typing issues
+
+As stated previously, Balsa is strongly typed: the left and right sides of as-
+signments are expected to have the same type. The only form of implicit type-
+casting is the promotion of numeric literals and constants to a wider numeric
+type. In particular, care must be taken to ensure that the result of an arithmetic
+operation will always be compatible with the declared result type. Consider
+the assignment statement x := x + 1. This is not a valid Balsa statement be-
+cause potentially the result is one bit wider than the width of the variable x. If
+the carry-out from the addition is to be ignored, the user must explicitly force
+the truncation by means of a cast.
+
+Casts
+
+If the variable x was declared as 32 bits, the correct form of the assignment
+above is:
+
+x := (x + 1 as 32 bits)
+
+The keyword as indicates the cast operation. The parentheses are a neces-
+sary part of the syntax. If the carry out of the addition of two 32-bit numbers
+is required, a record type can be used to hold the composite result:
+
+type AddResult is record
+Result : 32 bits;
+
+
+Chapter 10: The Balsa language
+177
+
+Carry : bit;
+end
+variable r : AddResult
+
+r := (a + b as AddResult)
+
+The expression r.Carry accesses the required carry bit, r.Result yields
+the 32-bit addition result.
+Casts are required when extracting bit fields. Here is an example from the
+instruction decoder of a simple microprocessor. The bottom 5 bits of the 16-bit
+instruction word contain an 5-bit signed immediate. It is required to extract the
+immediate field and sign-extend it to 16 bits:
+
+type Word is 16 signed bits
+type Imm5 is 5 signed bits
+
+variable Instr : 16 bits -- bottom 5 bits contain an immediate
+variable Imm16 : Word
+
+Imm16 := (((Instr as array 16 of bit) [0..4] as Imm5) as Word)
+
+First, the instruction word, Instr, is cast into an array of bits from which
+an arbitrary sub-range can be extracted:
+
+(Instr as array 16 of bit)
+
+Next the bottom (least significant) 5 bits must be extracted:
+
+(Instr as array 16 of bit) [0..4]
+
+The extracted 5 bits must now be cast back into a 5-bit signed number:
+
+((Instr as array 16 of bit) [0..4] as Imm5)
+
+The 5-bit signed number is then signed extended to the 16-bit immediate
+value:
+
+(((Instr as array 16 of bit) [0..4] as Imm5) as Word)
+
+The double cast is required because a straightforward cast from 5 bits to
+the variable Imm16 of type Word would have merely zero filled the topmost bit
+positions even though Word is a signed type. However, a cast from a signed
+numeric type to another (wider) signed numeric type will sign extend the nar-
+rower value into the width of the wider target type.
+Extracting bits from a field is a fairly common operation in many hardware
+designs. In general, the original datatype has to be cast into an array first before
+bitfield extraction. The smash operator “#” provides a convenient shorthand for
+casting an object into an array of bits. Thus the sign extension example above
+is more simply written:
+
+((#Instr [0..4] as Imm5) as Word)
+
+
+178
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Table 10.1.
+Balsa commands.
+
+Command
+Notes
+
+sync
+control only (dataless) handshake
+<-
+handshake data transfer from an expression to an output port
+->
+handshake data transfer to a variable from an input port
+:=
+assigns a value to a variable
+;
+sequence operator
+||
+parallel composition operator
+continue
+a no-op
+halt
+causes deadlock (useful in simulation)
+loop
+�
+�
+� end
+repeat forever
+while
+�
+�
+� else
+�
+�
+� end
+conditional loop
+for
+�
+�
+� end
+structural (not temporal) iteration
+if
+�
+�
+� then
+�
+�
+� else
+�
+�
+� end
+conditional execution, may have multiple guarded commands
+case
+�
+�
+� end
+conditional execution based on constant expressions
+select
+non-arbitrated choice operator
+arbitrate
+arbitrated choice operator
+print
+compile time printing of diagnostics
+
+Auto-assignment
+
+Statements of the form:
+
+x := f(x)
+
+are allowed in Balsa. However, the implementation generates an auxiliary vari-
+able which is then assigned back to the variable visible to the programmer –
+the variable is enclosed within a single handshake and cannot be read from
+and written to simultaneously. Since auto-assignment generates twice as many
+variables as might be suspected, it is probably better practice to avoid the auto-
+assignment, explicitly introduce the extra variable and then rewrite the program
+to hide the sequential update thereby avoiding any time penalty. An example
+of this approach is given in count10b on page 184.
+
+10.3.
+Control flow and commands
+
+Balsa’s command set is listed in table 10.1.
+
+Dataless handshakes
+
+sync
+�Channel� – awaits a handshake on the named channel. Circuit action
+does not proceed until the handshake is completed.
+
+
+Chapter 10: The Balsa language
+179
+
+Channel communications
+
+Data can be transferred between a variable and a channel, between channels
+or from a channel to a command code block as shown below:
+
+�Channel� <-
+�Variable� – transfers data from a variable to the named channel.
+This may either be an internal channel local to a procedure or an output port
+listed in the procedure declaration.
+
+�Channel� ->
+�Variable� – transfers data from the channel connected to a
+variable. The channel may either be an internal channel local to a procedure or
+an input port listed in the procedure declaration.
+
+�Channel1� ->
+�Channel2� – transfers data between channels.
+
+�Channel� -> then
+�Command� – allows the data to be accessed throughout the
+command block. However, the handshake on the channel is not completed and
+thus the data not released until the command block itself has terminated.
+
+Variable assignment
+
+�Variable� :=
+�Expression� – transfers the result of an expression into a vari-
+able. The result type of the expression and that of the variable must agree.
+
+Sequential composition
+
+�Command1� ;
+�Command2� – the two commands execute sequentially. The first
+must terminate before the second commences.
+
+Parallel composition
+
+�Command1� ||
+�Command2� – composes two commands such that they oper-
+ate concurrently and independently. Both commands must complete before
+the circuit action proceeds. Beware of inadvertently introducing dependencies
+between the two commands so that neither can proceed until the other has com-
+pleted. The “||” operator binds tighter than “;”. If that is not what is intended,
+then commands may be grouped in blocks as shown below
+
+[
+�Command1� ;
+�Command2� ] ||
+�Command3�
+
+Note the use of square brackets to group commands rather than parentheses.
+Alternatively, the keywords begin
+�
+�
+� end may be used and are mandatory if
+variables local to a block are to be declared.
+
+Continue and halt commands
+
+continue is effectively a no-op.
+The command halt causes a process
+thread to deadlock.
+
+
+180
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Looping constructs
+
+The loop command causes an infinite repetition of a block of code. Finite
+loops may be constructed using the while construct. The simplest example of
+its use is:
+
+while
+�Condition� then
+�Command� end
+
+However, multiple guards are allowed, so a more general form of the con-
+struct is:
+
+while
+
+�Condition1� then
+�Command1�
+|
+�Condition1� then
+�Command2�
+|
+�Condition3� then
+�Command3�
+else
+
+�Command4�
+end
+
+The ability of the while construct to take an else clause is a minor conve-
+nience. The code sequence above could have been written, with only a small
+difference in the resultant handshake circuit implementation, as:
+
+while
+
+�Condition1� then
+�Command1�
+|
+�Condition2� then
+�Command2�
+|
+�Condition3� then
+�Command3�
+end;
+
+�Command4�
+
+If more than one guard is satisfied, the particular command that is executed
+is unspecified.
+
+Structural iteration
+
+Balsa has a structural looping construct. In many programming languages
+it is a matter of convenience or style as to whether a loop is written in terms of
+a for loop or a while loop. This is not so in Balsa. The for loop is similar
+to VHDL’s for
+�
+�
+� generate command and is used for iteratively laying out
+repetitive structures. An example of its use was given earlier in section 9.4.4 on
+page 165. An illustration of the inappropriate use of the for command is given
+in the example count10e on page 189. Structures may be iteratively instantiated
+to operate either sequentially or concurrently with one another depending on
+whether for ; or for || is employed.
+
+Conditional execution
+
+Balsa has two constructs to achieve conditional execution. Balsa’s case
+statement is similar to that found in conventional programming languages. A
+single guard may match more than one value of the guard expression.
+
+
+Chapter 10: The Balsa language
+181
+
+case x+y of
+1 .. 4, 11
+then o <- x
+| 5 .. 10 then o <- y
+else o <- z
+end
+
+An if
+�
+�
+� then
+�
+�
+� else statement allows conditional execution based on
+the evaluation of expressions at run-time. Its syntax is similar to that of the
+while loop. Note the sequencing implicit in nested if statements, such as that
+shown below:
+
+if
+�Condition1� then
+
+�Command1�
+else
+if
+�Condition2� then
+
+�Command2�
+end
+end
+
+The test for Condition2 is made after the test for Condition1. If it is
+known that the two conditions are mutually exclusive, the expression may be
+written:
+
+if
+�Condition1� then
+�Command1�
+|
+�Condition2� then
+�Command2�
+end
+
+The “|” separator causes Condition1 and Condition2 to be evaluated in
+parallel. The result is undefined if more than one guard (condition) is satisfied.
+
+10.4.
+Binary/unary operators
+
+Balsa’s binary/unary operators are shown in order of decreasing precedence
+in table 10.2.
+
+10.5.
+Program structure
+
+File structure
+
+A typical design will consist of several files containing procedure/type/cons-
+tant declarations which come together in a top-level procedure that constitutes
+the overall design. This top-level procedure would typically be at the end of
+a file which imports all the other relevant design files. This importing feature
+forms a simple but effective way of allowing component reuse and maps sim-
+ply onto the notion of the imported procedures being either pre-compiled hand-
+shake circuits or existing (possibly hand crafted) library components. Declara-
+tions have a syntactically defined order (left to right, top to bottom) with each
+declaration having its scope defined from the point of declaration to the end of
+
+
+182
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Table 10.2.
+Balsa binary/unary operators.
+
+Symbol
+Operation
+Valid types
+Notes
+
+.
+record indexing
+record
+
+#
+smash
+any
+takes value from any type and reduces it
+to an array of bits
+[]
+array indexing
+array
+non-constant index possible (can generate
+lots of hardware)
+not,log,
+unary operators
+numeric
+log only works on constants and returns
+- (unary)
+the ceiling: e.g. log 15 returns 4
+“-” returns a result 1 bit wider than the
+argument
+ˆ
+exponentiation
+numeric
+
+*, /, %
+multiply, divide,
+numeric
+only applicable to constants
+
+remainder
+
++,-
+add, subtract
+numeric
+results are 1 or 2 bits longer than
+the largest argument
+@
+concatenation
+arrays
+
+<,>,<=,>=
+inequalities
+numeric,
+enumerations
+
+=, /=
+equals,
+all types
+comparison is by sign extended
+
+not equals
+value for signed numeric types
+and
+bitwise AND
+numeric
+Balsa uses type 1 bits for if/while
+guards so bitwise and logical operators
+are the same.
+or, xor
+bitwise OR,XOR
+numeric
+
+the current (or importing) file. Thus Balsa has the same simple “declare before
+use” rule of C and Modula, though without any facility for prototypes.
+
+Declarations
+
+Declarations introduce new type, constant or procedure names into the global
+name spaces from the point of declaration until the end of the enclosing block
+(or file in the case of top-level declarations). There are three disjoint name
+spaces: one for types, one for procedures and a third for all other declarations.
+At the top level, only constants are in this last category. However, variables and
+channels may be included in procedure local declarations. Where a declaration
+within an enclosed/inner block has the same name as one previously made in
+an outer/enclosing context, the local declaration will hide the outer declaration
+for the remainder of that inner block.
+
+
+Chapter 10: The Balsa language
+183
+
+Procedures
+
+Procedures form the bulk of a Balsa description. Each procedure has a name,
+a set of ports and an accompanying behavioural description. The sync key-
+word introduces dataless channels. Both dataless and data bearing channels can
+be members of “arrayed channels”. Arrayed channels allow numeric/enumer-
+ated indexing of otherwise functionally separate channels. Procedures can also
+carry a list of local declarations which may include other procedures, types and
+constants.
+
+Shared procedures
+
+Normally, each call to a procedure generates separate hardware to instan-
+tiate that procedure. A procedure may be shared, in which case calls to that
+procedure access common hardware thereby avoiding duplication of the cir-
+cuit at the cost of some multiplexing to allow sharing to occur. The use of
+shared procedures is discussed further on page 187.
+
+Functions
+
+In many programming languages, functions can be thought of as procedures
+without side effects that return a result. However, in Balsa there is a fundamen-
+tal difference between functions and procedures. Parameters to a procedure
+define handshaking channels that interface to the circuit block defined by the
+procedure. Function parameters, on the other hand, are just expression aliases
+returning values. An example of the use of function definitions can be found
+in the arbiter tree design on page 202.
+
+10.6.
+Example circuits
+
+In this section, various designs of counter are described in Balsa. In flavour,
+they resemble the specifications of conventional synchronous counters, since
+these designs are more familiar to newcomers to asynchronous systems. More
+sophisticated systolic counters, better suited to an asynchronous approach, are
+described by van Berkel [14].
+In this design, the role of the clock which updates the state of the counter is
+taken by a dataless sync channel, named aclk. The counter issues a handshake
+request over the sync channel, the environment responds with an acknowledge-
+ment completing the handshake and the counter state is updated.
+
+A modulo-16 counter
+
+-- count16a.balsa: modulo 16 counter
+import [balsa.types.basic]
+
+
+184
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+procedure count16 (sync aclk; output count : nibble) is
+variable count_reg : nibble
+begin
+loop
+sync aclk ;
+count <- count_reg ;
+count_reg := (count_reg + 1 as nibble)
+end
+end
+
+This counter interfaces to its environment by means of two channels: the
+dataless aclk channel and the output channel count which carries the current
+count value. The internal register which implements the variable count reg
+and the output channel are of the predefined type nibble (4-bits wide). After
+count reg is incremented, the result must be cast back to type nibble. For
+simplicity, issues of initialisation/reset have been ignored. A LARD simulation
+of this circuit will give a harmless warning when uninitialised variables are
+accessed.
+
+Removing auto-assignment
+
+The auto-assignment statement in the example above, although concise and
+expressive, hides the fact that, in most back-ends, an auxiliary variable is cre-
+ated so that the update can be carried out in a race-free manner. By making this
+auxiliary variable explicit, advantage may be taken of its visibility to overlap
+its update with other activity as shown in the example below.
+
+-- count16b.balsa: write-back overlaps output assignment
+import [balsa.types.basic]
+
+procedure count16 (sync aclk; output count : nibble) is
+variable count_reg, tmp : nibble
+begin
+loop
+sync aclk;
+tmp := (count_reg + 1 as nibble) ||
+count <- count_reg;
+count_reg := tmp
+end
+end
+
+In this example, the transfer of the count register to the output channel is
+overlapped with the incrementing of the auxiliary shadow register. There is
+some slight area overhead involved in parallelisation and any potential speed-
+up may be minimal in this case, but the principle of making trade-offs at the
+level of the source code is illustrated.
+
+
+Chapter 10: The Balsa language
+185
+
+A modulo-10 counter
+
+The basic counter description above can easily be modified to produce a
+modulo-10 counter. A simple test is required to detect when the internal regis-
+ter reaches its maximum value and then to reset it to zero.
+
+-- count10a.balsa: an asynchronous decade counter
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+procedure count10(sync aclk; output count: C_size) is
+variable count_reg : C_size
+variable tmp : C_size
+begin
+loop
+sync aclk;
+if count_reg /= max_count then
+tmp := (count_reg + 1 as C_size)
+else
+tmp := 0
+end || count <- count_reg ;
+count_reg := tmp
+end -- loop
+end -- begin
+
+A loadable up/down decade counter
+
+This example describes a loadable up/down decade counter. It introduces
+many of the language features discussed earlier in the chapter. The counter
+requires two control bits, one to determine the direction of count, and the other
+to determine whether the counter should load or
+�inc,dec�rement on the next
+operation. The are several valid design options; in this example, count10b
+below, the control bits and the data to be loaded are bundled together in a
+single channel, in sigs.
+
+-- count10b.balsa: an aysnchronous up/down decade counter
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+type dir is enumeration down, up end
+type mode is enumeration load, count end
+type In_bundle is record
+data : C_size ;
+mode : mode;
+dir : dir
+
+
+186
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+end
+
+procedure updown10 (
+input in_sigs: In_bundle;
+output count: C_size
+) is
+variable count_reg : C_size
+variable tmp : In_bundle
+begin
+loop
+in_sigs -> tmp; -- read control+data bundle
+if
+tmp.mode = count then
+case tmp.dir of
+down then
+-- counting down
+if count_reg /= 0 then
+tmp.data := (count_reg - 1 as C_size)
+else
+tmp.data := max_count
+end -- if
+| up then
+-- counting up
+if count_reg /= max_count then
+tmp.data := (count_reg + 1 as C_size)
+else
+tmp.data := 0
+end -- if
+end -- case tmp.dir
+end; -- if
+count <- tmp.data || count_reg:= tmp.data
+end -- loop
+end
+
+The example above illustrates the use of if
+�
+�
+� then
+�
+�
+� else and case
+control constructs as well the use of record structures and enumerated types.
+The use of symbolic values within enumerated types makes the code more
+readable. Test harnesses which can be generated automatically by the Balsa
+system can also read the symbolic enumerated values. For example, here is a
+test file which initialises the counter to 8, counts up, testing that the counter
+wraps round to zero and then counts down allowing the user to check that the
+counter correctly wraps to 9.
+
+{8, load, up}
+load
+counter with 8
+{0, count, up}
+count to 9
+{0, count, up}
+count & wrap to 0
+{0, count, up}
+count to 1
+{0, count, down}
+count down to 0
+{0, count, down}
+count down to 9
+{1, load, down}
+load counter with 1
+{0, count, down}
+count down to 0
+{0, count, down}
+count down & wrap to 9
+
+
+Chapter 10: The Balsa language
+187
+
+Sharing hardware
+
+In Balsa, every statement instantiates hardware in the resulting circuit. It
+is therefore worth examining descriptions to see if there are any repeated con-
+structs that could either be moved to a common point in the code or replaced by
+shared procedures. In count10b above, the description instantiates two adders:
+one used for incrementing and the other for decrementing. Since these two
+units are not used concurrently, area can be saved by sharing a single adder
+(which adds either
+�1 or
+�1 depending in the direction of count) described by
+a shared procedure. The code below illustrates how count10b can be rewritten
+to use a shared procedure. The shared procedure add sub computes the next
+count value by adding the current count value to a variable, inc, which can
+take values of
+�1 or
+�1. Note that to accommodate these values, inc must be
+declared as 2 signed bits.
+The area advantage of the approach is shown by examining the cost of the
+circuit reported by breeze-cost: count10b has a cost of 2141 units, whereas the
+shared procedure version has a cost of only 1760. The relative advantage, of
+course, becomes more pronounced as the size of the counter increases.
+
+-- count10c.balsa: introducing shared procedures
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 9
+
+type dir is enumeration down, up end
+type mode is enumeration load, count end
+type inc is 2 signed bits
+
+type In_bundle is record
+data : C_size ;
+mode :
+mode;
+dir : dir
+end
+
+procedure updown10 (
+input in_sigs: In_bundle;
+output count: C_size
+) is
+variable count_reg : C_size
+variable tmp : In_bundle
+variable inc : inc
+
+shared add_sub is
+begin
+tmp.data:= (count_reg + inc as C_size)
+end
+
+
+188
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+begin
+loop
+in_sigs -> tmp; -- read control+data bundle
+if
+tmp.mode = count then
+case tmp.dir of
+down then
+-- counting down
+if count_reg /= 0 then
+inc:= -1;
+add_sub()
+else
+tmp.data := max_count
+end -- if
+| up then
+-- counting up
+if count_reg /= max_count then
+inc := +1;
+add_sub()
+else
+tmp.data := 0
+end -- if
+end -- case tmp.dir
+end; -- if
+count <- tmp.data || count_reg:= tmp.data
+end -- loop
+end
+
+In order to guarantee the correctness of implementations, there are a number
+of minor restrictions on the use of shared procedures:
+
+shared procedures cannot have any arguments;
+
+shared procedures cannot use local channels;
+
+shared procedures using elements of the channel referenced by a select
+statement (see section 10.7 on page 190) must be declared as local within
+the body of that select block.
+
+“while” loop description
+
+An alternative description of the modulo-10 counter employs the while con-
+struct:
+
+-- count10d.balsa: mod-10 counter alternative implementation
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 10
+
+procedure count10(sync aclk; output count : C_size) is
+variable count_reg : C_size
+begin
+
+
+Chapter 10: The Balsa language
+189
+
+loop
+while count_reg < max_count then
+sync aclk;
+count <- count_reg;
+count_reg:= (count_reg + 1 as C_size)
+end; -- while
+count_reg:= 0
+end -- loop
+end
+
+Structural “for” loops
+
+for loops are a potential pitfall for beginners to Balsa. In many program-
+ming languages, while loops and for loops can be used interchangeably. This
+is not the case in Balsa: a for loop implements structural iteration, in other
+words, separate hardware is instantiated for each pass through the loop. The
+following description, which superficially appears very similar to the while
+loop example of count10d previously, appears to be correct: it compiles with-
+out problems and a LARD simulation appears to give the correct behaviour.
+However, examination of the cost reveals an area cost of 11577, a large in-
+crease. It is important to understand why this is the case. The for loop is un-
+rolled at compile time and 10 instances of the circuit to increment the counter
+are created. Each instance of the loop is activated sequentially. The PostScript
+plot of the handshake circuit graph is rather unreadable; setting max_count to
+3 produces a more readable plot.
+
+-- count10e.balsa: beware the "for" construct
+import [balsa.types.basic]
+
+type C_size is nibble
+constant max_count = 10
+
+procedure count10(sync aclk; output count: C_size) is
+variable count_reg : C_size
+begin
+loop
+for ; i in 1 .. max_count then
+sync aclk;
+count <- count_reg;
+count_reg:= (count_reg + 1 as C_size)
+end; -- for ; i
+count_reg:= 0
+end -- loop
+end -- begin
+
+If, instead of using the sequential for construct, the parallel for construct
+had been employed (for ||
+�
+�
+�), the compiler would give an error message
+complaining about read/write conflicts from parallel threads. In this case, all
+
+
+190
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+instances of the counter circuit would attempt to update the counter register
+at the same time, leading to possible conflicts. A reader who understands the
+resulting potential handshake circuit is well on the way to a good understanding
+of the methodology.
+
+10.7.
+Selecting channels
+
+The asynchronous circuit described below merges two input channels into
+a single output channel; it may be thought of as a self-selecting multiplexer.
+The select statement chooses between the two input channels a and b by
+waiting for data on either channel to arrive. When a handshake on either a or b
+commences data is held valid on the input, and the handshake not completed,
+until the end of the select
+�
+�
+�end block.
+This circuit is an example of handshake enclosure and avoids the need for an
+internal latch to be created to store the data from the input channel; a possible
+disadvantage is that, because of the delayed completion of the handshake, the
+input is not released immediately to continue processing independently. In this
+example, data is transferred to the output channel and the input handshake will
+complete as soon as data has been removed from the output channel. An exam-
+ple of a more extended enclosure can be found in the code for the population
+counter on page 197.
+
+-- mux1.balsa: unbuffered Merge
+import [balsa.types.basic]
+
+procedure mux (input a, b :byte; output c :byte) is
+begin
+loop
+select a then c <- a
+-- channel behaves like a variable
+|
+b then c <- b
+-- ditto
+end -- select
+end -- loop
+end
+
+Because of the enclosed nature of the handshake associated with select,
+inputs a and b should be mutually exclusive for the duration of the block of
+code enclosed by the select. In many cases, this is not a difficult obligation
+to satisfy. However, if a and b are truly independent, select can be replaced
+by arbitrate which allows an arbitrated choice to be made. Arbiters are rel-
+atively expensive in terms of speed and may not be possible to implement in
+some technologies. Further, the loss of determinism in circuits with arbitra-
+tion can also introduce testing and design correctness verification problems.
+Designers should therefore not use arbiters unnecessarily.
+
+
+Chapter 10: The Balsa language
+191
+
+-- mux2.balsa: unbuffered arbitrated Merge.
+import [balsa.types.basic]
+
+procedure mux (input a, b :byte; output c :byte) is
+begin
+loop
+arbitrate a then c <- a
+-- channel behaves like a variable
+|
+b then c <- b
+-- ditto
+end -- arbitrate
+end -- loop
+end
+
+
+
+Chapter 11
+
+BUILDING LIBRARY COMPONENTS
+
+11.1.
+Parameterised descriptions
+
+Parameterised procedures allow designers to develop a library of commonly
+used components and then to instantiate those structures later with varying
+parameters. A simple example is the specification of a buffer as a library part
+without knowing the width of the buffer. Similarly, a pipeline of buffers can be
+defined in the library without requiring any knowledge of the depth chosen for
+the pipeline when it is instantiated.
+
+11.1.1
+A variable width buffer definition
+
+The example pbuffer1 below defines a single place buffer with a parame-
+terised width.
+
+-- pbuffer1.balsa - parameterised buffer example
+import [balsa.types.basic]
+
+procedure Buffer (
+parameter X : type ;
+input i : X; output o : X
+) is
+variable x : X
+begin
+loop
+i -> x ;
+o <- x
+end
+end
+
+-- now define a byte wide buffer
+procedure Buffer8 is Buffer (byte)
+
+-- now use the definition
+procedure test1 (input a : byte ; output b : byte) is
+begin
+Buffer8 (a, b)
+
+193
+
+
+194
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+end
+
+-- alternatively
+procedure test2 (input a : byte ; output b : byte) is
+begin
+Buffer (byte, a, b)
+end
+
+The definition of the single place buffer given earlier on page 161 is modi-
+fied by the addition of the parameter declaration which defines X to be of type
+type. In other words X is identified as being a type to be refined later. Once a
+parameter type has been declared, it can be used in later declarations and state-
+ments: for example, input channel i is defined as being of type X. No hardware
+is generated for the parameterised procedure definition itself.
+Having defined the procedure, it can be used in other procedure definitions.
+Buffer8 defines a byte wide buffer that can be instantiated as required as
+shown, for example, in procedure test1. Alternatively, a concrete realisation
+of the parameterised procedure can be used directly as shown in procedure
+test2.
+
+11.1.2
+Pipelines of variable width and depth
+
+The next example illustrates how multiple parameters to a procedure may
+be specified. The parameterised buffer element is included in a pipeline whose
+depth is also parameterised.
+
+-- pbuffer2.balsa - parameterised pipeline example
+import [balsa.types.basic]
+import [pbuffer1]
+
+-- BufferN: an n-place parameterised, variable width buffer
+procedure BufferN (
+parameter n : cardinal ;
+parameter X : type ;
+input i : X ;
+output o : X
+) is
+begin
+if n = 1 then
+-- single place pipeline
+Buffer(X, i, o)
+| n >= 2 then
+-- parallel evaluation
+local array 1 .. n-1 of channel c : X
+begin
+Buffer(x, i, c[1])
+||
+-- first buffer
+Buffer(x, c[n-1], o) ||
+-- last buffer
+for || i in 1 .. n-2 then
+Buffer(X, c[i], c[i+1])
+end -- for || i
+end
+
+
+Chapter 11: Building library components
+195
+
+else print error, "zero length pipeline specified"
+end -- if
+end
+
+-- Now define a 4 deep, byte-wide pipeline.
+procedure Buffer4 is BufferN(4, byte)
+
+Buffer is the single place parameterised width buffer of the previous exam-
+ple and this is reused by means of the library statement import[pbuffer1].
+In this code, BufferN is defined in a very similar manner to the example in sec-
+tion 9.4.4 on page 165 except that the number of stages in the pipeline, n, is
+not a constant but is a parameter to the definition of type cardinal. Note that
+this definition includes some error checking. If an attempt is made to build a
+zero length pipeline during a definition, an error message is printed.
+
+11.2.
+Recursive definitions
+
+Balsa allows a form of recursion in definitions (as long as the resulting struc-
+tures can be statically determined at compile time). Many structures can be
+described elegantly using this technique which forms a natural extension to
+the powerful parameterisation mechanism. The remainder of this chapter il-
+lustrates recursive parameterisation by means of some interesting (and useful)
+examples.
+
+11.2.1
+An n-way multiplexer
+
+An n-way multiplexer can be constructed from a tree of 2-way multiplexers.
+A recursive definition suggests itself as the natural specification technique: an
+n-way multiplexer can be split into two n/2-way multiplexers connected by
+internal channels to a 2-way multiplexer as shown in figure 11.1 on page 196.
+
+--- Pmux1.balsa: A recursive parameterised MUX definition
+import [balsa.types.basic]
+
+procedure PMux (
+parameter X : type;
+parameter n : cardinal;
+array n of input inp : X;
+-- note use of arrayed port
+output out : X
+) is
+begin
+if n = 0 then print error,"Parameter n should not be zero"
+|
+n = 1 then
+loop
+select inp[0] then
+out <- inp[0]
+end -- select
+end -- loop
+
+
+196
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+0
+inp
+
+n/2−1
+inp
+
+n/2
+inp
+
+n−1
+inp
+
+0
+out
+
+1
+out
+
+out
+
+After Decompostion
+
+0
+inp
+
+1
+inp
+
+inpn−1
+n−2
+inp
+
+Before Decomposition
+
+out
+
+Figure 11.1.
+Decomposition of an n-way multiplexer.
+
+|
+n = 2 then
+loop
+select inp[0] then
+out <- inp[0]
+| inp[1] then
+out <- inp[1]
+end -- select
+end -- loop
+else
+local
+-- local block with local definitions
+channel out0, out1 : X
+constant mid = n/2
+begin
+PMux (X, mid,
+inp[0..mid-1], out0) ||
+PMux (X, n-mid, inp[mid..n-1], out1) ||
+PMux (X, 2, {out0,out1},out)
+end
+end -- if
+end
+
+-- Here is a 5-way multiplexer
+procedure PMux5Byte is PMux(byte, 5)
+
+Commentary on the code
+
+The multiplexer is parameterised in terms of the type of the inputs and the
+number of channels n. The code is straightforward. A multiplexer of size
+greater than 2 is decomposed into two multiplexers half the size connected by
+internal channels to a 2-to-1 multiplexer. Notice how the arrayed channels,
+out0 and out1 are specified as a tuple. The recursive decomposition stops
+
+
+Chapter 11: Building library components
+197
+
+when the number of inputs is 2 or 1 (specifying a multiplexer with zero inputs
+generates an error). A 1-input multiplexer makes no choice of inputs.
+
+A Balsa test harness
+
+The code below illustrates how a simple Balsa program can be used as a test
+harness to generate test values for the multiplexer. The test program is actually
+rather na¨ıve.
+
+-- test_pmux.balsa - A test-harness for Pmux1
+import [balsa.types.basic]
+import [pmux1]
+
+procedure test (output out : byte) is
+type ttype is sizeof byte + 1 bits
+array 5 of channel inp : byte
+variable i : ttype
+begin
+begin
+i:= 1;
+while i <= 0x80 then
+inp[0] <- (i as byte);
+inp[1] <- (i+1 as byte);
+inp[2] <- (i+2 as byte);
+inp[3] <- (i+3 as byte);
+inp[4] <- (i+4 as byte);
+i:= (i + i as ttype)
+end
+end ||
+PMux5Byte(inp, out)
+end
+
+11.2.2
+A population counter
+
+This next example counts the number of bits set in a word. It comes from
+the requirement in an Amulet processor to know the number of registers to be
+restored/saved during LDM/STM (Load/Store Multiple) instructions.
+The approach taken is to partition the problem into two parts. Initially, ad-
+jacent bits are added together to form an array of 2-bit channels representing
+the numbers of bits that are set in each of the adjacent pairs. The array of 2-
+bit numbers are then added in a recursively defined tree of adders (procedure
+AddTree). The structure of the bit-counter is shown in figure 11.2.
+
+-- popcount: count the number of bits set in a word
+import [balsa.types.basic]
+
+procedure AddTree (
+parameter inputCount : cardinal;
+parameter inputSize : cardinal;
+parameter outputSize : cardinal;
+
+
+198
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+array inputCount of input i : inputSize bits;
+output o : outputSize bits
+) is
+begin
+if inputCount = 1 then
+select i[0] then o <- (i[0] as outputSize) end
+|
+inputCount = 2 then
+select i[0], i[1] then
+o <- (i[0] + i[1] as outputSize bits)
+end -- select
+else
+local
+constant lowHalfInputCount = inputCount / 2
+constant highHalfInputCount = inputCount - lowHalfInputCount
+
+channel lowO, highO : outputSize - 1 bits
+begin
+AddTree (lowHalfInputCount, inputSize, outputSize - 1,
+i[0..lowHalfInputCount-1], lowO) ||
+AddTree (highHalfInputCount, inputSize, outputSize - 1,
+i[lowHalfInputCount..inputCount-1], highO) ||
+AddTree (2, outputSize - 1, outputSize, {lowO, highO}, o)
+end
+end -- if
+end
+
+procedure PopulationCount (
+parameter n : cardinal;
+input i : n bits;
+output o : log (n+1) bits
+) is
+begin
+if n % 2 = 1 then
+print error, "number of bits must be even"
+end; -- if
+loop
+select i then
+if n = 1 then
+o <- i
+|
+n = 2 then
+o <- (#i[0] + #i[1])
+add bits 0 and 1
+else
+local
+constant pairCount = n - (n / 2)
+array pairCount of channel addedPairs : 2 bits
+begin
+for || c in 0..pairCount-1 then
+addedPairs[c] <- (#i[c*2] + #i[(c*2)+1])
+end ||
+AddTree (pairCount, 2, log (n+1), addedPairs, o)
+
+
+Chapter 11: Building library components
+199
+
+#i[0]
+#i[1]
+#i[2]
+#i[3]
+#i[4]
+#i[5]
+#i[6]
+#i[7]
+
++ : 2 bits
++ : 2 bits
+
++ : 3 bits
+
++ : bit
++ : bit
++ : bit
++ : bit
+
+o
+
+PopulationCount (8)
+
+(4,2,4)
+AddTree
+(2,3,4)
+AddTree
+
+i
+
+(2,2,3)
+AddTree
+(2,2,3)
+AddTree
+
+Figure 11.2.
+Structure of a bit population counter.
+
+end
+end -- if
+end -- select
+end -- loop
+end
+
+procedure PopCount16 is PopulationCount (16)
+
+Commentary on the code
+
+parameterisation:
+Procedures AddTree and PopulationCount are param-
+eterised. PopulationCount can be used to count the number of bits set in any
+sized word. AddTree is parameterised to allow a recursively defined adder of
+any number of arbitrary width vectors.
+
+enclosed selection:
+The semantics of the enclosed handshake of select al-
+low the contents of the input i to be referred to several times in the body of the
+select block without the need for an internal latch.
+
+avoiding deadlock:
+Note that the formation of the sum of adjacent bits is
+specified by a parallel for loop.
+
+for || c in 0..pairCount-1 then
+addedPairs[c] <- (#i[c*2] + #i[(c*2)+1])
+end
+
+
+200
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+It might be thought that a serial for ; loop could be used at, perhaps, the
+expense of speed. This is not the case: the system will deadlock, illustrating
+why designing asynchronous circuits requires some real understanding of the
+methodology. In this case the adder to which the array of addPairs is con-
+nected requires pairs of inputs to be ready before it can complete the addition
+and release its inputs. However, if the sum of adjacent bits is computed seri-
+ally, the next pair will not be computed until the handshake for the previous
+pair has been completed – which is not possible because AddTree is awaiting
+all pairs to become valid: result deadlock!
+
+11.2.3
+A Balsa shifter
+
+General shifters are an essential element of all microprocessors including
+the Amulet processors. The following description forms the basis of such a
+shifter. It implements only a rotate right function, but it is easily extensible to
+other shift functions.
+The main work of the shifter is local procedure rorBody which recursively
+creates sub-shifters capable of optionally rotating 1, 2, 4, 8 etc. bits. The
+structure of the shifter is shown in figure 11.3.
+
+rorStage
+
+rorStage
+
+rorStage
+
+mux
+
+rotate
+
+#d[log distance]
+
+i
+d
+
+o
+
+rorStage
+
+0
+1
+
+rorBody (1)
+
+rorBody (2)
+
+rorBody (4)
+
+i
+
+o
+
+ror (8 bits)
+
+d
+
+Figure 11.3.
+Structure of a rotate right shifter.
+
+import [balsa.types.basic]
+
+-- ror: rotate right shifter
+procedure ror (
+parameter X : type;
+input d : sizeof X bits;
+
+
+Chapter 11: Building library components
+201
+
+input i : X;
+output o : X
+) is
+begin
+loop
+select d then
+local
+constant typeWidth = sizeof X
+
+procedure rorBody (
+parameter distance : cardinal;
+input i : X;
+output o : X
+) is
+local
+procedure rorStage (
+output o : X
+) is
+begin
+select i then
+if #d[log distance] then
+o <- (#i[typeWidth-1..distance] @
+#i[distance-1..0] as X)
+{shift}
+else
+o <- i
+{don’t shift}
+end -- if
+end -- select
+end
+channel c : X
+begin
+if distance > 1 then
+rorStage (c) ||
+rorBody (distance/2, c, o)
+else
+rorStage (o)
+end -- if
+end
+begin
+rorBody (typeWidth/2, i, o)
+end
+end -- select
+end -- loop
+end
+
+procedure ror32 is ror (32 bits)
+
+
+202
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+Testing the shifter
+
+This next code example builds another test routine in Balsa to exercise the
+shifter.
+
+import [balsa.types.basic]
+import [ror]
+
+--test ror32
+procedure test_ror32(output o : 32 bits)
+is
+variable i : 5 bits
+channel shiftchan : 32 bits
+channel distchan : 5 bits
+begin
+begin
+i:= 1;
+while i < 31 then
+shiftchan <- 7 || distchan <- i;
+i:= (i+1 as 5 bits)
+end -- while
+end || ror32(distchan, shiftchan, o)
+end
+
+11.2.4
+An arbiter tree
+
+The final example builds a parameterised arbiter. This circuit forms part
+of the DMA controller of chapter 12. The architecture of an 8-input arbiter
+is shown in figure 12.3 on page 212. ArbFunnel is a parameterisable tree
+composed of two elements: ArbHead and ArbTree. Pairs of incoming sync
+requests are arbitrated and combined into single bit decisions by ArbHead el-
+ements. These single bit channels are then arbitrated between by ArbTree
+elements. An ArbTree takes a number of decision bits from each of a number
+of inputs (on the i ports) and produces a rank of 2-input arbiters to reduce the
+problem to half as many inputs each with 1 extra decision bit. Recursive calls
+to ArbTree reduce the number of input channels to one (whose final decision
+value is returned on port o).
+
+-- ArbHead: 2 way arbcall with channel no. output
+import [balsa.types.basic]
+procedure ArbHead (
+sync i0, i1;
+output o : bit
+) is
+begin
+loop
+arbitrate i0 then o <- 0
+|
+i1 then o <- 1
+end -- arbitrate
+
+
+Chapter 11: Building library components
+203
+
+end -- loop
+end -- begin
+
+-- ArbTree: a tree arbcall which outputs a channel number
+-- prepended onto the input channel’s data. (invokes itself
+-- recursively to make the tree)
+
+procedure ArbTree (
+parameter inputCount : cardinal;
+parameter depth : cardinal;
+-- bits to carry from inputs
+array inputCount of input i : depth bits;
+output o : (log inputCount) + depth bits
+) is
+type BitArray is array 1 of bit
+type BitArray2 is array 2 of bit
+function AddTopBit (hd : bit; tl : depth bits) =
+(#tl @ {hd} as depth + 1 bits)
+function AddTopBit2 (hd : bit; tl : depth + 1 bits) =
+(#tl @ {hd} as depth + 2 bits)
+function AddTop2Bits (hd0 : bit; hd1 : bit; tl : depth bits) =
+(#tl @ {hd0,hd1} as depth + 2 bits)
+begin
+case inputCount of
+0, 1 then print error, "Can’t build an ArbTree with fewer than 2 inputs"
+|
+2 then loop
+arbitrate i[0] -> i0 then o <- AddTopBit (0, i0)
+|
+i[1] -> i1 then o <- AddTopBit (1, i1)
+end -- arbitrate
+end -- loop
+|
+3 then local channel lo : 1 + depth bits
+begin
+ArbTree (2, depth, i[0 .. 1], lo) ||
+loop
+arbitrate lo then o <- AddTopBit2 (0, lo)
+| i[2] -> i2 then o <- AddTop2Bits (1, 0, i2)
+end -- arbitrate
+end -- loop
+end
+else local
+constant halfCount = inputCount / 2
+constant halfBits = depth + log halfCount
+channel l, r : halfBits bits
+begin
+ArbTree (halfCount, depth, i[0 .. halfCount-1], l) ||
+ArbTree (inputCount - halfCount, depth,
+i[halfCount .. inputCount-1], r) ||
+ArbTree (2, halfBits, {l,r}, o)
+end -- begin
+end -- case inputCount
+end
+
+
+204
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+-- ArbFunnel: build a tree arbcall (balanced apart from the last
+--
+channel which is faster than the rest) which produces a channel
+--
+number from an array of sync inputs
+procedure ArbFunnel (
+parameter inputCount : cardinal;
+array inputCount of sync i;
+output o : log inputCount bits
+) is
+constant halfCount = inputCount / 2
+constant oddInputCount = inputCount % 2
+begin
+if inputCount < 2 then
+print error, "can’t build an ArbFunnel with fewer than 2 inputs"
+| inputCount = 2 then
+ArbHead (i[0], i[1], o)
+| inputCount > 2 then
+local
+array halfCount + 1 of channel li : bit
+begin
+for || j in 0 .. halfCount - 1 then
+ArbHead (i[j*2], i[j*2+1], li[j])
+end ||
+if oddInputCount then
+ArbTree (halfCount + 1, 1, li[0 .. halfCount], o) ||
+loop
+select i[inputCount - 1] then li[halfCount] <- 0
+end -- select
+end -- loop
+else
+ArbTree (halfCount, 1, li[0 .. halfCount-1], o)
+end -- if
+end
+end
+-- if
+end
+
+
+Chapter 12
+
+A SIMPLE DMA CONTROLLER
+
+A simple 4 channel DMA controller is presented as a practical description
+of a reasonably large-scale Balsa design written entirely in Balsa and so can
+be compiled for any of the technologies which the Balsa back-end supports.
+Readers should note that this controller is not the same as the Amulet3i DMA
+controller referred to in chapter 15. A more detailed description of this con-
+troller and the motivation for its design can be found in [8]. A complete listing
+of the code for the controller can be downloaded from [7].
+The simplified controller provides:
+
+4 full address range channels each with independent source, destination
+and count registers.
+
+8 client DMA request inputs with matching acknowledgements.
+
+Peripheral to peripheral, memory to memory and peripheral to memory
+transfers. Each channel has both source and destination client requests
+so “true” peripheral to peripheral transfers can be performed by waiting
+for requests from both parties.
+
+Figure 12.1 shows the programmer’s view of the controller’s register mem-
+ory map. The register bank is split into two parts: the channel registers and the
+global registers.
+
+12.1.
+Global registers
+
+The global registers contain control state pertaining to the state of currently
+active channels and of interrupts signalled by the termination of transfer runs.
+There are 4 global registers:
+
+genCtrl: General control.
+In this controller, the general control register
+only contains one bit: the global enable – gEnable. The global enable is the
+only controller bit reset at power-up. All other controller state bits must be
+initialised before gEnable is set. Using a global enable bit in this way allows
+the initialisation part of the Balsa description to remain small and cheap.
+
+205
+
+
+206
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+0
+1
+2
+3
+4
+5
+11:9
+8:6
+
+chan[n].ctrl
+
+enable
+
+srcInc
+dstInc
+countDec
+srcDRQ
+dstDRQ
+srcClientNo
+dstClientNo
+
+31
+23
+15
+0
+24
+16
+8
+7
+
+chan[0].src
+
+chan[0].dst
+
+chan[0].count
+
+chan[1].dst
+
+chan[1].count
+
+chan[2].src
+
+chan[2].count
+
+chan[3].src
+
+chan[3].count
+
+chan[1].src
+
+chan[2].dst
+
+000
+004
+008
+00C
+
+010
+014
+018
+01C
+
+020
+024
+028
+02C
+
+030
+034
+038
+03C
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+Reads as 0x0000 0...
+
+genCtrl
+Reads as 0/1
+200
+
+204
+
+208
+
+20C
+
+chanStatus
+
+IRQMask
+
+IRQReq
+
+Reads as 0x0000 000...
+
+Reads as 0x0000 000...
+
+Reads as 0x0000 000...
+
+chan[0].ctrl
+
+chan[2].ctrl
+
+chan[3].ctrl
+
+chan[2].dst
+
+chan[1].ctrl
+
+genCtrl.gEnable
+
+global registers
+
+channel registers
+
+Figure 12.1.
+DMA controller programmer’s model.
+
+chanStatus: Channelend-of-runstatus.
+The chanStatus register contains
+4 bits, one per DMA channel. When set by the DMA controller, a bit in this
+register indicates that the corresponding channel has come to the end of its run
+of transfers.
+
+IRQMask, IRQReq: Interrupt mask and status.
+The IRQMask register
+contains one bit per channel (like chanStatus) with set bits specifying that
+an interrupt should be raised at the end of a transfer run of that channel (when
+the corresponding chanStatus bit becomes set). IRQReq contains the current
+interrupt status for each channel.
+The channel status, IRQ mask and IRQ status bits are kept in global registers
+in order to reduce the number of DMA register reads which must be performed
+by the CPU after receiving an interrupt in order to determine which channel to
+service.
+
+12.2.
+Channel registers
+
+Each channel has 4 registers associated with it in the same way as the
+Amulet3i DMA controller. The two address registers (channel[n].src and
+channel[n].dst) specify the 32-bit source and destination addresses for trans-
+fers. The count register (channel[n].count) is a 32-bit count of remaining
+
+
+Chapter 12: A simple DMA controller
+207
+
+transfers to perform; transfer runs terminate when the count register is decre-
+mented to zero. The control register (channel[n].ctrl) specifies the updates
+to be performed on the other three registers and the clients to which this chan-
+nel is connected. Writing to the control register has the effect of clearing inter-
+rupts and end-of-run indication on that channel. The control register contains
+8 fields:
+
+enable: Transfer enable.
+If the enable bit is set, this channel should be
+considered for transfers when a new DMA request arrives. Channel enables
+are not cleared on power-up. The genCtrl.gEnable bit can be used to pre-
+vent transfers from occurring whilst the channel enable bits are cleared during
+startup.
+
+srcInc, dstInc, countDec: Increment/decrement control.
+These bits are
+used to enable source, destination and count register update after a transfer.
+Source and destination registers are incremented by 4 after transfers (since
+only word transfers are supported in this version of the controller) if srcInc
+and dstInc (respectively) are set. Note that the bottom 2 bits of these ad-
+dresses are preserved. The count register is decremented by 1 after each trans-
+fer if countDec is set. Resetting either srcInc or dstInc results in the corre-
+sponding address remaining unchanged between transfers. This is useful for
+nominating peripheral (rather than memory) addresses. Resetting countDec
+results in “free-running” transfers.
+
+srcDRQ, dstDRQ: Initial DMA requests.
+Transfers can take place on a
+channel when a pair of DMA requests have been received, one for the source
+client and the other for the destination client (the requests-pending registers).
+The srcDRQ and dstDRQ bits specify the initial states for those two requests.
+Setting both of these bits indicates that the source and destination requests
+should be considered to have already arrived. Resetting one or both of the bits
+specifies that requests from the corresponding
+�src,dst�ClientNo numbered
+client should trigger a transfer (both client requests are required when both
+control bits are reset).
+
+srcClientNo, dstClientNo: Client to channel mapping.
+These fields spec-
+ify the client numbers from which this channel receives source and destination
+DMA requests. These fields are only of use when either srcDRQ or dstDRQ (or
+both) are reset.
+
+12.3.
+DMA controller structure
+
+The structure of the simplified DMA controller is shown in Figure 12.2. The
+simplified DMA controller is composed of 5 units:
+
+
+208
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ARBITER
+
+DMA Requests
+
+DRQ
+
+MARBLE
+
+TECommand
+
+ARBITER
+
+interrupts to CPU
+
+A/Doff
+Don
+
+busResponse
+
+Control unit
+Engine
+Transfer
+
+TEAck: tfr. ack.
+
+A
+D
+
+Initiator I/F
+Target I/F
+
+busCommand
+
+Figure 12.2.
+DMA controller structure.
+
+
+Chapter 12: A simple DMA controller
+209
+
+MARBLE target interface
+
+The controller is assumed to be attached to the MARBLE asynchronous
+bus which connects the subsystems in the Amulet3i system-on-chip (see chap-
+ter 15). It is relatively easy to provide an interface to other forms of on-chip
+system connect busses.
+The MARBLE target interface provides a connection to MARBLE through
+which the controller can be programmed. Accesses to the registers from this
+interface are arbitrated with incoming DMA requests and transfer acknowl-
+edgements from the transfer engine. This arbitration and the decoupling of
+transfer engine from control unit allow the DMA controller to avoid potential
+bus access deadlock situations.
+The MARBLE interface used here carries an 8-bit address (8-bit word ad-
+dress, 10-bit byte address) similar to that of the Amulet3i DMA controller.
+This allows the same address mapping of channel registers and the possibil-
+ity of extendeding the number of channels to 32 without changing the global
+register addresses.
+
+MARBLE initiator interface
+
+The initiator interface is used by the DMA controller to perform its transfers.
+Only the address and control bits to this interface are connected to the Balsa
+synthesised controller hardware. The data to and from the initiator interface is
+handled by a latch (shown as the shaded box in Figure 12.2). Only word-wide
+transfers are supported and so this latch is all that is needed to hold data values
+between the read and write bus transactions of a transfer. Supporting different
+transfer widths is relatively easy but has not been implemented in this example
+in order to simplify the code.
+
+Control unit
+
+Each DMA channel has a pair of register bits, the requests-pending bits,
+which recode the arrival of requests for that channel’s source and destination
+clients. After marking-up an incoming request, the control unit examines the
+requests-pending registers of each channel in turn to find a channel on which
+to perform a transfer. If a transfer is to be performed, the register contents for
+that channel are forwarded to the transfer engine and the register contents are
+updated to reflect the incremented addresses and decremented count. DMA
+requests are acknowledged straight away when no transfer engine command
+is issued, or just after the command is issued where a transfer command is
+issued to the transfer engine. The acknowledgement of DMA requests does
+not guarantee the timely completion of the related transfer, peripherals must
+observe bus accesses made to themselves for this purpose. The acknowledge-
+
+
+210
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ment serves only to confirm the receipt of the DMA transfer request. A request
+must be removed after an acknowledgement is signalled so that other requests
+can be received through the request arbitration tree to mark-up potential trans-
+fers for other channels.
+
+Transfer engine
+
+The controller’s transfer engine takes commands from the control unit when
+a DMA transfer is due to be performed and performs no DMA request mapping
+or filtering of its own. The only reason for having the transfer engine is to
+prevent the potential bus deadlock situation if an access to the register bank
+is made across MARBLE while the DMA controller is trying to perform a
+transfer. In this situation, control of the bus belongs to the initiator (usually
+the CPU) trying to access the DMA controller. This initiator cannot proceed
+as the DMA controller is engaged in trying to gain the bus for itself. With a
+transfer engine, and the decoupling of DMA request/CPU access from transfer
+operations, the control unit is free to fulfil the initiator’s register request while
+the transfer engine is waiting for the bus to become available.
+After performing a transfer, the transfer engine will signal to the control
+unit to provide a new transfer command; it does this by a handshake on the
+transfer acknowledge channel (marked TEAck in Figure 12.2). This channel
+passes through the control unit’s command arbitration hardware and serves to
+inform the control unit that the transfer engine is free and that the request-
+pending register can be polled to find the next suitable transfer candidate. The
+acknowledgement not only provides the self-looping activation required to per-
+form memory to memory transfers but also allows the looping required to ser-
+vice requests for other types of transfer which are received during the period
+when the transfer engine was busy.
+A flag register, TEBusy, held in the control unit, is used to record the status
+of the transfer engine so that commands are not issued to it while a transfer
+is in progress. This flag is set each time a transfer command is issued to the
+transfer engine and cleared each time a transfer acknowledgement is received
+by the control unit. The request-pending registers are not re-examined (and a
+transfer command issued) if TEBusy is set.
+
+Arbiter tree
+
+The DMA controller receives DMA requests on an array of 8 sync channels
+connected to the input of the ARBITER unit shown in Figure 12.2. This arbiter
+unit is a tree of 2-way arbiter cells that combines these 8 inputs into a single
+DMA request number which it provides to the control unit. DMA requests
+are acknowledged as soon as the control unit has recorded them. Only the
+successful transfer of data between peripherals should be used as an indication
+
+
+Chapter 12: A simple DMA controller
+211
+
+of the actual completion of a DMA operation. When a transfer is begun (i.e.
+passed from control unit to transfer engine), that transfer’s channel registers
+and requests-pending registers are updated before another arbitrated access to
+the control unit is accepted. As a consequence, a new request on a channel can
+arrive (and be correctly observed) as soon as any transfer-related bus activity
+occurs for that transfer.
+
+12.4.
+The Balsa description
+
+The Balsa description of the DMA controller is composed of 3 parts: the
+arbiter tree, the control unit and the transfer engine. The two MARBLE inter-
+faces sit outside the Balsa block and are controlled through the target (mta and
+mtd) ports (corresponding to command and response ports) and the initiator
+address/control (mia) port. The top level of the DMA controller is:
+
+procedure DMAArb is ArbFunnel (NoOfClients)
+
+procedure dma (
+input mta : MARBLE8bACommand;
+output mtd : MARBLEResponse;
+output mia : MARBLECommandNoData;
+output irq : bit;
+array NoOfClients of sync drq
+) is
+channel DRQClientNo : ClientNo
+channel TECommand : array 2 of Word
+\textbf{--srcAddr, dstAddr}
+sync TEAck
+begin
+DMAArb (drq, DRQClientNo) ||
+DMAControl (mta, mtd, DRQClientNo, TECommand, TEAck, IRQ) ||
+DMATransferEngine (TECommand, TEAck, mia)
+end
+
+Interrupts are signalled by writing a 0 or 1 to the irq port. This interrupt
+value must then be caught by an external latch to generate a bare interrupt
+signal.
+
+12.4.1
+Arbiter tree
+
+DMA requests from the client peripherals arrive on the sync channels drq,
+these channels connect to the request arbiter DMAArb. The procedure declara-
+tion for DMAArb is given in the top level as a parameterised version of the pro-
+cedure ArbFunnel and was described in chapter 11 on page 202. Figure 12.3
+shows the organisation of an 8-input ArbFunnel.
+
+
+212
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+ARB.
+
+ARB.
+
+ARB.
+
+ArbHead
+ArbHead
+
+ArbTree over 2,2 
+
+ArbTree over 4,1
+
+ArbFunnel over 8
+
+o
+
+ArbHead
+ArbHead
+
+i[0]
+i[1]
+i[2]
+i[3]
+i[4]
+i[5]
+i[6]
+i[7]
+
+Figure 12.3.
+8-input arbiter – ArbFunnel.
+
+12.4.2
+Transfer engine
+
+The transfer engine is, like the arbiter unit, quite simple. It exists only as
+a buffer stage between the control unit and the MARBLE initiator interface.
+This function is reflected in the sequencing in the Balsa description and the
+latches used to store the outgoing addresses.
+
+procedure DMATransferEngine (
+input command : array 2 of Word;
+sync ack;
+output busCommand : MARBLECommandNoData
+) is
+variable commandV : array 2 of Word
+begin
+loop
+command -> commandV;
+busCommand <- {commandV[0],read,word};
+busCommand <- {commandV[1],write,word};
+sync ack
+end
+end
+
+
+Chapter 12: A simple DMA controller
+213
+
+12.4.3
+Control unit
+
+The bulk of the controller is contained in the control unit. It contains all the
+channel register latch bits and register access multiplexers/demultiplexers. The
+reduced number of channels and single channel type makes this arrangement
+practical. There are in total 445 bits of programmer accessible state. The ports,
+local variables and local channels of the control unit’s Balsa description are:
+
+procedure DMAControl (
+input busCommand : MARBLE8bACommand;
+output busResponse : MARBLEResponse;
+input DRQ : ClientNo;
+output TECommand : array 2 of Word;
+sync TEAck;
+output IRQ : bit
+) is
+-- combined channel registers
+variable channelRegisters :
+array NoOfChannels of ChannelRegister
+variable channelR, channelW : ChannelRegister
+array over ChannelRegType of bit
+variable channelNo : ChannelNo
+variable clientNo : ClientNo
+
+variable TEBusy : bit
+
+variable gEnable : bit
+variable chanStatus : array NoOfChannels of bit
+variable IRQMask, IRQReq : array NoOfChannels of bit
+
+variable requestPending :
+array NoOfChannels of RequestPair
+
+channel commandSourceC : DMACommandSource
+channel busCommandC : MARBLE8bACommand
+channel DRQC : ClientNo
+variable commandSource : DMACommandSource
+. . .
+
+The ChannelRegister is the combined source, destination, count and con-
+trol registers for one channel. The variable channelRegisters is accessed
+by reading or writing these 108-bit wide registers (32 + 32 + 32 + 12). The
+two registers, channelR and channelW, are used as read and write buffers
+to the channel registers. This allows the partial writes required for CPU ac-
+cess to individual 32-bit words to fragment only these two registers, not all
+of the channel registers. The variables channelNo and clientNo are used to
+hold channel and client numbers between operations. DMA request arrival
+
+
+214
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+and request mark-up can modify clientNo and channel register accesses and
+ready-to-transfer polling can modify channelNo.
+The three channel declarations are used to communicate between a sub-
+procedure of DMAControl, RequestHandler, which arbitrates requests from
+the arbiter tree, MARBLE target interface and transfer engine acknowledge for
+service by the control unit. RequestHandler’s description is fairly uninterest-
+ing and so will not be discussed.
+The body of the control unit, with the less interesting portions removed, is
+as follows:
+
+begin
+Init ();
+-- RequestHandler is an ArbFunnel
+--
+with accompanying data
+RequestHandler (busCommand, DRQ, TEAck, commandSourceC,
+busCommandC, DRQC) ||
+loop
+-- find source of service requests
+commandSourceC -> commandSource;
+case commandSource of
+DRQ then DRQC -> clientNo; MarkUpClientRequest ()
+| bus then
+select busCommandC then
+if (busCommandC.a as RegAddrType).globalNchannel
+then . . . -- global R/W from the CPU
+else -- channel regs
+channelNo :=
+(busCommandC.a as ChannelRegAddr).channelNo;
+ReadChannelRegisters ();
+case busCommandC.rNw of
+. . . -- most of CPU reg. access code omitted
+-- CPU ctrl register write
+| ctrl then channelW.ctrl :=
+(busCommandC.d as ControlRegister) ||
+requestsPending[channelNo] := {0,0} ||
+ClearChanStatus ()
+end;
+WriteChannelRegisters ()
+end
+end
+end
+else -- TEAck
+TEBusy := 0;
+if gEnable then AssessInterrupts () end
+end;
+if gEnable and not TEBusy then
+TryToIssueTransfer ()
+end
+
+
+Chapter 12: A simple DMA controller
+215
+
+end
+end
+
+A number of procedure calls are made by the control unit body, for exam-
+ple, AssessInterrupts (). These procedure calls are to shared procedures
+whose definitions follow the local variables in DMAControl’s description. In
+Balsa, local procedures which are declared to be “shared” are only instantiated
+in the containing procedure’s handshake circuit in one place. (Normal pro-
+cedure calls place a new copy of that procedure’s body for each call). Calls
+to shared procedures are combined using a Call component making their use
+cheaper than normal procedures for whom a new copy of the called procedure’s
+body is placed at each call location.
+
+DMA request handling – MarkUpClientRequest
+
+Incoming DMA requests are marked up in the request pending registers as
+previously described. The procedure MarkUpClientRequest performs this
+operation by testing all the channels’ srcClientNo and dstClientNo con-
+trol bits with clientNo (the client ID of the incoming request) in parallel.
+MarkUpClientRequest’s description is:
+
+shared MarkUpClientRequest is
+begin
+for || i in 0..NoOfChannels-1 then
+if channelRegisters[i].ctrl.srcClientNo = clientNo
+then requestsPending[i].src := 1
+end ||
+if channelRegisters[i].ctrl.dstClientNo = clientNo
+then requestsPending[i].dst := 1
+end
+end
+end
+
+The for || loops in this description performs parallel structural instantia-
+tion of NoOfChannels copies of the body if statements.
+
+Register Access – ReadChannelRegisters,
+WriteChannelRegisters
+
+The shared procedures used to access the channel registers are very short.
+They make the only variable-indexed accesses to the channel registers. The
+two procedures are:
+
+shared ReadChannelRegisters is begin
+channelR := channelRegisters[channelNo]
+end
+
+
+216
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+shared WriteChannelRegisters is begin
+channelRegisters[channelNo] := channelW
+end
+
+Notice that no individual word write enables are present and so, in order
+to modify a single word, a whole channel register must be read, modified,
+then written back. The ReadChannelRegisters followed by channelW :=
+channelR in the description of the CPU’s access to the channel registers uses
+this update method.
+
+Channel status and interrupts – ClearChanStatus,
+AssessInterrupts
+
+The interrupt output bit is asserted by AssessInterrupts. Interrupts are
+signalled when the IRQReq register is non-zero and are reassessed each time
+an action which could clear an interrupt occurs. ClearChanStatus is called
+during channel control register updates to clear interrupts and channel status
+(end-of-run) indications. Their descriptions are:
+
+shared AssessInterrupts is begin
+IRQ <- (IRQReq as NoOfChannels bits) /= 0
+end
+
+shared ClearChanStatus is begin
+chanStatus[channelNo] := 0 ||
+IRQReq[channelNo] := 0;
+AssessInterrupts ()
+end
+
+Ready-to-transfer polling – TryToIssueTransfer, IssueTransfer
+
+Whenever the DMA controller is stimulated by its command interfaces, it
+tries to perform a transfer. The request-pending, and ctrl[n].enable bits
+for each channel are examined in turn to determine if that channel is ready
+to transfer. Incrementing the channel number during this search is performed
+using a local channel to allow the incremented value to be accessed in parallel
+from two places. TryToIssueTransfer’s Balsa description is:
+
+shared TryToIssueTransfer is local
+variable foundChannel : bit
+variable newChannelNo : ChannelNo
+begin
+foundChannel := 0 || channelNo := 0;
+
+while not foundChannel then
+-- source and destination requests arrived?
+if requestsPending[channelNo] = {1,1}
+and channelRegisters[channelNo].ctrl.enable then
+
+
+Chapter 12: A simple DMA controller
+217
+
+ReadChannelRegisters ();
+requestsPending[channelNo] :=
+channelR.ctrl.srcDRQ, channelR.ctrl.dstDRQ ||
+foundChannel := 1 ||
+IssueTransfer () ||
+UpdateRegistersAfterTransfer ()
+else
+local
+channel incChanNo : array ChannelNoLen + 1 of bit
+begin
+incChanNo <- (channelNo + 1 as
+array ChannelNoLen + 1 of bit) ||
+select incChanNo then
+foundChannel := incChanNo[ChannelNoLen] ||
+newChannelNo := (incChanNo[0..ChannelNoLen-1]
+as ChannelNo)
+end;
+channelNo := newChannelNo
+end
+end
+end
+end
+
+Notice that if a transfer is taken, the requestPending bits for that chan-
+nel are re-initialised from the ctrl.�srcDRQ,dstDRQ� control bits for that
+channel. The procedure IssueTransfer actually passes the transfer on to the
+transfer engine. Its definition is:
+
+shared IssueTransfer is begin
+TEBusy := 1 ||
+TECommand <- {channelR.src, channelR.dst}
+end
+
+The interlock formed by checking TEBusy before attempting a transfer, and
+the setting/resetting of TEBusy by transfer initiation/completion ensures that
+no transfer is presented to the transfer engine (deadlocking the control unit)
+while it is occupied. The TEAck communication back to the control unit also
+provides stimuli for re-triggering the DMA controller to perform outstanding
+requests. This re-triggering, combined with the sequential polling of channels,
+allows outstanding requests (received while the transfer engine was busy) to be
+serviced correctly. Notice that a static prioritisation on pre-arrived requests is
+enforced by sequential channel polling.
+
+Register increment/decrement –
+UpdateRegistersAfterTransfer
+
+After a transfer has been issued, the registers for that transfer’s channel must
+be updated. Procedure UpdateRegistersAfterTransfer performs this task:
+
+
+218
+Part II: Balsa - An Asynchronous HardwareSynthesis System
+
+shared UpdateRegistersAfterTransfer is begin
+channelW.ctrl := channelR.ctrl ||
+if channelR.ctrl.srcInc then
+channelW.src := (channelR.src + 1 as Word)
+end ||
+if channelR.ctrl.dstInc then
+channelW.dst := (channelR.dst + 1 as Word)
+end ||
+if channelR.ctrl.countDec then
+channelW.count := (channelR.count - 1 as Word)
+end;
+if channelW.count = 0 then
+chanStatus[channelNo] := 1 ||
+if IRQMask[channelNo] then
+IRQReq[channelNo] := 1
+end ||
+channelW.ctrl.enable := 0
+end;
+WriteChannelRegisters ()
+end
+
+This procedure uses two incrementers and a decrementer to modify the
+channel’s source address, destination address and count respectively. If the
+channel’s transfer count is decremented to zero, the chanStatus bit, interrupt
+status and channel enable are all updated to indicate an end-of-run.
+This concludes the descriptions of the more illustrative aspects of the Balsa
+DMA controller description. For further details see [8] and, as was noted at
+the beginning of this chapter, a full code listing is available from [7].
+
+
+III
+
+LARGE-SCALE ASYNCHRONOUS DESIGNS
+
+Abstract
+In this final part of the book we describe some large-scale asynchronous VLSI
+designs to illustrate the capabilities of this technology. The first two of these
+designs – the contactless smart card chip developed at Philips and the Viterbi
+decoder developed at the University of Manchester – were designed within EU-
+funded projects in the low-power design initiative that is the sponsor of this book
+series. The third chapter describes aspects of the Amulet microprocessor series,
+again from the University of Manchester, developed in several other EU-funded
+projects which, although outside the low-power design initiative, never-the-less
+still had low power as a significant objective.
+The chips descibed in this part of the book are some of the largest and most com-
+plex asynchronous designs ever developed. Fully detailed descriptions of them
+are far beyond the scope of this book, but they are included to demonstrate that
+asynchronous design is fully capable of supporting large-scale designs, and they
+show what can be done with skilled and experienced design teams. The descrip-
+tions presented here have been written to give insight into the thinking processes
+that a designer of state-of-the-art asynchronous systems might go through in de-
+veloping such designs.
+
+Keywords:
+asynchronous circuits, large-scale designs
+
+
+
+Chapter 13
+
+DESCALE:
+
+�
+
+a Design Experiment for a Smart Card Application
+consuming Low Energy
+
+Joep Kessels & Ad Peeters
+Philips Research, NL-5656AA Eindhoven, The Netherlands
+{Joep.Kessels
+� Ad.Peeters}@philips.com
+
+Torsten Kramer
+Kramer-Consulting, D-21079 Hamburg, Germany
+
+Kramer@kramer-consulting.de
+
+Volker Timm
+Philips Semiconductors, D-22529 Hamburg, Germany
+
+Volker.Timm@philips.com
+
+Abstract
+We have designed an asynchronous chip for contactless smart cards. Asyn-
+chronous circuits have two power properties that make them very suitable for
+contactless devices: low average power and small current peaks. The fact that
+asynchronous circuits operate over a wide range of the supply voltage, while
+automatically adapting their speed, has been used to obtain a circuit that is very
+resilient to voltage drops while giving maximum performance for the power be-
+ing received. The asynchronous circuit has been built, tested and evaluated.
+
+Keywords:
+low-power asynchronous circuits, smart cards, contactless devices, DES cryp-
+tography
+
+�Part of the work described in this paper was funded by the European Commission under Esprit TCS/ESD-
+LPD contract 25519 (Design Experiment DESCALE).
+
+221
+
+
+222
+Part III: Large-Scale Asynchronous Designs
+
+13.1.
+Introduction
+
+Since their introduction in the eighties, smart cards have been used in a con-
+tinuously growing number of applications, such as banking, telephony (tele-
+phone and SIM cards), access control (Pay-TV), health-care, tickets for public
+transport, electronic signatures, and identification. Currently, most cards have
+contacts and, for that reason, need to be inserted into a reader. For applications
+in which the fast handling of transactions is important, contactless smart cards
+have been introduced requiring only close proximity to a reader (typically sev-
+eral centimeters). The chip on such a card must be extremely power efficient,
+since it is powered by electromagnetic radiation.
+Asynchronous CMOS circuits have the potential for very low power con-
+sumption, since they only dissipate when and where active. However, asyn-
+chronous circuits are difficult to design at the level of gates and registers.
+Therefore the high-level design language Tangram was defined [141] and a
+silicon compiler has been implemented that translates Tangram programs into
+asynchronous circuits.
+The Tangram compiler generates a special class of asynchronous circuits
+called handshake circuits [135, 112]. Handshake circuits are constructed from
+a set of about forty basic components that use handshake signalling for com-
+munication.
+Several chips have been designed in Tangram [136, 144] and if we compare
+these chips to existing clocked implementations, then the asynchronous ver-
+sions are generally about 20% larger in area and consume about 25% of the
+power.
+In order to find out what advantages asynchronous circuits have to offer in
+contactless smart cards, we have designed an asynchronous smart card chip. In
+this chapter, we indicate which properties of asynchronous circuits have been
+exploited and we present the results. The rest of the chapter is organized as fol-
+lows. Section 13.2 presents the Tangram method for designing asynchronous
+circuits. Section 13.3 summarizes the differences in power behaviour between
+synchronous and asynchronous circuits. Section 13.4 first provides some back-
+ground to contactless smart cards, then identifies the power characteristics in
+which contactless devices differ from battery-powered ones, and finally indi-
+cates why asynchronous circuits are very suited for contactless devices. Sec-
+tion 13.5 presents the digital circuit, Section 13.6 the results of the silicon, and
+Section 13.8 the power regulator, which also exploits an asynchronous oppor-
+tunity. We conclude with a summary of the merits of this asynchronous design.
+
+
+Chapter 13: Descale
+223
+
+13.2.
+VLSI programming of asynchronous circuits
+
+The design flow that is used in the design of the smart card IC reported here
+is based on the programming language Tangram and its associated compiler
+and tool set. An important aspect of this design approach is the transparency
+of the silicon compiler [142], which allows the designer to reason about de-
+sign characteristics such as area, power, performance, and testability, at the
+programming (Tangram) level.
+This section first introduces the Tangram tool set, then briefly describes the
+underlying handshake technology, and finally illustrates VLSI-programming
+techniques using the GCD algorithm presented in Chapter 8.
+
+13.2.1
+The Tangram toolset
+
+Fig. 13.1 shows the Tangram toolset. The designer describes a design in
+Tangram, which is a conventional programming language, similar to C and
+Pascal, extended to include constructs for expressing concurrency and com-
+munication in a way similar to those in the language CSP [58]. In addition to
+this, there are language constructs for expressing hardware-specific issues, like
+sharing of blocks and waiting for clock-edges.
+A compiler translates Tangram programs into so-called handshake circuits,
+which are netlists composed from a library of some 40 handshake components.
+Each handshake component implements a language construct, such as sequenc-
+ing, repetition, communication, and sharing.
+The handshake circuit simulator and corresponding performance analyzer
+give feedback to the designer about aspects such as function, area, timing,
+power, and testability.
+The actual mapping onto a conventional (synchronous) standard-cell library
+is done in two steps. In the first step the component expander uses the com-
+ponent library to generate an abstract netlist consisting of combinational logic,
+registers, and asynchronous cells, such as Muller C-elements. This step also
+determines the data encoding and the handshake protocol; generally a four-
+phase single-rail implementation is generated. In the second step commercial
+synthesis tools and technology mappers are used to generate the standard-cell
+netlist. No dedicated (asynchronous) cells are required in this mapping, be-
+cause all asynchronous cells are decomposed into cells from the standard-cell
+library at hand.
+Similar language-based approaches using handshake circuits as intermedi-
+ate format are described in [17, 9]. Design approaches in which asynchronous
+details are not hidden from the designer have also proven successful [80, 21,
+83, 30, 64]. A general overview of design methods for asynchronous circuits
+is given in [69] and in Part I of this book.
+
+
+224
+Part III: Large-Scale Asynchronous Designs
+
+Tangram
+
+Program
+
+Tangram
+
+Compiler
+
+Handshake
+Circuit
+
+Component
+
+Expander
+
+Abstract
+Netlist
+
+Technology
+
+Mapper
+
+Mapped
+
+Netlist
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Function, Area
+
+Timing, Power, Test
+
+Performance
+Analyzer
+
+Timed Traces
+Fault Coverage
+
+Handshake Circuit
+Simulator
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Component
+
+Library
+
+Asynchronous
+
+Library
+
+Standard-cell
+Library
+
+Figure 13.1.
+The Tangram toolset: boxes denote tools, ovals denote (design) representations.
+
+
+Chapter 13: Descale
+225
+
+Active
+Passive
+
+Active
+Passive
+
+�
+Req
+
+�
+Ack
+
+Figure 13.2.
+Handshake channel: abstract figure (top) and implementation (bottom).
+
+13.2.2
+Handshake technology
+
+The design of large-scale asynchronous ICs demands a timing discipline to
+replace the clock regime that is used in conventional VLSI design. We have
+chosen handshake signaling [121] as the asynchronous timing discipline, since
+it supports plug-and-play composition of components into systems, and is also
+easy to implement in VLSI. An alternative to handshaking would be to com-
+pose asynchronous finite-state machines that communicate using fundamental
+mode or burst-mode assumptions [27, 132].
+Fig. 13.2 shows a handshake channel, which is a point-to-point connection
+between an active and a passive partner. In the abstract figure, the solid circle
+indicates the channel’s active side and the open circle its passive side. The
+implementation shows that both partners are connected by two wires: a request
+(Req) and an acknowledge (Ack) wire. A handshake requires the cooperation of
+both partners. It is initiated by the active party, which starts by sending a signal
+via Req, and then waits until a signal arrives via Ack. The passive side waits
+until a request arrives, and then sends an acknowledge. Handshake channels
+can be used not only for synchronization, but also for communication. To that
+end, data can be encoded in the request, the acknowledge, or in both.
+The protocol used in most asynchronous VLSI circuits is a four-phase hand-
+shake, in which the channel starts in a state with both Req and Ack low. The
+active side starts a handshake by making Req high. When this is observed by
+the passive side, it pulls Ack high. After this a return-to-zero cycle follows,
+during which first Req and then Ack go low, thus returning to the initial state.
+Handshake components interact with their environment using handshake
+channels. One can build handshake components implementing language con-
+structs. Fig. 13.3 shows two examples: the sequencer and the parallel compo-
+nent.
+The sequencer, when activated via a, performs first a handshake via b and
+then via c. It is used to control the sequential execution of commands con-
+nected to b and c. After receiving a request along a, it sends a request along b,
+waits for the corresponding acknowledge, then sends a request along c, waits
+
+
+226
+Part III: Large-Scale Asynchronous Designs
+
+SEQ
+
+a
+
+c
+b
+PAR
+
+a
+
+c
+b
+
+Figure 13.3.
+Handshake components: sequencer (left) and parallel (right).
+
+Guard
+Command
+DO
+
+�
+
+Figure 13.4.
+Handshake circuit for while loop.
+
+for the acknowledge on c, and finally signals completion of its operation by
+sending an acknowledge along channel a.
+The parallel component, when activated by a request along a, sends requests
+along channels b and c concurrently, waits until both acknowledges have ar-
+rived, and then sends an acknowledge along channel a.
+Components for storage of data (variables) and operation on data (such as
+addition and bit-manipulation) can also be constructed. Tangram programs
+are compiled into handshake circuits (composition of handshake components)
+in a syntax-directed way (see also Chapter 8 on page 123). For instance, the
+compilation of a while loop in Tangram, which is written as
+
+do
+Guard
+then
+Command
+od
+
+results in the handshake circuit shown in Fig. 13.4. The do-component, when
+activated, collects the value of the guard through a handshake on its active data
+port. When this guard is false, it completes the handshake on its passive port,
+otherwise it activates the command through a handshake on its active control
+port, and after this has completed, re-evaluates the guard to start a new cycle.
+Details about handshake circuits, the compilation from Tangram into this
+intermediate architecture, the four-phase handshake protocols that are applied,
+and of the gate-level implementation of handshake components can be found
+in [141, 135, 112].
+
+13.2.3
+GCD algorithm
+
+When designing in Tangram, the emphasis for an initial design typically is
+on functional correctness. When this is the only point of attention, the result
+
+
+Chapter 13: Descale
+227
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin x,y:var int ff
+| forever do
+in?<<x,y>>
+; do x<>y then
+if x<y then y:=y-x
+else x:=x-y
+fi
+od
+; out!x
+od
+end
+
+Figure 13.5.
+GCD algorithm in Tangram.
+
+is generally too large, too slow, and too power hungry. The design of a suit-
+able datapath and control architecture, targetting some optimization criteria or
+cost function, can be approached in a transformational way. One can use the
+Tangram tool set to evaluate and analyze the design, and based on that, de-
+cide which transformations to apply. The transparency of the silicon compiler
+(‘What you program is what you get’) helps in predicting the effect of these
+transformations.
+The GCD algorithm is used as an example to illustrate VLSI programming
+based on a transparent silicon compiler (see also the discussion in section 8.3.3
+on page 127). We start with the algorithm of Fig. 13.5, which is functionally
+correct but far from optimal when it comes to implementation cost in VLSI.
+The corresponding handshake circuit is shown in Fig. 13.6.
+The cost report for this GCD design, as presented by the Tangram toolset, is
+the following:
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+19
+2052
+38.0
+12.5
+Memory
+16
+3744
+69.3
+22.8
+C-elements
+12
+1242
+23.0
+7.5
+Logic
+81
+9414
+174.3
+57.2
+--------------------------------------------------------------------
+Total:
+128
+16452
+304.7
+100.0
+
+An important aspect of the design is that it contains four operators in the dat-
+apath. We can improve this by changing the Tangram description in such a
+way that only one subtractor is needed, instead of two. A way to achieve this
+is to change the timing behaviour of the algorithm, and use a higher number
+of iterations of the do-loop by either swapping or subtracting the two numbers
+
+
+228
+Part III: Large-Scale Asynchronous Designs
+
+mux
+
+�
+
+�
+y
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+dmx
+
+�
+
+�
+
+�
+
+�
+
+mux
+
+�
+
+�
+x
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+dmx
+
+�
+
+dmx
+
+�
+
+��
+
+�
+
+�
+do
+
+�
+
+�
+
+�
+@
+
+�
+
+�
+
+�
+in
+
+��
+
+�
+
+�
+
+� out
+
+�
+
+;
+0
+1
+
+2
+
+�
+
+Figure 13.6.
+Compiled handshake circuit for initial GCD program.
+
+
+Chapter 13: Descale
+229
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin
+xy : var <<int,int>> ff
+& x = alias xy.0
+& y = alias xy.1
+| forever do
+in?xy
+; do x<>y then
+if x<y then xy:= <<x,y-x>>
+else xy:= <<y,x>>
+fi
+od
+; out!x
+od
+end
+
+Figure 13.7.
+GCD algorithm in Tangram with optimized control.
+
+x and y. This requires that x and y are stored in a single flip-flop variable.
+The new Tangram algorithm thus obtained is shown in Fig. 13.7. Its associated
+gate-level cost report has improved from 305 to 274 gate equivalents.
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+14
+1512
+28.0
+10.2
+Memory
+16
+3744
+69.3
+25.3
+C-elements
+10
+1008
+18.7
+6.8
+Logic
+86
+8532
+158.0
+57.7
+--------------------------------------------------------------------
+Total:
+126
+14796
+274.0
+100.0
+
+An additional transformation is to compute x<y and y-x using only one op-
+erator, and to combine the two assignments to variable xy into one assignment,
+so as to further simplify the control, at the price of always requiring the worst-
+case computation time for the conditional expression, even if it involves just a
+swap of x and y. Furthermore, one can allow one additional step in the com-
+putation, so that the termination condition of the loop simplifies from x<>y to
+y<>0. This final implementation is shown in Fig. 13.8; its handshake circuit is
+shown in Fig. 13.9, in which the datapath operations have been represented in
+an abstract way, rather than as separate components.
+The handshake circuit of the optimized design has a simpler structure than
+that of the initial design. The number of logic blocks (and, in the single-rail
+implementation, the number of delay-matching gate chains) has been reduced
+from four to two. This improvement is also apparent in the area statistics given
+below.
+
+
+230
+Part III: Large-Scale Asynchronous Designs
+
+int = type [0..255]
+& gcd_gc : main proc (in?chan <<int,int>> & out!chan int).
+begin
+xy : var <<int,int>> ff
+& x = alias xy.0
+& y = alias xy.1
+& comp: func(). (y-x) cast <<int,bool>>
+| forever do
+in?xy
+; do y<>0 then
+xy:= if -comp.1 then <<x,comp.0>> else <<y,x>> fi
+od
+; out!x
+od
+end
+
+Figure 13.8.
+GCD algorithm in Tangram with optimized datapath.
+
+�
+
+�
+in
+mux
+
+�
+
+�
+xy
+
+�
+
+x
+
+�
+
+y
+�� 0
+
+�
+
+f
+�x�y�
+
+�
+
+�
+
+�
+� out
+
+do
+
+�
+
+�
+
+�
+
+;
+0
+1
+
+2
+
+�
+
+Figure 13.9.
+Compiled handshake circuit for optimized GCD program.
+
+
+Chapter 13: Descale
+231
+
+celltype
+#occ
+area
+gate eq. [eqv]
+(%)
+--------------------------------------------------------------------
+Delay Matchers
+10
+1080
+20.0
+9.4
+Memory
+16
+3744
+69.3
+32.7
+C-elements
+6
+576
+10.7
+5.0
+Logic
+42
+6048
+112.0
+52.8
+--------------------------------------------------------------------
+Total:
+74
+11448
+212.0
+100.0
+
+One may observe that area-wise, the design has improved in the control (the
+number of C-elements was reduced from 12 to 6), in the logic (the area for
+combinational logic was reduced from 174 to 112 gate equivalents), and in the
+total number of delay elements required (which was reduced from 19 to 10).
+
+13.3.
+Opportunities for asynchronous circuits
+
+When the asynchronous circuits generated by the Tangram compiler are
+compared to synchronous ones, three differences stand out, leading to four
+attractive properties of these asynchronous circuits.
+
+1 The subcircuits in a synchronous circuit are clock-driven, whereas they
+are demand-driven in an asynchronous one. This means that the subcir-
+cuits in an asynchronous circuit are only active when and where needed.
+Asynchronous circuits will therefore generally dissipate less power than
+synchronous ones.
+
+2 The operations in a synchronous circuit are synchronized by a central
+clock, whereas they are synchronized by distributed handshakes in an
+asynchronous circuit. Therefore
+
+a) a synchronous circuit shows large current peaks at the clock edges,
+whereas the power consumption of an asynchronous circuit is more
+uniformly distributed over time;
+
+b) the strict periodicity of the current peaks in a synchronous circuit
+leads to higher clock harmonics in the emission spectrum, which
+are absent in the spectrum of an asynchronous design.
+
+3 Synchronous circuits use an external time reference, whereas asynch-
+ronous circuits are self-timed. This means that asynchronous circuits
+operate over a wide range of the supply voltage (for instance, from 1 up
+to 3.3 V) while automatically adapting their speed. This property, called
+automatic performance adaptation, implies that asynchronous circuits
+are resilient to supply voltage variations. It can also be used to reduce
+the power consumption by adaptive voltage scaling, which means adapt-
+ing the supply voltage to the performance required [100]. Adaptive volt-
+
+
+232
+Part III: Large-Scale Asynchronous Designs
+
+age scaling techniques are also applied in synchronous circuits, but then
+special measures must be taken to adapt the clock frequency.
+
+Asynchronous circuits also have drawbacks. The most important one is that
+these circuits are unconventional: designers and mainstream tools and libraries
+are all oriented towards synchronous design methods. Additional drawbacks
+of asynchronous circuits come from the fact that they use gates to control reg-
+isters (latches and flip-flops), instead of the relatively straightforward clock-
+distribution network in synchronous circuits. Although this enables the low
+power consumption it also leads to circuits that are typically larger, slower,
+and harder to test. Testability (for fabrication defects) is probably the most
+fundamental issue of these. For a discussion on testing asynchronous circuits
+we refer to [61, 120].
+Property 2.b was the main reason for Philips Semiconductors to design a
+family of asynchronous pager chips [114]. However, as is addressed in the next
+section, it is the other three properties that can be used most advantageously in
+contactless smart card chips.
+
+13.4.
+Contactless smartcards
+
+Contactless smart cards have a number of advantages when compared to
+contacted ones: they are convenient and fast to use, insensitive to dirt and
+grease and, since their readers have no slots, they are less amenable to vandal-
+ism.
+The communication between a contactless smart card and reader is estab-
+lished through an electromagnetic field, emitted by the reader. The card has a
+coil through which it can retrieve energy from this field. The amount of energy
+that can be retrieved depends on the distance to the reader, the number of turns
+in the coil, and the orientation of the card in the field.
+Fig. 13.10 shows the functional parts of a contactless smartcard consisting
+of a VLSI circuit (in the dotted box) and an external coil. The tuned circuit
+formed by the coil and capacitor C0 is used for three purposes:
+
+receiving power;
+
+receiving the clock frequency (equal to the carrier frequency); and
+
+supporting the communication.
+
+The complete circuit consists of a power supply unit and a digital circuit with
+a buffer capacitor (C1) for storing energy.
+Several standards for contactless smart cards currently exist:
+
+ISO/IEC 10536, which specifies close coupling cards, operating at a dis-
+tance of 1cm from the reader.
+
+
+Chapter 13: Descale
+233
+
+Digital
+Circuit
+
+Power
+Supply
+Unit
+C0
+C1
+
+Figure 13.10.
+Contactless smart card.
+
+Table 13.1.
+Main characteristics of ISO/IEC 14443 standard.
+
+ISO 14443 standard
+A (Mifare)
+B
+
+Carrier frequency
+13.56 MHz
+13.56 MHz
+Throughput (up and down)
+106kbit/sec
+106kbit/sec
+Down link (reader to card)
+ASK 100%
+ASK 10%
+encoding
+Miller
+NRZ
+Up link (card to reader)
+ASK
+BPSK
+frequency
+847.5 kHz
+847.5 kHz
+modulation
+Manchester
+NRZ
+
+ISO/IEC 14443, for so-called proximity integrated circuit cards (PICCs),
+operating at a distance of up to 10cm from the reader, typically using 5
+turns in the on-card coil. This standard defines two types, A and B, the
+main characteristics of which are given in Table 13.1.
+
+ISO/IEC 15693 specifies vicinity integrated circuit cards (VICCs), op-
+erating at some 50cm from the reader, and typically requiring a coil with
+a few hundreds of turns.
+
+The Mifare [63] standard (ISO/IEC 14443 type A) has hundreds of millions
+of cards in operation today. Fig. 13.11 shows a Mifare card with both the chip
+and the coil visible. Mifare is a proximity card (it can be used up to 10 cm
+from the reader) supporting two-way communication. Performance is impor-
+tant, since the transaction time must be less than 200 msec. One of the first
+companies to deploy Mifare technology en masse was the Seoul Bus Asso-
+ciation, which has millions of such bus cards in use, generating hundreds of
+millions of transactions per month.
+This chapter reports an asynchronous Mifare smart card IC that was reported
+earlier in [73]. Both synchronous [116] and asynchronous [2] circuits for smart
+cards of the ISO/IEC 14443 type B standard have also been reported. Due to
+
+
+234
+Part III: Large-Scale Asynchronous Designs
+
+the 100% ASK modulation scheme in the type A standard, a Mifare IC is
+exposed to periods during which no power comes in at all, in contrast to the
+type B standard, which is based on 10% ASK modulation.
+Since on average (over time) contactless smart card chips receive only a few
+milliwatts of power, power efficiency is very important. Although low power
+is also important in battery-powered devices, there are two crucial differences
+between these two types of device.
+
+1 To maximize the battery life-time in battery-powered devices one should
+minimize the average power consumption. In contactless devices, how-
+ever, one should in addition minimize the peak power, since the peaks
+must be kept below a certain level, which depends on the incoming
+power and the buffer capacitor.
+
+2 The supply voltage is nearly constant in battery-powered devices whereas
+in contactless ones it may vary over time during a transaction due to fluc-
+tuations in both the incoming and the consumed power.
+
+Figure 13.11.
+Mifare card, showing IC (bottom left) and coil.
+
+In the bullets below, we give some facts about conventional synchronous
+chips for contactless smart cards, which, as we will see later, offer opportuni-
+ties for improvement by using asynchronous circuits.
+
+A synchronous circuit runs at a fixed speed dictated by the clock, de-
+spite the fact that both the incoming and the effectively consumed power
+vary over time. Synchronous circuits must, therefore, be designed so as
+to allow the most power-hungry operations to be performed when min-
+imum power is coming in. Consequently, if too much power is being
+
+
+Chapter 13: Descale
+235
+
+DES
+
+RSA
+
+UART
+
+80C51
+Micro-
+controller
+
+RAM
+
+ROM
+
+EEPROM
+
+Figure 13.12.
+Global design of the smart-card circuit.
+
+received, that superfluous power is thrown away. If, on the other hand,
+too little power is being received, the supply voltage drops making the
+circuit slower and, as soon as the circuit has become too slow to meet
+the time requirements set by the clock, the transaction must be canceled.
+For this reason contactless smart card chips contain subcircuits that de-
+tect when the voltage drops below a certain threshold and then abort the
+transaction.
+
+Currently, the performance of the microcontroller in a contactless smart
+card chip is usually not limited by the speed of the circuit, but by the
+RF-power being received.
+
+A synchronous circuit requires a buffer capacitor of several nanofarads
+and the area needed for such a capacitor is of the same order of magni-
+tude as the area needed for the microcontroller.
+
+The communication from the smart card to the reader is based on mod-
+ulating the load, which implies that normal functional load fluctuations
+may interfere with the communication.
+
+13.5.
+The digital circuit
+
+We have built the digital circuit shown in Fig. 13.12 that consists of:
+
+an 80C51 microcontroller;
+
+three kinds of low-power memory, the sizes and access times of which
+are given in Table 13.2 (64 bytes can be written simultaneously in one
+write access to the EEPROM);
+
+two encryption coprocessors:
+
+– an RSA converter [119] for public key conversions and
+
+
+236
+Part III: Large-Scale Asynchronous Designs
+
+– a triple DES converter [96] for private key conversions;
+
+a UART for the external communication.
+
+The EEPROM contains program parts as well as data such as encryption keys
+and e-money. Both the ROM and the RAM are equipped with matching delay
+lines and for the EEPROM we designed a similar function based on a counter.
+These delay lines have been used to provide all three memories with a hand-
+shake interface, which made it extremely easy to deal with the differences in
+access time as well as variations in both temperature and supply voltage. An
+additional advantage is that the controller automatically runs faster when exe-
+cuting code from ROM than when executing code from EEPROM.
+The circuit is meant to be used in a dual-interface card, which is a card with
+both a contacted and a contactless interface. Apart from the RSA converter,
+which will not be used in contactless operation, all circuits are asynchronous.
+In contactless operation, the average supply voltage will be about 2 V. The
+simulations, however, are done at 3.3 V, which is the voltage at which the
+library has been characterized.
+
+Table 13.2.
+Memory sizes and access times.
+
+Memory
+Size Access time [ns]
+type
+[kbyte]
+read
+write
+
+RAM
+2
+10
+10
+ROM
+38
+30
+EEPROM
+32
+180
+4,000
+
+13.5.1
+The 80C51 microcontroller
+
+The 80C51 microcontroller is a modified version of the one described in
+[144, 143]. The four most important modifications are described below.
+To deal with the slow memories a prefetch unit has been included in the
+80C51 architecture. At 3.3 V the average instruction execution time in free-
+running mode is about 100 ns provided it takes no time to fetch the code from
+memory. If, however, code is fetched from the EEPROM and the microcon-
+troller has to wait during the read accesses, the performance would be drasti-
+cally reduced, since most instructions are one or two bytes long, taking 180 or
+360 ns to fetch. To avoid this performance degradation a form of instruction
+prefetching has been introduced in which a process running concurrently to
+the 80C51 core is fetching code bytes as long as a two byte FIFO is not full.
+
+
+Chapter 13: Descale
+237
+
+The prefetch unit gives an increase in performance of about 30%. A simplified
+version of the prefetch unit is described in Section 13.5.2.
+We also introduced early write completion, which means that the microcon-
+troller continues execution as soon as it has issued a write access. This has
+been introduced to prevent the microcontroller from waiting during the 4 msec
+it takes to do a write access to the EEPROM (for instance to change the e-
+money), but also to speed up the write accesses to the RAM. To exploit this
+feature when doing a write access to the EEPROM, the corresponding code
+must be in the ROM.
+The controller has been provided with an immediate halt input signal by
+which its execution can be halted within a short time. This provision is nec-
+essary to deal with the fact that the information, which the reader sends to the
+card, is coded by suppressing the carrier during periods of 3 µsec. Since the
+card does not receive any power during these periods, the controller has to be
+halted immediately (only some basic functions continue to operate). In the
+synchronous design this halting function came naturally, since the clock would
+stop during these periods.
+We have introduced a quasi synchronous mode, in which the microcontroller
+is, at instruction level, fully timing compatible with its synchronous counter-
+part. In this mode, the asynchronous microcontroller waits after each instruc-
+tion until the number of clock ticks required by a synchronous version have
+elapsed. This mode is necessary when time-dependent functions are designed
+in software. Since this mode is under software control, the microcontroller can
+easily switch modes depending on the function it is executing. This feature
+was also of great help to demonstrate the guaranteed performance, which is
+the maximum clock rate at which each instruction terminates within the given
+number of clock ticks. For most programs, the free-running performance is
+about twice as high as the guaranteed performance.
+We have compared the asynchronous circuit as compiled from Tangram with
+a synchronous circuit that is functionally equivalent, and which has been com-
+piled from synthesizable VHDL to the same CMOS technology. These ICs
+have a comparable performance.
+The asynchronous microcontroller nicely
+demonstrates the three properties of asynchronous circuits that we want to ex-
+ploit in the design of the smart-card chip:
+
+The average power consumption of the asynchronous 80C51 is about
+three times lower than the power consumption of its synchronous coun-
+terpart when delivering the same performance at the same supply volt-
+age.
+
+Fig. 13.13 shows the current peaks of both the synchronous and the asyn-
+chronous 80C51 at 3.3 V, where the asynchronous version is running in
+quasi synchronous mode, giving a performance that is 2.5 times higher
+
+
+238
+Part III: Large-Scale Asynchronous Designs
+
+100
+
+0
+1
+2
+0
+1
+2
+
+Time[µsec]
+
+I[mA]
+
+Synchronous version
+Asynchronous version
+
+Figure 13.13.
+Current peaks of 80C51 microcontroller.
+
+than the synchronous design (the synchronous version runs at 10 MHz
+and the asynchronous version at 25 MHz). Despite the fact that the figure
+does not give a fair impression of the average power being consumed,
+it clearly shows that the current peaks of the asynchronous 80C51 are
+about five times smaller than those of the synchronous equivalent.
+
+The performance adaptation property of asynchronous circuits is demon-
+strated in Fig. 13.14, which shows the free-running performance of the
+microcontroller, when executing code from ROM, as a function of the
+supply voltage. As is expected, the performance depends linearly on the
+supply voltage. When the supply voltage goes up from 1.5 to 3.3 V, the
+performance increases from 3 to 8.7 MIPS (about a factor 3). Since the
+ROM containing the program does not function properly when the sup-
+ply voltage is below 1.5 V, we could not measure the performance for
+lower values. We observed, however, that the DES coprocessor, which
+does not need a memory, still functions correctly at a supply voltage
+level as low as 0.5 V.
+
+The figure also shows the supply current as a function of the supply
+voltage. Note that the current increases in this range from 0.7 to 6 mA
+(about a factor 9). Since in CMOS circuits the current is the product
+of the transition rate (performance) and the charge being switched per
+transition (both of which depend linearly on the supply voltage), the
+current increases with the square of the voltage. From this it follows that
+
+
+Chapter 13: Descale
+239
+
+the power, being the product of the current and the voltage, goes up with
+the cube of the voltage.
+
+From this data one can compute the third curve showing the energy
+needed to execute an instruction, which increases with the square of the
+supply voltage from 0.35 to 2.25 nJ.
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+���
+���
+���
+���
+���
+���
+���
+���
+���
+���
+
+������
+�
+������
+���
+
+�
+����������
+� !���
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+"�����
+�
+��#�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+$�����
+�
+��
+%�&�����%��
+��'�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+�
+
+Figure 13.14.
+Measured performance of the asynchronous 80C51 for various supply voltages.
+
+13.5.2
+The prefetch unit
+
+Fig. 13.15 gives the Tangram code of a simplified version of the prefetch
+unit. The prefetch unit communicates with the 80C51 core through two chan-
+nels: it receives the address from which to start fetching code bytes via channel
+StartAddress and then it sends these bytes through channel CodeByte. Since
+the prefetch unit plays the passive role in both communications, it can probe
+each channel to see whether the core has started a communication through that
+channel. The state of the prefetch unit consists of program counter, pc, and
+a two-place buffer, which is implemented by means of an array Buffer, an
+integer count, and two one-bit pointers getptr and putptr.
+
+
+240
+Part III: Large-Scale Asynchronous Designs
+
+forever do
+sel probe(StartAddress)
+then (
+StartAddress?pc
+|| putptr := getptr
+|| count := 0
+|| AbortMemAcc()
+)
+or probe(CodeByte) and (count>0)
+then CodeByte!Buffer[getptr]
+; (
+getptr := next(getptr)
+|| count := count-1
+)
+or MemAck
+then Buffer[putptr] := MemData
+; (
+putptr := next(putptr)
+|| count := count+1
+|| pc := pc+1
+|| CompleteMemAcc()
+)
+les
+; if (count<2) and -MemReq
+then MemReq := true
+fi
+od
+
+Figure 13.15.
+Tangram code of a simplified version of the prefetch unit.
+
+The prefetch unit executes an infinite loop and in each step it first executes a
+selection command (denoted by sel ... les), in which it can select among three
+guarded commands, which are separated by the keyword or. Each guarded
+command is of the form
+
+Guard
+then
+Command,
+
+in which Guard is a Boolean condition, and Command a command that may
+be executed only if the guard holds. A command is said to be enabled if the
+corresponding guard holds. Executing a selection command implies waiting
+until at least one of the commands is enabled, then selecting such a command
+—in an arbitrated choice— and executing it.
+In the first guarded command, channel StartAddress is probed to find out
+whether the core is sending a new start address. In that case, program counter
+pc is set to the address received, the buffer is flushed, and a possible outstand-
+ing memory access is aborted (by resetting both MemReq and the delay counter).
+All four subcommands in this guarded command are executed in parallel (‘A
+|| B’ means execute commands A and B concurrently, whereas ‘A ; B’ means
+execute A and B sequentially).
+
+
+Chapter 13: Descale
+241
+
+The second guarded command takes care of sending the next program byte
+via channel CodeByte to the core if the core is ready to receive that byte and
+the buffer is not empty. The third guarded command gets enabled if MemAck
+goes high indicating that the data signals in a read access are valid. In that
+case the value read from memory is put in the buffer after which the memory
+handshake is completed.
+After each event, if the buffer is not full and no memory access is being
+performed a next memory access is started. Since (count<2)
+� -MemReq is
+an invariant of the loop, the last (conditional) command in the loop can be
+simplified to unconditional assignment MemReq:= (count<2).
+Note that the value (pc-count) is equal to the program counter in the core,
+since it is set to the destination address in the case of a jump, increased by
+1 if a code byte is transferred to the core, and kept invariant if a code byte
+is read from memory. Therefore the core does not need to hold the program
+counter and instead, when the information is needed for a relative branch, it can
+retrieve the counter value from the prefetch unit. Clearly, this feature requires
+an extension of the Tangram code shown in Fig. 13.15.
+
+13.5.3
+The DES coprocessor
+
+A transaction may need up to ten single DES conversions, where each con-
+version takes about 10 ms if it is executed in software. Therefore a hardware
+solution is needed, since these conversions would otherwise consume about
+half of the transaction time.
+Fig. 13.16 shows the datapath of the DES coprocessor. The processor sup-
+ports both single- and triple-DES conversions and, for the latter type of con-
+version, contains two keys: a foreground and a background key. Single-DES
+conversions use the foreground key, whereas triple-DES conversions use the
+foreground key for the first and third conversion and the background key for
+the second conversion. The foreground key is stored in register CD0 consisting
+of 56 flipflops (the DES key size is 56 bits), whereas the background key re-
+sides in variable CD1 consisting of 56 latches. The text value resides in variable
+LR consisting of 64 latches (DES words contain 64 bits).
+A single-DES conversion consists of 16 steps and, in each step, the key is
+permuted and a new text value is computed from the old text value and the key.
+In order to have the key return to its original value at the end of a conversion,
+the key makes two basic permutations in 12 of the 16 steps and only one in
+the remaining 4, where 28 basic permutations are needed for a complete cycle.
+The permutations are performed in flipflop register CD0.
+Most of the area is taken by the combinational circuit called DES. Since this
+circuit is also dominant in power dissipation, one should minimize the number
+of transitions at its inputs. For this purpose, we have introduced two latch
+
+
+242
+Part III: Large-Scale Asynchronous Designs
+
+cd(56)
+
+CD0(56)
+
+CD1(56)
+
+LR(64)
+
+lr(64)
+
+Data(8)
+
+DES
+
+LRotMuxLR
+RotMuxCD
+
+Swap
+
+Inp/Enc/Dec
+
+SFRwrite
+
+Figure 13.16.
+DES coprocessor architecture.
+
+registers: cd for the key and lr for the text. If two basic permutations are done
+in one step, cd hides the effect of the first one from combinational circuit DES.
+In addition, all inputs of combinational circuit DES change only once in each
+step by loading the two registers lr and cd simultaneously and then storing the
+result in register LR as described by the following piece of Tangram text.
+
+( lr:= LR || cd:= CD0 ) ; LR:= DES(lr,cd)
+
+Therefore, latch register lr also serves as a kind of slave register. Latch
+register cd also serves a functional purpose, since the two keys are swapped by
+executing the following three transfers.
+
+cd:= CD0 ; CD0:= CD1 ; CD1:= cd
+
+The size of the DES coprocessor is 3,250 gate equivalents, of which 57%
+is taken by the combinational logic and 35% by latches and flip-flops. Con-
+sequently, the overhead in area due to the asynchronous design style (delay
+matching and C-elements) is marginal at 8%. At 3.3 V, a single-DES conver-
+sion takes 1.25 µs and 12 nJ.
+
+
+Chapter 13: Descale
+243
+
+Fig. 13.17 shows the simulated current of the DES coprocessor at 3.3 V
+(the microcontroller is active before and after the DES computation). The real
+current peaks will be much smaller due to a lower supply voltage (the DES
+processor functions properly at a supply voltage as low as 0.5 V) as well as the
+buffer capacitor (the resolution in the simulation is 1 ns).
+
+1
+
+100
+
+3
+2
+0
+t[µs]
+
+i[mA]
+
+Figure 13.17.
+DES coprocessor current at 3.3 V.
+
+The conversion time, of a few microseconds, is so small that we used the
+handshaking mechanism to obtain synchronization between the microcontroller
+and the coprocessor. After starting the coprocessor, the microcontroller can
+continue executing instructions, and only when reading the result will it be
+held up in a handshake until the result is available. Note that a synchronous
+design would require a form of busy-waiting.
+
+13.6.
+Results
+
+Fig. 13.18 shows the layout of the chip, which is in a five-layer metal,
+0.35 µm technology and has a size of 4�52
+� 4�16
+ 18 mm2, including the
+bond pads. Many bond pads are only included for measurement and evalua-
+tion purposes. A production chip only needs about 10 bond pads.
+The two horizontal blocks on top form the buffer capacitor (in a production
+chip, the capacitor would only require about one quarter of the area). The
+memories are on the next row, from left to right: two RAMs, one ROM and
+the EEPROM, which is the large block to the right. The asynchronous circuit
+is located in the lower left quadrant, near the centre.
+Table 13.3a gives the area of the blocks constituting the contactless dig-
+ital circuit, which is the asynchronous circuit together with the memories.
+
+
+244
+Part III: Large-Scale Asynchronous Designs
+
+Figure 13.18.
+Layout of smart card chip.
+
+The other modules are either synchronous or analog circuits, where the syn-
+chronous modules are not used in contactless operation. From this table it
+follows that the asynchronous logic takes only 12% of the total contactless
+digital circuit.
+
+Table 13.3a.
+Areas of the contactless dig-
+ital circuit blocks.
+
+Block
+Area [mm2]
+
+RAM
+1.2
+ROM
+1.0
+EEPROM
+5.6
+Async. circ.
+1.1
+
+Total
+8.9
+
+Table 13.3b.
+Areas of the asynchronous
+modules.
+
+Module
+Area [GE]
+
+CPU
+7,800
+Pref. Unit
+700
+DES
+3,250
+UART
+2,040
+Interfaces
+3,680
+Timer
+1,080
+
+Total
+18,550
+
+
+Chapter 13: Descale
+245
+
+The sizes of the different asynchronous modules are given in Table 13.3b.
+In the standard cell library used, a gate equivalent (GE) is 54 µm2 with a typical
+layout density of 17,500 gates per mm2.
+
+Table 13.3c.
+Power of the contactless
+digital circuit.
+
+Block
+Power
+
+Core
+56%
+ROM
+27%
+RAM
+17%
+
+Table 13.3d.
+Effect of asynchronous de-
+sign on power and area at different levels.
+
+Level
+Power
+Area
+
+Async. circ.
+�70%
+�18%
+Async. + Mem.
+�60%
+�2%
+
+Table 13.3c shows the power dissipation of the digital circuit blocks when
+the controller is executing code from ROM (being the normal situation).
+Table 13.3d shows the effects on power and area of an asynchronous design
+at two different levels. The asynchronous circuit gives a reduction in power
+dissipation of about 70% for 18% additional area. At the level of the contact-
+less digital circuit, however, we obtain a power reduction of 60% for only 2%
+additional area. Note that this analysis does not include the synchronous RSA
+converter and the analog circuits needed in a production chip, such as for in-
+stance the buffer capacitor and the power supply unit. Therefore at chip level
+the relative reduction in power dissipation will be about the same, whereas the
+overhead in area will be reduced even further.
+
+13.7.
+Test
+
+The testing of asynchronous circuits for manufacturing defects is known to
+be a difficult problem [61, 120]. The main problem is that asynchronous cir-
+cuits have many feedback loops that cannot be controlled by an external signal.
+This makes the introduction of scan testing expensive, and forces the designer
+to become involved in the development of a functional test, either in producing
+the patterns directly or in implementing the design-for-test measures.
+A functional test approach was chosen for the chip described in this chapter.
+During test, the microcontroller is connected to an external ROM that contains
+a test program. This program computes a signature, and a circuit is said to be
+defective if the signature is not correct. In addition, current measurements are
+performed to increase the test coverage.
+The functional tests were developed using the test for the 80C51 microcon-
+troller [144] as a starting point. Both this test and its extension were developed
+using the test-evaluation tool described in [152]. This tool evaluates the struc-
+tural test coverage (controllability and observability) of functional traces.
+
+
+246
+Part III: Large-Scale Asynchronous Designs
+
+For the datapath logic, the tool can be used to achieve a 100% coverage for
+the stuck-at-input fault model, even though actually achieving this may be a
+real challenge to the test engineer. For the 80C51 microcontroller, however,
+with the inherent controllability and observability of its registers and buses,
+this appears to be feasible.
+In the absence of full-scan, it is known that a 100% coverage for the stuck-
+at-input fault model can only be achieved for the asynchronous control logic
+by using a combination of functional patterns and current measurements in a
+circuit that has been modified so as to be pausable in the middle of carefully
+selected handshakes [120]. Since this modification has not been implemented
+in the smartcard circuit reported here, the fault coverage can never be 100%.
+The exact fault coverage of the traces that were used here is not known,
+as this would demand unrealistic levels of compute power using a verification
+tool, but is estimated to be around 90% for the asynchronous control and data-
+path subsystem.
+
+13.8.
+The power supply unit
+
+Fig. 13.19 shows the power supply unit consisting of a rectifier and a power
+regulator, which are both completely analog circuits. The design of the recti-
+fier is conventional, and of the regulator we discuss only those aspects of the
+behaviour that are relevant to the design of the digital circuit without going
+into the details of its design.
+
+Pow
+Reg
+
+i0
+
+v0
+
+i1
+
+v1
+Rect
+
+Figure 13.19.
+Power supply unit.
+
+To avoid interference with the communication, a power regulator has been
+designed that shows an almost constant load at its input. Fig. 13.20 shows
+Spice-level simulation results of such a power regulator when the input volt-
+age V0 is fixed at 5V. On the horizontal axis we have the activity (number of
+transitions per second) of the digital circuit. The input load is almost constant,
+since input current i0 is almost constant over the whole range.
+When the activity is low, output voltage V1 is constant at about 3V. In this
+range, too much power is coming in and the regulator functions as a voltage
+source with output current i1 increasing when the activity increases. The su-
+perfluous power is shunted to ground. However, i1 reaches a saturation point
+
+
+Chapter 13: Descale
+247
+
+when it reaches i0. From this point on, no more power is shunted to ground and
+the regulator starts to function as a current source with output voltage V1 de-
+creasing when the activity increases. The regulator delivers maximum power
+in the middle of the range where both the outgoing voltage and the outgoing
+current are high. Note that these simulation results assume constant incoming
+RF power. The variations in the incoming RF power during a transaction, how-
+ever, are an additional source of fluctuations in V1, since these variations result
+in shifts of the saturation point.
+
+i0
+
+i1
+
+V1
+
+2
+
+3
+
+0.0
+
+0.4
+
+0.8
+
+1.2
+
+mA
+
+V
+
+1
+
+Activity
+
+Figure 13.20.
+Power regulator behaviour.
+
+A power source with these characteristics burdens the designer of a syn-
+chronous circuit with the problem of trading off between performance and ro-
+bustness. Going for maximum performance means assuming a supply voltage
+of 3 V in which case a transaction must be aborted if the voltage drops below
+2.5 V, for example. On the other hand, if the designer opts for a more robust
+design by choosing 2 V as the operating voltage, performance is lost when
+the regulator delivers 3 V. Such trade-offs are not needed for an asynchronous
+circuit, since it always automatically gives the maximum performance for the
+power received.
+
+13.9.
+Conclusions
+
+We have designed, built and evaluated an asynchronous chip for contactless
+smart cards in which we have exploited the fact that asynchronous circuits:
+
+use little average power,
+
+show small current peaks, and
+
+operate over a wide range of the supply voltage.
+
+
+248
+Part III: Large-Scale Asynchronous Designs
+
+Measurements and simulations showed the following advantages of this de-
+sign when compared to a conventional synchronous one:
+
+The asynchronous circuit gives the maximum performance for the power
+received. This comes mainly from the fact that the asynchronous de-
+sign needs less of what is the main limiting factor for the performance,
+namely power. Compared to a synchronous design, the asynchronous
+circuit needs about 40% of the power for less than 2% additional area.
+In addition, the automatic speed adaptation property of asynchronous
+circuits saves the designer from trading off between performance and
+robustness. Due to this property the asynchronous circuit will give free-
+running instead of guaranteed performance, where the difference be-
+tween the two is about a factor two.
+
+The asynchronous design is more resilient to voltage drops, since it still
+operates correctly for voltages down to 1.5 V.
+
+The current peaks of an asynchronous circuit are less pronounced, mak-
+ing the requirements with respect to the buffer capacitor more modest.
+
+The combination of the power regulator with the asynchronous circuit
+gives little communication interference. In this case, the smaller current
+peaks and the self-adaptation property are of importance.
+
+Acknowledgements
+
+We gratefully acknowledge the other members of the Tangram team: Kees
+van Berkel, Marc Verra and Erwin Woutersen, and we thank Klaus Ully for
+helping us to get the DES converter right.
+
+
+Chapter 14
+
+AN ASYNCHRONOUS VITERBI
+DECODER
+
+�
+
+Linda E. M. Brackenbury
+Department of Computer Science, The University of Manchester
+
+lbrackenbury@cs.man.ac.uk
+
+Abstract
+Viterbi decoders are used for decoding data encoded using convolutional for-
+ward error correcting codes. Such codes are used in a large proportion of digital
+transmission and digital recording systems because, even when the transmitted
+signal is subjected to significant noise, the decoder is still able efficiently to de-
+termine the most likely transmitted data.
+This chapter descibes a novel Viterbi decoder aimed at being power efficient
+through adopting an asynchronous approach. The new design is based upon se-
+rial unary arithmetic for the computation and storage of the metrics required;
+this arithmetic replaces the add-compare-select parallel arithmetic performed by
+conventional synchronous systems. Like all Viterbi decoders, a history of com-
+putational results is built up over many data bits to determine the data most likely
+to have been transmitted at an earlier time. The identification of a starting point
+to this tracing operation allows the storage requirement to be greatly reduced
+compared with that in conventional decoders where the starting point is random.
+Furthermore, asynchronous operation in the system described enables multiple,
+independent, concurrent tracing operations to be performed which are decoupled
+from the placing of new data in the history memory.
+
+Keywords:
+low-power asynchronous circuits, Viterbi, convolution decoder
+
+14.1.
+Introduction
+
+The PREST (Power REduction for Systems Technology) [1] project was a
+collaborative project where each partner designed a low power alternative to
+
+�The Viterbi design work was supported by the EPSRC/MoD PowerPack project GR/L27930 and the EU
+PREST project EP25242, and this support is gratefully acknowledged.
+
+249
+
+
+250
+Part III: Large-Scale Asynchronous Designs
+
+an industry standard Viterbi decoder. The Manchester University team’s aim
+was to obtain a power-efficient design through the use of asynchronous timing.
+The Viterbi decoder function [148] was chosen as it is a key function in
+digital TV and mobile communications. It detects and corrects transmission
+errors, outputting the data stream which, according to its calculations, is the
+most likely to have been transmitted. Viterbi coding is popular because it can
+deal with a continual data stream and requires no framing information such as
+a block start or finish. Furthermore, the way in which the output stream is con-
+structed means that even if output data is incorrectly indicated, this situation
+will correct itself in time.
+
+14.2.
+The Viterbi decoder
+
+14.2.1
+Convolution encoding
+
+In order to understand the functions performed by the decoder, the way the
+data is encoded needs to be described. This is illustrated in figure 14.1. The
+input data stream enters a shift register which is initially set to zero. The 2-
+bit register stores the previous two input bits and these are combined with the
+current bit in two modulo-2 adders to provide the two binary digits which form
+the encoded output for the current input bit.
+
+1−bit
+delay
+1−bit
+
++
+
++
+
+i(t−1)
+i(t−2)
+
+Y
+
+delay
+
+X
+
+i(t)
+
+Figure 14.1.
+Four-state encoder.
+
+For example, suppose that the input stream comprises 011 with the current
+input bit (‘1’) on the right and the oldest bit (‘0’) on the left, then the encoded
+X-output, which adds the bits in all three bits, is 0 and the encoded Y-output,
+which adds the current and oldest bit, is 1; so ‘01’ would be transmitted in this
+case. As each input bit causes two encoded bits to be transmitted, the code is
+referred to as 1�2 rate.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+251
+
+The use of three input bits (called the constraint length k) means that the
+encoder has 2
+
+�k
+�1�
+
+� 4 possible states, ranging from state S0 when both previ-
+ous bits are zero to S3 when both these bits are ones. So, a current state n will
+become state 2n modulo 4 if the current input bit is a zero and
+�2n
+�1� modulo
+4 if it is a one; for example, if the current state is 2, the next state will be 0 or
+1. That is, each state leads to two known states. This state transition informa-
+tion versus time is normally drawn in the form of a trellis diagram as shown in
+figure 14.2 for a four-state system. States are also referred to as nodes, and the
+node-to-node connection network is referred to as a butterfly connection due
+to its predictability, regularity and shape.
+
+S3
+
+S2
+
+S1
+
+S0
+
+0
+1
+2
+3
+4
+
+1
+1
+0
+1
+
+time
+
+00
+
+10
+
+11
+
+01
+
+01
+
+10
+
+00
+
+11
+
+data
+
+Figure 14.2.
+Four-node trellis.
+
+By knowing the starting state of the trellis at any time and the subsequent
+input stream, the route taken by the encoder and the state it reaches can be
+traced. For example, starting at state 0 at time 0, an input pattern of 1 then 1
+then 0 followed by 1 will cause the trellis path to travel from state S0
+� S1
+�
+S3
+� S2
+� S1; this route is indicated by the thicker black line on figure 14.2.
+This figure also shows the encoder outputs associated with each path or
+branch on the trellis. For example, moving from state S3 to S2, corresponding
+to a current input of ‘0’, the input stream must be 110 (with the oldest bit
+leftmost), so the encoded X and Y outputs are 0 and 1; this path is labelled
+with 01 to indicate the encoder output in moving from S3 to S2. The other
+paths are calculated in a similar way and labelled appropriately.
+
+14.2.2
+Decoder principle
+
+The decoder attempts to reconstruct the most likely path taken by the en-
+coder through the trellis. It does this by constructing a trellis and attaching
+weights to the nodes and each possible path at each timeslot; these indicate
+how likely it is that the node and path were the route taken by the encoder.
+Consider again the four-node example above. The encoded output for trans-
+
+
+252
+Part III: Large-Scale Asynchronous Designs
+
+mission resulting from the same input sequence of 1 then 1 then 0 then 1 would
+be 11 followed by 01, 01 and finally 00 (from an initial encoder state of all-
+zeros). Suppose that instead of receiving this sequence, the decoder receives
+corrupted data of 11 followed by 00, 01 and 00. Also assume that node 0 has
+an initialised weight of zero and that the other nodes have a higher initialised
+weight, say 2, at time 0; this corresponds to the encoder starting in state S0.
+
+S3
+
+S2
+
+S1
+
+S0
+
+2
+
+2
+
+2
+
+0
+
+0
+
+1
+
+1
+
+1
+
+2
+
+0
+
+1
+
+2
+
+11
+
+2
+
+1
+
+1
+
+1
+
+0
+
+2
+
+1
+
+0
+
+00
+
+1
+
+2
+
+0
+
+0
+
+1
+
+1
+
+2
+
+1
+
+01
+
+3
+
+3
+
+1
+
+2
+
+2
+
+1
+
+1
+
+1
+
+0
+
+2
+
+1
+
+0
+
+00
+
+3
+
+0
+
+2
+2
+
+3
+
+1
+
+1
+
+2
+
+2
+
+2
+1
+0
+3
+4
+3
+3
+
+1
+
+time
+
+Figure 14.3.
+Decoding trellis.
+
+For each branch, the distance between each received bit and the bit expected
+by the branch gives the weight of the branch. Where X and Y are encoded as
+single bits, referred to as hard coding, the number of bits difference between
+the received bits and the ideal encoded state for the branch gives the branch
+weight. The weights allocated in each timeslot for the received data are shown
+in figure 14.3. The received bits in the first timeslot are 11. Branches rep-
+resenting an encoded output of 11 are given a weight 0 while the branches
+representing an encoded output of 00 are given a weight of 2, and branches
+with ideal encoded outputs of 01 or 10 are given a weight of 1; these branch
+weights represent the distance between the branch and the received input. The
+branch weights are then added to the node weights to give an overall weight.
+Thus for example, state S2 has a node weight of 2 at time 0 and its branches
+have weights of 0 and 2 giving an overall weight of 2 and 4 going into the
+receiving node at the end of the timeslot. The overall weight indicates how
+likely it is that the route through the encoder corresponded to this sequence;
+the lower the overall weight the more likely this is.
+From the trellis diagram, it can be seen that each state can be reached from
+two other states and that each state will therefore have two incoming overall
+weights. The weight that is chosen is the lower of these two since this rep-
+resents the most likely branch that the encoder would traverse in reaching a
+particular state. So for example, state S1 can be entered from either state S0
+or S2 and at time 1, the overall node plus branch weight from S0 is 0
+� 0
+� 0
+while from S2 it is 2
+� 2
+� 4. The weight arising from S0 is chosen as this is
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+253
+
+the lower and therefore more likely route into S1; the weight arising from the
+S2 branch is discarded and the new weight for node S1 at the end of this first
+timeslot is that from S0, i.e. 0.
+This process continues in a similar way for each timeslot with weights given
+to each branch according to the difference between the encoded branch pattern
+and the received bits. This difference is then added to the node weight to give
+an overall weight with the next weight for a node equal to the lower incoming
+weight. This results in the weights given in figure 14.3.
+To form the output from the decoder, a path is traced through its trellis using
+the accumulated history of node weight information. The weights at time 4
+indicate that the encoder is most likely to be in state S1 as this has the lowest
+node weight. Furthermore, this state was most likely to have been reached
+from S2. Continuing tracing backwards from S2 taking the lower overall node
+count, the most likely path taken from S2 is S3, S1 and S0 (initialisation). Note
+that despite having received corrupted data, this is exactly the sequence taken
+by the encoder in sending this data!
+The output data can be ‘rescued’ from the decoder by noting that to reach
+states S0 and S2 the current data input to the encoder is a ‘0’ while to reach
+states S1 and S3, the current encoder input is a ‘1’; that is, the least significant
+state bit indicates the state of the current data. Thus since the optimum states
+the decoder reaches at time 1, 2, 3 and 4 is S1, S3, S2 and S1 respectively, the
+decoder would output a data stream of 1101 as being the most likely input to
+the encoder.
+
+14.3.
+System parameters
+
+In practice the decoder designed is larger and more complicated than indi-
+cated by the simple example given. The encoder uses the current bit and the
+six previous bits in various combinations to provide the two encoded output
+streams; this spreads each input bit out over a longer transmission period, in-
+creasing the probability of error-free reception in the presence of noise. If the
+current bit is bit 0 with the oldest bit in the shift register being bit 6, then the
+encoded X-output is obtained by adding bits 0, 1, 2, 3 and 6 and the encoded
+Y-output from adding bits 0, 2, 3, 5 and 6. Since the constraint length is 7, the
+system has 64 nodes and therefore 128 paths or branches. Thus state n, where
+n is an integer from 0 to 63, leads to states 2n modulo 64 and
+�2n
+�1� modulo
+64.
+Furthermore, the received bits are not hard coded (just straight ones and
+zeros) but soft coded. Three bits are used to represent values, with 100 used
+for a strong zero value and 011 for a strong one value. Noise in transmission
+means that a received character can be indicated by any 3-bit value from 0 to
+7 and in interpreting the received value, it is helpful to regard the number as
+
+
+254
+Part III: Large-Scale Asynchronous Designs
+
+signed. Thus 011 indicates a strong one while 000 denotes received data that
+weakly indicates a one. Similarly, 100 implies a strong zero while 111 denotes
+a probable but weak zero. To make the text easier to follow hereafter, unsigned
+values will be used with the code for a 3-bit perfect zero taken as 000 (i.e.
+value 0) while a 3-bit perfect one is taken as 111 (i.e. value 7).
+The interface to the asynchrous decoder is synchronous with the validity of
+the encoded characters on a particular positive clock edge indicated by a Block-
+Valid signal; encoded data is only present when this signal is high. Code rates
+other than the 1/2, primarily described in this chapter, are also possible. These
+are achieved by using the Block-Valid with a Puncture and a Puncture-X-nY
+signal; if Block-Valid is active (high) then a high Puncture signal indicates that
+only one of the X or Y symbols is valid and Puncture-X-nY indicates which.
+Both encoded characters are present if Block-Valid is high and Puncture is low.
+All the code rates originate from the 1/2 rate encoder with data for the other
+code rates then obtained by omitting to send some of these encoded characters.
+In this way, in addition to the 1/2 code rate, the system receives and decodes
+rates of 2/3 (two input bits result in the transmission of three encoded charac-
+ters), 3/4, 5/6, 6/7 and 7/8 (7 input bits defined by 8 encoded characters with
+the remaining 6 not transmitted). As the code rate progressively increases, less
+redundancy is included in the transmitted data resulting in an increased error
+rate from the decoder; a rate of 1/2 will yield the most error-free output.
+The system is expected to be operated from a 90 MHz clock. Since this
+clock is used by all code rates, the code rate not only defines the ratio of the
+number of input bits to the number of transmitted encoded characters but also
+specifies the ratio of the number of clocks containing encoded information (of
+1 or 2 valid characters) to the number of clock cycles. For example, a 3/4 rate
+signifies that three input bits result in four transmitted characters and also that
+three clocks out of four contain some encoded information. Thus in a 3/4 code,
+every fourth clock contains no encoded information (Block-Valid is low). Any
+clock for which Block-Valid is high is said to contain a symbol. Thus with
+a 90 MHz clock rate, a 1/2 rate code which has valid data on alternate clock
+cycles yields a data rate of 45 MSymbols/sec and a 7/8 code rate is equivalent
+to 90x7/8 Msymbols/sec = 78.75 Msymbols/sec.
+
+14.4.
+System overview
+
+To perform the decoder computation previously outlined, the Viterbi de-
+coder comprises three units as shown in figure 14.4. The Branch Metric Unit
+(BMU) receives the transmitted data and computes the distance between the
+ideal branch pattern symbols of (0,0), (0,7), (7,0) or (7,7) and the received
+data; these weights are then passed to the Path Metric Unit (PMU). The PMU
+holds the node weights and performs the node plus branch weight calculations,
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+255
+
+Clock
+Req
+Req
+
+branch
+
+Ack
+
+Req
+
+Unit
+History
+global
+Metric
+Path
+
+Unit
+Metric
+Unit
+
+Branch
+
+Receiver
+
+Ack
+
+Decoded
+Output 
+
+Ack
+
+winner
+
+Input from
+metrics
+
+local
+winners
+
+Figure 14.4.
+Decoder units.
+
+selecting the lower overall weight as the next weight for a node in a particular
+timeslot. The computed node weights are then fed back (within the PMU) and
+become the node weights for the next timeslot.
+As well as computing the next slot node weights, the PMU remembers
+whether the winner at a node came from the upper or lower branch leading
+in to it; this is only a single bit per node. On each timeslot, this local winner
+information is passed to the History Unit (HU), called the Survivor Memory
+Unit in synchronous systems. This information enables the HU to keep a his-
+tory of the routes taken through the trellis by each node. In addition to this
+local node information, in our asynchronous design the state with the lowest
+node weight is identified in the PMU and its identity passed to the HU. Locat-
+ing this global winner gives a known starting point in searching back through
+the trellis history to find the best data to output.
+Conventional synchronous designs do not undertake an overall node winner
+identification and consequently start the search at a random point. They need
+to store a relatively large number of timeslots in order to ensure that there is
+sufficient history to make it likely that the correct route through the trellis is
+eventually found. In the asynchronous design, the identification of the overall
+node winner in the PMU was relatively easy to perform and it seemed the
+natural way to proceed. It has had the desirable effect of enabling the amount
+of timeslot history kept in the HU to be reduced and it also reduces the activity
+in the HU, saving power.
+The HU uses both the overall winning node information (the global winner)
+and the local node winners in order to reconstruct the trellis and trace back the
+path from the current overall winner to find the node indicated by the oldest
+timeslot stored; the bit for this node is then output. The HU can be visualised
+as a circular buffer. Once data is output from the oldest slot, this information
+is overwritten with the current (newest) winner data so that the next oldest data
+becomes the oldest data in the next timeslot.
+Figure 14.4 shows the bundled-data interface used between units; four-phase
+signalling is used for the Request and Acknowledge handshake signals. The
+Clock signal is required because the external system to the decoder is syn-
+
+
+256
+Part III: Large-Scale Asynchronous Designs
+
+chronous. Input data is supplied and output data removed on the positive clock
+edge provided that the bits on that clock edge are valid as indicated by Block-
+Valid. In practice, all units contain buffering at their interfaces in order to
+decouple the operation of the units, to cater for local variations in operating
+times, and to satisfy latency requirements imposed by the external system.
+
+14.5.
+The Path Metric Unit (PMU)
+
+14.5.1
+Node pair design in the PMU
+
+The PMU performs the core of the computation in the decoder and is the
+starting point for the design. Conventionally, the computation performed is
+that of Add (node to branch weight), Compare (upper and lower weights to
+a node) and Select (lower weight as next weight for a node). Because of the
+butterfly connection, the branch weights associated with nodes j and j
+�32 and
+their connections lead to nodes 2 j and 2 j
+� 1, as shown in figure 14.5 where
+BMa and BMb represent the branch weights; it should be noted that since a
+branch represents ideal convolved characters of (0,0), (0,7), (7,0) or (7,7), it is
+only necessary to compute a total of four weights in any system representing
+their distance from the received soft-coded characters.
+
+node metric
+Previous
+
+node
+  
+node
+
+node
+node
+BMa
+
+  BMa
+
+BMb
+
+BMb
+
+Next
+
+node metric
+
+ j+32
+
+     j
+
+2j+1
+
+   2j
+
+Figure 14.5.
+Node pair computation.
+
+As the logic for this pair of nodes is self-contained and all the logic can be
+similarly partitioned into self-contained pairs of nodes, the basic unit of logic
+design in the PMU is the node pair; this is then replicated the required number
+of times (32 in this system). Furthermore, since 8-bit parallel arithmetic is nor-
+mally used, in a 64-node system this leads to 512 data signals in the butterfly
+connection and 1024 interconnections within the PMU.
+In an effort to simplify this routing problem and to achieve an associated
+power reduction from this simplification, serial arithmetic was proposed for
+the asynchronous design; in principle, this would reduce the butterfly to just
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+257
+
+Table 14.1.
+Serial unary arithmetic.
+
+number
+transition
+representation
+
+zero
+=
+000000
+one
+=
+111111
+two
+=
+000001
+three
+=
+111101
+four
+=
+000101
+five
+=
+110101
+six
+=
+010101
+
+Ain
+
+Rin
+Rout
+
+Aout
+
+C
+C
+C
+C
+C
+C
+
+R1
+
+A1
+R2
+
+A2
+R3
+A4
+
+A3
+R4
+A5
+
+R5
+
+1
+3
+5
+
+2
+4
+6
+
+Figure 14.6.
+Six-bit 2-phase dataless FIFO.
+
+64 wires. The adoption of serial arithmetic significantly impinges upon the
+node pair design and the way weights are stored. A conventional binary rep-
+resentation of values where serial arithmetic was performed on them was not
+a practical option as it would lead to a system throughput appreciably below
+that specified. This led to the idea of indicating values by the number of entries
+occupied in a FIFO buffer, so for example a count of five would require that the
+bottom five stages of the buffer showed that they were full; note that the buffer
+stages don’t need to store any data but merely require a full/empty indication.
+The speed and simplicity of this full/empty dataless FIFO scheme is further
+enhanced by adopting serial unary arithmetic for representing the data in the
+buffer (rather than a ‘1’, say, to represent full and a ‘0’ for empty). This is
+essentially a 2-phase representation for values, so that the number of transi-
+tions considering the full/empty bits as a whole represents the count. This is
+illustrated in table 14.1 for a 6-stage FIFO where the input enters on the left
+hand side (and exits from the right hand side).
+The FIFOs used to hold the path and state metrics are Muller pipelines as
+shown in figure 14.6 (see also figure 2.8 on page 17). The encoding of a metric,
+M, is simply the state of an initially empty Muller pipeline after it has been
+exposed to M 2-phase handshakes on its input. Since a Muller C-gate in the
+technology used has a propagation delay of around 1 nsec the FIFO can transfer
+data in and out at a rate of around 500 MHz.
+
+
+258
+Part III: Large-Scale Asynchronous Designs
+
+Using serial unary arithmetic, the major design component in a node pair
+for adding the node and branch weights and transferring the smaller to be the
+new node weight is an increment-decrement unit. The basic scheme for a node
+pair is illustrated in figure 14.7.
+
+node  2n+1
+node  2n
+
+node  2n+3
+node  2n+2
+
+node  n/2
+
+node  n/2+32
+
+from node pair  n/4
+
+from node pair  n/4+16
+to node pair  n+1
+
+to node pair n
+
+butterfly
+network
+
+butterfly
+network
+
+metric
+state
+
+state
+metric
+
+path
+
+metrics
+branch
+
+path
+
+path
+
+path
+
+global
+
+state
+metric
+
+state
+metric
+
+local
+winner
+
+metric
+
+metric
+
+metric
+
+select
+
+select
+
+node  n+1
+
+node  n
+
+node pair n/2
+
+winner
+to History Unit
+
+metric
+
+Figure 14.7.
+Node pair logic.
+
+The new weights for each state are stored in the State Metric FIFOs on the
+right hand side of figure 14.7. When the global and local winners of these have
+been sent to the HU and acknowledged, the next timeslot commences with the
+parallel loading of the branch weights into the Path Metric FIFOs on the left
+in figure 14.7; these overwrite any existing content in these FIFOs. Parallel
+loading here, rather than serial entry, was selected on the grounds of speed and
+the need to clear the Path Metric FIFOs of any existing count.
+The branch weights loaded are those computed by the BMU. The BMU first
+computes the conventional branch weights based on the difference between the
+two ideal 3-bit characters expected on the trellis branch and the two received
+values. The BMU then translates this to a transition pattern. This is made
+more complicated by the fact that the external environment to the Path Metric
+FIFOs is sometimes in the
+� 1
+� state necessitating a 2-phase inversion of the
+computed pattern.
+Once the branch weights are loaded in, the node weights are then added to
+them. The node weights are transferred serially from the State Metric FIFOs
+across the butterfly connection into the Path Metric FIFOs. For each event
+transferred, two Path Metric FIFOs are incremented by one and the State Met-
+ric FIFO decremented by one. The transfer is complete when the feeding State
+Metric FIFO is empty. The Path Metric FIFOs for the node pair can commence
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+259
+
+the comparison and selection of the lower count as the node weight for the re-
+ceiving State Metric FIFO once the receiving State Metric FIFO is empty.
+The simplest way of performing this comparison and selection of the lower
+Path Metric FIFO count is to pair transitions (events) in the upper and lower
+Path Metric FIFOs connected to a receiving State Metric FIFO. Each paired
+event decrements each Path Metric FIFO by one and produces a transition
+which is used to increment the State Metric FIFO. The observant reader will
+note that the pairing of events to produce an output transition is exactly the
+action performed by a two-input Muller C-gate and this in principle is all that
+is required for the Select element shown in figure 14.7.
+The pairing action ceases when the lower count in the two Path Metric FI-
+FOs has decremented to zero. At this point, this Path Metric FIFO is the local
+path winner and the new node weight in the receiving State Metric FIFO is
+complete. The identity of the winning Path Metric FIFO (upper or lower) is
+needed to reconstruct the trellis; this information is buffered in the PMU and
+sent to the HU when all local winners are known and the overall winning node
+has been identified. This completes the actions required in the current timeslot
+and the PMU is then free to commence the next timeslot.
+
+14.5.2
+Branch metrics
+
+The proposed scheme is only viable if the numbers that need to be trans-
+ferred between the FIFOs can be kept small. A simulator was written in order
+to establish the minimum values that were consistent with meeting the bit error
+rates specified for the industrial standard device.
+
+�
+�
+
+
+
+ 
+
+7,7
+
+d
+
+a
+b
+
+d11
+
+d00
+
+c
+
+d01
+
+d10
+
+7,0
+0,0
+
+0,7
+
+Figure 14.8.
+Computing the branch metric.
+
+
+260
+Part III: Large-Scale Asynchronous Designs
+
+Table 14.2.
+Branch metric weight generation.
+
+received 3-bit character
+0
+1
+2
+3
+4
+5
+6
+7
+
+Weight referenced to 0:
+0
+0
+0
+0
+1
+3
+5
+7
+Weight referenced to 7:
+7
+5
+3
+1
+0
+0
+0
+0
+
+In the BMU, the distance of the incoming data from the ideal branch rep-
+resentations of (0,0), (0,7), (7,0) and (7,7) needs to be computed. This cal-
+culation is depicted in figure 14.8. The incoming data is assumed to have
+a value of
+�a�c� which does not correspond to any of the ideal points. The
+squares of the distances d00, d01, d10 and d11 are a2
+
+�c2, a2
+
+�d2, b2
+
+�c2 and
+b2
+
+�d2 respectively. Only the relative values of these quantities are of interest.
+Substituting
+�7
+� a� for b and
+�7
+� c� for d in the quadratic expressions gives
+squared distances of a2
+
+�c2, a2
+
+�
+�7
+�c�2,
+�7
+�a�2
+
+�c2 and
+�7
+�a�2
+
+�
+�7
+�c�2.
+Expanding out and subtracting a2
+
+� c2 gives distance values of 0, 49
+� 14c,
+49
+�14a and 98
+�14a
+�14c. Dividing by 7, adding a
+�c and then substituting
+back b for
+�7
+�a� and d for
+�7
+�c� yields the linear linear metrics a
+�c, a
+�d,
+b
+�c and b
+�d.
+Thus in this particular system, the Euclidian distance squared is equivalent
+to the Manhattan distance, which is a somewhat surprising and unexpected re-
+sult. It indicates that to use squared distances offers no advantage over using
+the much simpler linear distances; indeed, using squared distances followed
+by scaling to reduce the number size (which is adopted in some systems) in-
+troduces unnecessary circuit complexity and some inaccuracy.
+The linear weights are further minimised by subtracting the x and y distance
+to the nearest ideal point from the branch weights, so the smallest linear metric
+is always made zero. For example, if the incoming soft codes are exactly 7,7
+in figure 14.8 then linear metrics of 14, 7, 7 and 0 are generated. However, if
+the incoming bits are at co-ordinate 5,6 (due to noise) then the metrics become
+11, 6, 8 and 3 which by subtracting 3 from all values (2 for x value and 1
+for y value) become branch metrics of 8, 3, 5 and 0. Using decrementing
+of the smallest distance to the nearest point, which then becomes zero, the
+values generated for weights for each 3-bit character are reduced as shown in
+table 14.2.
+The maximum metric is now 14, and this will always arise when the incom-
+ing data coincides with one of the ideal input combinations. This is still too
+large for a system which needs to operate serially at close to 90 MHz. There-
+fore the metric is further scaled non-linearly and, to preserve the relative value
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+261
+
+of the weights, this is performed separately on each of the two incoming 3-bit
+soft values.
+Referring to table 14.2, weights of zero remain zero while weights of 1, 3, 5
+and 7 are scaled to 1, 2, 3 and 4. Clearly, the weights for both 3-bit soft codes
+need to be added to obtain the overall branch metric weight. Thus for example,
+the incoming soft-codes of 7,7 generate weights of 4+4, 4+0, 0+4 and 0 = 8, 4,
+4 and 0, while incoming soft-codes of 5,6 generates weights of 2+3, 2+0, 0+3
+and 0 = 5, 2, 3 and 0. Although the scaling here is non-linear and will therefore
+introduce some inaccuracy, simulation showed that this was not significant in
+relation to the results obtained with and without the scaling.
+Furthermore, simulation also reveals that, unsurprisingly, paths with the
+largest weights are rarely involved in the most likely path to be chosen dur-
+ing the path reconstruction to find the output. Therefore, weights generated
+can be limited or capped and a limit of 6 is used in the BMU. Thus the weights
+actually generated and loaded by the BMU for a 7,7 soft code input are 6, 4, 4
+and 0; however, the weights for a 5,6 input code remain as above.
+Six-bit FIFOs are used throughout the PMU and again numbers in the PMU
+are capped at 6. To deal with cases where the serial addition of the node metric
+to a branch weight would exceed this number, logic referred to as the Overflow
+Unit is placed at the input of each Path Metric FIFO. This receives the incom-
+ing request handshake but does not pass it to the FIFO, instead it returns it to
+the sending State Metric FIFO as an acknowledge signal.
+
+14.5.3
+Slot timing
+
+The overall or global winner from the PMU in a particular timeslot is the
+node having the lowest state metric count. In the same way that the BMU
+values can be adjusted so that the lowest weight is zero, the state metric values
+can also be decremented so that their minimum value is zero. As a result,
+numbers in the BMU and PMU are guaranteed to range only between zero and
+six. Furthermore if the soft bits contain no noise, which is the situation most of
+the time, then one (and only one) State Metric FIFO will contain a zero count
+indicating the best path through the trellis. This means that in the majority of
+timeslots, there is no need to perform any subtraction on the state metrics.
+Detecting that a count is zero is in itself easy since it is indicated by an all-
+zeros state in the FIFO. This, as well as establishing the local winner, is done
+locally within a node and is timed from the control signals applied to the node;
+each node has a control section which generates the timing signals required by
+the node, and the timing within a node is independent of the timing of all other
+nodes.
+A slot commences with the loading of the BMU branch weights. The node
+timing then passes to the next stage where the state-to-path metric transfer is
+
+
+262
+Part III: Large-Scale Asynchronous Designs
+
+performed. Following this, detecting that the sending State Metric FIFOs and
+the State Metric FIFO to receive the lower path metric count are empty causes
+the generation of a state-to-path metric done signal. The node timing then
+moves on to the next phase which generates a path-to-state metric enable. If
+at the time this signal is activated one of the Path Metric FIFOs for the node
+is empty, then a flip-flop is set indicating that this node is a global winner
+candidate; in this case no transfer to the State Metric FIFO is required and
+the path-to-state metric done signal is generated; this signal is used to clock
+the upper/lower branch (local) winner into a flip-flop and also to set a ‘local
+winner found’ flip-flop.
+If neither Path Metric FIFO is empty then the path-to-state enable signal
+allows the transfer to its State Metric FIFO until one of the Path Metric FIFOs
+becomes empty; at this point, the path-to-state metric done signal is generated,
+setting the ‘local winner’ and the ‘local winner found’ flip-flops only.
+The ‘local winner found’ and ‘global winner found’ signals now progress
+up the PMU logic hierarchy to the top level because all information needed
+to be passed to the HU has to be present before the request out signal to the
+HU is generated by the PMU. Furthermore, when the local and global winner
+data have been assembled for the HU, all nodes need to be informed that the
+slot has ended and the timing can be shifted to the start of the next slot. It
+should therefore be noted that while the timing within the nodes is local, the
+communication of the winner information to the HU and the subsequent release
+of the nodes has to be global.
+
+14.5.4
+Global winner identification
+
+The formation of the ‘all local winners found’ and the global winner identi-
+fication is partitioned across the various levels of the PMU logic hierarchy. At
+the level of a pair of node pairs, the four global winner candidate signals are
+input to a priority function which produces a 2-bit node address and a ‘global
+winner found’ signal if one of the inputs was a candidate. This logic is repeated
+with four such pairs of node pairs so that using the ‘global winner found’ sig-
+nal generated by each pair of node pairs, the two bit address obtained indicates
+which pair of node pairs contains the global winner; these are then combined
+with the two-bit address generated by that pair of node pairs to form a 4-bit
+node address. Finally, this logic is repeated at the top level where there are
+four sets of four pairs of node pairs. Again the ‘global winner found’ signal
+from each set is used in a priority logic function to produce the two-bit address
+of the winning set and this is combined with the four address bits identifying
+the winning node within the winning set; this is the 6-bit node identification
+that is sent to the HU as the global winner.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+263
+
+The ‘local winner found’ signals only require combining in NAND or NOR
+gates and this is done at the node pair, four pairs of node pairs and top levels.
+At the top level, all the ‘local winner found’ signals must be true in order to
+generate the request out signal to the HU. Since the global winner identification
+is generated from the node whose local winner is the first to be indicated, the
+global winner is guaranteed to be identified prior to the last local winner being
+found. The acknowledge signal from the HU, in response to the PMU request
+out, causes a reset signal to all nodes which resets the global candidate and the
+‘local winner found’ flip-flops and moves the timing on.
+For the cases where no State Metric FIFO contains zero, this is detected;
+it indicates that noisy data has been received. Here, the global winner can be
+identified by performing a subtraction such that one or more State Metric FI-
+FOs with the smallest count become zero. This could be time consuming and
+a decision was taken not to perform the decrement in the current timeslot. In-
+stead the local winners are sent to the HU in the normal way but a Not-Valid
+signal accompanies the request handshake indicating that while the local win-
+ner information is genuine the global winner identification should be ignored.
+The decrementing of all the state metric weights is performed in the next
+cycle by all the Overflow Units (which precede the Path Metric FIFOs). A
+signal to this unit indicates that decrementing is required. This results in the
+first incoming request from a State Metric FIFO to transfer its count to the
+Path Metric FIFO being ignored. The Overflow Unit sends an acknowledge
+back to its sending State Metric FIFO but does not pass the request on to its
+Path Metric FIFO. This effectively decrements the State Metric FIFOs by one
+by discarding the first item sent by them to the Path Metric FIFOs.
+Only a count of one is decremented in this way on any timeslot. This may
+still leave all state metrics with a non-zero count in them but simulation re-
+vealed that this was highly unlikely. Furthermore, if the Overflow Units were
+used to decrement the state metrics by the smallest count then either consider-
+able logic to determine the size of this count would be needed, or time consum-
+ing logic which decremented all state metrics by one and then tested to see if all
+these metrics were still non-zero (repeating these steps if necessary) would be
+required. Instead the much simpler approach of detecting a zero-valued state
+metric and identifying when all state metric counts are non-zero is used.
+In retrospect, it would have been better to have decremented the State Met-
+ric down to zero in the current timeslot. The decrementing has to occur at some
+point and postponing to the next timeslot merely shifts when the operation is
+performed. More importantly, the failure to identify the global winner in the
+case of all State Metrics FIFOs holding a non-zero count means that informa-
+tion which is in the PMU is not passed to the HU and therefore the HU has less
+information on which to base its decisions as to the data output.
+
+
+264
+Part III: Large-Scale Asynchronous Designs
+
+14.6.
+The History Unit (HU)
+
+The global and local winner information from the PMU to the HU is accom-
+panied by a Request handshake signal in the normal way. Having specified the
+interface to the PMU, the design of the asynchronous HU can be decoupled
+from the design of the rest of the system. As previously mentioned, the iden-
+tification of a global winner means that the number of timeslots of local and
+global winner history that need be kept by the HU can be reduced compared
+with systems that need to start the tracing back through the trellis information
+from a random point. A rule of thumb for the minimum number of timeslots
+that need to be stored for determining the correct output is around 5 times the
+constraint length. With a length of seven for the system described, a minimum
+history of 35 timeslots is required and this was confirmed by simulation. On
+this basis, a 65-slot HU was developed.
+
+14.6.1
+Principle of operation
+
+Figure 14.9 illustrates the principle of operation of the HU which, for sim-
+plicity, is shown as having only four states and storing only five time slots
+indicated by the rectangular outline. At each time step, indicated by T1, T2,
+etc., the PMU supplies the HU with the local winner information (an arrow
+points back from each state to the upper/lower winner state in the previous
+time step) and a global winner indicated by a sold circle.
+Consider the situation when the latest data to have been supplied is at T6.
+The global winner at T6 is S3, and following the arrows back, the global winner
+at T2 is S0. The next data output bit is therefore 0 (the least significant bit
+of S0’s state number), and this is output as the buffer window slides forward
+to time step T7. At T7 the received data has been corrupted by noise, and
+the global winner is (erroneously) S3. Following the local winners back, the
+backtrace modifies the global winners in time steps T6 and T5, but the path
+converges with the old path at T4. The next data output (from the global winner
+at T3) is 1 and is still correct. Moving on to T8, the global winner is S0 and,
+tracing back, the global winners at T5, T6 and T7 are changed, with T5 and T6
+resuming their previous correct values. The noise that corrupted the path at T7
+has no lasting effect and the output data stream is error-free.
+
+14.6.2
+History Unit backtrace
+
+The data structure used in the HU is illustrated in table 14.3. Each of the
+65 slots contains 64 bits of local winner information and a 6-bit global winner
+identifier. There is also a ‘global winner valid’ flag which indicates whether or
+not the global winner has been computed. The 65 slots form a circular buffer
+with the start (and end) of the buffer stepping around the slots in sequence.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+265
+
+0
+0
+1
+0
+1
+1
+1
+0
+data_out:
+
+T1
+T2
+T3
+T4
+T5
+T6
+T7
+T8
+
+time
+
+S0
+
+S1
+
+S2
+
+S3
+
+Figure 14.9.
+History Unit backtrace example.
+
+Table 14.3.
+History Unit data structure.
+
+slot
+local winner
+global winner
+valid
+number
+(64 bits)
+(6 bits)
+(1 bit)
+
+0
+L00[63..0]
+G00[5..0]
+V00
+1
+L01[63..0]
+G01[5..0]
+V01
+
+�
+�
+�
+18
+L18[63..0]
+G18[5..0]
+V18
+19
+L19[63..0]
+G19[5..0]
+V19
+20
+L20[63..0]
+G20[5..0]
+V20
+� head
+21
+L21[63..0]
+G21[5..0]
+V21
+22
+L22[63..0]
+G22[5..0]
+V22
+
+�
+�
+�
+64
+L64[63..0]
+G64[5..0]
+V64
+
+At each step the next output bit is issued from the least significant bit of the
+current head-slot global winner identifier. Then the new local and global win-
+ner information is written into the head slot and the head pointer moves to the
+next slot. The new local and global winner information is used to initiate a
+backtrace, which updates the current optimum path held in the global winner
+memories.
+The trellis arrangement produces a simple arithmetic relationship between
+one state and the next state so that, given a global winner identity in one slot,
+the previous global winner identity is readily computed. The parent identity
+can be derived from the child identity by shifting the child state one place to
+the right and inserting the relevant local winner bit into the most significant bit
+position. For example, if the global winner is node 23 in a slot, then the global
+
+
+266
+Part III: Large-Scale Asynchronous Designs
+
+winner in the previous slot will be node 11 (if the current slot local winner for
+node 23 is 0) or node 11+32 = node 43 (if the local winner for node 23 is 1).
+Where the current global winner is the child of the previous global winner,
+the current winner continues the good path already stored in the global winner
+memories. This makes it unnecessary to search back through the local winner
+information in order to reconstruct the trellis and hence saves power. There-
+fore, when data is received from the PMU, if the incoming global winner is the
+child of the last winner, then it is only necessary to output data from the oldest
+global winner entry and then to overwrite this memory line with the incoming
+local and global winner information.
+However, if sufficient noise is present (or noise has been present and the
+data now switches back to a good stream), then there may be a discontinu-
+ity between the incoming and previous global winner; this is recognised by
+the current global winner not being the child of the previous winner. In this
+case, the global winner memories do not hold a good path and this path is re-
+constructed using the local winner information. Here, the output data is read
+out and the winner information is written in as before. In addition, starting
+from the current global winner, this node identification is used to select its up-
+per/lower branch winner from the current local winner information. The parent
+identity is then constructed as described above. This computed parent identity
+is compared with the global winner identity for the previous slot. If they are
+the same then the backtrace has now converged onto the good path kept in the
+global winner memories and no further action is required. If, however, they are
+not the same then the computed parent identity needs to overwrite the global
+winner in the previous timeslot. The backtrace process now repeats in order to
+construct a good path to the next previous timeslot, and this process continues
+until the computed parent identity does match the stored global winner.
+Backtracing slot by slot thus proceeds until the computed path converges
+with the stored path. The algorithm is shown in a Balsa-like pseudo-code in
+figure 14.10. (Note, however, that Balsa does not have the ‘��’ and ‘��’
+shift operators; they are borrowed here from C to improve clarity.) In practice,
+simulation shows that path convergence usually occurs within eight or fewer
+slots. So, although the most recent items may be over-written, the oldest items
+tend to be static and the output data from the oldest slots does not change.
+Overwriting the entire path is a rare occurrence and, in this circumstance, the
+data output from the system is almost certainly erroneous.
+No backtrace is commenced in any slot where the global winner is invalid;
+the global winner entry is marked as invalid but the local winner information is
+written in the normal way. Any subsequent backtrace that encounters an invalid
+global winner will force a not equivalence with the incoming computed global
+winner at that slot, so that the computed global winner replaces the invalid
+stored value and the entry is then marked as valid.
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+267
+
+loop
+c := head;
+-- child starts at head
+data_out <- Gc[0];
+-- output lsb of Gc
+Lc := In.local_winners;
+-- update head local winners,
+Gc := In.global_winner;
+--
+global winner, and
+Vc := In.global_winner_valid; --
+global winner valid bit
+head := head + 1;
+-- step head pointer to next slot
+if Vc then
+-- backtrace only from valid head
+p := (c-1) % 65;
+-- parent slot number
+while (c /= head
+-- detect buffer wrap-around
+and (not Vp
+-- over-write invalid parent
+or
+Gp /= (Lc[Gc] << 5) + (Gc >> 1)))) -- not converged
+then
+Gp := (Lc[Gc] << 5) + (Gc >> 1));
+-- update parent
+Vp := TRUE;
+-- mark as valid
+c
+:= p;
+-- next slot
+p
+:= (c-1) % 65
+-- next parent slot number
+end -- while
+end -- if
+end -- loop
+
+Figure 14.10.
+History Unit backtrace sequential algorithm.
+
+14.6.3
+History Unit implementation
+
+The type of memory used in the HU is the dominant factor in determining
+its implementation. Initially, RAM elements were considered for this storage
+as single and dual-port read elements were present in the available cell library.
+However, their use makes it difficult to keep track of incomplete backtraces
+when a new backtrace needs to be started. In addition, the global and local
+winner memories need to be separate entities but this introduces some ineffi-
+ciency in the address decoding. Furthermore, there are difficulties in providing
+the many specific timed signals required to drive the memory. The RAM tim-
+ings are equivalent to many simple gate delays. Such gates would be used to
+form the reference timing signals, and it is not clear that the gate propagation
+delays due to supply voltage changes vary in the same way as the RAM delays.
+For these reasons, the memory was implemented with flip-flop storage and
+the system is shown in figure 14.11. It comprises 65 lines made up of 64
+slots of replicated storage and one further slot which, on reset, becomes the
+slot holding the head token; the head slot receives the new local and global
+winner information from the PMU. The control block holds the global winner
+identification plus the token handling and backtrace logic.
+The concurrent asynchronous algorithm is illustrated in Balsa-like pseudo-
+code in figure 14.12, which represents a single stage in the History Unit. The
+complete HU comprises 65 such stages, one of which is initialised to be the
+
+
+268
+Part III: Large-Scale Asynchronous Designs
+
+control
+
+evaluate
+addr
+
+token
+
+Rin
+Ain
+data_out
+Aout
+
+control
+
+local winners
+
+winners
+winners
+local
+
+memory
+
+local
+
+memory
+
+addr
+
+data
+
+strobe
+
+addr
+
+data
+
+strobe
+
+winner
+global
+
+Figure 14.11.
+History Unit implementation.
+
+head of the circular buffer. This can be compared with the sequential algo-
+rithm shown earlier in figure 14.10. The transformation from the sequential
+to the concurrent algorithm is illustrative of high-level asynchronous design
+methodologies even when the design is being carried out manually (as was this
+Viterbi decoder) and not in a high-level language such as Balsa.
+The head slot contains the oldest winner information and determines the 1-
+bit data output from the system. Remembering that odd states signify a ‘1’ in-
+put and even states a ‘0’ input, the head slot outputs the least significant global
+winner bit on the data-out line. This data enters a buffer and the acknowledge
+Aout signifies its acceptance. The head is then free to write the current winner
+information to its memory. The Token signal then passes (leftwards) to the
+adjacent slot which now becomes the new head.
+The parent node of the current global winner is computed as described and
+this is passed (rightwards) to the adjacent slot with an Evaluate signal. The
+computed parent is compared with the stored winner in the previous stage.
+Equivalence results in no further backtracing and the backtrace is said to be
+retired. Not equivalence causes overwriting of the global winner and this win-
+ner accompanied by Strobe is used to address (Addr) the local memory. The
+data bit returned on Data is used to compute the parent of this winner which is
+then passed rightwards to the preceding timeslot with an Evaluate signal. This
+process repeats until the backtrace converges with the existing global winners
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+269
+
+loop
+arbitrate In then
+if head then
+-- head...
+data_out <- G[0];
+--
+output next data bit
+L := In.local_winners || --
+update local values
+G := In.global_winner ||
+V := In.global_winner_valid;
+if V then
+--
+if global winner is valid
+addrOut <- (L[G] << 5) + (G >> 1) -- start backtrace
+end; -- if V
+sync tokenOut ||
+--
+pass on head gtoken
+head := false
+--
+and clear head Boolean
+end -- if head
+| addrIn then
+--
+backtrace input?
+if not head then
+if addrIn /= G or not V then
+-- path converged? If not...
+G := addrIn ||
+--
+update global winner
+V := true;
+addrOut <- (L[G] << 5) + (G >> 1) -- propagate backtrace
+end -- if ..
+end -- if not head
+| tokenIn then
+--
+head token arrived
+head := true
+--
+set head Boolean
+end -- arbitrate
+end -- loop
+
+Figure 14.12.
+History Unit backtrace stage.
+
+and can be retired. A backtrace has to be forcibly retired if it is in danger of
+running into the head slot; an arbiter is used to test for this in the control and it
+is the only place where arbiters need to be used in the system. Fortunately, the
+meeting of the head and backtrace processing is a rare occurrence.
+It should be noted that, unlike a conventional system, path reconstruction is
+only undertaken if necessary and then for only as long as required; both strate-
+gies save power. Furthermore, the use of asynchronous techniques within the
+HU enables the writing of winner information from the PMU to be indepen-
+dent of and run concurrently with any path reconstruction activity. The use
+of flip-flop storage rather than RAM has resulted in a simpler and more flexi-
+ble design. It also has the distinct advantage of enabling multiple backtraces,
+whose frontiers are all at different slots, to be run concurrently.
+
+14.7.
+Results and design evaluation
+
+The asynchronous Viterbi decoder system was implemented as described
+using the industrial partner’s cell library which was designed to operate from
+3.3V. Non-standard elements such as the Muller C-gate were constructed from
+
+
+270
+Part III: Large-Scale Asynchronous Designs
+
+the cell library; the only full-custom element which had to be designed was an
+arbiter.
+Following simulation, the decoder was fabricated on a 0.35 micron CMOS
+process. Results for a 1/2 code rate, random input bit stream with no errors
+show an overall power dissipation of 1333 mW at 45 Msymbols/sec. Of this,
+the dissipation in the PMU dominates at 1233 mW while the HU takes only
+37 mW. The difference (about 60 mW) between these figure and those for the
+overall consumption are accounted for by the dissipation in the BMU and in
+the small amount of ‘glue’ logic prior to the BMU.
+Errors in the input data to the decoder result in only small variations in the
+dissipation with the overall dissipation falling slightly with an increasing input
+error rate; internally, this decrease comprises a small reduction in the PMU
+dissipation and a smaller rise in the HU dissipation. The results for other code
+rates are a scaled version of those obtained for the 1/2 code rate as might be
+expected. For example, a 3/4 code rate which receives 3 symbols for every 4
+clocks exhibits 1.5 times the dissipation of the 1/2 rate code with its 2 symbols
+every 4 clocks.
+The asynchronous PMU performs approximately the same amount of work
+regardless of the number of errors in the data stream. This results from capping
+numbers at 6. This means that for a good data stream, all nodes have a weight
+of six except the one node on the good (correct) path. Thus the PMU is almost
+permanently ‘saturated’ and practically all work performed relates to paths
+which will never be selected! Errors cause some spread in the node weights
+with the higher weights (4, 5 and 6) predominating and the slightly smaller
+counts on some nodes accounts for the slight drop in dissipation in the PMU
+under error conditions.
+The asynchronous PMU dissipation is very high and does not compare well
+with synchronous PMUs using conventional parallel arithmetic to perform the
+add, compare and select operation [15]. In order to understand why this occurs,
+it is necessary to examine the operation and logic used in the asynchronous
+PMU in more detail. With good (i.e. no error) data, 63 nodes have weights
+of 6 and one node has a weight of 0. This translates to PMU activity on each
+timeslot where all State Metric FIFOs but one contain counts of 6 which are
+then transferred to Path Metric FIFOs, whose counts of 6 are in turn removed
+from the Path Metric FIFOs, paired and transferred to the receiving State Met-
+ric FIFOs. Entering or removing a count of 6 from a FIFO involves 21 changes
+of state in the stages. Furthermore, the number of transitions actually involved
+is higher due to (around 5) internal transitions on the elements making up the
+C-gates forming each FIFO stage. Thus, each of 63 nodes experiences around
+650 transitions just on the data path per timeslot. The control and other over-
+heads on the data path can be expected to form (say) an additional 30% of
+logic. This indicates a node activity of around 850 transitions/timeslot and
+
+
+Chapter 14: An Asynchronous Viterbi Decoder
+271
+
+overall the PMU can be expected to make a maximum of 54,400 transitions on
+each timeslot.
+Unfortunately, the design of the FIFOs, particularly the Path Metric FIFOs,
+has led to high capacitive loading on the true and inverse C-gate outputs at
+each stage. A dissipation of 1233 mW with 54,400 transitions per slot and an
+energy cost of 5�45
+�C joules per transition, where C is the average track plus
+gate loading capacitance (in Farads), indicates an average loading of 92 fF and
+this is confirmed by measurements.
+By contrast, the HU power efficiency is excellent and is the real success
+story in this design. Its dissipation is low and is far smaller than that in any
+other system the designers are aware of. The HU dissipation demonstrates that
+by keeping a good path, very little computing is required to output the data
+stream when there are no errors. Furthermore, when lots of noise is present,
+so that the backtrace process is active with many good paths in the process of
+being reconstructed concurrently, the dissipation in the HU only rises slightly;
+this indicates that accessing the local winner memory in flip-flops and over-
+writing the global winner information is not a power-costly operation.
+The HU dissipation also compares favourably with a (synchronous) system
+built along similar principles to the HU described here but using RAM ele-
+ments from the library instead of flip-flops. Due to the limitations of the RAM
+devices previously mentioned, these introduce additional complexity because
+only one backtrace is performed at any time; it is therefore necessary to keep
+track of the depth reached by incomplete backtraces which are abandoned for a
+new backtrace leaving a partially complete global winner path reconstruction.
+The difference in dissipation between the asynchronous HU using flip-flops
+and the other using RAMs reflects the power cost of accessing the local win-
+ner RAM and the associated significant additional computation involved in the
+backtrace. This points to the power efficiency of storing the HU information
+in a manner best suited to the task.
+
+14.8.
+Conclusions
+
+As in many asynchronous designs, the system design has had to be ap-
+proached from first principles and has caused a complete rethink about how
+to implement the Viterbi algorithm. This has resulted in a number of novel
+features being incorporated in the PMU and HU units. In the PMU, the deci-
+sion to use serial unary arithmetic has enabled the conventional parallel add,
+compare and select logic to be dispensed with and replaced by dataless FIFOs
+which perform the arithmetic serially.
+While the PMU is an interesting and different design from that convention-
+ally used, its power consumption is not good. Its design illustrates that power
+efficiency at the algorithmic and architecture levels needs to be combined with
+
+
+272
+Part III: Large-Scale Asynchronous Designs
+
+efficient design at the logic, circuit and layout levels to realise the true po-
+tential of a system. This is demonstrated by a synchronous PMU constructed
+along similar architectural principles to those described but implemented using
+a low-power logic family and full custom layout which dissipates only 70 mW
+at 45 Msymbols/sec [15]. It is clear that while a full custom design of the asyn-
+chronous PMU datapath would reduce the current power levels significantly, a
+major revision of the PMU logic for the datapath, paying particular attention to
+loading, is required for a design which has better power efficiency than other
+systems.
+The identification of a global winner is probably the most important advance
+in the PMU design. This has meant that both a good path and a local winner
+history can be kept by the HU, leading to a greatly reduced amount of overall
+storage required to deduce the output data. The use of flip-flop storage has also
+greatly contributed to the power efficiency of this unit and it does demonstrate
+the power advantages of optimising design at all levels in the design hierarchy
+down to and including the logic.
+The HU also illustrates the advantages of asynchronous design in that the
+placing of current information is decoupled from any backtracing operations of
+which there may be many running concurrently. Furthermore, the speed of the
+backtracing is only dependent on the logic required to perform this operation
+and not on any other internal or external system timing. Such a decoupled,
+multiple backtracing activity would clearly be more difficult to organise in the
+context of a synchronous timing environment.
+
+14.8.1
+Acknowledgement
+
+As in any large project, a number of people have been engaged in the design
+and implementation of the Viterbi decoder described in this chapter. It is there-
+fore a pleasure to acknowledge the other colleagues in the Amulet group in the
+Computer Science Department at Manchester University involved at all stages
+in this project, namely Mike Cumpstey, Steve Furber and Peter Riocreux. I am
+also grateful to them for comments on the draft of this chapter.
+
+14.8.2
+Further reading
+
+Further information on the Viterbi algorithm may be found in [148], [71]
+and [70]. Futher information on the PREST project is in [1].
+
+
+Chapter 15
+
+PROCESSORS
+
+�
+
+Jim D. Garside
+Department of Computer Science, The University of Manchester
+
+jgarside@cs.man.ac.uk
+
+Abstract
+Computer design becomes ever more complex. Small asynchronous systems
+may be intriguing and even elegant but unless asynchronous logic can not only be
+competitive with ‘conventional’ logic but can show some significant advantages
+it cannot be taken seriously in the commercial world.
+There can be no better way to demonstrate the feasibility of something than
+by doing it. To this end several research groups around have the world have been
+putting together real, large, asynchronous systems. These have taken several
+forms, but many groups have chosen to start with microprocessors; a processor
+is a good demonstrator because it is well defined, self-contained and forces a de-
+signer to solve problems which are already well understood. If an asynchronous
+implementation of a microprocessor can compare favourably with a synchronous
+device performing an identical function then the case is proven.
+This chapter describes a number of processors that have been fabricated and
+discusses in some detail some of the solutions employed. The primary source of
+the material is the Amulet series of ARM implementations – because these are
+the most familiar to the author – but other devices are included as appropriate.
+The later parts of the chapter widen the descriptions to include memory systems,
+cacheing and on-chip interconnect, illustrating how a complete asynchronous
+System on Chip (SoC) can be produced.
+
+Keywords:
+low-power asynchronous circuits, processor architecture
+
+�The majority of the work described in this chapter has been made possible by grants from the European
+Union Open Microprocessor systems Initiative (OMI). The primary sources of funding have been OMI-
+MAP (Amulet1), OMI/DE-ARM (Amulet2e) and OMI/ATOM (Amulet3). Without this funding none of
+these devices would have been made and this support is gratefully acknowledged.
+
+273
+
+
+274
+Part III: Large-Scale Asynchronous Designs
+
+15.1.
+An introduction to the Amulet processors
+
+Most of the examples in this chapter are based around the Amulet series
+of microprocessors, developed at the University of Manchester. All of these
+have been asynchronous implementations of the ARM architecture [65] and,
+as such, allow direct comparisons with their synchronous contemporaries. It
+should be noted that the primary intention, as in other ARM designs, was to
+produce power-efficient rather than high-performance processors.
+Brief descriptions of the three fabricated Amulet processors and some other
+notable examples are given below.
+
+15.1.1
+Amulet1 (1994)
+
+Figure 15.1.
+Amulet1.
+
+Amulet1 [158] (figure 15.1) was a feasibility study in asynchronous design,
+using techniques based extensively on Sutherland’s Micropipelines [128]. Al-
+though two-phase signalling was used for communications standard, transpar-
+ent latches were used internally rather than Sutherland’s capture-pass latch (see
+figure 2.11 on page 20). The external two-phase interface proved difficult to
+interface with external commodity parts.
+Amulet1 comprised 60,000 transistors in a 1�0µm, 2-layer metal process and
+ran the ARM6 instruction set (with the exception of the multiply-accumulate
+operation).
+It achieved about half the instruction throughput of an ARM6
+manufactured on the same process with roughly the same energy efficiency
+(MIPS/W).
+
+
+Chapter 15: Processors
+275
+
+Figure 15.2.
+Amulet2e.
+
+15.1.2
+Amulet2e (1996)
+
+Amulet2e [44] (figure 15.2) was an ARM7 compatible device with complete
+instruction set compliance. In addition to the CPU it included an asynchronous
+4 KByte cache memory and a flexible external interface making it much easier
+to integrate with commodity parts. A few other optimisations such as (limited)
+result forwarding and branch prediction were added.
+Internally this device used four-phase rather than two-phase handshake pro-
+tocols. It occupied 450,000 transistors (mostly in the cache memory) in a
+0�5µm 3-layer metal process; although about three times faster than Amulet1 it
+was still only half the performance of a contemporary synchronous chip.
+
+15.1.3
+Amulet3i (2000)
+
+Amulet3i [48] was intended as a macrocell for supporting System on Chip
+(SoC) applications rather than a stand-alone device. It is an ARM9 compat-
+ible device comprising around 800 000 transistors in a 0�35µm 3-layer metal
+process. It comprises an Amulet3 CPU, 8 KBytes of pseudo-dual port RAM,
+16 KBytes of ROM, a powerful DMA controller and an external memory/test
+interface, all based around a MARBLE [4] asynchronous on-chip bus.
+
+
+276
+Part III: Large-Scale Asynchronous Designs
+
+Figure 15.3.
+DRACO.
+
+Amulet3i achieves roughly the same performance as a contemporary, syn-
+chronous ARM with an equal or marginally better energy efficiency. It was
+integrated with a number of synchronous peripheral devices, designed by Ha-
+genuk GmbH, to form the DRACO (DECT Radio Communications Controller)
+device (figure 15.3).
+
+15.2.
+Some other asynchronous microprocessors
+
+Several other groups around the world have also been developing asyn-
+chronous microprocessors over the past decade or so. In this section a (non-
+exhaustive) selection of these are briefly described.
+Caltech has produced two asynchronous processors: the ‘Caltech Asyn-
+chronous Microprocessor’ (1989) [86] was a locally-designed 16-bit RISC
+which was the first single chip asynchronous processor; the ‘MiniMIPS’ [88]
+was an implementation of the R3000 architecture [72]. Both of these proces-
+sors were custom designed using delay-insensitive coding rather than the ‘bun-
+dled data’ used in the Amulets. This, together with a design philosophy aimed
+at speed rather than low-power consumption results in high-performance, high-
+power devices.
+Another MIPS-style microprocessor is the University of Tokyo’s ‘TITAC-2’
+(1997) [130] (figure 15.4) which is an R2000. This was developed in another
+
+
+Chapter 15: Processors
+277
+
+Figure 15.4.
+TITAC-2. (Picture courtesy of the University of Tokyo.)
+
+Figure 15.5.
+ASPRO-216.
+(Picture courtesy of the TIMA Laboratory,CIS Group, IMAG
+(Grenoble).)
+
+
+278
+Part III: Large-Scale Asynchronous Designs
+
+different design style (quasi-delay insensitive). As may be apparent from the
+figure TITAC-2 employed considerable manual layout.
+‘ASPRO-216’ (1998) [117] from IMAG in Grenoble is slightly different in
+that it is a 16-bit signal processor rather than a general-purpose microprocessor.
+More significantly its design was largely automated and synthesised from a
+CHP(Communicating Hardware Processes) [84, 118] description. This shows
+in the more ‘amorphous’ appearance of the processor, although the chip is
+dominated by its memories (figure 15.5).
+All the devices mentioned above have been research prototypes. Commer-
+cial take up of asynchronous processor technology has been slower; neverthe-
+less there are some significant examples.
+Philips Research Laboratories in Eindhoven have been developing the ‘Tan-
+gram’ [135] circuit synthesis system, primarily aimed at low-performance,
+very low power systems. This has been used to produce an asynchronous im-
+plementation of the 80C51 (1998) [144] which has been deployed in commer-
+cial pager devices where its low power and low EMI properties are particularly
+attractive. It is also intended for use in smartcard applications (see chapter 13
+on page 221).
+Although not strictly a microprocessor the Sharp DDMP (Data-Driven Me-
+dia Processor) (1997) [131] merits inclusion here. Intended for multimedia
+applications this provides a number of parallel processing elements which are
+employed or left idle according to the demand at any time. Asynchronous
+technology was attractive here because of the ease of power management.
+Finally the DRACO device (figure 15.3) was designed specifically for com-
+mercial use although not (yet) marketed due to company reorganisation. As
+a processor in a radio ‘base station’ the low EMI properties of asynchronous
+logic were the reasons for adoption of this technology.
+
+15.3.
+Processors as design examples
+
+Why build an asynchronous microprocessor? Part of the answer must be the
+various advantages of low power, low EMI etc. claimed for any asynchronous
+design and demonstrated in the commercial devices mentioned above, but why
+is a processor a good demonstration of these features?
+In many ways it is not. A better demonstrator of asynchronous advantages
+may well be an application with a regular structure which is amenable to very
+fine grain pipelining: some signal processing or network switching applica-
+tions have these characteristics. However there is a great deal of appeal in con-
+structing a microprocessor. Firstly, it is a well-defined and self-contained prob-
+lem; it is easy to define what a microprocessor should do and to demonstrate
+that it fulfils that specification. Secondly, it forces an asynchronous designer to
+confront and solve a number of implementation problems which might not oc-
+
+
+Chapter 15: Processors
+279
+
+cur if a ‘tailor made’ demonstration was chosen. Lastly, it is often possible to
+compare the result with contemporary, synchronous devices in order to quan-
+tify and assess the results of the work. Of course, microprocessors are also an
+intensely fast-moving and competitive market in which it is hard to compete,
+even in a familiar technology!
+
+15.4.
+Processor implementation techniques
+
+15.4.1
+Pipelining processors
+
+Pipelining [56] the operation of a device such as a microprocessor is an effi-
+cient way to improve performance. At its simplest, pipelining can subdivide a
+time-consuming operation into a number of faster operations whose execution
+can be overlapped. If done ‘perfectly’ a five-stage pipeline can speed up an
+operation by (almost) five times with only a small hardware cost in pipeline
+latches.
+A typical synchronous pipeline should divide the logic into equally timed
+slices. If there is an imbalance in the partitioning the slowest pipeline stage
+sets the clock period for the whole system; faster logic is slowed down by the
+clock resulting in some time wastage (figure 15.6). This is more emphasised
+if the timing of a particular pipeline stage is, in some way, variable or ‘data
+dependent’. Data dependencies can be quite common in a microprocessor: a
+simple example is an ALU operation, where a ‘move’ operation is faster than
+an addition because the former operations require no carry propagation. A
+more ‘exaggerated’ example is a memory access where a cache hit is much
+faster than a cache miss.
+
+work
+Useful
+period
+Clock
+
+Fetch
+Decode
+Evaluate
+Transfer
+Write
+
+Figure 15.6.
+Synchronous pipeline usage.
+
+Normally in most of such cases the clock must allow for the slowest pos-
+sible cycle. This either slows down the clock or causes the designer to invest
+considerable hardware in speeding up the worst-case operations, for example
+adding fast carry propagation mechanisms. Whilst the latter is a good trade
+if a high proportion of the operations are slow this is poor economics if the
+worst-care operations are rare.
+
+
+280
+Part III: Large-Scale Asynchronous Designs
+
+Another possible solution open to the synchronous designer is to allow cer-
+tain slow operations to occupy more than one clock cycle; this is clearly expe-
+dient when a cache miss occurs and the processor must idle until an off-chip
+memory can be read. However multi-cycle operations introduce the need for
+system-wide clock control to stall other parts of the system; even then the tim-
+ing resolution is still limited to whole clock cycles.
+Asynchronous pipelining is conceptually much easier. Not only is all con-
+trol localised, but it is implicitly adaptable to pipeline stages with variable de-
+lays. This means that rare, worst-case operations can be accommodated with-
+out either excessive hardware investment or a significant impact in the overall
+processing time by altering the delay of a pipeline stage dynamically. This,
+combined with the fact that operations can flow at their own rate rather than
+being stacked one to a stage, gives the pipeline considerable ‘elasticity’. In
+such a pipeline a cache miss still occupies a single cycle, although that cycle
+will be a particularly slow one!
+Another clear example of this is visible in the ARM instruction set [65]
+where data processing operations and address calculations may specify an
+operand shift before the ALU operation. In early ARM implementations this
+was contained in a single pipeline stage and hidden by the slower memory ac-
+cess time. To avoid a performance penalty more modern synchronous designs
+have the options of:
+
+an additional shifter pipeline stage (increases latency);
+
+stalling for an extra clock when a shift is required (increases complex-
+ity).
+
+An asynchronous pipeline can simply stretch the execution cycle when re-
+quired. As the additional time is unlikely to be as long as the ALU stage delay
+this is more flexible than either of the synchronous options, and the overall
+impact is small because shift/operate operations are quite rare.
+
+‘Average case’ performance.
+The elasticity of an asynchronous pipeline has
+led to the myth that an asynchronous pipeline can perform its processing in an
+‘average’ rather than worst case time for each pipeline stage. This is true only
+if the unit under consideration is kept constantly busy. This will not be true in
+the general case: on some occasions the unit will be ready before its operands
+and at other times it will have completed before subsequent stages can accept
+its output; in each case an interlude of idleness is enforced. This effect can be
+reduced by providing fast buffers between processing elements to allow some
+‘play’ in the timing, but true average case behaviour can only be achieved with
+buffers of infinite size. Any buffering increases the pipeline latency and should
+therefore be used with circumspection.
+
+
+Chapter 15: Processors
+281
+
+In practice an asynchronous pipeline should be balanced in a similar fashion
+to a synchronous pipeline. The difference is that occasional, time consuming
+operations can be accommodated without either pipeline disruption or signif-
+icant extra hardware. An added bonus is that the pipeline latency, especially
+when filling an ‘empty’ pipeline, can be reduced because the first instruction
+is not retarded by the clock at each stage (figure 15.6). The problem then de-
+volves into partitioning the system and solving the resulting problem of internal
+operand dependences.
+
+15.4.2
+Asynchronous pipeline architectures
+
+Once an asynchronous pipeline has been developed it is very easy to add
+pipeline stages to a design. However, pipelining indiscriminately can be a Bad
+Thing, at least in a general-purpose processor. The major reason for this is
+the need to resolve dependencies [56], where one operation must gather its
+operands from the results of previous instructions; if many operations are in
+the pipeline simultaneously it is quite likely that any new instructions will be
+forced to wait for some of these to complete. Resolving dependencies in an
+asynchronous environment can be a relatively complex task and is discussed in
+a later section.
+A less obvious consequence is the increase in latency in the pipeline, not
+only due to added latch delays but because, in some circumstances, the pipeline
+needs to drain and refill. Consider a processor pipeline with a fast FIFO buffer
+acting as a prefetch buffer; this initially seems like a good idea as it can help
+reduce stalls due to (for example) occasional slow cycles in the prefetch (e.g.
+cache miss) and execution (e.g. multiplication) stages. However, at least in
+a general purpose processor, this benefit is masked because a single stall can
+fill up the prefetch buffer and, typically shortly thereafter, a branch requires it
+to be drained. This was an architectural error evidenced in Amulet1, which
+suffered a noticeable performance penalty due to its four-stage prefetch buffer.
+Of course in other applications, where pipeline flushes are rare, the ease of
+adding buffering can provide significant gains; however experience has shown
+that for a general purpose CPU a more conventional approach producing a
+reasonably balanced pipeline based around a known cycle time (such as the
+cache read-hit cycle time) is the ‘best’ approach.
+One definite advantage of an asynchronous pipeline is that the pipeline flow
+can be controlled locally. Consider that bane of the RISC architecture the
+multi-cycle instruction; in a synchronous environment such an operation must
+be able to suspend large parts or all of the processing pipeline, necessitating a
+widespread control network. In an asynchronous environment the system need
+not be aware of operations in other parts of the system; instead a multi-cycle
+
+
+282
+Part III: Large-Scale Asynchronous Designs
+
+operation simply appears as a longer delay, possibly causing a stall if other
+processing elements require interaction with the busy unit(s).
+In the ARM architecture there is one case where this ability is very useful;
+the multiple register load and store operations (LDM/STM) can transfer an
+arbitrary subset of the sixteen current registers to or from memory. Amulet3
+implements this by generating several local ‘instruction’ packets for the exe-
+cution stages for a single input handshake. At this point it is likely that the
+prefetch will fill up the intervening latches and stall, but this is a natural con-
+sequence of the pipeline’s operation.
+There are other examples of this behaviour in Amulet3 (figure 15.7), notably
+the Thumb decoder which ingests 32-bit packets which can contain either one
+ARM instruction or two of the ‘compressed’, 16-bit Thumb instructions. In
+the latter case two output handshakes are generated for each input. This pro-
+vides an advantage over earlier (synchronous) Thumb implementations, which
+fetch instructions sixteen bits at a time, because it reduces the number of in-
+struction fetch memory cycles and, with a slow memory, uses the full available
+bandwidth. The power consumption in the memory system is also reduced
+commensurately.
+Local control is also possible in instructions such as ‘CMP’ (compare) which
+do not need to traverse the entire pipeline length; it is just as easy to remove a
+packet from the pipeline by generating no output handshakes as it is to generate
+extra packets. In Amulet3 comparison operations only affect the processor’s
+flags which reside in the ‘execute’ stage and therefore cause no activity further
+down the pipeline.
+A final benefit of the localised control is that the pipeline operation can
+be regulated by any active stage. Both Amulet2 and Amulet3 have retrofitted a
+‘halt’ (until interrupted) instruction into the ARM instruction set (implemented
+transparently from an instruction which branches to itself). This can be de-
+tected and ‘executed’ anywhere within the pipeline with the same effect of
+causing the processor to stall. Indeed Amulet2 instigates halts in its execution
+unit whereas Amulet3 halts by suspending instruction prefetch, but the over-
+all effect is the same. Halting an asynchronous processor (or part thereof) is
+equivalent to stopping the clock in a synchronous processor and, in a CMOS
+process, can drop the power consumption to near zero. This facility therefore
+makes power management particularly easy and – in many cases – near auto-
+matic.
+
+15.4.3
+Determinism and non-determinism
+
+Before examining specific architectural techniques which can be employed
+in an asynchronous processor it is worth considering something of the de-
+sign philosophies employed. The most significant is probably that of non-
+
+
+Chapter 15: Processors
+283
+
+Latch
+
+Execute
+Interface
+Data
+
+FIFO
+
+Buffer
+
+Latch
+
+Decode &
+
+Latch
+
+Register
+
+Latch
+
+Prefetch
+
+Reorder
+
+Write
+
+Latch
+
+Reg. Rd.
+
+Thumb
+
+skip
+memory
+
+FIQ
+
+IRQ
+
+Latch
+
+Instr
+Memory
+ may generate
+
+ additional packets
+
+branch addresses
+
+forwarding
+indirect PC load
+
+store data
+
+addr.
+
+load data Memory
+Data
+
+Figure 15.7.
+Amulet3 core organisation.
+
+
+284
+Part III: Large-Scale Asynchronous Designs
+
+determinism because, unlike a synchronous processor, an asynchronous pro-
+cessor can behave non-deterministically and yet still function correctly.
+An advantage in the analysis and design of a synchronous system is that the
+state in the next cycle can be determined entirely from the current state. This
+may also be true in an asynchronous system, but the timing freedom means
+that this is not the only choice of action. Within a small asynchronous state
+machine it is possible to achieve the same behaviour with internal transitions
+ordered differently (e.g. the inputs to a Muller C-element can change in any
+order) and this is also true on a macroscopic level. The first example used here
+is a processor’s prefetch behaviour, chosen because different philosophies have
+been chosen in different projects.
+All the Amulet processors have had a non-deterministic prefetch depth. This
+is achieved by allowing the prefetch unit to run freely, normally only con-
+strained by the rate at which the instruction memory is able to accept instruc-
+tions. In order to branch the prefetch process is ‘interrupted’ and a new pro-
+gramme counter value inserted; this is an asynchronous process which requires
+arbitration and therefore happens at a non-deterministic point.
+An alternative approach, for example used in the ASPRO-216 processor
+[117], is to prefetch a fixed number of instructions.
+This can be done by
+prompting the prefetch unit to begin a new fetch for each instruction which
+completes. If a branch is required this can be signalled, if not this too is sig-
+nalled. In effect the processing pipeline becomes a ring in which a number of
+tokens are circulated and reused. (See also section 3.8.2 on page 39.)
+Having a deterministic prefetch simplifies certain tasks, notably dealing
+with speculative prefetches and branch delay slots. As it is possible to say
+exactly how many instructions will be prefetched following a branch these can
+be processed or discarded with a simple counting process. However keeping
+tokens flowing around a ring places an extra constraint on the elasticity of the
+pipeline which – in some circumstances – may sacrifice performance.
+With a non-deterministic prefetch depth it is possible to have fetched zero
+or more instructions speculatively, although there will be an upper bound set
+when the pipeline ‘backs up’. In an architecture without delay slots (such
+as ARM) the lower limit is not a problem, but some means other than in-
+struction counting must be provided to indicate that the prefetch stream has
+changed. The Amulet processors do this by ‘colouring’ the prefetch streams.
+To illustrate: imagine the processor prefetches a stream of ‘red’ instructions.
+Eventually, as these are executed, a branch is encountered which causes the
+execution unit to request prefetches starting at a new address and in a different
+colour (‘green’, say). Subsequent red instructions must then be discarded, the
+first green instruction being the next operation completed. A subsequent green
+branch will cause another colour change; because all the former red operations
+
+
+Chapter 15: Processors
+285
+
+(a) Deadlock
+(b) Deadlock avoided with
+extra latch
+
+Figure 15.8.
+Branch deadlock and its prevention.
+
+must have been flushed at this point it is possible to switch back to red, thus
+only two colours (i.e. a single colour bit) is required.
+Before leaving this issue there is one other, less obvious, problem with a
+non-deterministic prefetch depth which can cause deadlock if not considered
+in the design. In this architecture the act of branching uses an arbiter to insert
+a token into the processor’s pipeline. If the pipeline is already full – and there
+is nothing to limit this – then the arbiter cannot acknowledge the operation
+and thus the pipeline deadlocks (figure 15.8(a)). Because a branch could occur
+when the pipeline is not full the deadlock is not inevitable, but it could happen
+each time a branch is attempted.
+This problem is easily solved; an extra latch which is normally empty can
+decouple the branch operation from the main pipeline flow (figure 15.8(b)),
+leaving ‘normal’ operation to continue. As a second, valid branch cannot be
+taken until after the first has been accepted and its target fetched and decoded,
+a single latch will always prevent deadlocks here.
+This class of problem is generic. A manifestation of a similar problem be-
+came evident early in the design of Amulet1 where the instruction fetch com-
+peted non-deterministically for the bus with data loads and stores. In a situation
+where the processing pipeline is full it is possible that an instruction fetch can
+occupy the memory bus but be unable to complete because there is no latch
+ready for the result. If a data transfer is pending then it is blocked, resulting
+in the pipeline remaining full (figure 15.9(a)). This can be rectified if the in-
+struction fetch is throttled so that it cannot gain the bus until it is known that
+it can relinquish it again (figure 15.9(b)). The converse of the problem does
+
+
+286
+Part III: Large-Scale Asynchronous Designs
+
+Memory
+
+(a) Deadlock
+
+Throttle
+
+Memory
+
+(b) Deadlock avoided by
+throttling prefetch
+
+Figure 15.9.
+Bus contention deadlock and its prevention.
+
+not occur because the data transfer, once started, cannot be stalled, so only
+instruction prefetch requires throttling.
+Amulet3, with its separate instruction and data buses, does not exhibit this
+problem within the processor, although the memory system can still suffer an
+analogous deadlock. This is described further in the section on memory.
+If a deterministic, asynchronous solution is preferred this could be imple-
+mented by adding a data transfer phase to every instruction. There is still an
+asynchronous advantage here in that an ‘unwanted’ data access would be very
+fast but there is a price in limiting the adaptability of the processor’s pipeline.
+
+Counterflow Pipeline Processor.
+Although most asynchronous micropro-
+cessor designs have a ‘conventional’ architecture (other than lacking a clock!)
+it may be practicable to implement radically different processor structures, and
+several have been studied. One interesting – and highly non-deterministic –
+idea is the Counterflow Pipeline Processor (CFPP) [126] in which instructions
+flow freely along a pipeline containing processing units towards the register
+bank while operands flow equally freely towards them. This is intended to re-
+duce stalls in waiting for operand dependences between instructions; as well
+as evaluating and carrying its result to completion an instruction can cast the
+result backwards to be picked up as required by subsequent operations.
+Whilst expensive in the number of buses required, the CFPP allows con-
+siderable flexibility in the number and arrangement of functional units (fig-
+ure 15.10). Functional units may be ALUs of full or limited function, mem-
+ory access units, multipliers etc. The only rule is that the operation must be
+attempted by one of the available units, so stalls are only needed if an uncom-
+pleted operand reaches the last appropriate functional unit still without all its
+operands.
+
+
+Chapter 15: Processors
+287
+
+Instructions
+Fetch
+
+ALU/...
+
+ALU/... 
+
+ALU/...
+Instructions Operands
+
+Instructions Operands
+
+Instructions Operands
+
+Registers
+
+Branch
+
+Results
+
+Figure 15.10.
+Principle of the Counterflow Pipeline Processor.
+
+To ensure that an instruction does not miss any of its operands, passing
+in the other direction, a degree of synchronisation is needed at each pipeline
+stage. Because there is no ‘favoured’ direction in the pipeline an arbiter is
+required at each stage to ensure that two packets do not ‘cross over’ without
+checking each other’s contents. Because every movement of both instructions
+and data requires an arbitration, fast, efficient arbiters are essential for such a
+performance architecture.
+Of course deepening the pipeline to accommodate many functional units
+increases the penalty due to branches – which need to propagate ‘backwards’
+to the fetch stage – so good branch prediction is important.
+
+Arbitration and deadlock.
+It is possible to build an asynchronous processor
+which is as deterministic in its operation sequences as its synchronous coun-
+terpart (i.e. deterministic except for such events as interrupts). Alternatively it
+is possible to make a highly non-deterministic processor.
+Each scheme has both advantages and disadvantages. Enforcing synchroni-
+sation could lead to a reduction in performance; for example the memory in-
+terface may not begin a prefetch until told that an instruction does not require
+the memory for a data transfer, with a consequent reduction in available band-
+width. On the other hand the predictability of the system’s behaviour can make
+testing significantly easier by reducing the reachable state space of the system.
+The decision as to what (if anything) should be allowed to be non-deterministic
+is a decision for the designer which must be reviewed in the particular circum-
+stances. However it must be remembered that every non-determinism has an
+associated arbiter (which, theoretically, can require an infinite resolution time)
+
+
+288
+Part III: Large-Scale Asynchronous Designs
+
+and is likely to introduce a potential deadlock which must be identified and
+prevented.
+In the general case a good rule for avoiding deadlock is to examine carefully
+any instances of arbitration for a ‘shared’ resource (such as the bus) and ensure
+that no unit can be granted access until it is certain that it can relinquish it
+again regardless of the behaviour of other parts of the system. Each arbiter
+increases the number of states reachable by the processor and makes the design
+problem harder, but it increases the system’s flexibility. Non-determinism can
+be beneficial if used with caution.
+
+15.4.4
+Dependencies
+
+When a processor is pipelined beyond a certain depth it is necessary to in-
+sert interlocks to ensure that dependences between instructions are satisfied
+and programmes execute correctly. Even devices such as the MIPS R3000
+[72] – which was a ‘Microprocessor without Interlocked Pipeline Stages’ is
+interlocked in the sense that the programmer/compiler could use a clock cy-
+cle count to ensure correct operation; an expedient which is disqualified in an
+asynchronous environment. Similar constraints are applied in the ARM archi-
+tecture.
+
+PC pipeline.
+ARM does not use the clock explicitly in the way MIPS does,
+but there is one aspect of the architecture which is similar. The Programme
+Counter (PC) is available to the programmer in the general-purpose register
+set and when it is read the value obtained is the address of the current in-
+struction plus two instructions. This is a historical consequence of the early
+ARM implementations where there were two clock cycles between generating
+the instruction’s address and executing the operation. Compatibility with this
+must be maintained, even in an asynchronous processor where the prefetch and
+execution are autonomous.
+
+memory
+Instruction
+
+Read
+
+Decode
+Reg.
+
+PC pipeline
+
+Fetch
+
+Figure 15.11.
+PC pipeline.
+
+Because the generation and subsequent, possible use of the PC are unsyn-
+chronised in an Amulet processor a method of transmitting the value must be
+found. To do this all Amulet processors have maintained a copy of the PC with
+
+
+Chapter 15: Processors
+289
+
+each fetched instruction (figure 15.11). These flow down the pipeline with the
+instruction and can be read whenever the PC may be needed. The PC may be
+required explicitly as an operand or address, implicitly in branch instructions,
+or to find a return address in the case of exceptions such as interrupts or mem-
+ory aborts. Different Amulet cores have varied the exact value (e.g. PC+8)
+held with this ‘PC pipeline’ in attempts to minimise the later calculation over-
+heads, but in Amulet3 the PC is held without any premodification which allows
+any of the required values PC+2, PC+4, PC+8 to be calculated with a simple
+incrementer.
+It is worth noting that the PC values need not be bundled with the instruction
+directly. The PC pipeline can be a separate, asynchronous pipeline from the
+instruction fetch which can have a different depth, providing that a ‘one in,
+one out’ correspondence is maintained. This is a feature which is exploited in
+Amulet3 to throttle the prefetch unit and prevent instruction fetches causing a
+deadlock; this mechanism is described more fully in section 15.5.2.
+
+Registerdependencies.
+The greatest register dependency problems are read-
+after-write dependencies. One of these occurs in the case of a fragment of code
+such as:
+
+LDR
+R1, [R0]
+; Load ...
+ADD
+R2, R1, R3
+; ... and read
+
+In this example it is essential that the value is (at least apparently) loaded
+into R1 before the subsequent instruction uses it. As soon as the execution
+path is pipelined there is a risk that this will not be assured and this uncertainty
+is increased in an asynchronous device where the load could take an arbitrary
+time.
+Three solutions for ensuring register bank dependencies are satisfied are
+given below.
+
+Don’t pipeline.
+
+Lock.
+
+Forward.
+
+The first solution was the approach taken in the earliest, synchronous ARM
+implementations. This involves reading register operands, performing an oper-
+ation and writing the result back in a single cycle so that a subsequent operation
+always has a ‘clean’ register view. This is simple but makes the evaluation cy-
+cle very long and is unacceptable in a high-performance processor.
+A locking approach allows selective pipelining of the instruction execution
+by retarding instructions whose operand registers are not yet valid. This in-
+volves setting a ‘lock’ flag when an instruction is decoded which wishes to
+
+
+290
+Part III: Large-Scale Asynchronous Designs
+
+modify a register and clearing the flag again when the value finally arrives. A
+subsequent instruction can test the locks on its operands and be delayed until
+they are all clear. This mechanism is eminently suited to an asynchronous im-
+plementation because a stalled instruction is simply caused to wait, which can
+be done without recourse to arbitration.
+In practice it is convenient to allow more than one result to be outstanding
+for a single register. Partly this is a consequence of the ARM’s extensive use
+of conditional instructions, such as:
+
+CMP
+R1, R2
+; Set flags
+MOVNE
+R0, #-1
+; If R1 ? R2
+MOVEQ
+R0, #1
+; If R1 = R2
+
+In such a case a single lock flag (on R0 in this instance) is inadequate and
+some form of semaphore is needed. It turns out that the operation of such
+a semaphore is fairly simple to implement in an asynchronous environment
+provided that testing and incrementing are mutually exclusive. At issue time
+an instruction therefore:
+
+attempts to read its operands and waits until they are all unlocked, then
+
+locks its register destination(s) by incrementing the semaphore.
+
+The semaphore is decremented again when the result is returned; this action
+may take the semaphore to zero and, possibly, free up a waiting instruction.
+This can happen at any time.
+The example above illustrates another potential problem in an asynchronous
+system: the two ‘MOV’ operations are mutually exclusive and so only one will
+be executed. As this is not known at issue time both have incremented the
+semaphore and so both must decrement it, otherwise R0 will be permanently
+locked. In general if a speculative operation is begun it must complete – the
+‘write’ operation therefore always takes place, although sometimes the register
+contents are not changed and only the unlocking is performed.
+The principle of semaphores was designed and implemented as the ‘lock
+FIFO’ in Amulet1 and Amulet2. In these processors the semaphore also held
+the destination register addresses to avoid carrying them with the instruction.
+As the instructions flowed (largely) in order, results and destinations could be
+paired at write time.
+The lock FIFO was implemented as an asynchronous pipeline as shown in
+figure 15.12. Because the cells in each latch (horizontal) are transparent latches
+the entries are copied from one to the next, thus ensuring the outputs of the OR
+gates are glitch free. These can then be used to stall any read operations on
+the particular register until the write is complete. The only hazard would be
+if a register was actively locked whilst a read was being attempted; this is
+prevented by a sequencing of read-then-lock by the instruction decoder.
+
+
+Chapter 15: Processors
+291
+
+Lock indicator
+1
+0
+1
+0
+
+0
+
+0
+
+0
+
+0
+
+0
+0
+0
+
+1
+
+0
+
+1
+0
+
+0
+
+0
+0
+
+1
+
+0
+
+FIFO Control
+
+Figure 15.12.
+Lock FIFO.
+
+Unlocking can happen safely at any time, both relative to the reading or
+locking of other registers. The destination address, in its decoded form, is
+already available at the bottom of the data FIFO.
+
+Reorder buffer.
+Although the lock FIFO works successfully it can intro-
+duce inefficiency in that it enforces in-order completion on the instructions
+and stalls each instruction until its operands are available. Therefore it is an
+effective, cheap method to guarantee functionality but is less than ideal for
+high-performance architectures.
+In Amulet3 register dependencies are resolved using an asynchronous re-
+order buffer [66, 51]. Whilst the major incentive for this was to facilitate a less
+intrusive page fault mechanism (see below) this allows instructions to complete
+in an arbitrary order and results can be forwarded at any time. It is therefore a
+significant step towards a complete out-of-order asynchronous processor.
+
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+�������������������������
+
+
+
+
+
+
+
+
+
+
+
+
+Look−up
+
+ALU
+Registers
+Reorder
+
+Memory
+
+Read
+
+Arrival
+
+Forward
+
+Writeback
+
+Allocate
+
+Figure 15.13.
+Reorder buffer position in Amulet3.
+
+
+292
+Part III: Large-Scale Asynchronous Designs
+
+Table 15.1.
+Reorder buffer allocation.
+
+Instruction
+Slot0
+Slot1
+Slot2
+Slot3
+
+LDR
+R0, [R2]
+R0
+?
+?
+?
+MOV
+R4, #17
+R0
+R4
+?
+?
+LDR
+R7, [R0+4]!
+R0
+R4
+R0
+R7
+ADD
+R7, R7, R4
+R7
+R4
+R0
+R7
+CMP
+R3, R4
+R7
+R4
+R0
+R7
+ADDNE
+R7, R7, R6
+R7
+R7
+R0
+R7
+SUB
+R1, R7, R0
+R7
+R7
+R1
+R7
+
+The reorder buffer is positioned between the various data processing units
+and the register bank (figure 15.13) and crosses several timing domains. An
+instruction first encounters the reorder buffer control at decode time when its
+operands may be discovered and forwarded; any results are allocated space
+at this time. Subsequently an instruction may be subdivided (and, in princi-
+ple, issued at any time) and results can arrive separately and independently.
+Operands can be forwarded any number of times between when they arrive
+and when they are overwritten. Finally an ordered writeback occurs where the
+results are retired and the reorder buffer slots freed for reallocation.
+In the decode phase the operand register numbers are compared with a small
+content addressable memory (CAM) to determine which reorder buffer slots
+may contain appropriate values for forwarding. This list always terminates
+with the register itself, so a value is always available from somewhere. Once
+this is map is known the reorder buffer slots can safely be reassigned.
+The assignment process cyclically associates each reorder buffer entry with
+a particular register address and the instruction packet carries forward just the
+reorder buffer ‘slot’ number. Consider the following code fragment:
+
+LDR
+R0, [R2]
+;
+MOV
+R4, #17
+;
+LDR
+R7, [R0+4]!
+;
+ADD
+R7, R7, R4
+;
+CMP
+R3, R4
+;
+ADDNE
+R7, R7, R6
+;
+SUB
+R1, R7, R0
+;
+
+Assuming that the reorder buffer has four entries and the next free entry
+is (arbitrarily) 0, the reorder buffer assignment will proceed as shown in ta-
+ble 15.1. In each case the italicized entry is the latest one to have been as-
+signed.
+
+
+Chapter 15: Processors
+293
+
+Note:
+
+the second load (LDR) operation uses the ARM’s base writeback mode
+and therefore requires two destinations;
+
+the comparison has no register destinations;
+
+the same register address can appear multiple times in the reorder buffer;
+
+a slot is assigned even if the instruction is conditional (ADDNE) and
+may not produce a valid result.
+
+The instruction decoder still retains the reorder buffer map prior to the start
+of the instruction. In parallel with the reassignment it proceeds as follows,
+using the final instruction as an example:
+
+The locations of the appropriate registers are examined; these may be in
+the process of being reassigned but cannot be overwritten yet because
+the instruction execution has not begun.
+
+For R7 try: slot 1, slot 0, slot 3, register bank.
+
+For R0 try: slot 2, register bank.
+
+Try to read from each location in the list until a value is obtained.
+
+Note that a list of possibilities is required because the assigned slots need
+not contain a valid value. The most obvious cause of invalidation in an ARM
+is an instruction which fails its condition code check (e.g. the value of R7 in
+slot1), but other conditions – such as a preceding branch – can also result in an
+instruction being abandoned.
+
+Read register
+
+Read CAM
+Forward
+
+Assign slot
+
+Figure 15.14.
+Register processes in the Amulet3 decode unit.
+
+The flow of control through the decode phase is summarised in figure 15.14.
+Note that the forwarding time can vary depending on external factors while the
+slot assignment time depends on the number of slots which are required (from
+zero to two) and (occasionally) may have to wait for a slot to be available. Re-
+order buffer slots are assigned serially, even within a single instruction which
+simplifies the asynchronous implementation. There is a small performance
+impact on more ‘complex’ instructions, but these are relatively rare.
+
+
+294
+Part III: Large-Scale Asynchronous Designs
+
+Subsequent to assignment the instruction packets proceed to further stages
+carrying their reorder buffer slot number. Although Amulet3 issues only sin-
+gle instructions in order the ARM instruction set effectively allows two semi-
+independent operations via the internal execution unit and via the external data
+interface (figure 15.15). Each of these may produce a result at any time so each
+has its own port into the reorder buffer. Whilst these inputs are asynchronous
+they are guaranteed to target different slots and can therefore be independent.
+
+LDM − subsequent cycles
+
+Data Load
+
+LDM − first cycle
+
+Internal operation
+
+Decode
+Decode
+
+Execute
+
+Execute
+
+Data Int.
+
+Decode
+
+Execute
+
+Data Int.
+
+Data Int.
+
+Execute
+Data Int.
+
+Decode
+
+Figure 15.15.
+Sub-instruction routing.
+
+The writeback process simply copies results back to the register bank. On its
+arrival each result ‘fills’ a reorder buffer slot. The ‘writeback’ process therefore
+waits until a particular slot is ready, copies it out, and moves to the next one.
+The fact that the slots may become ready in an arbitrary order is not a concern.
+Superimposed on this process is a forwarding mechanism which waits until
+a result is ready and copies it back into the decode stage. This process can hap-
+pen before, during or after the writeback process; in fact the processes are asyn-
+chronous and concurrent. The key is that both processes use non-destructive
+copying and therefore leave the data available for the other as required. A
+result in the reorder buffer remains until it is overwritten.
+Either of the above processes may be required to wait if they need a result
+which has not yet arrived. In order to control this in a non-interacting way there
+are two separate ‘flags’ which indicate the presence of data in a slot. One flag
+is raised to indicate that the slot is ‘full’; this is cleared when the writeback
+process has copied the result to the register bank. This is at the heart of the
+control circuit shown in figure 15.16 (which will be described shortly). The
+
+
+Chapter 15: Processors
+295
+
+WPa_n
+WPr_n
+
+Aout
+
+T_n+1
+T_n
+
+Rout_n
+
+C
+C
+
+Match
+
+Try
+
+Fcol
+
+C
+
+Result_ack
+
+Full_n
+
+Token_out
+
+Result_req
+
+Token_in
+
+Write back
+
+Figure 15.16.
+Reorder buffer copy-back circuit.
+
+Table 15.2.
+‘Fcol’ state as results arrive in a reorder buffer.
+
+Result
+0
+1
+2
+3
+Number
+
+0
+0
+� 1
+0
+0
+0
+1
+1
+0
+� 1
+0
+0
+2
+1
+1
+0
+� 1
+0
+3
+1
+1
+1
+0
+� 1
+4
+1
+� 0
+1
+1
+1
+5
+0
+1
+� 0
+1
+1
+6
+0
+0
+1
+� 0
+1
+7
+0
+0
+0
+1
+� 0
+8
+0
+� 1
+0
+0
+0
+
+second flag (‘Fcol’) is merely changed to indicate the arrival of a result. As the
+state of this flag alternates on each pass through the cyclic reorder buffer it is
+possible for a forwarding request to test to see if a result has arrived without
+changing the flag’s state (table 15.2). This is essential because a value may be
+forwarded zero or more times, the number not being known at issue time.
+
+
+296
+Part III: Large-Scale Asynchronous Designs
+
+As a final embellishment a result returned to the reorder buffer may be in-
+valid (due, for example, to a condition code failure). Each slot is therefore
+accompanied by a flag which can prevent the data being written to the register
+bank (the ‘full’ flag is still cleared) or being forwarded. In the latter case fur-
+ther forwarding may be attempted from a less recent result, culminating in the
+use of the default register value.
+The writeback circuit (figure 15.16) is a good example of the working of
+such asynchronous control circuits. It operates as follows.
+
+The arrival of a result (top left) causes the ‘Full’ bit to be set and ac-
+knowledges itself (top right). The request comes from one of a number
+of mutually exclusive sources.
+
+This event also toggles the forwarding colour (‘Fcol’) which allows any
+instructions issued subsequent to this one to use the result.
+
+Note that the input can be on any of the mutually exclusive input chan-
+nels; two are shown here but more are possible.
+
+The input request can be removed at any time leaving the slot marked as
+‘Full’.
+
+When it becomes this slot’s turn to copy back a token is received (bot-
+tom left) which initiates an output request (bottom centre). If the token
+is received first the circuit waits for the result to arrive. The token is
+acknowledged.
+
+This process also allows the ‘Full’ bit to be reset, waiting for the input
+request to fall if it has not already done so.
+
+When the copy back is complete the (broadcast) acknowledge is picked
+up by the circuit to complete the token input handshake and pass the
+token to the next, similar circuit to the right. The next slot cannot attempt
+to output until the four-phase output acknowledge is complete.
+
+These slots are connected in a ring and reset so that there is a single token
+present for the first stage to output after which it runs freely, emptying each
+slot in turn when it contains a result.
+The experimental (and often unsystematic!) way in which Amulet3 was de-
+signed meant that the original state machine was defined and refined to a ver-
+sion close to that shown in figure 15.16 as a schematic and only subsequently
+subjected to a more formal analysis. The analysis was carried out using Petrify
+(which was introduced in section 6.7 on page 102).
+At first this caused problems due to the choices within the system. The
+output channel ‘calls’ the register bank and it was expedient to broadcast the
+output acknowledge to all the copy back circuits and use the (unique) active
+
+
+Chapter 15: Processors
+297
+
+request to sensitize the relevant location. This proved hard to model in a sin-
+gle subcircuit. However on a suggestion from one of the Petrify developers
+(Alex Yakovlev) it proved easier to model the whole system than a single part.
+The resultant signal transition graph (STG) (see figure 15.17) clearly shows the
+four implemented subcircuits which pass control around cyclically. The rest of
+the processor is abstracted in the four small rings which, when an output hand-
+shake has been completed, can reset the circuit to ‘Full’ again at an arbitrary
+time.
+The analysis with Petrify both verified the circuit’s operation and removed
+a redundant transistor in each subcircuit.
+There are two hazards in the asynchronous processes as described here. The
+first is that there is no local way of preventing a slot being overwritten before
+it has been emptied. In Amulet3 this is guaranteed elsewhere by ensuring that
+a slot is never assigned until it has been freed by the copy back process. In
+effect a count of free slots is maintained which is decremented when a slot is
+assigned and incremented when it is released again. Because assignment and
+release are both cyclic and in-order it is not necessary to pass individual slot
+identification, the presence of a token is adequate. This ‘throttle’ is imple-
+mented as a simple, dataless FIFO which also acts as an asynchronous buffer
+between the unsynchronized processes. This is shown in figure 15.18 for a
+system with four reorder buffer slots; in the state shown one result has been
+produced but not yet committed to the register bank, two more are being gen-
+erated and the decoder could issue another instruction which generated a single
+result.
+The other hazard is because the forwarding and writeback processes are not
+synchronised, therefore the register value which is read (as a default) could be
+changed during the read process, resulting in ‘random’ data being read. How-
+ever this can only happen if there is a valid value for that register in the reorder
+buffer and therefore it is certain that this value will be forwarded in preference
+to the register contents. Providing that the implementor can ensure that the
+‘random’ data does not cause problems within the circuit, the mechanism is
+secure against this asynchronous interaction.
+Studies of ARM code indicated that for the Amulet3 architecture a reorder
+buffer with five or more entries was unlikely ever to fill [50]. Amulet3 therefore
+implements a four entry buffer; however the mechanism described is extensible
+to any chosen size.
+
+15.4.5
+Exceptions
+
+An ‘exception’ is an unexpected event which occurs in the course of running
+a programme. The Amulet processors are compatible with the ARM instruc-
+
+
+298
+Part III: Large-Scale Asynchronous Designs
+
+Rout1+
+
+Full1-
+T4-
+Aout+
+
+Rout1-
+
+T1+
+Aout-
+
+Aout-
+
+WPr1+
+Rout2+
+
+Full1+
+Full2-
+T1-
+Aout+
+
+WPa1+
+
+Rout2-
+
+T2+
+
+WPr1-
+
+WPa1-
+
+Aout-
+
+WPr2+
+Rout3+
+
+Full2+
+Full3-
+T2-
+Aout+
+
+WPa2+
+
+Rout3-
+
+T3+
+
+WPr2-
+
+WPa2-
+
+Aout-
+
+WPr3+
+Rout4+
+
+Full3+
+Full4-
+T3-
+Aout+
+
+WPa3+
+
+Rout4-
+
+T4+
+
+WPr3-
+
+WPa3-
+
+WPr4+
+
+Full4+
+
+WPa4+
+
+WPr4-
+
+WPa4-
+
+Figure 15.17.
+An STG for all four reorder buffer token-passing circuits.
+
+
+Chapter 15: Processors
+299
+
+Execute
+Decode
+
+Memory
+
+Reorder
+
+Figure 15.18.
+Token passing throttle on reorder buffer.
+
+tion set and, therefore, the types of exceptions and the behaviour when they
+occur is predefined. Ignoring reset, ARM has six types of exception:
+
+Prefetch abort - a memory fault (e.g. page fault) during an instruction
+prefetch;
+
+Data abort - a memory fault during a read or write transfer;
+
+Illegal instruction - an emulator trap;
+
+Software interrupt - a system call (not really an exception, but has similar
+behaviour);
+
+Interrupt - a normal, level sensitive interrupt;
+
+Fast interrupt - similar to normal interrupts, but higher priority.
+
+Of these the majority are quite easy to deal with: software interrupts and
+illegal instructions can be detected at instruction decode time, as can prefetch
+aborts (when there is no instruction to decode); interrupts are imprecise and
+therefore can be inserted anywhere; only data aborts have the ability to cause
+serious problems because they only evidence after the instruction has been
+decoded and started to execute.
+
+Interrupts.
+In any processor an interrupt is an asynchronous event. In one
+sense the arrival of an interrupt can be thought of as the insertion of a ‘call’ in-
+struction in the normal sequence of instruction fetches. At first glance it would
+seem that simply arbitrating an extra ‘instruction’ packet into the instruction
+stream would suffice; however this simplistic view can cause problems.
+The chief problem is that there is some interaction between the interrupt
+‘instruction’ and prefetch stream; the interrupt needs a PC value to synthesise
+its return address, and the interrupting device cannot know what this is. Fur-
+thermore the return address must be valid; if an interrupt is accepted just after
+
+
+300
+Part III: Large-Scale Asynchronous Designs
+
+a branch has been prefetched it could be inserted into code which should not
+be run.
+Amulet3 implements interrupts by ‘hijacking’ rather than inserting instruc-
+tions, the whole operation being performed in the prefetch unit. Although the
+interrupt signals (in this case they are level-sensitive) change asynchronously
+with respect to the prefetch unit the mutual exclusion element is better thought
+of as a synchroniser than as a typical asynchronous arbiter. Figure 15.19 shows
+a method of implementing this. Here any change in the interrupt input will re-
+tard the normal request flow until the synchronised state has been latched; the
+synchronised interrupt signal only changes when done is low.
+
+Request
+
+MUTEX
+
+Interrupt
+
+Interrupt
+
+Synchronised
+
+Done
+
+Latch
+
+Figure 15.19.
+Interrupt synchroniser.
+
+When it is known that an interrupt signal has become active the current PC
+value effectively becomes the address of the interrupt ‘instruction’ and can be
+used to form the return address. This can be sent as an instruction but can save
+time by bypassing the memory. The interrupt can then be disabled to prevent
+further acknowledgement.
+Because this action takes place in the prefetch unit, Amulet3 can treat the
+interrupt entry as a predicted branch and jump directly to the appropriate inter-
+rupt service routine which, in an ARM, are at fixed addresses.
+The problem still arises that the interrupt entry may be speculative. If a
+branch is pending the return address sent to the execution stage may be invalid
+– in any case it will be wrongly coloured and therefore discarded! However
+the act of branching updates both the PC and any associated information, in-
+cluding the interrupt enables. As the interrupt has not been serviced the (level-
+sensitive) request will still be active and another attempt to enter the service
+routine will be made. This time the branch target address will be saved and
+there can be no further impediments.
+
+Data aborts.
+Although it solves the register dependency problems, the re-
+order buffer was originally introduced to simplify the implementation of data
+aborts. The ARM architecture specifies that, if an abort occurs, no effects
+from following instructions will be retained. Earlier Amulet processors did not
+
+
+Chapter 15: Processors
+301
+
+speculate on memory operations, relying on a fast ‘go/no go’ decision from
+any memory management unit. Amulet3 allows for more sophisticated (i.e.
+slower!) memory management by outputting memory transfer requests and
+only checking for aborts at the end of the operation. To be of any worth this
+must allow other, speculative operations to take place in parallel, but these
+operations cannot be ‘retired’ until the outcome of the load is known.
+The reorder buffer provides a place for speculative results to be held – and
+forwarded for reuse if necessary – until they can be retired into registers. In the
+(rare) case of a data abort the speculative results can be discarded, leaving the
+register bank intact. The discard can be achieved either by using a colouring
+scheme, tested by the register writeback process, or by marking speculative
+results as invalid using the same flag as an operation which has failed for other
+reasons. For certain reasons of implementational expediency Amulet3 uses
+the latter method, although the asynchronous hazard of invalidating a result
+whilst a forwarding operation is being attempted must be avoided. (This is
+achieved by implementing two validity bits and nullifying only one of them;
+the copy back process uses an AND of these whereas forwarding uses an OR.
+Forwarding is therefore not disturbed, although the result will be discarded
+later.)
+The reorder buffer accounts only for the register state however; the ARM
+holds other state bits which also require preservation. Two separate mecha-
+nisms are used for these, depending on the frequency of changes.
+The first is the current programme status register (CPSR) which holds the
+processor’s flags, operating mode et alia, the whole of which can be repre-
+sented in ten bits. The flags clearly change frequently during execution and
+there are many dependences on these bits due, for example, to compare-branch
+sequences. When attempting a memory operation Amulet3 simply copies the
+current CPSR state into a history buffer [66] FIFO; successful completion of
+the transaction discards this entry, but an abort can restore the CPSR state to
+that at the start of the failed operation.
+The other non-register state is a set of five saved programme status regis-
+ters (SPSRs) which act as a temporary store for the CPSR in various exception
+entries. These change very rarely and it is uneconomic to enlarge the history
+buffer to encompass them, although – in theory – they could be changed be-
+tween a load being issued and an abort being signalled. The solution here was
+simply to use a semaphore to lock the SPSRs whilst any memory operations
+are outstanding. This delivers the required functionality very cheaply and the
+performance penalty is tiny because SPSRs change so rarely.
+
+
+302
+Part III: Large-Scale Asynchronous Designs
+
+15.5.
+Memory – a case study
+
+It seems reasonable that an asynchronous processor should interact with an
+asynchronous memory system. This implies the need for handshake interfaces
+on a range of memory systems, including RAM, ROM and caches. This is the
+subject of the following sections.
+An individual memory array is a very regular structure and – under steady
+voltage, temperature etc. conditions – will produce data in a constant time.
+At first glance this may suggest that there is not much scope for asynchronous
+design within a memory system. However each part of the memory will have
+its own characteristic timing; in some cases even a simple memory will have a
+variable cycle time. An example is a RAM which will typically take longer to
+read from than it will to write to.
+In fact the memory system is one part of a computer which has extremely
+variable timing; even a clocked computer will take different times to service a
+cache hit and a cache miss. An asynchronous system will accommodate such
+cycle time variation quite naturally and is able to exploit many more subtle
+variations which would be padded to fill a clock cycle in a synchronous ma-
+chine.
+
+15.5.1
+Sequential accesses
+
+A static RAM (SRAM) stores data in small flip-flops which have only a
+very weak output drive. To accelerate read access it is normal to use ‘sense
+amplifiers’ to detect small (sub-digital) swings in voltage and produce an early
+digital output. Sense amplifiers, being analogue parts, are quite power-hungry.
+Sense amplification is only useful when there has been enough voltage
+swing for the read bits to be discriminated; it is also only required until the
+bits can be latched. As this period is certainly less than even half a clock cycle
+this is an ideal application for a self-timed system. A delay can be inserted to
+keep the sense amplifiers ‘switched off’ when the read cycle commences and
+only switch them on when they may be useful. An extra (known) bit in the
+RAM may then be discriminated and, when it has been read, used to latch the
+entire RAM line and disable the sense amplifiers. The same signal can be used
+to indicate the completion of the read cycle, possibly returning the RAM array
+to the precharge state in preparation for another cycle (figure 15.20).
+When designing such a circuit the RAM completion is easy to detect but the
+delay before enabling the sense amplifiers is harder to judge. The designer can
+choose this to be short – to ensure maximum speed – or somewhat longer –
+to ensure the memory is ready before the amplifiers are enabled thus ensuring
+minimum power wastage. If the designer errs then either speed or power may
+be compromised slightly, however functionality is retained.
+
+
+Chapter 15: Processors
+303
+
+Delay
+
+RAM array
+
+Latch
+
+Figure 15.20.
+Self-timed sense amplifiers.
+
+A typical SRAM array is organised to be roughly square; a 1 Kbyte RAM
+might therefore be organised as (say) 64
+� 128 rather than 256
+� 32 even
+though the processor requires only 32 bits on a given cycle. This presents
+the RAM designer with two choices:
+
+multiplex 32 sense amplifiers to the required word;
+
+amplify all 128 bits and ignore the unwanted ones.
+
+The first choice appears better when a read is considered in isolation but
+cycles are rarely so arranged; typical access patterns (especially code fetches)
+exhibit considerable sequentiality and this can be exploited in the hardware
+design.
+When using the first option it is possible to delay the RAM precharge and
+provide a subsequent read operation with a shorter read delay. The Amulet2e
+cache [49] uses this technique and is therefore able to provide subsequent ac-
+cesses within a RAM ‘line’ faster than the first such access. This variation
+in access time is much less than a whole ‘cycle’ and therefore would be of
+no interest to a synchronous designer, but it is exploited automatically in an
+asynchronous system.
+The second option given above can latch the entire RAM line after amplifi-
+cation. It can then service subsequent requests from this latch. This frees the
+RAM array to be precharged and – possibly – used for other purposes. This
+technique is exploited in the Amulet3i RAM which is described below.
+
+15.5.2
+The Amulet3i RAM
+
+As shown in figure 15.7 on page 283, the Amulet3 processor has a Harvard
+architecture with separate instruction and data buses. However in the Amulet3i
+
+
+304
+Part III: Large-Scale Asynchronous Designs
+
+SoC the memory model is unified; this implies that the buses must ‘merge’
+somewhere outside the processor core.
+In practice the local buses are merged in two places: once to get onto the
+on-chip MARBLE bus (see below) and once for access to the local, high-speed
+RAM (figure 15.21). In Amulet3i the local RAM is memory-mapped rather
+than forming a cache, although there is no reason why a cache could not have
+been implemented here; cache design is discussed later.
+
+A
+D
+
+AmuInstAdr
+
+InstBus
+
+RIA
+
+MID
+
+RRI
+
+Data Port
+
+Inst Port
+
+DataBus
+
+AmuWriteData
+
+AmuDataAdr
+
+SDA
+
+A
+D
+A
+D
+
+Initiator
+
+Logic
+InstDec
+
+Core
+AMULET3
+
+DataDec/
+Arbiter
+
+Target
+Initiator
+
+8Kbyte
+RAM
+
+MIA
+
+RRD
+
+RDA
+
+RWD
+
+SWD
+
+MRD
+
+MDA
+
+MWD
+
+MARBLE
+
+DAI
+
+IAI
+
+Figure 15.21.
+Amulet3i local memory.
+
+The local RAM (8 kbytes) is divided into 1 Kbyte blocks; both buses run
+to each block (figure 15.22). The blocks are interleaved so that – most of the
+time – there need be no interaction between instruction and data fetches. Only
+if there is a conflict with both buses requiring data from the same RAM block
+is there any need for arbitration.
+In the case of a conflict there is no clock on which to control an adjudica-
+tion; access to the block is restricted by a mutual exclusion element (‘mutex’),
+within the Arbiter blocks in figure 15.22, on a ‘first-come, first served’ basis.
+Note that, in general, data and instruction accesses are not synchronised and
+therefore the average wait will be about half the typical RAM access time.
+Collisions are further minimised by using the latch at the output of the sense
+amplifiers (see preceding section) as a form of cache. Here separate latches
+are provided for instruction and data reads, so sequential accesses rarely need
+to compete for the arbitrated resource. In practice this gives performance ap-
+
+
+Chapter 15: Processors
+305
+
+Arbiter
+Arbiter
+Arbiter
+
+Initiator
+Initiator/Target
+
+Dbuffer
+Ibuffer
+Dbuffer
+Ibuffer
+Ibuffer
+Dbuffer
+
+MARBLE
+
+Local Instruction bus
+
+Local Data bus
+
+MARBLE
+
+microprocessor
+AMULET3
+
+1Kbyte RAM
+1Kbyte RAM
+1Kbyte RAM
+
+Figure 15.22.
+Memory block organisation.
+
+proaching that of a dual-port RAM despite being implemented with standard
+SRAM.
+The local RAM architecture thus provides memory cycles with two differ-
+ent delays (‘random’ and ‘sequential’) with the potential of an added (variable)
+delay in the rare case of a collision. In Amulet3i this is further complicated
+because the two local buses cycle in different times; the instruction bus is sim-
+plified as a read-only bus and runs noticeably faster than the full-function local
+data bus, which also permits external bus mastery to allow DMA and test ac-
+cess (fig. 20). The implications of these various timings are absorbed by the
+asynchronous nature of the system – for instance it is not necessary to slow the
+instruction fetches down by around 25% to fit a clock period set by the data
+bus.
+The inclusion of arbiters within the memory blocks implies that the access
+patterns are non-deterministic. Care must therefore be taken to ensure that the
+system cannot reach a deadlock state. The only possible deadlock that could
+occur in the memory would occur as follows:
+
+1 A (non-sequential) data transfer needs access to a particular RAM block.
+
+2 This is prevented because an instruction fetch is already using the RAM
+array.
+
+
+306
+Part III: Large-Scale Asynchronous Designs
+
+3 The instruction fetch cannot complete because the instruction decoder is
+still busy.
+
+4 The processor pipeline is full and is blocked by the data fetch.
+
+5 Deadlock!
+
+To avoid this it is important not to gain access to the shared resource (the
+RAM array) until it is known that the operation will be able to release it again.
+A data transfer can always do this but provision has to be made restricting the
+generation of instruction fetches until they can be guaranteed to release the
+RAM. In practice the latch following the sense amplifiers forms a convenient,
+dedicated buffer in which to hold the instruction and allow data accesses to
+proceed. In Amulet3 the processor throttles its requests so only a single in-
+struction fetch is outstanding at any given time and this must be removed from
+the ‘I buffer’ before the next address can be sent (figure 15.23).
+
+Latch
+Latch
+
+Arbiter
+
+D tag
+
+D buffer
+
+I tag
+
+I buffer
+
+Instructions
+
+Data transfers
+
+Throttling (within processor)
+
+RAM block
+
+Figure 15.23.
+Memory block arbitration and throttling.
+
+The need for arbitration is rare and thus the possibility of discovering a
+deadlock by random simulation even rarer. It is therefore essential to analyse
+such a non-deterministic system thoroughly to ensure that the opportunities for
+deadlock are removed.
+
+
+Chapter 15: Processors
+307
+
+15.5.3
+Cache
+
+An synchronous cache is very similar to an asynchronous RAM; most of the
+design is a combination of the preceding description of asynchronous RAM
+and standard cache design techniques. However in order to be efficient there
+are certain problems, not present in a synchronous cache, which require solu-
+tion.
+The most significant problems are in managing any conflicts between the
+processor/cache interactions and cache/bus interactions. The first of these to
+address is the issue of line fetch.
+A line fetch generally occurs when a cache miss results from an attempted
+access to a cacheable location. A cache line comprising the required word and
+a small number of adjacent words (a cache line) are copied from memory into
+the cache. The simplest solution to this is to halt the processor, fetch the entire
+cache line, and allow the processor to proceed as the access is now a cache
+hit. However this requires a processor stall which is considerably longer than
+is strictly necessary.
+A more efficient scheme is to begin fetching the cache line, forward the re-
+quired word to the processor as soon as it arrives (it can often be arranged to be
+the first word fetched) and then allow the processor and line fetch to continue
+independently. Performance is further enhanced by allowing the processor to
+use other parts of the cache whilst the line fetch is proceeding (‘hit-under-
+miss’) and to use the incoming words as soon as they arrive. Unfortunately
+in an asynchronous environment this is difficult because the fetched words are
+arriving with no fixed timing relationship with the processor’s cycles.
+Initial thoughts may suggest arbitration for the cache. However it is possi-
+ble to solve this problem without arbitration while maintaining all the desired
+functions by including a dedicated latch for holding the last fetched line. This
+latch is called the Line Fetch Latch.
+
+Read
+
+F
+Read
+Ack.
+
+Select
+
+T
+
+F
+
+Select
+
+LFL Hit
+
+SYNC.
+LF LATCH
+
+MAIN CACHE ARRAY
+
+Req.
+
+Hit
+
+T
+
+DATA
+
+LF ENGINE
+
+ADDR
+
+Figure 15.24.
+Control circuit request steering.
+
+
+308
+Part III: Large-Scale Asynchronous Designs
+
+The line fetch latch (LFL) (figure 15.24) is actually a set of latches residing
+just outside the true RAM array. It normally holds the last-fetched cache line.
+It has its own tag and comparator which allow it to function much like the other
+cache lines. Note that the LFL holds the only copy of this data. (Incidentally,
+because the LFL is static and requires no sense amplification when it is read it
+can provide faster access in an asynchronous system.)
+When, as a result of a cache miss, a fetch is needed, a line from the RAM is
+selected for rejection. For the moment assume that the cache is write-through
+and therefore the RAM can simply be overwritten. The LFL contents, together
+with its tag, are then copied into the chosen line and the LFL is marked as
+‘empty’. This can happen in parallel with the start of the external access.
+The processor is then assumed to have a cache hit from within the LFL and
+attempts to read the appropriate word; this causes a stall at the synchronisation
+point because the word is empty and – unless the external memory is excep-
+tionally fast – will not have been refilled yet.
+As words arrive they are stored in the LFL and individually flagged as ‘full’.
+As soon as the processor can read a word from the LFL (typically after the
+completion of the first fetch cycle) it can continue. From this time the processor
+can continue in parallel with the remaining words being fetched.
+A subsequent cache cycle could be:
+
+a cache hit: this can proceed independently and without interaction with
+the LFL;
+
+a cache miss: this will cause a stall until the line fetch process is com-
+plete and the fetch process can be repeated;
+
+a LFL hit: this attempts to read the LFL whilst it is being filled. The
+possibilities are that the required word is already present (the processor
+continues) or the word is still pending (the processor must stall until it
+arrives.
+
+Only in the last case is there any interaction between the asynchronous pro-
+cesses. However this interaction is merely a wait which can be implemented
+with a flip-flop and an AND gate (figure 15.25). The potential wait begins
+when the LFL is first emptied (caused by, and therefore synchronised with,
+the processor’s action). The wait can end at an arbitrary time but this merely
+delays a transition; it cannot abort or change an action and can therefore be
+implemented without arbitration or the risk of metastability. This mechanism
+was first implemented in the Amulet2e cache system [49] and was also used in
+TITAC-2 [130].
+Both these processors used a simple write-through cache for simplicity. For
+higher performance a copy-back mechanism is needed. This too can be pro-
+vided using an extension of the LFL mechanism. In this case the process of
+
+
+Chapter 15: Processors
+309
+
+Q
+
+R
+
+S
+
+Q
+S
+
+Q
+
+R
+
+S
+
+Q
+
+R
+
+S
+
+Memory line fetch interface
+
+LF_data3
+LF_data2
+LF_data1
+LF_data0
+
+En
+
+Data in
+
+word address
+LF_req
+Processor read interface
+
+LF_complete
+
+LF_ack
+
+Figure 15.25.
+Line-fetch latch read synchronisation.
+
+line fetching is complicated by the need first to copy the victim line from the
+cache array before overwriting it. The victim line can be placed in a separate
+write buffer together with its address as supplied by its tag field. Note that the
+line fetch is caused by a cache miss, so the rejected line can never be the same
+as the line being fetched (this becomes more important later). The LFL is then
+emptied into the RAM array as before and the refilling begins. The writing
+of the rejected line can be delayed because it is less urgent than satisfying the
+cache miss.
+Each cache line (and the write buffer) also contain a ‘dirty’ flag which is set
+if the cache line has been modified. This can be checked and used to determine
+if the write buffer should be written out (i.e. ‘dirty’ is true) or is already coher-
+ent with the memory; in the latter case the copy-back process can be bypassed.
+This process reduces the write traffic but, with a single entry write buffer,
+does not greatly assist with reducing fetch latency because the write buffer
+must be emptied before a subsequent fetch. However the write buffer can be
+extended, albeit at the cost of introducing an arbiter and the potential pitfalls
+therefrom.
+If a second line fetch is needed before an earlier fetch is complete it is the-
+oretically possible for this to overtake any pending write operations. This is
+also desirable. As the write operation will already be pending – merely wait-
+ing for the bus to become free – it is necessary to determine that the new fetch
+request arrived before the write could begin, which requires arbitration in an
+asynchronous environment. However, once the decision is made the operations
+can proceed as before. Two problems remain however:
+
+if the write buffer becomes full there will be nowhere to evict a line to
+and the system can deadlock;
+
+
+310
+Part III: Large-Scale Asynchronous Designs
+
+if the required line is one which has recently been evicted the fetch could
+overtake a pending write and thus lose cache coherency.
+
+The first problem is relatively easy to solve; a simple counter can ensure
+that only a certain amount of overtaking is allowed and that one space in the
+write buffer always remains free (i.e. when the last entry is filled the next bus
+operation must be a write – unless, of course, it is a ‘clean’ line where the write
+can be assumed and bypassed). This can be implemented, for example, as a
+semaphore.
+The second problem is harder, but can be solved by forwarding in a similar
+manner to the forwarding from a processor’s reorder buffer. A line fetch checks
+the addresses of the entries in the write buffer and, if it finds a match, satisfies
+itself from there instead of requesting the memory bus. In such a case the
+‘fetch’ can be performed with no latency, much more rapidly as it is an internal
+operation and can copy an entire cache line in a single operation. This is, of
+course, irrelevant to the functioning of other parts of the system as the whole
+is self-timed anyway. Such forwarding does not affect the write process (a re-
+fetched ‘dirty’ line is returned clean) and can take place regardless of whether
+the copy-back is pending, in progress, complete or not needed. The write buffer
+acts, in effect, as a write-through victim cache [56].
+There is one caveat to this process; the line fetch occurs in parallel with
+the eviction of a cache line. It is therefore possible that one entry in the write
+buffer could be changing during the comparison process. As noted above this
+entry is reserved for the freshly evicted line and can safely be excluded from
+the comparison, thus averting any possibilities of a false positive due to signals
+changing.
+
+15.6.
+Larger asynchronous systems
+
+15.6.1
+System-on-Chip (DRACO)
+
+DRACO (DECT Radio Communications Controller) (figure 15.26) is a sys-
+tem on chip based around the Amulet3 processor. In terms of area about half
+of the (7 mm square) device is an asynchronous ‘island’ – hence Amulet3i –
+and the other half comprises synchronous peripherals compiled from VHDL.
+The asynchronous subsystem (figure 15.27) is a computer in itself and was
+developed both with commercial intent and with a view to investigating some
+new techniques. The processor and RAM have already been discussed. Some
+other novel asynchronous features are outlined in this section.
+
+15.6.2
+Interconnection
+
+Ideally an asynchronous system should be based around an asynchronous
+bus. Indeed it is arguable that large, fast synchronous systems should also use
+
+
+Chapter 15: Processors
+311
+
+Figure 15.26.
+DRACO layout.
+
+subsystem
+
+8 Kbyte
+
+interface
+Test
+
+interface
+Memory
+
+Bridge
+Synchronous
+MARBLE/
+
+peripheral
+
+DMA
+
+I/Os
+peripheral
+
+control
+DRAM
+
+selects
+
+ROM
+
+chip
+
+controller
+
+Synchronous
+
+controller
+
+RAM
+16 Kbyte
+
+data
+
+addr
+
+DMArq
+
+test
+
+delay
+
+AMULET3
+
+MARBLE
+
+synchronous
+
+asynchronous
+
+Figure 15.27.
+Amulet3i asynchronous subsystem.
+
+
+312
+Part III: Large-Scale Asynchronous Designs
+
+asynchronous interconnection between their synchronous subsystems to alle-
+viate problems with high-speed clock distribution and clock skew. This model
+is sometimes referred to as “GALS” (Globally Asynchronous, Locally Syn-
+chronous) and may represent an early commercial opportunity for the inclusion
+of asynchronous circuits in ‘conventional’ systems.
+
+MARBLE.
+As a step in developing such an interconnection standard an
+Amulet3i contains the first implementation of MARBLE [5], a 32-bit, multi-
+master, on-chip bus which communicates by using handshakes rather than a
+clock. Apart from this the signal definitions, with 32-bit address and data,
+look very similar to a conventional bus. MARBLE separates address and data
+communications, allowing pipelining and interleaving of operations in order to
+increase the available bandwidth when several devices require global access.
+MARBLE is supported by ‘initiator’ and ‘target’ interfaces which can be
+attached to any asynchronous component. These, their address, and the bus
+wiring provide all that is needed for communication between the various com-
+ponents. In Amulet3i there are four initiators and seven targets. For example
+the processor’s two local buses each terminate in a MARBLE initiator and the
+local data bus is also a MARBLE target which allows DMA and test data in
+and out of the RAM from other initiators.
+
+Chain.
+Chain (‘Chip area interconnect’) is currently under development as
+a possible replacement for a conventional bus for on-chip communications.
+Chain is based around narrow, high-speed, point-to-point links forming a net-
+work rather than a bus. The idea is to exploit the potential for fast symbol
+transmission within an asynchronous system while reducing the number of
+long distance wires.
+By using a delay-insensitive coding scheme Chain relieves the chip designer
+of the need to ensure timing closure across the whole chip; it also provides
+tolerance of potential problems such as induced crosstalk on the long inter-
+connection wires. Again the user need only communicate with ‘conventional’
+parallel interfaces.
+
+15.6.3
+Balsa and the DMA controller
+
+The DMA controller is a complex, multi-channel unit which was evolved
+according to an external specification. Whilst the function of the unit is rela-
+tively straightforward, even in the asynchronous domain, the unit is notable for
+being the first real application of Balsa synthesis [11].
+The DMA controller comprises a large set of control registers for the many
+DMA channels and a control unit which chooses an active request and services
+it. The registers were designed as blocks of custom VLSI to optimise their area.
+The control logic was written in Balsa, and modified several times as the spec-
+
+
+Chapter 15: Processors
+313
+
+ification changed. The modifications proved remarkably easy to accommodate
+in this environment.
+Such synthesis is not (yet) suitable for high-performance units, but proved
+extremely useful in such an application where the performance was limited by
+other constraints (such as the bus speed, here) and development time predom-
+inates. Of course, in an asynchronous environment, it is easy to accommodate
+devices in any performance range without affecting the overall system func-
+tionality.
+Part II of this book is an introduction to Balsa and includes a complete
+source listing of a simpler 4-channel DMA controller.
+
+15.6.4
+Calibrated time delays
+
+To be useful in real systems an asynchronous processor must be able to
+interface with currently available commodity parts. Whilst it is possible – and
+perhaps even desirable – to have memory devices with handshake interfaces,
+these are not available ‘off the shelf’. Instead commodity memories rely on
+absolute timing assumptions to guarantee their correct operation and thus a
+system using them must have an accurate timing reference. This is the one
+thing lacking in an asynchronous design.
+Introducing a clock purely to provide memory timing would negate many of
+the advantages of the asynchronous system; it is therefore preferable to retain
+the idea of providing data ‘on demand’, providing an adequately precise delay
+can be provided. This delay need not be a clock; a monostable which can be
+triggered on demand is preferable.
+Amulet1 and Amulet2e relied on an externally supplied delay line to pro-
+vide a bus timing reference. This is a flexible solution in that a short delay
+can be used repeatedly to provide longer timing intervals, providing a flexible,
+programmable interface. For example an area of address space could be set
+up for DRAM devices with (say) 1 delay address set up time, 2 delays RAS
+address hold, etc. The bus interface can then count delays as appropriate.
+This is a reasonable solution but suffers from certain drawbacks:
+
+the on-chip delay for each delay cycle is not factored;
+
+delay lines are not particularly precise;
+
+driving an off-chip delay is power hungry.
+
+A much better solution would be to use an on-chip delay. The chief problem
+with this is that a fixed delay will be imprecise, varying from chip to chip and
+also changing with temperature and supply voltage fluctuations. Any on-board
+delay must therefore be calibratable against an external reference.
+Amulet3i uses such a delay. This comprises a chain of delay elements (fig-
+ure 15.28) which can be ‘shorted’ at a determined point. This can be calibrated
+
+
+314
+Part III: Large-Scale Asynchronous Designs
+
+−
+
+R
+
+C
+
+M
+
++
+
+Out
+
+−
+−
+C
+
+In
+
+C
++
++
+−
++
+
+L
+
+C
+
+0
+0
+
+1
+
+0
+
+1
+1
+
+0
+
+1
+
+Figure 15.28.
+Controllable delay circuit.
+
+by counting the number of cycles completed within a known interval and ad-
+justed accordingly using the control wires shown. Once calibrated the delay
+will only change slowly (e.g. with temperature drift) unless the external condi-
+tions change; calibration can therefore be repeated at infrequent intervals under
+software control. The external timing reference can be a long delay such as a
+period of a 32 kHz ‘watch’ oscillator – hardly power expensive!
+
+15.6.5
+Production test
+
+Figure 15.27 on page 311 includes a ‘test interface controller’ block. The
+DRACO chip was designed as a commercial product, and must therefore be
+testable in production. The test approach adopted in the design of DRACO is
+based upon exploiting the MARBLE bus to access the various on-chip mod-
+ules and their test features. In normal use the external memory interface is a
+MARBLE target, but for test purposes it can be reconfigured as a bus initiator,
+enabling an external tester to control the bus and thereby read and write the
+on-chip memory. Circuits in the test interface controller make this efficient,
+with support for automatic sequential address generation, and so on.
+All of the production tests for the asynchronous subsystem use the exter-
+nal tester to load a test program into the on-chip RAM and then instruct the
+Amulet3 processor to execute the program. The test runs without intervention
+from the tester, which must simply wait a given time before inspecting an on-
+chip memory location to read a pass/fail code. Inevitably the tests run at full
+speed; there is no external timing input to control them.
+Certain on-chip modules are very difficult to test in purely functional mode,
+so additional access routes (via test registers accessible from MARBLE) are
+provided to make the tests more efficient. The calibrated time delay is one
+such circuit, and the processor branch predictor is another. In the latter case,
+the branch predictor is taken off line so that the processor (of which it is a part)
+
+
+Chapter 15: Processors
+315
+
+can manipulate its internal state to run optimised tests on its CAM and RAM
+components.
+Although DRACO is not, at the time of writing, in full-scale production, the
+tests ran without difficulty on prototype sample parts.
+
+15.7.
+Summary
+
+Devices such as DRACO have demonstrated that it is feasible to build large,
+functional asynchronous systems. Like any prototype system the chip has its
+problems. There are two (known) bugs in the asynchronous half of the sys-
+tem: an electrical drive strength problem within the multiplier (which failed to
+evidence during simulation) and a logic oversight in the prefetch unit which
+falsely indicates a cycle is ‘sequential’ under certain specific conditions (a
+problem if running code from off-chip DRAM). Neither of these is attributable
+to the asynchronous nature of the device (indeed, there are slightly more bugs
+in the synchronous part of the device!) and both are readily fixable. The pro-
+cessor is comparable with an ARM9 manufactured on the same process in
+terms of physical size, performance and energy efficiency; preliminary mea-
+surements suggest that is is significantly ‘quieter’ in terms of EMI.
+This chapter has presented possible solutions (though certainly not the only
+ones!) to many of the problems facing the designer of complex asynchronous
+processing and memory systems. The majority of the designs described at the
+beginning of the chapter have been produced by academic groups and could be
+classified as “research”; however the complexity of a modern system on chip is
+such that these designs stretch the capability of even a large university group. It
+has been demonstrated that large, functional asynchronous designs are not only
+possible, but can be competitive and have some unique advantages in terms of
+power management and EMI. Asynchronous interconnection may be the only
+solution for large devices, even those with local clocks. Asynchronous chip
+design is ready to move to “development”.
+
+
+
+Epilogue
+
+Asynchronous technology has existed since the first days of digital elec-
+tronics – many of the earliest computers did not employ a central clock signal.
+However, with the development of integrated circuits the need for a straightfor-
+ward design discipline that could scale up rapidly with the available transistor
+resource was pressing, and clocked design became the dominant approach.
+Today, most practising digital designers know very little about asynchronous
+techniques, and what they do know tends to discourage them from venturing
+into the territory. But clocked design is beginning to show signs of stress – its
+ability to scale is waning, and it brings with it growing problems of excessive
+power dissipation and electromagnetic interference.
+During the reign of the clock, a few designers have remained convinced that
+asynchronous techniques have merit, and new techniques have been developed
+that are far better suited to the VLSI era than were the approaches employed on
+early machines. In this book we have tried to illuminate these new techniques
+in a way that is accessible to any practising digital circuit designer, whether or
+not they have had prior exposure to asynchronous circuits.
+In this account of asynchronous design techniques we have had to be selec-
+tive in order not to obscure the principal goal with arcane detail. Much work of
+considerable quality and merit has been omitted, and the reader whose interest
+has been ignited by this book will find that there is a great deal of published
+material available that exposes aspects of asynchronous design that have not
+been touched upon here.
+Although there are commercial examples of VLSI devices based on asyn-
+chronous techniques (a couple of which have been described in this book),
+these are exceptions – most asynchronous development is still taking place in
+research laboratories. If this is to change in the future, where will this change
+first manifest itself?
+The impending demise of clocked design has been forecast for many years
+and still has not happened. If it does happen, it will be for some compelling
+reason, since designers will not lightly cast aside their years of experience in
+one design style in favour of another style that is less proven and less well
+supported by automated tools.
+
+317
+
+
+318
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+There are many possible reasons for considering asynchronous design, but
+no single ‘killer application’ that makes its use obligatory. Several of the argu-
+ments for adopting asynchronous techniques mentioned at the start of this book
+– low power, low electromagnetic interference, modularity, etc. – are applica-
+ble in their own niches, but only the modularity argument has the potential to
+gain universal adoption. Here a promising approach that will support hetero-
+geneous timing environments is GALS (Globally Asynhronous Locally Syn-
+chronous) system design. An asynchronous on-chip interconnect – a ‘chip area
+network’ such as Chain (described on page 312) – is used to connect clocked
+modules. The modules themselves can be kept small enough for clock skew to
+be well-contained so that straightforward synchronous design techniques work
+well, and different modules can employ different clocks or the same clock with
+different phases. Once this framework is in place, it is then clearly straightfor-
+ward to make individual modules asynchronous on a case-by-case basis.
+Here, perhaps unsurprisingly, we see the need to merge asynchronous tech-
+nology with established synchronous design techniques, so most of the func-
+tional design can be performed using well-understood tools and approaches.
+This evolutionary approach contrasts with the revolutionary attacks described
+in Part III of this book, and represents the most likely scenario for the wide-
+spread adoption of the techniques described in this book in the medium-term
+future.
+In the shorter term, however, the application niches that can benefit from
+asynchronous technology are important and viable. It is our hope in writing
+this book that more designers will come to understand the principles of asyn-
+chronous design and its potential to offer new solutions to old and new prob-
+lems. Clocks are useful but they can become straitjackets. Don’t be afraid to
+think outside the box!
+
+For further information on asynchronous design see the bibliography at the
+end of this book, the Asynchronous Bibliography on the Internet [111], and
+the general information on asynchronous design available at the Asynchronous
+Logic Homepage, also on the Internet [47].
+
+
+References
+
+[1] G. Abouyannis et al. Project PREST EP25242. European Low Power
+Initiative for Electronic System Design (ESDLPD) Third International
+Workshop, pages 5–49, 2000.
+
+[2] A. Abrial, J. Bouvier, M. Renaudin, P. Senn, and P. Vivet. A new con-
+tactless smartcard IC using an on-chip antenna and an asynchronous
+micro-controller.
+IEEE Journal of Solid-State Circuits, 36(7):1101–
+1107, July 2001.
+
+[3] T. Agerwala. Putting Petri nets to work. IEEE Computer, 12(12):85–94,
+December 1979.
+
+[4] W.J. Bainbridge and S.B. Furber. Asynchronous macrocell intercon-
+nect using MARBLE. In Proc. International Symposium on Advanced
+Research in Asynchronous Circuits and Systems (Async’98), pages 122–
+132. IEEE Computer Society Press, April 1998.
+
+[5] W.J. Bainbridge and S.B. Furber.
+MARBLE: An asynchronous on-
+chip macrocell bus. Microprocessors and Microsystems, 24(4):213–222,
+April 2000.
+
+[6] T.S. Balraj and M.J. Foster. Miss Manners: A specialized silicon com-
+piler for synchronizers. In Charles E. Leierson, editor, Advanced Re-
+search in VLSI, pages 3–20. MIT Press, April 1986.
+
+[7] A. Bardsley. The Balsa web pages.
+http://www.cs.man.ac.uk/amulet/balsa/projects/balsa.
+
+[8] A. Bardsley. Implementing Balsa Handshake Circuits. PhD thesis, De-
+partment of Computer Science, University of Manchester, 2000.
+
+[9] A. Bardsley and D.A. Edwards. Compiling the language Balsa to delay-
+insensitive hardware. In C. D. Kloos and E. Cerny, editors, Hardware
+Description Languages and their Applications (CHDL), pages 89–91,
+April 1997.
+
+319
+
+
+320
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[10] A. Bardsley and D.A. Edwards. The Balsa asynchronous circuit synthe-
+sis system. In Forum on Design Languages, September 2000.
+
+[11] A. Bardsley and D.A. Edwards. Synthesising an asynchronous DMA
+controller with Balsa. Journal of Systems Architecture, 46:1309–1319,
+2000.
+
+[12] P.A. Beerel, C.J. Myers, and T.H.-Y. Meng.
+Automatic synthesis of
+gate-level speed-independent circuits. Technical Report CSL-TR-94-
+648, Stanford University, November 1994.
+
+[13] P.A. Beerel, C.J. Myers, and T.H.-Y. Meng. Covering conditions and
+algorithms for the synthesis of speed-independent circuits. IEEE Trans-
+actions on Computer-Aided Design, March 1998.
+
+[14] G. Birtwistle and A. Davis, editors. Proceedings of the Banff VIII Work-
+shop: Asynchronous Digital Circuit Design, Banff, Alberta, Canada,
+August 28–September 3, 1993. Springer Verlag, Workshops in Com-
+puting Science, 1995.
+Contributions from: S.B. Furber, “Comput-
+ing without Clocks: Micropipelining the ARM Processor,” A. Davis,
+“Practical Asynchronous Circuit Design: Methods and Tools,” C.H. van
+Berkel, “VLSI Programming of Asynchronous Circuits for Low Power,”
+J. Ebergen, “Parallel Program and Asynchronous Circuit Design,” A.
+Davis and S. Nowick, “Introductory Survey”.
+
+[15] I. Bogdan, M. Munteau, P.A. Ivey, N.L. Seed, and N. Powell. Power
+reduction techniques for a Viterbi decoder implementation. European
+Low Power Initiative for Electronic System Design (ESDLPD) Third In-
+ternational Workshop, pages 28–48, 2000.
+
+[16] E. Brinksma and T. Bolognesi. Introduction to the ISO specification
+language LOTOS. Computer Networks and ISDN Systems, 14(1), 1987.
+
+[17] E. Brunvand and R.F. Sproull. Translating concurrent programs into
+delay-insensitive circuits.
+In Proc. International Conf. Computer-
+Aided Design (ICCAD), pages 262–265. IEEE Computer Society Press,
+November 1989.
+
+[18] J.A. Brzozowsky and C.-J.H. Seager. Asynchronous Circuits. Springer
+Verlag, Monographs in Computer Science, 1994. ISBN: 0-387-94420-6.
+
+[19] S.M. Burns. Performance Analysis and Optimization of Asynchronous
+Circuits. PhD thesis, Computer Science Department, California Institute
+of Technology, 1991. Caltech-CS-TR-91-01.
+
+[20] S.M. Burns.
+General condition for the decomposition of state hold-
+ing elements. In Proc. International Symposium on Advanced Research
+
+
+REFERENCES
+321
+
+in Asynchronous Circuits and Systems. IEEE Computer Society Press,
+March 1996.
+
+[21] S.M. Burns and A.J. Martin. Syntax-directed translation of concurrent
+programs into self-timed circuits. In J. Allen and F. Leighton, editors,
+Advanced Research in VLSI, pages 35–50. MIT Press, 1988.
+
+[22] D.M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems.
+PhD thesis, Stanford University, October 1984.
+
+[23] K.T. Christensen, P. Jensen, P. Korger, and J. Sparsø. The design of an
+asynchronous TinyRISC TR4101 microprocessor core. In Proc. Inter-
+national Symposium on Advanced Research in Asynchronous Circuits
+and Systems, pages 108–119. IEEE Computer Society Press, 1998.
+
+[24] T.-A. Chu. Synthesis of Self-Timed VLSI Circuits from Graph-Theoretic
+Specifications. PhD thesis, MIT Laboratory for Computer Science, June
+1987.
+
+[25] T.-A. Chu and R.K. Roy (editors). Special issue on asynchronous cir-
+cuits and systems. IEEE Design & Test, 11(2), 1994.
+
+[26] T.-A. Chu and L.A. Glasser. Synthesis of self-timed control circuits
+from graphs: An example. In Proc. International Conf. Computer De-
+sign (ICCD), pages 565–571. IEEE Computer Society Press, 1986.
+
+[27] B. Coates, A. Davis, and K. Stevens.
+The Post Office experience:
+Designing a large asynchronous chip. Integration, the VLSI journal,
+15(3):341–366, October 1993.
+
+[28] F. Commoner, A.W. Holt, S. Even, and A. Pnueli.
+Marked directed
+graphs. J. Comput. System Sci., 5(1):511–523, October 1971.
+
+[29] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and
+A. Yakovlev. Petrify: a tool for manipulating concurrent specifications
+and synthesis of asynchronous controllers. In XI Conference on Design
+of Integrated Circuits and Systems, Barcelona, November 1996.
+
+[30] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and
+A. Yakovlev. Petrify: a tool for manipulating concurrent specifications
+and synthesis of asynchronous controllers. IEICE Transactions on In-
+formation and Systems, E80-D(3):315–325, March 1997.
+
+[31] U. Cummings, A. Lines, and A. Martin. An asynchronous pipelined lat-
+tice structure filter. In Proc. International Symposium on Advanced Re-
+search in Asynchronous Circuits and Systems, pages 126–133, Novem-
+ber 1994.
+
+
+322
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[32] A. Davis. A data-driven machine architecture suitable for VLSI imple-
+mentation. In Proceedings of the First Caltech Conference on VLSI,
+pages 479–494, Pasadena, CA, January 1979.
+
+[33] A. Davis and S.M. Nowick. Asynchronous circuit design: Motivation,
+background, and methods. In G. Birtwistle and A. Davis, editors, Asyn-
+chronous Digital Circuit Design, Workshops in Computing, pages 1–49.
+Springer-Verlag, 1995.
+
+[34] A. Davis and S.M. Nowick. An introduction to asynchronous circuit
+design. Technical Report UUCS-97-013, Department of Computer Sci-
+ence, University of Utah, September 1997.
+
+[35] A. Davis and S.M. Nowick. An introduction to asynchronous circuit
+design. In A. Kent and J. G. Williams, editors, The Encyclopedia of
+Computer Science and Technology, volume 38. Marcel Dekker, New
+York, February 1998.
+
+[36] J.B. Dennis. Data Flow Computation. In Control Flow and Data Flow
+— Concepts of Distributed Programming, International Summer School,
+pages 343–398, Marktoberdorf, West Germany, July 31 – August 12,
+1984. Springer, Berlin.
+
+[37] J.C. Ebergen and R. Berks. Response time properties of linear asyn-
+chronous pipelines. Proceedings of the IEEE, 87(2):308–318, February
+1999.
+
+[38] P.B. Endecott and S.B. Furber.
+Modelling and simulation of asyn-
+chronous systems using the LARD hardware description language. In
+Proceedings of the 12th European Simulation Multiconference, Manch-
+ester, Society for Computer Simulation International, pages 39–43, June
+1994. ISBN 1-56555-148-6.
+
+[39] K.M. Fant and S.A. Brandt. Null Conventional Logic: A complete and
+consistent logic for asynchronous digital circuit synthesis. In Interna-
+tional Conference on Application-specific Systems, Architectures, and
+Processors, pages 261–273, 1996.
+
+[40] R.M. Fuhrer, S.M. Nowick, M. Theobald, N.K. Jha, B. Lin, and
+L. Plana.
+Minimalist: An environment for the synthesis, verifica-
+tion and testability of burst-mode asynchronous machines.
+Techni-
+cal Report TR CUCS-020-99, Columbia University, NY, July 1999.
+http://www.cs.columbia.edu/˜nowick/minimalist.pdf.
+
+[41] S.B. Furber and P. Day. Four-phase micropipeline latch control circuits.
+IEEE Transactions on VLSI Systems, 4(2):247–253, June 1996.
+
+
+REFERENCES
+323
+
+[42] S.B. Furber, P. Day, J.D. Garside, N.C. Paver, S. Temple, and J.V.
+Woods. The design and evaluation of an asynchronous microproces-
+sor. In Proc. Int’l. Conf. Computer Design, pages 217–220, October
+1994.
+
+[43] S.B. Furber, D.A. Edwards, and J.D. Garside. AMULET3: a 100 MIPS
+asynchronous embedded processor. In Proc. International Conf. Com-
+puter Design (ICCD), September 2000.
+
+[44] S.B. Furber, J.D. Garside, P. Riocreux, S. Temple, P. Day, J. Liu, and
+N.C. Paver. AMULET2e: An asynchronous embedded controller. Pro-
+ceedings of the IEEE, 87(2):243–256, February 1999.
+
+[45] S.B. Furber, J.D. Garside, S. Temple, J. Liu, P. Day, and N.C. Paver.
+AMULET2e: An asynchronous embedded controller. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 290–299. IEEE Computer Society Press, 1997.
+
+[46] EU IST-1999-13515, G3Card – generation 3 smartcard, January 2000.
+
+[47] J.D. Garside. The Asynchronous Logic Homepages.
+http://www.cs.man.ac.uk/async/.
+
+[48] J.D. Garside, W.J. Bainbridge, A. Bardsley, D.A. Edwards, S.B. Furber,
+J. Liu, D.W. Lloyd, S. Mohammadi, J.S. Pepper, O. Petlin, S. Temple,
+and J.V. Woods. AMULET3i – an asynchronous system-on-chip. In
+Proc. International Symposium on Advanced Research in Asynchronous
+Circuits and Systems, pages 162–175. IEEE Computer Society Press,
+April 2000.
+
+[49] J.D. Garside, S. Temple, and R. Mehra. The AMULET2e cache sys-
+tem. In Proc. International Symposium on Advanced Research in Asyn-
+chronous Circuits and Systems (Async’96), pages 208–217. IEEE Com-
+puter Society Press, March 1996.
+
+[50] D.A. Gilbert. Dependency and Exception Handling in an Asynchronous
+Microprocessor. PhD thesis, Department of Computer Science, Univer-
+sity of Manchester, 1997.
+
+[51] D.A. Gilbert and J.D. Garside.
+A result forwarding mechanism for
+asynchronous pipelined systems. In Proc. International Symposium on
+Advanced Research in Asynchronous Circuits and Systems (Async’97),
+pages 2–11. IEEE Computer Society Press, April 1997.
+
+[52] B. Gilchrist, J.H. Pomerene, and S.Y. Wong. Fast carry logic for digital
+computers. IRE Transactions on Electronic Computers, EC-4(4):133–
+136, December 1955.
+
+
+324
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[53] L.A. Glasser and D.W. Dobberpuhl. The Design and Analysis of VLSI
+Circuits. Addison-Wesley, 1985.
+
+[54] S. Hauck. Asynchronous design methodologies: An overview. Proceed-
+ings of the IEEE, 83(1):69–93, January 1995.
+
+[55] L.G. Heller, W.R. Griffin, J.W. Davis, and N.G. Thoma. Cascode voltage
+switch logic: A differential CMOS logic family.
+Proc. International
+Solid State Circuits Conference, pages 16–17, February 1984.
+
+[56] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantita-
+tive Approach (2nd edition). Morgan Kaufmann, 1996.
+
+[57] C.A.R. Hoare. Communicating sequential processes. Communications
+of the ACM, 21(8):666–677, August 1978.
+
+[58] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall,
+Englewood Cliffs, 1985.
+
+[59] D.A. Huffman.
+The synthesis of sequential switching circuits.
+J.
+Franklin Inst., pages 161–190, 275–303, March/April 1954.
+
+[60] D.A. Huffman. The synthesis of sequential switching circuits. In E. F.
+Moore, editor, Sequential Machines: Selected Papers. Addison-Wesley,
+1964.
+
+[61] H. Hulgaard, S.M. Burns, and G. Borriello. Testing asynchronous cir-
+cuits: A survey. Integration, the VLSI journal, 19(3):111–131, Novem-
+ber 1995.
+
+[62] K. Hwang. Computer Arithmetic: Principles, Architecture, and Design.
+John Wiley & Sons, 1979.
+
+[63] ISO/IEC. Mifare identification cards - contactless integrated circuit(s)
+cards - proximity cards. Standard ISO/IEC Standard 14443 Type A.
+
+[64] H. Jacobson, E. Brunvand, G. Gopalakrishnan, and P. Kudva. High-level
+asynchronous system design using the ACK framework. In Proc. Inter-
+national Symposium on Advanced Research in Asynchronous Circuits
+and Systems, pages 93–103. IEEE Computer Society Press, April 2000.
+
+[65] D. Jaggar. Advanced RISC Machines Architecture Reference Manual.
+Prentice Hall, 1996.
+
+[66] M. Johnson. Superscalar Microprocessor Design. Series in Innovative
+Technology. Prentice Hall, 1991.
+
+
+REFERENCES
+325
+
+[67] S.C. Johnson and S. Mazor. Silicon compiler lets system makers design
+their own VLSI chips. Electronic Design, 32(20):167–181, 1984.
+
+[68] G. Jones. Programming in OCCAM. Prentice-Hall international, 87.
+
+[69] M.B. Josephs, S.M. Nowick, and C.H. van Berkel. Modeling and de-
+sign of asynchronous circuits. Proceedings of the IEEE, 87(2):234–242,
+February 1999.
+
+[70] G.C. Clark Jr. and J.B. Cain. Error correcting coding for digital com-
+munication. Plenum, 1981.
+
+[71] G.D. Forney Jr. The Viterbi algorithm. Proc. IEEE, 13(3):268–278,
+1973.
+
+[72] G. Kane and J. Heinrich. MIPS RISC Achitecture. Prentice Hall, 1992.
+
+[73] J. Kessels, T. Kramer, G. den Besten, A. Peeters, and V. Timm. Apply-
+ing asynchronous circuits in contactless smart cards. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 36–44. IEEE Computer Society Press, April 2000.
+
+[74] J. Kessels, T. Kramer, A. Peeters, and V. Timm. DESCALE: a design ex-
+periment for a smart card application consuming low energy. In R. van
+Leuken, R. Nouta, and A. de Graaf, editors, European Low Power Ini-
+tiative for Electronic System Design, pages 247–262. Delft Institute of
+Microelectronics and Submicron Technology, July 2000.
+
+[75] Z. Kohavi. Switching and Finite Automata Theory. McGraw-Hill, 1978.
+
+[76] A. Kondratyev, J. Cortadella, M. Kishinevsky, L. Lavagno, and
+A. Yakovlev. Logic decomposition of speed-independent circuits. Pro-
+ceedings of the IEEE, 87(2):347–362, February 1999.
+
+[77] J. Liu. Arithmetic and control components for an asynchronous micro-
+processor. PhD thesis, Department of Computer Science, University of
+Manchester, 1997.
+
+[78] D.W. Lloyd. VHDL models of asychronous handshaking. (Personal
+communication, August 1998).
+
+[79] A.J. Martin.
+The probe: An addition to communication primitives.
+Information Processing Letters, 20(3):125–130, 1985.
+Erratum: IPL
+21(2):107, 1985.
+
+[80] A.J. Martin. Compiling communicating processes into delay-insensitive
+VLSI circuits. Distributed Computing, 1(4):226–234, 1986.
+
+
+326
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[81] A.J. Martin. Formal program transformations for VLSI circuit synthesis.
+In E.W. Dijkstra, editor, Formal Development of Programs and Proofs,
+UT Year of Programming Series, pages 59–80. Addison-Wesley, 1989.
+
+[82] A.J. Martin. The limitations to delay-insensitivity in asynchronous cir-
+cuits. In W.J. Dally, editor, Advanced Research in VLSI: Proceedings of
+the Sixth MIT Conference, pages 263–278. MIT Press, 1990.
+
+[83] A.J. Martin. Programming in VLSI: From communicating processes
+to delay-insensitive circuits.
+In C.A.R. Hoare, editor, Developments
+in Concurrency and Communication, UT Year of Programming Series,
+pages 1–64. Addison-Wesley, 1990.
+
+[84] A.J. Martin. Synthesis of asynchronous VLSI circuits, 1991.
+
+[85] A.J. Martin. Asynchronous datapaths and the design of an asynchronous
+adder. Formal Methods in System Design, 1(1):119–137, July 1992.
+
+[86] A.J. Martin, S.M. Burns, T. K. Lee, D. Borkovic, and P.J. Hazewindus.
+The design of an asynchronous microprocessor.
+In Charles L. Seitz,
+editor, Advanced Research in VLSI, pages 351–373. MIT Press, 1989.
+
+[87] A.J. Martin, S.M. Burns, T.K. Lee, D. Borkovic, and P.J. Hazewindus.
+The first asynchronous microprocessor: The test results. Computer Ar-
+chitecture News, 17(4):95–98, 1989.
+
+[88] A.J. Martin, A. Lines, R. Manohar, M. Nystr¨om, P. Penzes, R. South-
+worth, U.V. Cummings, and T.-K. Lee. The design of an asynchronous
+MIPS R3000. In Proceedings of the 17th Conference on Advanced Re-
+search in VLSI, pages 164–181. MIT Press, September 1997.
+
+[89] R. Milner. Communication and Concurrency. Prentice-Hall, 1989.
+
+[90] C.E. Molnar, I.W. Jones, W.S. Coates, and J.K. Lexau. A FIFO ring
+oscillator performance experiment. In Proc. International Symposium
+on Advanced Research in Asynchronous Circuits and Systems, pages
+279–289. IEEE Computer Society Press, April 1997.
+
+[91] C.E. Molnar, I.W. Jones, W.S. Coates, J.K. Lexau, S.M. Fairbanks, and
+I.E. Sutherland. Two FIFO ring performance experiments. Proceedings
+of the IEEE, 87(2):297–307, February 1999.
+
+[92] D.E. Muller. Asynchronous logics and application to information pro-
+cessing. In H. Aiken and W. F. Main, editors, Proc. Symp. on Applica-
+tion of Switching Theory in Space Technology, pages 289–297. Stanford
+University Press, 1963.
+
+
+REFERENCES
+327
+
+[93] D.E. Muller and W.S. Bartky. A theory of asynchronous circuits. In
+Proceedings of an International Symposium on the Theory of Switch-
+ing, Cambridge, April 1957, Part I, pages 204–243. Harvard University
+Press, 1959. The annals of the computation laboratory of Harvard Uni-
+versity, Volume XXIX.
+
+[94] T. Murata. Petri Nets: Properties, Analysis and Applications. Proceed-
+ings of the IEEE, 77(4):541–580, April 1989.
+
+[95] C.J. Myers. Asynchronous Circuit Design. John Wiley & Sons, July
+2001. ISBN: 0-471-41543-X.
+
+[96] National Bureau of Standards. Data encryption standard, January 1997.
+Federal Information Processing Standards Publication 46.
+
+[97] C.D. Nielsen. Evaluation of function blocks for asynchronous design.
+In Proc. European Design Automation Conference (EURO-DAC), pages
+454–459. IEEE Computer Society Press, September 1994.
+
+[98] C.D. Nielsen, J. Staunstrup, and S.R. Jones. Potential performance ad-
+vantages of delay-insensitivity. In M. Sami and J. Calzadilla-Daguerre,
+editors, Proceedings of IFIP workshop on Silicon Architectures for
+Neural Nets, StPaul-de-Vence, France, November 1990. North-Holland,
+Amsterdam, 1991.
+
+[99] L.S. Nielsen. Low-power Asynchronous VLSI Design. PhD thesis, De-
+partment of Information Technology, Technical University of Denmark,
+1997. IT-TR:1997-12.
+
+[100] L.S. Nielsen, C. Niessen, J. Sparsø, and C.H. van Berkel. Low-power
+operation using self-timed circuits and adaptive scaling of the supply
+voltage. IEEE Transactions on VLSI Systems, 2(4):391–397, 1994.
+
+[101] L.S. Nielsen and J. Sparsø. A low-power asynchronous data-path for a
+FIR filter bank. In Proc. International Symposium on Advanced Re-
+search in Asynchronous Circuits and Systems, pages 197–207. IEEE
+Computer Society Press, 1996.
+
+[102] L.S. Nielsen and J. Sparsø.
+An 85 µW asynchronous filter-bank for
+a digital hearing aid. In Proc. IEEE International Solid State circuits
+Conference, pages 108–109, 1998.
+
+[103] L.S. Nielsen and J. Sparsø. Designing asynchronous circuits for low
+power: An IFIR filter bank for a digital hearing aid. Proceedings of the
+IEEE, 87(2):268–281, February 1999. Special issue on “Asynchronous
+Circuits and Systems” (Invited Paper).
+
+
+328
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[104] D.C. Noice.
+A Two-Phase Clocking Dicipline for Digital Integrated
+Circuits. PhD thesis, Department of Electrical Engineering, Stanford
+University, February 1983.
+
+[105] S.M. Nowick. Design of a low-latency asynchronous adder using specu-
+lative completion. IEE Proceedings, Computers and Digital Techniques,
+143(5):301–307, September 1996.
+
+[106] S.M. Nowick, M.B. Josephs, and C.H. van Berkel (editors).
+Special
+issue on asynchronous circuits and systems. Proceedings of the IEEE,
+87(2), February 1999.
+
+[107] S.M. Nowick, K.Y. Yun, and P.A. Beerel. Speculative completion for
+the design of high-performance asynchronous dynamic adders. In Proc.
+International Symposium on Advanced Research in Asynchronous Cir-
+cuits and Systems, pages 210–223. IEEE Computer Society Press, April
+1997.
+
+[108] International Standards Organization. LOTOS — a formal description
+technique based on the temporal ordering of observational behaviour.
+ISO IS 8807, 1989.
+
+[109] N.C. Paver, P. Day, C. Farnsworth, D.L. Jackson, W.A. Lien, and J. Liu.
+A low-power, low-noise configurable self-timed DSP. In Proc. Interna-
+tional Symposium on Advanced Research in Asynchronous Circuits and
+Systems, pages 32–42, 1998.
+
+[110] M. Pedersen.
+Design of asynchronous circuits using standard CAD
+tools. Technical Report IT-E 774, Technical University of Denmark,
+Dept. of Information Technology, 1998. (In Danish).
+
+[111] A.M.G. Peeters. The ‘Asynchronous’ Bibliography.
+http://www.win.tue.nl/˜wsinap/async.html.
+Corresponding e-mail address: async-bib@win.tue.nl.
+
+[112] A.M.G. Peeters. Single-Rail Handshake Circuits. PhD thesis, Eind-
+hoven University of Technology, June 1996.
+http://www.win.tue.nl/˜wsinap/pdf/Peeters96.pdf.
+
+[113] J.L. Peterson. Petri nets. Computing Surveys, 9(3):223–252, September
+1977.
+
+[114] Philips Semiconductors. PCA5007 handshake-technology pager IC data
+sheet. http://www.semiconductors.philips.com/pip/PCA5007H.
+
+[115] J. Rabaey. Digital Integrated Circuits: A Design Perspective. Prentice-
+Hall, 1996.
+
+
+REFERENCES
+329
+
+[116] P. Rakers, L. Connell, T. Collins, and D. Russell. Secure contactless
+smartcard ASIC with DPA protection. IEEE Journal of Solid-State Cir-
+cuits, 36(3):559–565, March 2001.
+
+[117] M. Renaudin, P. Vivet, and F. Robin. ASPRO-216: A standard-cell QDI
+16-bit RISC asynchronous microprocessor. In Proc. International Sym-
+posium on Advanced Research in Asynchronous Circuits and Systems
+(Async’98), pages 22–31. IEEE Computer Society Press, April 1998.
+
+[118] M. Renaudin, P. Vivet, and F. Robin. A design framework for asyn-
+chronous/synchronous circuits based on CHP to HDL translation. In
+Proc. International Symposium on Advanced Research in Asynchronous
+Circuits and Systems, pages 135–144, April 1999.
+
+[119] R. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital
+signatures and public-key crypto systems, June 1978.
+
+[120] M. Roncken. Defect-oriented testability for asynchronous ICs. Pro-
+ceedings of the IEEE, 87(2):363–375, February 1999.
+
+[121] C.L. Seitz. System timing. In C.A. Mead and L.A. Conway, editors,
+Introduction to VLSI Systems, chapter 7. Addison-Wesley, 1980.
+
+[122] N.P. Singh. A design methodology for self-timed systems. Master’s
+thesis, Laboratory for Computer Science, MIT, 1981. MIT/LCS/TR-
+258.
+
+[123] J. Sparsø, C.D. Nielsen, L.S. Nielsen, and J. Staunstrup. Design of self-
+timed multipliers: A comparison. In S. Furber and M. Edwards, editors,
+Asynchronous Design Methodologies, volume A-28 of IFIP Transac-
+tions, pages 165–180. Elsevier Science Publishers, 1993.
+
+[124] J. Sparsø and J. Staunstrup. Delay-insensitive multi-ring structures. IN-
+TEGRATION, the VLSI Journal, 15(3):313–340, October 1993.
+
+[125] J. Sparsø, J. Staunstrup, and M. Dantzer-Sørensen. Design of delay in-
+sensitive circuits using multi-ring structures. In G. Musgrave, editor,
+Proc. of EURO-DAC ’92, European Design Automation Conference,
+Hamburg, Germany, September 7-10, 1992, pages 15–20. IEEE Com-
+puter Society Press, 1992.
+
+[126] R.F. Sproull, I.E. Sutherland, and C.E. Molnar. The counterflow pipeline
+processor architecture. IEEE Design & Test of Computers, 11(3):48–59,
+1994.
+
+[127] L. Stok. Architectural Synthesis and Optimization of Digital Systems.
+PhD thesis, Eindhoven University of Technology, 1991.
+
+
+330
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+[128] I.E. Sutherland.
+Micropipelines.
+Communications of the ACM,
+32(6):720–738, June 1989.
+
+[129] Synopsys, Inc. Synopsys VSS Family Core Programs Manual, 1997.
+
+[130] A. Takamura, M. Kuwako, M. Imai, T. Fujii, M. Ozawa, I. Fukasaku,
+Y. Ueno, and T. Nanya. TITAC-2: An asynchronous 32-bit micropro-
+cessor based on scalable-delay-insensitive model. In Proc. International
+Conf. Computer Design (ICCD’97), pages 288–294. MIT Press, Octo-
+ber 1997.
+
+[131] H. Terada, S. Miyata, and M. Iwata.
+DDMPs: Self-timed super-
+pipelined data-driven multimedia processors. Proceedings of the IEEE,
+87(2):282–296, February 1999.
+
+[132] M. Theobald and S.M. Nowick. Transformations for the synthesis and
+optimization of asynchronous distributed control. In Proc. ACM/IEEE
+Design Automation Conference, 2001.
+
+[133] S.H. Unger.
+Asynchronous Sequential Switching Circuits.
+Wiley-
+Interscience, John Wiley & Sons, Inc., New York, 1969.
+
+[134] C.H. van Berkel. Beware the isochronic fork. INTEGRATION, the VLSI
+journal, 13(3):103–128, 1992.
+
+[135] C.H. van Berkel. Handshake Circuits: an Asynchronous Architecture
+for VLSI Programming, volume 5 of International Series on Parallel
+Computation. Cambridge University Press, 1993.
+
+[136] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and
+F. Schalij. Asynchronous circuits for low power: a DCC error corrector.
+IEEE Design & Test, 11(2):22–32, 1994.
+
+[137] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and
+F. Schalij. A fully asynchronous low-power error corrector for the DCC
+player. In ISSCC 1994 Digest of Technical Papers, volume 37, pages
+88–89. IEEE, 1994. ISSN 0193-6530.
+
+[138] C.H. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken,
+F. Schalij, and R. van de Viel.
+A single-rail re-implementation of a
+DCC error detector using a generic standard-cell library. In 2nd Work-
+ing Conference on Asynchronous Design Methodologies, London, May
+30-31, 1995, pages 72–79, 1995.
+
+[139] C.H. van Berkel, F. Huberts, and A. Peeters.
+Stretching quasi delay
+insensitivity by means of extended isochronic forks. In Asynchronous
+
+
+REFERENCES
+331
+
+Design Methodologies, pages 99–106. IEEE Computer Society Press,
+May 1995.
+
+[140] C.H. van Berkel, M.B. Josephs, and S.M. Nowick. Scanning the tech-
+nology: Applications of asynchronous circuits.
+Proceedings of the
+IEEE, 87(2):223–233, February 1999.
+
+[141] C.H. van Berkel, J. Kessels, M. Roncken, R. Saeijs, and F. Schalij. The
+VLSI-programming language Tangram and its translation into hand-
+shake circuits. In Proc. European Conference on Design Automation
+(EDAC), pages 384–389, 1991.
+
+[142] C.H. van Berkel, C. Niessen, M. Rem, and R. Saeijs. VLSI program-
+ming and silicon compilation. In Proc. International Conf. Computer
+Design (ICCD), pages 150–166, Rye Brook, New York, 1988. IEEE
+Computer Society Press.
+
+[143] H. van Gageldonk.
+An Asynchronous Low-Power 80C51 Microcon-
+troller. PhD thesis, Dept. of Math. and C.S., Eindhoven Univ. of Tech-
+nology, September 1998.
+
+[144] H. van Gageldonk, D. Baumann, C.H. van Berkel, D. Gloor, A. Peeters,
+and G. Stegmann. An asynchronous low-power 80c51 microcontroller.
+In Proc. International Symposium on Advanced Research in Asyn-
+chronous Circuits and Systems, pages 96–107. IEEE Computer Society
+Press, April 1998.
+
+[145] P. Vanbekbergen.
+Synthesis of Asynchronous Control Circuits from
+Graph-Theoretic Specifications. PhD thesis, Catholic University of Leu-
+ven, September 1993.
+
+[146] V.I. Varshavsky, M.A. Kishinevsky, V.B. Marakhovsky, V.A. Peschan-
+sky, L.Y. Rosenblum, A.R. Taubin, and B.S. Tzirlin. Self-timed Con-
+trol of Concurrent Processes.
+Kluwer Academic Publisher, 1990.
+V.I.Varshavsky Ed., (Russian edition: 1986).
+
+[147] T. Verhoeff. Delay-insensitive codes - an overview. Distributed Com-
+puting, 3(1):1–8, 1988.
+
+[148] A.J. Viterbi. Error bounds for convolutional codes and an asymptotically
+optimum decoding algorithm.
+In IEEE Transactions on Information
+Theory, volume 13, pages 260–269, 1967.
+
+[149] P. Viviet and M. Renaudin. CHP2VHDL, a CHP to VHDL translator
+- towards asynchronous-design simulation.
+In L. Lavagno and M.B.
+
+
+332
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+Josephs, editors, Handouts from the ACiD-WG Workshop on Specifica-
+tion models and languages and technology effects of asynchronous de-
+sign. Dipartemento di Elettronica, Polytecnico de Torino, Italy, January
+1998.
+
+[150] J.F. Wakerly. Digital Design: Principles and Practices, 3/e. Prentice-
+Hall, 2001.
+
+[151] N. Weste and K. Esraghian. Principles of CMOS VLSI Design – A sys-
+tems Perspective, 2nd edition. Addison-Wesley, 1993.
+
+[152] Rik van de Wiel. High-level test evaluation of asynchronous circuits.
+In Asynchronous Design Methodologies, pages 63–71. IEEE Computer
+Society Press, May 1995.
+
+[153] T.E. Williams. Self-Timed Rings and their Application to Division. PhD
+thesis, Department of Electrical Engineering and Computer Science,
+Stanford University, 1991. CSL-TR-91-482.
+
+[154] T.E. Williams. Analyzing and improving latency and throughput in self-
+timed rings and pipelines. In Tau-92: 1992 Workshop on Timing Issues
+in the Specification and Synthesis of Digital Systems. ACM/SIGDA,
+March 1992.
+
+[155] T.E. Williams. Performance of iterative computation in self-timed rings.
+Journal of VLSI Signal Processing, 6(3), October 1993.
+
+[156] T.E. Williams and M.A. Horowitz.
+A zero-overhead self-timed 160
+ns. 54 bit CMOS divider.
+IEEE Journal of Solid State Circuits,
+26(11):1651–1661, 1991.
+
+[157] T.E. Williams, N. Patkar, and G. Shen. SPARC64: A 64-b 64-active-
+instruction out-of-order-execution MCM processor.
+IEEE Journal of
+Solid-State Circuits, 30(11):1215–1226, November 1995.
+
+[158] J.V. Woods, P. Day, S.B. Furber, J.D. Garside, N.C. Paver, and S. Tem-
+ple. AMULET1: An asynchronous ARM processor. IEEE Transactions
+on Computers, 46(4):385–398, April 1997.
+
+[159] C. Ykman-Couvreur, B. Lin, and H. de Man. Assassin: A synthesis sys-
+tem for asynchronous control circuits. Technical report, IMEC, Septem-
+ber 1994. User and Tutorial manual.
+
+[160] K.Y. Yun and D.L. Dill. Automatic synthesis of extended burst-mode
+circuits: Part II (automatic synthesis). IEEE Transactions on Computer-
+Aided Design, 18(2):118–132, February 1999.
+
+
+Index
+
+Acknowledgement (or indication), 15
+Activation port, 163
+Active port, 156
+Actual case latency, 65
+Adaptive voltage scaling, 231
+Addition (ripple-carry), 64
+Amulet microprocessors, 274
+Amulet1, 274, 281, 285, 290
+Amulet2, 290
+Amulet2e, 275, 303
+Amulet3, 282, 286, 291, 297, 300
+Amulet3i, 156, 205, 275, 303, 313
+DRACO, 278, 310
+And-or-invert (AOI) gates, 102
+Arbitration, 79, 202, 210–211, 269, 287, 304
+ARM, 274, 280, 297
+ASPRO-216, 278, 284
+Asymmetric delay, 48, 53
+Asynchronous advantages, 3, 231
+Asynchronous disadvantages, 232
+Asynchronous synthesis, 155
+Atomic complex gate, 94, 103
+Automatic performance adaptation, 231
+Average power consumption, 237
+Balsa, 123, 155, 312
+communications, 179
+area cost, 167
+array types, 175
+arrayed channels, 166, 176
+auto-assignment, 178, 184
+channel viewer, 171
+conditional execution, 180
+constants, 166, 174
+data types, 173
+design flow, 159
+DMA controller, 211
+enumerated types, 174
+for loops, 166
+hardware sharing, 187
+looping constructs, 180
+modular compilation, 165
+numeric types, 173
+operators, 181
+parallel composition, 165
+
+parameterised descriptions, 193
+program structure, 181
+record types, 174
+recursive definitions, 195
+simulation, 168
+structural iteration, 180, 189
+test harness, 168, 197
+tools, 159
+Branch colour, 284, 300
+Breeze, 159, 162
+Bubble limited, 49
+Bubble, 30
+Bundled-data, 9, 157, 255
+Burst mode, 86
+input burst, 86
+output burst, 86
+C-element, 14, 58, 92, 257
+asymmetric, 100, 59
+generalized, 100, 103, 105
+implementation, 15
+specification, 15, 92
+Cache, 275, 303–304, 307
+Calibrated time delays, 313
+Caltech, 133, 276
+Capture-pass latch, 19
+Cast, 176
+CCS (calculus of communicating systems), 123
+Channel (or link), 7, 30, 156
+communication in Balsa, 162
+Channel type
+biput, 115
+nonput, 115
+pull, 10, 115
+push, 10, 115
+Chip area interconnect (Chain), 312
+CHP (communicating hardware processes),
+123–124, 278
+Circuit templates:
+for statement, 37
+if statement, 36
+while statement, 38
+Classification
+delay-insensitive (DI), 25
+quasi delay-insensitive (QDI), 25
+
+333
+
+
+334
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+self-timed, 26
+speed-independent (SI), 25
+Closed circuit, 23
+Codeword (dual-rail), 12
+empty, 12
+intermediate, 12
+valid, 12
+Compatible states, 85
+Complete state coding (CSC), 88
+Completion indication, 65
+Completion
+detection, 21–22, 302
+indication, 62
+strong, 62
+weak, 62
+Complex gates, 104
+Concurrent processes, 123
+Concurrent statements, 123
+Consistent state assignment, 88
+Control limited, 50
+Control logic for transition signaling, 20
+Control-data-flow graphs, 36
+Convolution encoding, 250
+Counterflow Pipeline Processor (CFPP), 286
+CSP (communicating sequential processes), 123,
+223
+Cycle time of a ring, 49
+Data dependency, 279
+Data encoding
+bundled-data, 9
+dual-rail, 11
+m-of-n, 14
+one-hot (or 1-of-n), 13
+single-rail, 10
+Data limited, 49
+Data types, 173
+Data validity scheme (4-phase bundled-data)
+broad, 116
+early, 116
+extended early, 116
+late, 116
+Data validity, 156
+Data-flow abstraction, 7
+DCVSL, 70
+Deadlock, 30, 155, 199, 285, 287, 305
+Delay assumptions, 23
+Delay insensitive minterm synthesis (DIMS), 67
+Delay matching, 11, 236
+Delay model
+fixed delay, 83
+inertial delay, 83
+delay time, 83
+reject time, 83
+min-max delay, 83
+transport delay, 83
+unbounded delay, 83
+Delay selection, 66, 305
+
+Delay-insensitive (DI), 12, 17, 25, 156
+codes, 12, 312
+Demultiplexer (DEMUX), 32
+Dependency graph, 52
+DES coprocessor, 241
+Design for test, 314
+Determinism, 282
+Differential logic, 70
+DIMS, 67–68
+DMA controller, 202, 205
+Balsa description, 211
+control unit, 209, 213
+on DRACO, 312
+transfer engine, 210, 212
+DRACO, 276, 278, 310, 315
+Dual-rail carry signals, 65
+Dual-rail encoding, 11
+Dummy environment, 87
+Dynamic wavelength, 49
+Electromagnetic interference (EMI), 278, 315
+emission spectrum, 231
+Electromagnetic radiation (as power source), 222
+Empty word, 12, 29–30
+Environment, 83
+Event, 9
+Exceptions, 297
+Excitation region, 97
+Excited gate/variable, 24
+FIFO, 16, 257
+Finite state machine (using a ring), 35
+Firing (of a gate), 24
+For statement, 37, 166
+Fork, 31
+Forward latency, 47
+Four-phase handshake, 225
+Function block, 31, 60–61
+bundled-data (“speculative completion”), 66
+bundled-data, 18, 65
+dual-rail (DIMS), 67
+dual-rail (Martin’s adder), 71
+dual-rail (null convention logic), 69
+dual-rail (transistor level CMOS), 70
+dual-rail, 22
+hybrid, 73
+strongly indicating, 62
+weakly indicating, 62
+Fundamental mode, 81, 83–84
+Generalized C-element, 103, 105
+Generate (carry), 65
+Globally Asynchronous, Locally Synchronous
+(GALS), 312
+Greatest common divisor (GCD), 38, 131, 226
+Guarded command, 128, 240
+Guarded repetition, 128
+Halt, 237, 282
+Handshake channel, 115, 156, 225
+biput, 115
+
+
+INDEX
+335
+
+nonput, 115, 129
+pull, 10, 115, 129, 156
+push, 10, 115, 129, 156
+Handshake circuit, 128, 162, 223
+2-place ripple FIFO, 130–131
+2-place shift register, 129
+greatest common divisor (GCD), 132, 226
+Handshake component, 156, 225
+arbiter, 79
+bar, 131
+case, 157
+demultiplexer, 32, 76, 131
+do, 131, 226
+fetch, 157, 163
+fork, 31, 58, 131, 133
+join, 31, 58, 130
+latch, 29, 31, 57
+2-phase bundled-data, 19
+4-phase bundled-data, 18, 106
+4-phase dual-rail, 21
+merge, 32, 58
+multiplexer, 32, 76, 109, 131
+parallel, 225–226
+passivator, 130
+repeater, 129, 163
+sequencer, 129, 163, 225
+transferer, 130
+variable, 130, 163
+Handshake expansion, 133
+Handshake protocol, 7, 9
+2-phase bundled-data, 9, 274–275
+2-phase dual-rail, 13
+4-phase bundled-data, 9, 117, 255
+4-phase dual-rail, 11
+non-return-to-zero (NRZ), 10
+return-to-zero (RTZ), 10
+Handshaking, 7, 155
+Hazard, 297
+dynamic-01, 83
+dynamic-10, 83, 95
+static-0, 83
+static-1, 83, 94
+Huffmann, D. A., 84
+Hysteresis, 22, 64
+If statement, 36, 181
+IFIR filter bank, 39
+Indication (or acknowledgement), 15
+of completion, 65
+dependency graphs, 73
+distribution of valid/empty indication, 72
+strong, 62
+weak, 62
+Initial state, 101
+Initialization, 101, 30
+Input free choice, 88
+Input-output mode, 81, 84
+Instruction prefetching, 236
+
+Intermediate codeword, 12
+Interrupts, 299
+Isochronic fork, 26
+Iterative computation (using a ring), 35
+Join, 31
+Kill (carry), 65
+LARD, 159
+Latch (see also: handshake comp.), 18
+Latch controller, 106
+fully-decoupled, 120
+normally opaque, 121
+normally transparent, 121
+semi-decoupled, 120
+simple/un-decoupled, 119
+Latency, 47
+actual case, 65
+Line fetch latch (LFL), 308
+Link (or channel), 7, 30
+Liveness, 88
+Lock FIFO, 290
+Logic decomposition, 94
+Logic thresholds, 27
+LOTOS, 123
+M-of-n threshold gates with hysteresis, 69
+Makefile, 165–166
+MARBLE bus, 209, 304, 312
+Matched delay, 11, 65
+Memory, 302
+Merge, 32
+Metastability, 78
+filter, 78
+mean time between failure, 79
+probability of, 79
+Micropipelines, 19, 156, 274
+Microprocessors
+80C51, 236
+Amulet series, 274
+ASPRO-216, 278, 284
+asynchronous MIPS R3000, 133
+asynchronous MIPS, 39
+CFPP, 286
+MiniMIPS, 276
+TITAC-2, 276
+Minterm, 22, 67
+Modulo-10 counter, 158, 185
+Modulo-16 counter, 183
+Monotonic cover constraint, 97, 99, 103
+Muller C-element, 15
+Muller model of a closed circuit, 23
+Muller pipeline/distributor, 16, 257
+Muller, D., 84
+Multi-cycle instruction, 281
+Multiplexer (MUX), 32, 109
+Mutual exclusion, 58, 77, 300, 304
+mutual exclusion element (MUTEX), 77
+N-way multiplexer, 195
+NCL adder, 70
+
+
+336
+PRINCIPLES OF ASYNCHRONOUS DESIGN
+
+Non-determinism, 282, 305
+Non-return-to-zero (NRZ), 10
+Null Convention Logic (NCL), 69
+NULL, 12
+OCCAM, 123
+Occupancy (or static spread), 49
+One-hot encoding, 13
+Operator reduction, 134
+Optimization, 227
+Parallel composition, 164
+Passive port, 156
+Performance parameters:
+cycle time of a ring, 49
+dynamic wavelength, 49
+forward latency, 47
+latency, 47
+period, 48
+reverse latency, 48
+throughput, 49
+Performance
+analysis and optimization, 41
+average case, 280
+Period, 48
+Persistency, 88
+Petri net, 86
+merge, 88
+1-bounded, 88
+controlled choice, 89
+firing, 86
+fork, 88
+input free choice, 88
+join, 88
+liveness, 88
+places, 86
+token, 86
+transition, 86
+Petrify, 102, 296
+Pipeline, 5, 30, 279
+2-phase bundled-data, 19
+4-phase bundled-data, 18
+4-phase dual-rail, 20
+balance, 281
+Place, 86
+Power consumption, 231, 234
+Power efficiency, 250
+Power supply, 246
+Precharged CMOS circuitry, 116
+Prefetch unit (80C51), 239
+Primitive flow table, 85
+Probe, 123, 125
+Process decomposition, 133
+Processors, 274
+Production rule expansion, 134
+Propagate (carry), 65
+Pull channel, 10, 115, 156
+Push channel, 10, 115, 156
+Quasi delay-insensitive (QDI), 25
+
+Quiescent region, 97
+Re-shuffling signal transitions, 102, 112
+Read-after-write data hazard, 40
+Receive, 123, 125
+Reduced flow table, 85
+Register
+dependency, 289
+locking, 40, 289
+Rendezvous, 125
+Reorder buffer, 291
+Reset function, 97
+Reset signal, 163
+Return-to-zero (RTZ), 9–10
+Reverse latency, 48
+Ring, 30, 296
+finite state machine, 35
+iterative computation, 35
+Ripple FIFO, 16
+Self-timed, 26
+Semantics-preserving transformations, 133
+Send, 123, 125
+Sequencer, 225
+Sequencing, 162
+Serial unary arithmetic, 257
+Set function, 97
+Set-Reset implementation, 96
+Shared ressource, 77
+Sharing, 187, 223
+Sharp DDMP, 278
+Shift register
+with parallel load, 44
+Signal transition graph (STG), 86, 297
+Signal transition, 9
+Silicon compiler, 124, 223
+Simulation, 168
+Single input change, 84
+Single-place buffer, 161
+Single-rail, 10
+Smart cards, 222, 232
+Spacer, 12
+Speculative completion, 66
+Speed adaptation, 248
+Speed-independent (SI), 23–25, 83
+Stable gate/variable, 23
+Standard C-element, 106
+implementation, 96
+State graph, 85
+Static data-flow structure, 7, 29
+Static data-flow structure
+examples:
+greatest common divisor (GCD), 38
+IFIR filter bank, 39
+MIPS microprocessor, 39
+simple example, 33
+vector multiplier, 40
+Static spread (or occupancy), 49, 120
+Static type checking, 118
+
+
+INDEX
+337
+
+Stuck-at fault model, 27
+Supply voltage variations, 231, 236
+Synchronizer flip-flop, 78
+Synchronous message passing, 123
+Syntax-directed compilation, 128, 155
+Tangram examples:
+2-place ripple FIFO, 127
+2-place shift register, 126
+GCD using guarded repetition, 128
+GCD using while and if statements, 127
+Tangram, 123, 155, 222–223, 278
+Technology mapping, 103, 224
+Test, 27, 245, 314
+IDDQ testing, 28
+halting of circuit, 28, 246
+isochronic forks, 28
+short and open faults, 28
+stuck-at faults, 27
+toggle test, 28
+untestable stuck-at faults, 28
+Throughput, 42, 49
+Thumb decoder, 282
+Time safe, 78
+TITAC-2, 276, 308
+Token, 7, 30, 86
+Transition, 86
+
+Transparent to handshaking, 7, 23, 33, 61
+Two-place buffer, 163
+Unique entry constraint, 97, 99
+Up/down decade counter, 185
+Valid codeword, 12
+Valid data, 12, 29
+Valid token, 30
+Value safe, 78
+Vector multiplier, 40
+Verilog, 124
+VHDL, 124, 155, 237
+Viterbi decoder, 249
+backtrace, 264
+branch metric, 260
+constraint length, 251
+global winner, 262
+History Unit, 264
+Path Metric Unit (PMU), 256
+soft codes, 253
+trellis diagram, 251
+VLSI programming, 128, 223
+VSTGL (Visual STG Lab), 103
+Wave, 16
+crest, 16
+trough, 16
+While statement, 38
+Write-back, 40
+
+
diff --git a/papers/references/temp/Tegmark2000.pdf b/papers/references/temp/Tegmark2000.pdf
new file mode 100644
index 00000000..08fe95ed
--- /dev/null
+++ b/papers/references/temp/Tegmark2000.pdf
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e3f207a1170ff3edb0765236fce4fbcaf4c73a9745664e3f35bdc38dafc20aab
+size 365231
diff --git a/papers/references/temp/Tegmark2000.txt b/papers/references/temp/Tegmark2000.txt
new file mode 100644
index 00000000..049d0e7f
--- /dev/null
+++ b/papers/references/temp/Tegmark2000.txt
@@ -0,0 +1,2024 @@
+arXiv:quant-ph/9907009v2  10 Nov 1999
+
+The Importance of Quantum Decoherence in Brain Processes
+
+Max Tegmark
+Institute for Advanced Study, Olden Lane, Princeton, NJ 08540; max@ias.edu
+Dept. of Physics, Univ. of Pennsylvania, Philadelphia, PA 19104
+(Submitted to Phys. Rev. E July 2 1999, accepted October 25)
+
+Based on a calculation of neural decoherence rates, we ar-
+gue that that the degrees of freedom of the human brain that
+relate to cognitive processes should be thought of as a classical
+rather than quantum system, i.e., that there is nothing funda-
+mentally wrong with the current classical approach to neural
+network simulations. We find that the decoherence timescales
+(∼ 10−13 − 10−20 seconds) are typically much shorter than
+the relevant dynamical timescales (∼ 10−3 − 10−1 seconds),
+both for regular neuron firing and for kink-like polarization
+excitations in microtubules. This conclusion disagrees with
+suggestions by Penrose and others that the brain acts as a
+quantum computer, and that quantum coherence is related
+to consciousness in a fundamental way.
+
+I. INTRODUCTION
+
+In most current mainstream biophysics research on
+cognitive processes, the brain is modeled as a neural net-
+work obeying classical physics. In contrast, Penrose [1,2],
+and others have argued that quantum mechanics may
+play an essential role, and that successful brain simula-
+tions can only be performed with a quantum computer.
+The main purpose of this paper is to address this issue
+with quantitative decoherence calculations.
+The field of artificial neural networks (for an introduc-
+tion, see, e.g., [4–6]) is currently booming, driven by a
+broad range of applications and improved computing re-
+sources. Although the popular neurological models come
+in various levels of abstraction, none involve effects of
+quantum coherence in any fundamental way. Encouraged
+by successes in modeling memory, learning, visual pro-
+cessing, etc. [7,8], many workers in the field have boldly
+conjectured that a sufficiently complex neural network
+could in principle perform all cognitive processes that we
+associate with consciousness.
+On the other hand, many authors have argued that
+consciousness can only be understood as a quantum ef-
+fect. For instance, Wigner [9] suggested that conscious-
+ness was linked to the quantum measurement problem1,
+and this idea has been greatly elaborated by Stapp [3].
+There have been numerous suggestions that conscious-
+ness is a macroquantum effect, involving superconduc-
+
+1 Interestingly, Wigner changed his mind and gave up this
+idea [10] after he became aware in of the first paper on deco-
+herence in 1970 [11].
+
+tivity [12], superfluidity [13], electromagnetic fields [14],
+Bose condensation [15,16], superflourescence [17] or some
+other mechanism [18,19]. Perhaps the most concrete one
+is that of Penrose [2], proposing that this takes place
+in microtubules, the ubiquitous hollow cylinders that
+among other things help cells maintain their shapes. It
+has been argued that microtubules can process informa-
+tion like a cellular automaton [20], and Penrose suggests
+that they operate as a quantum computer. This idea has
+been further elaborated employing string theory methods
+[21–27].
+The make-or-break issue for all these quantum mod-
+els is whether the relevant degrees of freedom of the
+brain can be sufficiently isolated to retain their quan-
+tum coherence, and opinions are divided. For instance,
+Stapp has argued that interaction with the environment
+is probably small enough to be unimportant for cer-
+tain neural processes [28], whereas Zeh [29], Zurek [30],
+Scott [31], Hawking [32] and Hepp [33] have conjectured
+that environment-induced coherence will rapidly destroy
+macrosuperpositions in the brain. It is therefore timely
+to try to settle the issue with detailed calculations of the
+relevant decoherence rates. This is the purpose of the
+present work.
+The rest of this paper is organized as follows. In Sec-
+tion II, we briefly review the open system quantum me-
+chanics necessary for our calculations, and introduce a
+decomposition into three subsystems to place the prob-
+lem in its proper context.
+In Section III, we evaluate
+decoherence rates both for neuron firing and for the mi-
+crotubule processes proposed by Penrose et al., relegating
+some technical details to the Appendix. We conclude in
+Section IV by discussing the implications of our results,
+both for modeling cognitive brain processes and for in-
+corporating them into a quantum-mechanical treatment
+of the rest of the world.
+
+II. SYSTEMS AND SUBSYSTEMS
+
+In this section, we review those aspects of quantum
+mechanics for open systems that are needed for our cal-
+culations, and introduce a classification scheme and a
+subsystem decomposition to place the problem at hand
+in its appropriate context.
+
+1
+
+
+A. Notation
+
+Let us first briefly review the quantum mechanics of
+subsystems. The state of an arbitrary quantum system
+is described by its density matrix ρ, which left in isolation
+will evolve in time according to the Schr¨odinger equation
+
+˙ρ = −i[H, ρ]/¯h.
+(1)
+
+It is often useful to view a system as composed of two
+subsystems, so that some of the degrees of freedom cor-
+respond to the 1st and the rest to the 2nd. The state of
+subsystem i is described by the reduced density matrix
+ρi obtained by tracing (marginalizing) over the degrees
+of freedom of the other: ρ1 ≡ tr 2ρ, ρ2 ≡ tr 1ρ. Let us
+decompose the Hamiltonian as
+
+H = H1 + H2 + Hint,
+(2)
+
+where the operator H1 affects only the 1st subsystem
+and H2 affects only the 2nd subsystem. The interaction
+Hamiltonian Hint is the remaining nonseparable part, de-
+fined as Hint ≡ H − H1 − H2, so such a decomposition
+is always possible, although it is generally only useful if
+Hint is in some sense small.
+If Hint = 0, i.e., if there is no interaction between
+the two subsystems, then it is easy to show that ˙ρi =
+−i[Hi, ρi]/¯h, i = 1, 2, that is, we can treat each subsys-
+tem as if the rest of the Universe did not exist, ignoring
+any correlations with the other subsystem that may have
+been present in the full non-separable density matrix ρ.
+It is of course this property that makes density matrices
+so useful in the first place, and that led von Neumann
+to invent them [34]: the full system is assumed to obey
+equation (1) simply because its interactions with the rest
+of the Universe are negligible.
+
+B. Fluctuation, dissipation, communication and
+decoherence
+
+In practice, the interaction Hint between subsystems
+is usually not zero. This has a number of qualitatively
+different effects:
+
+1. Fluctuation
+
+2. Dissipation
+
+3. Communication
+
+4. Decoherence
+
+The first two involve transfer of energy between the sub-
+systems, whereas the last two involve exchange of infor-
+mation. The first three occur in classical physics as well
+- only the last one is a purely quantum-mechanical phe-
+nomenon.
+For example, consider a tiny colloid grain (subsystem
+1) in a jar of water (subsystem 2). Collisions with water
+
+molecules will cause fluctuations in the center-of-mass
+position of the colloid (brownian motion). If its initial ve-
+locity is high, dissipation (friction) will slow it down to
+a mean speed corresponding to thermal equilibrium with
+the water. The dissipation timescale τdiss, defined as the
+time it would take to lose half of the initial excess energy,
+will in this case be of order τcoll × (M/m), where τcoll is
+the mean-free time between collisions, M the colloid mass
+M and m is the mass of a water molecule. We will define
+communication as exchange of information. The infor-
+mation that the two subsystems have about each other,
+measured in bits, is
+
+I12 ≡ S1 + S2 − S,
+(3)
+
+where Si ≡ −tr iρi log ρi is the entropy of the ith subsys-
+tem, S ≡ −tr ρ log ρ is the entropy of the total system,
+and the logarithms are base 2. If this mutual informa-
+tion is zero, then the states of the two systems are un-
+correlated and independent, with the density matrix of
+the separable form ρ = ρ1 ⊗ ρ2. If the subsystems start
+out independent, any interaction will at least initially
+increase the subsystem entropies Si, thereby increasing
+the mutual information, since the entropy S of the total
+system always remains constant.
+This apparent entropy increase of subsystems, which
+is related to the arrow of time and the 2nd law of of ther-
+modynamics [35], occurs also in classical physics. How-
+ever, quantum mechanics produces a qualitatively new
+effect as well, known as decoherence [11,36,37], sup-
+pressing off-diagonal elements in the reduced density ma-
+trices ρi. This effect destroys the ability to observe long-
+range quantum superpositions within the subsystems,
+and is now rather well-understood and uncontroversial
+[30,38–42] – the interested reader is referred to [43] and
+a recent book on decoherence [44] for details.
+For in-
+stance, if our colloid was initially in a superposition of
+two locations separated by a centimeter, this macrosu-
+perposition would for all practical purposes be destroyed
+by the first collision with a water molecule, i.e., on a
+timescale τdec of order τcoll, with the quantum superpo-
+sition surviving only on scales below the de de Broigle
+wavelength of the water molecules [45,46].2 This means
+
+2Decoherence picks out a preferred basis in the quantum-
+mechanical Hilbert space, termed the “pointer basis” by
+Zurek [36], in which superpositions are rapidly destroyed and
+classical behavior is approached. This normally includes the
+position basis, which is why we never experience superposi-
+tions of objects in macroscopically different positions. Deco-
+herence is quite generic. Although it has been claimed that
+this preferred basis consists of the maximal set of commuting
+observables that also commute with Hint (the “microstable
+basis” of Omnes [43]), this is in fact merely a sufficient condi-
+tion, not a necessary one. If [Hint, x] = 0 for some observable
+x but [Hint, p] ̸= 0 for its conjugate p, then the interaction
+
+2
+
+
+that τdiss/τdec ∼ M/m in our example, i.e., that decoher-
+ence is much faster than dissipation for macroscopic ob-
+jects, and this qualitative result has been shown to hold
+quite generally as well (see [43] and references therein).
+Loosely speaking, this is because each microscopic par-
+ticle that scatters off of the subsystem carries away only
+a tiny fraction m/M of the total momentum, but essen-
+tially all of the necessary information.
+
+QUANTUM�
+SYSTEM
+
+NOT �
+INDEPENDENT�
+SYSTEM
+
+IMPOSSIBLE
+
+CLASSICAL�
+SYSTEM
+
+0.1
+1
+
+1
+
+0.1
+
+10
+
+100
+
+10
+100
+Dissipation time/Decoherence time
+
+Dynamical time/Decoherence time
+
+FIG. 1. The qualitative behavior of a subsystem depends on
+the timescales for dynamics, dissipation and decoherence.
+This
+classification is by necessity quite crude, so the boundaries should
+not be thought of as sharp.
+
+C. Classification of systems
+
+Let us define the dynamical timescale τdyn of a subsys-
+tem as that which is characteristic of its internal dynam-
+ics. For a planetary system or an atom, τdyn would be
+the orbital frequency.
+The qualitative behavior of a system depends on the
+ratio of these timescales, as illustrated in Figure 1. If
+τdyn ≪ τdec, we are are dealing with a true quantum sys-
+tem, since its superpositions can persist long enough to
+be dynamically important. If τdyn ≫ τdiss, it is hardly
+meaningful to view it as an independent system at all,
+since its internal forces are so week that they are dwarfed
+
+will indeed cause decoherence for x as advertised. But this
+will happen even if [Hint, x] ̸= 0 — all that matters is that
+[Hint, p] ̸= 0, i.e., that the interaction Hamiltonian contains
+(“measures”) x.
+
+by the effects of the surroundings. In the intermediate
+case where τdec ≪ τdyn <∼ τdiss, we have a familiar classi-
+cal system.
+The relation between τdec and τdiss depends only on
+the form of Hint, whereas the question of whether τdyn
+falls between these values depends on the normalization
+of Hint in equation (2). Since τdec ∼ τdiss for microscopic
+(atom-sized) systems and τdec ≪ τdiss for macroscopic
+ones, Figure 1 shows that whereas macroscopic systems
+can behave quantum-mechanically, microscopic ones can
+never behave classically.
+
+D. Three systems: subject, object and environment
+
+Most discussions of quantum statistical mechanics split
+the Universe into two subsystems [47]: the object under
+consideration and everything else (referred to as the en-
+vironment). Since our purpose is to model the observer,
+we need to include a third subsystem as well, the subject.
+As illustrated in Figure 2, we therefore decompose the
+total system into three subsystems:
+
+• The subject consists of the degrees of freedom as-
+sociated with the subjective perceptions of the ob-
+server. This does not include any other degrees of
+freedom associated with the brain or other parts of
+the body.
+
+• The object consists of the degrees of freedom that
+the observer is interested in studying, e.g., the
+pointer position on a measurement apparatus.
+
+• The environment consists of everything else, i.e.,
+all the degrees of freedom that the observer is not
+paying attention to. By definition, these are the
+degrees of freedom that we always perform a partial
+trace over.
+
+3
+
+
+SUBJECT
+OBJECT
+
+ENVIRONMENT
+
+Hs
+Ho
+
+He
+
+Hso
+
+Hoe
+Hse
+
+Object �
+decoherence
+
+Subject�
+decoherence,�
+finalizing �
+decisions
+
+Measurement,�
+observation,�
+"wavefuntion �
+collapse",�
+willful action
+
+(Always traced over)
+
+(Always zero entropy)
+
+FIG. 2. An observer can always decompose the world into three
+subsystems: the degrees of freedom corresponding to her subjective
+perceptions (the subject), the degrees of freedom being studied (the
+object), and everything else (the environment). As indicated, the
+subsystem Hamiltonians Hs, Ho, He and the interaction Hamilto-
+nians Hso, Hoe, Hse can cause qualitatively very different effects,
+which is why it is often useful to study them separately. This paper
+focuses on the interaction Hse.
+
+Note that the first two definitions are very restrictive.
+Whereas the subject would include the entire body of
+the observer in the common way of speaking, only very
+few degrees of freedom qualify as our subject or object.
+For instance, if a physicist is observing a Stern-Gerlach
+apparatus, the vast majority of the ∼ 1028 degrees of
+freedom in the the observer and apparatus are counted
+as environment, not as subject or object.
+The term “perception” is used in a broad sense in item
+1, including thoughts, emotions and any other attributes
+of the subjectively perceived state of the observer.
+The practical usefulness in this decomposition lies in
+that one can often neglect everything except the object
+and its internal dynamics (given by Ho) to first order,
+using simple prescriptions to correct for the interactions
+with the subject and the environment.
+The effects of
+both Hso and Hoe have been extensively studied in the
+literature. Hso involves quantum measurement, and gives
+rise to the usual interpretation of the diagonal elements of
+the object density matrix as probabilities. Hoe produces
+decoherence, selecting a preferred basis and making the
+object act classically if the conditions in Figure 1 are met.
+In contrast, Hse, which causes decoherence directly in
+the subject system, has received relatively little atten-
+
+tion. It is the focus of the present paper, and the next
+section is devoted to quantitative calculations of decoher-
+ence in brain processes, aimed at determining whether
+the subject system should be classified as classical or
+quantum in the sense of Figure 1.
+We will return to
+Figure 2 and a more detailed discussion of its various
+subsystem interactions in Section IV.
+
+III. DECOHERENCE RATES
+
+In this section, we will make quantitative estimates
+of decoherence rates for neurological processes. We first
+analyze the process of neuron firing, widely assumed to be
+central to cognitive processes. We also analyze electrical
+excitations in microtubules, which Penrose and others
+have suggested may be relevant to conscious thought.
+
+A. Neuron firing
+
+Neurons (see Figure 3) are one of the key building
+blocks of the brain’s information processing system. It is
+widely believed that the complex network of ∼ 1011 neu-
+rons with their nonlinear synaptic couplings is in some
+way linked to our subjective perceptions, i.e., to the sub-
+ject degrees of freedom. If this picture is correct, then if
+Hs or Hso puts the subject into a superposition of two
+distinct mental states, some neurons will be in a super-
+position of firing and not firing. How fast does such a
+superposition of a firing and non-firing neuron decohere?
+Let us consider this process in more detail.
+For in-
+troductory reviews of neuron dynamics, the reader is re-
+ferred to, e.g., [48–50].
+Like virtually all animal cells,
+neurons have ATP driven pumps in their membranes
+which push sodium ions out of the cell into the surround-
+ing fluids and potassium ions the other way. The former
+process is slightly more efficient, so the neuron contains a
+slight excess of negative charge in its “resting” state, cor-
+responding to a potential difference U0 ≈ −0.07 V across
+the axon membrane (“axolemma”). There is an inher-
+ent instability in the system, however. If the potential
+becomes substantially less negative, then voltage-gated
+sodium channels in the axon membrane open up, allow-
+ing Na+ ions to come gushing in. This makes the poten-
+tial still less negative, causes still more opening, etc. This
+chain reaction, “firing”, propagates down the axon at a
+speed of up to 100 m/s, changing the potential difference
+to a value U1 that is typically of order +0.03 V [49].
+The axon quickly recovers. After less than ∼ 1 ms, the
+sodium channels close regardless of the voltage, and large
+potassium channels (also voltage gated, but with a time
+delay) open up allowing K+ ions to flow out and restore
+the resting potential U0. The ATP driven pumps quickly
+restore the Na+ and K+ concentrations to their initial
+values, making the neuron ready to fire again if triggered.
+Fast neurons can fire over 103 times per second.
+
+4
+
+
+Na+
+Na+
+
+dendrites
+
+axon
+
+cell body
+
+myelin�
+insulation
+
+fraction f�
+not insulated
+
+thickness h
+
+Here�
+if�
+firing
+
+Here�
+if not�
+firing
+
+voltage�
+sensitive�
+gate
+
+length�
+L
+
+axon�
+membrane
+
+pulse
+
+di
+
+re
+
+ct
+
+io
+
+n
+
+diameter d
+
+FIG. 3. Schematic illustration of a neuron (left), a section of
+the myelinated axon (center) and and a piece of its axon membrane
+(right).
+The axon is typically insulated (myelinated) with small
+bare patches every 0.5 mm or so (so-called Nodes of Ranvier) where
+the voltage-sensitive sodium and potassium gates are concentrated
+[51,52]. If the neuron is in a superposition of firing and not firing,
+then N ∼ 106 Na+ ions are in a superposition of being inside and
+outside the cell (right).
+
+Consider a small patch of the membrane, assumed to
+be roughly flat with uniform thickness h as in Figure 3.
+If there is an excess surface density ±σ of charge near
+the inside/outside membrane surfaces, giving a voltage
+differential U across the membrane, then application of
+Gauss’ law tells us that σ = ǫ0E, where the electric field
+strength in the membrane is E = U/h and ǫ0 is the vac-
+uum permittivity.
+Consider an axon of length L and
+diameter d, with a fraction f of its surface area bare (not
+insulated with myelin). The total active surface area is
+thus A = πdLf, so the total number of Na+ ions that
+migrate in during firing is
+
+N = Aσ
+
+q
+= πdLfǫ0(U1 − U0)
+
+qh
+,
+(4)
+
+where q is the ionic charge (q = qe, the absolute value
+of the electron charge). Taking values typical for central
+nervous system axons [52,53], h = 8 nm, d = 10 µm,
+L = 10 cm, f = 10−3, U0 = −0.07 V and U1 = +0.03 V
+gives N ≈ 106 ions, and reasonable variations in our
+parameters can change this number by a few orders of
+magnitude.
+
+B. Neuron decoherence mechanisms
+
+Above we saw that a quantum superposition of the
+neuron states “resting” and “firing” involves of order a
+million ions being in a spatial superposition of inside and
+outside the axon membrane, separated by a distance of
+order h ∼ 10 nm. In this subsection, we will compute the
+timescale on which decoherence destroys such a superpo-
+sition.
+
+In this analysis, the object is the neuron, and the su-
+perposition will be destroyed by any interaction with
+other (environment) degrees of freedom that is sensitive
+to where the ions are located. We will consider the fol-
+lowing three sources of decoherence for the ions:
+
+1. Collisions with other ions
+
+2. Collisions with water molecules
+
+3. Coloumb interactions with more distant ions
+
+There are many more decoherence mechanisms [44–46].
+Exotic candidates such as quantum gravity [54] and
+modified quantum mechanics [55,56] are generally much
+weaker [46]. A number of decoherence effects may be even
+stronger than those listed, e.g., interactions as the ions
+penetrate the membrane — the listed effects will turn out
+to be so strong that we can make our argument by sim-
+ply using them as lower limits on the actual decoherence
+rate.
+Let ρ denote the density matrix for the position r of a
+single Na+ ion. As reviewed in the Appendix, all three
+of the listed processes cause ρ to evolve as
+
+ρ(x, x′, t0 + t) = ρ(x, x′, t0)f(x, x′, t)
+(5)
+
+for some function f that is independent of the ion state
+ρ and depends only on the interaction Hamiltonian Hint.
+This assumes that we can neglect the motion of the ion
+itself on the decoherence timescale — we will see that
+this condition is met with a broad margin.
+
+1. Ion–ion collisions
+
+For scattering of environment particles (processes 1
+and 2) that have a typical de Broigle wavelength λ, we
+have [46]
+
+f(x, x′, t) = e−Λt�
+1−e−|x′−x|2/2λ2�
+
+≈
+
+�
+e−|x′−x|2Λt/2λ2
+for |x′ − x| ≪ λ,
+
+e−Λt
+for |x′ − x| ≫ λ.
+(6)
+
+Here Λ is the scattering rate, given by Λ ≡ n⟨σv⟩, where
+n is the density of scatterers, σ is the scattering cross
+section and v is the velocity. The product σv is aver-
+aged over a the velocity distribution, which we take to
+be a thermal (Boltzmann) distribution for correspond-
+ing to T = 37◦C ≈ 310 K. The gist of equation (6) is
+that a single collision decoheres the ion down to the
+de Broigle wavelength of the scattering particle.
+The
+information I12 communicated during the scattering is
+I12 ∼ log2(∆x/λ) bits, where ∆x is the initial spread in
+the position of our particle.
+Since the typical de Broigle wavelength of a Na+ ion
+(mass m ≈ 23mp) or H2O molecule (m ≈ 18mp) is
+
+5
+
+
+λ =
+2π¯h
+√
+
+3mkT
+≈ 0.03 nm
+(7)
+
+at 310K, way smaller than the the membrane thickness
+h ∼ 10 nm over which we need to maintain quantum
+coherence, we are clearly in the |x′ − x| ≫ λ limit of
+equation (6). This means that the spatial superposition
+of an ion decays exponentially Λ−1, of order its mean
+free time between collisions. Since the superposition of
+the neuron states “resting” and “firing” involves N such
+superposed ions, it thus gets destroyed on a timescale
+τ ≡ (NΛ)−1.
+Let us now evaluate τ. Coulomb scattering between
+two ions of unit charge gives substantial deflection angles
+(θ ∼ 1) with a cross section or order3
+
+σ ∼
+� gq2
+
+mv2
+
+�2
+,
+(9)
+
+where v is the relative velocity and g ≡ 1/4πǫ0 is the
+Coulomb constant. In thermal equilibrium, the kinetic
+energy mv2/2 is of order kT , so v ∼
+�
+
+kT/m. For the
+ion density, let us write n = ηnH2O, where the density
+of water molecules nH2O is about (1 g/cm3)/(18mp) ∼
+1023/cm3 and η is the relative concentration of ions (pos-
+itive and negative combined). Typical ion concentrations
+during the resting state are [Na+] =9.2 (120) mmol/l and
+[K+] =140 (2.5) mmol/l inside (outside) the axon mem-
+brane [48], corresponding to total Na+ + K+ concentra-
+tions of η ≈ 0.00027 (0.00022) inside (outside). To be
+conservative, we will simply use η ≈ 0.0002 throughout.
+Ion–ion collisions therefore destroy the superposition on
+a timescale
+
+τ ∼
+1
+
+Nnσv ∼
+
+�
+
+m(kT )3
+
+Ng2q4en
+∼ 10−20 s.
+(10)
+
+2. Ion–water collisions
+
+Since H2O molecules are electrically neutral, the cross-
+section is dominated by their electric dipole moment
+p ≈ 1.85 Debye ≈ (0.0385 nm) × qe. We can model this
+
+3 If the first ion starts at rest at r1 = (0, 0, 0) and the sec-
+ond is incident with r2 = (vt, b, 0), then a very weak scatter-
+ing with deflection angle θ ≪ 1 will leave these trajectories
+roughly unchanged, the radial force F = gq2/|r1 −r2|2 merely
+causing a net transverse acceleration [57]
+
+∆vy =
+� ∞
+
+−∞
+
+�y · F
+
+m dt =
+� ∞
+
+−∞
+
+gq2b dt
+
+[b2 + (vt)2]3/2 = 2gq2
+
+mvb .
+(8)
+
+The approximation breaks down as the deflection angle θ ≈
+∆vy/v approaches unity. This occurs for b ∼ gq2/mv2, giving
+σ = πb2 as in equation (9).
+
+dipole as two opposing unit charges separated by a dis-
+tance y ≡ p/qe ≪ b, so summing the two corresponding
+contributions from equation (8) gives a deflection angle
+
+θ ≈ 2gqep
+
+mv2b2 .
+(11)
+
+This gives a cross section
+
+σ = πb2 ∼ gqep
+
+mv2 .
+(12)
+
+for scattering with large (θ ∼ 1) deflections. Although σ
+is smaller than for the case of ion–ion collisions, n is larger
+because the concentration factor η drops out, giving a
+final result
+
+τ ∼
+1
+
+Nnσv ∼
+
+√
+
+mkT
+
+Ngqepn ∼ 10−20 s
+(13)
+
+3. Interactions with distant ions
+
+As shown in the Appendix, long-range interaction with
+a distant (environment) particle gives
+
+f(r, r′, t) = �p2 [M(r′ − r)t/¯h] ,
+(14)
+
+up to a phase factor that is irrelevant for decoherence.
+Here �p2 is the Fourier transform of p2(r) ≡ ρ2(r, r), the
+probability distribution for the location of the environ-
+ment particle. M is the 3 × 3 Hessian matrix of second
+derivatives of the interaction potential of the two parti-
+cles at their mean separation. A slightly less general for-
+mula was derived in the seminal paper [45]. For roughly
+thermal states, ρ2 (and thus p) is likely to be well ap-
+proximated by a Gaussian [58,59]. This gives
+
+f(r, r′, t) = e− 1
+
+2 (r′−r)tMtΣM(r′−r)t2/¯h2,
+(15)
+
+where Σ = ⟨r2rt
+2⟩ − ⟨r2⟩⟨rt
+2⟩ is the covariance matrix of
+the location of the environment particle.
+Decoherence
+is destroyed when the exponent becomes of order unity,
+i.e., on a timescale
+
+τ ≡
+�
+(r′ − r)tMtΣM(r′ − r)
+�−1/2 ¯h.
+(16)
+
+Assuming a Coulomb potential V = gq2/|r2 − r1| gives
+M = (3�a�at − I)gq2/a3 where a ≡ r2 − r1 = a�a, |�a| =
+1. For thermal states, we have the isotropic case Σ =
+(∆x)2I, so equation (16) reduces to
+
+τ =
+¯ha3
+
+gq2|r′ − r|∆x
+�
+1 + 3 cos2 θ
+�−1/2 ,
+(17)
+
+where cos θ ≡ �a · (r′ − r)/|r′ − r|. To be conservative,
+we take ∆x to be as small as the uncertainty principle
+allows. With the thermal constraint (∆p)2/m <∼ kT on
+the momentum uncertainty, this gives
+
+6
+
+
+∆x =
+¯h
+
+2∆p ∼
+¯h
+√
+
+mkT
+.
+(18)
+
+Substituting this into equation (17) and dividing by the
+number of ions N, we obtain the decoherence timescale
+
+τ ∼
+a3√
+
+mkT
+
+Ngq2|r′ − r|.
+(19)
+
+caused by a single environment ion a distance a away.
+Each such ion will produce its own suppression factor f,
+so we need to sum the exponent in equation (15) over all
+ions. Since the tidal force M ∝ a−3 causes the exponent
+to drop as a−6, this sum will generally be dominated by
+the very closest ion, which will typically be a distance
+a ∼ n−1/3 away. We are interested in decoherence for
+separations |r′ − r| = h, the membrane thickness, which
+gives
+
+τ ∼
+
+√
+
+mkT
+
+Ngq2enh ∼ 10−19 s.
+(20)
+
+The relation between these different estimates is dis-
+cussed in more detail in the Appendix.
+
+C. Microtubules
+
+Microtubules are a major component of the cytoskele-
+ton, the “scaffolding” that helps cells maintain their
+shapes.
+They are hollow cylinders of diameter D =
+24 nm made up of 13 filaments that are strung together
+out of proteins known as tubulin dimers. These dimers
+can make transitions between two states known as α
+and β, corresponding to different electric dipole moments
+along the axis of the tube. It has been argued that micro-
+tubules may have additional functions as well, serving as
+a means of energy and information transfer [20]. A model
+has been presented whereby the dipole-dipole interac-
+tions between nearby dimers can lead to long-range po-
+larization and kink-like excitations that may travel down
+the microtubules at speeds exceeding 1 m/s [60].
+Penrose has gone further and suggested that the dy-
+namics of such excitations can make a microtubule act
+like a quantum computer, and that microtubules are the
+site of of human consciousness [2]. This idea has been fur-
+ther elaborated [21–24] employing methods from string
+theory, with the conclusion that quantum superpositions
+of coherent excitations can persist for as long as a second
+before being destroyed by decoherence. See also [61,62].
+This was hailed as a success for the model, the interpre-
+tation being that the quantum gravity effect on micro-
+tubules was identified with the human though process on
+this same timescale.
+This decoherence rate τ ∼ 1 s was computed assuming
+that quantum gravity is the main decoherence source.
+Since this quantum gravity model is somewhat contro-
+versial [32] and its effect has been found to be more than
+
+20 orders of magnitude weaker than other decoherence
+sources in some cases [46], it seems prudent to evalu-
+ate other decoherence sources for the microtubule case
+as well, to see whether they are in fact dominant. We
+will now do so.
+Using coordinates where the x-axis is along the tube
+axis, the above-mentioned models all focus on the time-
+evolution of p(x), the average x-component of the electric
+dipole moment of the tubulin dimers at each x. In terms
+of this polarization function p(x), the net charge per unit
+length of tube is −p′(x). The propagating kink-like exci-
+tations [60] are of the form
+
+p(x) =
+� +p0
+for x ≪ x0,
+
+−p0
+for x ≫ x0,
+(21)
+
+where the kink location x0 propagates with constant
+speed and has a width of order a few tubulin dimers.
+The polarization strength p0 is such that the total charge
+around the kink is Q = − � p′(x)dx = 2p0 ∼ 940qe, due
+to the presence of 18 Ca2+ ions on each of the 13 fila-
+ments contributing to p0 [60].
+Suppose that such a kink is in two different places
+in superposition, separated by some distance |r′ − r|.
+How rapidly will the superposition be destroyed by de-
+coherence?
+To be conservative, we will ignore colli-
+sions between polarized tubulin dimers and nearby water
+molecules, since it has been argued that these may be in
+some sense ordered and part of the quantum system [24]
+– although this argument is difficult to maintain for the
+water outside the microtubule, which permeates the en-
+tire cell volume. Let us instead apply equation (19), with
+N = Q/qe ∼ 103. The distance to the nearest ion will
+generally be less than a = R + n−1/3 ∼ 26 nm, where the
+tubulin diameter D = 24 nm dominates over the inter-
+ion separation n−1/3 ∼ 2 nm in the fluid surrounding
+the microtubule. Superpositions spanning many tubuline
+dimers (|r′ − r| ≫ D) therefore decohere on a timescale
+
+τ ∼ D2√
+
+mkT
+
+Ngq2e
+∼ 10−13 s.
+(22)
+
+due to the nearest ion alone. This is quite a conserva-
+tive estimate, since the other nD3 ∼ 103 ions that are
+merely a small fraction further away will also contribute
+to the decoherence rate, but it is nonetheless 6-7 orders
+of magnitude shorter than the estimates of Mavromatos
+& Nanopoulos [25–27]. We will comment on screening
+effects below.
+
+1. Decoherence summary
+
+Our decoherence rates are summarized in Table 1. How
+accurate are they likely to be?
+In the calculations above, we generally tried to be con-
+servative, erring on the side of underestimating the deco-
+herence rate. For instance, we neglected that N potas-
+sium ions also end up in superposition once the neuron
+
+7
+
+
+firing is quenched, we neglected the contribution of other
+abundant ions such as Cl− to η, and and we ignored col-
+lisions with water molecules in the microtubule case.
+Since we were only interested in order-of-magnitude
+estimates, we made a number of crude approximations,
+e.g., for the cross sections. We neglected screening ef-
+fects because the decoherence rates were dominated by
+the particles closest to the system, i.e., the very same par-
+ticles that are responsible for screening the charge from
+more distant ones.
+
+Table 1. Decoherence timescales.
+
+Object
+Environment
+τdec
+
+Neuron
+Colliding ion
+10−20s
+Neuron
+Colliding H2O
+10−20s
+Neuron
+Nearby ion
+10−19s
+Microtubule
+Distant ion
+10−13s
+
+IV. DISCUSSION
+
+A. The classical nature of brain processes
+
+The calculations above enable us to address the ques-
+tion of whether cognitive processes in the brain consti-
+tute a classical or quantum system in the sense of Fig-
+ure 1. If we take the characteristic dynamical timescale
+for such processes to be τdyn ∼ 10−2 s − 100 s (the ap-
+parent timescale of e.g., speech, thought and motor re-
+sponse), then a comparison of τdyn with τdec from Table 1
+shows that processes associated with either conventional
+neuron firing or with polarization excitations in micro-
+tubules fall squarely in the classical category, by a mar-
+gin exceeding ten orders of magnitude. Neuron firing it-
+self is also highly classical, since it occurs on a timescale
+τdyn ∼ 10−3 − 10−4 s [53]. Even a kink-like microtubule
+excitation is classical by many orders of magnitude, since
+it traverses a short tubule on a timescale τdyn ∼ 5×10−7 s
+[60].
+What about other mechanisms?
+It is worth noting
+that if (as is commonly believed) different neuron fir-
+ing patterns correspond in some way to different con-
+scious perceptions, then consciousness itself cannot be
+of a quantum nature even if there is a yet undiscovered
+physical process in the brain with a very long decoherence
+time. As mentioned above, suggestions for such candi-
+dates have involved, e.g., superconductivity [12], super-
+fluidity [13], electromagnetic fields [14], Bose condensa-
+tion [15,16], superflourescence [17] and other mechanisms
+[18,19]. The reason is that as soon as such a quantum
+subsystem communicates with the constantly decohering
+neurons to create conscious experience, everything deco-
+heres.
+How extreme variations in the decoherence rates can
+we obtain by changing our model assumptions? Although
+the rates can be altered by a few of orders of magnitudes
+by pushing parameters such as the neuron dimensions,
+the myelination fraction or the microtubule kink charge
+
+to the limits of plausibility, it is clearly impossible to
+change the basic conclusion that τdec ≪ 10−3 s, i.e., that
+we are dealing with a classical system in the sense of Fig-
+ure 1. Even the tiniest neuron imaginable, with only a
+single ion (N = 1) traversing the cell wall during firing,
+would have τdec ∼ 10−14 s.
+Likewise, reducing the ef-
+fective microtubule kink charge to a small fraction of qe
+would not help.
+How are we to understand the above-mentioned claims
+that brain subsystems can be sufficiently isolated to
+exhibit macroquantum behavior?
+It appears that the
+subtle distinction between dissipation and decoherence
+timescales has not always been appreciated.
+
+B. Implications for the subject-object-environment
+decomposition
+
+Let us now discuss the subsystem decomposition of
+Figure 2 in more detail in light of our results. As the
+figure indicates, the virtue of this decomposition into
+subject, object and environment is that the subsystem
+Hamiltonians Hs, Ho, He and the interaction Hamiltoni-
+ans Hso, Hoe, Hse can cause qualitatively very different
+effects. Let us now briefly discuss each of them in turn.
+Most of these processes are schematically illustrated
+in Figure 4 and Figure 5, where for purposes of illus-
+tration, we have shown the extremely simple case where
+both the subject and object have only a single degree of
+freedom that can take on only a few distinct values (3
+for the subject, 2 for the object). For definiteness, we
+denote the three subject states |¨- ⟩, | ¨⌣⟩ and | ¨⌢⟩, and in-
+terpret them as the observer feeling neutral, happy and
+sad, respectively. We denote the two object states |↑⟩
+and |↓⟩, and interpret them as the spin component (“up”
+or “down”) in the z-direction of a spin-1/2 system, say a
+silver atom. The joint system consisting of subject and
+object therefore has only 2 × 3 = 6 basis states: |¨- ↑⟩,
+|¨- ↓⟩, | ¨⌣↑⟩, | ¨⌣↓⟩, | ¨⌢↑⟩, | ¨⌢↓⟩. In Figures 4 and 5, we
+have therefore plotted ρ as a 6 × 6 matrix consisting of
+nine two-by-two blocks.
+
+=
++
+
+Object�
+evolution
+Object�
+decohe-�
+rence
+Ho
+(Entropy�
+constant)
+(Entropy�
+increases)
+
+Hoe
+
+Observation/Measurement
+
+(Entropy decreases)
+Hso
+�
+
+2
+1_
+2
+1_
+
+8
+
+
+FIG. 4. Time evolution of the 6×6 density matrix for the basis
+states |¨- ↑⟩, |¨- ↓⟩, | ¨⌣↑⟩, | ¨⌣↓⟩, | ¨
+⌢↑⟩, | ¨⌢↓⟩ as the object evolves in
+isolation, then decoheres, then gets observed by the subject. The
+final result is a statistical mixture of the states | ¨⌣↑⟩ and | ¨⌢↓⟩,
+simple zero-entropy states like the one we started with.
+
+1. Effect of Ho: constant entropy
+
+If the object were to evolve during a time interval t
+without interacting with the subject or the environment
+(Hso = Hoe = 0), then according to equation (1) its
+reduced density matrix ρo would evolve into UρoU † with
+the same entropy, since the time-evolution operator U ≡
+e−iHot is unitary.
+Suppose the subject stays in the state |¨- ⟩ and the
+object starts out in the pure state |↑⟩. Let the object
+Hamiltonian Ho correspond to a magnetic field in the y-
+direction causing the spin to precess to the x-direction,
+i.e., to the state (|↑⟩+|↓⟩)/
+√
+
+2. The object density matrix
+ρo then evolves into
+
+ρo = U|↑⟩⟨↑|U † = 1
+
+2(|↑⟩ + |↓⟩)(⟨↑| + ⟨↓|)
+
+= 1
+
+2(|↑⟩⟨↑| + |↑⟩⟨↓| + |↓⟩⟨↑| + |↓⟩⟨↓|),
+(23)
+
+corresponding to the four entries of 1/2 in the second
+matrix of Figure 4.
+This is quite typical of pure quantum time evolution: a
+basis state eventually evolves into a superposition of ba-
+sis states, and the quantum nature of this superposition
+is manifested by off-diagonal elements in ρo. Another fa-
+miliar example of this is the familiar spreading out of the
+wave packet of a free particle.
+
+2. Effect of Hoe: increasing entropy
+
+This was the effect of Ho alone. In contrast, Hoe will
+generally cause decoherence and increase the entropy of
+the object. As discussed in detail in Section III and the
+Appendix, it entangles it with the environment, which
+suppresses the off-diagonal elements of the reduced den-
+sity matrix of the object as illustrated in Figure 4. If Hoe
+couples to the z-component of the spin, this destroys the
+terms |↑⟩⟨↓| and |↓⟩⟨↑|. Complete decoherence therefore
+converts the final state of equation (23) into
+
+ρo = 1
+
+2(|↑⟩⟨↑| + |↓⟩⟨↓|),
+(24)
+
+corresponding to the two entries of 1/2 in the third ma-
+trix of Figure 4.
+
+3. Effect of Hso: decreasing entropy
+
+Whereas Hoe typically causes the apparent entropy of
+the object to increase, Hso typically causes it to decrease.
+
+Figure 4 illustrates the case of an ideal measurement,
+where the subject starts out in the state |¨- ⟩ and Hso is of
+such a form that gets perfectly correlated with the object.
+In the language of Section II, an ideal measurement is a
+type of communication where the mutual information I12
+between the subject and object systems is increased to its
+maximum possible value. Suppose that the measurement
+is caused by Hso becoming large during a time interval so
+brief that we can neglect the effects of Hs and Ho. The
+joint subject+object density matrix ρso then evolves as
+ρso �→ UρsoU †, where U ≡ exp
+�
+−i
+�
+Hsodt
+�
+. If observing
+|↑⟩ makes the subject happy and |↓⟩ makes the subject
+sad, then we have U|¨-↑⟩ = | ¨⌣↑⟩ and U|¨-↓⟩ = | ¨⌢↓⟩. The
+state given by equation (24) would therefore evolve into
+
+ρo = 1
+
+2U(|¨- ⟩⟨¨- |) ⊗ (|↑⟩⟨↑| + |↓⟩⟨↓|)U †
+(25)
+
+= 1
+
+2(U|¨-↑⟩⟨¨-↑|U † + U|¨-↓⟩⟨¨-↓|U †
+(26)
+
+= 1
+
+2(| ¨⌣↑⟩⟨ ¨⌣↑| + | ¨⌢↓⟩⟨ ¨⌢↓ |),
+(27)
+
+as illustrated in Figure 4.
+This final state contains a
+mixture of two subjects, corresponding to definite but
+opposite knowledge of the object state.
+According to
+both of them, the entropy of the object has decreased
+from one bit to zero bits.
+In general, we see that the object decreases its en-
+tropy when it exchanges information with the subject
+and increases when it exchanges information with the
+environment.4 Loosely speaking, the entropy of an ob-
+ject decreases while you look at it and increases while
+you don’t5.
+
+4If n bits of information are exchanged with the environ-
+ment, then equation (3) shows that the object entropy will
+increase by this same amount if the environment is in ther-
+mal equilibrium (with maximal entropy) throughout. If we
+were to know the state of the environment initially (by our
+definition of environment, we do not), then both the object
+and environment entropy will typically increase by n/2 bits.
+5 Here and throughout, we are assuming that the total
+system, which is by definition isolated, evolves according to
+the Schr¨odinger equation (1). Although modifications of the
+Schr¨odinger equation have been suggested by some authors,
+either in a mathematically explicit form as in [55,56] or ver-
+bally as a so-called reduction postulate, there is so far no
+experimental evidence suggesting that modifications are nec-
+essary. The original motivations for such modifications were
+
+1. to be able to interpret the diagonal elements of the
+density matrix as probabilities and
+
+2. to suppress off-diagonal elements of the density matrix.
+
+The subsequent discovery by Everett [64] that the probability
+interpretation automatically appears to hold for almost all
+observers in the final superposition solved problem 1, and is
+discussed in more detail in, e.g., [29,66–74]. The still more
+
+9
+
+
+=
++
+
+Subject�
+evolution
+Subject�
+decohe-�
+rence
+Hs
+(Snap �
+decision)
+�
+Hse
+
+�
+
+2
+1_
+2
+1_
+
+FIG. 5. Time evolution of the same 6 × 6 density matrix as in
+Figure 4 when the subject evolves in isolation, then decoheres. The
+object remains in the state |↑⟩ the whole time. The final result is
+a statistical mixture of the two states | ¨⌣↑⟩ and | ¨
+⌢↑⟩.
+
+4. Effect if Hs: the thought process
+
+So far, we have focused on the object and discussed
+effects of its internal dynamics (Ho) and its interactions
+with the environment (Hoe) and subject (Hso). Let us
+now turn to the subject and consider the role played by
+its internal dynamics (Hs) and interactions with the en-
+vironment (Hse).
+In his seminal 1993 book, Stapp [3]
+presents an argument about brain dynamics that can be
+summarized as follows.
+
+1. Since the brain contains ∼ 1011 synapses connected
+together by neurons in a highly nonlinear fashion,
+there must be a huge number of metastable rever-
+berating patters of pulses into which the brain can
+evolve.
+
+2. Neural network simulations have indicated that the
+metastable state into which a brain does in fact
+evolves depends sensitively on the initial conditions
+in small numbers of synapses.
+
+3. The latter depends on the locations of a small num-
+ber of calcium atoms, which might be expected to
+be in quantum superpositions.
+
+4. Therefore, one would expect the brain to evolve
+into
+a
+quantum
+superposition
+of
+many
+such
+metastable configurations.
+
+5. Moreover, the fatigue characteristics of the synap-
+tic junctions will cause any given metastable state
+
+recent discovery of decoherence [11,36,37] solved problem 2,
+as well as explaining so-called superselection rules for the first
+time (why for instance the position basis has a special status)
+[44].
+
+to become, after a short time, unstable:
+the
+subject will then be forced to search for a new
+metastable configuration, and will therefore con-
+tinue to evolve into a superposition of increasingly
+disparate states.
+
+If different states (perceptions) of the subject correspond
+to different metastable states of neuron firing patterns, a
+definite perception would eventually evolve into a super-
+position of several subjectively distinguishable percep-
+tions.
+We will follow Stapp in making this assumption about
+Hs. For illustrative purposes, let us assume that this can
+happen even at the level of a single thought or snap de-
+cision where the outcome feels unpredictable to us. Con-
+sider the following experiment: the subject starts out
+with a blank face and counts silently to three, then makes
+a snap decision on whether to smile or frown. The time-
+evolution operator U ≡ exp
+�
+−i � Hsdt
+�
+will then have
+the property that U|¨- ⟩ = (| ¨⌣⟩ + | ¨⌢⟩)/
+√
+
+2, so the sub-
+ject density matrix ρs will evolve into
+
+ρs = U|¨- ⟩⟨¨- |U † = 1
+
+2(| ¨⌣⟩ + | ¨⌢⟩)(⟨ ¨⌣| + ⟨ ¨⌢|)
+
+= 1
+
+2(| ¨⌣⟩⟨ ¨⌣| + | ¨⌣⟩⟨ ¨⌢| + | ¨⌢⟩⟨ ¨⌣| + | ¨⌢⟩⟨ ¨⌢|),
+(28)
+
+corresponding to the four entries of 1/2 in the second
+matrix in Figure 5.
+
+5. Effect of Hse: subject decoherence
+
+Just as Hoe can decohere the object, Hse can decohere
+the subject. The difference is that whereas the object can
+be either a quantum system (with small Hoe) or a classi-
+cal system (with large Hoe), a human subject always has
+a large interaction with the environment. As we showed
+in Section III, τdec ≪ τdyn for the subject, i.e., the ef-
+fect of Hse is faster than that of Hs by many orders of
+magnitude. This means that we should strictly speaking
+not think of macrosuperpositions such as equation (28)
+as first forming and then decohering as in Figure 5 —
+rather, subject decoherence is so fast that such superpo-
+sitions decohere already during their process of forma-
+tion. Therefore we are never even close to being able to
+perceive superpositions of different perceptions. Reduc-
+ing object decoherence (from Hoe) during measurement
+would make no difference, since decoherence would take
+place in the brain long before the transmission of the ap-
+propriate sensory input through sensory nerves had been
+completed.
+
+C. He and Hsoe
+
+The environment is of course the most complicated sys-
+tem, since it contains the vast majority of the degrees of
+
+10
+
+
+freedom in the total system. It is therefore very fortu-
+nate that we can so often ignore it, considering only those
+limited aspects of it that affect the subject and object.
+For the most general H, there can also be an ugly
+irreducible residual term Hsoe ≡ H − Hs − Ho − He −
+Hso − Hoe − Hse.
+
+D. Implications for modeling cognitive processes
+
+For the neural network community, the implication of
+our result is “business as usual”, i.e., there is no need
+to worry about the fact that current simulations do not
+incorporate effects of quantum coherence. The only rem-
+nant from quantum mechanics is the apparent random-
+ness that we subjectively perceive every time the subject
+system evolves into a superposition as in equation (28),
+but this can be simply modeled by including a random
+number generator in the simulation. In other words, the
+recipe used to prescribe when a given neuron should fire
+and how synaptic coupling strengths should be updated
+may have to involve some classical randomness to cor-
+rectly mimic the behavior of the brain.
+
+1. Hyper-classicality
+
+If a subject system is to be a good model of us, Hso
+and Hse need to meet certain criteria: decoherence and
+communication are necessary, but fluctuation and dissi-
+pation must be kept low enough that the subject does
+not lose its autonomy completely.
+In our study of neural processes, we concluded that the
+subject is not a quantum system, since τdec ≪ τdyn. How-
+ever, since the dissipation time τdiss for neuron firing is
+of the same order as its dynamical timescale, we see that
+in the sense of Figure 1, the subject is not a simple clas-
+sical system either. It is therefore somewhat misleading
+to think of it as simply some classical degrees of freedom
+evolving fairly undisturbed (only interacting enough to
+stay decohered and occasionally communicate with the
+outside world). Rather, the semi-autonomous degrees of
+freedom that constitute the subject are to be found at a
+higher level of complexity, perhaps as metastable global
+patters of neuron firing.
+These degrees of freedom might be termed “hyper-
+classical”:
+although
+there
+is
+nothing
+quantum-
+mechanical about their equations of motion (except that
+they can be stochastic), they may bear little resemblance
+with the underlying classical equations from which they
+were derived.
+Energy conservation and other familiar
+concepts from Hamiltonian dynamics will be irrelevant
+for these more abstract equations, since neurons are en-
+ergy pumped and highly dissipative. Other examples of
+such hyper-classical systems include the time-evolution
+of the memory contents of a regular (highly dissipative)
+
+digital computer as well as the motion on the screen of
+objects in a computer game.
+
+2. Nature of the subject system
+
+In this paper, we have tacitly assumed that conscious-
+ness is synonymous with certain brain processes. This is
+what Lockwood terms the “identity theory” [66]. It dates
+back to Hobbes (∼1660) and has been espoused by, e.g.,
+Russell, Feigl, Smart, Armstrong, Churchland and Lock-
+wood himself. Let us briefly explore the more specific
+assumption that the subject degrees of freedom are our
+perceptions. In this picture, some of the subject degrees
+of freedom would have to constitute a “world model”,
+with the interaction Hso such that the resulting commu-
+nication keeps these degrees of freedom highly correlated
+with selected properties of the outside world (object +
+environment). Some such properties, i.e.,
+
+• the intensity of the electromagnetic on the retina,
+averaged through three narrow-band filters (color
+vision) and one broad-band filter (black-and-white
+vision),
+
+• the spectrum of air pressure fluctuations in the ears
+(sound),
+
+• the chemical composition of gas in the nose (smell)
+and solutions in the mouth (taste),
+
+• heat and pressure at a variety of skin locations,
+
+• locations of body parts,
+
+are tracked rather continuously, with the corresponding
+mutual information I12 between subject and surround-
+ings remaining fairly constant.
+Persisting correlations
+with properties of the past state of the surroundings
+(memories) further contribute to the mutual information
+I12. Much of I12 is due to correlations with quite subtle
+aspects of the surroundings, e.g., the contents of books.
+The total mutual information I12 between a person and
+the external world is fairly low at birth, gradually grows
+through learning, and falls when we forget. In contrast,
+most innate objects have a very small mutual informa-
+tion with the rest of the world, books and diskettes being
+notable exceptions.
+The extremely limited selection of properties that the
+subject correlates with has presumably been determined
+by evolutionary utility, since it is known to differ between
+species: birds perceive four primary colors but cats only
+one, bees perceive light polarization, etc. In this picture,
+we should therefore not consider these particular (“classi-
+cal”) aspects of our surroundings to be more fundamental
+than the vast majority that the subject system is uncor-
+related with. Morover, our perception of e.g. space is as
+subjective as our perception of color, just as suggested
+by e.g. [50].
+
+11
+
+
+3. The binding problem
+
+One of the motivations for models with quantum co-
+herence in the brain was the so-called binding problem.
+In the words of James [75,76], “the only realities are the
+separate molecules, or at most cells. Their aggregation
+into a ‘brain’ is a fiction of popular speech”. James’ con-
+cern, shared by many after him, was that consciousness
+did not seem to be spatially localized to any one small
+part of the brain, yet subjectively feels like a coherent
+entity. Because of this, Stapp [3] and many others have
+appealed to quantum coherence, arguing that this could
+make consciousness a holistic effect involving the brain
+as a whole.
+However, non-local degrees of freedom can be impor-
+tant even in classical physics, For instance, oscillations
+in a guitar string are local in Fourier space, not in real
+space, so in this case the “binding problem” can be solved
+by a simple change of variables. As Eddington remarked
+[77], when observing the ocean we perceive the moving
+waves as objects in their own right because they display a
+certain permanence, even though the water itself is only
+bobbing up and down. Similarly, thoughts are presum-
+ably highly non-local excitation patterns in the neural
+network of our brain, except of a non-linear and much
+more complex nature.
+In short, this author feels that
+there is no binding problem.
+
+4. Outlook
+
+In summary, our decoherence calculations have in-
+dicated that there is nothing fundamentally quantum-
+mechanical about cognitive processes in the brain, sup-
+porting the Hepp’s conjecture [33]. Specifically, the com-
+putations in the brain appear to be of a classical rather
+than quantum nature, and the argument by Lisewski [78]
+that quantum corrections may be needed for accurate
+modeling of some details, e.g., non-Markovian noise in
+neurons, does of course not change this conclusion. This
+means that although the current state-of-the-art in neu-
+ral network hardware is clearly still very far from be-
+ing able to model and understand cognitive processes as
+complex as those in the brain, there are no quantum me-
+chanical reasons to doubt that this research is on the
+right track.
+
+Acknowledgements:
+The author wishes to thank
+the organizers of the Spaatind-98 and Gausdal-99 win-
+ter schools, where much of this work was done, and
+Mark Alford, Philippe Blanchard, Carlton Caves, Angel-
+ica de Oliveira-Costa, Matthew Donald, Andrei Gruzi-
+nov, Piet Hut, Nick Mavromatos, Henry Stapp, Hans-
+Dieter Zeh and Woitek Zurek for stimulating discussions
+and helpful comments. Support for this work was pro-
+vided by the Sloan Foundation and by NASA though
+
+grant NAG5-6034 and Hubble Fellowship HF-01084.01-
+96A from STScI, operated by AURA, Inc. under NASA
+contract NAS5-26555.
+
+APPENDIX: DECOHERENCE FORMULAS
+
+The quantitative effect of decoherence from both short
+range interactions (scattering) and long-range interac-
+tions was first derived in a seminal paper by Joos & Zeh
+[45]. Since our application involved scattering between
+particles of comparable mass, we used a generalized ver-
+sion of these results that included the effect of recoil [46].
+In this Appendix, we derive a slightly generalized formula
+for long-range interactions, and briefly comment on the
+relation between these short-range and long-range limit-
+ing cases.
+
+1. Decoherence due to tidal forces
+
+Even if the dissipation and fluctuation caused by Hint
+is dynamically unimportant, H1 and H2 can be neglected
+in equation (2) when calculating the decoherence effect in
+the many cases where the interaction Hamiltonian deco-
+heres the object on a timescale far below the dynamical
+time. In this approximation, we consider two particles
+with an interaction H = Hint = V (r2 − r1) for some
+potential V . According to equation (1), the two-particle
+density matrix ρ therefore evolves as
+
+ρ(r1, r′
+1, r2, r′
+2, t0 + t)
+
+= ρ(r1, r′
+1, r2, r′
+2, t)e−i[V (r2−r1)−V (r′
+2−r′
+1)]/¯h.
+(A1)
+
+Following [45], we assume that the two particles are fairly
+localized near their initial average positions
+
+r0
+i ≡ ⟨ri⟩0 = tr [riρi(t0)],
+(A2)
+
+i = 1, 2, and approximate the potential by its second
+order Taylor expansion
+
+V (r2 − r1) ≈ V (a) − F · (x2 − x1)
+
++ 1
+
+2(x2 − x1)tM(x2 − x1).
+(A3)
+
+Here F ≡= −∇V (a) is the average force, M is the Hes-
+sian matrix Mij ≡ ∂i∂jV (a) and a ≡ r0
+2−r0
+1. We have in-
+troduced relative coordinates xi ≡ ri−r0
+i . Assuming that
+the two particles are independent initially as in [45], i.e.,
+that ρ(t0) takes the separable form ρ(x1, x′
+1, x2, x′
+2, t0) =
+ρ1(x1, x′
+1, t0)ρ2(x2, x′
+2, t0), this gives
+
+ρ1(x1, x′
+1, t0 + t) = tr 2ρ(t0 + t) =
+�
+ρ(x1, x′
+1, x, x, t0 + t)d3x = ρ1(x1, x′
+1, t0)f(x1, x′
+1, t), (A4)
+
+where
+
+12
+
+
+f(x1, x′
+1, t) ≈
+
+eiφ(x1,x′
+1,t)
+�
+ρ2(x2, x′
+2, t0)e−it(x′
+1−x1)tMx2/¯hd3x2 =
+
+eiφ(x1,x′
+1,t)�p2[M(x′
+1 − x1)t/¯h].
+(A5)
+
+Here the phase factor
+
+eiφ(x,x′,t) ≡ e
+i
+¯h[F·(x′−x)+ 1
+
+2 x′tMx′− 1
+
+2 xtMx]
+(A6)
+
+is of no importance for decoherence, since it does not
+suppress the magnitude |ρ1(x1, x′
+1, t)| of the off-diagonal
+elements – it merely causes momentum transfer related
+to fluctuation and dissipation.
+It is the other term
+that causes decoherence. �p2 is the Fourier transform of
+p2(x) ≡ ρ2(x, x, t0), the probability distribution for the
+location of the environment particle.
+
+2. Properties of the effect
+
+Let us briefly discuss some qualitative features of equa-
+tion (A5).
+Since �p2(0) =
+�
+p2(x2)d3x2 = tr ρ2 = 1,
+ρ1(x, x′) remains unchanged on the diagonal x = x′.
+This is because Hint is not changing the position of our
+our object particle, merely its momentum.
+Since the
+mean position ⟨x2⟩ =
+�
+p2x2d3x2 = tr [x2ρ2] = 0 van-
+ishes (using equation (A2)), we have ∇�p2(0) = 0.
+In
+fact, |f| takes a maximum on the diagonal, and the
+Riemann-Lebesgue Lemma shows that |f| = |�p2| ≤ 1
+whenever x ̸= x′, with equality only for the unphys-
+ical case where p2 is a delta function, i.e., where the
+location of the environment particle is perfectly known.
+∂i∂j|f(0)| = −M⟨x2xt
+2⟩Mt2/2¯h2, so so the larger ⟨x2xt
+2⟩
+is (i.e., the more spread out the environment particle is),
+the closer to the diagonal decoherence will suppress our
+density matrix.
+Since M is the shear matrix of the force field −∇V , we
+see that it is tidal forces that are causing the decoherence
+— the average force F simply contributes to the phase
+factor eiφ. Specifically, the rate at which our object de-
+grees of freedom r1 decohere grows with the tidal force
+that it exerts on the environment: if the environment
+particle is spread out with ⟨x2xt
+2⟩ large, experiencing a
+wide range of forces from the object, object decoherence
+is rapid. In the opposite situation, where the object is
+spread out and the environment is not, the object will
+experience strong classical tidal forces but no decoher-
+ence.
+
+3. Relation between long-range and short-range
+decoherence
+
+Above we derived the effect of decoherence from long-
+range tidal forces. Another interesting case that has been
+solved analytically [45] is that of short-range interactions
+
+that can be modeled as scattering events. If the scatter-
+ing takes place during short enough a time interval that
+we can neglect the internal dynamics of the object, then
+its reduced density matrix changes as [46]
+
+ρ1(r, r′) �→ ρ1(r, r′)�p
+�r′ − r
+
+¯h
+
+�
+,
+(A7)
+
+where p(q) is the probability distribution for the momen-
+tum transfer q in the collision. This equation generalizes
+the scattering result of [45] by including the effect of re-
+coil. The larger the uncertainty in momentum transfer,
+the stronger the decoherence effect becomes, since widen-
+ing p narrows its Fourier transform �p. Changing the mean
+momentum transfer ⟨q⟩ does not affect the decoherence,
+merely contributes a phase factor just as F did above.
+Typically, the last factor in equation (A7) destroys coher-
+ence down to scales of order the de Broigle wavelength
+of the scatterer, with directional modulations from the
+angular dependence of the scattering cross section. Gen-
+eralization to a steady flux of scattering particles [46]
+gives equation (6).
+Equation (A7) has striking similarities with the tidal
+force result of equation (A5): in both cases, the density
+matrix gets multiplied by the Fourier transform of a prob-
+ability distribution.
+If fact, up to uninteresting phase
+factors, we can rewrite our equation (A5) in exactly the
+form of equation (A7) by redefining p to be the probabil-
+ity distribution for momentum transfer q = M(x2 −x1)t
+due to tidal forces for a fixed x1, i.e.,
+
+p(q) ≡ p2(x2)d3x2
+
+d3q = p2(x1 + M−1q/t)
+
+t3 det M
+.
+(A8)
+
+Fourier transforming this expression and substituting the
+result into equation (A7), we recover equation (A5) up
+to a phase factor.
+Perhaps the simplest way to understand all these re-
+sults is in terms of Wigner functions [79]. If W(x1, p1) is
+the Wigner phase space distribution for the object parti-
+cle, then any of the momentum-transferring interactions
+that we have considered will take the form
+
+W(x1, p1) �→
+�
+W(x1, p1 − q)p(q, x1)d3q
+(A9)
+
+for some probability distribution p that may or may not
+depend on x1. Since the density matrix
+
+ρ1(x1, x′
+1) =
+�
+W
+�x1 + x′
+1
+
+2
+, p
+�
+e−i(x−x′)·pd3p
+(A10)
+
+is just the Wigner function Fourier transformed in the
+momentum direction (and rotated by 45◦), the convolu-
+tion with p in equation (A9) reduces to a simple multi-
+plication with �p in equation (A7).
+
+13
+
+
+[1] R. Penrose, The Emperor’s New Mind (Oxford, Oxford
+Univ. Press, 1989).
+[2] R. Penrose, in The Large, the Small and the Human
+Mind, ed.
+M. Longair (Cambridge, Cambridge Univ.
+Press, 1997).
+[3] H. P. Stapp, Mind, Matter and Quantum Mechanics
+(Berlin, Springer, 1993).
+[4] D. J. Amit, Modeling Brain Functions (Cambridge, Cam-
+bridge Univ. Press, 1989).
+[5] M. M´ezard, G. Parisi, and M. Virasoro, Spin Glass The-
+ory and Beyond (Singapore, World Scientific, 1993).
+[6] R. L. Harvey, Neural Network Principles (Englewood
+Cliffs, Prentice Hall, 1994).
+[7] F. H. Eeckman and J. M. Bower, Computation and Neu-
+ral Systems (Boston, Kluwer, 1993).
+[8] D. R. McMillen, G. M. T D’Eleuterio, and J. R. P
+Halperin, Phys. Rev. E 59, 6 (1999).
+[9] E. P. Wigner, in The Scientist Speculates: an Anthology
+of Partly-Baked Ideas, p284-302, ed. I. J. Good (London,
+Heinemann, 1962).
+[10] J. Mehra and A. S. Wightman, The Collected Works of
+E. P. Wigner, Vol. VI, p271 (Berlin, Springer, 1995).
+[11] H. D. Zeh, Found. Phys. 1, 69 (1970).
+[12] E. H. Walker, Mathematical Biosciences 7, 131 (1970).
+[13] L. H. Domash, in Scientific Research on TM, ed. D. W.
+Orme-Johnson and J. T. Farrow (Weggis, Switzerland,
+Maharishi Univ. Press, 1977).
+[14] H. P. Stapp, Phys. Rev. D 28, 1386 (1983).
+[15] I. N. Marshall, New Ideas in Psychology 7, 73 (1989).
+[16] D. Zohar, The Quantum Self (New York, William Mor-
+row, 1990).
+[17] H. Rosu, Metaphysical Review 3, 1, gr-qc/9409007
+(1997).
+[18] L. M. Ricciardi and H. Umezawa, Kibernetik 4, 44
+(1967).
+[19] A. Vitiello, Int. J. Mod. Phys. B9, 973-89 (1996).
+[20] S. R. Hameroff and R. C. Watt, Journal of Theoretical
+Biology 98, 549 (1982).
+S. R. Hameroff, Ultimate Computing: Biomolecular Con-
+sciousness
+and
+Nanotechnology
+(Amsterdam, North-
+Holland, 1987).
+[21] D. V. Nanopoulos 1995, hep-ph/9505374
+[22] N.
+Mavromatos
+and D.
+V.
+Nanopoulos
+1995,
+hep-
+ph/9505401
+[23] N. Mavromatos and D. V. Nanopoulos 1995, quant-
+ph/9510003
+[24] N. Mavromatos and D. V. Nanopoulos 1995, quant-
+ph/9512021
+[25] N. Mavromatos and D. V. Nanopoulos, Int. J. Mod. Phys
+B 12, 517, quant-ph/9708003 (1998).
+[26] N. Mavromatos and D. V. Nanopoulos 1998, quant-
+ph/9802063
+[27] N. Mavromatos 1999, J.
+Bioelectrochemistry & Bioenergetics;48;273
+[28] H. P. Stapp, Found. Phys. 21, 1451 (1991).
+[29] H. D. Zeh, quant-ph/9908084, Epistemological Letters of
+the Ferdinand-Gonseth Association 63:0 (Biel, Switzer-
+land, 1981)
+[30] W. H. Zurek, Phys. Today 44 (10), 36 (1991).
+[31] A. Scott, J. Consciousness Studies 6, 484 (1996).
+
+[32] S. Hawking, in The Large, the Small and the Human
+Mind, ed.
+M. Longair (Cambridge, Cambridge Univ.
+Press, 1997).
+[33] K. Hepp, in Quantum Future, ed. P. Blanchard and A.
+Jadczyk (Berlin, Springer, 1999).
+[34] J.
+von
+Neumann,
+Matematische
+Grundlagen
+der
+Quanten-Mechanik (Springer, Berlin, 1932).
+[35] H. D. Zeh, The Arrow of Time, 3rd ed. (Springer, Berlin,
+1999).
+[36] W. H. Zurek, Phys. Rev. D 24, 1516 (1981).
+[37] W. H. Zurek, Phys. Rev. D 26, 1862 (1982).
+[38] W. H. Zurek, reprint LAUR 84-2750, in Non-Equilibrium
+Statistical Physics, ed. G. Moore and M. O. Sculy (New
+York, Plenum, 1984).
+[39] E. Peres, Am. J. Phys. 54, 688 (1986).
+[40] P. Pearle, Phys. Rev. A 39, 2277 (1989).
+[41] M. R. Gallis and G. N. Fleming, Phys. Rev. A 42, 38
+(1989).
+[42] W. H. Unruh and W. H. Zurek, Phys. Rev. D 40, 1071
+(1989).
+[43] R. Omn`es, Phys. Rev. A 56, 3383 (1997).
+[44] D. Giulini, E. Joos, C. Kiefer, J. Kupsch, I. O. Sta-
+matescu, and H. D. Zeh, Decoherence and the Appear-
+ance of a Classical World in Quantum Theory (Springer,
+Berlin, 1996).
+[45] E. Joos and H. D. Zeh, Z. Phys. B 59, 223 (1985).
+[46] M. Tegmark, Found. Phys. Lett. 6, 571 (1993).
+[47] R. P. Feynman, Statistical Mechanics (Reading, Ben-
+jamin, 1972).
+[48] B. Katz, Nerve, Muscle, and Synapse (New York,
+McGraw-Hill, 1966).
+[49] J. P. Schad´e and D. H. Ford, Basic Neurology, 2nd ed.
+(Amsterdam, Elsevier, 1973).
+[50] A. G. Cairns-Smith, Evolving the Mind (Cambridge,
+Cambridge Univ. Press, 1996).
+[51] P. Morell and W. T. Norton, Sci. Am. 242, 74 (1980).
+[52] A. Hirano and J. A. Llena, in The Axon, ed. S. G. Wax-
+man, J. D. Kocsis, and P. K. Stys (New York, Oxford
+Univ. Press, 1995).
+[53] J. M. Ritchie, in The Axon, ed. S. G. Waxman, J. D.
+Kocsis, and P. K. Stys (New York, Oxford Univ. Press,
+1995).
+[54] J. Ellis, S. Mohanty, and Nanopoulos D V, Phys. Lett. B
+221, 113 (1989).
+[55] P. Pearle, Phys. Rev. D 13, 857 (1976).
+[56] G. C. Ghirardi, A. Rimini, and T. Weber, Phys. Rev. D
+34, 470 (1986).
+[57] J. D. Jackson, Classical Electrodynamics (New York, Wi-
+ley, 1975).
+[58] W. H. Zurek, S. Habib, and J. P. Paz, Phys. Rev. Lett.
+70, 1187 (1993).
+[59] M. Tegmark and H. S. Shapiro, Phys. Rev. E 50, 2538
+(1994).
+[60] M. V. Satari´c, J. A. Tuszy´nski, and R. B. ˇZakula, Phys.
+Rev. E 48, 589 (1993).
+[61] H. C. Rosu, Phys. Rev. E 55, 2038 (1997).
+[62] H. C. Rosu, Nuovo Cimento D 20, 369 (1998).
+[63] H. S. Stapp 1999,
+Attention, Intention, and Mind in Quantum Physics and
+Quantum Ontology and Mind-Matter Synthesis, available
+
+14
+
+
+at www-physics.lbl.gov/ stapp/stappfiles.html.
+[64] H. Everett III, Rev. Mod. Phys. 29, 454 (1957).
+H. Everett III, The Many-Worlds Interpretation of Quan-
+tum Mechanics, B. S. DeWitt and N. Graham (Prince-
+ton, Princeton Univ. Press, 1986).
+[65] J. A. Wheeler, Rev. Mod. Phys. 29, 463;1957 (1957).
+L. M. Cooper and D. van Vechten, Am. J. Phys 37, 1212
+(1969).
+B. S. DeWitt, Phys. Today 23, 30 (1971).
+[66] M. Lockwood, Mind, Brain and the Quantum (Cam-
+bridge, Blackwell, 1989).
+[67] D. Deutsch The Fabric of Reality (Allen Lane, New York,
+1997).
+[68] D. N. Page A 1995, gr-qc/9507025
+[69] M. J. Donald 1997, quant-ph/9703008
+[70] M. J. Donald 1999, quant-ph/9904001
+[71] L. Vaidman 1996, quant-ph/9609006, Int. Stud. Phil.
+Sci., in press
+[72] T. Sakaguchi 1997, quant-ph/9704039
+[73] M. Tegmark, quant-ph/9709032, Fortschr. Phys. 46, 855
+(1997).
+[74] M. Tegmark, gr-qc/9704009, Annals of Physics 270, 1
+(1998).
+[75] W. James, The Principles of Psychology (New York, Holt,
+1890).
+[76] W. James 1904, in The Writings of William James,
+pp169-183, ed. J. J. McDermott (Chicago, Univ. Chicago
+Press, 1977).
+[77] A. Eddington, Space, Time & Gravitation (Cambridge,
+Cambridge Univ. Press, 1920).
+[78] A. M. Lisewski 1999, quant-ph/9907052
+[79] E. P. Wigner, Phys. Rev. 40, 749 (1932).
+M. Hillery, R. H. O’Connell, M. O. Scully & Wigner E
+P, Phys. Rep. 106, 121 (1984).
+Y. S. Kim and M. E. Noz, Phase Space Picture of Quan-
+tum Mechanics: Group Theoretical Approach (Singapore,
+World Scientific, 1991).
+
+15
+
+